ipex-llm

Author	SHA1	Message	Date
Jin Qiao	440cfe18ed	LLM: GPU Example Updates for Windows (#9992 ) * modify aquila * modify aquila2 * add baichuan * modify baichuan2 * modify blue-lm * modify chatglm3 * modify chinese-llama2 * modiy codellama * modify distil-whisper * modify dolly-v1 * modify dolly-v2 * modify falcon * modify flan-t5 * modify gpt-j * modify internlm * modify llama2 * modify mistral * modify mixtral * modify mpt * modify phi-1_5 * modify qwen * modify qwen-vl * modify replit * modify solar * modify starcoder * modify vicuna * modify voiceassistant * modify whisper * modify yi * modify aquila2 * modify baichuan * modify baichuan2 * modify blue-lm * modify chatglm2 * modify chatglm3 * modify codellama * modify distil-whisper * modify dolly-v1 * modify dolly-v2 * modify flan-t5 * modify llama2 * modify llava * modify mistral * modify mixtral * modify phi-1_5 * modify qwen-vl * modify replit * modify solar * modify starcoder * modify yi * correct the comments * remove cpu_embedding in code for whisper and distil-whisper * remove comment * remove cpu_embedding for voice assistant * revert modify voice assistant * modify for voice assistant * add comment for voice assistant * fix comments * fix comments	2024-01-29 11:25:11 +08:00
Yuwen Hu	c6d4f91777	[LLM] Add UTs of load_low_bit for transformers-style API (#10001 ) * Add uts for transformers api load_low_bit generation * Small fixes * Remove replit-code for CPU tests due to current load_low_bit issue on MPT * Small change * Small reorganization to llm unit tests on CPU * Small fixes	2024-01-29 10:18:23 +08:00
Yishuo Wang	d720554d43	simplify quantize kv cache api (#10011 )	2024-01-29 09:23:57 +08:00
Yina Chen	a3322e2a6c	add fp8 e5 to use_xmx (#10015 )	2024-01-26 18:29:46 +08:00
Qiyuan Gong	9e18ea187f	[LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006 ) * Avoid OOM during muti-round streaming chat with kv cache * For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31. * Other models need to compare kv cache size with kv_len.	2024-01-26 17:30:08 +08:00
binbin Deng	e5ae6f2c13	LLM: fix truncation logic of past_key_values in chatglm multi turn chat (#10007 ) * Avoid frequently truncating past_key_values when its length is larger than required.	2024-01-26 16:56:02 +08:00
Yuwen Hu	1eaaace2dc	Update perf test all-in-one config for batch_size arg (#10012 )	2024-01-26 16:46:36 +08:00
Xin Qiu	7952bbc919	add conf batch_size to run_model (#10010 )	2024-01-26 15:48:48 +08:00
SONG Ge	421e7cee80	[LLM] Add Text_Generation_WebUI Support (#9884 ) * initially add text_generation_webui support * add env requirements install * add necessary dependencies * update for starting webui * update shared and noted to place models * update heading of part3 * meet comments * add copyright license * remove extensions * convert tutorial to windows side * add warm-up to optimize performance	2024-01-26 15:12:49 +08:00
Yuwen Hu	f0da0c131b	Disable llama2 optimize model true or false test for now in Arc UTs (#10008 )	2024-01-26 14:42:11 +08:00
Ruonan Wang	a00efa0564	LLM: add mlp & qkv fusion for FP16 Llama-7B (#9932 ) * add mlp fusion for llama * add mlp fusion * fix style * update * add mm_qkv_out * fix style * update * meet code review * meet code review	2024-01-26 11:50:38 +08:00
Wang, Jian4	98ea3459e5	LLM : Fix llama draft_model dtype error (#10005 ) * fix llama draft_model dtype error * updat	2024-01-26 10:59:48 +08:00
Yishuo Wang	aae1870096	fix qwen kv cache length (#9998 )	2024-01-26 10:15:01 +08:00
Chen, Zhentao	762adc4f9d	Reformat summary table (#9942 ) * reformat the table * refactor the file * read result.json only	2024-01-25 23:49:00 +08:00
binbin Deng	171fb2d185	LLM: reorganize GPU finetuning examples (#9952 )	2024-01-25 19:02:38 +08:00
Yuwen Hu	175027c90f	Small clarification for windows installation guide (#10002 )	2024-01-25 18:39:11 +08:00
Yishuo Wang	24b34b6e46	change xmx condition (#10000 )	2024-01-25 17:48:11 +08:00
Ziteng Zhang	8b08ad408b	Add batch_size in all_in_one (#9999 ) Add batch_size in all_in_one, except run_native_int4	2024-01-25 17:43:49 +08:00
Wang, Jian4	093e6f8f73	LLM: Add qwen CPU speculative example (#9985 ) * init from gpu * update for cpu * update * update * fix xpu readme * update * update example prompt * update prompt and add 72b * update * update	2024-01-25 17:01:34 +08:00
Yishuo Wang	bf65548d29	Add quantize kv cache support for chaglm2/3 (#9996 )	2024-01-25 16:55:59 +08:00
Chen, Zhentao	86055d76d5	fix optimize_model not working (#9995 )	2024-01-25 16:39:05 +08:00
Wang, Jian4	9bff84e6fd	LLM: Convert draft_model kv_cache from bf16 to fp32 (#9964 ) * convert bf16 to fp32 * update * change when init * init first and cut off after * init and exchange * update python type * update * fix bug * update * update	2024-01-25 11:20:27 +08:00
ZehuaCao	51aa8b62b2	add gradio_web_ui to llm-serving image (#9918 )	2024-01-25 11:11:39 +08:00
Yina Chen	99ff6cf048	Update gpu spec decoding baichuan2 example dependency (#9990 ) * add dependency * update * update	2024-01-25 11:05:04 +08:00
Yina Chen	27338540c3	Fix repetition_penalty not activated issue (#9989 )	2024-01-25 10:40:41 +08:00
Jason Dai	3bc3d0bbcd	Update self-speculative readme (#9986 )	2024-01-24 22:37:32 +08:00
Yuwen Hu	b27e5a27b9	Remove the check for meta device in _replace_with_low_bit_linear (#9984 )	2024-01-24 18:15:39 +08:00
Ruonan Wang	d4f65a6033	LLM: add mistral speculative example (#9976 ) * add mistral example * update	2024-01-24 17:35:15 +08:00
Yina Chen	b176cad75a	LLM: Add baichuan2 gpu spec example (#9973 ) * add baichuan2 gpu spec example * update readme & example * remove print * fix typo * meet comments * revert * update	2024-01-24 16:40:16 +08:00
Jinyi Wan	ec2d9de0ea	Fix README.md for solar (#9957 )	2024-01-24 15:50:54 +08:00
Mingyu Wei	bc9cff51a8	LLM GPU Example Update for Windows Support (#9902 ) * Update README in LLM GPU Examples * Update reference of Intel GPU * add cpu_embedding=True in comment * small fixes * update GPU/README.md and add explanation for cpu_embedding=True * address comments * fix small typos * add backtick for cpu_embedding=True * remove extra backtick in the doc * add period mark * update readme	2024-01-24 13:42:27 +08:00
Chen, Zhentao	e0db44dcb6	fix unexpected keyword argument 'device' (#9982 ) * add device for chatglm3 only * add comment for this change * fix style * fix style * fix style again.. * finally fixed style	2024-01-24 13:20:46 +08:00
Lilac09	de27ddd81a	Update Dockerfile (#9981 )	2024-01-24 11:10:06 +08:00
Lilac09	a2718038f7	Fix qwen model adapter in docker (#9969 ) * fix qwen in docker * add patch for model_adapter.py in fastchat * add patch for model_adapter.py in fastchat	2024-01-24 11:01:29 +08:00
Mingyu Wei	50a851e3b3	LLM: separate arc ut for disable XMX (#9953 ) * separate test_optimize_model api with disabled xmx * delete test_optimize_model in test_transformers_api.py * set env variable in .sh/ put back test_optimize_model * unset env variable * remove env setting in .py * address errors in action * remove import ipex * lower tolerance	2024-01-23 19:04:47 +08:00
Yuwen Hu	8d28aa8e2b	[LLM] Fix the model.device problem when `cpu_embedding=True` (#9971 ) * Overwrite the device attribute for CPUPinnedParam * Expose cpu_embedding=True for Linux users * Fix python style	2024-01-23 18:51:11 +08:00
Yishuo Wang	f82782cd3b	fix starcoder (#9975 )	2024-01-23 17:24:53 +08:00
WeiguangHan	be5836bee1	LLM: fix outlier value (#9945 ) * fix outlier value * small fix	2024-01-23 17:04:13 +08:00
Yishuo Wang	2c8a9aaf0d	fix qwen causal mask when quantize_kv_cache=True (#9968 )	2024-01-23 16:34:05 +08:00
Yina Chen	5aa4b32c1b	LLM: Add qwen spec gpu example (#9965 ) * add qwen spec gpu example * update readme --------- Co-authored-by: rnwang04 <ruonan1.wang@intel.com>	2024-01-23 15:59:43 +08:00
Yina Chen	36c665667d	Add logits processor & qwen eos stop in speculative decoding (#9963 ) * add logits processor & qwen eos * fix style * fix * fix * fix style * fix style * support transformers 4.31 * fix style * fix style --------- Co-authored-by: rnwang04 <ruonan1.wang@intel.com>	2024-01-23 15:57:28 +08:00
Ruonan Wang	60b35db1f1	LLM: add chatglm3 speculative decoding example (#9966 ) * add chatglm3 example * update * fix	2024-01-23 15:54:12 +08:00
Xin Qiu	da4687c917	fix fp16 (#9970 )	2024-01-23 15:53:32 +08:00
Lilac09	052962dfa5	Using original fastchat and add bigdl worker in docker image (#9967 ) * add vllm worker * add options in entrypoint	2024-01-23 14:17:05 +08:00
Chen, Zhentao	301425e377	harness tests on pvc multiple xpus (#9908 ) * add run_multi_llb.py * update readme * add job hint	2024-01-23 13:20:37 +08:00
Ruonan Wang	27b19106f3	LLM: add readme for speculative decoding gpu examples (#9961 ) * add readme * add readme * meet code review	2024-01-23 12:54:19 +08:00
Chen, Zhentao	39219b7e9a	add default device meta when lcmu enabled (#9941 )	2024-01-23 11:00:49 +08:00
Xin Qiu	dacf680294	add fused rotary pos emb for qwen (#9956 ) * add fused rotary pos emb for qwen * update	2024-01-23 10:37:56 +08:00
Ruonan Wang	7b1d9ad7c0	LLM: limit esimd sdp usage for k_len < 8 (#9959 ) * update * fix	2024-01-23 09:28:23 +08:00
Ruonan Wang	3e601f9a5d	LLM: Support speculative decoding in bigdl-llm (#9951 ) * first commit * fix error, add llama example * hidden print * update api usage * change to api v3 * update * meet code review * meet code review, fix style * add reference, fix style * fix style * fix first token time	2024-01-22 19:14:56 +08:00

1 2 3 4 5 ...

2065 commits