ipex-llm

Author	SHA1	Message	Date
Yishuo Wang	ea0d03fd28	Refactor baichuan1 7B and 13B (#11258 )	2024-06-07 14:29:20 +08:00
Yishuo Wang	ef8e9b2ecd	Refactor qwen2 moe (#11244 )	2024-06-07 13:14:54 +08:00
Zhao Changmin	b7948671de	[WIP] Add look up table in 1st token stage (#11193 ) * lookuptb	2024-06-07 10:51:05 +08:00
Xin Qiu	2f809116e2	optimize Chatglm4 (#11239 ) * chatglm4 * update * update * add rms norm * chatglm4	2024-06-06 18:25:20 +08:00
Yishuo Wang	2e4ccd541c	fix qwen2 cpu (#11240 )	2024-06-06 16:24:19 +08:00
Yishuo Wang	e738ec38f4	disable quantize kv in specific qwen model (#11238 )	2024-06-06 14:08:39 +08:00
Yishuo Wang	c4e5806e01	add latest optimization in starcoder2 (#11236 )	2024-06-06 14:02:17 +08:00
Yishuo Wang	ba27e750b1	refactor yuan2 (#11235 )	2024-06-06 13:17:54 +08:00
Guoqiong Song	f6d5c6af78	fix issue 1407 (#11171 )	2024-06-05 13:35:57 -07:00
Yina Chen	ed67435491	Support Fp6 k in ipex-llm (#11222 ) * support fp6_k * support fp6_k * remove * fix style	2024-06-05 17:34:36 +08:00
binbin Deng	a6674f5bce	Fix `should_use_fuse_rope` error of Qwen1.5-MoE-A2.7B-Chat (#11216 )	2024-06-05 15:56:10 +08:00
Xin Qiu	566691c5a3	quantized attention forward for minicpm (#11200 ) * quantized minicpm * fix style check	2024-06-05 09:15:25 +08:00
Jiao Wang	bb83bc23fd	Fix Starcoder issue on CPU on transformers 4.36+ (#11190 ) * fix starcoder for sdpa * update * style	2024-06-04 10:05:40 -07:00
Xiangyu Tian	ac3d53ff5d	LLM: Fix vLLM CPU version error (#11206 ) Fix vLLM CPU version error	2024-06-04 19:10:23 +08:00
Ruonan Wang	1dde204775	update q6k (#11205 )	2024-06-04 17:14:33 +08:00
Yishuo Wang	6454655dcc	use sdp in baichuan2 13b (#11198 )	2024-06-04 15:39:00 +08:00
Yishuo Wang	d90cd977d0	refactor stablelm (#11195 )	2024-06-04 13:14:43 +08:00
Xin Qiu	5f13700c9f	optimize Minicpm (#11189 ) * minicpm optimize * update	2024-06-03 18:28:29 +08:00
Shaojun Liu	401013a630	Remove chatglm_C Module to Eliminate LGPL Dependency (#11178 ) * remove chatglm_C.*.pyd to solve ngsolve weak copyright vunl fix style check error * remove chatglm native int4 from langchain	2024-05-31 17:03:11 +08:00
Ruonan Wang	50b5f4476f	update q4k convert (#11179 )	2024-05-31 11:36:53 +08:00
ZehuaCao	4127b99ed6	Fix null pointer dereferences error. (#11125 ) * delete unused function on tgi_server * update * update * fix style	2024-05-30 16:16:10 +08:00
Guancheng Fu	50ee004ac7	Fix vllm condition (#11169 ) * add use-vllm * done * fix style * fix done	2024-05-30 15:23:17 +08:00
Ruonan Wang	9bfbf78bf4	update api usage of xe_batch & fp16 (#11164 ) * update api usage * update setup.py	2024-05-29 15:15:14 +08:00
Yina Chen	e29e2f1c78	Support new fp8 e4m3 (#11158 )	2024-05-29 14:27:14 +08:00
Yishuo Wang	bc5008f0d5	disable sdp_causal in phi-3 to fix overflow (#11157 )	2024-05-28 17:25:53 +08:00
SONG Ge	33852bd23e	Refactor pipeline parallel device config (#11149 ) * refactor pipeline parallel device config * meet comments * update example * add warnings and update code doc	2024-05-28 16:52:46 +08:00
Yishuo Wang	d307622797	fix first token sdp with batch (#11153 )	2024-05-28 15:03:06 +08:00
Yina Chen	3464440839	fix qwen import error (#11154 )	2024-05-28 14:50:12 +08:00
Yina Chen	b6b70d1ba0	Divide core-xe packages (#11131 ) * temp * add batch * fix style * update package name * fix style * add workflow * use temp version to run uts * trigger performance test * trigger win igpu perf * revert workflow & setup	2024-05-28 12:00:18 +08:00
binbin Deng	c9168b85b7	Fix error during merging adapter (#11145 )	2024-05-27 19:41:42 +08:00
binbin Deng	367de141f2	Fix mixtral-8x7b with transformers=4.37.0 (#11132 )	2024-05-27 09:50:54 +08:00
ZehuaCao	63e95698eb	[LLM]Reopen autotp generate_stream (#11120 ) * reopen autotp generate_stream * fix style error * update	2024-05-24 17:16:14 +08:00
Yishuo Wang	1dc680341b	fix phi-3-vision import (#11129 )	2024-05-24 15:57:15 +08:00
Guancheng Fu	7f772c5a4f	Add half precision for fastchat models (#11130 )	2024-05-24 15:41:14 +08:00
Zhao Changmin	65f4212f89	Fix qwen 14b run into register attention fwd (#11128 ) * fix qwen 14b	2024-05-24 14:45:07 +08:00
Yishuo Wang	1db9d9a63b	optimize internlm2 xcomposer agin (#11124 )	2024-05-24 13:44:52 +08:00
Yishuo Wang	9372ce87ce	fix internlm xcomposer2 fp16 (#11123 )	2024-05-24 11:03:31 +08:00
Cengguang Zhang	011b9faa5c	LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107 ) * LLM: unify alibi mask dtype. * fix comments.	2024-05-24 10:27:53 +08:00
Yishuo Wang	797dbc48b8	fix phi-2 and phi-3 convert (#11116 )	2024-05-23 17:37:37 +08:00
Yishuo Wang	37b98a531f	support running internlm xcomposer2 on gpu and add sdp optimization (#11115 )	2024-05-23 17:26:24 +08:00
Zhao Changmin	c5e8b90c8d	Add Qwen register attention implemention (#11110 ) * qwen_register	2024-05-23 17:17:45 +08:00
Yishuo Wang	0e53f20edb	support running internlm-xcomposer2 on cpu (#11111 )	2024-05-23 16:36:09 +08:00
Yishuo Wang	cd4dff09ee	support phi-3 vision (#11101 )	2024-05-22 17:43:50 +08:00
Xin Qiu	71bcd18f44	fix qwen vl (#11090 )	2024-05-21 18:40:29 +08:00
Yishuo Wang	f00625f9a4	refactor qwen2 (#11087 )	2024-05-21 16:53:42 +08:00
Yishuo Wang	d830a63bb7	refactor qwen (#11074 )	2024-05-20 18:08:37 +08:00
Yishuo Wang	4e97047d70	fix baichuan2 13b fp16 (#11071 )	2024-05-20 11:21:20 +08:00
Yishuo Wang	31ce3e0c13	refactor baichuan2-13b (#11064 )	2024-05-17 16:25:30 +08:00
Ruonan Wang	f1156e6b20	support gguf_q4k_m / gguf_q4k_s (#10887 ) * initial commit * UPDATE * fix style * fix style * add gguf_q4k_s * update comment * fix	2024-05-17 14:30:09 +08:00
Yishuo Wang	981d668be6	refactor baichuan2-7b (#11062 )	2024-05-17 13:01:34 +08:00
Ruonan Wang	3a72e5df8c	disable mlp fusion of fp6 on mtl (#11059 )	2024-05-17 10:10:16 +08:00
SONG Ge	192ae35012	Add support for llama2 quantize_kv with transformers 4.38.0 (#11054 ) * add support for llama2 quantize_kv with transformers 4.38.0 * fix code style * fix code style	2024-05-16 22:23:39 +08:00
SONG Ge	16b2a418be	hotfix native_sdp ut (#11046 ) * hotfix native_sdp * update	2024-05-16 17:15:37 +08:00
Xin Qiu	6be70283b7	fix chatglm run error (#11045 ) * fix chatglm * update * fix style	2024-05-16 15:39:18 +08:00
Yishuo Wang	8cae897643	use new rope in phi3 (#11047 )	2024-05-16 15:12:35 +08:00
Yishuo Wang	59df750326	Use new sdp again (#11025 )	2024-05-16 09:33:34 +08:00
SONG Ge	9942a4ba69	[WIP] Support llama2 with transformers==4.38.0 (#11024 ) * support llama2 with transformers==4.38.0 * add supprot for quantize_qkv * add original support for 4.38.0 now * code style fix	2024-05-15 18:07:00 +08:00
Yina Chen	686f6038a8	Support fp6 save & load (#11034 )	2024-05-15 17:52:02 +08:00
Ruonan Wang	ac384e0f45	add fp6 mlp fusion (#11032 ) * add fp6 fusion * add qkv fusion for fp6 * remove qkv first	2024-05-15 17:42:50 +08:00
hxsz1997	93d40ab127	Update lookahead strategy (#11021 ) * update lookahead strategy * remove lines * fix python style check	2024-05-15 14:48:05 +08:00
Yishuo Wang	fad1dbaf60	use sdp fp8 causal kernel (#11023 )	2024-05-15 10:22:35 +08:00
Yishuo Wang	ee325e9cc9	fix phi3 (#11022 )	2024-05-15 09:32:12 +08:00
Zhao Changmin	0a732bebe7	Add phi3 cached RotaryEmbedding (#11013 ) * phi3cachedrotaryembed * pep8	2024-05-15 08:16:43 +08:00
Yina Chen	893197434d	Add fp6 support on gpu (#11008 ) * add fp6 support * fix style	2024-05-14 16:31:44 +08:00
Zhao Changmin	b03c859278	Add phi3RMS (#10988 ) * phi3RMS	2024-05-14 15:16:27 +08:00
Yishuo Wang	170e3d65e0	use new sdp and fp32 sdp (#11007 )	2024-05-14 14:29:18 +08:00
Guancheng Fu	74997a3ed1	Adding load_low_bit interface for ipex_llm_worker (#11000 ) * initial implementation, need tests * fix * fix baichuan issue * fix typo	2024-05-13 15:30:19 +08:00
Yishuo Wang	1b3c7a6928	remove phi3 empty cache (#10997 )	2024-05-13 14:09:55 +08:00
Yishuo Wang	ad96f32ce0	optimize phi3 1st token performance (#10981 )	2024-05-10 17:33:46 +08:00
Cengguang Zhang	cfed76b2ed	LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937 ) * LLM: add split tensor support for baichuan2-7b and qwen1.5-7b. * fix style. * fix style. * fix style. * add support for mistral and fix condition threshold. * fix style. * fix comments.	2024-05-10 16:40:15 +08:00
Kai Huang	a6342cc068	Empty cache after phi first attention to support 4k input (#10972 ) * empty cache * fix style	2024-05-09 19:50:04 +08:00
Yishuo Wang	e753125880	use fp16_sdp when head_dim=96 (#10976 )	2024-05-09 17:02:59 +08:00
Yishuo Wang	697ca79eca	use quantize kv and sdp in phi3-mini (#10973 )	2024-05-09 15:16:18 +08:00
Wang, Jian4	3209d6b057	Fix spculative llama3 no stop error (#10963 ) * fix normal * add eos_tokens_id on sp and add list if * update * no none	2024-05-08 17:09:47 +08:00
Yishuo Wang	2ebec0395c	optimize phi-3-mini-128 (#10959 )	2024-05-08 16:33:17 +08:00
Zhao Changmin	0d6e12036f	Disable fast_init_ in load_low_bit (#10945 ) * fast_init_ disable	2024-05-08 10:46:19 +08:00
Yishuo Wang	c801c37bc6	optimize phi3 again: use quantize kv if possible (#10953 )	2024-05-07 17:26:19 +08:00
Yishuo Wang	aa2fa9fde1	optimize phi3 again: use sdp if possible (#10951 )	2024-05-07 15:53:08 +08:00
Qiyuan Gong	d7ca5d935b	Upgrade Peft version to 0.10.0 for LLM finetune (#10886 ) * Upgrade Peft version to 0.10.0 * Upgrade Peft version in ARC unit test and HF-Peft example.	2024-05-07 15:09:14 +08:00
Wang, Jian4	191b184341	LLM: Optimize cohere model (#10878 ) * use mlp and rms * optimize kv_cache * add fuse qkv * add flash attention and fp16 sdp * error fp8 sdp * fix optimized * fix style * update * add for pp	2024-05-07 10:19:50 +08:00
Guancheng Fu	49ab5a2b0e	Add embeddings (#10931 )	2024-05-07 09:07:02 +08:00
Wang, Jian4	0e0bd309e2	LLM: Enable Speculative on Fastchat (#10909 ) * init * enable streamer * update * update * remove deprecated * update * update * add gpu example	2024-05-06 10:06:20 +08:00
Cengguang Zhang	75dbf240ec	LLM: update split tensor conditions. (#10872 ) * LLM: update split tensor condition. * add cond for split tensor. * update priority of env. * fix style. * update env name.	2024-04-30 17:07:21 +08:00
Guancheng Fu	2c64754eb0	Add vLLM to ipex-llm serving image (#10807 ) * add vllm * done * doc work * fix done * temp * add docs * format * add start-fastchat-service.sh * fix	2024-04-29 17:25:42 +08:00
Yishuo Wang	d884c62dc4	remove new_layout parameter (#10906 )	2024-04-29 10:31:50 +08:00
Guancheng Fu	fbcd7bc737	Fix Loader issue with dtype fp16 (#10907 )	2024-04-29 10:16:02 +08:00
Guancheng Fu	c9fac8c26b	Fix sdp logic (#10896 ) * fix * fix	2024-04-28 22:02:14 +08:00
Yina Chen	015d07a58f	Fix lookahead sample error & add update strategy (#10894 ) * Fix sample error & add update strategy * add mtl config * fix style * remove print	2024-04-28 17:21:00 +08:00
Cengguang Zhang	9752ffe979	LLM: update split qkv native sdp. (#10895 ) * LLM: update split qkv native sdp. * fix typo.	2024-04-26 18:47:35 +08:00
Guancheng Fu	990535b1cf	Add tensor parallel for vLLM (#10879 ) * initial * test initial tp * initial sup * fix format * fix * fix	2024-04-26 17:10:49 +08:00
Yishuo Wang	46ba962168	use new quantize kv (#10888 )	2024-04-26 14:42:17 +08:00
Wang, Jian4	3e8ed54270	LLM: Fix bigdl_ipex_int8 warning (#10890 )	2024-04-26 11:18:44 +08:00
Yina Chen	8811f268ff	Use new fp16 sdp in Qwen and modify the constraint (#10882 )	2024-04-25 19:23:37 +08:00
Yang Wang	1ce8d7bcd9	Support the `desc_act` feature in GPTQ model (#10851 ) * support act_order * update versions * fix style * fix bug * clean up	2024-04-24 10:17:13 -07:00
Yina Chen	dc27b3bc35	Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790 ) * update sdp condition * update * fix * update & test llama * mistral * fix style * update * fix style * remove pvc constrain * update ds on arc * fix style	2024-04-24 17:24:01 +08:00
binbin Deng	c9feffff9a	LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864 )	2024-04-24 16:02:27 +08:00
Yishuo Wang	2d210817ff	add phi3 optimization (#10871 )	2024-04-24 15:17:40 +08:00
Cengguang Zhang	763413b7e1	LLM: support llama split tensor for long context in transformers>=4.36. (#10844 ) * LLm: support llama split tensor for long context in transformers>=4.36. * fix dtype. * fix style. * fix style. * fix style. * fix style. * fix dtype. * fix style.	2024-04-23 16:13:25 +08:00
ZehuaCao	92ea54b512	Fix speculative decoding bug (#10855 )	2024-04-23 14:28:31 +08:00
Wang, Jian4	18c032652d	LLM: Add mixtral speculative CPU example (#10830 ) * init mixtral sp example * use different prompt_format * update output * update	2024-04-23 10:05:51 +08:00

1 2 3 4 5

224 commits