ipex-llm

Author	SHA1	Message	Date
Yishuo Wang	e738ec38f4	disable quantize kv in specific qwen model (#11238 )	2024-06-06 14:08:39 +08:00
Yishuo Wang	c4e5806e01	add latest optimization in starcoder2 (#11236 )	2024-06-06 14:02:17 +08:00
Yishuo Wang	ba27e750b1	refactor yuan2 (#11235 )	2024-06-06 13:17:54 +08:00
Guoqiong Song	f6d5c6af78	fix issue 1407 (#11171 )	2024-06-05 13:35:57 -07:00
Yina Chen	ed67435491	Support Fp6 k in ipex-llm (#11222 ) * support fp6_k * support fp6_k * remove * fix style	2024-06-05 17:34:36 +08:00
binbin Deng	a6674f5bce	Fix `should_use_fuse_rope` error of Qwen1.5-MoE-A2.7B-Chat (#11216 )	2024-06-05 15:56:10 +08:00
Xin Qiu	566691c5a3	quantized attention forward for minicpm (#11200 ) * quantized minicpm * fix style check	2024-06-05 09:15:25 +08:00
Jiao Wang	bb83bc23fd	Fix Starcoder issue on CPU on transformers 4.36+ (#11190 ) * fix starcoder for sdpa * update * style	2024-06-04 10:05:40 -07:00
Xiangyu Tian	ac3d53ff5d	LLM: Fix vLLM CPU version error (#11206 ) Fix vLLM CPU version error	2024-06-04 19:10:23 +08:00
Ruonan Wang	1dde204775	update q6k (#11205 )	2024-06-04 17:14:33 +08:00
Qiyuan Gong	ce3f08b25a	Fix IPEX auto importer (#11192 ) * Fix ipex auto importer with Python builtins. * Raise errors if the user imports ipex manually before importing ipex_llm. Do nothing if they import ipex after importing ipex_llm. * Remove import ipex in examples.	2024-06-04 16:57:18 +08:00
Yishuo Wang	6454655dcc	use sdp in baichuan2 13b (#11198 )	2024-06-04 15:39:00 +08:00
Yishuo Wang	d90cd977d0	refactor stablelm (#11195 )	2024-06-04 13:14:43 +08:00
Xin Qiu	5f13700c9f	optimize Minicpm (#11189 ) * minicpm optimize * update	2024-06-03 18:28:29 +08:00
Shaojun Liu	401013a630	Remove chatglm_C Module to Eliminate LGPL Dependency (#11178 ) * remove chatglm_C.*.pyd to solve ngsolve weak copyright vunl fix style check error * remove chatglm native int4 from langchain	2024-05-31 17:03:11 +08:00
Ruonan Wang	50b5f4476f	update q4k convert (#11179 )	2024-05-31 11:36:53 +08:00
ZehuaCao	4127b99ed6	Fix null pointer dereferences error. (#11125 ) * delete unused function on tgi_server * update * update * fix style	2024-05-30 16:16:10 +08:00
Guancheng Fu	50ee004ac7	Fix vllm condition (#11169 ) * add use-vllm * done * fix style * fix done	2024-05-30 15:23:17 +08:00
Ruonan Wang	9bfbf78bf4	update api usage of xe_batch & fp16 (#11164 ) * update api usage * update setup.py	2024-05-29 15:15:14 +08:00
Yina Chen	e29e2f1c78	Support new fp8 e4m3 (#11158 )	2024-05-29 14:27:14 +08:00
Yishuo Wang	bc5008f0d5	disable sdp_causal in phi-3 to fix overflow (#11157 )	2024-05-28 17:25:53 +08:00
SONG Ge	33852bd23e	Refactor pipeline parallel device config (#11149 ) * refactor pipeline parallel device config * meet comments * update example * add warnings and update code doc	2024-05-28 16:52:46 +08:00
Yishuo Wang	d307622797	fix first token sdp with batch (#11153 )	2024-05-28 15:03:06 +08:00
Yina Chen	3464440839	fix qwen import error (#11154 )	2024-05-28 14:50:12 +08:00
Yina Chen	b6b70d1ba0	Divide core-xe packages (#11131 ) * temp * add batch * fix style * update package name * fix style * add workflow * use temp version to run uts * trigger performance test * trigger win igpu perf * revert workflow & setup	2024-05-28 12:00:18 +08:00
binbin Deng	c9168b85b7	Fix error during merging adapter (#11145 )	2024-05-27 19:41:42 +08:00
Guancheng Fu	daf7b1cd56	[Docker] Fix image using two cards error (#11144 ) * fix all * done	2024-05-27 16:20:13 +08:00
binbin Deng	367de141f2	Fix mixtral-8x7b with transformers=4.37.0 (#11132 )	2024-05-27 09:50:54 +08:00
Guancheng Fu	fabc395d0d	add langchain vllm interface (#11121 ) * done * fix * fix * add vllm * add langchain vllm exampels * add docs * temp	2024-05-24 17:19:27 +08:00
ZehuaCao	63e95698eb	[LLM]Reopen autotp generate_stream (#11120 ) * reopen autotp generate_stream * fix style error * update	2024-05-24 17:16:14 +08:00
Yishuo Wang	1dc680341b	fix phi-3-vision import (#11129 )	2024-05-24 15:57:15 +08:00
Guancheng Fu	7f772c5a4f	Add half precision for fastchat models (#11130 )	2024-05-24 15:41:14 +08:00
Zhao Changmin	65f4212f89	Fix qwen 14b run into register attention fwd (#11128 ) * fix qwen 14b	2024-05-24 14:45:07 +08:00
Yishuo Wang	1db9d9a63b	optimize internlm2 xcomposer agin (#11124 )	2024-05-24 13:44:52 +08:00
Yishuo Wang	9372ce87ce	fix internlm xcomposer2 fp16 (#11123 )	2024-05-24 11:03:31 +08:00
Cengguang Zhang	011b9faa5c	LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107 ) * LLM: unify alibi mask dtype. * fix comments.	2024-05-24 10:27:53 +08:00
Jiao Wang	0a06a6e1d4	Update tests for transformers 4.36 (#10858 ) * update unit test * update * update * update * update * update * fix gpu attention test * update * update * update * update * update * update * update example test * replace replit code * update * update * update * update * set safe_serialization false * perf test * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * delete * update * update * update * update * update * update * revert * update	2024-05-24 10:26:38 +08:00
Xiangyu Tian	b3f6faa038	LLM: Add CPU vLLM entrypoint (#11083 ) Add CPU vLLM entrypoint and update CPU vLLM serving example.	2024-05-24 09:16:59 +08:00
Yishuo Wang	797dbc48b8	fix phi-2 and phi-3 convert (#11116 )	2024-05-23 17:37:37 +08:00
Yishuo Wang	37b98a531f	support running internlm xcomposer2 on gpu and add sdp optimization (#11115 )	2024-05-23 17:26:24 +08:00
Zhao Changmin	c5e8b90c8d	Add Qwen register attention implemention (#11110 ) * qwen_register	2024-05-23 17:17:45 +08:00
Yishuo Wang	0e53f20edb	support running internlm-xcomposer2 on cpu (#11111 )	2024-05-23 16:36:09 +08:00
Yishuo Wang	cd4dff09ee	support phi-3 vision (#11101 )	2024-05-22 17:43:50 +08:00
Xin Qiu	71bcd18f44	fix qwen vl (#11090 )	2024-05-21 18:40:29 +08:00
Yishuo Wang	f00625f9a4	refactor qwen2 (#11087 )	2024-05-21 16:53:42 +08:00
Yishuo Wang	d830a63bb7	refactor qwen (#11074 )	2024-05-20 18:08:37 +08:00
Wang, Jian4	74950a152a	Fix tgi_api_server error file name (#11075 )	2024-05-20 16:48:40 +08:00
Yishuo Wang	4e97047d70	fix baichuan2 13b fp16 (#11071 )	2024-05-20 11:21:20 +08:00
Wang, Jian4	a2e1578fd9	Merge tgi_api_server to main (#11036 ) * init * fix style * speculative can not use benchmark * add tgi server readme	2024-05-20 09:15:03 +08:00
Yishuo Wang	31ce3e0c13	refactor baichuan2-13b (#11064 )	2024-05-17 16:25:30 +08:00
Ruonan Wang	f1156e6b20	support gguf_q4k_m / gguf_q4k_s (#10887 ) * initial commit * UPDATE * fix style * fix style * add gguf_q4k_s * update comment * fix	2024-05-17 14:30:09 +08:00
Yishuo Wang	981d668be6	refactor baichuan2-7b (#11062 )	2024-05-17 13:01:34 +08:00
Ruonan Wang	3a72e5df8c	disable mlp fusion of fp6 on mtl (#11059 )	2024-05-17 10:10:16 +08:00
SONG Ge	192ae35012	Add support for llama2 quantize_kv with transformers 4.38.0 (#11054 ) * add support for llama2 quantize_kv with transformers 4.38.0 * fix code style * fix code style	2024-05-16 22:23:39 +08:00
SONG Ge	16b2a418be	hotfix native_sdp ut (#11046 ) * hotfix native_sdp * update	2024-05-16 17:15:37 +08:00
Xin Qiu	6be70283b7	fix chatglm run error (#11045 ) * fix chatglm * update * fix style	2024-05-16 15:39:18 +08:00
Yishuo Wang	8cae897643	use new rope in phi3 (#11047 )	2024-05-16 15:12:35 +08:00
Yishuo Wang	59df750326	Use new sdp again (#11025 )	2024-05-16 09:33:34 +08:00
SONG Ge	9942a4ba69	[WIP] Support llama2 with transformers==4.38.0 (#11024 ) * support llama2 with transformers==4.38.0 * add supprot for quantize_qkv * add original support for 4.38.0 now * code style fix	2024-05-15 18:07:00 +08:00
Yina Chen	686f6038a8	Support fp6 save & load (#11034 )	2024-05-15 17:52:02 +08:00
Ruonan Wang	ac384e0f45	add fp6 mlp fusion (#11032 ) * add fp6 fusion * add qkv fusion for fp6 * remove qkv first	2024-05-15 17:42:50 +08:00
Wang, Jian4	2084ebe4ee	Enable fastchat benchmark latency (#11017 ) * enable fastchat benchmark * add readme * update readme * update	2024-05-15 14:52:09 +08:00
hxsz1997	93d40ab127	Update lookahead strategy (#11021 ) * update lookahead strategy * remove lines * fix python style check	2024-05-15 14:48:05 +08:00
Wang, Jian4	d9f71f1f53	Update benchmark util for example using (#11027 ) * mv benchmark_util.py to utils/ * remove * update	2024-05-15 14:16:35 +08:00
Yishuo Wang	fad1dbaf60	use sdp fp8 causal kernel (#11023 )	2024-05-15 10:22:35 +08:00
Yishuo Wang	ee325e9cc9	fix phi3 (#11022 )	2024-05-15 09:32:12 +08:00
Zhao Changmin	0a732bebe7	Add phi3 cached RotaryEmbedding (#11013 ) * phi3cachedrotaryembed * pep8	2024-05-15 08:16:43 +08:00
Yina Chen	893197434d	Add fp6 support on gpu (#11008 ) * add fp6 support * fix style	2024-05-14 16:31:44 +08:00
Zhao Changmin	b03c859278	Add phi3RMS (#10988 ) * phi3RMS	2024-05-14 15:16:27 +08:00
Yishuo Wang	170e3d65e0	use new sdp and fp32 sdp (#11007 )	2024-05-14 14:29:18 +08:00
Guancheng Fu	a465111cf4	Update README.md (#11003 )	2024-05-13 16:44:48 +08:00
Guancheng Fu	74997a3ed1	Adding load_low_bit interface for ipex_llm_worker (#11000 ) * initial implementation, need tests * fix * fix baichuan issue * fix typo	2024-05-13 15:30:19 +08:00
Yishuo Wang	1b3c7a6928	remove phi3 empty cache (#10997 )	2024-05-13 14:09:55 +08:00
Yishuo Wang	ad96f32ce0	optimize phi3 1st token performance (#10981 )	2024-05-10 17:33:46 +08:00
Cengguang Zhang	cfed76b2ed	LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937 ) * LLM: add split tensor support for baichuan2-7b and qwen1.5-7b. * fix style. * fix style. * fix style. * add support for mistral and fix condition threshold. * fix style. * fix comments.	2024-05-10 16:40:15 +08:00
Kai Huang	a6342cc068	Empty cache after phi first attention to support 4k input (#10972 ) * empty cache * fix style	2024-05-09 19:50:04 +08:00
Yishuo Wang	e753125880	use fp16_sdp when head_dim=96 (#10976 )	2024-05-09 17:02:59 +08:00
Yishuo Wang	697ca79eca	use quantize kv and sdp in phi3-mini (#10973 )	2024-05-09 15:16:18 +08:00
Wang, Jian4	3209d6b057	Fix spculative llama3 no stop error (#10963 ) * fix normal * add eos_tokens_id on sp and add list if * update * no none	2024-05-08 17:09:47 +08:00
Yishuo Wang	2ebec0395c	optimize phi-3-mini-128 (#10959 )	2024-05-08 16:33:17 +08:00
Zhao Changmin	0d6e12036f	Disable fast_init_ in load_low_bit (#10945 ) * fast_init_ disable	2024-05-08 10:46:19 +08:00
Yishuo Wang	c801c37bc6	optimize phi3 again: use quantize kv if possible (#10953 )	2024-05-07 17:26:19 +08:00
Yishuo Wang	aa2fa9fde1	optimize phi3 again: use sdp if possible (#10951 )	2024-05-07 15:53:08 +08:00
Qiyuan Gong	d7ca5d935b	Upgrade Peft version to 0.10.0 for LLM finetune (#10886 ) * Upgrade Peft version to 0.10.0 * Upgrade Peft version in ARC unit test and HF-Peft example.	2024-05-07 15:09:14 +08:00
Wang, Jian4	191b184341	LLM: Optimize cohere model (#10878 ) * use mlp and rms * optimize kv_cache * add fuse qkv * add flash attention and fp16 sdp * error fp8 sdp * fix optimized * fix style * update * add for pp	2024-05-07 10:19:50 +08:00
Guancheng Fu	49ab5a2b0e	Add embeddings (#10931 )	2024-05-07 09:07:02 +08:00
Wang, Jian4	0e0bd309e2	LLM: Enable Speculative on Fastchat (#10909 ) * init * enable streamer * update * update * remove deprecated * update * update * add gpu example	2024-05-06 10:06:20 +08:00
Cengguang Zhang	75dbf240ec	LLM: update split tensor conditions. (#10872 ) * LLM: update split tensor condition. * add cond for split tensor. * update priority of env. * fix style. * update env name.	2024-04-30 17:07:21 +08:00
Guancheng Fu	2c64754eb0	Add vLLM to ipex-llm serving image (#10807 ) * add vllm * done * doc work * fix done * temp * add docs * format * add start-fastchat-service.sh * fix	2024-04-29 17:25:42 +08:00
Yishuo Wang	d884c62dc4	remove new_layout parameter (#10906 )	2024-04-29 10:31:50 +08:00
Guancheng Fu	fbcd7bc737	Fix Loader issue with dtype fp16 (#10907 )	2024-04-29 10:16:02 +08:00
Guancheng Fu	c9fac8c26b	Fix sdp logic (#10896 ) * fix * fix	2024-04-28 22:02:14 +08:00
Yina Chen	015d07a58f	Fix lookahead sample error & add update strategy (#10894 ) * Fix sample error & add update strategy * add mtl config * fix style * remove print	2024-04-28 17:21:00 +08:00
Cengguang Zhang	9752ffe979	LLM: update split qkv native sdp. (#10895 ) * LLM: update split qkv native sdp. * fix typo.	2024-04-26 18:47:35 +08:00
Guancheng Fu	990535b1cf	Add tensor parallel for vLLM (#10879 ) * initial * test initial tp * initial sup * fix format * fix * fix	2024-04-26 17:10:49 +08:00
Yishuo Wang	46ba962168	use new quantize kv (#10888 )	2024-04-26 14:42:17 +08:00
Wang, Jian4	3e8ed54270	LLM: Fix bigdl_ipex_int8 warning (#10890 )	2024-04-26 11:18:44 +08:00
Yina Chen	8811f268ff	Use new fp16 sdp in Qwen and modify the constraint (#10882 )	2024-04-25 19:23:37 +08:00
Yang Wang	1ce8d7bcd9	Support the `desc_act` feature in GPTQ model (#10851 ) * support act_order * update versions * fix style * fix bug * clean up	2024-04-24 10:17:13 -07:00
Yina Chen	dc27b3bc35	Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790 ) * update sdp condition * update * fix * update & test llama * mistral * fix style * update * fix style * remove pvc constrain * update ds on arc * fix style	2024-04-24 17:24:01 +08:00
binbin Deng	c9feffff9a	LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864 )	2024-04-24 16:02:27 +08:00
Yishuo Wang	2d210817ff	add phi3 optimization (#10871 )	2024-04-24 15:17:40 +08:00
Cengguang Zhang	763413b7e1	LLM: support llama split tensor for long context in transformers>=4.36. (#10844 ) * LLm: support llama split tensor for long context in transformers>=4.36. * fix dtype. * fix style. * fix style. * fix style. * fix style. * fix dtype. * fix style.	2024-04-23 16:13:25 +08:00
ZehuaCao	92ea54b512	Fix speculative decoding bug (#10855 )	2024-04-23 14:28:31 +08:00
Wang, Jian4	18c032652d	LLM: Add mixtral speculative CPU example (#10830 ) * init mixtral sp example * use different prompt_format * update output * update	2024-04-23 10:05:51 +08:00
Yishuo Wang	fe5a082b84	add phi-2 optimization (#10843 )	2024-04-22 18:56:47 +08:00
Guancheng Fu	47bd5f504c	[vLLM]Remove vllm-v1, refactor v2 (#10842 ) * remove vllm-v1 * fix format	2024-04-22 17:51:32 +08:00
Wang, Jian4	23c6a52fb0	LLM: Fix ipex torchscript=True error (#10832 ) * remove * update * remove torchscript	2024-04-22 15:53:09 +08:00
Yina Chen	3daad242b8	Fix No module named 'transformers.cache_utils' with transformers < 4.36 (#10835 ) * update sdp condition * update * fix * fix 431 error * revert sdp & style fix * fix * meet comments	2024-04-22 14:05:50 +08:00
Guancheng Fu	caf75beef8	Disable sdpa (#10814 )	2024-04-19 17:33:18 +08:00
Yishuo Wang	57edf2033c	fix lookahead with transformers >= 4.36 (#10808 )	2024-04-19 16:24:56 +08:00
Ovo233	1a885020ee	Updated importing of top_k_top_p_filtering for transformers>=4.39.0 (#10794 ) * In transformers>=4.39.0, the top_k_top_p_filtering function has been deprecated and moved to the hugging face package trl. Thus, for versions >= 4.39.0, import this function from trl.	2024-04-19 15:34:39 +08:00
Yishuo Wang	08458b4f74	remove rms norm copy (#10793 )	2024-04-19 13:57:48 +08:00
Ruonan Wang	754b0ffecf	Fix pvc llama (#10798 ) * ifx * update	2024-04-18 10:44:57 -07:00
Ruonan Wang	439c834ed3	LLM: add mixed precision for lm_head (#10795 ) * add mixed_quantization * meet code review * update * fix style * meet review	2024-04-18 19:11:31 +08:00
Yina Chen	8796401b08	Support q4k in ipex-llm (#10796 ) * support q4k * update	2024-04-18 18:55:28 +08:00
Ruonan Wang	0e8aac19e3	add q6k precision in ipex-llm (#10792 ) * add q6k * add initial 16k * update * fix style	2024-04-18 16:52:09 +08:00
Wang, Jian4	14ca42a048	LLM：Fix moe indexs error on cpu (#10791 )	2024-04-18 15:56:52 +08:00
Guancheng Fu	cbe7b5753f	Add vLLM[xpu] related code (#10779 ) * Add ipex-llm side change * add runable offline_inference * refactor to call vllm2 * Verified async server * add new v2 example * add README * fix * change dir * refactor readme.md * add experimental * fix	2024-04-18 15:29:20 +08:00
Wang, Jian4	209c3501e6	LLM: Optimize qwen1.5 moe model (#10706 ) * update moe block * fix style * enable optmize MLP * enabel kv_cache * enable fuse rope * enable fused qkv * enable flash_attention * error sdp quantize * use old api * use fuse * use xetla * fix python style * update moe_blocks num * fix output error * add cpu sdpa * update * update * update	2024-04-18 14:54:05 +08:00
Ziteng Zhang	ff040c8f01	LISA Finetuning Example (#10743 ) * enabling xetla only supports qtype=SYM_INT4 or FP8E5 * LISA Finetuning Example on gpu * update readme * add licence * Explain parameters of lisa & Move backend codes to src dir * fix style * fix style * update readme * support chatglm * fix style * fix style * update readme * fix	2024-04-18 13:48:10 +08:00
Yang Wang	952e517db9	use config rope_theta (#10787 ) * use config rope_theta * fix style	2024-04-17 20:39:11 -07:00
Guancheng Fu	31ea2f9a9f	Fix wrong output for Llama models on CPU (#10742 )	2024-04-18 11:07:27 +08:00
Xin Qiu	e764f9b1b1	Disable fast fused rope on UHD (#10780 ) * use decoding fast path * update * update * cleanup	2024-04-18 10:03:53 +08:00
Yina Chen	ea5b373a97	Add lookahead GPU example (#10785 ) * Add lookahead example * fix style & attn mask * fix typo * address comments	2024-04-17 17:41:55 +08:00
Wang, Jian4	a20271ffe4	LLM: Fix yi-6b fp16 error on pvc (#10781 ) * updat for yi fp16 * update * update	2024-04-17 16:49:59 +08:00
Yina Chen	766fe45222	Fix spec error caused by lookup pr (#10777 ) * Fix spec error * remove * fix style	2024-04-17 11:27:35 +08:00
Qiyuan Gong	f2e923b3ca	Axolotl v0.4.0 support (#10773 ) * Add Axolotl 0.4.0, remove legacy 0.3.0 support. * replace is_torch_bf16_gpu_available * Add HF_HUB_OFFLINE=1 * Move transformers out of requirement * Refine readme and qlora.yml	2024-04-17 09:49:11 +08:00
Yina Chen	899d392e2f	Support prompt lookup in ipex-llm (#10768 ) * lookup init * add lookup * fix style * remove redundant code * change param name * fix style	2024-04-16 16:52:38 +08:00
binbin Deng	0a62933d36	LLM: fix qwen AutoTP (#10766 )	2024-04-16 09:56:17 +08:00
Cengguang Zhang	3e2662c87e	LLM: fix get env KV_CACHE_ALLOC_BLOCK_LENGTH type. (#10771 )	2024-04-16 09:32:30 +08:00
binbin Deng	3d561b60ac	LLM: add `enable_xetla` parameter for `optimize_model` API (#10753 )	2024-04-15 12:18:25 +08:00
binbin Deng	c3fc8f4b90	LLM: add bs limitation for llama softmax upcast to fp32 (#10752 )	2024-04-12 15:40:25 +08:00
Yishuo Wang	8086554d33	use new fp16 sdp in llama and mistral (#10734 )	2024-04-12 10:49:02 +08:00
Yang Wang	019293e1b9	Fuse MOE indexes computation (#10716 ) * try moe * use c++ cpu to compute indexes * fix style	2024-04-11 10:12:55 -07:00
binbin Deng	70ed9397f9	LLM: fix AttributeError of FP16Linear (#10740 )	2024-04-11 17:03:56 +08:00
Cengguang Zhang	4b024b7aac	LLM: optimize chatglm2 8k input. (#10723 ) * LLM: optimize chatglm2 8k input. * rename.	2024-04-10 16:59:06 +08:00
Wang, Jian4	c9e6d42ad1	LLM: Fix chatglm3-6b-32k error (#10719 ) * fix chatglm3-6b-32k * update style	2024-04-10 11:24:06 +08:00
Keyan (Kyrie) Zhang	585c174e92	Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables (#10707 ) * Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables. * Fix style	2024-04-10 10:48:46 +08:00
Jiao Wang	878a97077b	Fix llava example to support transformerds 4.36 (#10614 ) * fix llava example * update	2024-04-09 13:47:07 -07:00
Zhicun	b4147a97bb	Fix dtype mismatch error (#10609 ) * fix llama * fix * fix code style * add torch type in model.py --------- Co-authored-by: arda <arda@arda-arc19.sh.intel.com>	2024-04-09 17:50:33 +08:00
Yishuo Wang	8f45e22072	fix llama2 (#10710 )	2024-04-09 17:28:37 +08:00
Yishuo Wang	e438f941f2	disable rwkv5 fp16 (#10699 )	2024-04-09 16:42:11 +08:00
binbin Deng	44922bb5c2	LLM: support baichuan2-13b using AutoTP (#10691 )	2024-04-09 14:06:01 +08:00
Yina Chen	c7422712fc	mistral 4.36 use fp16 sdp (#10704 )	2024-04-09 13:50:33 +08:00
Ovo233	dcb2038aad	Enable optimization for sentence_transformers (#10679 ) * enable optimization for sentence_transformers * fix python style check failure	2024-04-09 12:33:46 +08:00
Yang Wang	5a1f446d3c	support fp8 in xetla (#10555 ) * support fp8 in xetla * change name * adjust model file * support convert back to cpu * factor * fix bug * fix style	2024-04-08 13:22:09 -07:00
Cengguang Zhang	7c43ac0164	LLM: optimize llama natvie sdp for split qkv tensor (#10693 ) * LLM: optimize llama natvie sdp for split qkv tensor. * fix block real size. * fix comment. * fix style. * refactor.	2024-04-08 17:48:11 +08:00
Xin Qiu	1274cba79b	stablelm fp8 kv cache (#10672 ) * stablelm fp8 kvcache * update * fix * change to fp8 matmul * fix style * fix * fix * meet code review * add comment	2024-04-08 15:16:46 +08:00
Cengguang Zhang	c0cd238e40	LLM: support llama2 8k input with w4a16. (#10677 ) * LLM: support llama2 8k input with w4a16. * fix comment and style. * fix style. * fix comments and split tensor to quantized attention forward. * fix style. * refactor name. * fix style. * fix style. * fix style. * refactor checker name. * refactor native sdp split qkv tensor name. * fix style. * fix comment rename variables. * fix co-exist of intermedia results.	2024-04-08 11:43:15 +08:00
Wang, Jian4	47cabe8fcc	LLM: Fix no return_last_logit running bigdl_ipex chatglm3 (#10678 ) * fix no return_last_logits * update only for chatglm	2024-04-07 15:27:58 +08:00
Zhicun	9d8ba64c0d	Llamaindex: add tokenizer_id and support chat (#10590 ) * add tokenizer_id * fix * modify * add from_model_id and from_mode_id_low_bit * fix typo and add comment * fix python code style --------- Co-authored-by: pengyb2001 <284261055@qq.com>	2024-04-07 13:51:34 +08:00
Xiangyu Tian	08018a18df	Remove not-imported MistralConfig (#10670 )	2024-04-07 10:32:05 +08:00
Cengguang Zhang	1a9b8204a4	LLM: support int4 fp16 chatglm2-6b 8k input. (#10648 )	2024-04-07 09:39:21 +08:00
Jiao Wang	69bdbf5806	Fix vllm print error message issue (#10664 ) * update chatglm readme * Add condition to invalidInputError * update * update * style	2024-04-05 15:08:13 -07:00
Xin Qiu	4c3e493b2d	fix stablelm2 1.6b (#10656 ) * fix stablelm2 1.6b * meet code review	2024-04-03 22:15:32 +08:00
Yishuo Wang	702e686901	optimize starcoder normal kv cache (#10642 )	2024-04-03 15:27:02 +08:00
Xin Qiu	3a9ab8f1ae	fix stablelm logits diff (#10636 ) * fix logits diff * Small fixes --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-04-03 15:08:12 +08:00
Zhicun	b827f534d5	Add tokenizer_id in Langchain (#10588 ) * fix low-bit * fix * fix style --------- Co-authored-by: arda <arda@arda-arc12.sh.intel.com>	2024-04-03 14:25:35 +08:00
Kai Huang	c875b3c858	Add seq len check for llama softmax upcast to fp32 (#10629 )	2024-04-03 12:05:13 +08:00
Jiao Wang	23e33a0ca1	Fix qwen-vl style (#10633 ) * update * update	2024-04-02 18:41:38 -07:00
binbin Deng	2bbd8a1548	LLM: fix llama2 FP16 & bs>1 & autotp on PVC and ARC (#10611 )	2024-04-03 09:28:04 +08:00
Jiao Wang	654dc5ba57	Fix Qwen-VL example problem (#10582 ) * update * update * update * update	2024-04-02 12:17:30 -07:00
Yuwen Hu	fd384ddfb8	Optimize StableLM (#10619 ) * Initial commit for stablelm optimizations * Small style fix * add dependency * Add mlp optimizations * Small fix * add attention forward * Remove quantize kv for now as head_dim=80 * Add merged qkv * fix lisence * Python style fix --------- Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>	2024-04-02 18:58:38 +08:00
Yishuo Wang	ba8cc6bd68	optimize starcoder2-3b (#10625 )	2024-04-02 17:16:29 +08:00
Shaojun Liu	a10f5a1b8d	add python style check (#10620 ) * add python style check * fix style checks * update runner * add ipex-llm-finetune-qlora-cpu-k8s to manually_build workflow * update tag to 2.1.0-SNAPSHOT	2024-04-02 16:17:56 +08:00
Cengguang Zhang	58b57177e3	LLM: support bigdl quantize kv cache env and add warning. (#10623 ) * LLM: support bigdl quantize kv cache env and add warnning. * fix style. * fix comments.	2024-04-02 15:41:08 +08:00
Kai Huang	0a95c556a1	Fix starcoder first token perf (#10612 ) * add bias check * update	2024-04-02 09:21:38 +08:00
Cengguang Zhang	e567956121	LLM: add memory optimization for llama. (#10592 ) * add initial memory optimization. * fix logic. * fix logic, * remove env var check in mlp split.	2024-04-02 09:07:50 +08:00
Ruonan Wang	bfc1caa5e5	LLM: support iq1s for llama2-70b-hf (#10596 )	2024-04-01 13:13:13 +08:00
Yishuo Wang	437a349dd6	fix rwkv with pip installer (#10591 )	2024-03-29 17:56:45 +08:00
Ruonan Wang	0136fad1d4	LLM: support iq1_s (#10564 ) * init version * update utils * remove unsed code	2024-03-29 09:43:55 +08:00
Qiyuan Gong	f4537798c1	Enable kv cache quantization by default for flex when 1 < batch <= 8 (#10584 ) * Enable kv cache quantization by default for flex when 1 < batch <= 8. * Change up bound from <8 to <=8.	2024-03-29 09:43:42 +08:00
Cengguang Zhang	b44f7adbad	LLM: Disable esimd sdp for PVC GPU when batch size>1 (#10579 ) * llm: disable esimd sdp for pvc bz>1. * fix logic. * fix: avoid call get device name twice.	2024-03-28 22:55:48 +08:00
Xin Qiu	5963239b46	Fix qwen's position_ids no enough (#10572 ) * fix position_ids * fix position_ids	2024-03-28 17:05:49 +08:00
ZehuaCao	52a2135d83	Replace ipex with ipex-llm (#10554 ) * fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update	2024-03-28 13:54:40 +08:00
Cheen Hau, 俊豪	1c5eb14128	Update pip install to use --extra-index-url for ipex package (#10557 ) * Change to 'pip install .. --extra-index-url' for readthedocs * Change to 'pip install .. --extra-index-url' for examples * Change to 'pip install .. --extra-index-url' for remaining files * Fix URL for ipex * Add links for ipex US and CN servers * Update ipex cpu url * remove readme * Update for github actions * Update for dockerfiles	2024-03-28 09:56:23 +08:00
binbin Deng	92dfed77be	LLM: fix abnormal output of fp16 deepspeed autotp (#10558 )	2024-03-28 09:35:48 +08:00
Xiangyu Tian	51d34ca68e	Fix wrong import in speculative (#10562 )	2024-03-27 18:21:07 +08:00
Guancheng Fu	04baac5a2e	Fix fastchat top_k (#10560 ) * fix -1 top_k * fix * done	2024-03-27 16:01:58 +08:00
binbin Deng	fc8c7904f0	LLM: fix torch_dtype setting of apply fp16 optimization through optimize_model (#10556 )	2024-03-27 14:18:45 +08:00
Ruonan Wang	ea4bc450c4	LLM: add esimd sdp for pvc (#10543 ) * add esimd sdp for pvc * update * fix * fix batch	2024-03-26 19:04:40 +08:00
Xiangyu Tian	11550d3f25	LLM: Add length check for IPEX-CPU speculative decoding (#10529 ) Add length check for IPEX-CPU speculative decoding.	2024-03-26 17:47:10 +08:00
Guancheng Fu	a3b007f3b1	[Serving] Fix fastchat breaks (#10548 ) * fix fastchat * fix doc	2024-03-26 17:03:52 +08:00
Yishuo Wang	69a28d6b4c	fix chatglm (#10540 )	2024-03-26 16:01:00 +08:00
binbin Deng	0a3e4e788f	LLM: fix mistral hidden_size setting for deepspeed autotp (#10527 )	2024-03-26 10:55:44 +08:00
Xin Qiu	1dd40b429c	enable fp4 fused mlp and qkv (#10531 ) * enable fp4 fused mlp and qkv * update qwen * update qwen2	2024-03-26 08:34:00 +08:00
Wang, Jian4	16b2ef49c6	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
Wang, Jian4	a1048ca7f6	Update setup.py and add new actions and add compatible mode (#25 ) * update setup.py * add new action * add compatible mode	2024-03-22 15:44:59 +08:00
Wang, Jian4	9df70d95eb	Refactor bigdl.llm to ipex_llm (#24 ) * Rename bigdl/llm to ipex_llm * rm python/llm/src/bigdl * from bigdl.llm to from ipex_llm	2024-03-22 15:41:21 +08:00

... 2 3 4 5 6 ...

340 commits