ipex-llm

Author	SHA1	Message	Date
Yishuo Wang	1b3c7a6928	remove phi3 empty cache (#10997 )	2024-05-13 14:09:55 +08:00
Yishuo Wang	ad96f32ce0	optimize phi3 1st token performance (#10981 )	2024-05-10 17:33:46 +08:00
Cengguang Zhang	cfed76b2ed	LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937 ) * LLM: add split tensor support for baichuan2-7b and qwen1.5-7b. * fix style. * fix style. * fix style. * add support for mistral and fix condition threshold. * fix style. * fix comments.	2024-05-10 16:40:15 +08:00
Kai Huang	a6342cc068	Empty cache after phi first attention to support 4k input (#10972 ) * empty cache * fix style	2024-05-09 19:50:04 +08:00
Yishuo Wang	e753125880	use fp16_sdp when head_dim=96 (#10976 )	2024-05-09 17:02:59 +08:00
Yishuo Wang	697ca79eca	use quantize kv and sdp in phi3-mini (#10973 )	2024-05-09 15:16:18 +08:00
Wang, Jian4	3209d6b057	Fix spculative llama3 no stop error (#10963 ) * fix normal * add eos_tokens_id on sp and add list if * update * no none	2024-05-08 17:09:47 +08:00
Yishuo Wang	2ebec0395c	optimize phi-3-mini-128 (#10959 )	2024-05-08 16:33:17 +08:00
Zhao Changmin	0d6e12036f	Disable fast_init_ in load_low_bit (#10945 ) * fast_init_ disable	2024-05-08 10:46:19 +08:00
Yishuo Wang	c801c37bc6	optimize phi3 again: use quantize kv if possible (#10953 )	2024-05-07 17:26:19 +08:00
Yishuo Wang	aa2fa9fde1	optimize phi3 again: use sdp if possible (#10951 )	2024-05-07 15:53:08 +08:00
Qiyuan Gong	d7ca5d935b	Upgrade Peft version to 0.10.0 for LLM finetune (#10886 ) * Upgrade Peft version to 0.10.0 * Upgrade Peft version in ARC unit test and HF-Peft example.	2024-05-07 15:09:14 +08:00
Wang, Jian4	191b184341	LLM: Optimize cohere model (#10878 ) * use mlp and rms * optimize kv_cache * add fuse qkv * add flash attention and fp16 sdp * error fp8 sdp * fix optimized * fix style * update * add for pp	2024-05-07 10:19:50 +08:00
Guancheng Fu	49ab5a2b0e	Add embeddings (#10931 )	2024-05-07 09:07:02 +08:00
Wang, Jian4	0e0bd309e2	LLM: Enable Speculative on Fastchat (#10909 ) * init * enable streamer * update * update * remove deprecated * update * update * add gpu example	2024-05-06 10:06:20 +08:00
Cengguang Zhang	75dbf240ec	LLM: update split tensor conditions. (#10872 ) * LLM: update split tensor condition. * add cond for split tensor. * update priority of env. * fix style. * update env name.	2024-04-30 17:07:21 +08:00
Guancheng Fu	2c64754eb0	Add vLLM to ipex-llm serving image (#10807 ) * add vllm * done * doc work * fix done * temp * add docs * format * add start-fastchat-service.sh * fix	2024-04-29 17:25:42 +08:00
Yishuo Wang	d884c62dc4	remove new_layout parameter (#10906 )	2024-04-29 10:31:50 +08:00
Guancheng Fu	fbcd7bc737	Fix Loader issue with dtype fp16 (#10907 )	2024-04-29 10:16:02 +08:00
Guancheng Fu	c9fac8c26b	Fix sdp logic (#10896 ) * fix * fix	2024-04-28 22:02:14 +08:00
Yina Chen	015d07a58f	Fix lookahead sample error & add update strategy (#10894 ) * Fix sample error & add update strategy * add mtl config * fix style * remove print	2024-04-28 17:21:00 +08:00
Cengguang Zhang	9752ffe979	LLM: update split qkv native sdp. (#10895 ) * LLM: update split qkv native sdp. * fix typo.	2024-04-26 18:47:35 +08:00
Guancheng Fu	990535b1cf	Add tensor parallel for vLLM (#10879 ) * initial * test initial tp * initial sup * fix format * fix * fix	2024-04-26 17:10:49 +08:00
Yishuo Wang	46ba962168	use new quantize kv (#10888 )	2024-04-26 14:42:17 +08:00
Wang, Jian4	3e8ed54270	LLM: Fix bigdl_ipex_int8 warning (#10890 )	2024-04-26 11:18:44 +08:00
Yina Chen	8811f268ff	Use new fp16 sdp in Qwen and modify the constraint (#10882 )	2024-04-25 19:23:37 +08:00
Yang Wang	1ce8d7bcd9	Support the `desc_act` feature in GPTQ model (#10851 ) * support act_order * update versions * fix style * fix bug * clean up	2024-04-24 10:17:13 -07:00
Yina Chen	dc27b3bc35	Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790 ) * update sdp condition * update * fix * update & test llama * mistral * fix style * update * fix style * remove pvc constrain * update ds on arc * fix style	2024-04-24 17:24:01 +08:00
binbin Deng	c9feffff9a	LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864 )	2024-04-24 16:02:27 +08:00
Yishuo Wang	2d210817ff	add phi3 optimization (#10871 )	2024-04-24 15:17:40 +08:00
Cengguang Zhang	763413b7e1	LLM: support llama split tensor for long context in transformers>=4.36. (#10844 ) * LLm: support llama split tensor for long context in transformers>=4.36. * fix dtype. * fix style. * fix style. * fix style. * fix style. * fix dtype. * fix style.	2024-04-23 16:13:25 +08:00
ZehuaCao	92ea54b512	Fix speculative decoding bug (#10855 )	2024-04-23 14:28:31 +08:00
Wang, Jian4	18c032652d	LLM: Add mixtral speculative CPU example (#10830 ) * init mixtral sp example * use different prompt_format * update output * update	2024-04-23 10:05:51 +08:00
Yishuo Wang	fe5a082b84	add phi-2 optimization (#10843 )	2024-04-22 18:56:47 +08:00
Guancheng Fu	47bd5f504c	[vLLM]Remove vllm-v1, refactor v2 (#10842 ) * remove vllm-v1 * fix format	2024-04-22 17:51:32 +08:00
Wang, Jian4	23c6a52fb0	LLM: Fix ipex torchscript=True error (#10832 ) * remove * update * remove torchscript	2024-04-22 15:53:09 +08:00
Yina Chen	3daad242b8	Fix No module named 'transformers.cache_utils' with transformers < 4.36 (#10835 ) * update sdp condition * update * fix * fix 431 error * revert sdp & style fix * fix * meet comments	2024-04-22 14:05:50 +08:00
Guancheng Fu	caf75beef8	Disable sdpa (#10814 )	2024-04-19 17:33:18 +08:00
Yishuo Wang	57edf2033c	fix lookahead with transformers >= 4.36 (#10808 )	2024-04-19 16:24:56 +08:00
Ovo233	1a885020ee	Updated importing of top_k_top_p_filtering for transformers>=4.39.0 (#10794 ) * In transformers>=4.39.0, the top_k_top_p_filtering function has been deprecated and moved to the hugging face package trl. Thus, for versions >= 4.39.0, import this function from trl.	2024-04-19 15:34:39 +08:00
Yishuo Wang	08458b4f74	remove rms norm copy (#10793 )	2024-04-19 13:57:48 +08:00
Ruonan Wang	754b0ffecf	Fix pvc llama (#10798 ) * ifx * update	2024-04-18 10:44:57 -07:00
Ruonan Wang	439c834ed3	LLM: add mixed precision for lm_head (#10795 ) * add mixed_quantization * meet code review * update * fix style * meet review	2024-04-18 19:11:31 +08:00
Yina Chen	8796401b08	Support q4k in ipex-llm (#10796 ) * support q4k * update	2024-04-18 18:55:28 +08:00
Ruonan Wang	0e8aac19e3	add q6k precision in ipex-llm (#10792 ) * add q6k * add initial 16k * update * fix style	2024-04-18 16:52:09 +08:00
Wang, Jian4	14ca42a048	LLM：Fix moe indexs error on cpu (#10791 )	2024-04-18 15:56:52 +08:00
Guancheng Fu	cbe7b5753f	Add vLLM[xpu] related code (#10779 ) * Add ipex-llm side change * add runable offline_inference * refactor to call vllm2 * Verified async server * add new v2 example * add README * fix * change dir * refactor readme.md * add experimental * fix	2024-04-18 15:29:20 +08:00
Wang, Jian4	209c3501e6	LLM: Optimize qwen1.5 moe model (#10706 ) * update moe block * fix style * enable optmize MLP * enabel kv_cache * enable fuse rope * enable fused qkv * enable flash_attention * error sdp quantize * use old api * use fuse * use xetla * fix python style * update moe_blocks num * fix output error * add cpu sdpa * update * update * update	2024-04-18 14:54:05 +08:00
Ziteng Zhang	ff040c8f01	LISA Finetuning Example (#10743 ) * enabling xetla only supports qtype=SYM_INT4 or FP8E5 * LISA Finetuning Example on gpu * update readme * add licence * Explain parameters of lisa & Move backend codes to src dir * fix style * fix style * update readme * support chatglm * fix style * fix style * update readme * fix	2024-04-18 13:48:10 +08:00
Yang Wang	952e517db9	use config rope_theta (#10787 ) * use config rope_theta * fix style	2024-04-17 20:39:11 -07:00
Guancheng Fu	31ea2f9a9f	Fix wrong output for Llama models on CPU (#10742 )	2024-04-18 11:07:27 +08:00
Xin Qiu	e764f9b1b1	Disable fast fused rope on UHD (#10780 ) * use decoding fast path * update * update * cleanup	2024-04-18 10:03:53 +08:00
Yina Chen	ea5b373a97	Add lookahead GPU example (#10785 ) * Add lookahead example * fix style & attn mask * fix typo * address comments	2024-04-17 17:41:55 +08:00
Wang, Jian4	a20271ffe4	LLM: Fix yi-6b fp16 error on pvc (#10781 ) * updat for yi fp16 * update * update	2024-04-17 16:49:59 +08:00
Yina Chen	766fe45222	Fix spec error caused by lookup pr (#10777 ) * Fix spec error * remove * fix style	2024-04-17 11:27:35 +08:00
Qiyuan Gong	f2e923b3ca	Axolotl v0.4.0 support (#10773 ) * Add Axolotl 0.4.0, remove legacy 0.3.0 support. * replace is_torch_bf16_gpu_available * Add HF_HUB_OFFLINE=1 * Move transformers out of requirement * Refine readme and qlora.yml	2024-04-17 09:49:11 +08:00
Yina Chen	899d392e2f	Support prompt lookup in ipex-llm (#10768 ) * lookup init * add lookup * fix style * remove redundant code * change param name * fix style	2024-04-16 16:52:38 +08:00
binbin Deng	0a62933d36	LLM: fix qwen AutoTP (#10766 )	2024-04-16 09:56:17 +08:00
Cengguang Zhang	3e2662c87e	LLM: fix get env KV_CACHE_ALLOC_BLOCK_LENGTH type. (#10771 )	2024-04-16 09:32:30 +08:00
binbin Deng	3d561b60ac	LLM: add `enable_xetla` parameter for `optimize_model` API (#10753 )	2024-04-15 12:18:25 +08:00
binbin Deng	c3fc8f4b90	LLM: add bs limitation for llama softmax upcast to fp32 (#10752 )	2024-04-12 15:40:25 +08:00
Yishuo Wang	8086554d33	use new fp16 sdp in llama and mistral (#10734 )	2024-04-12 10:49:02 +08:00
Yang Wang	019293e1b9	Fuse MOE indexes computation (#10716 ) * try moe * use c++ cpu to compute indexes * fix style	2024-04-11 10:12:55 -07:00
binbin Deng	70ed9397f9	LLM: fix AttributeError of FP16Linear (#10740 )	2024-04-11 17:03:56 +08:00
Cengguang Zhang	4b024b7aac	LLM: optimize chatglm2 8k input. (#10723 ) * LLM: optimize chatglm2 8k input. * rename.	2024-04-10 16:59:06 +08:00
Wang, Jian4	c9e6d42ad1	LLM: Fix chatglm3-6b-32k error (#10719 ) * fix chatglm3-6b-32k * update style	2024-04-10 11:24:06 +08:00
Keyan (Kyrie) Zhang	585c174e92	Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables (#10707 ) * Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables. * Fix style	2024-04-10 10:48:46 +08:00
Jiao Wang	878a97077b	Fix llava example to support transformerds 4.36 (#10614 ) * fix llava example * update	2024-04-09 13:47:07 -07:00
Zhicun	b4147a97bb	Fix dtype mismatch error (#10609 ) * fix llama * fix * fix code style * add torch type in model.py --------- Co-authored-by: arda <arda@arda-arc19.sh.intel.com>	2024-04-09 17:50:33 +08:00
Yishuo Wang	8f45e22072	fix llama2 (#10710 )	2024-04-09 17:28:37 +08:00
Yishuo Wang	e438f941f2	disable rwkv5 fp16 (#10699 )	2024-04-09 16:42:11 +08:00
binbin Deng	44922bb5c2	LLM: support baichuan2-13b using AutoTP (#10691 )	2024-04-09 14:06:01 +08:00
Yina Chen	c7422712fc	mistral 4.36 use fp16 sdp (#10704 )	2024-04-09 13:50:33 +08:00
Ovo233	dcb2038aad	Enable optimization for sentence_transformers (#10679 ) * enable optimization for sentence_transformers * fix python style check failure	2024-04-09 12:33:46 +08:00
Yang Wang	5a1f446d3c	support fp8 in xetla (#10555 ) * support fp8 in xetla * change name * adjust model file * support convert back to cpu * factor * fix bug * fix style	2024-04-08 13:22:09 -07:00
Cengguang Zhang	7c43ac0164	LLM: optimize llama natvie sdp for split qkv tensor (#10693 ) * LLM: optimize llama natvie sdp for split qkv tensor. * fix block real size. * fix comment. * fix style. * refactor.	2024-04-08 17:48:11 +08:00
Xin Qiu	1274cba79b	stablelm fp8 kv cache (#10672 ) * stablelm fp8 kvcache * update * fix * change to fp8 matmul * fix style * fix * fix * meet code review * add comment	2024-04-08 15:16:46 +08:00
Cengguang Zhang	c0cd238e40	LLM: support llama2 8k input with w4a16. (#10677 ) * LLM: support llama2 8k input with w4a16. * fix comment and style. * fix style. * fix comments and split tensor to quantized attention forward. * fix style. * refactor name. * fix style. * fix style. * fix style. * refactor checker name. * refactor native sdp split qkv tensor name. * fix style. * fix comment rename variables. * fix co-exist of intermedia results.	2024-04-08 11:43:15 +08:00
Wang, Jian4	47cabe8fcc	LLM: Fix no return_last_logit running bigdl_ipex chatglm3 (#10678 ) * fix no return_last_logits * update only for chatglm	2024-04-07 15:27:58 +08:00
Zhicun	9d8ba64c0d	Llamaindex: add tokenizer_id and support chat (#10590 ) * add tokenizer_id * fix * modify * add from_model_id and from_mode_id_low_bit * fix typo and add comment * fix python code style --------- Co-authored-by: pengyb2001 <284261055@qq.com>	2024-04-07 13:51:34 +08:00
Xiangyu Tian	08018a18df	Remove not-imported MistralConfig (#10670 )	2024-04-07 10:32:05 +08:00
Cengguang Zhang	1a9b8204a4	LLM: support int4 fp16 chatglm2-6b 8k input. (#10648 )	2024-04-07 09:39:21 +08:00
Jiao Wang	69bdbf5806	Fix vllm print error message issue (#10664 ) * update chatglm readme * Add condition to invalidInputError * update * update * style	2024-04-05 15:08:13 -07:00
Xin Qiu	4c3e493b2d	fix stablelm2 1.6b (#10656 ) * fix stablelm2 1.6b * meet code review	2024-04-03 22:15:32 +08:00
Yishuo Wang	702e686901	optimize starcoder normal kv cache (#10642 )	2024-04-03 15:27:02 +08:00
Xin Qiu	3a9ab8f1ae	fix stablelm logits diff (#10636 ) * fix logits diff * Small fixes --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-04-03 15:08:12 +08:00
Zhicun	b827f534d5	Add tokenizer_id in Langchain (#10588 ) * fix low-bit * fix * fix style --------- Co-authored-by: arda <arda@arda-arc12.sh.intel.com>	2024-04-03 14:25:35 +08:00
Kai Huang	c875b3c858	Add seq len check for llama softmax upcast to fp32 (#10629 )	2024-04-03 12:05:13 +08:00
Jiao Wang	23e33a0ca1	Fix qwen-vl style (#10633 ) * update * update	2024-04-02 18:41:38 -07:00
binbin Deng	2bbd8a1548	LLM: fix llama2 FP16 & bs>1 & autotp on PVC and ARC (#10611 )	2024-04-03 09:28:04 +08:00
Jiao Wang	654dc5ba57	Fix Qwen-VL example problem (#10582 ) * update * update * update * update	2024-04-02 12:17:30 -07:00
Yuwen Hu	fd384ddfb8	Optimize StableLM (#10619 ) * Initial commit for stablelm optimizations * Small style fix * add dependency * Add mlp optimizations * Small fix * add attention forward * Remove quantize kv for now as head_dim=80 * Add merged qkv * fix lisence * Python style fix --------- Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>	2024-04-02 18:58:38 +08:00
Yishuo Wang	ba8cc6bd68	optimize starcoder2-3b (#10625 )	2024-04-02 17:16:29 +08:00
Shaojun Liu	a10f5a1b8d	add python style check (#10620 ) * add python style check * fix style checks * update runner * add ipex-llm-finetune-qlora-cpu-k8s to manually_build workflow * update tag to 2.1.0-SNAPSHOT	2024-04-02 16:17:56 +08:00
Cengguang Zhang	58b57177e3	LLM: support bigdl quantize kv cache env and add warning. (#10623 ) * LLM: support bigdl quantize kv cache env and add warnning. * fix style. * fix comments.	2024-04-02 15:41:08 +08:00
Kai Huang	0a95c556a1	Fix starcoder first token perf (#10612 ) * add bias check * update	2024-04-02 09:21:38 +08:00
Cengguang Zhang	e567956121	LLM: add memory optimization for llama. (#10592 ) * add initial memory optimization. * fix logic. * fix logic, * remove env var check in mlp split.	2024-04-02 09:07:50 +08:00
Ruonan Wang	bfc1caa5e5	LLM: support iq1s for llama2-70b-hf (#10596 )	2024-04-01 13:13:13 +08:00
Yishuo Wang	437a349dd6	fix rwkv with pip installer (#10591 )	2024-03-29 17:56:45 +08:00
Ruonan Wang	0136fad1d4	LLM: support iq1_s (#10564 ) * init version * update utils * remove unsed code	2024-03-29 09:43:55 +08:00
Qiyuan Gong	f4537798c1	Enable kv cache quantization by default for flex when 1 < batch <= 8 (#10584 ) * Enable kv cache quantization by default for flex when 1 < batch <= 8. * Change up bound from <8 to <=8.	2024-03-29 09:43:42 +08:00
Cengguang Zhang	b44f7adbad	LLM: Disable esimd sdp for PVC GPU when batch size>1 (#10579 ) * llm: disable esimd sdp for pvc bz>1. * fix logic. * fix: avoid call get device name twice.	2024-03-28 22:55:48 +08:00
Xin Qiu	5963239b46	Fix qwen's position_ids no enough (#10572 ) * fix position_ids * fix position_ids	2024-03-28 17:05:49 +08:00
ZehuaCao	52a2135d83	Replace ipex with ipex-llm (#10554 ) * fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update	2024-03-28 13:54:40 +08:00
Cheen Hau, 俊豪	1c5eb14128	Update pip install to use --extra-index-url for ipex package (#10557 ) * Change to 'pip install .. --extra-index-url' for readthedocs * Change to 'pip install .. --extra-index-url' for examples * Change to 'pip install .. --extra-index-url' for remaining files * Fix URL for ipex * Add links for ipex US and CN servers * Update ipex cpu url * remove readme * Update for github actions * Update for dockerfiles	2024-03-28 09:56:23 +08:00
binbin Deng	92dfed77be	LLM: fix abnormal output of fp16 deepspeed autotp (#10558 )	2024-03-28 09:35:48 +08:00
Xiangyu Tian	51d34ca68e	Fix wrong import in speculative (#10562 )	2024-03-27 18:21:07 +08:00
Guancheng Fu	04baac5a2e	Fix fastchat top_k (#10560 ) * fix -1 top_k * fix * done	2024-03-27 16:01:58 +08:00
binbin Deng	fc8c7904f0	LLM: fix torch_dtype setting of apply fp16 optimization through optimize_model (#10556 )	2024-03-27 14:18:45 +08:00
Ruonan Wang	ea4bc450c4	LLM: add esimd sdp for pvc (#10543 ) * add esimd sdp for pvc * update * fix * fix batch	2024-03-26 19:04:40 +08:00
Xiangyu Tian	11550d3f25	LLM: Add length check for IPEX-CPU speculative decoding (#10529 ) Add length check for IPEX-CPU speculative decoding.	2024-03-26 17:47:10 +08:00
Guancheng Fu	a3b007f3b1	[Serving] Fix fastchat breaks (#10548 ) * fix fastchat * fix doc	2024-03-26 17:03:52 +08:00
Yishuo Wang	69a28d6b4c	fix chatglm (#10540 )	2024-03-26 16:01:00 +08:00
binbin Deng	0a3e4e788f	LLM: fix mistral hidden_size setting for deepspeed autotp (#10527 )	2024-03-26 10:55:44 +08:00
Xin Qiu	1dd40b429c	enable fp4 fused mlp and qkv (#10531 ) * enable fp4 fused mlp and qkv * update qwen * update qwen2	2024-03-26 08:34:00 +08:00
Wang, Jian4	16b2ef49c6	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
Wang, Jian4	a1048ca7f6	Update setup.py and add new actions and add compatible mode (#25 ) * update setup.py * add new action * add compatible mode	2024-03-22 15:44:59 +08:00
Wang, Jian4	9df70d95eb	Refactor bigdl.llm to ipex_llm (#24 ) * Rename bigdl/llm to ipex_llm * rm python/llm/src/bigdl * from bigdl.llm to from ipex_llm	2024-03-22 15:41:21 +08:00
Wang, Jian4	34d0a9328c	LLM: Speed-up mixtral in pipeline parallel inference (#10472 ) * speed-up mixtral * fix style	2024-03-22 11:06:28 +08:00
Cengguang Zhang	b9d4280892	LLM: fix baichuan7b quantize kv abnormal output. (#10504 ) * fix abnormal output. * fix style. * fix style.	2024-03-22 10:00:08 +08:00
Yishuo Wang	f0f317b6cf	fix a typo in yuan (#10503 )	2024-03-22 09:40:04 +08:00
Guancheng Fu	3a3756b51d	Add FastChat bigdl_worker (#10493 ) * done * fix format * add licence * done * fix doc * refactor folder * add license	2024-03-21 18:35:05 +08:00
Xin Qiu	dba7ddaab3	add sdp fp8 for qwen llama436 baichuan mistral baichuan2 (#10485 ) * add sdp fp8 * fix style * fix qwen * fix baichuan 13 * revert baichuan 13b and baichuan2-13b * fix style * update	2024-03-21 17:23:05 +08:00
Kai Huang	30f111cd32	lm_head empty_cache for more models (#10490 ) * modify constraint * fix style	2024-03-21 17:11:43 +08:00
binbin Deng	2958ca49c0	LLM: add patching function for llm finetuning (#10247 )	2024-03-21 16:01:01 +08:00
Kai Huang	021d77fd22	Remove softmax upcast fp32 in llama (#10481 ) * update * fix style	2024-03-20 18:17:34 +08:00
Yishuo Wang	cfdf8ad496	Fix `modules_not_to_convert` argument (#10483 )	2024-03-20 17:47:03 +08:00
Xiangyu Tian	cbe24cc7e6	LLM: Enable BigDL IPEX Int8 (#10480 ) Enable BigDL IPEX Int8	2024-03-20 15:59:54 +08:00
ZehuaCao	1d062e24db	Update serving doc (#10475 ) * update serving doc * add tob * update * update * update * update vllm worker	2024-03-20 14:44:43 +08:00
Cengguang Zhang	4581e4f17f	LLM: fix whiper model missing config. (#10473 ) * fix whiper model missing config. * fix style. * fix style. * style.	2024-03-20 14:22:37 +08:00
Yishuo Wang	749bedaf1e	fix rwkv v5 fp16 (#10474 )	2024-03-20 13:15:08 +08:00
Yuwen Hu	72bcc27da9	[LLM] Add `TransformersBgeEmbeddings` class in `bigdl.llm.langchain.embeddings` (#10459 ) * Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings * Small fixes	2024-03-19 18:04:35 +08:00
Cengguang Zhang	463a86cd5d	LLM: fix qwen-vl interpolation gpu abnormal results. (#10457 ) * fix qwen-vl interpolation gpu abnormal results. * fix style. * update qwen-vl gpu example. * fix comment and update example. * fix style.	2024-03-19 16:59:39 +08:00
Xin Qiu	bbd749dceb	qwen2 fp8 cache (#10446 ) * qwen2 fp8 cache * fix style check	2024-03-19 08:32:39 +08:00
Yang Wang	9e763b049c	Support running pipeline parallel inference by vertically partitioning model to different devices (#10392 ) * support pipeline parallel inference * fix logging * remove benchmark file * fic * need to warmup twice * support qwen and qwen2 * fix lint * remove genxir * refine	2024-03-18 13:04:45 -07:00
Xiangyu Tian	dbdeaddd6a	LLM: Fix log condition for BIGDL_OPT_IPEX (#10441 ) remove log for BIGDL_OPT_IPEX	2024-03-18 16:03:51 +08:00
Xin Qiu	399843faf0	Baichuan 7b fp16 sdp and qwen2 pvc sdp (#10435 ) * add baichuan sdp * update * baichuan2 * fix * fix style * revert 13b * revert	2024-03-18 10:15:34 +08:00
Yishuo Wang	bd64488b2a	add mask support for llama/chatglm fp8 sdp (#10433 ) * add mask support for fp8 sdp * fix chatglm2 dtype * update	2024-03-15 17:36:52 +08:00
Xin Qiu	24473e331a	Qwen2 fp16 sdp (#10427 ) * qwen2 sdp and refine * update * update * fix style * remove use_flash_attention	2024-03-15 13:12:03 +08:00
Ruonan Wang	b036205be2	LLM: add fp8 sdp for chatglm2/3 (#10411 ) * add fp8 sdp for chatglm2 * fix style	2024-03-15 09:38:18 +08:00
Wang, Jian4	fe8976a00f	LLM: Support gguf models use low_bit and fix no json(#10408 ) * support others model use low_bit * update readme * update to add *.json	2024-03-15 09:34:18 +08:00
Xin Qiu	cda38f85a9	Qwen fp16 sdp (#10401 ) * qwen sdp * fix * update * update * update sdp * update * fix style check * add to origin type	2024-03-15 08:51:50 +08:00
dingbaorong	1c0f7ed3fa	add xpu support (#10419 )	2024-03-14 17:13:48 +08:00
Heyang Sun	7d29765092	refactor qwen2 forward to enable XPU (#10409 ) * refactor awen2 forward to enable XPU * Update qwen2.py	2024-03-14 11:03:05 +08:00
ZehuaCao	f66329e35d	Fix multiple get_enable_ipex function error (#10400 ) * fix multiple get_enable_ipex function error * remove get_enable_ipex_low_bit function	2024-03-14 10:14:13 +08:00
Kai Huang	76e30d8ec8	Empty cache for lm_head (#10317 ) * empty cache * add comments	2024-03-13 20:31:53 +08:00
Yishuo Wang	06a851afa9	support new baichuan model (#10404 )	2024-03-13 17:45:50 +08:00
Yishuo Wang	b268baafd6	use fp8 sdp in llama (#10396 )	2024-03-13 16:45:38 +08:00
Xiangyu Tian	60043a3ae8	LLM: Support Baichuan2-13b in BigDL-vLLM (#10398 ) Support Baichuan2-13b in BigDL-vLLM.	2024-03-13 16:21:06 +08:00
Xiangyu Tian	e10de2c42d	[Fix] LLM: Fix condition check error for speculative decoding on CPU (#10402 ) Fix condition check error for speculative decoding on CPU	2024-03-13 16:05:06 +08:00
Heyang Sun	d72c0fad0d	Qwen2 SDPA forward on CPU (#10395 ) * Fix Qwen1.5 CPU forward * Update convert.py * Update qwen2.py	2024-03-13 13:10:03 +08:00
Wang, Jian4	0193f29411	LLM : Enable gguf float16 and Yuan2 model (#10372 ) * enable float16 * add yun files * enable yun * enable set low_bit on yuan2 * update * update license * update generate * update readme * update python style * update	2024-03-13 10:19:18 +08:00
Yina Chen	f5d65203c0	First token lm_head optimization (#10318 ) * add lm head linear * update * address comments and fix style * address comment	2024-03-13 10:11:32 +08:00
Xin Qiu	28c4a8cf5c	Qwen fused qkv (#10368 ) * fused qkv + rope for qwen * quantized kv cache * fix * update qwen * fixed quantized qkv * fix * meet code review * update split * convert.py * extend when no enough kv * fix	2024-03-12 17:39:00 +08:00
Yishuo Wang	741c2bf1df	use new rms norm (#10384 )	2024-03-12 17:29:51 +08:00
Xiangyu Tian	0ded0b4b13	LLM: Enable BigDL IPEX optimization for int4 (#10319 ) Enable BigDL IPEX optimization for int4	2024-03-12 17:08:50 +08:00
Zhao Changmin	df2b84f7de	Enable kv cache on arc batch (#10308 )	2024-03-12 16:46:04 +08:00
Guancheng Fu	cc4148636d	[FastChat-integration] Add initial implementation for loader (#10323 ) * add initial implementation for loader * add test method for model_loader * data * Refine	2024-03-12 10:54:59 +08:00
binbin Deng	dbcfc5c2fa	LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub (#10364 )	2024-03-11 16:19:17 +08:00
Chen, Zhentao	a425eaabfc	fix from_pretrained when device_map=None (#10361 ) * pr trigger * fix error when device_map=None * fix device_map=None	2024-03-11 16:06:12 +08:00
Yina Chen	d7b765fd3f	serving xpu memory opt (#10358 )	2024-03-11 15:21:22 +08:00
Ruonan Wang	be29833b2b	LLM: fix qwen2 (#10356 )	2024-03-11 09:29:08 +08:00
Zhicun	9026c08633	Fix llamaindex AutoTokenizer bug (#10345 ) * fix tokenizer * fix AutoTokenizer bug * modify code style	2024-03-08 16:24:50 +08:00
Keyan (Kyrie) Zhang	7a621a4db0	Fix device_map bug by raise an error when using device_map=xpu (#10340 ) * Fix device_map bug by raise an error when using device_map=xpu * Fix sync error * Fix python style * Use invalidInputError instead of invalidOperationError	2024-03-08 13:38:52 +08:00
Yishuo Wang	1ac193ba02	add rope theta argument (#10343 )	2024-03-07 17:27:19 +08:00
Cengguang Zhang	496d18ab6d	LLM: add quantize kv cache support for baichuan 7b and 13b. (#10330 ) * add quantize kv cache for baichuan 7b and 13b. * fix typo. * fix. * fix style. * fix style.	2024-03-07 16:17:38 +08:00
Yina Chen	9ea499ca68	Optimize speculative decoding PVC memory usage (#10329 ) * optimize memory * update * update * update * support other models * update * fix style	2024-03-06 09:54:21 +08:00
dingbaorong	cc796848ea	fix typos (#10274 ) Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 18:38:22 +08:00
Yishuo Wang	0011ff9f64	optimize bge large performance (#10324 )	2024-03-05 17:06:03 +08:00
Cengguang Zhang	30d009bca7	LLM: support quantized kv cache for Mistral in transformers >=4.36.0 (#10326 ) * support quantize kv for mistral in transformers 4.36 * update mistral support. * fix style.	2024-03-05 16:23:50 +08:00
dingbaorong	1e6f0c6f1a	Add llamaindex gpu example (#10314 ) * add llamaindex example * fix core dump * refine readme * add trouble shooting * refine readme --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:36:00 +08:00
dingbaorong	fc7f10cd12	add langchain gpu example (#10277 ) * first draft * fix * add readme for transformer_int4_gpu * fix doc * check device_map * add arc ut test * fix ut test * fix langchain ut * Refine README * fix gpu mem too high * fix ut test --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:33:57 +08:00
Cengguang Zhang	ab9fc2485f	LLM: add quantize kv support for llama transformer 4.36 (#10298 ) * add quantize kv support for llama transformer 4.36 * fix style. * fix style.	2024-03-04 10:33:35 +08:00
SONG Ge	0ab40917fb	[LLM] Split merged_qk to separated q/k linear (#10299 ) * modify merge_qk_linear to separated q/k linear * update	2024-03-01 16:48:55 +08:00
Yang Wang	f4d7dbcde2	use fused qkv forward in qwen2 (#10185 ) * use fused qkv forward in qwen2 * support both * fix style * fix rope * remove pring * fix style * clean up	2024-03-01 16:46:35 +08:00
Wang, Jian4	beb9433cec	LLM: Reduce speculative _ipex_optimize_model memory use (#10281 ) * use tpp * update ipex	2024-03-01 13:48:23 +08:00
Yuwen Hu	f0ff0eebe1	[LLM] Support quantize kv cache for Baichuan2 7B (#10280 ) * Add quatized kv cache framework for Baichuan2 7B * Support quantize kv cache for baichuan2 * Small fix * Fix python style	2024-03-01 13:35:42 +08:00
SONG Ge	273de341d7	hot-fix silu error import (#10292 )	2024-03-01 10:11:37 +08:00
Xin Qiu	232273a1b5	Enable Gemma fused mlp + Gelu (#10276 ) * update llama mlp forward * add all * fix style check * split * update * update * update * fix style	2024-02-29 16:53:24 +08:00
Guancheng Fu	2d930bdca8	Add vLLM bf16 support (#10278 ) * add argument load_in_low_bit * add docs * modify gpu doc * done --------- Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>	2024-02-29 16:33:42 +08:00
SONG Ge	13b0bc9075	[LLM] Add quantize_kv optimization for yuan2 model (#10243 ) * add initial quantize_kv support for yuan2 model * fix yuan2 quantize_kv generation * apply fp16 conv layer optimizations * disable mlp for quantize_kv	2024-02-29 16:33:26 +08:00
Zhicun	4e6cc424f1	Add LlamaIndex RAG (#10263 ) * run demo * format code * add llamaindex * add custom LLM with bigdl * update * add readme * begin ut * add unit test * add license * add license * revised * update * modify docs * remove data folder * update * modify prompt * fixed * fixed * fixed	2024-02-29 15:21:19 +08:00
Ruonan Wang	a9fd20b6ba	LLM: Update qkv fusion for GGUF-IQ2 (#10271 ) * first commit * update mistral * fix transformers==4.36.0 * fix * disable qk for mixtral now * fix style	2024-02-29 12:49:53 +08:00
Ruonan Wang	4b08bc1417	LLM: relax batch check of flash atttention by double check attention mask (#10270 ) * relax batch check * fix * fix style	2024-02-29 09:39:55 +08:00
Yina Chen	07f36fbfcc	Fix gptj failed to extend (#10269 )	2024-02-29 09:39:27 +08:00
Yishuo Wang	cccb02dad1	fix baichuan2 13b 2k input (#10267 )	2024-02-28 17:20:20 +08:00
Heyang Sun	7244fd1ba5	Fix Arc StarCoder wrong query_shape when input is long (#10268 ) * Fix Arc StarCoder wrong query_shape when input is long * Update gptbigcode.py	2024-02-28 17:07:08 +08:00
Cengguang Zhang	a4de3095f3	LLM: Support quantize kv cache in mistral. (#10261 ) * init * update quantize kv.	2024-02-28 14:08:08 +08:00
Zhicun	308e637d0d	Add DeepSeek-MoE-16B-Chat (#10155 ) * dsmoe-hf add * add dsmoe pytorch * update README * modify comment * remove GPU example * update model name * format code	2024-02-28 10:12:09 +08:00
Yang Wang	c581c6db30	draft mmint4 (#10031 ) change to llm.cpp support transposed format revert implement qkv fuse fix style change to vertically pack change to enable_xetla fix mlp_fusion_check remove comments address comments add some comments fix style	2024-02-27 14:55:16 -08:00
Yishuo Wang	b4fa4ab46f	optimize yuan 2.0 again (#10252 )	2024-02-27 14:51:42 +08:00
Heyang Sun	36a9e88104	Speculative Starcoder on CPU (#10138 ) * Speculative Starcoder on CPU * enable kv-cache pre-allocation * refine codes * refine * fix style * fix style * fix style * refine * refine * Update speculative.py * Update gptbigcode.py * fix style * Update speculative.py * enable mixed-datatype layernorm on top of torch API * adaptive dtype * Update README.md	2024-02-27 09:57:29 +08:00
Yishuo Wang	a47989c860	optimize yuan 2.0 performance (#10244 )	2024-02-26 17:20:10 +08:00
Wang, Jian4	6c74b99a28	LLM: Update qwen readme (#10245 )	2024-02-26 17:03:09 +08:00
Wang, Jian4	f9b75f900b	LLM: Enable qwen target_model ipex (#10232 ) * change order * enable qwen ipex * update qwen example * update * fix style * update	2024-02-26 16:41:12 +08:00
Yuwen Hu	e38e29511c	[LLM] Yuan2 MLP and Rotary optimization (#10231 ) * Add optimization for rotary embedding * Add mlp fused optimizatgion * Python style fix * Fix rotary embedding due to logits difference * Small fix	2024-02-26 15:10:08 +08:00
SONG Ge	df2f3885ba	[LLM] Enable kv_cache and forward_qkv optimizations for yuan2 (#10225 ) * add init kv_cache support for yuan2 * add forward qkv in yuan	2024-02-26 11:29:48 +08:00
Ruonan Wang	28513f3978	LLM: support fp16 embedding & add mlp fusion for iq2_xxs (#10219 ) * add fp16 embed * small fixes * fix style * fix style * fix comment	2024-02-23 17:26:24 +08:00
Yuwen Hu	eeecd9fc08	Python style fix (#10230 )	2024-02-23 17:21:23 +08:00
Yuwen Hu	e511bbd8f1	[LLM] Add basic optimization framework for Yuan2 (#10227 ) * Add basic optimization framework for Yuan2 * Small fix * Python style fix * Small fix * Small fix	2024-02-23 17:05:00 +08:00
Xin Qiu	30795bdfbc	Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv (#10212 ) * gemma optimization * update * update * fix style * meet code review	2024-02-23 10:07:24 +08:00
Guoqiong Song	63681af97e	falcon for transformers 4.36 (#9960 ) * falcon for transformers 4.36	2024-02-22 17:04:40 -08:00
Yina Chen	ce5840a8b7	GPT-J rope optimization on xpu (#10182 ) * optimize * update * fix style & move use_fuse_rope * add ipex version check * fix style * update * fix style * meet comments * address comments * fix style	2024-02-22 16:25:12 +08:00
Xiangyu Tian	f445217d02	LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize (#10189 ) Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.	2024-02-22 16:01:11 +08:00
Heyang Sun	c876d9b5ca	Support for MPT rotary embedding (#10208 )	2024-02-22 15:16:31 +08:00
Ruonan Wang	5e1fee5e05	LLM: add GGUF-IQ2 examples (#10207 ) * add iq2 examples * small fix * meet code review * fix * meet review * small fix	2024-02-22 14:18:45 +08:00
SONG Ge	ca1166a0e5	[LLM] Add quantize kv_cache for Baichuan2-13B (#10203 ) * add quantize kv_cache for baichuan2-13b * style fix	2024-02-22 13:43:35 +08:00
Ruonan Wang	34ee1aa91f	LLM: add esimd sdp support for chatglm3 (#10205 ) * add esimd sdp support * fix style	2024-02-22 13:37:16 +08:00
Ruonan Wang	f7c96b19ef	LLM: support iq2 for mixtral (#10191 ) * support name mapping for mixtral * support mixtral mixed quantization * fix style * fix	2024-02-21 16:00:29 +08:00
Xin Qiu	56ad781f2f	qwen2 cpu fix (#10187 )	2024-02-21 11:23:51 +08:00
Zhao Changmin	4fbf449c2d	for rwkv4 (#10179 )	2024-02-21 10:11:10 +08:00
Ruonan Wang	3288acb8de	LLM : Support embedding quantization (only q2k now) (#10170 ) * basic logic added * basic support * support save&load, update mixed strategy * fix style * use int8 for lm_head * add check for xpu	2024-02-20 16:56:57 +08:00
binbin Deng	2bb96c775c	LLM: fix device setting during saving optimized model (#10154 )	2024-02-20 09:52:59 +08:00
Xin Qiu	1f6d5b9f30	enable fused rmsnorm and rope qwen2 (#10163 ) * qwen2 * change convert * cleanup	2024-02-20 08:33:09 +08:00
Zhao Changmin	f8730e8dc1	Skip rescale rwkv linear when load_low_bit (#10164 ) * rwkv_ld	2024-02-19 15:56:42 +08:00
Heyang Sun	3e2af5ec0a	Fix IPEX Baichuan Speculative (#10162 ) * Fix IPEX Baichuan Speculative * compatible with 13B * Update speculative.py	2024-02-19 15:27:34 +08:00
Yina Chen	23c91cdce6	[LLM] Add min_step_draft in speculative decoding (#10142 ) * Fix gptj kvcache & position id * Add min_draft_tokens in speculative decoding * fix style * update	2024-02-19 14:31:41 +08:00
Wang, Jian4	f2417e083c	LLM: enable chatglm3-6b target_model ipex (#10085 ) * init * always make casual_mask * not return last tensor * update * optimize_model = False * enable optimized=False * enable optimized_model=true * speed_up ipex target_model * remove if True * use group_size * update python style * update * update	2024-02-19 13:38:32 +08:00
Yina Chen	1508d6b089	Fix gptj kvcache & position id (#10141 )	2024-02-18 10:02:49 +08:00
Yishuo Wang	4d33aac7f9	quick fix qwen2 fp8 kv cache (#10135 )	2024-02-08 17:04:59 +08:00
Cengguang Zhang	39d90839aa	LLM: add quantize kv cache for llama. (#10086 ) * feat: add quantize kv cache for llama. * fix style. * add quantized attention forward function. * revert style. * fix style. * fix style. * update quantized kv cache and add quantize_qkv * fix style. * fix style. * optimize quantize kv cache. * fix style.	2024-02-08 16:49:22 +08:00
Yishuo Wang	d848efe17c	add quantize kv cache support for qwen2 (#10134 )	2024-02-08 16:17:21 +08:00
SONG Ge	3f79128ed7	[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 (#10131 ) * add support for kv_cache optimization on transformers-v4.37.0 * enable attention forward * style fix * disable rotary for now	2024-02-08 14:20:26 +08:00
Ruonan Wang	063dc145ac	LLM: basic support for q2k (#10132 ) * basic support for q2k * fix style	2024-02-08 13:52:01 +08:00
Cengguang Zhang	0cf6a12691	LLM: add default torch_dtype for fp16. (#10124 ) * set default torch_dtype for fp16. * fix style. * bug fix. * update bug fix.	2024-02-08 10:24:16 +08:00
Yishuo Wang	1aa0c623ce	disable fused layer norm on UHD (#10130 )	2024-02-08 10:20:01 +08:00
Yuwen Hu	a8450fc300	[LLM] Support MLP optimization for Qwen1.5 (#10123 )	2024-02-08 09:15:34 +08:00
binbin Deng	925f82107e	LLM: support models hosted by modelscope (#10106 )	2024-02-07 16:46:36 +08:00
Xiangyu Tian	8953acd7d6	[LLM] Fix log condition for BIGDL_OPT_IPEX (#10115 ) Fix log condition for BIGDL_OPT_IPEX	2024-02-07 10:27:10 +08:00
Yuwen Hu	518ef95abc	Small fix for Nonetype error (#10104 )	2024-02-06 14:58:52 +08:00
Ruonan Wang	d61f4905ac	LLM: 2bit quantization initial support (#10042 ) * basis quantize support * fix new module name * small update * and mixed int4 with iq2_xxs * remove print * code refactor * fix style * meet code review	2024-02-06 14:58:32 +08:00
Jiao Wang	33b9e7744d	fix dimension (#10097 )	2024-02-05 15:07:38 -08:00
Zhicun	7d2be7994f	add phixtral and optimize phi-moe (#10052 )	2024-02-05 11:12:47 +08:00
Zhicun	676d6923f2	LLM: modify transformersembeddings.embed() in langchain (#10051 )	2024-02-05 10:42:10 +08:00
Jin Qiao	ad050107b3	LLM: fix mpt load_low_bit issue (#10075 ) * fix * retry * retry	2024-02-05 10:17:07 +08:00
Ruonan Wang	8e33cb0f38	LLM: support speecht5_tts (#10077 ) * support speecht5_tts * fix	2024-02-04 13:26:42 +08:00
ivy-lv11	428b7105f6	Add HF and PyTorch example InternLM2 (#10061 )	2024-02-04 10:25:55 +08:00
Yina Chen	77be19bb97	LLM: Support gpt-j in speculative decoding (#10067 ) * gptj * support gptj in speculative decoding * fix * update readme * small fix	2024-02-02 14:54:55 +08:00
Xin Qiu	6e0f1a1e92	use apply_rotary_pos_emb_cache_freq_xpu in mixtral (#10060 ) * use apply_rotary_pos_emb_cache_freq_xpu in mixtral * fix style	2024-02-01 15:40:49 +08:00
Heyang Sun	601024f418	Mistral CPU example of speculative decoding (#10024 ) * Mistral CPU example of speculative decoding * update transformres version * update example * Update README.md	2024-02-01 10:52:32 +08:00
Heyang Sun	968e70544d	Enable IPEX Mistral in Speculative (#10059 )	2024-02-01 10:48:16 +08:00
Yina Chen	3ca03d4e97	Add deepmind sample into bigdl-llm speculative decoding (#10041 ) * migrate deepmind sample * update * meet comments * fix style * fix style	2024-02-01 09:57:02 +08:00
Wang, Jian4	7e5cd42a5c	LLM : Update optimize ipex bf16 (#10038 ) * use 4.35.2 and remove * update rmsnorm * remove * remove * update python style * update * update python style * update * fix style * update * remove whitespace	2024-01-31 10:59:55 +08:00
Ruonan Wang	3685622f29	LLM: fix llama 4.36 forward(#10047 )	2024-01-31 10:31:10 +08:00
Yishuo Wang	53a5140eff	Optimize rwkv v5 rest token again (#10043 )	2024-01-31 10:01:11 +08:00
Ruonan Wang	6b63ba23d1	LLM: add full module name during convert (#10035 )	2024-01-30 14:43:07 +08:00
Yishuo Wang	7dfa6dbe46	add rwkv time shift optimization (#10032 )	2024-01-30 14:10:55 +08:00
Xiangyu Tian	f57d0fda8b	[LLM] Use IPEX Optimization for Self Speculative Decoding (#9997 ) Use IPEX Optimization for Self Speculative Decoding	2024-01-30 09:11:06 +08:00
Ruonan Wang	ccf8f613fb	LLM: update fp16 Linear on ARC/FLEX (#10023 )	2024-01-29 18:25:26 +08:00
Shaojun Liu	824c8029d7	Fix "local variable 'model' referenced before assignment" (#10022 )	2024-01-29 16:18:04 +08:00
Xiangyu Tian	f37e4702bc	[LLM] Use IPEX Optimization for BF16 Model (#9988 ) Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true	2024-01-29 11:28:25 +08:00
Yishuo Wang	d720554d43	simplify quantize kv cache api (#10011 )	2024-01-29 09:23:57 +08:00
Yina Chen	a3322e2a6c	add fp8 e5 to use_xmx (#10015 )	2024-01-26 18:29:46 +08:00
Qiyuan Gong	9e18ea187f	[LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006 ) * Avoid OOM during muti-round streaming chat with kv cache * For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31. * Other models need to compare kv cache size with kv_len.	2024-01-26 17:30:08 +08:00
Ruonan Wang	a00efa0564	LLM: add mlp & qkv fusion for FP16 Llama-7B (#9932 ) * add mlp fusion for llama * add mlp fusion * fix style * update * add mm_qkv_out * fix style * update * meet code review * meet code review	2024-01-26 11:50:38 +08:00
Wang, Jian4	98ea3459e5	LLM : Fix llama draft_model dtype error (#10005 ) * fix llama draft_model dtype error * updat	2024-01-26 10:59:48 +08:00
Yishuo Wang	aae1870096	fix qwen kv cache length (#9998 )	2024-01-26 10:15:01 +08:00
Yishuo Wang	24b34b6e46	change xmx condition (#10000 )	2024-01-25 17:48:11 +08:00
Yishuo Wang	bf65548d29	Add quantize kv cache support for chaglm2/3 (#9996 )	2024-01-25 16:55:59 +08:00
Wang, Jian4	9bff84e6fd	LLM: Convert draft_model kv_cache from bf16 to fp32 (#9964 ) * convert bf16 to fp32 * update * change when init * init first and cut off after * init and exchange * update python type * update * fix bug * update * update	2024-01-25 11:20:27 +08:00
Yina Chen	27338540c3	Fix repetition_penalty not activated issue (#9989 )	2024-01-25 10:40:41 +08:00
Yuwen Hu	b27e5a27b9	Remove the check for meta device in _replace_with_low_bit_linear (#9984 )	2024-01-24 18:15:39 +08:00
Yina Chen	b176cad75a	LLM: Add baichuan2 gpu spec example (#9973 ) * add baichuan2 gpu spec example * update readme & example * remove print * fix typo * meet comments * revert * update	2024-01-24 16:40:16 +08:00
Chen, Zhentao	e0db44dcb6	fix unexpected keyword argument 'device' (#9982 ) * add device for chatglm3 only * add comment for this change * fix style * fix style * fix style again.. * finally fixed style	2024-01-24 13:20:46 +08:00
Yuwen Hu	8d28aa8e2b	[LLM] Fix the model.device problem when `cpu_embedding=True` (#9971 ) * Overwrite the device attribute for CPUPinnedParam * Expose cpu_embedding=True for Linux users * Fix python style	2024-01-23 18:51:11 +08:00
Yishuo Wang	f82782cd3b	fix starcoder (#9975 )	2024-01-23 17:24:53 +08:00
Yishuo Wang	2c8a9aaf0d	fix qwen causal mask when quantize_kv_cache=True (#9968 )	2024-01-23 16:34:05 +08:00
Yina Chen	36c665667d	Add logits processor & qwen eos stop in speculative decoding (#9963 ) * add logits processor & qwen eos * fix style * fix * fix * fix style * fix style * support transformers 4.31 * fix style * fix style --------- Co-authored-by: rnwang04 <ruonan1.wang@intel.com>	2024-01-23 15:57:28 +08:00
Xin Qiu	da4687c917	fix fp16 (#9970 )	2024-01-23 15:53:32 +08:00
Ruonan Wang	27b19106f3	LLM: add readme for speculative decoding gpu examples (#9961 ) * add readme * add readme * meet code review	2024-01-23 12:54:19 +08:00
Chen, Zhentao	39219b7e9a	add default device meta when lcmu enabled (#9941 )	2024-01-23 11:00:49 +08:00
Xin Qiu	dacf680294	add fused rotary pos emb for qwen (#9956 ) * add fused rotary pos emb for qwen * update	2024-01-23 10:37:56 +08:00
Ruonan Wang	7b1d9ad7c0	LLM: limit esimd sdp usage for k_len < 8 (#9959 ) * update * fix	2024-01-23 09:28:23 +08:00
Ruonan Wang	3e601f9a5d	LLM: Support speculative decoding in bigdl-llm (#9951 ) * first commit * fix error, add llama example * hidden print * update api usage * change to api v3 * update * meet code review * meet code review, fix style * add reference, fix style * fix style * fix first token time	2024-01-22 19:14:56 +08:00
Heyang Sun	fb91c97fe8	support for Baichuan/Baichuan2 13B Chat running speculative decoding (#9921 ) * support for Baichuan/Baichuan2 13B Chat running speculative decoding * fix stype	2024-01-22 09:11:44 +08:00
Xin Qiu	97f0cd8975	optimize Decilm 7b (#9922 ) * optimize deci * update * decilm attension forward	2024-01-19 17:31:13 +08:00
Wang, Jian4	bcaeb05272	Update optimize qwen (#9943 ) * update for n tokens input * fix dtype * update	2024-01-19 16:54:59 +08:00
Ruonan Wang	bf37b3a670	LLM: optimize CPU speculative decoding of chatglm3 (#9928 ) * update * fix style * meet code review	2024-01-19 14:10:22 +08:00
Shaojun Liu	967714bac8	gguf memory optimization for mixtral (#9939 )	2024-01-19 11:13:15 +08:00
Lilac09	7032a2ad73	Optimize gguf load memory for mistral (#9923 ) * optimize gguf load for mistral * fix output of gguf mistral * reset	2024-01-19 09:14:39 +08:00
Shaojun Liu	9a46f019d7	gguf memory optimization for baichuan (#9937 )	2024-01-19 09:11:02 +08:00
Guancheng Fu	2e1448f08e	[Serving] Add vllm_worker to fastchat serving framework (#9934 ) * add worker * finish * finish * add license * add more comments	2024-01-18 21:33:36 +08:00
Yishuo Wang	7bbb98abb6	Disable fused layer norm when using XMX to fix mpt UT (#9933 )	2024-01-18 16:22:12 +08:00
Wang, Jian4	1fc9dfa265	LLM: Update for Qwen n tokens inputs (#9931 ) * update for n tokens inputs * update style * update	2024-01-18 15:56:29 +08:00
Heyang Sun	5184f400f9	Fix Mixtral GGUF Wrong Output Issue (#9930 ) * Fix Mixtral GGUF Wrong Output Issue * fix style * fix style	2024-01-18 14:11:27 +08:00
Yishuo Wang	453df868c9	add rwkv v5 attention kernel (#9927 )	2024-01-18 10:16:29 +08:00
Ruonan Wang	054952f82f	LLM: Fix rope of chatglm3 to support speculative decoding on CPU (#9926 )	2024-01-18 09:28:10 +08:00
Ziteng Zhang	18cd1f1432	[LLM]Solve the problem of calling bmm operator in BF16Linear (#9924 ) * Solve the problem of calling bmm operator in BF16Linear	2024-01-17 18:08:35 +08:00
Yina Chen	98b86f83d4	Support fast rope for training (#9745 ) * init * init * fix style * add test and fix * address comment * update * merge upstream main	2024-01-17 15:51:38 +08:00
Ruonan Wang	427f75000b	LLM: fix sdp of chatglm3 (#9917 ) * fix * fix * fix	2024-01-17 13:37:28 +08:00
Yishuo Wang	94767da7cf	optimize rwkv v4 first token performance (#9912 )	2024-01-17 09:27:41 +08:00
Shaojun Liu	b909c5c9c2	GGUF load memory optimization (#9913 ) * block-wise * convert linear for module * revert * Fix PEP8 checks Error	2024-01-16 18:54:39 +08:00
Xin Qiu	dee32f7d15	copy fused rms norm's reuslt to avoid <unk> (#9909 )	2024-01-16 16:54:08 +08:00
Ruonan Wang	8d7326ae03	LLM: fix chatglm3 sdp to support speculative decoding (#9900 ) * fix chatglm3 * fix * update * meet code review * fix	2024-01-16 11:29:13 +08:00
Guancheng Fu	9f34da7cdb	Update PVC XMX condition (#9901 ) * update pvc xmx condition * update condition * update conditon	2024-01-15 15:42:15 +08:00
Yishuo Wang	6637860ddf	change xmx condition (#9896 )	2024-01-12 19:51:48 +08:00
Ruonan Wang	d9cf55bce9	LLM: fix MLP check of mixtral (#9891 )	2024-01-11 18:01:59 +08:00
Ziteng Zhang	4af88a67b9	support chatglm3 with bf16 (#9888 ) * support chatglm3 with bigdl-bf16	2024-01-11 16:45:21 +08:00
Yuwen Hu	0aef35a965	[LLM] Improve LLM doc regarding windows gpu related info (#9880 ) * Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix	2024-01-11 14:37:16 +08:00
Ruonan Wang	53531ae4ee	LLM: support qkv fusion for fp8e5 (#9878 ) * update * add mistral * meet code review	2024-01-10 17:50:00 +08:00
Lilac09	cb32b985ec	add mistral and chatglm support to vllm (#9879 ) * add mistral and chatglm support to vllm * add mistral and chatglm support to vllm	2024-01-10 15:38:42 +08:00
Ruonan Wang	3e05c9e11b	LLM: update esimd sdp kernel (#9871 )	2024-01-09 18:10:01 +08:00
Yishuo Wang	36496d60ac	only use quantize kv cache on MTL (#9862 )	2024-01-09 13:24:02 +08:00
ZehuaCao	146076bdb5	Support llm-awq backend (#9856 ) * Support for LLM-AWQ Backend * fix * Update README.md * Add awqconfig * modify init * update * support llm-awq * fix style * fix style * update * fix AwqBackendPackingMethod not found error * fix style * update README * fix style --------- Co-authored-by: Uxito-Ada <414416158@qq.com> Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com> Co-authored-by: cyita <yitastudy@gmail.com>	2024-01-09 13:07:32 +08:00
Ruonan Wang	fea6f16057	LLM: add mlp fusion for fp8e5 and update related check (#9860 ) * update mlp fusion * fix style * update	2024-01-09 09:56:32 +08:00
Jiao Wang	3b6372ab12	Fix Llama transformers 4.36 support (#9852 ) * supoort 4.36 * style * update * update * update * fix merge * update	2024-01-08 00:32:23 -08:00
Chen, Zhentao	1b585b0d40	set fp8 default as e5m2 (#9859 )	2024-01-08 15:53:57 +08:00
Ruonan Wang	dc995006cc	LLM: add flash attention for mistral / mixtral (#9846 ) * add flash attention for mistral * update * add flash attn for mixtral * fix style	2024-01-08 09:51:34 +08:00
Yishuo Wang	afaa871144	[LLM] support quantize kv cache to fp8 (#9812 )	2024-01-08 09:28:20 +08:00
Jiao Wang	248ae7fad2	LLama optimize_model to support transformers 4.36 (#9818 ) * supoort 4.36 * style * update * update * update	2024-01-05 11:30:18 -08:00
Ruonan Wang	a60bda3324	LLM: update check for deepspeed (#9838 )	2024-01-05 16:44:10 +08:00
Ruonan Wang	16433dd959	LLM: fix first token judgement of flash attention (#9841 ) * fix flash attention * meet code review * fix	2024-01-05 13:49:37 +08:00
Yina Chen	f919f5792a	fix kv cache out of bound (#9827 )	2024-01-05 12:38:57 +08:00
Ruonan Wang	5df31db773	LLM: fix accuracy issue of chatglm3 (#9830 ) * add attn mask for first token * fix * fix * change attn calculation * fix * fix * fix style * fix style	2024-01-05 10:52:05 +08:00
Xiangyu Tian	38c05be1c0	[LLM] Fix dtype mismatch in Baichuan2-13b (#9834 )	2024-01-04 15:34:42 +08:00
Ziteng Zhang	05b681fa85	[LLM] IPEX auto importer set on by default (#9832 ) * Set BIGDL_IMPORT_IPEX default to True * Remove import intel_extension_for_pytorch as ipex from GPU example	2024-01-04 13:33:29 +08:00
Wang, Jian4	4ceefc9b18	LLM: Support bitsandbytes config on qlora finetune (#9715 ) * test support bitsandbytesconfig * update style * update cpu example * update example * update readme * update unit test * use bfloat16 * update logic * use int4 * set defalut bnb_4bit_use_double_quant * update * update example * update model.py * update * support lora example	2024-01-04 11:23:16 +08:00
Ruonan Wang	20e9742fa0	LLM: fix chatglm3 issue (#9820 ) * fix chatglm3 issue * small update	2024-01-03 16:15:55 +08:00
Wang, Jian4	a54cd767b1	LLM: Add gguf falcon (#9801 ) * init falcon * update convert.py * update style	2024-01-03 14:49:02 +08:00
Qiyuan Gong	f0f9d45eac	[LLM] IPEX import support bigdl-core-xe-21 (#9769 ) Add support for bigdl-core-xe-21.	2023-12-28 15:23:58 +08:00
Guancheng Fu	5857a38321	[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH (#9782 ) * add option kv_cache_block * change var name	2023-12-28 14:41:47 +08:00
Ruonan Wang	99bddd3ab4	LLM: better FP16 support for Intel GPUs (#9791 ) * initial support * fix * fix style * fix * limi esimd usage condition * refactor code * fix style * small fix * meet code review * small fix	2023-12-28 13:30:13 +08:00
Yishuo Wang	7d9f6c6efc	fix cpuinfo error (#9793 )	2023-12-28 09:23:44 +08:00
Wang, Jian4	7ed9538b9f	LLM: support gguf mpt (#9773 ) * add gguf mpt * update	2023-12-28 09:22:39 +08:00
Cengguang Zhang	d299f108d0	update falcon attention forward. (#9796 )	2023-12-28 09:11:59 +08:00
Kai Huang	689889482c	Reduce max_cache_pos to reduce Baichuan2-13B memory (#9694 ) * optimize baichuan2 memory * fix * style * fp16 mask * disable fp16 * fix style * empty cache * revert empty cache	2023-12-26 19:51:25 +08:00
Xiangyu Tian	0ea842231e	[LLM] vLLM: Add api_server entrypoint (#9783 ) Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.	2023-12-26 16:03:57 +08:00
Ruonan Wang	11d883301b	LLM: fix wrong batch output caused by flash attention (#9780 ) * fix * meet code review * move batch size check to the beginning * move qlen check inside function * meet code review	2023-12-26 09:41:27 +08:00
Heyang Sun	66e286a73d	Support for Mixtral AWQ (#9775 ) * Support for Mixtral AWQ * Update README.md * Update README.md * Update awq_config.py * Update README.md * Update README.md	2023-12-25 16:08:09 +08:00
Ruonan Wang	1917bbe626	LLM: fix `BF16Linear` related training & inference issue (#9755 ) * fix bf16 related issue * fix * update based on comment & add arc lora script * update readme * update based on comment * update based on comment * update * force to bf16 * fix style * move check input dtype into function * update convert * meet code review * meet code review * update merged model to support new training_mode api * fix typo	2023-12-25 14:49:30 +08:00
Xiangyu Tian	30dab36f76	[LLM] vLLM: Fix kv cache init (#9771 ) Fix kv cache init	2023-12-25 14:17:06 +08:00
Yina Chen	449b387125	Support relora in bigdl-llm (#9687 ) * init * fix style * update * support resume & update readme * update * update * remove important * add training mode * meet comments	2023-12-25 14:04:28 +08:00
Ziteng Zhang	986f65cea9	[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py (#9762 )	2023-12-25 11:31:14 +08:00
Guancheng Fu	daf536fb2d	vLLM: Apply attention optimizations for selective batching (#9758 ) * fuse_rope for prefil * apply kv_cache optimizations * apply fast_decoding_path * Re-enable kv_cache optimizations for prefill * reduce KV_CACHE_ALLOC_BLOCK for selective_batching	2023-12-25 10:29:31 +08:00
Qiyuan Gong	4c487313f2	Revert "[LLM] IPEX auto importer turn on by default for XPU (#9730 )" (#9759 ) This reverts commit `0284801fbd`.	2023-12-22 16:38:24 +08:00
Qiyuan Gong	0284801fbd	[LLM] IPEX auto importer turn on by default for XPU (#9730 ) * Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU. * Remove import intel_extension_for_pytorch as ipex from GPU example. * Add support for bigdl-core-xe-21.	2023-12-22 16:20:32 +08:00
Guancheng Fu	fdf93c9267	Implement selective batching for vLLM (#9659 ) * add control to load hf model * finish initial version of selective_batching * temp * finish * Remove print statement * fix error * Apply yang's optimization * a version that works * We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path * format * temp solution: not batching prefill requests * a version that works for prefill batching * format * a solid version: works normally * a temp version * Solid version: remove redundant functions * fix format * format * solid: add option to enable selective_batching * remove logic for using transformer models * format * format * solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING * format * finish * format	2023-12-22 13:45:46 +08:00
Ruonan Wang	2f36769208	LLM: bigdl-llm lora support & lora example (#9740 ) * lora support and single card example * support multi-card, refactor code * fix model id and style * remove torch patch, add two new class for bf16, update example * fix style * change to training_mode * small fix * add more info in help * fixstyle, update readme * fix ut * fix ut * Handling compatibility issues with default LoraConfig	2023-12-22 11:05:39 +08:00
SONG Ge	ba0b939579	[LLM] Support transformers-v4.36.0 on mistral model (#9744 ) * add support transformers-v4.36.0 on mistral model * python/llm/src/bigdl/llm/transformers/models/mistral.py * make the redundant implementation as utils * fix code style * fix * fix style * update with utils enough_kv_room	2023-12-22 09:59:27 +08:00
Xin Qiu	e36111e713	mixstral fused qkv and rope (#9724 ) * mixstral fused qkv and rope * fix and clean * fix style * update * update * fix * update * fix	2023-12-22 09:26:35 +08:00
Jiao Wang	e4f6e43675	safetenor to false (#9728 )	2023-12-21 14:41:51 -08:00
Yishuo Wang	426660b88e	simplify qwen attention (#9747 )	2023-12-21 17:53:29 +08:00
Wang, Jian4	984697afe2	LLM: Add bloom gguf support (#9734 ) * init * update bloom add merges * update * update readme * update for llama error * update	2023-12-21 14:06:25 +08:00
Heyang Sun	df775cf316	fix python style (#9742 ) * fix python style * fix * fix	2023-12-21 11:25:05 +08:00
Xin Qiu	6c3e698bf1	mistral decoding_fast_path and fused mlp (#9714 ) * mistral decoding_fast_path and fused mlp * meet code review	2023-12-21 10:11:37 +08:00
Heyang Sun	d157f623b6	Load Mixtral gguf in a block-wise way (#9725 ) * Load Mixtral gguf in a block-wise way * refine	2023-12-21 10:03:23 +08:00
Zhao Changmin	4bda975a3e	LLM: Align lowbit model config (#9735 ) * align lowbit model config	2023-12-21 09:48:58 +08:00
Wang, Jian4	e1e921f425	LLM: gguf other model using dtype (#9729 )	2023-12-21 09:33:40 +08:00
Yishuo Wang	13ea6330bd	optimize qwen rope (#9737 )	2023-12-20 17:34:34 +08:00
Ziteng Zhang	4c032a433e	[LLM] Add glibc checker (#9624 ) * Add glibc checker * Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.	2023-12-20 16:52:43 +08:00

... 5 6 7 8 9 ...

940 commits