ipex-llm

Author	SHA1	Message	Date
Zhicun	b827f534d5	Add tokenizer_id in Langchain (#10588 ) * fix low-bit * fix * fix style --------- Co-authored-by: arda <arda@arda-arc12.sh.intel.com>	2024-04-03 14:25:35 +08:00
Kai Huang	c875b3c858	Add seq len check for llama softmax upcast to fp32 (#10629 )	2024-04-03 12:05:13 +08:00
Jiao Wang	23e33a0ca1	Fix qwen-vl style (#10633 ) * update * update	2024-04-02 18:41:38 -07:00
binbin Deng	2bbd8a1548	LLM: fix llama2 FP16 & bs>1 & autotp on PVC and ARC (#10611 )	2024-04-03 09:28:04 +08:00
Jiao Wang	654dc5ba57	Fix Qwen-VL example problem (#10582 ) * update * update * update * update	2024-04-02 12:17:30 -07:00
Yuwen Hu	fd384ddfb8	Optimize StableLM (#10619 ) * Initial commit for stablelm optimizations * Small style fix * add dependency * Add mlp optimizations * Small fix * add attention forward * Remove quantize kv for now as head_dim=80 * Add merged qkv * fix lisence * Python style fix --------- Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>	2024-04-02 18:58:38 +08:00
Yishuo Wang	ba8cc6bd68	optimize starcoder2-3b (#10625 )	2024-04-02 17:16:29 +08:00
Shaojun Liu	a10f5a1b8d	add python style check (#10620 ) * add python style check * fix style checks * update runner * add ipex-llm-finetune-qlora-cpu-k8s to manually_build workflow * update tag to 2.1.0-SNAPSHOT	2024-04-02 16:17:56 +08:00
Cengguang Zhang	58b57177e3	LLM: support bigdl quantize kv cache env and add warning. (#10623 ) * LLM: support bigdl quantize kv cache env and add warnning. * fix style. * fix comments.	2024-04-02 15:41:08 +08:00
Kai Huang	0a95c556a1	Fix starcoder first token perf (#10612 ) * add bias check * update	2024-04-02 09:21:38 +08:00
Cengguang Zhang	e567956121	LLM: add memory optimization for llama. (#10592 ) * add initial memory optimization. * fix logic. * fix logic, * remove env var check in mlp split.	2024-04-02 09:07:50 +08:00
Ruonan Wang	bfc1caa5e5	LLM: support iq1s for llama2-70b-hf (#10596 )	2024-04-01 13:13:13 +08:00
Yishuo Wang	437a349dd6	fix rwkv with pip installer (#10591 )	2024-03-29 17:56:45 +08:00
Ruonan Wang	0136fad1d4	LLM: support iq1_s (#10564 ) * init version * update utils * remove unsed code	2024-03-29 09:43:55 +08:00
Qiyuan Gong	f4537798c1	Enable kv cache quantization by default for flex when 1 < batch <= 8 (#10584 ) * Enable kv cache quantization by default for flex when 1 < batch <= 8. * Change up bound from <8 to <=8.	2024-03-29 09:43:42 +08:00
Cengguang Zhang	b44f7adbad	LLM: Disable esimd sdp for PVC GPU when batch size>1 (#10579 ) * llm: disable esimd sdp for pvc bz>1. * fix logic. * fix: avoid call get device name twice.	2024-03-28 22:55:48 +08:00
Xin Qiu	5963239b46	Fix qwen's position_ids no enough (#10572 ) * fix position_ids * fix position_ids	2024-03-28 17:05:49 +08:00
ZehuaCao	52a2135d83	Replace ipex with ipex-llm (#10554 ) * fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update	2024-03-28 13:54:40 +08:00
Cheen Hau, 俊豪	1c5eb14128	Update pip install to use --extra-index-url for ipex package (#10557 ) * Change to 'pip install .. --extra-index-url' for readthedocs * Change to 'pip install .. --extra-index-url' for examples * Change to 'pip install .. --extra-index-url' for remaining files * Fix URL for ipex * Add links for ipex US and CN servers * Update ipex cpu url * remove readme * Update for github actions * Update for dockerfiles	2024-03-28 09:56:23 +08:00
binbin Deng	92dfed77be	LLM: fix abnormal output of fp16 deepspeed autotp (#10558 )	2024-03-28 09:35:48 +08:00
Xiangyu Tian	51d34ca68e	Fix wrong import in speculative (#10562 )	2024-03-27 18:21:07 +08:00
Guancheng Fu	04baac5a2e	Fix fastchat top_k (#10560 ) * fix -1 top_k * fix * done	2024-03-27 16:01:58 +08:00
binbin Deng	fc8c7904f0	LLM: fix torch_dtype setting of apply fp16 optimization through optimize_model (#10556 )	2024-03-27 14:18:45 +08:00
Ruonan Wang	ea4bc450c4	LLM: add esimd sdp for pvc (#10543 ) * add esimd sdp for pvc * update * fix * fix batch	2024-03-26 19:04:40 +08:00
Xiangyu Tian	11550d3f25	LLM: Add length check for IPEX-CPU speculative decoding (#10529 ) Add length check for IPEX-CPU speculative decoding.	2024-03-26 17:47:10 +08:00
Guancheng Fu	a3b007f3b1	[Serving] Fix fastchat breaks (#10548 ) * fix fastchat * fix doc	2024-03-26 17:03:52 +08:00
Yishuo Wang	69a28d6b4c	fix chatglm (#10540 )	2024-03-26 16:01:00 +08:00
binbin Deng	0a3e4e788f	LLM: fix mistral hidden_size setting for deepspeed autotp (#10527 )	2024-03-26 10:55:44 +08:00
Xin Qiu	1dd40b429c	enable fp4 fused mlp and qkv (#10531 ) * enable fp4 fused mlp and qkv * update qwen * update qwen2	2024-03-26 08:34:00 +08:00
Wang, Jian4	16b2ef49c6	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
Wang, Jian4	a1048ca7f6	Update setup.py and add new actions and add compatible mode (#25 ) * update setup.py * add new action * add compatible mode	2024-03-22 15:44:59 +08:00
Wang, Jian4	9df70d95eb	Refactor bigdl.llm to ipex_llm (#24 ) * Rename bigdl/llm to ipex_llm * rm python/llm/src/bigdl * from bigdl.llm to from ipex_llm	2024-03-22 15:41:21 +08:00
Wang, Jian4	34d0a9328c	LLM: Speed-up mixtral in pipeline parallel inference (#10472 ) * speed-up mixtral * fix style	2024-03-22 11:06:28 +08:00
Cengguang Zhang	b9d4280892	LLM: fix baichuan7b quantize kv abnormal output. (#10504 ) * fix abnormal output. * fix style. * fix style.	2024-03-22 10:00:08 +08:00
Yishuo Wang	f0f317b6cf	fix a typo in yuan (#10503 )	2024-03-22 09:40:04 +08:00
Guancheng Fu	3a3756b51d	Add FastChat bigdl_worker (#10493 ) * done * fix format * add licence * done * fix doc * refactor folder * add license	2024-03-21 18:35:05 +08:00
Xin Qiu	dba7ddaab3	add sdp fp8 for qwen llama436 baichuan mistral baichuan2 (#10485 ) * add sdp fp8 * fix style * fix qwen * fix baichuan 13 * revert baichuan 13b and baichuan2-13b * fix style * update	2024-03-21 17:23:05 +08:00
Kai Huang	30f111cd32	lm_head empty_cache for more models (#10490 ) * modify constraint * fix style	2024-03-21 17:11:43 +08:00
binbin Deng	2958ca49c0	LLM: add patching function for llm finetuning (#10247 )	2024-03-21 16:01:01 +08:00
Kai Huang	021d77fd22	Remove softmax upcast fp32 in llama (#10481 ) * update * fix style	2024-03-20 18:17:34 +08:00
Yishuo Wang	cfdf8ad496	Fix `modules_not_to_convert` argument (#10483 )	2024-03-20 17:47:03 +08:00
Xiangyu Tian	cbe24cc7e6	LLM: Enable BigDL IPEX Int8 (#10480 ) Enable BigDL IPEX Int8	2024-03-20 15:59:54 +08:00
ZehuaCao	1d062e24db	Update serving doc (#10475 ) * update serving doc * add tob * update * update * update * update vllm worker	2024-03-20 14:44:43 +08:00
Cengguang Zhang	4581e4f17f	LLM: fix whiper model missing config. (#10473 ) * fix whiper model missing config. * fix style. * fix style. * style.	2024-03-20 14:22:37 +08:00
Yishuo Wang	749bedaf1e	fix rwkv v5 fp16 (#10474 )	2024-03-20 13:15:08 +08:00
Yuwen Hu	72bcc27da9	[LLM] Add `TransformersBgeEmbeddings` class in `bigdl.llm.langchain.embeddings` (#10459 ) * Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings * Small fixes	2024-03-19 18:04:35 +08:00
Cengguang Zhang	463a86cd5d	LLM: fix qwen-vl interpolation gpu abnormal results. (#10457 ) * fix qwen-vl interpolation gpu abnormal results. * fix style. * update qwen-vl gpu example. * fix comment and update example. * fix style.	2024-03-19 16:59:39 +08:00
Xin Qiu	bbd749dceb	qwen2 fp8 cache (#10446 ) * qwen2 fp8 cache * fix style check	2024-03-19 08:32:39 +08:00
Yang Wang	9e763b049c	Support running pipeline parallel inference by vertically partitioning model to different devices (#10392 ) * support pipeline parallel inference * fix logging * remove benchmark file * fic * need to warmup twice * support qwen and qwen2 * fix lint * remove genxir * refine	2024-03-18 13:04:45 -07:00
Xiangyu Tian	dbdeaddd6a	LLM: Fix log condition for BIGDL_OPT_IPEX (#10441 ) remove log for BIGDL_OPT_IPEX	2024-03-18 16:03:51 +08:00
Xin Qiu	399843faf0	Baichuan 7b fp16 sdp and qwen2 pvc sdp (#10435 ) * add baichuan sdp * update * baichuan2 * fix * fix style * revert 13b * revert	2024-03-18 10:15:34 +08:00
Yishuo Wang	bd64488b2a	add mask support for llama/chatglm fp8 sdp (#10433 ) * add mask support for fp8 sdp * fix chatglm2 dtype * update	2024-03-15 17:36:52 +08:00
Xin Qiu	24473e331a	Qwen2 fp16 sdp (#10427 ) * qwen2 sdp and refine * update * update * fix style * remove use_flash_attention	2024-03-15 13:12:03 +08:00
Ruonan Wang	b036205be2	LLM: add fp8 sdp for chatglm2/3 (#10411 ) * add fp8 sdp for chatglm2 * fix style	2024-03-15 09:38:18 +08:00
Wang, Jian4	fe8976a00f	LLM: Support gguf models use low_bit and fix no json(#10408 ) * support others model use low_bit * update readme * update to add *.json	2024-03-15 09:34:18 +08:00
Xin Qiu	cda38f85a9	Qwen fp16 sdp (#10401 ) * qwen sdp * fix * update * update * update sdp * update * fix style check * add to origin type	2024-03-15 08:51:50 +08:00
dingbaorong	1c0f7ed3fa	add xpu support (#10419 )	2024-03-14 17:13:48 +08:00
Heyang Sun	7d29765092	refactor qwen2 forward to enable XPU (#10409 ) * refactor awen2 forward to enable XPU * Update qwen2.py	2024-03-14 11:03:05 +08:00
ZehuaCao	f66329e35d	Fix multiple get_enable_ipex function error (#10400 ) * fix multiple get_enable_ipex function error * remove get_enable_ipex_low_bit function	2024-03-14 10:14:13 +08:00
Kai Huang	76e30d8ec8	Empty cache for lm_head (#10317 ) * empty cache * add comments	2024-03-13 20:31:53 +08:00
Yishuo Wang	06a851afa9	support new baichuan model (#10404 )	2024-03-13 17:45:50 +08:00
Yishuo Wang	b268baafd6	use fp8 sdp in llama (#10396 )	2024-03-13 16:45:38 +08:00
Xiangyu Tian	60043a3ae8	LLM: Support Baichuan2-13b in BigDL-vLLM (#10398 ) Support Baichuan2-13b in BigDL-vLLM.	2024-03-13 16:21:06 +08:00
Xiangyu Tian	e10de2c42d	[Fix] LLM: Fix condition check error for speculative decoding on CPU (#10402 ) Fix condition check error for speculative decoding on CPU	2024-03-13 16:05:06 +08:00
Heyang Sun	d72c0fad0d	Qwen2 SDPA forward on CPU (#10395 ) * Fix Qwen1.5 CPU forward * Update convert.py * Update qwen2.py	2024-03-13 13:10:03 +08:00
Wang, Jian4	0193f29411	LLM : Enable gguf float16 and Yuan2 model (#10372 ) * enable float16 * add yun files * enable yun * enable set low_bit on yuan2 * update * update license * update generate * update readme * update python style * update	2024-03-13 10:19:18 +08:00
Yina Chen	f5d65203c0	First token lm_head optimization (#10318 ) * add lm head linear * update * address comments and fix style * address comment	2024-03-13 10:11:32 +08:00
Xin Qiu	28c4a8cf5c	Qwen fused qkv (#10368 ) * fused qkv + rope for qwen * quantized kv cache * fix * update qwen * fixed quantized qkv * fix * meet code review * update split * convert.py * extend when no enough kv * fix	2024-03-12 17:39:00 +08:00
Yishuo Wang	741c2bf1df	use new rms norm (#10384 )	2024-03-12 17:29:51 +08:00
Xiangyu Tian	0ded0b4b13	LLM: Enable BigDL IPEX optimization for int4 (#10319 ) Enable BigDL IPEX optimization for int4	2024-03-12 17:08:50 +08:00
Zhao Changmin	df2b84f7de	Enable kv cache on arc batch (#10308 )	2024-03-12 16:46:04 +08:00
Guancheng Fu	cc4148636d	[FastChat-integration] Add initial implementation for loader (#10323 ) * add initial implementation for loader * add test method for model_loader * data * Refine	2024-03-12 10:54:59 +08:00
binbin Deng	dbcfc5c2fa	LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub (#10364 )	2024-03-11 16:19:17 +08:00
Chen, Zhentao	a425eaabfc	fix from_pretrained when device_map=None (#10361 ) * pr trigger * fix error when device_map=None * fix device_map=None	2024-03-11 16:06:12 +08:00
Yina Chen	d7b765fd3f	serving xpu memory opt (#10358 )	2024-03-11 15:21:22 +08:00
Ruonan Wang	be29833b2b	LLM: fix qwen2 (#10356 )	2024-03-11 09:29:08 +08:00
Zhicun	9026c08633	Fix llamaindex AutoTokenizer bug (#10345 ) * fix tokenizer * fix AutoTokenizer bug * modify code style	2024-03-08 16:24:50 +08:00
Keyan (Kyrie) Zhang	7a621a4db0	Fix device_map bug by raise an error when using device_map=xpu (#10340 ) * Fix device_map bug by raise an error when using device_map=xpu * Fix sync error * Fix python style * Use invalidInputError instead of invalidOperationError	2024-03-08 13:38:52 +08:00
Yishuo Wang	1ac193ba02	add rope theta argument (#10343 )	2024-03-07 17:27:19 +08:00
Cengguang Zhang	496d18ab6d	LLM: add quantize kv cache support for baichuan 7b and 13b. (#10330 ) * add quantize kv cache for baichuan 7b and 13b. * fix typo. * fix. * fix style. * fix style.	2024-03-07 16:17:38 +08:00
Yina Chen	9ea499ca68	Optimize speculative decoding PVC memory usage (#10329 ) * optimize memory * update * update * update * support other models * update * fix style	2024-03-06 09:54:21 +08:00
dingbaorong	cc796848ea	fix typos (#10274 ) Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 18:38:22 +08:00
Yishuo Wang	0011ff9f64	optimize bge large performance (#10324 )	2024-03-05 17:06:03 +08:00
Cengguang Zhang	30d009bca7	LLM: support quantized kv cache for Mistral in transformers >=4.36.0 (#10326 ) * support quantize kv for mistral in transformers 4.36 * update mistral support. * fix style.	2024-03-05 16:23:50 +08:00
dingbaorong	1e6f0c6f1a	Add llamaindex gpu example (#10314 ) * add llamaindex example * fix core dump * refine readme * add trouble shooting * refine readme --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:36:00 +08:00
dingbaorong	fc7f10cd12	add langchain gpu example (#10277 ) * first draft * fix * add readme for transformer_int4_gpu * fix doc * check device_map * add arc ut test * fix ut test * fix langchain ut * Refine README * fix gpu mem too high * fix ut test --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:33:57 +08:00
Cengguang Zhang	ab9fc2485f	LLM: add quantize kv support for llama transformer 4.36 (#10298 ) * add quantize kv support for llama transformer 4.36 * fix style. * fix style.	2024-03-04 10:33:35 +08:00
SONG Ge	0ab40917fb	[LLM] Split merged_qk to separated q/k linear (#10299 ) * modify merge_qk_linear to separated q/k linear * update	2024-03-01 16:48:55 +08:00
Yang Wang	f4d7dbcde2	use fused qkv forward in qwen2 (#10185 ) * use fused qkv forward in qwen2 * support both * fix style * fix rope * remove pring * fix style * clean up	2024-03-01 16:46:35 +08:00
Wang, Jian4	beb9433cec	LLM: Reduce speculative _ipex_optimize_model memory use (#10281 ) * use tpp * update ipex	2024-03-01 13:48:23 +08:00
Yuwen Hu	f0ff0eebe1	[LLM] Support quantize kv cache for Baichuan2 7B (#10280 ) * Add quatized kv cache framework for Baichuan2 7B * Support quantize kv cache for baichuan2 * Small fix * Fix python style	2024-03-01 13:35:42 +08:00
SONG Ge	273de341d7	hot-fix silu error import (#10292 )	2024-03-01 10:11:37 +08:00
Xin Qiu	232273a1b5	Enable Gemma fused mlp + Gelu (#10276 ) * update llama mlp forward * add all * fix style check * split * update * update * update * fix style	2024-02-29 16:53:24 +08:00
Guancheng Fu	2d930bdca8	Add vLLM bf16 support (#10278 ) * add argument load_in_low_bit * add docs * modify gpu doc * done --------- Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>	2024-02-29 16:33:42 +08:00
SONG Ge	13b0bc9075	[LLM] Add quantize_kv optimization for yuan2 model (#10243 ) * add initial quantize_kv support for yuan2 model * fix yuan2 quantize_kv generation * apply fp16 conv layer optimizations * disable mlp for quantize_kv	2024-02-29 16:33:26 +08:00
Zhicun	4e6cc424f1	Add LlamaIndex RAG (#10263 ) * run demo * format code * add llamaindex * add custom LLM with bigdl * update * add readme * begin ut * add unit test * add license * add license * revised * update * modify docs * remove data folder * update * modify prompt * fixed * fixed * fixed	2024-02-29 15:21:19 +08:00
Ruonan Wang	a9fd20b6ba	LLM: Update qkv fusion for GGUF-IQ2 (#10271 ) * first commit * update mistral * fix transformers==4.36.0 * fix * disable qk for mixtral now * fix style	2024-02-29 12:49:53 +08:00
Ruonan Wang	4b08bc1417	LLM: relax batch check of flash atttention by double check attention mask (#10270 ) * relax batch check * fix * fix style	2024-02-29 09:39:55 +08:00
Yina Chen	07f36fbfcc	Fix gptj failed to extend (#10269 )	2024-02-29 09:39:27 +08:00
Yishuo Wang	cccb02dad1	fix baichuan2 13b 2k input (#10267 )	2024-02-28 17:20:20 +08:00
Heyang Sun	7244fd1ba5	Fix Arc StarCoder wrong query_shape when input is long (#10268 ) * Fix Arc StarCoder wrong query_shape when input is long * Update gptbigcode.py	2024-02-28 17:07:08 +08:00
Cengguang Zhang	a4de3095f3	LLM: Support quantize kv cache in mistral. (#10261 ) * init * update quantize kv.	2024-02-28 14:08:08 +08:00
Zhicun	308e637d0d	Add DeepSeek-MoE-16B-Chat (#10155 ) * dsmoe-hf add * add dsmoe pytorch * update README * modify comment * remove GPU example * update model name * format code	2024-02-28 10:12:09 +08:00
Yang Wang	c581c6db30	draft mmint4 (#10031 ) change to llm.cpp support transposed format revert implement qkv fuse fix style change to vertically pack change to enable_xetla fix mlp_fusion_check remove comments address comments add some comments fix style	2024-02-27 14:55:16 -08:00
Yishuo Wang	b4fa4ab46f	optimize yuan 2.0 again (#10252 )	2024-02-27 14:51:42 +08:00
Heyang Sun	36a9e88104	Speculative Starcoder on CPU (#10138 ) * Speculative Starcoder on CPU * enable kv-cache pre-allocation * refine codes * refine * fix style * fix style * fix style * refine * refine * Update speculative.py * Update gptbigcode.py * fix style * Update speculative.py * enable mixed-datatype layernorm on top of torch API * adaptive dtype * Update README.md	2024-02-27 09:57:29 +08:00
Yishuo Wang	a47989c860	optimize yuan 2.0 performance (#10244 )	2024-02-26 17:20:10 +08:00
Wang, Jian4	6c74b99a28	LLM: Update qwen readme (#10245 )	2024-02-26 17:03:09 +08:00
Wang, Jian4	f9b75f900b	LLM: Enable qwen target_model ipex (#10232 ) * change order * enable qwen ipex * update qwen example * update * fix style * update	2024-02-26 16:41:12 +08:00
Yuwen Hu	e38e29511c	[LLM] Yuan2 MLP and Rotary optimization (#10231 ) * Add optimization for rotary embedding * Add mlp fused optimizatgion * Python style fix * Fix rotary embedding due to logits difference * Small fix	2024-02-26 15:10:08 +08:00
SONG Ge	df2f3885ba	[LLM] Enable kv_cache and forward_qkv optimizations for yuan2 (#10225 ) * add init kv_cache support for yuan2 * add forward qkv in yuan	2024-02-26 11:29:48 +08:00
Ruonan Wang	28513f3978	LLM: support fp16 embedding & add mlp fusion for iq2_xxs (#10219 ) * add fp16 embed * small fixes * fix style * fix style * fix comment	2024-02-23 17:26:24 +08:00
Yuwen Hu	eeecd9fc08	Python style fix (#10230 )	2024-02-23 17:21:23 +08:00
Yuwen Hu	e511bbd8f1	[LLM] Add basic optimization framework for Yuan2 (#10227 ) * Add basic optimization framework for Yuan2 * Small fix * Python style fix * Small fix * Small fix	2024-02-23 17:05:00 +08:00
Xin Qiu	30795bdfbc	Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv (#10212 ) * gemma optimization * update * update * fix style * meet code review	2024-02-23 10:07:24 +08:00
Guoqiong Song	63681af97e	falcon for transformers 4.36 (#9960 ) * falcon for transformers 4.36	2024-02-22 17:04:40 -08:00
Yina Chen	ce5840a8b7	GPT-J rope optimization on xpu (#10182 ) * optimize * update * fix style & move use_fuse_rope * add ipex version check * fix style * update * fix style * meet comments * address comments * fix style	2024-02-22 16:25:12 +08:00
Xiangyu Tian	f445217d02	LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize (#10189 ) Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.	2024-02-22 16:01:11 +08:00
Heyang Sun	c876d9b5ca	Support for MPT rotary embedding (#10208 )	2024-02-22 15:16:31 +08:00
Ruonan Wang	5e1fee5e05	LLM: add GGUF-IQ2 examples (#10207 ) * add iq2 examples * small fix * meet code review * fix * meet review * small fix	2024-02-22 14:18:45 +08:00
SONG Ge	ca1166a0e5	[LLM] Add quantize kv_cache for Baichuan2-13B (#10203 ) * add quantize kv_cache for baichuan2-13b * style fix	2024-02-22 13:43:35 +08:00
Ruonan Wang	34ee1aa91f	LLM: add esimd sdp support for chatglm3 (#10205 ) * add esimd sdp support * fix style	2024-02-22 13:37:16 +08:00
Ruonan Wang	f7c96b19ef	LLM: support iq2 for mixtral (#10191 ) * support name mapping for mixtral * support mixtral mixed quantization * fix style * fix	2024-02-21 16:00:29 +08:00
Xin Qiu	56ad781f2f	qwen2 cpu fix (#10187 )	2024-02-21 11:23:51 +08:00
Zhao Changmin	4fbf449c2d	for rwkv4 (#10179 )	2024-02-21 10:11:10 +08:00
Ruonan Wang	3288acb8de	LLM : Support embedding quantization (only q2k now) (#10170 ) * basic logic added * basic support * support save&load, update mixed strategy * fix style * use int8 for lm_head * add check for xpu	2024-02-20 16:56:57 +08:00
binbin Deng	2bb96c775c	LLM: fix device setting during saving optimized model (#10154 )	2024-02-20 09:52:59 +08:00
Xin Qiu	1f6d5b9f30	enable fused rmsnorm and rope qwen2 (#10163 ) * qwen2 * change convert * cleanup	2024-02-20 08:33:09 +08:00
Zhao Changmin	f8730e8dc1	Skip rescale rwkv linear when load_low_bit (#10164 ) * rwkv_ld	2024-02-19 15:56:42 +08:00
Heyang Sun	3e2af5ec0a	Fix IPEX Baichuan Speculative (#10162 ) * Fix IPEX Baichuan Speculative * compatible with 13B * Update speculative.py	2024-02-19 15:27:34 +08:00
Yina Chen	23c91cdce6	[LLM] Add min_step_draft in speculative decoding (#10142 ) * Fix gptj kvcache & position id * Add min_draft_tokens in speculative decoding * fix style * update	2024-02-19 14:31:41 +08:00
Wang, Jian4	f2417e083c	LLM: enable chatglm3-6b target_model ipex (#10085 ) * init * always make casual_mask * not return last tensor * update * optimize_model = False * enable optimized=False * enable optimized_model=true * speed_up ipex target_model * remove if True * use group_size * update python style * update * update	2024-02-19 13:38:32 +08:00
Yina Chen	1508d6b089	Fix gptj kvcache & position id (#10141 )	2024-02-18 10:02:49 +08:00
Yishuo Wang	4d33aac7f9	quick fix qwen2 fp8 kv cache (#10135 )	2024-02-08 17:04:59 +08:00
Cengguang Zhang	39d90839aa	LLM: add quantize kv cache for llama. (#10086 ) * feat: add quantize kv cache for llama. * fix style. * add quantized attention forward function. * revert style. * fix style. * fix style. * update quantized kv cache and add quantize_qkv * fix style. * fix style. * optimize quantize kv cache. * fix style.	2024-02-08 16:49:22 +08:00
Yishuo Wang	d848efe17c	add quantize kv cache support for qwen2 (#10134 )	2024-02-08 16:17:21 +08:00
SONG Ge	3f79128ed7	[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 (#10131 ) * add support for kv_cache optimization on transformers-v4.37.0 * enable attention forward * style fix * disable rotary for now	2024-02-08 14:20:26 +08:00
Ruonan Wang	063dc145ac	LLM: basic support for q2k (#10132 ) * basic support for q2k * fix style	2024-02-08 13:52:01 +08:00
Cengguang Zhang	0cf6a12691	LLM: add default torch_dtype for fp16. (#10124 ) * set default torch_dtype for fp16. * fix style. * bug fix. * update bug fix.	2024-02-08 10:24:16 +08:00
Yishuo Wang	1aa0c623ce	disable fused layer norm on UHD (#10130 )	2024-02-08 10:20:01 +08:00
Yuwen Hu	a8450fc300	[LLM] Support MLP optimization for Qwen1.5 (#10123 )	2024-02-08 09:15:34 +08:00
binbin Deng	925f82107e	LLM: support models hosted by modelscope (#10106 )	2024-02-07 16:46:36 +08:00
Xiangyu Tian	8953acd7d6	[LLM] Fix log condition for BIGDL_OPT_IPEX (#10115 ) Fix log condition for BIGDL_OPT_IPEX	2024-02-07 10:27:10 +08:00
Yuwen Hu	518ef95abc	Small fix for Nonetype error (#10104 )	2024-02-06 14:58:52 +08:00
Ruonan Wang	d61f4905ac	LLM: 2bit quantization initial support (#10042 ) * basis quantize support * fix new module name * small update * and mixed int4 with iq2_xxs * remove print * code refactor * fix style * meet code review	2024-02-06 14:58:32 +08:00
Jiao Wang	33b9e7744d	fix dimension (#10097 )	2024-02-05 15:07:38 -08:00
Zhicun	7d2be7994f	add phixtral and optimize phi-moe (#10052 )	2024-02-05 11:12:47 +08:00
Zhicun	676d6923f2	LLM: modify transformersembeddings.embed() in langchain (#10051 )	2024-02-05 10:42:10 +08:00
Jin Qiao	ad050107b3	LLM: fix mpt load_low_bit issue (#10075 ) * fix * retry * retry	2024-02-05 10:17:07 +08:00
Ruonan Wang	8e33cb0f38	LLM: support speecht5_tts (#10077 ) * support speecht5_tts * fix	2024-02-04 13:26:42 +08:00

1 2 3 4 5 ...

654 commits