ipex-llm

Author	SHA1	Message	Date
Xin Qiu	1dd40b429c	enable fp4 fused mlp and qkv (#10531 ) * enable fp4 fused mlp and qkv * update qwen * update qwen2	2024-03-26 08:34:00 +08:00
Wang, Jian4	16b2ef49c6	Update_document by heyang (#30 )	2024-03-25 10:06:02 +08:00
Wang, Jian4	a1048ca7f6	Update setup.py and add new actions and add compatible mode (#25 ) * update setup.py * add new action * add compatible mode	2024-03-22 15:44:59 +08:00
Wang, Jian4	9df70d95eb	Refactor bigdl.llm to ipex_llm (#24 ) * Rename bigdl/llm to ipex_llm * rm python/llm/src/bigdl * from bigdl.llm to from ipex_llm	2024-03-22 15:41:21 +08:00
Wang, Jian4	34d0a9328c	LLM: Speed-up mixtral in pipeline parallel inference (#10472 ) * speed-up mixtral * fix style	2024-03-22 11:06:28 +08:00
Cengguang Zhang	b9d4280892	LLM: fix baichuan7b quantize kv abnormal output. (#10504 ) * fix abnormal output. * fix style. * fix style.	2024-03-22 10:00:08 +08:00
Yishuo Wang	f0f317b6cf	fix a typo in yuan (#10503 )	2024-03-22 09:40:04 +08:00
Guancheng Fu	3a3756b51d	Add FastChat bigdl_worker (#10493 ) * done * fix format * add licence * done * fix doc * refactor folder * add license	2024-03-21 18:35:05 +08:00
Xin Qiu	dba7ddaab3	add sdp fp8 for qwen llama436 baichuan mistral baichuan2 (#10485 ) * add sdp fp8 * fix style * fix qwen * fix baichuan 13 * revert baichuan 13b and baichuan2-13b * fix style * update	2024-03-21 17:23:05 +08:00
Kai Huang	30f111cd32	lm_head empty_cache for more models (#10490 ) * modify constraint * fix style	2024-03-21 17:11:43 +08:00
binbin Deng	2958ca49c0	LLM: add patching function for llm finetuning (#10247 )	2024-03-21 16:01:01 +08:00
Kai Huang	021d77fd22	Remove softmax upcast fp32 in llama (#10481 ) * update * fix style	2024-03-20 18:17:34 +08:00
Yishuo Wang	cfdf8ad496	Fix `modules_not_to_convert` argument (#10483 )	2024-03-20 17:47:03 +08:00
Xiangyu Tian	cbe24cc7e6	LLM: Enable BigDL IPEX Int8 (#10480 ) Enable BigDL IPEX Int8	2024-03-20 15:59:54 +08:00
ZehuaCao	1d062e24db	Update serving doc (#10475 ) * update serving doc * add tob * update * update * update * update vllm worker	2024-03-20 14:44:43 +08:00
Cengguang Zhang	4581e4f17f	LLM: fix whiper model missing config. (#10473 ) * fix whiper model missing config. * fix style. * fix style. * style.	2024-03-20 14:22:37 +08:00
Yishuo Wang	749bedaf1e	fix rwkv v5 fp16 (#10474 )	2024-03-20 13:15:08 +08:00
Yuwen Hu	72bcc27da9	[LLM] Add `TransformersBgeEmbeddings` class in `bigdl.llm.langchain.embeddings` (#10459 ) * Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings * Small fixes	2024-03-19 18:04:35 +08:00
Cengguang Zhang	463a86cd5d	LLM: fix qwen-vl interpolation gpu abnormal results. (#10457 ) * fix qwen-vl interpolation gpu abnormal results. * fix style. * update qwen-vl gpu example. * fix comment and update example. * fix style.	2024-03-19 16:59:39 +08:00
Xin Qiu	bbd749dceb	qwen2 fp8 cache (#10446 ) * qwen2 fp8 cache * fix style check	2024-03-19 08:32:39 +08:00
Yang Wang	9e763b049c	Support running pipeline parallel inference by vertically partitioning model to different devices (#10392 ) * support pipeline parallel inference * fix logging * remove benchmark file * fic * need to warmup twice * support qwen and qwen2 * fix lint * remove genxir * refine	2024-03-18 13:04:45 -07:00
Xiangyu Tian	dbdeaddd6a	LLM: Fix log condition for BIGDL_OPT_IPEX (#10441 ) remove log for BIGDL_OPT_IPEX	2024-03-18 16:03:51 +08:00
Xin Qiu	399843faf0	Baichuan 7b fp16 sdp and qwen2 pvc sdp (#10435 ) * add baichuan sdp * update * baichuan2 * fix * fix style * revert 13b * revert	2024-03-18 10:15:34 +08:00
Yishuo Wang	bd64488b2a	add mask support for llama/chatglm fp8 sdp (#10433 ) * add mask support for fp8 sdp * fix chatglm2 dtype * update	2024-03-15 17:36:52 +08:00
Xin Qiu	24473e331a	Qwen2 fp16 sdp (#10427 ) * qwen2 sdp and refine * update * update * fix style * remove use_flash_attention	2024-03-15 13:12:03 +08:00
Ruonan Wang	b036205be2	LLM: add fp8 sdp for chatglm2/3 (#10411 ) * add fp8 sdp for chatglm2 * fix style	2024-03-15 09:38:18 +08:00
Wang, Jian4	fe8976a00f	LLM: Support gguf models use low_bit and fix no json(#10408 ) * support others model use low_bit * update readme * update to add *.json	2024-03-15 09:34:18 +08:00
Xin Qiu	cda38f85a9	Qwen fp16 sdp (#10401 ) * qwen sdp * fix * update * update * update sdp * update * fix style check * add to origin type	2024-03-15 08:51:50 +08:00
dingbaorong	1c0f7ed3fa	add xpu support (#10419 )	2024-03-14 17:13:48 +08:00
Heyang Sun	7d29765092	refactor qwen2 forward to enable XPU (#10409 ) * refactor awen2 forward to enable XPU * Update qwen2.py	2024-03-14 11:03:05 +08:00
ZehuaCao	f66329e35d	Fix multiple get_enable_ipex function error (#10400 ) * fix multiple get_enable_ipex function error * remove get_enable_ipex_low_bit function	2024-03-14 10:14:13 +08:00
Kai Huang	76e30d8ec8	Empty cache for lm_head (#10317 ) * empty cache * add comments	2024-03-13 20:31:53 +08:00
Yishuo Wang	06a851afa9	support new baichuan model (#10404 )	2024-03-13 17:45:50 +08:00
Yishuo Wang	b268baafd6	use fp8 sdp in llama (#10396 )	2024-03-13 16:45:38 +08:00
Xiangyu Tian	60043a3ae8	LLM: Support Baichuan2-13b in BigDL-vLLM (#10398 ) Support Baichuan2-13b in BigDL-vLLM.	2024-03-13 16:21:06 +08:00
Xiangyu Tian	e10de2c42d	[Fix] LLM: Fix condition check error for speculative decoding on CPU (#10402 ) Fix condition check error for speculative decoding on CPU	2024-03-13 16:05:06 +08:00
Heyang Sun	d72c0fad0d	Qwen2 SDPA forward on CPU (#10395 ) * Fix Qwen1.5 CPU forward * Update convert.py * Update qwen2.py	2024-03-13 13:10:03 +08:00
Wang, Jian4	0193f29411	LLM : Enable gguf float16 and Yuan2 model (#10372 ) * enable float16 * add yun files * enable yun * enable set low_bit on yuan2 * update * update license * update generate * update readme * update python style * update	2024-03-13 10:19:18 +08:00
Yina Chen	f5d65203c0	First token lm_head optimization (#10318 ) * add lm head linear * update * address comments and fix style * address comment	2024-03-13 10:11:32 +08:00
Xin Qiu	28c4a8cf5c	Qwen fused qkv (#10368 ) * fused qkv + rope for qwen * quantized kv cache * fix * update qwen * fixed quantized qkv * fix * meet code review * update split * convert.py * extend when no enough kv * fix	2024-03-12 17:39:00 +08:00
Yishuo Wang	741c2bf1df	use new rms norm (#10384 )	2024-03-12 17:29:51 +08:00
Xiangyu Tian	0ded0b4b13	LLM: Enable BigDL IPEX optimization for int4 (#10319 ) Enable BigDL IPEX optimization for int4	2024-03-12 17:08:50 +08:00
Zhao Changmin	df2b84f7de	Enable kv cache on arc batch (#10308 )	2024-03-12 16:46:04 +08:00
Guancheng Fu	cc4148636d	[FastChat-integration] Add initial implementation for loader (#10323 ) * add initial implementation for loader * add test method for model_loader * data * Refine	2024-03-12 10:54:59 +08:00
binbin Deng	dbcfc5c2fa	LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub (#10364 )	2024-03-11 16:19:17 +08:00
Chen, Zhentao	a425eaabfc	fix from_pretrained when device_map=None (#10361 ) * pr trigger * fix error when device_map=None * fix device_map=None	2024-03-11 16:06:12 +08:00
Yina Chen	d7b765fd3f	serving xpu memory opt (#10358 )	2024-03-11 15:21:22 +08:00
Ruonan Wang	be29833b2b	LLM: fix qwen2 (#10356 )	2024-03-11 09:29:08 +08:00
Zhicun	9026c08633	Fix llamaindex AutoTokenizer bug (#10345 ) * fix tokenizer * fix AutoTokenizer bug * modify code style	2024-03-08 16:24:50 +08:00
Keyan (Kyrie) Zhang	7a621a4db0	Fix device_map bug by raise an error when using device_map=xpu (#10340 ) * Fix device_map bug by raise an error when using device_map=xpu * Fix sync error * Fix python style * Use invalidInputError instead of invalidOperationError	2024-03-08 13:38:52 +08:00
Yishuo Wang	1ac193ba02	add rope theta argument (#10343 )	2024-03-07 17:27:19 +08:00
Cengguang Zhang	496d18ab6d	LLM: add quantize kv cache support for baichuan 7b and 13b. (#10330 ) * add quantize kv cache for baichuan 7b and 13b. * fix typo. * fix. * fix style. * fix style.	2024-03-07 16:17:38 +08:00
Yina Chen	9ea499ca68	Optimize speculative decoding PVC memory usage (#10329 ) * optimize memory * update * update * update * support other models * update * fix style	2024-03-06 09:54:21 +08:00
dingbaorong	cc796848ea	fix typos (#10274 ) Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 18:38:22 +08:00
Yishuo Wang	0011ff9f64	optimize bge large performance (#10324 )	2024-03-05 17:06:03 +08:00
Cengguang Zhang	30d009bca7	LLM: support quantized kv cache for Mistral in transformers >=4.36.0 (#10326 ) * support quantize kv for mistral in transformers 4.36 * update mistral support. * fix style.	2024-03-05 16:23:50 +08:00
dingbaorong	1e6f0c6f1a	Add llamaindex gpu example (#10314 ) * add llamaindex example * fix core dump * refine readme * add trouble shooting * refine readme --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:36:00 +08:00
dingbaorong	fc7f10cd12	add langchain gpu example (#10277 ) * first draft * fix * add readme for transformer_int4_gpu * fix doc * check device_map * add arc ut test * fix ut test * fix langchain ut * Refine README * fix gpu mem too high * fix ut test --------- Co-authored-by: Ariadne <wyn2000330@126.com>	2024-03-05 13:33:57 +08:00
Cengguang Zhang	ab9fc2485f	LLM: add quantize kv support for llama transformer 4.36 (#10298 ) * add quantize kv support for llama transformer 4.36 * fix style. * fix style.	2024-03-04 10:33:35 +08:00
SONG Ge	0ab40917fb	[LLM] Split merged_qk to separated q/k linear (#10299 ) * modify merge_qk_linear to separated q/k linear * update	2024-03-01 16:48:55 +08:00
Yang Wang	f4d7dbcde2	use fused qkv forward in qwen2 (#10185 ) * use fused qkv forward in qwen2 * support both * fix style * fix rope * remove pring * fix style * clean up	2024-03-01 16:46:35 +08:00
Wang, Jian4	beb9433cec	LLM: Reduce speculative _ipex_optimize_model memory use (#10281 ) * use tpp * update ipex	2024-03-01 13:48:23 +08:00
Yuwen Hu	f0ff0eebe1	[LLM] Support quantize kv cache for Baichuan2 7B (#10280 ) * Add quatized kv cache framework for Baichuan2 7B * Support quantize kv cache for baichuan2 * Small fix * Fix python style	2024-03-01 13:35:42 +08:00
SONG Ge	273de341d7	hot-fix silu error import (#10292 )	2024-03-01 10:11:37 +08:00
Xin Qiu	232273a1b5	Enable Gemma fused mlp + Gelu (#10276 ) * update llama mlp forward * add all * fix style check * split * update * update * update * fix style	2024-02-29 16:53:24 +08:00
Guancheng Fu	2d930bdca8	Add vLLM bf16 support (#10278 ) * add argument load_in_low_bit * add docs * modify gpu doc * done --------- Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>	2024-02-29 16:33:42 +08:00
SONG Ge	13b0bc9075	[LLM] Add quantize_kv optimization for yuan2 model (#10243 ) * add initial quantize_kv support for yuan2 model * fix yuan2 quantize_kv generation * apply fp16 conv layer optimizations * disable mlp for quantize_kv	2024-02-29 16:33:26 +08:00
Zhicun	4e6cc424f1	Add LlamaIndex RAG (#10263 ) * run demo * format code * add llamaindex * add custom LLM with bigdl * update * add readme * begin ut * add unit test * add license * add license * revised * update * modify docs * remove data folder * update * modify prompt * fixed * fixed * fixed	2024-02-29 15:21:19 +08:00
Ruonan Wang	a9fd20b6ba	LLM: Update qkv fusion for GGUF-IQ2 (#10271 ) * first commit * update mistral * fix transformers==4.36.0 * fix * disable qk for mixtral now * fix style	2024-02-29 12:49:53 +08:00
Ruonan Wang	4b08bc1417	LLM: relax batch check of flash atttention by double check attention mask (#10270 ) * relax batch check * fix * fix style	2024-02-29 09:39:55 +08:00
Yina Chen	07f36fbfcc	Fix gptj failed to extend (#10269 )	2024-02-29 09:39:27 +08:00
Yishuo Wang	cccb02dad1	fix baichuan2 13b 2k input (#10267 )	2024-02-28 17:20:20 +08:00
Heyang Sun	7244fd1ba5	Fix Arc StarCoder wrong query_shape when input is long (#10268 ) * Fix Arc StarCoder wrong query_shape when input is long * Update gptbigcode.py	2024-02-28 17:07:08 +08:00
Cengguang Zhang	a4de3095f3	LLM: Support quantize kv cache in mistral. (#10261 ) * init * update quantize kv.	2024-02-28 14:08:08 +08:00
Zhicun	308e637d0d	Add DeepSeek-MoE-16B-Chat (#10155 ) * dsmoe-hf add * add dsmoe pytorch * update README * modify comment * remove GPU example * update model name * format code	2024-02-28 10:12:09 +08:00
Yang Wang	c581c6db30	draft mmint4 (#10031 ) change to llm.cpp support transposed format revert implement qkv fuse fix style change to vertically pack change to enable_xetla fix mlp_fusion_check remove comments address comments add some comments fix style	2024-02-27 14:55:16 -08:00
Yishuo Wang	b4fa4ab46f	optimize yuan 2.0 again (#10252 )	2024-02-27 14:51:42 +08:00
Heyang Sun	36a9e88104	Speculative Starcoder on CPU (#10138 ) * Speculative Starcoder on CPU * enable kv-cache pre-allocation * refine codes * refine * fix style * fix style * fix style * refine * refine * Update speculative.py * Update gptbigcode.py * fix style * Update speculative.py * enable mixed-datatype layernorm on top of torch API * adaptive dtype * Update README.md	2024-02-27 09:57:29 +08:00
Yishuo Wang	a47989c860	optimize yuan 2.0 performance (#10244 )	2024-02-26 17:20:10 +08:00
Wang, Jian4	6c74b99a28	LLM: Update qwen readme (#10245 )	2024-02-26 17:03:09 +08:00
Wang, Jian4	f9b75f900b	LLM: Enable qwen target_model ipex (#10232 ) * change order * enable qwen ipex * update qwen example * update * fix style * update	2024-02-26 16:41:12 +08:00
Yuwen Hu	e38e29511c	[LLM] Yuan2 MLP and Rotary optimization (#10231 ) * Add optimization for rotary embedding * Add mlp fused optimizatgion * Python style fix * Fix rotary embedding due to logits difference * Small fix	2024-02-26 15:10:08 +08:00
SONG Ge	df2f3885ba	[LLM] Enable kv_cache and forward_qkv optimizations for yuan2 (#10225 ) * add init kv_cache support for yuan2 * add forward qkv in yuan	2024-02-26 11:29:48 +08:00
Ruonan Wang	28513f3978	LLM: support fp16 embedding & add mlp fusion for iq2_xxs (#10219 ) * add fp16 embed * small fixes * fix style * fix style * fix comment	2024-02-23 17:26:24 +08:00
Yuwen Hu	eeecd9fc08	Python style fix (#10230 )	2024-02-23 17:21:23 +08:00
Yuwen Hu	e511bbd8f1	[LLM] Add basic optimization framework for Yuan2 (#10227 ) * Add basic optimization framework for Yuan2 * Small fix * Python style fix * Small fix * Small fix	2024-02-23 17:05:00 +08:00
Xin Qiu	30795bdfbc	Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv (#10212 ) * gemma optimization * update * update * fix style * meet code review	2024-02-23 10:07:24 +08:00
Guoqiong Song	63681af97e	falcon for transformers 4.36 (#9960 ) * falcon for transformers 4.36	2024-02-22 17:04:40 -08:00
Yina Chen	ce5840a8b7	GPT-J rope optimization on xpu (#10182 ) * optimize * update * fix style & move use_fuse_rope * add ipex version check * fix style * update * fix style * meet comments * address comments * fix style	2024-02-22 16:25:12 +08:00
Xiangyu Tian	f445217d02	LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize (#10189 ) Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.	2024-02-22 16:01:11 +08:00
Heyang Sun	c876d9b5ca	Support for MPT rotary embedding (#10208 )	2024-02-22 15:16:31 +08:00
Ruonan Wang	5e1fee5e05	LLM: add GGUF-IQ2 examples (#10207 ) * add iq2 examples * small fix * meet code review * fix * meet review * small fix	2024-02-22 14:18:45 +08:00
SONG Ge	ca1166a0e5	[LLM] Add quantize kv_cache for Baichuan2-13B (#10203 ) * add quantize kv_cache for baichuan2-13b * style fix	2024-02-22 13:43:35 +08:00
Ruonan Wang	34ee1aa91f	LLM: add esimd sdp support for chatglm3 (#10205 ) * add esimd sdp support * fix style	2024-02-22 13:37:16 +08:00
Ruonan Wang	f7c96b19ef	LLM: support iq2 for mixtral (#10191 ) * support name mapping for mixtral * support mixtral mixed quantization * fix style * fix	2024-02-21 16:00:29 +08:00
Xin Qiu	56ad781f2f	qwen2 cpu fix (#10187 )	2024-02-21 11:23:51 +08:00
Zhao Changmin	4fbf449c2d	for rwkv4 (#10179 )	2024-02-21 10:11:10 +08:00
Ruonan Wang	3288acb8de	LLM : Support embedding quantization (only q2k now) (#10170 ) * basic logic added * basic support * support save&load, update mixed strategy * fix style * use int8 for lm_head * add check for xpu	2024-02-20 16:56:57 +08:00
binbin Deng	2bb96c775c	LLM: fix device setting during saving optimized model (#10154 )	2024-02-20 09:52:59 +08:00
Xin Qiu	1f6d5b9f30	enable fused rmsnorm and rope qwen2 (#10163 ) * qwen2 * change convert * cleanup	2024-02-20 08:33:09 +08:00
Zhao Changmin	f8730e8dc1	Skip rescale rwkv linear when load_low_bit (#10164 ) * rwkv_ld	2024-02-19 15:56:42 +08:00
Heyang Sun	3e2af5ec0a	Fix IPEX Baichuan Speculative (#10162 ) * Fix IPEX Baichuan Speculative * compatible with 13B * Update speculative.py	2024-02-19 15:27:34 +08:00
Yina Chen	23c91cdce6	[LLM] Add min_step_draft in speculative decoding (#10142 ) * Fix gptj kvcache & position id * Add min_draft_tokens in speculative decoding * fix style * update	2024-02-19 14:31:41 +08:00
Wang, Jian4	f2417e083c	LLM: enable chatglm3-6b target_model ipex (#10085 ) * init * always make casual_mask * not return last tensor * update * optimize_model = False * enable optimized=False * enable optimized_model=true * speed_up ipex target_model * remove if True * use group_size * update python style * update * update	2024-02-19 13:38:32 +08:00
Yina Chen	1508d6b089	Fix gptj kvcache & position id (#10141 )	2024-02-18 10:02:49 +08:00
Yishuo Wang	4d33aac7f9	quick fix qwen2 fp8 kv cache (#10135 )	2024-02-08 17:04:59 +08:00
Cengguang Zhang	39d90839aa	LLM: add quantize kv cache for llama. (#10086 ) * feat: add quantize kv cache for llama. * fix style. * add quantized attention forward function. * revert style. * fix style. * fix style. * update quantized kv cache and add quantize_qkv * fix style. * fix style. * optimize quantize kv cache. * fix style.	2024-02-08 16:49:22 +08:00
Yishuo Wang	d848efe17c	add quantize kv cache support for qwen2 (#10134 )	2024-02-08 16:17:21 +08:00
SONG Ge	3f79128ed7	[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 (#10131 ) * add support for kv_cache optimization on transformers-v4.37.0 * enable attention forward * style fix * disable rotary for now	2024-02-08 14:20:26 +08:00
Ruonan Wang	063dc145ac	LLM: basic support for q2k (#10132 ) * basic support for q2k * fix style	2024-02-08 13:52:01 +08:00
Cengguang Zhang	0cf6a12691	LLM: add default torch_dtype for fp16. (#10124 ) * set default torch_dtype for fp16. * fix style. * bug fix. * update bug fix.	2024-02-08 10:24:16 +08:00
Yishuo Wang	1aa0c623ce	disable fused layer norm on UHD (#10130 )	2024-02-08 10:20:01 +08:00
Yuwen Hu	a8450fc300	[LLM] Support MLP optimization for Qwen1.5 (#10123 )	2024-02-08 09:15:34 +08:00
binbin Deng	925f82107e	LLM: support models hosted by modelscope (#10106 )	2024-02-07 16:46:36 +08:00
Xiangyu Tian	8953acd7d6	[LLM] Fix log condition for BIGDL_OPT_IPEX (#10115 ) Fix log condition for BIGDL_OPT_IPEX	2024-02-07 10:27:10 +08:00
Yuwen Hu	518ef95abc	Small fix for Nonetype error (#10104 )	2024-02-06 14:58:52 +08:00
Ruonan Wang	d61f4905ac	LLM: 2bit quantization initial support (#10042 ) * basis quantize support * fix new module name * small update * and mixed int4 with iq2_xxs * remove print * code refactor * fix style * meet code review	2024-02-06 14:58:32 +08:00
Jiao Wang	33b9e7744d	fix dimension (#10097 )	2024-02-05 15:07:38 -08:00
Zhicun	7d2be7994f	add phixtral and optimize phi-moe (#10052 )	2024-02-05 11:12:47 +08:00
Zhicun	676d6923f2	LLM: modify transformersembeddings.embed() in langchain (#10051 )	2024-02-05 10:42:10 +08:00
Jin Qiao	ad050107b3	LLM: fix mpt load_low_bit issue (#10075 ) * fix * retry * retry	2024-02-05 10:17:07 +08:00
Ruonan Wang	8e33cb0f38	LLM: support speecht5_tts (#10077 ) * support speecht5_tts * fix	2024-02-04 13:26:42 +08:00
ivy-lv11	428b7105f6	Add HF and PyTorch example InternLM2 (#10061 )	2024-02-04 10:25:55 +08:00
Yina Chen	77be19bb97	LLM: Support gpt-j in speculative decoding (#10067 ) * gptj * support gptj in speculative decoding * fix * update readme * small fix	2024-02-02 14:54:55 +08:00
Xin Qiu	6e0f1a1e92	use apply_rotary_pos_emb_cache_freq_xpu in mixtral (#10060 ) * use apply_rotary_pos_emb_cache_freq_xpu in mixtral * fix style	2024-02-01 15:40:49 +08:00
Heyang Sun	601024f418	Mistral CPU example of speculative decoding (#10024 ) * Mistral CPU example of speculative decoding * update transformres version * update example * Update README.md	2024-02-01 10:52:32 +08:00
Heyang Sun	968e70544d	Enable IPEX Mistral in Speculative (#10059 )	2024-02-01 10:48:16 +08:00
Yina Chen	3ca03d4e97	Add deepmind sample into bigdl-llm speculative decoding (#10041 ) * migrate deepmind sample * update * meet comments * fix style * fix style	2024-02-01 09:57:02 +08:00
Wang, Jian4	7e5cd42a5c	LLM : Update optimize ipex bf16 (#10038 ) * use 4.35.2 and remove * update rmsnorm * remove * remove * update python style * update * update python style * update * fix style * update * remove whitespace	2024-01-31 10:59:55 +08:00
Ruonan Wang	3685622f29	LLM: fix llama 4.36 forward(#10047 )	2024-01-31 10:31:10 +08:00
Yishuo Wang	53a5140eff	Optimize rwkv v5 rest token again (#10043 )	2024-01-31 10:01:11 +08:00
Ruonan Wang	6b63ba23d1	LLM: add full module name during convert (#10035 )	2024-01-30 14:43:07 +08:00
Yishuo Wang	7dfa6dbe46	add rwkv time shift optimization (#10032 )	2024-01-30 14:10:55 +08:00
Xiangyu Tian	f57d0fda8b	[LLM] Use IPEX Optimization for Self Speculative Decoding (#9997 ) Use IPEX Optimization for Self Speculative Decoding	2024-01-30 09:11:06 +08:00
Ruonan Wang	ccf8f613fb	LLM: update fp16 Linear on ARC/FLEX (#10023 )	2024-01-29 18:25:26 +08:00
Shaojun Liu	824c8029d7	Fix "local variable 'model' referenced before assignment" (#10022 )	2024-01-29 16:18:04 +08:00
Xiangyu Tian	f37e4702bc	[LLM] Use IPEX Optimization for BF16 Model (#9988 ) Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true	2024-01-29 11:28:25 +08:00
Yishuo Wang	d720554d43	simplify quantize kv cache api (#10011 )	2024-01-29 09:23:57 +08:00
Yina Chen	a3322e2a6c	add fp8 e5 to use_xmx (#10015 )	2024-01-26 18:29:46 +08:00
Qiyuan Gong	9e18ea187f	[LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006 ) * Avoid OOM during muti-round streaming chat with kv cache * For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31. * Other models need to compare kv cache size with kv_len.	2024-01-26 17:30:08 +08:00
Ruonan Wang	a00efa0564	LLM: add mlp & qkv fusion for FP16 Llama-7B (#9932 ) * add mlp fusion for llama * add mlp fusion * fix style * update * add mm_qkv_out * fix style * update * meet code review * meet code review	2024-01-26 11:50:38 +08:00
Wang, Jian4	98ea3459e5	LLM : Fix llama draft_model dtype error (#10005 ) * fix llama draft_model dtype error * updat	2024-01-26 10:59:48 +08:00
Yishuo Wang	aae1870096	fix qwen kv cache length (#9998 )	2024-01-26 10:15:01 +08:00
Yishuo Wang	24b34b6e46	change xmx condition (#10000 )	2024-01-25 17:48:11 +08:00
Yishuo Wang	bf65548d29	Add quantize kv cache support for chaglm2/3 (#9996 )	2024-01-25 16:55:59 +08:00
Wang, Jian4	9bff84e6fd	LLM: Convert draft_model kv_cache from bf16 to fp32 (#9964 ) * convert bf16 to fp32 * update * change when init * init first and cut off after * init and exchange * update python type * update * fix bug * update * update	2024-01-25 11:20:27 +08:00
Yina Chen	27338540c3	Fix repetition_penalty not activated issue (#9989 )	2024-01-25 10:40:41 +08:00
Yuwen Hu	b27e5a27b9	Remove the check for meta device in _replace_with_low_bit_linear (#9984 )	2024-01-24 18:15:39 +08:00
Yina Chen	b176cad75a	LLM: Add baichuan2 gpu spec example (#9973 ) * add baichuan2 gpu spec example * update readme & example * remove print * fix typo * meet comments * revert * update	2024-01-24 16:40:16 +08:00
Chen, Zhentao	e0db44dcb6	fix unexpected keyword argument 'device' (#9982 ) * add device for chatglm3 only * add comment for this change * fix style * fix style * fix style again.. * finally fixed style	2024-01-24 13:20:46 +08:00
Yuwen Hu	8d28aa8e2b	[LLM] Fix the model.device problem when `cpu_embedding=True` (#9971 ) * Overwrite the device attribute for CPUPinnedParam * Expose cpu_embedding=True for Linux users * Fix python style	2024-01-23 18:51:11 +08:00
Yishuo Wang	f82782cd3b	fix starcoder (#9975 )	2024-01-23 17:24:53 +08:00
Yishuo Wang	2c8a9aaf0d	fix qwen causal mask when quantize_kv_cache=True (#9968 )	2024-01-23 16:34:05 +08:00
Yina Chen	36c665667d	Add logits processor & qwen eos stop in speculative decoding (#9963 ) * add logits processor & qwen eos * fix style * fix * fix * fix style * fix style * support transformers 4.31 * fix style * fix style --------- Co-authored-by: rnwang04 <ruonan1.wang@intel.com>	2024-01-23 15:57:28 +08:00
Xin Qiu	da4687c917	fix fp16 (#9970 )	2024-01-23 15:53:32 +08:00
Ruonan Wang	27b19106f3	LLM: add readme for speculative decoding gpu examples (#9961 ) * add readme * add readme * meet code review	2024-01-23 12:54:19 +08:00
Chen, Zhentao	39219b7e9a	add default device meta when lcmu enabled (#9941 )	2024-01-23 11:00:49 +08:00
Xin Qiu	dacf680294	add fused rotary pos emb for qwen (#9956 ) * add fused rotary pos emb for qwen * update	2024-01-23 10:37:56 +08:00
Ruonan Wang	7b1d9ad7c0	LLM: limit esimd sdp usage for k_len < 8 (#9959 ) * update * fix	2024-01-23 09:28:23 +08:00
Ruonan Wang	3e601f9a5d	LLM: Support speculative decoding in bigdl-llm (#9951 ) * first commit * fix error, add llama example * hidden print * update api usage * change to api v3 * update * meet code review * meet code review, fix style * add reference, fix style * fix style * fix first token time	2024-01-22 19:14:56 +08:00
Heyang Sun	fb91c97fe8	support for Baichuan/Baichuan2 13B Chat running speculative decoding (#9921 ) * support for Baichuan/Baichuan2 13B Chat running speculative decoding * fix stype	2024-01-22 09:11:44 +08:00
Xin Qiu	97f0cd8975	optimize Decilm 7b (#9922 ) * optimize deci * update * decilm attension forward	2024-01-19 17:31:13 +08:00
Wang, Jian4	bcaeb05272	Update optimize qwen (#9943 ) * update for n tokens input * fix dtype * update	2024-01-19 16:54:59 +08:00
Ruonan Wang	bf37b3a670	LLM: optimize CPU speculative decoding of chatglm3 (#9928 ) * update * fix style * meet code review	2024-01-19 14:10:22 +08:00
Shaojun Liu	967714bac8	gguf memory optimization for mixtral (#9939 )	2024-01-19 11:13:15 +08:00
Lilac09	7032a2ad73	Optimize gguf load memory for mistral (#9923 ) * optimize gguf load for mistral * fix output of gguf mistral * reset	2024-01-19 09:14:39 +08:00
Shaojun Liu	9a46f019d7	gguf memory optimization for baichuan (#9937 )	2024-01-19 09:11:02 +08:00
Guancheng Fu	2e1448f08e	[Serving] Add vllm_worker to fastchat serving framework (#9934 ) * add worker * finish * finish * add license * add more comments	2024-01-18 21:33:36 +08:00
Yishuo Wang	7bbb98abb6	Disable fused layer norm when using XMX to fix mpt UT (#9933 )	2024-01-18 16:22:12 +08:00
Wang, Jian4	1fc9dfa265	LLM: Update for Qwen n tokens inputs (#9931 ) * update for n tokens inputs * update style * update	2024-01-18 15:56:29 +08:00
Heyang Sun	5184f400f9	Fix Mixtral GGUF Wrong Output Issue (#9930 ) * Fix Mixtral GGUF Wrong Output Issue * fix style * fix style	2024-01-18 14:11:27 +08:00
Yishuo Wang	453df868c9	add rwkv v5 attention kernel (#9927 )	2024-01-18 10:16:29 +08:00
Ruonan Wang	054952f82f	LLM: Fix rope of chatglm3 to support speculative decoding on CPU (#9926 )	2024-01-18 09:28:10 +08:00
Ziteng Zhang	18cd1f1432	[LLM]Solve the problem of calling bmm operator in BF16Linear (#9924 ) * Solve the problem of calling bmm operator in BF16Linear	2024-01-17 18:08:35 +08:00
Yina Chen	98b86f83d4	Support fast rope for training (#9745 ) * init * init * fix style * add test and fix * address comment * update * merge upstream main	2024-01-17 15:51:38 +08:00
Ruonan Wang	427f75000b	LLM: fix sdp of chatglm3 (#9917 ) * fix * fix * fix	2024-01-17 13:37:28 +08:00
Yishuo Wang	94767da7cf	optimize rwkv v4 first token performance (#9912 )	2024-01-17 09:27:41 +08:00
Shaojun Liu	b909c5c9c2	GGUF load memory optimization (#9913 ) * block-wise * convert linear for module * revert * Fix PEP8 checks Error	2024-01-16 18:54:39 +08:00
Xin Qiu	dee32f7d15	copy fused rms norm's reuslt to avoid <unk> (#9909 )	2024-01-16 16:54:08 +08:00
Ruonan Wang	8d7326ae03	LLM: fix chatglm3 sdp to support speculative decoding (#9900 ) * fix chatglm3 * fix * update * meet code review * fix	2024-01-16 11:29:13 +08:00
Guancheng Fu	9f34da7cdb	Update PVC XMX condition (#9901 ) * update pvc xmx condition * update condition * update conditon	2024-01-15 15:42:15 +08:00
Yishuo Wang	6637860ddf	change xmx condition (#9896 )	2024-01-12 19:51:48 +08:00
Ruonan Wang	d9cf55bce9	LLM: fix MLP check of mixtral (#9891 )	2024-01-11 18:01:59 +08:00
Ziteng Zhang	4af88a67b9	support chatglm3 with bf16 (#9888 ) * support chatglm3 with bigdl-bf16	2024-01-11 16:45:21 +08:00
Yuwen Hu	0aef35a965	[LLM] Improve LLM doc regarding windows gpu related info (#9880 ) * Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix	2024-01-11 14:37:16 +08:00
Ruonan Wang	53531ae4ee	LLM: support qkv fusion for fp8e5 (#9878 ) * update * add mistral * meet code review	2024-01-10 17:50:00 +08:00
Lilac09	cb32b985ec	add mistral and chatglm support to vllm (#9879 ) * add mistral and chatglm support to vllm * add mistral and chatglm support to vllm	2024-01-10 15:38:42 +08:00
Ruonan Wang	3e05c9e11b	LLM: update esimd sdp kernel (#9871 )	2024-01-09 18:10:01 +08:00
Yishuo Wang	36496d60ac	only use quantize kv cache on MTL (#9862 )	2024-01-09 13:24:02 +08:00
ZehuaCao	146076bdb5	Support llm-awq backend (#9856 ) * Support for LLM-AWQ Backend * fix * Update README.md * Add awqconfig * modify init * update * support llm-awq * fix style * fix style * update * fix AwqBackendPackingMethod not found error * fix style * update README * fix style --------- Co-authored-by: Uxito-Ada <414416158@qq.com> Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com> Co-authored-by: cyita <yitastudy@gmail.com>	2024-01-09 13:07:32 +08:00
Ruonan Wang	fea6f16057	LLM: add mlp fusion for fp8e5 and update related check (#9860 ) * update mlp fusion * fix style * update	2024-01-09 09:56:32 +08:00
Jiao Wang	3b6372ab12	Fix Llama transformers 4.36 support (#9852 ) * supoort 4.36 * style * update * update * update * fix merge * update	2024-01-08 00:32:23 -08:00
Chen, Zhentao	1b585b0d40	set fp8 default as e5m2 (#9859 )	2024-01-08 15:53:57 +08:00
Ruonan Wang	dc995006cc	LLM: add flash attention for mistral / mixtral (#9846 ) * add flash attention for mistral * update * add flash attn for mixtral * fix style	2024-01-08 09:51:34 +08:00
Yishuo Wang	afaa871144	[LLM] support quantize kv cache to fp8 (#9812 )	2024-01-08 09:28:20 +08:00
Jiao Wang	248ae7fad2	LLama optimize_model to support transformers 4.36 (#9818 ) * supoort 4.36 * style * update * update * update	2024-01-05 11:30:18 -08:00
Ruonan Wang	a60bda3324	LLM: update check for deepspeed (#9838 )	2024-01-05 16:44:10 +08:00
Ruonan Wang	16433dd959	LLM: fix first token judgement of flash attention (#9841 ) * fix flash attention * meet code review * fix	2024-01-05 13:49:37 +08:00
Yina Chen	f919f5792a	fix kv cache out of bound (#9827 )	2024-01-05 12:38:57 +08:00
Ruonan Wang	5df31db773	LLM: fix accuracy issue of chatglm3 (#9830 ) * add attn mask for first token * fix * fix * change attn calculation * fix * fix * fix style * fix style	2024-01-05 10:52:05 +08:00
Xiangyu Tian	38c05be1c0	[LLM] Fix dtype mismatch in Baichuan2-13b (#9834 )	2024-01-04 15:34:42 +08:00
Ziteng Zhang	05b681fa85	[LLM] IPEX auto importer set on by default (#9832 ) * Set BIGDL_IMPORT_IPEX default to True * Remove import intel_extension_for_pytorch as ipex from GPU example	2024-01-04 13:33:29 +08:00
Wang, Jian4	4ceefc9b18	LLM: Support bitsandbytes config on qlora finetune (#9715 ) * test support bitsandbytesconfig * update style * update cpu example * update example * update readme * update unit test * use bfloat16 * update logic * use int4 * set defalut bnb_4bit_use_double_quant * update * update example * update model.py * update * support lora example	2024-01-04 11:23:16 +08:00
Ruonan Wang	20e9742fa0	LLM: fix chatglm3 issue (#9820 ) * fix chatglm3 issue * small update	2024-01-03 16:15:55 +08:00
Wang, Jian4	a54cd767b1	LLM: Add gguf falcon (#9801 ) * init falcon * update convert.py * update style	2024-01-03 14:49:02 +08:00
Qiyuan Gong	f0f9d45eac	[LLM] IPEX import support bigdl-core-xe-21 (#9769 ) Add support for bigdl-core-xe-21.	2023-12-28 15:23:58 +08:00
Guancheng Fu	5857a38321	[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH (#9782 ) * add option kv_cache_block * change var name	2023-12-28 14:41:47 +08:00
Ruonan Wang	99bddd3ab4	LLM: better FP16 support for Intel GPUs (#9791 ) * initial support * fix * fix style * fix * limi esimd usage condition * refactor code * fix style * small fix * meet code review * small fix	2023-12-28 13:30:13 +08:00
Yishuo Wang	7d9f6c6efc	fix cpuinfo error (#9793 )	2023-12-28 09:23:44 +08:00
Wang, Jian4	7ed9538b9f	LLM: support gguf mpt (#9773 ) * add gguf mpt * update	2023-12-28 09:22:39 +08:00
Cengguang Zhang	d299f108d0	update falcon attention forward. (#9796 )	2023-12-28 09:11:59 +08:00
Kai Huang	689889482c	Reduce max_cache_pos to reduce Baichuan2-13B memory (#9694 ) * optimize baichuan2 memory * fix * style * fp16 mask * disable fp16 * fix style * empty cache * revert empty cache	2023-12-26 19:51:25 +08:00
Xiangyu Tian	0ea842231e	[LLM] vLLM: Add api_server entrypoint (#9783 ) Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.	2023-12-26 16:03:57 +08:00
Ruonan Wang	11d883301b	LLM: fix wrong batch output caused by flash attention (#9780 ) * fix * meet code review * move batch size check to the beginning * move qlen check inside function * meet code review	2023-12-26 09:41:27 +08:00
Heyang Sun	66e286a73d	Support for Mixtral AWQ (#9775 ) * Support for Mixtral AWQ * Update README.md * Update README.md * Update awq_config.py * Update README.md * Update README.md	2023-12-25 16:08:09 +08:00
Ruonan Wang	1917bbe626	LLM: fix `BF16Linear` related training & inference issue (#9755 ) * fix bf16 related issue * fix * update based on comment & add arc lora script * update readme * update based on comment * update based on comment * update * force to bf16 * fix style * move check input dtype into function * update convert * meet code review * meet code review * update merged model to support new training_mode api * fix typo	2023-12-25 14:49:30 +08:00
Xiangyu Tian	30dab36f76	[LLM] vLLM: Fix kv cache init (#9771 ) Fix kv cache init	2023-12-25 14:17:06 +08:00
Yina Chen	449b387125	Support relora in bigdl-llm (#9687 ) * init * fix style * update * support resume & update readme * update * update * remove important * add training mode * meet comments	2023-12-25 14:04:28 +08:00
Ziteng Zhang	986f65cea9	[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py (#9762 )	2023-12-25 11:31:14 +08:00
Guancheng Fu	daf536fb2d	vLLM: Apply attention optimizations for selective batching (#9758 ) * fuse_rope for prefil * apply kv_cache optimizations * apply fast_decoding_path * Re-enable kv_cache optimizations for prefill * reduce KV_CACHE_ALLOC_BLOCK for selective_batching	2023-12-25 10:29:31 +08:00
Qiyuan Gong	4c487313f2	Revert "[LLM] IPEX auto importer turn on by default for XPU (#9730 )" (#9759 ) This reverts commit `0284801fbd`.	2023-12-22 16:38:24 +08:00
Qiyuan Gong	0284801fbd	[LLM] IPEX auto importer turn on by default for XPU (#9730 ) * Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU. * Remove import intel_extension_for_pytorch as ipex from GPU example. * Add support for bigdl-core-xe-21.	2023-12-22 16:20:32 +08:00
Guancheng Fu	fdf93c9267	Implement selective batching for vLLM (#9659 ) * add control to load hf model * finish initial version of selective_batching * temp * finish * Remove print statement * fix error * Apply yang's optimization * a version that works * We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path * format * temp solution: not batching prefill requests * a version that works for prefill batching * format * a solid version: works normally * a temp version * Solid version: remove redundant functions * fix format * format * solid: add option to enable selective_batching * remove logic for using transformer models * format * format * solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING * format * finish * format	2023-12-22 13:45:46 +08:00
Ruonan Wang	2f36769208	LLM: bigdl-llm lora support & lora example (#9740 ) * lora support and single card example * support multi-card, refactor code * fix model id and style * remove torch patch, add two new class for bf16, update example * fix style * change to training_mode * small fix * add more info in help * fixstyle, update readme * fix ut * fix ut * Handling compatibility issues with default LoraConfig	2023-12-22 11:05:39 +08:00
SONG Ge	ba0b939579	[LLM] Support transformers-v4.36.0 on mistral model (#9744 ) * add support transformers-v4.36.0 on mistral model * python/llm/src/bigdl/llm/transformers/models/mistral.py * make the redundant implementation as utils * fix code style * fix * fix style * update with utils enough_kv_room	2023-12-22 09:59:27 +08:00
Xin Qiu	e36111e713	mixstral fused qkv and rope (#9724 ) * mixstral fused qkv and rope * fix and clean * fix style * update * update * fix * update * fix	2023-12-22 09:26:35 +08:00
Jiao Wang	e4f6e43675	safetenor to false (#9728 )	2023-12-21 14:41:51 -08:00
Yishuo Wang	426660b88e	simplify qwen attention (#9747 )	2023-12-21 17:53:29 +08:00
Wang, Jian4	984697afe2	LLM: Add bloom gguf support (#9734 ) * init * update bloom add merges * update * update readme * update for llama error * update	2023-12-21 14:06:25 +08:00
Heyang Sun	df775cf316	fix python style (#9742 ) * fix python style * fix * fix	2023-12-21 11:25:05 +08:00
Xin Qiu	6c3e698bf1	mistral decoding_fast_path and fused mlp (#9714 ) * mistral decoding_fast_path and fused mlp * meet code review	2023-12-21 10:11:37 +08:00
Heyang Sun	d157f623b6	Load Mixtral gguf in a block-wise way (#9725 ) * Load Mixtral gguf in a block-wise way * refine	2023-12-21 10:03:23 +08:00
Zhao Changmin	4bda975a3e	LLM: Align lowbit model config (#9735 ) * align lowbit model config	2023-12-21 09:48:58 +08:00
Wang, Jian4	e1e921f425	LLM: gguf other model using dtype (#9729 )	2023-12-21 09:33:40 +08:00
Yishuo Wang	13ea6330bd	optimize qwen rope (#9737 )	2023-12-20 17:34:34 +08:00
Ziteng Zhang	4c032a433e	[LLM] Add glibc checker (#9624 ) * Add glibc checker * Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.	2023-12-20 16:52:43 +08:00
Yina Chen	cd652a1710	Support fp8 e5m2 on arc (#9711 ) * init * fix style * update * fix style * update	2023-12-20 16:26:17 +08:00
Yishuo Wang	e54c428d30	add bf16/fp16 fuse mlp support (#9726 )	2023-12-20 10:40:45 +08:00
Heyang Sun	612651cb5d	fix typo (#9723 )	2023-12-20 09:41:59 +08:00
Yishuo Wang	522cf5ed82	[LLM] Improve chatglm2/3 rest token performance with long context (#9716 )	2023-12-19 17:29:38 +08:00
Yishuo Wang	f2e6abb563	fix mlp batch size check (#9718 )	2023-12-19 14:22:22 +08:00
Heyang Sun	1fa7793fc0	Load Mixtral GGUF Model (#9690 ) * Load Mixtral GGUF Model * refactor * fix empty tensor when to cpu * update gpu and cpu readmes * add dtype when set tensor into module	2023-12-19 13:54:38 +08:00
Qiyuan Gong	d0a3095b97	[LLM] IPEX auto importer (#9706 ) * IPEX auto importer and get_ipex_version. * Add BIGDL_IMPORT_IPEX to control auto import, default is false.	2023-12-19 13:39:38 +08:00
Yang Wang	f4fb58d99c	fusing qkv project and rope (#9612 ) * Try fusing qkv project and rope * add fused mlp * fuse append cache * fix style and clean up code * clean up	2023-12-18 16:45:00 -08:00
Cengguang Zhang	4d22add4af	LLM: fix qwen efficiency issue in perf-test.	2023-12-18 18:32:54 +08:00
Ruonan Wang	8ed89557e5	LLM: add mlp optimization of mixtral (#9709 )	2023-12-18 16:59:52 +08:00
Xin Qiu	320110d158	handle empty fused norm result (#9688 ) * handle empty fused norm result * remove fast_rms_norm * fix style	2023-12-18 09:56:11 +08:00
SONG Ge	d5b81af7bd	Support mixtral attention optimization on transformers-v4.36.0 (#9674 ) * add example code to support mistral/mixtral attention on transformers v4.36.0 * update * style fix * add update for seen-tokens * support mixtral * rm mistral change * small fix * add more comments and remove use_cache part --------- Co-authored-by: plusbang <binbin1.deng@intel.com>	2023-12-15 14:30:23 +08:00
Cengguang Zhang	adbef56001	LLM: update qwen attention forward. (#9695 ) * feat: update qwen attention forward. * fix: style.	2023-12-15 14:06:15 +08:00
Wang, Jian4	b8437a1c1e	LLM: Add gguf mistral model support (#9691 ) * add mistral support * need to upgrade transformers version * update	2023-12-15 13:37:39 +08:00
Wang, Jian4	496bb2e845	LLM: Support load BaiChuan model family gguf model (#9685 ) * support baichuan model family gguf model * update gguf generate.py * add verify models * add support model_family * update * update style * update type * update readme * update * remove support model_family	2023-12-15 13:34:33 +08:00
Yishuo Wang	9a330bfc2b	fix fuse mlp when using q5_0 or fp8 (#9689 )	2023-12-14 16:16:05 +08:00
Xin Qiu	5e46e0e5af	fix baichuan2-7b 1st token performance regression on xpu (#9683 ) * fix baichuan2-7b 1st token performance regression * add comments * fix style	2023-12-14 09:58:32 +08:00
Yishuo Wang	09ca540f9b	use fuse mlp in qwen (#9672 )	2023-12-13 17:20:08 +08:00
Ruonan Wang	c7741c4e84	LLM: update moe block convert to optimize rest token latency of Mixtral (#9669 ) * update moe block convert * further accelerate final_hidden_states * fix style * fix style	2023-12-13 16:17:06 +08:00
Xiangyu Tian	1c6499e880	[LLM] vLLM: Support Mixtral Model (#9670 ) Add Mixtral support for BigDL vLLM.	2023-12-13 14:44:47 +08:00
Ruonan Wang	dc5b1d7e9d	LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] (#9633 ) * integrate sdp * update api * fix style * meet code review * fix * distinguish mtl from arc * small fix	2023-12-13 11:29:57 +08:00
Qiyuan Gong	5b0e7e308c	[LLM] Add support for empty activation (#9664 ) * Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch. * Add comments.	2023-12-13 11:07:45 +08:00
SONG Ge	284e7697b1	[LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC (#9579 ) * optimize kv_cache to support beam_search on Arc * correctness test update * fix query_length issue * simplify implementation * only enable the optimization on gpu device * limit the beam_search support only enabled with gpu device and batch_size > 1 * add comments for beam_search case and revert ut change * meet comments * add more comments to describe the differece between multi-cases	2023-12-13 11:02:14 +08:00
Ziteng Zhang	8931f2eb62	[LLM] Fix transformer qwen size mismatch and rename causal_mask (#9655 ) * Fix size mismatching caused by context_layer * Change registered_causal_mask to causal_mask	2023-12-12 20:57:40 +08:00
binbin Deng	59ce86d292	LLM: support `optimize_model=True` for Mixtral model (#9657 )	2023-12-12 16:41:26 +08:00
Heyang Sun	9f02f96160	[LLM] support for Yi AWQ model (#9648 )	2023-12-11 14:07:34 +08:00
Xin Qiu	82255f9726	Enable fused layernorm (#9614 ) * bloom layernorm * fix * layernorm * fix * fix * fix * style fix * fix * replace nn.LayerNorm	2023-12-11 09:26:13 +08:00
Yina Chen	70f5e7bf0d	Support peft LoraConfig (#9636 ) * support peft loraconfig * use testcase to test * fix style * meet comments	2023-12-08 16:13:03 +08:00
Xin Qiu	0b6f29a7fc	add fused rms norm for Yi and Qwen (#9640 )	2023-12-08 16:04:38 +08:00
Xin Qiu	5636b0ba80	set new linear status (#9639 )	2023-12-08 11:02:49 +08:00
Yuwen Hu	6f34978b94	[LLM] Add more performance tests for win iGPU (more in-out pairs, RWKV model) (#9626 ) * Add supports for loading rwkv models using from_pretrained api * Temporarily enable pr tests * Add RWKV in tests and more in-out pairs * Add rwkv for 512 tests * Make iterations smaller * Change back to nightly trigger	2023-12-07 18:55:16 +08:00
Ruonan Wang	d9b0c01de3	LLM: fix unlora module in qlora finetune (#9621 ) * fix unlora module * split train and inference	2023-12-07 16:32:02 +08:00
Yishuo Wang	7319f2c227	use fused mlp in baichuan2 (#9620 )	2023-12-07 15:50:57 +08:00
Xiangyu Tian	deee65785c	[LLM] vLLM: Delete last_kv_cache before prefilling (#9619 ) Remove last_kv_cache before prefilling to reduce peak memory usage.	2023-12-07 11:32:33 +08:00
Xiangyu Tian	0327169b50	[LLM] vLLM: fix memory leak in prepare_kv_cache (#9616 ) Revert modification in prepare_kv_cache to fix memory leak.	2023-12-07 10:08:18 +08:00
Xin Qiu	13d47955a8	use fused rms norm in chatglm2 and baichuan (#9613 ) * use fused rms norm in chatglm2 and baichuan * style fix	2023-12-07 09:21:41 +08:00
Yina Chen	404e101ded	QALora example (#9551 ) * Support qa-lora * init * update * update * update * update * update * update merge * update * fix style & update scripts * update * address comments * fix typo * fix typo --------- Co-authored-by: Yang Wang <yang3.wang@intel.com>	2023-12-06 15:36:21 +08:00
Guancheng Fu	6978b2c316	[VLLM] Change padding patterns for vLLM & clean code (#9609 ) * optimize * fix minor error * optimizations * fix style	2023-12-06 15:27:26 +08:00
Zheng, Yi	d154b38bf9	Add llama2 gpu low memory example (#9514 ) * Add low memory example * Minor fixes * Update readme.md	2023-12-05 17:29:48 +08:00
Ziteng Zhang	65934c9f4f	[LLM] Fix Qwen causal_mask and attention_mask size mismatching (#9600 ) * Fix #9582 , caused by Qwen modified modeling_qwen.py `7f62181c94 (d2h-049182)`	2023-12-05 15:15:54 +08:00
Qiyuan Gong	f211f136b6	Configurable TORCH_LINEAR_THRESHOLD from env (#9588 ) * Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD) * Change default to 512	2023-12-05 13:19:47 +08:00
Xiangyu Tian	5c03651309	[LLM] vLLM: Add Preempt for scheduler (#9568 ) Implement Preempt_by_recompute method for vllm.	2023-12-03 20:16:25 +08:00
Xin Qiu	69c49d21f5	use fused rms norm (#9572 ) * use fused rms norm * meet code review	2023-11-30 21:47:41 +08:00
Yishuo Wang	7f6465518a	support loading llama tokenizer from gguf model (#9565 )	2023-11-30 14:56:12 +08:00
Yuwen Hu	34503efa6a	Fix cpu pinned embedding (#9556 )	2023-11-29 18:27:56 +08:00
binbin Deng	4ff2ca9d0d	LLM: fix loss error on Arc (#9550 )	2023-11-29 15:16:18 +08:00
Yishuo Wang	65121c7997	support loading q4_1/q5_0/q5_1/q8_0 gguf model (#9546 )	2023-11-29 14:40:37 +08:00
Yuwen Hu	5f5ca38b74	[LLM Doc] Fix api doc rendering error (#9542 ) * Fix api rendering error * Fix python style	2023-11-29 09:17:09 +08:00
Yishuo Wang	a86c6e0b56	[LLM] support loading gguf model (#9544 )	2023-11-28 15:51:15 +08:00
Xiangyu Tian	916c338772	fix bugs in vllm length check (#9543 )	2023-11-28 11:09:54 +08:00
Zhao Changmin	e7e0cd3b5e	CPU Pinned embedding Layer (#9538 ) * CPU Pinned embedding	2023-11-28 09:46:31 +08:00
Guancheng Fu	963a5c8d79	Add vLLM-XPU version's README/examples (#9536 ) * test * test * fix last kv cache * add xpu readme * remove numactl for xpu example * fix link error * update max_num_batched_tokens logic * add explaination * add xpu environement version requirement * refine gpu memory * fix * fix style	2023-11-28 09:44:03 +08:00
Guancheng Fu	b6c3520748	Remove xformers from vLLM-CPU (#9535 )	2023-11-27 11:21:25 +08:00
binbin Deng	6bec0faea5	LLM: support Mistral AWQ models (#9520 )	2023-11-24 16:20:22 +08:00
Ruonan Wang	914a5a5a27	LLM: fix abnormal Mistral GPU accuracy by updating rms_norm (#9529 )	2023-11-24 15:37:50 +08:00
SONG Ge	3d24823cda	hot-fix mistral kv_cache (#9528 )	2023-11-24 14:33:04 +08:00
Zhao Changmin	42b7a16bc5	Replace torch.bmm with safe_bmm (#9519 ) * replace bmm with safe one * rename args and deprecated warning	2023-11-24 12:16:48 +08:00
Ruonan Wang	b63aae8a8e	LLM: add flash attention support for llama (#9518 ) * add initial flash attention for llama * accelerate fp32 first token by changing to fp16 in advance * support fp32	2023-11-23 18:40:18 +08:00
Guancheng Fu	bf579507c2	Integrate vllm (#9310 ) * done * Rename structure * add models * Add structure/sampling_params,sequence * add input_metadata * add outputs * Add policy,logger * add and update * add parallelconfig back * core/scheduler.py * Add llm_engine.py * Add async_llm_engine.py * Add tested entrypoint * fix minor error * Fix everything * fix kv cache view * fix * fix * fix * format&refine * remove logger from repo * try to add token latency * remove logger * Refine config.py * finish worker.py * delete utils.py * add license * refine * refine sequence.py * remove sampling_params.py * finish * add license * format * add license * refine * refine * Refine line too long * remove exception * so dumb style-check * refine * refine * refine * refine * refine * refine * add README * refine README * add warning instead error * fix padding * add license * format * format * format fix * Refine vllm dependency (#1) vllm dependency clear * fix licence * fix format * fix format * fix * adapt LLM engine * fix * add license * fix format * fix * Moving README.md to the correct position * Fix readme.md * done * guide for adding models * fix * Fix README.md * Add new model readme * remove ray-logic * refactor arg_utils.py * remove distributed_init_method logic * refactor entrypoints * refactor input_metadata * refactor model_loader * refactor utils.py * refactor models * fix api server * remove vllm.stucture * revert by txy 1120 * remove utils * format * fix license * add bigdl model * Refer to a specfic commit * Change code base * add comments * add async_llm_engine comment * refine * formatted * add worker comments * add comments * add comments * fix style * add changes --------- Co-authored-by: xiangyuT <xiangyu.tian@intel.com> Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>	2023-11-23 16:46:45 +08:00
Qiyuan Gong	0f0c6bb631	[LLM] Fix Qwen registered_causal_mask is None (#9513 ) * Add registered_causal_mask init based on `2abd8e5777`.	2023-11-23 09:28:04 +08:00
Ruonan Wang	076d106ef5	LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing (#9499 ) * update to bf16 to accelerate gradient checkpoint * add utils and fix ut	2023-11-21 17:08:36 +08:00
Xin Qiu	50b01058f1	enable new q4_1 (#9479 )	2023-11-17 14:58:57 +08:00
Zhao Changmin	30abd304a7	LLM: Fix baichuan pre-normalize model tensor assigning issue when loading (#9481 ) * No need to normalized when loading	2023-11-16 21:57:28 +08:00
Ruonan Wang	c0ef70df02	llm: quick fix of fast_rms_norm (#9480 )	2023-11-16 14:42:16 +08:00

... 4 5 6 7 8 ...

776 commits