ipex-llm

Author	SHA1	Message	Date
binbin Deng	4ff2ca9d0d	LLM: fix loss error on Arc (#9550 )	2023-11-29 15:16:18 +08:00
Yishuo Wang	65121c7997	support loading q4_1/q5_0/q5_1/q8_0 gguf model (#9546 )	2023-11-29 14:40:37 +08:00
Yuwen Hu	5f5ca38b74	[LLM Doc] Fix api doc rendering error (#9542 ) * Fix api rendering error * Fix python style	2023-11-29 09:17:09 +08:00
Yishuo Wang	a86c6e0b56	[LLM] support loading gguf model (#9544 )	2023-11-28 15:51:15 +08:00
Xiangyu Tian	916c338772	fix bugs in vllm length check (#9543 )	2023-11-28 11:09:54 +08:00
Zhao Changmin	e7e0cd3b5e	CPU Pinned embedding Layer (#9538 ) * CPU Pinned embedding	2023-11-28 09:46:31 +08:00
Guancheng Fu	963a5c8d79	Add vLLM-XPU version's README/examples (#9536 ) * test * test * fix last kv cache * add xpu readme * remove numactl for xpu example * fix link error * update max_num_batched_tokens logic * add explaination * add xpu environement version requirement * refine gpu memory * fix * fix style	2023-11-28 09:44:03 +08:00
Guancheng Fu	b6c3520748	Remove xformers from vLLM-CPU (#9535 )	2023-11-27 11:21:25 +08:00
binbin Deng	6bec0faea5	LLM: support Mistral AWQ models (#9520 )	2023-11-24 16:20:22 +08:00
Ruonan Wang	914a5a5a27	LLM: fix abnormal Mistral GPU accuracy by updating rms_norm (#9529 )	2023-11-24 15:37:50 +08:00
SONG Ge	3d24823cda	hot-fix mistral kv_cache (#9528 )	2023-11-24 14:33:04 +08:00
Zhao Changmin	42b7a16bc5	Replace torch.bmm with safe_bmm (#9519 ) * replace bmm with safe one * rename args and deprecated warning	2023-11-24 12:16:48 +08:00
Ruonan Wang	b63aae8a8e	LLM: add flash attention support for llama (#9518 ) * add initial flash attention for llama * accelerate fp32 first token by changing to fp16 in advance * support fp32	2023-11-23 18:40:18 +08:00
Guancheng Fu	bf579507c2	Integrate vllm (#9310 ) * done * Rename structure * add models * Add structure/sampling_params,sequence * add input_metadata * add outputs * Add policy,logger * add and update * add parallelconfig back * core/scheduler.py * Add llm_engine.py * Add async_llm_engine.py * Add tested entrypoint * fix minor error * Fix everything * fix kv cache view * fix * fix * fix * format&refine * remove logger from repo * try to add token latency * remove logger * Refine config.py * finish worker.py * delete utils.py * add license * refine * refine sequence.py * remove sampling_params.py * finish * add license * format * add license * refine * refine * Refine line too long * remove exception * so dumb style-check * refine * refine * refine * refine * refine * refine * add README * refine README * add warning instead error * fix padding * add license * format * format * format fix * Refine vllm dependency (#1) vllm dependency clear * fix licence * fix format * fix format * fix * adapt LLM engine * fix * add license * fix format * fix * Moving README.md to the correct position * Fix readme.md * done * guide for adding models * fix * Fix README.md * Add new model readme * remove ray-logic * refactor arg_utils.py * remove distributed_init_method logic * refactor entrypoints * refactor input_metadata * refactor model_loader * refactor utils.py * refactor models * fix api server * remove vllm.stucture * revert by txy 1120 * remove utils * format * fix license * add bigdl model * Refer to a specfic commit * Change code base * add comments * add async_llm_engine comment * refine * formatted * add worker comments * add comments * add comments * fix style * add changes --------- Co-authored-by: xiangyuT <xiangyu.tian@intel.com> Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>	2023-11-23 16:46:45 +08:00
Qiyuan Gong	0f0c6bb631	[LLM] Fix Qwen registered_causal_mask is None (#9513 ) * Add registered_causal_mask init based on `2abd8e5777`.	2023-11-23 09:28:04 +08:00
Ruonan Wang	076d106ef5	LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing (#9499 ) * update to bf16 to accelerate gradient checkpoint * add utils and fix ut	2023-11-21 17:08:36 +08:00
Xin Qiu	50b01058f1	enable new q4_1 (#9479 )	2023-11-17 14:58:57 +08:00
Zhao Changmin	30abd304a7	LLM: Fix baichuan pre-normalize model tensor assigning issue when loading (#9481 ) * No need to normalized when loading	2023-11-16 21:57:28 +08:00
Ruonan Wang	c0ef70df02	llm: quick fix of fast_rms_norm (#9480 )	2023-11-16 14:42:16 +08:00
Yina Chen	d5263e6681	Add awq load support (#9453 ) * Support directly loading GPTQ models from huggingface * fix style * fix tests * change example structure * address comments * fix style * init * address comments * add examples * fix style * fix style * fix style * fix style * update * remove * meet comments * fix style --------- Co-authored-by: Yang Wang <yang3.wang@intel.com>	2023-11-16 14:06:25 +08:00
Ruonan Wang	d2c064124a	LLM: update rms related usage to suport ipex 2.1 new api (#9466 ) * update rms related usage * fix style	2023-11-16 11:21:50 +08:00
Yuwen Hu	731b0aaade	Empty cache after embedding to cpu (#9477 )	2023-11-16 10:52:30 +08:00
Yang Wang	51d07a9fd8	Support directly loading gptq models from huggingface (#9391 ) * Support directly loading GPTQ models from huggingface * fix style * fix tests * change example structure * address comments * fix style * address comments	2023-11-13 20:48:12 -08:00
SONG Ge	2888818b3a	[LLM] Support mixed_fp8 on Arc (#9415 ) * ut gpu allocation memory fix * support mix_8bit on arc * rename mixed_4bit to mixed_fp4 and mixed_8bit to mixed_fp8 * revert unexpected changes * revert unexpected changes * unify common logits * rename in llm xmx_checker * fix typo error and re-unify	2023-11-13 09:26:30 +08:00
Heyang Sun	df8e4d7889	[LLM] apply allreduce and bias to training in LowBitLinear (#9395 )	2023-11-09 14:35:54 +08:00
Wang, Jian4	40cead6b5b	LLM: Fix CPU qlora dtype convert issue (#9394 )	2023-11-09 14:34:01 +08:00
Ruonan Wang	bfca76dfa7	LLM: optimize QLoRA by updating lora convert logic (#9372 ) * update convert logic of qlora * update * refactor and further improve performance * fix style * meet code review	2023-11-08 17:46:49 +08:00
Ruonan Wang	7e8fb29b7c	LLM: optimize QLoRA by reducing convert time (#9370 )	2023-11-08 13:14:34 +08:00
Yishuo Wang	bfd9f88f0d	[LLM] Use fp32 as dtype when batch_size <=8 and qtype is q4_0/q8_0/fp8 (#9365 )	2023-11-08 09:54:53 +08:00
Heyang Sun	fae6db3ddc	[LLM] refactor cpu low-bit forward logic (#9366 ) * [LLM] refactor cpu low-bit forward logic * fix style * Update low_bit_linear.py * Update low_bit_linear.py * refine	2023-11-07 15:09:16 +08:00
Heyang Sun	af94058203	[LLM] Support CPU deepspeed distributed inference (#9259 ) * [LLM] Support CPU Deepspeed distributed inference * Update run_deepspeed.py * Rename * fix style * add new codes * refine * remove annotated codes * refine * Update README.md * refine doc and example code	2023-11-06 17:56:42 +08:00
Xin Qiu	1420e45cc0	Chatglm2 rope optimization on xpu (#9350 )	2023-11-06 13:56:34 +08:00
Yuwen Hu	a0150bb205	[LLM] Move embedding layer to CPU for iGPU inference (#9343 ) * Move embedding layer to CPU for iGPU llm inference * Empty cache after to cpu * Remove empty cache as it seems to have some negative effect to first token	2023-11-03 11:13:45 +08:00
Yishuo Wang	726203d778	[LLM] Replace Embedding layer to fix it on CPU (#9254 )	2023-11-01 13:58:10 +08:00
Yang Wang	e1bc18f8eb	fix import ipex problem (#9323 ) * fix import ipex problem * fix style	2023-10-31 20:31:34 -07:00
Yina Chen	2262ae4d13	Support MoFQ4 on arc (#9301 ) * init * update * fix style * fix style * fix style * meet comments	2023-11-01 10:59:46 +08:00
Yang Wang	163d033616	Support qlora in CPU (#9233 ) * support qlora in CPU * revert example * fix style	2023-10-27 14:01:15 -07:00
Cengguang Zhang	44b5fcc190	LLM: fix pretraining_tp argument issue. (#9281 )	2023-10-26 18:43:58 +08:00
WeiguangHan	6b2a32eba2	LLM: add missing function for PyTorch InternLM model (#9285 )	2023-10-26 18:05:23 +08:00
Yina Chen	f879c48f98	fp8 convert use ggml code (#9277 )	2023-10-26 17:03:29 +08:00
Yina Chen	e2264e8845	Support arc fp4 (#9266 ) * support arc fp4 * fix style * fix style	2023-10-25 15:42:48 +08:00
Yang Wang	067c7e8098	Support deepspeed AutoTP (#9230 ) * Support deepspeed * add test script * refactor convert * refine example * refine * refine example * fix style * refine example and adapte latest ipex * fix style	2023-10-24 23:46:28 -07:00
Jin Qiao	90162264a3	LLM: replace torch.float32 with auto type (#9261 )	2023-10-24 17:12:13 +08:00
SONG Ge	bd5215d75b	[LLM] Reimplement chatglm fuse rms optimization (#9260 ) * re-implement chatglm rope rms * update	2023-10-24 16:35:12 +08:00
SONG Ge	bfc1e2d733	add fused rms optimization for chatglm model (#9256 )	2023-10-24 14:40:58 +08:00
Guancheng Fu	f37547249d	Refine README/CICD (#9253 )	2023-10-24 12:56:03 +08:00
binbin Deng	db37edae8a	LLM: update langchain api document page (#9222 )	2023-10-24 10:13:41 +08:00
Wang, Jian4	c14a61681b	Add load low-bit in model-serving for reduce EPC (#9239 ) * init load low-bit * fix * fix	2023-10-23 11:28:20 +08:00
Yina Chen	0383306688	Add arc fp8 support (#9232 ) * add fp8 support * add log * fix style	2023-10-20 17:15:07 +08:00
Yang Wang	118249b011	support transformers 4.34+ for llama (#9229 )	2023-10-19 22:36:30 -07:00
Chen, Zhentao	5850241423	correct Readme GPU example and API docstring (#9225 ) * update readme to correct GPU usage * update from_pretrained supported low bit options * fix stype check	2023-10-19 16:08:47 +08:00
Yang Wang	b0ddde0410	Fix removing convert dtype bug (#9216 ) * Fix removing convert dtype bug * fix style	2023-10-18 11:24:22 -07:00
Ruonan Wang	942d6418e7	LLM: fix chatglm kv cache (#9215 )	2023-10-18 19:09:53 +08:00
SONG Ge	0765f94770	[LLM] Optimize kv_cache for mistral model family (#9189 ) * add kv_cache optimization for mistral model * kv_cache optimize for mistral * update stylr * update	2023-10-18 15:13:37 +08:00
Ruonan Wang	3555ebc148	LLM: fix wrong length in gptj kv_cache optimization (#9210 ) * fix wrong length in gptj kv cache * update	2023-10-18 14:59:02 +08:00
Shengsheng Huang	6dad8d16df	optimize NormHead for Baichuan2 (#9205 ) * optimize NormHead for Baichuan2 * fix ut and change name * rename functions	2023-10-18 14:05:07 +08:00
Ruonan Wang	09815f7064	LLM: fix RMSNorm optimization of Baichuan2-13B/Baichuan-13B (#9204 ) * fix rmsnorm of baichuan2-13B * update baichuan1-13B too * fix style	2023-10-17 18:40:34 +08:00
Ruonan Wang	c0497ab41b	LLM: support kv_cache optimization for Qwen-VL-Chat (#9193 ) * dupport qwen_vl_chat * fix style	2023-10-17 13:33:56 +08:00
binbin Deng	1cd9ab15b8	LLM: fix ChatGLMConfig check (#9191 )	2023-10-17 11:52:56 +08:00
Yang Wang	7160afd4d1	Support XPU DDP training and autocast for LowBitMatmul (#9167 ) * support autocast in low bit matmul * Support XPU DDP training * fix amp	2023-10-16 20:47:19 -07:00
Ruonan Wang	77afb8796b	LLM: fix convert of chatglm (#9190 )	2023-10-17 10:48:13 +08:00
dingbaorong	af3b575c7e	expose modules_to_not_convert in optimize_model (#9180 ) * expose modules_to_not_convert in optimize_model * some fixes	2023-10-17 09:50:26 +08:00
Cengguang Zhang	5ca8a851e9	LLM: add fuse optimization for Mistral. (#9184 ) * add fuse optimization for mistral. * fix. * fix * fix style. * fix. * fix error. * fix style. * fix style.	2023-10-16 16:50:31 +08:00
Jiao Wang	49e1381c7f	update rope (#9155 )	2023-10-15 21:51:45 -07:00
binbin Deng	a164c24746	LLM: add kv_cache optimization for chatglm2-6b-32k (#9165 )	2023-10-16 10:43:15 +08:00
Yang Wang	7a2de00b48	Fixes for xpu Bf16 training (#9156 ) * Support bf16 training * Use a stable transformer version * remove env * fix style	2023-10-14 21:28:59 -07:00
Cengguang Zhang	51a133de56	LLM: add fuse rope and norm optimization for Baichuan. (#9166 ) * add fuse rope optimization. * add rms norm optimization.	2023-10-13 17:36:52 +08:00
Cengguang Zhang	433f408081	LLM: Add fuse rope and norm optimization for Aquila. (#9161 ) * add fuse norm optimization. * add fuse rope optimization	2023-10-13 14:18:37 +08:00
SONG Ge	e7aa67e141	[LLM] Add rope optimization for internlm (#9159 ) * add rope and norm optimization for internlm and gptneox * revert gptneox back and split with pr#9155 # * add norm_forward * style fix * update * update	2023-10-13 14:18:28 +08:00
Ruonan Wang	b8aee7bb1b	LLM: Fix Qwen kv_cache optimization (#9148 ) * first commit * ut pass * accelerate rotate half by using common util function * fix style	2023-10-12 15:49:42 +08:00
binbin Deng	69942d3826	LLM: fix model check before attention optimization (#9149 )	2023-10-12 15:21:51 +08:00
binbin Deng	eb3fb18eb4	LLM: improve PyTorch API doc (#9128 )	2023-10-11 15:03:39 +08:00
Zhao Changmin	1709beba5b	LLM: Explicitly close pickle file pointer before removing temporary directory (#9120 ) * fp close	2023-10-10 14:57:23 +08:00
binbin Deng	e4d1457a70	LLM: improve transformers style API doc (#9113 )	2023-10-10 09:31:00 +08:00
Zhao Changmin	edccfb2ed3	LLM: Check model device type (#9092 ) * check model device	2023-10-09 15:49:15 +08:00
Yina Chen	4c4f8d1663	[LLM]Fix Arc falcon abnormal output issue (#9096 ) * update * update * fix error & style * fix style * update train * to input_seq_size	2023-10-09 15:09:37 +08:00
Zhao Changmin	548e4dd5fe	LLM: Adapt transformers models for `optimize model` SL (#9022 ) * LLM: Adapt transformers model for SL	2023-10-09 11:13:44 +08:00
Ruonan Wang	f64257a093	LLM: basic api support for esimd fp16 (#9067 ) * basic api support for fp16 * fix style * fix * fix error and style * fix style * meet code review * update based on comments	2023-10-09 11:05:17 +08:00
Xin Qiu	b3e94a32d4	change log4error import (#9098 )	2023-10-08 09:23:28 +08:00
Kai Huang	78ea7ddb1c	Combine apply_rotary_pos_emb for gpt-neox (#9074 )	2023-10-07 16:27:46 +08:00
Yang Wang	36dd4afd61	Fix llama when rope scaling is not None (#9086 ) * Fix llama when rope scaling is not None * fix style * fix style	2023-10-06 13:27:37 -07:00
Yang Wang	fcb1c618a0	using bigdl-llm fused rope for llama (#9066 ) * optimize llama xpu rope * fix bug * fix style * refine append cache * remove check * do not cache cos sin * remove unnecessary changes * clean up * fix style * check for training	2023-10-06 09:57:29 -07:00
Jiao Wang	aefa5a5bfe	Qwen kv cache (#9079 ) * qwen and aquila * update * update * style	2023-10-05 11:59:17 -07:00
Jiao Wang	d5ca1f32b6	Aquila KV cache optimization (#9080 ) * update * update * style	2023-10-05 11:10:57 -07:00
Yang Wang	88565c76f6	add export merged model example (#9018 ) * add export merged model example * add sources * add script * fix style	2023-10-04 21:18:52 -07:00
Yang Wang	0cd8f1c79c	Use ipex fused rms norm for llama (#9081 ) * also apply rmsnorm * fix cpu	2023-10-04 21:04:55 -07:00
Cengguang Zhang	fb883100e7	LLM: support chatglm-18b convert attention forward in benchmark scripts. (#9072 ) * add chatglm-18b convert. * fix if statement. * fix	2023-09-28 14:04:52 +08:00
Yishuo Wang	6de2189e90	[LLM] fix chatglm main choice (#9073 )	2023-09-28 11:23:37 +08:00
Cengguang Zhang	b4a1266ef0	[WIP] LLM: add kv cache support for internlm. (#9036 ) * LLM: add kv cache support for internlm * add internlm apply_rotary_pos_emb * fix. * fix style.	2023-09-25 14:16:59 +08:00
Ruonan Wang	975da86e00	LLM: fix gptneox kv cache (#9044 )	2023-09-25 13:03:57 +08:00
Jiao Wang	028a6d9383	MPT model optimize for long sequence (#9020 ) * mpt_long_seq * update * update * update * style * style2 * update	2023-09-21 21:27:23 -07:00
Ruonan Wang	b943d73844	LLM: refactor kv cache (#9030 ) * refactor utils * meet code review; update all models * small fix	2023-09-21 21:28:03 +08:00
Cengguang Zhang	868511cf02	LLM: fix kv cache issue of bloom and falcon. (#9029 )	2023-09-21 18:12:20 +08:00
Ruonan Wang	bf51ec40b2	LLM: Fix empty cache (#9024 ) * fix * fix * update example	2023-09-21 17:16:07 +08:00
Yina Chen	714884414e	fix error (#9025 )	2023-09-21 16:42:11 +08:00
SONG Ge	fa47967583	[LLM] Optimize kv_cache for gptj model family (#9010 ) * optimize gptj model family attention * add license and comment for dolly-model * remove xpu mentioned * remove useless info * code sytle * style fix * code style in gptj fix * remove gptj arch * move apply_rotary_pos_emb into utils * kv_seq_length update * use hidden_states instead of query layer to reach batch size	2023-09-21 10:42:08 +08:00
Cengguang Zhang	b3cad7de57	LLM: add bloom kv cache support (#9012 ) * LLM: add bloom kv cache support * fix style.	2023-09-20 21:10:53 +08:00
Kai Huang	156af15d1e	Add NF3 (#9008 ) * add nf3 * grammar	2023-09-20 20:03:07 +08:00
Kai Huang	6981745fe4	Optimize kv_cache for gpt-neox model family (#9015 ) * override gptneox * style * move to utils * revert	2023-09-20 19:59:19 +08:00
Cengguang Zhang	735a17f7b4	LLM: add kv cache to falcon family. (#8995 ) * add kv cache to falcon family. * fix: import error. * refactor * update comments. * add two version falcon attention forward. * fix * fix. * fix. * fix. * fix style. * fix style.	2023-09-20 15:36:30 +08:00

1 2 3 4 5 ...

295 commits