ipex-llm

Author	SHA1	Message	Date
Ruonan Wang	427f75000b	LLM: fix sdp of chatglm3 (#9917 ) * fix * fix * fix	2024-01-17 13:37:28 +08:00
Yishuo Wang	94767da7cf	optimize rwkv v4 first token performance (#9912 )	2024-01-17 09:27:41 +08:00
Shaojun Liu	b909c5c9c2	GGUF load memory optimization (#9913 ) * block-wise * convert linear for module * revert * Fix PEP8 checks Error	2024-01-16 18:54:39 +08:00
Xin Qiu	dee32f7d15	copy fused rms norm's reuslt to avoid <unk> (#9909 )	2024-01-16 16:54:08 +08:00
Ruonan Wang	8d7326ae03	LLM: fix chatglm3 sdp to support speculative decoding (#9900 ) * fix chatglm3 * fix * update * meet code review * fix	2024-01-16 11:29:13 +08:00
Guancheng Fu	9f34da7cdb	Update PVC XMX condition (#9901 ) * update pvc xmx condition * update condition * update conditon	2024-01-15 15:42:15 +08:00
Yishuo Wang	6637860ddf	change xmx condition (#9896 )	2024-01-12 19:51:48 +08:00
Ruonan Wang	d9cf55bce9	LLM: fix MLP check of mixtral (#9891 )	2024-01-11 18:01:59 +08:00
Ziteng Zhang	4af88a67b9	support chatglm3 with bf16 (#9888 ) * support chatglm3 with bigdl-bf16	2024-01-11 16:45:21 +08:00
Yuwen Hu	0aef35a965	[LLM] Improve LLM doc regarding windows gpu related info (#9880 ) * Improve runtime configuration for windows * Add python 310/311 supports for wheel downloading * Add troubleshooting for windows gpu * Remove manually import ipex due to auto importer * Add info regarding cpu_embedding=True on iGPU * More info for Windows users * Small updates to API docs * Python style fix * Remove tip for loading from saved optimize_model for now * Updated based on comments * Update win info for multi-intel gpus selection * Small fix * Small fix	2024-01-11 14:37:16 +08:00
Ruonan Wang	53531ae4ee	LLM: support qkv fusion for fp8e5 (#9878 ) * update * add mistral * meet code review	2024-01-10 17:50:00 +08:00
Lilac09	cb32b985ec	add mistral and chatglm support to vllm (#9879 ) * add mistral and chatglm support to vllm * add mistral and chatglm support to vllm	2024-01-10 15:38:42 +08:00
Ruonan Wang	3e05c9e11b	LLM: update esimd sdp kernel (#9871 )	2024-01-09 18:10:01 +08:00
Yishuo Wang	36496d60ac	only use quantize kv cache on MTL (#9862 )	2024-01-09 13:24:02 +08:00
ZehuaCao	146076bdb5	Support llm-awq backend (#9856 ) * Support for LLM-AWQ Backend * fix * Update README.md * Add awqconfig * modify init * update * support llm-awq * fix style * fix style * update * fix AwqBackendPackingMethod not found error * fix style * update README * fix style --------- Co-authored-by: Uxito-Ada <414416158@qq.com> Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com> Co-authored-by: cyita <yitastudy@gmail.com>	2024-01-09 13:07:32 +08:00
Ruonan Wang	fea6f16057	LLM: add mlp fusion for fp8e5 and update related check (#9860 ) * update mlp fusion * fix style * update	2024-01-09 09:56:32 +08:00
Jiao Wang	3b6372ab12	Fix Llama transformers 4.36 support (#9852 ) * supoort 4.36 * style * update * update * update * fix merge * update	2024-01-08 00:32:23 -08:00
Chen, Zhentao	1b585b0d40	set fp8 default as e5m2 (#9859 )	2024-01-08 15:53:57 +08:00
Ruonan Wang	dc995006cc	LLM: add flash attention for mistral / mixtral (#9846 ) * add flash attention for mistral * update * add flash attn for mixtral * fix style	2024-01-08 09:51:34 +08:00
Yishuo Wang	afaa871144	[LLM] support quantize kv cache to fp8 (#9812 )	2024-01-08 09:28:20 +08:00
Jiao Wang	248ae7fad2	LLama optimize_model to support transformers 4.36 (#9818 ) * supoort 4.36 * style * update * update * update	2024-01-05 11:30:18 -08:00
Ruonan Wang	a60bda3324	LLM: update check for deepspeed (#9838 )	2024-01-05 16:44:10 +08:00
Ruonan Wang	16433dd959	LLM: fix first token judgement of flash attention (#9841 ) * fix flash attention * meet code review * fix	2024-01-05 13:49:37 +08:00
Yina Chen	f919f5792a	fix kv cache out of bound (#9827 )	2024-01-05 12:38:57 +08:00
Ruonan Wang	5df31db773	LLM: fix accuracy issue of chatglm3 (#9830 ) * add attn mask for first token * fix * fix * change attn calculation * fix * fix * fix style * fix style	2024-01-05 10:52:05 +08:00
Xiangyu Tian	38c05be1c0	[LLM] Fix dtype mismatch in Baichuan2-13b (#9834 )	2024-01-04 15:34:42 +08:00
Ziteng Zhang	05b681fa85	[LLM] IPEX auto importer set on by default (#9832 ) * Set BIGDL_IMPORT_IPEX default to True * Remove import intel_extension_for_pytorch as ipex from GPU example	2024-01-04 13:33:29 +08:00
Wang, Jian4	4ceefc9b18	LLM: Support bitsandbytes config on qlora finetune (#9715 ) * test support bitsandbytesconfig * update style * update cpu example * update example * update readme * update unit test * use bfloat16 * update logic * use int4 * set defalut bnb_4bit_use_double_quant * update * update example * update model.py * update * support lora example	2024-01-04 11:23:16 +08:00
Ruonan Wang	20e9742fa0	LLM: fix chatglm3 issue (#9820 ) * fix chatglm3 issue * small update	2024-01-03 16:15:55 +08:00
Wang, Jian4	a54cd767b1	LLM: Add gguf falcon (#9801 ) * init falcon * update convert.py * update style	2024-01-03 14:49:02 +08:00
Qiyuan Gong	f0f9d45eac	[LLM] IPEX import support bigdl-core-xe-21 (#9769 ) Add support for bigdl-core-xe-21.	2023-12-28 15:23:58 +08:00
Guancheng Fu	5857a38321	[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH (#9782 ) * add option kv_cache_block * change var name	2023-12-28 14:41:47 +08:00
Ruonan Wang	99bddd3ab4	LLM: better FP16 support for Intel GPUs (#9791 ) * initial support * fix * fix style * fix * limi esimd usage condition * refactor code * fix style * small fix * meet code review * small fix	2023-12-28 13:30:13 +08:00
Yishuo Wang	7d9f6c6efc	fix cpuinfo error (#9793 )	2023-12-28 09:23:44 +08:00
Wang, Jian4	7ed9538b9f	LLM: support gguf mpt (#9773 ) * add gguf mpt * update	2023-12-28 09:22:39 +08:00
Cengguang Zhang	d299f108d0	update falcon attention forward. (#9796 )	2023-12-28 09:11:59 +08:00
Kai Huang	689889482c	Reduce max_cache_pos to reduce Baichuan2-13B memory (#9694 ) * optimize baichuan2 memory * fix * style * fp16 mask * disable fp16 * fix style * empty cache * revert empty cache	2023-12-26 19:51:25 +08:00
Xiangyu Tian	0ea842231e	[LLM] vLLM: Add api_server entrypoint (#9783 ) Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.	2023-12-26 16:03:57 +08:00
Ruonan Wang	11d883301b	LLM: fix wrong batch output caused by flash attention (#9780 ) * fix * meet code review * move batch size check to the beginning * move qlen check inside function * meet code review	2023-12-26 09:41:27 +08:00
Heyang Sun	66e286a73d	Support for Mixtral AWQ (#9775 ) * Support for Mixtral AWQ * Update README.md * Update README.md * Update awq_config.py * Update README.md * Update README.md	2023-12-25 16:08:09 +08:00
Ruonan Wang	1917bbe626	LLM: fix `BF16Linear` related training & inference issue (#9755 ) * fix bf16 related issue * fix * update based on comment & add arc lora script * update readme * update based on comment * update based on comment * update * force to bf16 * fix style * move check input dtype into function * update convert * meet code review * meet code review * update merged model to support new training_mode api * fix typo	2023-12-25 14:49:30 +08:00
Xiangyu Tian	30dab36f76	[LLM] vLLM: Fix kv cache init (#9771 ) Fix kv cache init	2023-12-25 14:17:06 +08:00
Yina Chen	449b387125	Support relora in bigdl-llm (#9687 ) * init * fix style * update * support resume & update readme * update * update * remove important * add training mode * meet comments	2023-12-25 14:04:28 +08:00
Ziteng Zhang	986f65cea9	[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py (#9762 )	2023-12-25 11:31:14 +08:00
Guancheng Fu	daf536fb2d	vLLM: Apply attention optimizations for selective batching (#9758 ) * fuse_rope for prefil * apply kv_cache optimizations * apply fast_decoding_path * Re-enable kv_cache optimizations for prefill * reduce KV_CACHE_ALLOC_BLOCK for selective_batching	2023-12-25 10:29:31 +08:00
Qiyuan Gong	4c487313f2	Revert "[LLM] IPEX auto importer turn on by default for XPU (#9730 )" (#9759 ) This reverts commit `0284801fbd`.	2023-12-22 16:38:24 +08:00
Qiyuan Gong	0284801fbd	[LLM] IPEX auto importer turn on by default for XPU (#9730 ) * Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU. * Remove import intel_extension_for_pytorch as ipex from GPU example. * Add support for bigdl-core-xe-21.	2023-12-22 16:20:32 +08:00
Guancheng Fu	fdf93c9267	Implement selective batching for vLLM (#9659 ) * add control to load hf model * finish initial version of selective_batching * temp * finish * Remove print statement * fix error * Apply yang's optimization * a version that works * We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path * format * temp solution: not batching prefill requests * a version that works for prefill batching * format * a solid version: works normally * a temp version * Solid version: remove redundant functions * fix format * format * solid: add option to enable selective_batching * remove logic for using transformer models * format * format * solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING * format * finish * format	2023-12-22 13:45:46 +08:00
Ruonan Wang	2f36769208	LLM: bigdl-llm lora support & lora example (#9740 ) * lora support and single card example * support multi-card, refactor code * fix model id and style * remove torch patch, add two new class for bf16, update example * fix style * change to training_mode * small fix * add more info in help * fixstyle, update readme * fix ut * fix ut * Handling compatibility issues with default LoraConfig	2023-12-22 11:05:39 +08:00
SONG Ge	ba0b939579	[LLM] Support transformers-v4.36.0 on mistral model (#9744 ) * add support transformers-v4.36.0 on mistral model * python/llm/src/bigdl/llm/transformers/models/mistral.py * make the redundant implementation as utils * fix code style * fix * fix style * update with utils enough_kv_room	2023-12-22 09:59:27 +08:00

1 2 3 4 5 ...

351 commits