ipex-llm

Author	SHA1	Message	Date
binbin Deng	3feb58d1e4	Support baichuan2 for level0 pipeline (#12289 )	2024-10-29 19:24:16 +08:00
Zhao Changmin	546f455e8e	Patch sdpa check function in specific module attributes table (#12285 )	2024-10-29 18:41:09 +08:00
Ruonan Wang	821b0033ed	[NPU L0] update layernorm & code refactor (#12287 ) * update layernorm & code refactor * fix style * add common utils * change to Pool() * remove print	2024-10-29 15:01:45 +08:00
Yina Chen	4467645088	[NPU] Support l0 Llama groupwise (#12276 ) * except lm_head * remove * support gw lm_head * update * fix * remove run.bat * fix style * support llama3	2024-10-28 17:06:55 +08:00
Ruonan Wang	3fe2ea3081	[NPU] Reuse prefill of acc lib for pipeline (#12279 ) * first commit * update example * fix style * update example * embedding as const * fix generate * code refactor * meet code review * fix style * change max_output_len to max_context_len * fix all-in-one * fix example * add check for new tokens	2024-10-28 16:05:49 +08:00
binbin Deng	ec362e6133	Add llama3 level0 example (#12275 )	2024-10-28 09:24:51 +08:00
SONG Ge	08cb065370	hot-fix redundant import funasr (#12277 )	2024-10-25 19:40:39 +08:00
SONG Ge	a0c6432899	[NPU] Add support for loading a FunASR model (#12073 ) * add support for loading funasr model * add initial support for paraformer-encoder * add npu ops impl * add encoder-decoder npu pipeline * move paraformer encoders prefix 30 layers to npu and keep the rest layers on cpu	2024-10-25 17:22:01 +08:00
Ruonan Wang	854398f6e0	update example to reduce peak memory usage (#12274 )	2024-10-25 17:09:26 +08:00
Yuwen Hu	e713296090	Update all-in-one benchmark (#12272 ) * Update all-in-one benchmark * Small fix * Small fix * Small fix	2024-10-25 16:52:59 +08:00
Yuwen Hu	43b25a2fe7	Fix llama 3.2 vision on LNL (#12264 ) * Fix llama 3.2 vision on LNL * Small fix	2024-10-25 16:23:31 +08:00
Yuwen Hu	93895b2ac2	Openvino all in one benchmark small fix (#12269 ) * Small update for all-in-one benchmark readme to support OpenVINO tests * Small fix	2024-10-25 14:13:52 +08:00
Zijie Li	f7f62a3fef	Add OpenVINO performance tests to all-in-one benchmark (#12238 ) * add-openvino-to-all-in-one * update on openvino API * Update save_openvino.py * Update save_openvino.py * Update save_openvino.py * update on run.py and save_openvino * update references * Create openvino-requirements.txt * fix on comments * Small updates * Small fix * Fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-10-25 13:53:53 +08:00
Ruonan Wang	ae57e23e4f	fix incompatibility between llama GW & llama pipeline (#12267 ) * fix * fix	2024-10-25 10:31:44 +08:00
Yina Chen	b5e663854b	[NPU] Support llama groupwise (#12260 ) * support llama gw * support llama gw lm_head * fix style * remove unused code	2024-10-24 18:06:45 +08:00
Xin Qiu	39c9d1de52	fix code geex (#12261 )	2024-10-24 14:34:01 +08:00
Yishuo Wang	f3a2b20e6b	Optimize gpt2 (#12259 )	2024-10-24 13:44:24 +08:00
Ruonan Wang	821fd96367	Initial integrate our L0 Llama impl into ipex-llm (#12255 ) * temp save * initial support * fix * simplify code * fix style * fix example * make default value of pipeline as False	2024-10-24 09:49:27 +08:00
Yishuo Wang	cacc891962	Fix PR validation (#12253 )	2024-10-23 18:10:47 +08:00
binbin Deng	b685cf4349	Fix npu group size setting of optimize_model=False (#12256 )	2024-10-23 17:53:54 +08:00
binbin Deng	567b77a76b	Support IR and blob format for llama level0 pipeline (#12251 )	2024-10-23 16:02:35 +08:00
Yishuo Wang	578aef245d	Fix models auto choose SdpaAttention with ipex 2.3 (#12252 )	2024-10-23 15:33:45 +08:00
Yishuo Wang	88dc120a4c	fix fp16 linear (#12250 )	2024-10-23 14:35:19 +08:00
Yina Chen	e8cf7f32f5	npu gw small fix (#12249 )	2024-10-23 14:26:01 +08:00
Shaojun Liu	aae2490cb8	fix UT (#12247 ) * fix ut * Update test_transformers_api_attention.py * Update test_transformers_api_mlp.py	2024-10-23 14:13:06 +08:00
Yina Chen	e37f951cce	[NPU] Groupwise (#12241 ) * dq divide * fix * support attn divide * update qwen2 7b * divide down_proj & other linear * use concat & reduce sum * support scale after * support qwen2 * w/ mm * update reshape * spda * split * split 2+ * update * lm head-> 28 * no scale * update * update * update * fix style * fix style * to split linear * update * update code * address comments * fix style & remove redundant code & revert benchmark scripts * fix style & remove code * update save & load --------- Co-authored-by: Yang Wang <yang3.wang@intel.com>	2024-10-23 14:10:58 +08:00
Jin, Qiao	8fa98e2742	Remove Qwen2-7b from NPU example for "Run Optimized Models (Experimental)" (#12245 ) * Remove qwen2-7b from npu example readme * fix	2024-10-22 17:07:51 +08:00
Yina Chen	ec465fbcd7	Add lookup generate in load_low_bit (#12243 ) * add lookup generate in load_low_bit * update comment	2024-10-22 15:51:52 +08:00
Yuwen Hu	b3df47486d	Fix Gemma 2 on LNL (#12240 ) * Fix gemma 2 on LNL * Python style fix	2024-10-21 18:25:53 +08:00
Yuwen Hu	5935b25622	Further update windows gpu perf test regarding results integrity check (#12232 )	2024-10-18 18:15:13 +08:00
Yuwen Hu	b88c1df324	Add Llama 3.1 & 3.2 to Arc Performance test (#12225 ) * Add llama3.1 and llama3.2 in arc perf (#12202) * Add llama3.1 and llama3.2 in arc perf * Uninstall trl after arc test on transformers>=4.40 * Fix arc llama3 perf (#12212) * Fix pip uninstall * Uninstall trl after test on transformers==4.43.1 * Fix llama3 arc perf (#12218) --------- Co-authored-by: Jin, Qiao <89779290+JinBridger@users.noreply.github.com>	2024-10-17 21:12:45 +08:00
Yishuo Wang	9ea694484d	refactor ot remove old rope usage (#12224 )	2024-10-17 17:06:09 +08:00
Yishuo Wang	324bcb057e	refactor to reduce old rope usage (#12219 )	2024-10-17 14:45:09 +08:00
Jiao Wang	667f0db466	Update Eagle example to Eagle2+ipex-llm integration (#11717 ) * update to e2 example * update * update	2024-10-16 23:16:14 -07:00
Yishuo Wang	a4a758656a	refactor gemma to reduce old fuse rope usage (#12215 )	2024-10-16 17:40:28 +08:00
Yishuo Wang	9104a168f6	refactor phi-2 to reduce old fuse rope usage (#12214 )	2024-10-16 17:08:14 +08:00
Yishuo Wang	bb247e991b	refactor merge_qkv and attention_softmax (#12213 )	2024-10-16 15:58:14 +08:00
Yishuo Wang	e279148aa0	optimize llama3.2 vision again (#12211 )	2024-10-16 14:29:48 +08:00
Chu,Youcheng	f17cc4fdee	feat: add llama3.2-11b-vision in all in one (#12207 ) * feat: add llama3.2-11b-vision in all in one * fix: change model * fix: change name * fix: add a space * fix: switch import	2024-10-16 10:32:11 +08:00
Yuwen Hu	c9ac39fc1e	Add Llama 3.2 to iGPU performance test (`transformers 4.45`) (#12209 ) * Add Llama 3.2 to iGPU Perf (#12200) * Add Llama 3.2 to iGPU Perf * Downgrade accelerate after step * Temporarily disable model for test * Temporarily change ERRORLEVEL check (#12201) * Restore llama3.2 perf (#12206) * Revert "Temporarily change ERRORLEVEL check" This reverts commit 909dbbc930ab4283737161a55bb32006e6ca1991. * Revert "Temporarily disable model for test" This reverts commit 95322dc3c6429aa836f21bda0b5ba8d9b48592f8. --------- Co-authored-by: Jin, Qiao <89779290+JinBridger@users.noreply.github.com>	2024-10-15 17:44:46 +08:00
Yishuo Wang	f6611f9d3a	optimize llama3.2 vison attention again (#12204 )	2024-10-15 16:08:20 +08:00
Yishuo Wang	9b81236a2e	optimzie qwen2-vl vision (#12203 )	2024-10-15 15:54:25 +08:00
Yishuo Wang	d5344587ab	optimize internvl2 vision model's attention (#12198 )	2024-10-15 10:51:00 +08:00
Yuwen Hu	f8d1adc573	Fix Llama 3.2 & 3.1 on LNL (#12196 )	2024-10-14 17:39:20 +08:00
Yuwen Hu	516b578104	Support cpp release for ARL on Windows (#12189 ) * Support cpp Windows release for ARL * Temp commit for test * Remove temp commit	2024-10-14 17:20:31 +08:00
Zijie Li	7d80db710e	Add benchmark_util for `transformers >= 4.44.0` (#12171 ) * Create benchmark_util_4_45.py * Update __init__.py * Update lint-python * Update benchmark_util_4_45.py * Update benchmark_util_4_45.py * Create benchmark_util_4_44.py	2024-10-14 15:40:12 +08:00
Jin, Qiao	8e35800abe	Add llama 3.1 in igpu perf (#12194 )	2024-10-14 15:14:34 +08:00
Yuwen Hu	ddcdf47539	Support Windows ARL release (#12183 ) * Support release for ARL * Small fix * Small fix to doc * Temp for test * Remove temp commit for test	2024-10-11 18:30:52 +08:00
Jinhe	f983f1a8f4	Add Qwen2-VL gpu example (#12135 ) * qwen2-vl readme * add qwen2-vl example * fix * fix * fix * add link * Update regarding modules_to_not_convert and readme * Further fix * Small fix --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-10-11 18:25:23 +08:00
Ruonan Wang	310f18c8af	update NPU pipeline generate (#12182 ) * update * fix style	2024-10-11 17:39:20 +08:00
Shaojun Liu	724b2ae66d	add npu-level0 pipeline.dll to ipex-llm (#12181 ) * add npu-level0 pipeline.dll to ipex-llm * test * update runner label * fix * update * fix * fix	2024-10-11 16:05:20 +08:00
Ruonan Wang	4d93bb81fe	Initial support of NPU level0 Model (#12177 ) * first commit to support load dll and init llm pipeline * add init generate * fix style * small updates * fix style and check tokens number	2024-10-11 09:45:53 +08:00
Yuwen Hu	890662610b	Fix auto importer for LNL release (#12175 )	2024-10-10 15:17:43 +08:00
Yishuo Wang	535bee5381	fix qwen2 vl again (#12174 )	2024-10-10 13:50:01 +08:00
Yuwen Hu	aef1f671bd	Support LNL Windows release (#12169 ) * Release for LNL on Windows * Temp commit for release test * Change option name * Remove temp commit and change option name * temp commit for test again * Remove temp commit	2024-10-09 17:41:10 +08:00
Yishuo Wang	78d253165d	optimize qwen2 vl perf again (#12167 )	2024-10-09 16:43:48 +08:00
Zijie Li	3d044dbf53	add llama3.2-vision Pytorch example (#12165 )	2024-10-09 09:20:42 +08:00
Yishuo Wang	644af2a76e	add basic llama 3.2 vision support (#12163 )	2024-10-08 10:46:48 +08:00
Ch1y0q	17c23cd759	add llama3.2 GPU example (#12137 ) * add llama3.2 GPU example * change prompt format reference url * update * add Meta-Llama-3.2-1B-Instruct sample output * update wording	2024-09-29 14:41:54 +08:00
Yuwen Hu	f71b38a994	Update MiniCPM_V_26 GPU example with save & load (#12127 )	2024-09-26 17:40:22 +08:00
Yishuo Wang	669ff1a97b	fix sd1.5 (#12129 )	2024-09-26 17:15:16 +08:00
Yishuo Wang	a266528719	optimize llama 3.2 rope (#12128 )	2024-09-26 16:08:10 +08:00
Yishuo Wang	584c3489e7	add basic support for llama3.2 (#12125 )	2024-09-26 15:46:19 +08:00
Yishuo Wang	66f419f8b7	fix qwen2 vl (#12126 )	2024-09-26 15:44:02 +08:00
Ch1y0q	2ea13d502f	Add minicpm3 gpu example (#12114 ) * add minicpm3 gpu example * update GPU example * update --------- Co-authored-by: Huang, Xinshengzi <xinshengzi.huang@intel.com>	2024-09-26 13:51:37 +08:00
Yishuo Wang	77af9bc5fa	support passing None to low_bit in optimize_model (#12121 )	2024-09-26 11:09:35 +08:00
Yishuo Wang	47e0b83cbf	optimize sd 1.5 (#12119 )	2024-09-25 15:45:13 +08:00
Jin, Qiao	2bedb17be7	Add Qwen2.5 NPU Example (#12110 ) * Add Qwen2.5 NPU Example * fix * Merge qwen2.py and qwen2.5.py into qwen.py * Fix description	2024-09-25 15:20:03 +08:00
Yishuo Wang	5d63aef60b	optimize qwen2 vl again (#12109 )	2024-09-23 13:22:01 +08:00
Ruonan Wang	03bd01c99c	optimize npu qwen2 (#12107 )	2024-09-20 19:46:16 +08:00
Jinhe	02399021d6	add npu load_low_bit api in all-in-one benchmark (#12103 )	2024-09-20 17:56:08 +08:00
Yishuo Wang	9239fd4f12	add basic support and optimization for qwen2-vl (#12104 )	2024-09-20 17:23:06 +08:00
Yuwen Hu	828fa01ad3	[NPU] Add `mixed_precision` for Qwen2 7B (#12098 ) * Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct * Small fix * Fixed on load low bit with mixed precision * Small fix * Update example accordingly * Update for default prompt * Update base on comments * Final fix	2024-09-20 16:36:21 +08:00
Ch1y0q	2269768e71	add internvl2 example (#12102 ) * add internvl2 example * add to README.md * update * add link to zh-CN readme	2024-09-20 16:31:54 +08:00
Ruonan Wang	09b8c80d9d	update code for NPU qwen2 (#12094 ) * update code * fix	2024-09-20 15:58:32 +08:00
Jin, Qiao	db7500bfd4	Add Qwen2.5 GPU example (#12101 ) * Add Qwen2.5 GPU example * fix end line * fix description	2024-09-20 15:55:57 +08:00
Yishuo Wang	54b973c744	fix ipex_llm import in transformers 4.45 (#12099 )	2024-09-20 15:24:59 +08:00
Ch1y0q	9650bf616a	add `transpose_value_cache` for NPU benchmark (#12092 ) * add `transpose_value_cache` * update * update	2024-09-19 18:45:05 +08:00
Yuwen Hu	f7fb3c896c	Update lm_head optimization for Qwen2 7B (#12090 )	2024-09-18 17:02:02 +08:00
Xu, Shuo	ee33b93464	Longbench: NV code to ipex-llm (#11662 ) * add nv longbench * LongBench: NV code to ipex-llm * ammend * add more models support * ammend * optimize LongBench's user experience * ammend * ammend * fix typo * ammend * remove cuda related information & add a readme * add license to python scripts & polish the readme * ammend * ammend --------- Co-authored-by: cyita <yitastudy@gmail.com> Co-authored-by: ATMxsp01 <shou.xu@intel.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>	2024-09-18 15:55:14 +08:00
Wang, Jian4	40e463c66b	Enable vllm load gptq model (#12083 ) * enable vllm load gptq model * update * update * update * update style	2024-09-18 14:41:00 +08:00
Ruonan Wang	081af41def	[NPU] Optimize Qwen2 lm_head to use INT4 (#12072 ) * temp save * update * fix * fix * Split lm_head into 7 parts & remove int8 for lm_head when sym_int4 * Simlify and add condition to code * Small fix * refactor some code * fix style * fix style * fix style * fix * fix * temp sav e * refactor * fix style * further refactor * simplify code * meet code review * fix style --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-09-14 15:26:46 +08:00
Ch1y0q	b4b8c3e495	add `lowbit_path` for `generate.py`, fix `npu_model` (#12077 ) * add `lowbit_path` for `generate.py`, fix `npu_model` * update `README.md`	2024-09-13 17:28:05 +08:00
Wang, Jian4	d703e4f127	Enable vllm multimodal minicpm-v-2-6 (#12074 ) * enable minicpm-v-2-6 * add image_url readme	2024-09-13 13:28:35 +08:00
Ruonan Wang	48d9092b5a	upgrade OneAPI version for cpp Windows (#12063 ) * update version * update quickstart	2024-09-12 11:12:12 +08:00
Jinhe	e78e45ee01	update NPU readme: run conhost as administrator (#12066 )	2024-09-11 17:54:04 +08:00
Jinhe	4ca330da15	Fix NPU load error message and add minicpm npu lowbit feat (#12064 ) * fix npu_model raise sym_int4 error * add load_lowbit * remove print&perf	2024-09-11 16:56:35 +08:00
Jinhe	32e8362da7	added minicpm cpu examples (#12027 ) * minicpm cpu examples * add link for minicpm-2	2024-09-11 15:51:21 +08:00
Ruonan Wang	a0c73c26d8	clean NPU code (#12060 ) * clean code * remove time.perf_counter()	2024-09-11 15:10:35 +08:00
Wang, Jian4	c75f3dd874	vllm no padding glm4 to avoid nan error (#12062 ) * no padding glm4 * add codegeex	2024-09-11 13:44:40 +08:00
Chu,Youcheng	649390c464	fix: textual and env variable adjustment (#12038 )	2024-09-11 13:38:01 +08:00
Wang, Jian4	30a8680645	Update for vllm one card padding (#12058 )	2024-09-11 10:52:55 +08:00
Zijie Li	c5fdfde1bd	fix npu-model prompt (#12057 )	2024-09-11 10:06:45 +08:00
Yishuo Wang	d8c044e79d	optimize minicpm3 kv cache (#12052 )	2024-09-10 16:51:21 +08:00
Wang, Jian4	5d3ab16a80	Add vllm glm and baichuan padding (#12053 )	2024-09-10 15:57:28 +08:00
Guancheng Fu	69c8d36f16	Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042 ) * Enable single card sync engine * enable ipex-llm optimizations for vllm * enable optimizations for lm_head * Fix chatglm multi-reference problem * Remove duplicate layer * LLM: Update vLLM to v0.5.4 (#11746) * Enable single card sync engine * enable ipex-llm optimizations for vllm * enable optimizations for lm_head * Fix chatglm multi-reference problem * update 0.5.4 api_server * add dockerfile * fix * fix * refine * fix --------- Co-authored-by: gc-fu <guancheng.fu@intel.com> * Add vllm-0.5.4 Dockerfile (#11838) * Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957) * Fix vLLM not convert issues (#11817) (#11918) * Fix not convert issues * refine Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com> * Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969) * init * update mlp forward * fix minicpm error in vllm 0.5.4 * fix dependabot alerts (#12008) * Update 0.5.4 dockerfile (#12021) * Add vllm awq loading logic (#11987) * [ADD] Add vllm awq loading logic * [FIX] fix the module.linear_method path * [FIX] fix quant_config path error * Enable Qwen padding mlp to 256 to support batch_forward (#12030) * Enable padding mlp * padding to 256 * update style * Install 27191 runtime in 0.5.4 docker image (#12040) * fix rebase error * fix rebase error * vLLM: format for 0.5.4 rebase (#12043) * format * Update model_convert.py * Fix serving docker related modifications (#12046) * Fix undesired modifications (#12048) * fix * Refine offline_inference arguments --------- Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com> Co-authored-by: Jun Wang <thoughts.times@gmail.com> Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com> Co-authored-by: liu-shaojun <johnssalyn@outlook.com> Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>	2024-09-10 15:37:43 +08:00
Ch1y0q	73a4360f3f	update lowbit path for baichuan2, qwen2, `generate.py` (#12051 ) * update lowbit path for baichuan2, qwen2, `generate.py` * update readme	2024-09-10 15:35:24 +08:00
Ruonan Wang	dc4af02b2a	Fix qwen2 1.5B NPU load error (#12049 )	2024-09-10 14:41:18 +08:00
Yishuo Wang	abc370728c	optimize minicpm3 again (#12047 )	2024-09-10 14:19:57 +08:00
Ch1y0q	f0061a9916	remove local import os to fix Baichuan NPU load issue (#12044 )	2024-09-10 14:13:24 +08:00

1 2 3 4 5 ...

1975 commits