ipex-llm

Author	SHA1	Message	Date
Ruonan Wang	3fe2ea3081	[NPU] Reuse prefill of acc lib for pipeline (#12279 ) * first commit * update example * fix style * update example * embedding as const * fix generate * code refactor * meet code review * fix style * change max_output_len to max_context_len * fix all-in-one * fix example * add check for new tokens	2024-10-28 16:05:49 +08:00
SONG Ge	08cb065370	hot-fix redundant import funasr (#12277 )	2024-10-25 19:40:39 +08:00
SONG Ge	a0c6432899	[NPU] Add support for loading a FunASR model (#12073 ) * add support for loading funasr model * add initial support for paraformer-encoder * add npu ops impl * add encoder-decoder npu pipeline * move paraformer encoders prefix 30 layers to npu and keep the rest layers on cpu	2024-10-25 17:22:01 +08:00
Yina Chen	b5e663854b	[NPU] Support llama groupwise (#12260 ) * support llama gw * support llama gw lm_head * fix style * remove unused code	2024-10-24 18:06:45 +08:00
Ruonan Wang	821fd96367	Initial integrate our L0 Llama impl into ipex-llm (#12255 ) * temp save * initial support * fix * simplify code * fix style * fix example * make default value of pipeline as False	2024-10-24 09:49:27 +08:00
binbin Deng	b685cf4349	Fix npu group size setting of optimize_model=False (#12256 )	2024-10-23 17:53:54 +08:00
Yina Chen	e37f951cce	[NPU] Groupwise (#12241 ) * dq divide * fix * support attn divide * update qwen2 7b * divide down_proj & other linear * use concat & reduce sum * support scale after * support qwen2 * w/ mm * update reshape * spda * split * split 2+ * update * lm head-> 28 * no scale * update * update * update * fix style * fix style * to split linear * update * update code * address comments * fix style & remove redundant code & revert benchmark scripts * fix style & remove code * update save & load --------- Co-authored-by: Yang Wang <yang3.wang@intel.com>	2024-10-23 14:10:58 +08:00
Yuwen Hu	828fa01ad3	[NPU] Add `mixed_precision` for Qwen2 7B (#12098 ) * Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct * Small fix * Fixed on load low bit with mixed precision * Small fix * Update example accordingly * Update for default prompt * Update base on comments * Final fix	2024-09-20 16:36:21 +08:00
Ch1y0q	b4b8c3e495	add `lowbit_path` for `generate.py`, fix `npu_model` (#12077 ) * add `lowbit_path` for `generate.py`, fix `npu_model` * update `README.md`	2024-09-13 17:28:05 +08:00
Jinhe	4ca330da15	Fix NPU load error message and add minicpm npu lowbit feat (#12064 ) * fix npu_model raise sym_int4 error * add load_lowbit * remove print&perf	2024-09-11 16:56:35 +08:00
Ruonan Wang	dc4af02b2a	Fix qwen2 1.5B NPU load error (#12049 )	2024-09-10 14:41:18 +08:00
Ch1y0q	f0061a9916	remove local import os to fix Baichuan NPU load issue (#12044 )	2024-09-10 14:13:24 +08:00
Yang Wang	58555bd9de	Optimize broadcast for npu llama (#12028 )	2024-09-06 13:28:20 +08:00
Ruonan Wang	9eaff5e47d	add save & load support for NPU optimized model (#11999 ) * add save & load support * fix style	2024-09-03 20:53:22 +08:00
Ruonan Wang	60aa1a2c0f	Initial NPU support for MiniCPM-V-2_6 (#11966 ) * initial pr * update npu model * fix * fix kv cache type * fix * small fix * fix style * fix model id * change inter_pp=4 * address comment * fix * fix style * fix * rebase	2024-08-30 16:34:35 +08:00
SONG Ge	158289d205	[NPU] Add initial support for minicpm-llama-v2.5 (#11962 ) * add initial support for minicpm-llama-v2.5 * update impl * add minicpm-llama3-v2.5 example	2024-08-30 16:00:33 +08:00
Yina Chen	b38fb67bec	[NPU] lm head to cpu (#11943 ) * lm head to cpu * qwen2 * mv logic and add param to disable cpu_lm_head * use env and lm_head opt to mp file * fix * update * remove print	2024-08-28 16:34:07 +08:00
binbin Deng	bec00e2015	Improve baichuan2 NPU performance (#11942 )	2024-08-27 18:37:08 +08:00
Zijie Li	90f692937d	Update npu baichuan2 (#11939 )	2024-08-27 16:56:26 +08:00
binbin Deng	7c8c9a0670	Update benchmark script for NPU (#11932 )	2024-08-27 14:41:14 +08:00
Zijie Li	6c3eb1e1e8	refactor from_pretrained API for NPU (#11927 )	2024-08-27 09:50:30 +08:00
Yang Wang	99b05ba1dc	separate prefill into a process (#11787 ) * seperate prefill into a process * using model.share_memory() * might work * worked * use long prompt * refactor * cleanup * fix bug * clean up * changable inter and intra process stages * refactor * add max output len * fix npu_model changes that may cause generate down * fix npu_model generate import error * fix generare forward error --------- Co-authored-by: sgwhat <ge.song@intel.com>	2024-08-19 17:53:36 +08:00
Yang Wang	51bcac1229	follow up on experimental support of fused decoder layer for llama2 (#11785 ) * clean up and support transpose value cache * refine * fix style * fix style	2024-08-13 18:53:55 -07:00
binbin Deng	23d3acdc77	Add experimental support of fused decoder layer for llama2 (#11768 )	2024-08-13 14:41:36 +08:00
Zhao Changmin	f7e957aaf9	Clean npu dtype branch (#11515 ) * clean branch * create_npu_kernels	2024-07-05 15:45:26 +08:00
Zhao Changmin	57b8adb189	[WIP] Support npu load_low_bit method (#11502 ) * npu_load_low_bit	2024-07-04 17:15:34 +08:00
Zhao Changmin	6a0134a9b2	support q4_0_rtn (#11477 ) * q4_0_rtn	2024-07-02 16:57:02 +08:00
Zhao Changmin	cf8eb7b128	Init NPU quantize method and support q8_0_rtn (#11452 ) * q8_0_rtn * fix float point	2024-07-01 13:45:07 +08:00
Yishuo Wang	ca0e69c3a7	optimize npu llama perf again (#11431 )	2024-06-26 10:52:54 +08:00
Yishuo Wang	9f6e5b4fba	optimize llama npu perf (#11426 )	2024-06-25 17:43:20 +08:00
Yishuo Wang	a5e7d93242	Add initial save/load low bit support for NPU(now only fp16 is supported) (#11359 )	2024-06-20 10:49:39 +08:00
Yishuo Wang	ae7b662ed2	add fp16 NPU Linear support and fix intel_npu_acceleration_library version 1.0 support (#11352 )	2024-06-19 09:14:59 +08:00
Yishuo Wang	83082e5cc7	add initial support for intel npu acceleration library (#11347 )	2024-06-18 16:07:16 +08:00

33 commits