ipex-llm

Author	SHA1	Message	Date
Yina Chen	0763268e4c	[NPU]Qwen2 groupwise performance opt (#12299 ) * qwen2 gw performance opt * remove debug	2024-10-30 17:40:21 +08:00
binbin Deng	41b8064554	Support minicpm-1B in level0 pipeline (#12297 )	2024-10-30 17:21:47 +08:00
Yina Chen	70037ad55f	Groupwise prefill optimization (#12291 ) * except lm_head * remove * support gw lm_head * update * fix * remove run.bat * fix style * support llama3 * slice -> split * remove debug * fix style * add dpu	2024-10-30 14:59:45 +08:00
Ruonan Wang	2b2cb9c693	[NPU pipeline] Support save & load and update examples (#12293 ) * support save & load, update llama examples * update baichuan2 example * update readme	2024-10-30 10:02:00 +08:00
Yuwen Hu	5a15098835	Initial support for quantized forward on CPU when `quantization_group_size=0` (#12282 ) * Initial support for quantized forward on CPU when quantization_group_size=0 * Style fix * Style fix * Small fix * Small fix	2024-10-29 19:40:17 +08:00
binbin Deng	3feb58d1e4	Support baichuan2 for level0 pipeline (#12289 )	2024-10-29 19:24:16 +08:00
Ruonan Wang	3fe2ea3081	[NPU] Reuse prefill of acc lib for pipeline (#12279 ) * first commit * update example * fix style * update example * embedding as const * fix generate * code refactor * meet code review * fix style * change max_output_len to max_context_len * fix all-in-one * fix example * add check for new tokens	2024-10-28 16:05:49 +08:00
SONG Ge	a0c6432899	[NPU] Add support for loading a FunASR model (#12073 ) * add support for loading funasr model * add initial support for paraformer-encoder * add npu ops impl * add encoder-decoder npu pipeline * move paraformer encoders prefix 30 layers to npu and keep the rest layers on cpu	2024-10-25 17:22:01 +08:00
Yina Chen	b5e663854b	[NPU] Support llama groupwise (#12260 ) * support llama gw * support llama gw lm_head * fix style * remove unused code	2024-10-24 18:06:45 +08:00
binbin Deng	b685cf4349	Fix npu group size setting of optimize_model=False (#12256 )	2024-10-23 17:53:54 +08:00
binbin Deng	567b77a76b	Support IR and blob format for llama level0 pipeline (#12251 )	2024-10-23 16:02:35 +08:00
Yina Chen	e8cf7f32f5	npu gw small fix (#12249 )	2024-10-23 14:26:01 +08:00
Yina Chen	e37f951cce	[NPU] Groupwise (#12241 ) * dq divide * fix * support attn divide * update qwen2 7b * divide down_proj & other linear * use concat & reduce sum * support scale after * support qwen2 * w/ mm * update reshape * spda * split * split 2+ * update * lm head-> 28 * no scale * update * update * update * fix style * fix style * to split linear * update * update code * address comments * fix style & remove redundant code & revert benchmark scripts * fix style & remove code * update save & load --------- Co-authored-by: Yang Wang <yang3.wang@intel.com>	2024-10-23 14:10:58 +08:00
Ruonan Wang	03bd01c99c	optimize npu qwen2 (#12107 )	2024-09-20 19:46:16 +08:00
Yuwen Hu	828fa01ad3	[NPU] Add `mixed_precision` for Qwen2 7B (#12098 ) * Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct * Small fix * Fixed on load low bit with mixed precision * Small fix * Update example accordingly * Update for default prompt * Update base on comments * Final fix	2024-09-20 16:36:21 +08:00
Ruonan Wang	09b8c80d9d	update code for NPU qwen2 (#12094 ) * update code * fix	2024-09-20 15:58:32 +08:00
Yuwen Hu	f7fb3c896c	Update lm_head optimization for Qwen2 7B (#12090 )	2024-09-18 17:02:02 +08:00
Ruonan Wang	081af41def	[NPU] Optimize Qwen2 lm_head to use INT4 (#12072 ) * temp save * update * fix * fix * Split lm_head into 7 parts & remove int8 for lm_head when sym_int4 * Simlify and add condition to code * Small fix * refactor some code * fix style * fix style * fix style * fix * fix * temp sav e * refactor * fix style * further refactor * simplify code * meet code review * fix style --------- Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>	2024-09-14 15:26:46 +08:00
Ruonan Wang	a0c73c26d8	clean NPU code (#12060 ) * clean code * remove time.perf_counter()	2024-09-11 15:10:35 +08:00
Ruonan Wang	640998edea	update inter_pp of qwen2 (#12041 )	2024-09-10 10:34:17 +08:00
binbin Deng	d2e1b9aaff	Add input padding during prefill for qwen2-7b (#12033 )	2024-09-06 16:39:59 +08:00
Ruonan Wang	0d04531ae0	update NPU readme of Qwen2 (#12032 ) * update readme * update broadcast	2024-09-06 15:02:39 +08:00
Yang Wang	58555bd9de	Optimize broadcast for npu llama (#12028 )	2024-09-06 13:28:20 +08:00
binbin Deng	845e5dc89e	Support lm_head of minicpm-2b on NPU (#12019 )	2024-09-05 16:19:22 +08:00
binbin Deng	01099f08ee	Revert prefill logic of qwen2-7b (#11992 )	2024-09-03 14:45:01 +08:00
binbin Deng	2f3d1bd0ec	hotfix qwen2-7b weight setting (#11991 )	2024-09-02 18:11:08 +08:00
binbin Deng	a40ea7038d	Fix AttributeError of qwen2-1.5B (#11990 )	2024-09-02 17:55:10 +08:00
Yang Wang	c48817bd43	Support Qwen2-7b MLP in int4 and transpose_value_cache=True (#11968 )	2024-09-02 14:37:44 +08:00
Ruonan Wang	573c20bae6	fix npu lm_head cpu condition (#11976 ) * fix * fix * fix * fix stype * fix style * fix style	2024-08-30 17:11:26 +08:00
Ruonan Wang	60aa1a2c0f	Initial NPU support for MiniCPM-V-2_6 (#11966 ) * initial pr * update npu model * fix * fix kv cache type * fix * small fix * fix style * fix model id * change inter_pp=4 * address comment * fix * fix style * fix * rebase	2024-08-30 16:34:35 +08:00
binbin Deng	cd077881f1	Disable lm head (#11972 )	2024-08-30 11:05:18 +08:00
Yang Wang	fbf088f61e	remove obselete npu code (#11967 )	2024-08-29 14:16:44 -07:00
Yina Chen	882f4a5ff7	Add lnl npu driver recommend version and enable cpu_lm_head on llama3 (#11952 ) * update lnl npu driver version and enable cpu_lm_head on llama3 * update * fix style * typo * address comments * update * add qwen2-7b	2024-08-29 15:01:18 +08:00
binbin Deng	71f03dcc39	Support qwen2-7b with fused decoderlayer optimization on NPU (#11912 )	2024-08-29 13:34:20 +08:00
Jiao Wang	63ac5f64bb	Refactor NPU baichuan multiple-process (#11945 ) * update * add baichuan mp * clean * refactor * merge * style * update * update	2024-08-28 11:33:40 -07:00
SONG Ge	5ca7390082	[NPU] Add minicpm-2b support for npu multi-processing (#11949 ) * add minicpm-2b support * update example for minicpm-2b * add LNL NPU driver requirement in readme	2024-08-28 18:08:49 +08:00
Yina Chen	b38fb67bec	[NPU] lm head to cpu (#11943 ) * lm head to cpu * qwen2 * mv logic and add param to disable cpu_lm_head * use env and lm_head opt to mp file * fix * update * remove print	2024-08-28 16:34:07 +08:00
binbin Deng	bec00e2015	Improve baichuan2 NPU performance (#11942 )	2024-08-27 18:37:08 +08:00
Zijie Li	90f692937d	Update npu baichuan2 (#11939 )	2024-08-27 16:56:26 +08:00
Jiao Wang	b4b6ddf73c	NPU Baichuan2 Multi- Process example (#11928 )	2024-08-27 15:25:49 +08:00
SONG Ge	e211a5b076	update minicpm to meet latest refactor (#11937 )	2024-08-27 15:08:01 +08:00
Zijie Li	6c3eb1e1e8	refactor from_pretrained API for NPU (#11927 )	2024-08-27 09:50:30 +08:00
SONG Ge	019f725d4d	[NPU] Add support for running mp minicpm model on npu (#11909 ) * add initial support for npu minicpm mp * fix minicpm-1b abnormal output error	2024-08-26 17:52:55 +08:00
binbin Deng	303a090a6b	Add lm_head optimization on NPU (#11903 )	2024-08-23 15:51:07 +08:00
binbin Deng	72a7bf624b	Support qwen2-1.5b with fused decoderlayer optimization on NPU (#11888 )	2024-08-22 11:09:12 +08:00
Yang Wang	209d42ab79	Refactor npu mp to make it easier to integrate new models (#11873 ) * Refactor npu mp to make it easier to integrate new models * fix style * move layer functions to base	2024-08-20 20:58:47 -07:00
Yang Wang	bdaeee1d63	Fix run_decoders bug (#11871 )	2024-08-20 12:04:59 -07:00
Yang Wang	99b05ba1dc	separate prefill into a process (#11787 ) * seperate prefill into a process * using model.share_memory() * might work * worked * use long prompt * refactor * cleanup * fix bug * clean up * changable inter and intra process stages * refactor * add max output len * fix npu_model changes that may cause generate down * fix npu_model generate import error * fix generare forward error --------- Co-authored-by: sgwhat <ge.song@intel.com>	2024-08-19 17:53:36 +08:00
Yang Wang	51bcac1229	follow up on experimental support of fused decoder layer for llama2 (#11785 ) * clean up and support transpose value cache * refine * fix style * fix style	2024-08-13 18:53:55 -07:00
binbin Deng	23d3acdc77	Add experimental support of fused decoder layer for llama2 (#11768 )	2024-08-13 14:41:36 +08:00

1 2

71 commits