Commit graph

118 commits

Author SHA1 Message Date
binbin Deng
4e7e988f70
[NPU] Fix MTL and ARL support (#12580) 2024-12-19 16:55:30 +08:00
Zijie Li
1a2ab12876
[NPU] support asym_int4 for minicpm (#12567) 2024-12-18 10:55:35 +08:00
Zijie Li
fcb474820d
[NPU] support asym_int4 for llama (#12556)
* add llama-imatrix

* fix bugs in llama.py

* style fix
2024-12-17 14:01:17 +08:00
binbin Deng
caf15cc5ef
[NPU] Add IPEX_LLM_NPU_MTL to enable support on mtl (#12543) 2024-12-13 17:01:13 +08:00
binbin Deng
6596c18489
[NPU] Modify IPEX_LLM_NPU_DISABLE_COMPILE_OPT setting for long input (#12537) 2024-12-13 13:49:56 +08:00
Ruonan Wang
7cc01fdc86
[NPU] further fix of new_value_states (#12538) 2024-12-13 13:42:00 +08:00
binbin Deng
f36c23664f
[NPU] Fix abnormal output with latest driver (#12530) 2024-12-12 17:56:30 +08:00
Yuwen Hu
dbaf4abcb3
[NPU] Update C++ example with repetition_penalty & update Python code accordingly (#12528)
* Update c++ npu examples with repetition penalty

* Fit python with updated C++ API

* Style fix

* Small fix

* Small fix
2024-12-12 13:42:55 +08:00
Ruonan Wang
41ef4974ab
[NPU] fix transpose_value = False for NPU optimize_model=True (#12525) 2024-12-11 15:51:39 +08:00
Ruonan Wang
588bfa24dc
support hqq (#12518)
* support

* fix
2024-12-11 15:43:02 +08:00
Yuwen Hu
68f2873bd3
[NPU] Support repetition penalty for simple generate, Python (cpp backend) (#12522)
* Initial support of repetition penalty on NPU (cpp backend) for simple generate

* Bug fix for generation config and others

* Remove unnecessary print and style fix

* Remove unnecessary print

* Fix based on comments
2024-12-11 14:55:25 +08:00
binbin Deng
ea55235cbd
[NPU] Support glm-edge models (#12511) 2024-12-09 14:06:27 +08:00
Ruonan Wang
49ab8974fa
[NPU] initial support of asym_int4_rtn (#12484)
* initiail support of q4_1

* fix

* fix

* update

* update min to Z1

* update

* fix

* update

* fix style

* fix

* support qwen2 optimize_model=True mp version

* temp save

* fix

* fix style

* replace min with zero

* support split linear for q4_1

* fix lm_head with mixed_precision=True

* fix style

* revert test code

* add down proj back for q4_0

* remove print
2024-12-05 17:40:36 +08:00
Yuwen Hu
84f1c4ad57
Small fix for NPU Python cpp simple generate regarding eos tokens (#12501) 2024-12-04 18:54:06 +08:00
Kai Huang
7d27f134dd
Fix hf generate for llama3.2 (#12497)
* fix kv condition]

* meet review
2024-12-04 17:54:40 +08:00
Kai Huang
7ff4533b39
Support hf generate (#12477)
* generate

* style

* update

* remove timing

* style

* style

* combine generate api

* simple in kwargs
2024-12-04 16:31:09 +08:00
Yuwen Hu
ef4028ac2d
[NPU] Support split lm_head for Qwen2 with CPP (#12491)
* Use split for Qwen2 lm_head instead of slice in optimize_pre

* Support split lm_head in Qwen2 python cpp backend

* Fit with Python acc lib pipeline

* Removed default mixed_precision=True in all-in-one and related examples

* Small fix

* Style fix

* Fix based on comments

* Fix based on comments

* Stype fix
2024-12-04 14:41:08 +08:00
binbin Deng
c59284418c
Hotfix of BCE-Emdedding model (#12490) 2024-12-03 18:16:04 +08:00
Yuwen Hu
4ac66db034
[NPU] Support streaming in Python (cpp backend) (#12488)
* Support streaming in NPU Python (cpp backend)

* Small fix
2024-12-03 17:17:26 +08:00
Jin, Qiao
5fe766788e
Fix MiniCPM-V-2_6 running on NPU (#12486) 2024-12-03 16:16:29 +08:00
Ruonan Wang
598603bea6
small fix of imatrix (#12480) 2024-12-03 10:46:36 +08:00
binbin Deng
ab01753b1c
[NPU] update save-load API usage (#12473) 2024-12-03 09:46:15 +08:00
Yuwen Hu
26adb82ee3
[NPU] Remove hard code (#12479) 2024-12-02 18:26:07 +08:00
binbin Deng
54d9a590d4
[NPU]Fix eos_token setting (#12475) 2024-12-02 14:18:22 +08:00
Ruonan Wang
4b6c3160be
Support imatrix-guided quantization for NPU CW (#12468)
* init commit

* remove print

* add interface

* fix

* fix

* fix style
2024-12-02 11:31:26 +08:00
binbin Deng
14d8d3d8af
Integrate NPU C++ imple into ipex-llm (#12461) 2024-11-29 09:25:37 +08:00
Yina Chen
1b533a105c
[NPU] Add env to enable scale search (#12462)
* add env enable scale search

* address comment

* move logic
2024-11-28 17:06:00 +08:00
Yuwen Hu
303b104c10
Fix abnormal output for Qwen2-7B when sym_int8 (#12446) 2024-11-26 15:53:04 +08:00
Ruonan Wang
0e23bd779f
Add support of llama3.2 for NPU C++ (#12442)
* initial support of  llama3.2

* update

* update

* fix style

* fix style

* fix

* small fix
2024-11-26 09:26:55 +08:00
Ruonan Wang
b9abb8a285
Support qwen2.5 3B for NPU & update related examples (#12438)
* update qwen2.5-3B

* update convert

* small fix

* replace load_in_low_bit with low_bit

* small fix
2024-11-25 16:38:31 +08:00
SONG Ge
ff3f7cb25f
Fix speech_paraformer issue with unexpected changes (#12416)
* Fix speech_paraformer issue with unexpected changes

* Add paraformer version specified
2024-11-19 15:01:20 +08:00
binbin Deng
d4d949443f
[NPU] change attention_mask to fp16 (#12400) 2024-11-14 17:20:29 +08:00
SONG Ge
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU (#12393)
* Add initial support for modeling_xlm encoder on NPU

* Add EmbeddingModel class to keep the same usage with bce and npu fp16 linear convert

* Optimize currently implementation to support EmbeddingModel.encode API and convert other torch modules to NPU

* Add related example and documents
2024-11-14 10:50:27 +08:00
Yina Chen
59b01fa7d2
small fix (#12397) 2024-11-14 10:03:36 +08:00
Yina Chen
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix (#12395)
* qwen prefill attn_mask type fp16

* update
2024-11-13 17:51:34 +08:00
Yina Chen
9220babaab
qwen prefill attn_mask type fp16 (#12394) 2024-11-13 17:45:26 +08:00
Ruonan Wang
6bf5a8c230
[NPU] Update qwen2 compile config (#12383)
* update

* fix
2024-11-12 16:59:44 +08:00
binbin Deng
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU (#12375) 2024-11-12 15:59:55 +08:00
Yina Chen
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor (#12337)
* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* baichuan part

* update

* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* baichuan part

* update

* update

* update

* baichuan support

* code refactor

* remove code

* fix style

* address comments

* revert
2024-11-08 11:42:42 +08:00
binbin Deng
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline (#12361) 2024-11-08 10:01:23 +08:00
Yina Chen
d880e534d2
[NPU] acclib llama3.2 support groupwise (#12355)
* change inter_pp

* add comment
2024-11-07 11:19:55 +08:00
SONG Ge
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU (#12339)
* Add initial support for llama3.2-1b/3b

* move llama3.2 support into current llama_mp impl
2024-11-06 19:21:40 +08:00
Yina Chen
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support (#12327)
* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* support minicpm 2b

* fix style & error

* fix style & update

* remove print
2024-11-05 15:51:31 +08:00
Yina Chen
94c4ce389f
[NPU] Add env to disable compile opt (#12330)
* add env to disable compile opt

* fix style

* fix style
2024-11-04 17:46:17 +08:00
Ch1y0q
48123af463
add npu_group_size for transformers_int4_npu_win in all-in-one benchmark api (#12316)
* add `npu_group_size` for `transformers_int4_npu_win`
small bugfix

* update
2024-11-01 18:44:27 +08:00
Yina Chen
05c5d0267a
[NPU] Llama2 prefill use ov sdp (#12310)
* prefill use sdp

* add param

* update

* fix style

* fix style

* meet comments
2024-11-01 11:05:20 +08:00
Kai Huang
416c19165c
Add Qwen pipeline and example (#12292)
* support qwen pipeline

* update error msg

* style

* meet review

* minor
2024-10-31 11:25:25 +08:00
Yina Chen
0763268e4c
[NPU]Qwen2 groupwise performance opt (#12299)
* qwen2 gw performance opt

* remove debug
2024-10-30 17:40:21 +08:00
binbin Deng
41b8064554
Support minicpm-1B in level0 pipeline (#12297) 2024-10-30 17:21:47 +08:00
Yina Chen
70037ad55f
Groupwise prefill optimization (#12291)
* except lm_head

* remove

* support gw lm_head

* update

* fix

* remove run.bat

* fix style

* support llama3

* slice -> split

* remove debug

* fix style

* add dpu
2024-10-30 14:59:45 +08:00