binbin Deng
4e7e988f70
[NPU] Fix MTL and ARL support ( #12580 )
2024-12-19 16:55:30 +08:00
Zijie Li
1a2ab12876
[NPU] support asym_int4 for minicpm ( #12567 )
2024-12-18 10:55:35 +08:00
Zijie Li
fcb474820d
[NPU] support asym_int4 for llama ( #12556 )
...
* add llama-imatrix
* fix bugs in llama.py
* style fix
2024-12-17 14:01:17 +08:00
binbin Deng
caf15cc5ef
[NPU] Add IPEX_LLM_NPU_MTL to enable support on mtl ( #12543 )
2024-12-13 17:01:13 +08:00
binbin Deng
6596c18489
[NPU] Modify IPEX_LLM_NPU_DISABLE_COMPILE_OPT setting for long input ( #12537 )
2024-12-13 13:49:56 +08:00
Ruonan Wang
7cc01fdc86
[NPU] further fix of new_value_states ( #12538 )
2024-12-13 13:42:00 +08:00
binbin Deng
f36c23664f
[NPU] Fix abnormal output with latest driver ( #12530 )
2024-12-12 17:56:30 +08:00
Yuwen Hu
dbaf4abcb3
[NPU] Update C++ example with repetition_penalty & update Python code accordingly ( #12528 )
...
* Update c++ npu examples with repetition penalty
* Fit python with updated C++ API
* Style fix
* Small fix
* Small fix
2024-12-12 13:42:55 +08:00
Ruonan Wang
41ef4974ab
[NPU] fix transpose_value = False for NPU optimize_model=True ( #12525 )
2024-12-11 15:51:39 +08:00
Ruonan Wang
588bfa24dc
support hqq ( #12518 )
...
* support
* fix
2024-12-11 15:43:02 +08:00
Yuwen Hu
68f2873bd3
[NPU] Support repetition penalty for simple generate, Python (cpp backend) ( #12522 )
...
* Initial support of repetition penalty on NPU (cpp backend) for simple generate
* Bug fix for generation config and others
* Remove unnecessary print and style fix
* Remove unnecessary print
* Fix based on comments
2024-12-11 14:55:25 +08:00
binbin Deng
ea55235cbd
[NPU] Support glm-edge models ( #12511 )
2024-12-09 14:06:27 +08:00
Ruonan Wang
49ab8974fa
[NPU] initial support of asym_int4_rtn ( #12484 )
...
* initiail support of q4_1
* fix
* fix
* update
* update min to Z1
* update
* fix
* update
* fix style
* fix
* support qwen2 optimize_model=True mp version
* temp save
* fix
* fix style
* replace min with zero
* support split linear for q4_1
* fix lm_head with mixed_precision=True
* fix style
* revert test code
* add down proj back for q4_0
* remove print
2024-12-05 17:40:36 +08:00
Yuwen Hu
84f1c4ad57
Small fix for NPU Python cpp simple generate regarding eos tokens ( #12501 )
2024-12-04 18:54:06 +08:00
Kai Huang
7d27f134dd
Fix hf generate for llama3.2 ( #12497 )
...
* fix kv condition]
* meet review
2024-12-04 17:54:40 +08:00
Kai Huang
7ff4533b39
Support hf generate ( #12477 )
...
* generate
* style
* update
* remove timing
* style
* style
* combine generate api
* simple in kwargs
2024-12-04 16:31:09 +08:00
Yuwen Hu
ef4028ac2d
[NPU] Support split lm_head for Qwen2 with CPP ( #12491 )
...
* Use split for Qwen2 lm_head instead of slice in optimize_pre
* Support split lm_head in Qwen2 python cpp backend
* Fit with Python acc lib pipeline
* Removed default mixed_precision=True in all-in-one and related examples
* Small fix
* Style fix
* Fix based on comments
* Fix based on comments
* Stype fix
2024-12-04 14:41:08 +08:00
binbin Deng
c59284418c
Hotfix of BCE-Emdedding model ( #12490 )
2024-12-03 18:16:04 +08:00
Yuwen Hu
4ac66db034
[NPU] Support streaming in Python (cpp backend) ( #12488 )
...
* Support streaming in NPU Python (cpp backend)
* Small fix
2024-12-03 17:17:26 +08:00
Jin, Qiao
5fe766788e
Fix MiniCPM-V-2_6 running on NPU ( #12486 )
2024-12-03 16:16:29 +08:00
Ruonan Wang
598603bea6
small fix of imatrix ( #12480 )
2024-12-03 10:46:36 +08:00
binbin Deng
ab01753b1c
[NPU] update save-load API usage ( #12473 )
2024-12-03 09:46:15 +08:00
Yuwen Hu
26adb82ee3
[NPU] Remove hard code ( #12479 )
2024-12-02 18:26:07 +08:00
binbin Deng
54d9a590d4
[NPU]Fix eos_token setting ( #12475 )
2024-12-02 14:18:22 +08:00
Ruonan Wang
4b6c3160be
Support imatrix-guided quantization for NPU CW ( #12468 )
...
* init commit
* remove print
* add interface
* fix
* fix
* fix style
2024-12-02 11:31:26 +08:00
binbin Deng
14d8d3d8af
Integrate NPU C++ imple into ipex-llm ( #12461 )
2024-11-29 09:25:37 +08:00
Yina Chen
1b533a105c
[NPU] Add env to enable scale search ( #12462 )
...
* add env enable scale search
* address comment
* move logic
2024-11-28 17:06:00 +08:00
Yuwen Hu
303b104c10
Fix abnormal output for Qwen2-7B when sym_int8 ( #12446 )
2024-11-26 15:53:04 +08:00
Ruonan Wang
0e23bd779f
Add support of llama3.2 for NPU C++ ( #12442 )
...
* initial support of llama3.2
* update
* update
* fix style
* fix style
* fix
* small fix
2024-11-26 09:26:55 +08:00
Ruonan Wang
b9abb8a285
Support qwen2.5 3B for NPU & update related examples ( #12438 )
...
* update qwen2.5-3B
* update convert
* small fix
* replace load_in_low_bit with low_bit
* small fix
2024-11-25 16:38:31 +08:00
SONG Ge
ff3f7cb25f
Fix speech_paraformer issue with unexpected changes ( #12416 )
...
* Fix speech_paraformer issue with unexpected changes
* Add paraformer version specified
2024-11-19 15:01:20 +08:00
binbin Deng
d4d949443f
[NPU] change attention_mask to fp16 ( #12400 )
2024-11-14 17:20:29 +08:00
SONG Ge
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU ( #12393 )
...
* Add initial support for modeling_xlm encoder on NPU
* Add EmbeddingModel class to keep the same usage with bce and npu fp16 linear convert
* Optimize currently implementation to support EmbeddingModel.encode API and convert other torch modules to NPU
* Add related example and documents
2024-11-14 10:50:27 +08:00
Yina Chen
59b01fa7d2
small fix ( #12397 )
2024-11-14 10:03:36 +08:00
Yina Chen
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix ( #12395 )
...
* qwen prefill attn_mask type fp16
* update
2024-11-13 17:51:34 +08:00
Yina Chen
9220babaab
qwen prefill attn_mask type fp16 ( #12394 )
2024-11-13 17:45:26 +08:00
Ruonan Wang
6bf5a8c230
[NPU] Update qwen2 compile config ( #12383 )
...
* update
* fix
2024-11-12 16:59:44 +08:00
binbin Deng
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU ( #12375 )
2024-11-12 15:59:55 +08:00
Yina Chen
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor ( #12337 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* update
* update
* baichuan support
* code refactor
* remove code
* fix style
* address comments
* revert
2024-11-08 11:42:42 +08:00
binbin Deng
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline ( #12361 )
2024-11-08 10:01:23 +08:00
Yina Chen
d880e534d2
[NPU] acclib llama3.2 support groupwise ( #12355 )
...
* change inter_pp
* add comment
2024-11-07 11:19:55 +08:00
SONG Ge
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU ( #12339 )
...
* Add initial support for llama3.2-1b/3b
* move llama3.2 support into current llama_mp impl
2024-11-06 19:21:40 +08:00
Yina Chen
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support ( #12327 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* support minicpm 2b
* fix style & error
* fix style & update
* remove print
2024-11-05 15:51:31 +08:00
Yina Chen
94c4ce389f
[NPU] Add env to disable compile opt ( #12330 )
...
* add env to disable compile opt
* fix style
* fix style
2024-11-04 17:46:17 +08:00
Ch1y0q
48123af463
add npu_group_size for transformers_int4_npu_win in all-in-one benchmark api ( #12316 )
...
* add `npu_group_size` for `transformers_int4_npu_win`
small bugfix
* update
2024-11-01 18:44:27 +08:00
Yina Chen
05c5d0267a
[NPU] Llama2 prefill use ov sdp ( #12310 )
...
* prefill use sdp
* add param
* update
* fix style
* fix style
* meet comments
2024-11-01 11:05:20 +08:00
Kai Huang
416c19165c
Add Qwen pipeline and example ( #12292 )
...
* support qwen pipeline
* update error msg
* style
* meet review
* minor
2024-10-31 11:25:25 +08:00
Yina Chen
0763268e4c
[NPU]Qwen2 groupwise performance opt ( #12299 )
...
* qwen2 gw performance opt
* remove debug
2024-10-30 17:40:21 +08:00
binbin Deng
41b8064554
Support minicpm-1B in level0 pipeline ( #12297 )
2024-10-30 17:21:47 +08:00
Yina Chen
70037ad55f
Groupwise prefill optimization ( #12291 )
...
* except lm_head
* remove
* support gw lm_head
* update
* fix
* remove run.bat
* fix style
* support llama3
* slice -> split
* remove debug
* fix style
* add dpu
2024-10-30 14:59:45 +08:00