Ruonan Wang
f8c2bb2943
[NPU] optimize qwen2 prefill performance for C++ ( #12451 )
2024-11-27 10:46:18 +08:00
Ruonan Wang
7b40f9b372
[NPU] Support GW for NPU C++ ( #12450 )
2024-11-26 17:46:40 +08:00
Ruonan Wang
24b46b2b19
[NPU] further fix of qwen2 int8 pipeline & C++ ( #12449 )
...
* fix
* fix style
2024-11-26 16:39:39 +08:00
Yuwen Hu
303b104c10
Fix abnormal output for Qwen2-7B when sym_int8 ( #12446 )
2024-11-26 15:53:04 +08:00
Ruonan Wang
52c17fe104
Optimize first token of C++ NPU by adding npu_dpu_groups ( #12443 )
...
* add npu_dpu_groups
* add check for env
* fix style
2024-11-26 11:41:32 +08:00
Jinhe
66bd7abae4
add sdxl and lora-lcm optimization ( #12444 )
...
* add sdxl and lora-lcm optimization
* fix openjourney speed drop
2024-11-26 11:38:09 +08:00
Ruonan Wang
0e23bd779f
Add support of llama3.2 for NPU C++ ( #12442 )
...
* initial support of llama3.2
* update
* update
* fix style
* fix style
* fix
* small fix
2024-11-26 09:26:55 +08:00
Yishuo Wang
cdd41f5e4c
optimize sdxl again ( #12441 )
2024-11-25 17:46:46 +08:00
Ruonan Wang
b9abb8a285
Support qwen2.5 3B for NPU & update related examples ( #12438 )
...
* update qwen2.5-3B
* update convert
* small fix
* replace load_in_low_bit with low_bit
* small fix
2024-11-25 16:38:31 +08:00
Yishuo Wang
8164aed802
small change ( #12439 )
2024-11-25 14:35:49 +08:00
Yishuo Wang
be132c4209
fix and optimize sd ( #12436 )
2024-11-25 14:09:48 +08:00
Ruonan Wang
f41405368a
Support minicpm for NPU C++ ( #12434 )
...
* support minicpm-1b
* update
* tune fused_layers
* update readme.md
2024-11-25 10:42:02 +08:00
Ruonan Wang
0819fad34e
support Llama2-7B / Llama3-8B for NPU C++ ( #12431 )
...
* support llama2
* update
* support fused_layers=4 for Llama2-7B
2024-11-22 18:47:19 +08:00
Ruonan Wang
4ffa6c752c
New convert support for C++ NPU ( #12430 )
...
* initial commit
* fix
* fix style
* fix style
* fix
* fix
2024-11-22 14:28:30 +08:00
Yuwen Hu
8fdc36c140
Optimize with new batch kernel when batch_size=1 on LNL ( #12419 )
...
* Add use batch kernel condition for LNL
* Fix for other device judgement
* Fix based on comment
2024-11-21 16:21:35 +08:00
Yishuo Wang
145e8b480f
update batch kernel condition ( #12421 )
2024-11-21 10:12:46 +08:00
Ruonan Wang
54c62feb74
[NPU] dump prefill IR for further C++ solution ( #12402 )
...
* save prefill ir
* fix
* shorten convert time
* fix
* fix
* fix
* fix
* fix style
* dump config.json
* meet review
* small fix
2024-11-20 15:20:05 +08:00
SONG Ge
ff3f7cb25f
Fix speech_paraformer issue with unexpected changes ( #12416 )
...
* Fix speech_paraformer issue with unexpected changes
* Add paraformer version specified
2024-11-19 15:01:20 +08:00
Yuwen Hu
a69395f31f
Support performance mode of GLM4 model ( #12401 )
...
* Initial support of prepare generation args for transformers 445
* Small fix to chatglm4 model optimization
* Small fix
* fix glm4 position id
* fix glm4 error
* Small change in conditon & fix based on comments
* Style fixes
---------
Co-authored-by: cyita <yitastudy@gmail.com>
2024-11-18 18:46:52 +08:00
Song Fuchang
d2c821d458
Add missing arguments in pipeline parallel generate method ( #12142 )
...
Add two arguments: negative_prompt_ids and negative_prompt_attention_mask to generate method in pipeline_parallel.py.
2024-11-18 13:50:18 +08:00
Yishuo Wang
3d5fbf2069
update batch kernel condition ( #12408 )
2024-11-15 13:47:05 +08:00
binbin Deng
d4d949443f
[NPU] change attention_mask to fp16 ( #12400 )
2024-11-14 17:20:29 +08:00
SONG Ge
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU ( #12393 )
...
* Add initial support for modeling_xlm encoder on NPU
* Add EmbeddingModel class to keep the same usage with bce and npu fp16 linear convert
* Optimize currently implementation to support EmbeddingModel.encode API and convert other torch modules to NPU
* Add related example and documents
2024-11-14 10:50:27 +08:00
Yina Chen
59b01fa7d2
small fix ( #12397 )
2024-11-14 10:03:36 +08:00
Yishuo Wang
00fce5c940
use new q4_0 batch kernel ( #12396 )
2024-11-13 18:37:34 +08:00
Yina Chen
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix ( #12395 )
...
* qwen prefill attn_mask type fp16
* update
2024-11-13 17:51:34 +08:00
Yina Chen
9220babaab
qwen prefill attn_mask type fp16 ( #12394 )
2024-11-13 17:45:26 +08:00
Yuwen Hu
1158f91648
Fix llava with multi-image inputs ( #12384 )
2024-11-13 09:27:50 +08:00
Guancheng Fu
0ee54fc55f
Upgrade to vllm 0.6.2 ( #12338 )
...
* Initial updates for vllm 0.6.2
* fix
* Change Dockerfile to support v062
* Fix
* fix examples
* Fix
* done
* fix
* Update engine.py
* Fix Dockerfile to original path
* fix
* add option
* fix
* fix
* fix
* fix
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
2024-11-12 20:35:34 +08:00
Ruonan Wang
6bf5a8c230
[NPU] Update qwen2 compile config ( #12383 )
...
* update
* fix
2024-11-12 16:59:44 +08:00
binbin Deng
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU ( #12375 )
2024-11-12 15:59:55 +08:00
Yuwen Hu
e0918934c8
Add fused_mlp to glm4v models ( #12378 )
2024-11-11 17:10:25 +08:00
Yishuo Wang
dc34e8c51f
optimize glm4v vision attention ( #12369 )
2024-11-08 17:01:57 +08:00
Yishuo Wang
51f7f87768
fix ipex 2.3 bug ( #12366 )
2024-11-08 13:29:15 +08:00
Yina Chen
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor ( #12337 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* update
* update
* baichuan support
* code refactor
* remove code
* fix style
* address comments
* revert
2024-11-08 11:42:42 +08:00
binbin Deng
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline ( #12361 )
2024-11-08 10:01:23 +08:00
Yuwen Hu
1a6cbc473f
Add fused mlp optimizations to glm4 models ( #12360 )
...
* Add fused mlp to glm4 models
* Small fix
2024-11-07 18:52:47 +08:00
Yishuo Wang
ad68c56573
small improvement ( #12359 )
2024-11-07 15:57:41 +08:00
Yina Chen
d880e534d2
[NPU] acclib llama3.2 support groupwise ( #12355 )
...
* change inter_pp
* add comment
2024-11-07 11:19:55 +08:00
SONG Ge
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU ( #12339 )
...
* Add initial support for llama3.2-1b/3b
* move llama3.2 support into current llama_mp impl
2024-11-06 19:21:40 +08:00
Yuwen Hu
872a74481a
Small optimization to glm4 models ( #12351 )
2024-11-06 19:16:58 +08:00
Ruonan Wang
c267355b35
fix three NPU benchmark issues ( #12350 )
...
* fix three issues
* limit mixed_precision for CW only
2024-11-06 19:01:01 +08:00
Yina Chen
f24352aef9
llama 3.1/3.2 support compresskv ( #12347 )
...
* llama 3.1/3.2 support compresskv
* update
* fix transformers 4.45 error
* fix style
* fix typo
* disable llama3.2 1b compresskv
2024-11-06 17:33:43 +08:00
Yishuo Wang
e23ef7d088
optimize glm4v's vision part ( #12346 )
2024-11-06 15:43:40 +08:00
Yishuo Wang
c8b7265359
Add basic glm4v support ( #12345 )
2024-11-06 13:50:10 +08:00
binbin Deng
69e3a56943
[NPU] Hot fix of load_low_bit ( #12344 )
2024-11-06 10:07:00 +08:00
Yina Chen
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support ( #12327 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* support minicpm 2b
* fix style & error
* fix style & update
* remove print
2024-11-05 15:51:31 +08:00
Zhao Changmin
1b637e4477
Add chatglm2&3 fuse mlp ( #12328 )
...
* add chatglm fuse mlp
2024-11-04 18:04:41 +08:00
Yina Chen
94c4ce389f
[NPU] Add env to disable compile opt ( #12330 )
...
* add env to disable compile opt
* fix style
* fix style
2024-11-04 17:46:17 +08:00
Ch1y0q
e54af44ed6
Add transformers_int4_npu_pipeline_win in all-in-one benchmark ( #12325 )
...
* add transformers_int4_npu_pipeline_win
* bugfix
* bugfix: wrong actual_output_len
* fix format
* bugfix & update `README.md`
2024-11-04 16:00:20 +08:00