Commit graph

1990 commits

Author SHA1 Message Date
Qiyuan Gong
7e50ff113c
Add padding_token=eos_token for GPU trl QLora example (#12398)
* Avoid tokenizer doesn't have a padding token error.
2024-11-14 10:51:30 +08:00
SONG Ge
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU (#12393)
* Add initial support for modeling_xlm encoder on NPU

* Add EmbeddingModel class to keep the same usage with bce and npu fp16 linear convert

* Optimize currently implementation to support EmbeddingModel.encode API and convert other torch modules to NPU

* Add related example and documents
2024-11-14 10:50:27 +08:00
Yina Chen
59b01fa7d2
small fix (#12397) 2024-11-14 10:03:36 +08:00
Yishuo Wang
00fce5c940
use new q4_0 batch kernel (#12396) 2024-11-13 18:37:34 +08:00
Yina Chen
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix (#12395)
* qwen prefill attn_mask type fp16

* update
2024-11-13 17:51:34 +08:00
Yina Chen
9220babaab
qwen prefill attn_mask type fp16 (#12394) 2024-11-13 17:45:26 +08:00
Yuwen Hu
1158f91648
Fix llava with multi-image inputs (#12384) 2024-11-13 09:27:50 +08:00
Guancheng Fu
0ee54fc55f
Upgrade to vllm 0.6.2 (#12338)
* Initial updates for vllm 0.6.2

* fix

* Change Dockerfile to support v062

* Fix

* fix examples

* Fix

* done

* fix

* Update engine.py

* Fix Dockerfile to original path

* fix

* add option

* fix

* fix

* fix

* fix

---------

Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
2024-11-12 20:35:34 +08:00
Ruonan Wang
6bf5a8c230
[NPU] Update qwen2 compile config (#12383)
* update

* fix
2024-11-12 16:59:44 +08:00
binbin Deng
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU (#12375) 2024-11-12 15:59:55 +08:00
Yuwen Hu
e0918934c8
Add fused_mlp to glm4v models (#12378) 2024-11-11 17:10:25 +08:00
Yishuo Wang
dc34e8c51f
optimize glm4v vision attention (#12369) 2024-11-08 17:01:57 +08:00
Qiyuan Gong
2dfcc36825
Fix trl version and padding in trl qlora example (#12368)
* Change trl to 0.9.6
* Enable padding to avoid padding related errors.
2024-11-08 16:05:17 +08:00
Yishuo Wang
51f7f87768
fix ipex 2.3 bug (#12366) 2024-11-08 13:29:15 +08:00
Yina Chen
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor (#12337)
* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* baichuan part

* update

* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* baichuan part

* update

* update

* update

* baichuan support

* code refactor

* remove code

* fix style

* address comments

* revert
2024-11-08 11:42:42 +08:00
binbin Deng
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline (#12361) 2024-11-08 10:01:23 +08:00
Yuwen Hu
8fe294e01f
Small fix to all-in-one benchmark (#12362) 2024-11-07 18:56:34 +08:00
Yuwen Hu
1a6cbc473f
Add fused mlp optimizations to glm4 models (#12360)
* Add fused mlp to glm4 models

* Small fix
2024-11-07 18:52:47 +08:00
Yishuo Wang
ad68c56573
small improvement (#12359) 2024-11-07 15:57:41 +08:00
Yina Chen
d880e534d2
[NPU] acclib llama3.2 support groupwise (#12355)
* change inter_pp

* add comment
2024-11-07 11:19:55 +08:00
Jinhe
79f2877413
add minicpm-v models to transformers_int4_npu_win api (#12352)
* add minicpm npu

* optimize model
2024-11-07 10:05:10 +08:00
SONG Ge
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU (#12339)
* Add initial support for llama3.2-1b/3b

* move llama3.2 support into current llama_mp impl
2024-11-06 19:21:40 +08:00
Yuwen Hu
872a74481a
Small optimization to glm4 models (#12351) 2024-11-06 19:16:58 +08:00
Ruonan Wang
c267355b35
fix three NPU benchmark issues (#12350)
* fix three issues

* limit mixed_precision for CW only
2024-11-06 19:01:01 +08:00
Yina Chen
f24352aef9
llama 3.1/3.2 support compresskv (#12347)
* llama 3.1/3.2 support compresskv

* update

* fix transformers 4.45 error

* fix style

* fix typo

* disable llama3.2 1b compresskv
2024-11-06 17:33:43 +08:00
Jin, Qiao
d984c0672a
Add MiniCPM-V-2_6 to arc perf test (#12349) 2024-11-06 16:32:28 +08:00
Yishuo Wang
e23ef7d088
optimize glm4v's vision part (#12346) 2024-11-06 15:43:40 +08:00
Yishuo Wang
c8b7265359
Add basic glm4v support (#12345) 2024-11-06 13:50:10 +08:00
binbin Deng
69e3a56943
[NPU] Hot fix of load_low_bit (#12344) 2024-11-06 10:07:00 +08:00
Jin, Qiao
7240c283a3
Add dummy model in iGPU perf (#12341)
* Add dummy model in iGPU perf

* Add dummy model in iGPU perf

* Fix
2024-11-05 17:56:10 +08:00
Zhao Changmin
8e9a3a1158
fix chatglm2 cpu ut (#12336) 2024-11-05 16:43:57 +08:00
Yina Chen
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support (#12327)
* support minicpm 1b & qwen 1.5b gw

* support minicpm 1b

* support minicpm 2b

* fix style & error

* fix style & update

* remove print
2024-11-05 15:51:31 +08:00
Jin, Qiao
82a61b5cf3
Limit trl version in example (#12332)
* Limit trl version in example

* Limit trl version in example
2024-11-05 14:50:10 +08:00
Zijie Li
45b0d371aa
update benchmark readme (#12323)
* update benchmark readme

update new comment with memory usage included

* Update README.md
2024-11-05 08:19:08 +08:00
Zhao Changmin
1b637e4477
Add chatglm2&3 fuse mlp (#12328)
* add chatglm fuse mlp
2024-11-04 18:04:41 +08:00
Yina Chen
94c4ce389f
[NPU] Add env to disable compile opt (#12330)
* add env to disable compile opt

* fix style

* fix style
2024-11-04 17:46:17 +08:00
Ch1y0q
e54af44ed6
Add transformers_int4_npu_pipeline_win in all-in-one benchmark (#12325)
* add transformers_int4_npu_pipeline_win

* bugfix

* bugfix: wrong actual_output_len

* fix format

* bugfix & update `README.md`
2024-11-04 16:00:20 +08:00
binbin Deng
5ee6f97d6f
[NPU L0] Add layernorm weight as const / input setting (#12322) 2024-11-04 15:46:24 +08:00
Chu,Youcheng
a01371f90b
Doc: update harness readme (#12324) 2024-11-04 14:58:54 +08:00
Kai Huang
c8679ad592
Qwen layernorm as input (#12309)
* qwen layernorm as input

* add group size
2024-11-04 09:51:15 +08:00
Yuwen Hu
20755e8077
Small fix to all-in-one benchmark scripts (#12317) 2024-11-01 19:16:25 +08:00
Ch1y0q
48123af463
add npu_group_size for transformers_int4_npu_win in all-in-one benchmark api (#12316)
* add `npu_group_size` for `transformers_int4_npu_win`
small bugfix

* update
2024-11-01 18:44:27 +08:00
Zijie Li
cd5e22cee5
Update Llava GPU Example (#12311)
* update-llava-example

* add warmup

* small fix on llava example

* remove space& extra print prompt

* renew example

* small fix

---------
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>
2024-11-01 17:06:00 +08:00
binbin Deng
f53bb4ea0b
[NPU L0] Update 1st token generation (#12314) 2024-11-01 17:02:07 +08:00
binbin Deng
d409d9d0eb
[NPU L0] Update streaming mode of example (#12312) 2024-11-01 15:38:10 +08:00
Jin, Qiao
126f95be80
Fix DPO finetuning example (#12313) 2024-11-01 13:29:44 +08:00
Yina Chen
05c5d0267a
[NPU] Llama2 prefill use ov sdp (#12310)
* prefill use sdp

* add param

* update

* fix style

* fix style

* meet comments
2024-11-01 11:05:20 +08:00
binbin Deng
eda764909c
Add minicpm-2b in L0 pipeline (#12308) 2024-11-01 09:30:01 +08:00
Yishuo Wang
b9853f98b3
fix qwen2 attention_mask slice (#12307) 2024-10-31 17:00:05 +08:00
Jin, Qiao
3df6195cb0
Fix application quickstart (#12305)
* fix graphrag quickstart

* fix axolotl quickstart

* fix ragflow quickstart

* fix ragflow quickstart

* fix graphrag toc

* fix comments

* fix comment

* fix comments
2024-10-31 16:57:35 +08:00