Commit graph

536 commits

Author SHA1 Message Date
Kai Huang
c8679ad592
Qwen layernorm as input (#12309)
* qwen layernorm as input

* add group size
2024-11-04 09:51:15 +08:00
Ch1y0q
48123af463
add npu_group_size for transformers_int4_npu_win in all-in-one benchmark api (#12316)
* add `npu_group_size` for `transformers_int4_npu_win`
small bugfix

* update
2024-11-01 18:44:27 +08:00
binbin Deng
f53bb4ea0b
[NPU L0] Update 1st token generation (#12314) 2024-11-01 17:02:07 +08:00
binbin Deng
d409d9d0eb
[NPU L0] Update streaming mode of example (#12312) 2024-11-01 15:38:10 +08:00
Yina Chen
05c5d0267a
[NPU] Llama2 prefill use ov sdp (#12310)
* prefill use sdp

* add param

* update

* fix style

* fix style

* meet comments
2024-11-01 11:05:20 +08:00
binbin Deng
eda764909c
Add minicpm-2b in L0 pipeline (#12308) 2024-11-01 09:30:01 +08:00
Yishuo Wang
b9853f98b3
fix qwen2 attention_mask slice (#12307) 2024-10-31 17:00:05 +08:00
binbin Deng
4892df61c9
Add qwen2-1.5b in l0 pipeline example (#12306) 2024-10-31 16:44:25 +08:00
Xin Qiu
97a0f7fd35
Codegeex support (#12303)
* new codegeex attn

* use kv cache

* add compress/quantize kv

* remove compress/quantize kv

* fix style check

* fix style

* fix codegeex
2024-10-31 15:28:56 +08:00
Yishuo Wang
72605c7016
fix llama3.1/3.2 quantize kv check (#12302) 2024-10-31 11:55:07 +08:00
Kai Huang
416c19165c
Add Qwen pipeline and example (#12292)
* support qwen pipeline

* update error msg

* style

* meet review

* minor
2024-10-31 11:25:25 +08:00
Yina Chen
0763268e4c
[NPU]Qwen2 groupwise performance opt (#12299)
* qwen2 gw performance opt

* remove debug
2024-10-30 17:40:21 +08:00
binbin Deng
41b8064554
Support minicpm-1B in level0 pipeline (#12297) 2024-10-30 17:21:47 +08:00
Jinhe
46d8300f6b
bugfix for qlora finetuning on GPU (#12298)
* bugfix for qlora 100 step error

* indent fix

* annotation fix
2024-10-30 16:54:10 +08:00
Yina Chen
70037ad55f
Groupwise prefill optimization (#12291)
* except lm_head

* remove

* support gw lm_head

* update

* fix

* remove run.bat

* fix style

* support llama3

* slice -> split

* remove debug

* fix style

* add dpu
2024-10-30 14:59:45 +08:00
Yishuo Wang
540eaeb12c
refactor attention_softmax (#12295) 2024-10-30 13:20:50 +08:00
Ruonan Wang
2b2cb9c693
[NPU pipeline] Support save & load and update examples (#12293)
* support save & load, update llama examples

* update baichuan2 example

* update readme
2024-10-30 10:02:00 +08:00
Yuwen Hu
5a15098835
Initial support for quantized forward on CPU when quantization_group_size=0 (#12282)
* Initial support for quantized forward on CPU when quantization_group_size=0

* Style fix

* Style fix

* Small fix

* Small fix
2024-10-29 19:40:17 +08:00
binbin Deng
3feb58d1e4
Support baichuan2 for level0 pipeline (#12289) 2024-10-29 19:24:16 +08:00
Zhao Changmin
546f455e8e
Patch sdpa check function in specific module attributes table (#12285) 2024-10-29 18:41:09 +08:00
Ruonan Wang
821b0033ed
[NPU L0] update layernorm & code refactor (#12287)
* update layernorm & code refactor

* fix style

* add common utils

* change to Pool()

* remove print
2024-10-29 15:01:45 +08:00
Yina Chen
4467645088
[NPU] Support l0 Llama groupwise (#12276)
* except lm_head

* remove

* support gw lm_head

* update

* fix

* remove run.bat

* fix style

* support llama3
2024-10-28 17:06:55 +08:00
Ruonan Wang
3fe2ea3081
[NPU] Reuse prefill of acc lib for pipeline (#12279)
* first commit

* update example

* fix style

* update example

* embedding as const

* fix generate

* code  refactor

* meet code review

* fix style

* change max_output_len to max_context_len

* fix all-in-one

* fix example

* add check for new tokens
2024-10-28 16:05:49 +08:00
SONG Ge
08cb065370
hot-fix redundant import funasr (#12277) 2024-10-25 19:40:39 +08:00
SONG Ge
a0c6432899
[NPU] Add support for loading a FunASR model (#12073)
* add support for loading funasr model

* add initial support for paraformer-encoder

* add npu ops impl

* add encoder-decoder npu pipeline

* move paraformer encoders prefix 30 layers  to npu and keep the rest layers on cpu
2024-10-25 17:22:01 +08:00
Yuwen Hu
43b25a2fe7
Fix llama 3.2 vision on LNL (#12264)
* Fix llama 3.2 vision on LNL

* Small fix
2024-10-25 16:23:31 +08:00
Ruonan Wang
ae57e23e4f
fix incompatibility between llama GW & llama pipeline (#12267)
* fix

* fix
2024-10-25 10:31:44 +08:00
Yina Chen
b5e663854b
[NPU] Support llama groupwise (#12260)
* support llama gw

* support llama gw lm_head

* fix style

* remove unused code
2024-10-24 18:06:45 +08:00
Xin Qiu
39c9d1de52
fix code geex (#12261) 2024-10-24 14:34:01 +08:00
Yishuo Wang
f3a2b20e6b
Optimize gpt2 (#12259) 2024-10-24 13:44:24 +08:00
Ruonan Wang
821fd96367
Initial integrate our L0 Llama impl into ipex-llm (#12255)
* temp save

* initial support

* fix

* simplify code

* fix style

* fix example

* make default value of pipeline as False
2024-10-24 09:49:27 +08:00
Yishuo Wang
cacc891962
Fix PR validation (#12253) 2024-10-23 18:10:47 +08:00
binbin Deng
b685cf4349
Fix npu group size setting of optimize_model=False (#12256) 2024-10-23 17:53:54 +08:00
binbin Deng
567b77a76b
Support IR and blob format for llama level0 pipeline (#12251) 2024-10-23 16:02:35 +08:00
Yishuo Wang
578aef245d
Fix models auto choose SdpaAttention with ipex 2.3 (#12252) 2024-10-23 15:33:45 +08:00
Yishuo Wang
88dc120a4c
fix fp16 linear (#12250) 2024-10-23 14:35:19 +08:00
Yina Chen
e8cf7f32f5
npu gw small fix (#12249) 2024-10-23 14:26:01 +08:00
Yina Chen
e37f951cce
[NPU] Groupwise (#12241)
* dq divide

* fix

* support attn divide

* update qwen2 7b

* divide down_proj & other linear

* use concat & reduce sum

* support scale after

* support qwen2

* w/ mm

* update reshape

* spda

* split

* split 2+

* update

* lm head-> 28

* no scale

* update

* update

* update

* fix style

* fix style

* to split linear

* update

* update code

* address comments

* fix style & remove redundant code & revert benchmark scripts

* fix style & remove code

* update save & load

---------

Co-authored-by: Yang Wang <yang3.wang@intel.com>
2024-10-23 14:10:58 +08:00
Yina Chen
ec465fbcd7
Add lookup generate in load_low_bit (#12243)
* add lookup generate in load_low_bit

* update comment
2024-10-22 15:51:52 +08:00
Yuwen Hu
b3df47486d
Fix Gemma 2 on LNL (#12240)
* Fix gemma 2 on LNL

* Python style fix
2024-10-21 18:25:53 +08:00
Yishuo Wang
9ea694484d
refactor ot remove old rope usage (#12224) 2024-10-17 17:06:09 +08:00
Yishuo Wang
324bcb057e
refactor to reduce old rope usage (#12219) 2024-10-17 14:45:09 +08:00
Yishuo Wang
a4a758656a
refactor gemma to reduce old fuse rope usage (#12215) 2024-10-16 17:40:28 +08:00
Yishuo Wang
9104a168f6
refactor phi-2 to reduce old fuse rope usage (#12214) 2024-10-16 17:08:14 +08:00
Yishuo Wang
bb247e991b
refactor merge_qkv and attention_softmax (#12213) 2024-10-16 15:58:14 +08:00
Yishuo Wang
e279148aa0
optimize llama3.2 vision again (#12211) 2024-10-16 14:29:48 +08:00
Yishuo Wang
f6611f9d3a
optimize llama3.2 vison attention again (#12204) 2024-10-15 16:08:20 +08:00
Yishuo Wang
9b81236a2e
optimzie qwen2-vl vision (#12203) 2024-10-15 15:54:25 +08:00
Yishuo Wang
d5344587ab
optimize internvl2 vision model's attention (#12198) 2024-10-15 10:51:00 +08:00
Yuwen Hu
f8d1adc573
Fix Llama 3.2 & 3.1 on LNL (#12196) 2024-10-14 17:39:20 +08:00