Commit graph

262 commits

Author SHA1 Message Date
Yuwen Hu
8fdc36c140
Optimize with new batch kernel when batch_size=1 on LNL (#12419)
* Add use batch kernel condition for LNL

* Fix for other device judgement

* Fix based on comment
2024-11-21 16:21:35 +08:00
Yuwen Hu
a69395f31f
Support performance mode of GLM4 model (#12401)
* Initial support of prepare generation args for transformers 445

* Small fix to chatglm4 model optimization

* Small fix

* fix glm4 position id

* fix glm4 error

* Small change in conditon & fix based on comments

* Style fixes

---------

Co-authored-by: cyita <yitastudy@gmail.com>
2024-11-18 18:46:52 +08:00
Yishuo Wang
dc34e8c51f
optimize glm4v vision attention (#12369) 2024-11-08 17:01:57 +08:00
Yuwen Hu
1a6cbc473f
Add fused mlp optimizations to glm4 models (#12360)
* Add fused mlp to glm4 models

* Small fix
2024-11-07 18:52:47 +08:00
Yishuo Wang
ad68c56573
small improvement (#12359) 2024-11-07 15:57:41 +08:00
Yuwen Hu
872a74481a
Small optimization to glm4 models (#12351) 2024-11-06 19:16:58 +08:00
Yina Chen
f24352aef9
llama 3.1/3.2 support compresskv (#12347)
* llama 3.1/3.2 support compresskv

* update

* fix transformers 4.45 error

* fix style

* fix typo

* disable llama3.2 1b compresskv
2024-11-06 17:33:43 +08:00
Yishuo Wang
e23ef7d088
optimize glm4v's vision part (#12346) 2024-11-06 15:43:40 +08:00
Zhao Changmin
1b637e4477
Add chatglm2&3 fuse mlp (#12328)
* add chatglm fuse mlp
2024-11-04 18:04:41 +08:00
Yishuo Wang
b9853f98b3
fix qwen2 attention_mask slice (#12307) 2024-10-31 17:00:05 +08:00
Xin Qiu
97a0f7fd35
Codegeex support (#12303)
* new codegeex attn

* use kv cache

* add compress/quantize kv

* remove compress/quantize kv

* fix style check

* fix style

* fix codegeex
2024-10-31 15:28:56 +08:00
Yishuo Wang
72605c7016
fix llama3.1/3.2 quantize kv check (#12302) 2024-10-31 11:55:07 +08:00
Yishuo Wang
540eaeb12c
refactor attention_softmax (#12295) 2024-10-30 13:20:50 +08:00
Xin Qiu
39c9d1de52
fix code geex (#12261) 2024-10-24 14:34:01 +08:00
Yishuo Wang
f3a2b20e6b
Optimize gpt2 (#12259) 2024-10-24 13:44:24 +08:00
Yishuo Wang
9ea694484d
refactor ot remove old rope usage (#12224) 2024-10-17 17:06:09 +08:00
Yishuo Wang
324bcb057e
refactor to reduce old rope usage (#12219) 2024-10-17 14:45:09 +08:00
Yishuo Wang
a4a758656a
refactor gemma to reduce old fuse rope usage (#12215) 2024-10-16 17:40:28 +08:00
Yishuo Wang
9104a168f6
refactor phi-2 to reduce old fuse rope usage (#12214) 2024-10-16 17:08:14 +08:00
Yishuo Wang
bb247e991b
refactor merge_qkv and attention_softmax (#12213) 2024-10-16 15:58:14 +08:00
Yishuo Wang
e279148aa0
optimize llama3.2 vision again (#12211) 2024-10-16 14:29:48 +08:00
Yishuo Wang
f6611f9d3a
optimize llama3.2 vison attention again (#12204) 2024-10-15 16:08:20 +08:00
Yishuo Wang
9b81236a2e
optimzie qwen2-vl vision (#12203) 2024-10-15 15:54:25 +08:00
Yishuo Wang
d5344587ab
optimize internvl2 vision model's attention (#12198) 2024-10-15 10:51:00 +08:00
Yuwen Hu
f8d1adc573
Fix Llama 3.2 & 3.1 on LNL (#12196) 2024-10-14 17:39:20 +08:00
Yishuo Wang
535bee5381
fix qwen2 vl again (#12174) 2024-10-10 13:50:01 +08:00
Yishuo Wang
78d253165d
optimize qwen2 vl perf again (#12167) 2024-10-09 16:43:48 +08:00
Yishuo Wang
644af2a76e
add basic llama 3.2 vision support (#12163) 2024-10-08 10:46:48 +08:00
Yishuo Wang
669ff1a97b
fix sd1.5 (#12129) 2024-09-26 17:15:16 +08:00
Yishuo Wang
a266528719
optimize llama 3.2 rope (#12128) 2024-09-26 16:08:10 +08:00
Yishuo Wang
584c3489e7
add basic support for llama3.2 (#12125) 2024-09-26 15:46:19 +08:00
Yishuo Wang
66f419f8b7
fix qwen2 vl (#12126) 2024-09-26 15:44:02 +08:00
Yishuo Wang
47e0b83cbf
optimize sd 1.5 (#12119) 2024-09-25 15:45:13 +08:00
Yishuo Wang
5d63aef60b
optimize qwen2 vl again (#12109) 2024-09-23 13:22:01 +08:00
Yishuo Wang
9239fd4f12
add basic support and optimization for qwen2-vl (#12104) 2024-09-20 17:23:06 +08:00
Yishuo Wang
d8c044e79d
optimize minicpm3 kv cache (#12052) 2024-09-10 16:51:21 +08:00
Yishuo Wang
abc370728c
optimize minicpm3 again (#12047) 2024-09-10 14:19:57 +08:00
Yishuo Wang
048b4590aa
add basic minicpm3 optimization (#12039) 2024-09-09 17:25:08 +08:00
Yishuo Wang
6cedb601e4
remove some useless code (#12035) 2024-09-06 17:51:08 +08:00
Guoqiong Song
8803242f5c
fix llama on cpu (#12018) 2024-09-04 19:17:54 -07:00
Yuwen Hu
a9e485eb1b
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer (#11963)
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer

* Style fixes
2024-08-29 19:22:09 +08:00
Yishuo Wang
0fbb10259a
use sdp_causal to reduce internvl2-4b memory usage if set environment variable (#11953) 2024-08-28 17:35:05 +08:00
hxsz1997
650e6e6ce4
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
Add compress_kv for Baichuan2
2024-08-23 06:09:58 +03:00
Ruonan Wang
4a61f7d20d
update mlp of llama (#11897)
* update mlp of llama

* relax threshold of  mlp test

* revert code
2024-08-22 20:34:53 +08:00
Huang, Xinshengzi
eb1e65f8a9 add comment 2024-08-22 15:14:47 +08:00
Huang, Xinshengzi
a2be3d7501 add comment of compress kv in attention forward 2024-08-22 15:11:55 +08:00
Huang, Xinshengzi
ce7de77085 add comment of change in model forward 2024-08-22 14:29:27 +08:00
Huang, Xinshengzi
42398a0045 add comment 2024-08-22 13:17:13 +08:00
Huang, Xinshengzi
48a827aa07 fix typos 2024-08-22 11:35:47 +08:00
Huang, Xinshengzi
8a5df93de2 fix typos 2024-08-22 11:33:07 +08:00