Commit graph

30 commits

Author SHA1 Message Date
Yishuo Wang
0fbb10259a
use sdp_causal to reduce internvl2-4b memory usage if set environment variable (#11953) 2024-08-28 17:35:05 +08:00
Yishuo Wang
bd1e490d62
fix phi3 (#11878) 2024-08-21 10:31:41 +08:00
Yina Chen
c3c058373f
Update compresskv model forward type logic (#11868)
* update

* fix
2024-08-20 18:11:37 +08:00
Yishuo Wang
d4ee0a89f3
optimize phi3 memory usage (#11867) 2024-08-20 17:32:51 +08:00
Yishuo Wang
9490781aec
optimize phi3 memory usage again (#11848) 2024-08-19 17:26:59 +08:00
Yina Chen
3cd4e87168
Support compress KV with quantize KV (#11812)
* update llama

* support llama 4.41

* fix style

* support minicpm

* support qwen2

* support minicpm & update

* support chatglm4

* support chatglm

* remove print

* add DynamicCompressFp8Cache & support qwen

* support llama

* support minicpm phi3

* update chatglm2/4

* small fix & support qwen 4.42

* remove print
2024-08-19 15:32:32 +08:00
Yishuo Wang
828ab16537
fix phi3 and minicpmv cpu (#11818) 2024-08-15 17:43:29 +08:00
Yina Chen
7cd6ec9723
MiniCPM-V support compresskv (#11779)
* fix check error

* fix other models

* remove print
2024-08-13 19:03:40 +08:00
Yishuo Wang
aa861df066
use new fp32 softmax kernel (#11776) 2024-08-13 14:48:11 +08:00
Yina Chen
4b9c57cc60
Support compress kv with lookahead (#11752)
* support compress kv with lookahead

* enough kv miss param
2024-08-09 17:39:57 +08:00
Yina Chen
dd46c141bd
Phi3 support compresskv (#11733)
* phi3 support compresskv

* fix phi3 mtl error

* fix conflict with quant kv

* fix abnormal on mtl

* fix style

* use slide windows size to compress kv

* support sliding window

* fix style

* fix style

* temp: partial support quant kv

* support quant kv with compress kv, todo: model check

* temp

* fix style

* fix style

* remove prepare

* address comment

* default -> 1.8k
2024-08-09 15:43:43 +08:00
Yishuo Wang
c093f7d980
fix phi3 (#11729) 2024-08-07 09:39:46 +08:00
Yishuo Wang
929675aa6b
support latest phi3 (#11721) 2024-08-06 15:52:55 +08:00
SONG Ge
ef4b6519fb
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support

* add phi3 example
2024-06-17 17:44:24 +08:00
Yishuo Wang
bc5008f0d5
disable sdp_causal in phi-3 to fix overflow (#11157) 2024-05-28 17:25:53 +08:00
Yishuo Wang
d307622797
fix first token sdp with batch (#11153) 2024-05-28 15:03:06 +08:00
Yina Chen
b6b70d1ba0
Divide core-xe packages (#11131)
* temp

* add batch

* fix style

* update package name

* fix style

* add workflow

* use temp version to run uts

* trigger performance test

* trigger win igpu perf

* revert workflow & setup
2024-05-28 12:00:18 +08:00
Yishuo Wang
cd4dff09ee
support phi-3 vision (#11101) 2024-05-22 17:43:50 +08:00
Yishuo Wang
8cae897643
use new rope in phi3 (#11047) 2024-05-16 15:12:35 +08:00
Yishuo Wang
59df750326
Use new sdp again (#11025) 2024-05-16 09:33:34 +08:00
Yishuo Wang
fad1dbaf60
use sdp fp8 causal kernel (#11023) 2024-05-15 10:22:35 +08:00
Zhao Changmin
0a732bebe7
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed

* pep8
2024-05-15 08:16:43 +08:00
Zhao Changmin
b03c859278
Add phi3RMS (#10988)
* phi3RMS
2024-05-14 15:16:27 +08:00
Yishuo Wang
170e3d65e0
use new sdp and fp32 sdp (#11007) 2024-05-14 14:29:18 +08:00
Yishuo Wang
ad96f32ce0
optimize phi3 1st token performance (#10981) 2024-05-10 17:33:46 +08:00
Yishuo Wang
697ca79eca
use quantize kv and sdp in phi3-mini (#10973) 2024-05-09 15:16:18 +08:00
Yishuo Wang
2ebec0395c
optimize phi-3-mini-128 (#10959) 2024-05-08 16:33:17 +08:00
Yishuo Wang
c801c37bc6
optimize phi3 again: use quantize kv if possible (#10953) 2024-05-07 17:26:19 +08:00
Yishuo Wang
aa2fa9fde1
optimize phi3 again: use sdp if possible (#10951) 2024-05-07 15:53:08 +08:00
Yishuo Wang
2d210817ff
add phi3 optimization (#10871) 2024-04-24 15:17:40 +08:00