SONG Ge
|
192ae35012
|
Add support for llama2 quantize_kv with transformers 4.38.0 (#11054)
* add support for llama2 quantize_kv with transformers 4.38.0
* fix code style
* fix code style
|
2024-05-16 22:23:39 +08:00 |
|
SONG Ge
|
16b2a418be
|
hotfix native_sdp ut (#11046)
* hotfix native_sdp
* update
|
2024-05-16 17:15:37 +08:00 |
|
Xin Qiu
|
6be70283b7
|
fix chatglm run error (#11045)
* fix chatglm
* update
* fix style
|
2024-05-16 15:39:18 +08:00 |
|
Yishuo Wang
|
8cae897643
|
use new rope in phi3 (#11047)
|
2024-05-16 15:12:35 +08:00 |
|
Yishuo Wang
|
59df750326
|
Use new sdp again (#11025)
|
2024-05-16 09:33:34 +08:00 |
|
SONG Ge
|
9942a4ba69
|
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0
* add supprot for quantize_qkv
* add original support for 4.38.0 now
* code style fix
|
2024-05-15 18:07:00 +08:00 |
|
Ruonan Wang
|
ac384e0f45
|
add fp6 mlp fusion (#11032)
* add fp6 fusion
* add qkv fusion for fp6
* remove qkv first
|
2024-05-15 17:42:50 +08:00 |
|
Yishuo Wang
|
fad1dbaf60
|
use sdp fp8 causal kernel (#11023)
|
2024-05-15 10:22:35 +08:00 |
|
Zhao Changmin
|
0a732bebe7
|
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed
* pep8
|
2024-05-15 08:16:43 +08:00 |
|
Zhao Changmin
|
b03c859278
|
Add phi3RMS (#10988)
* phi3RMS
|
2024-05-14 15:16:27 +08:00 |
|
Yishuo Wang
|
170e3d65e0
|
use new sdp and fp32 sdp (#11007)
|
2024-05-14 14:29:18 +08:00 |
|
Yishuo Wang
|
ad96f32ce0
|
optimize phi3 1st token performance (#10981)
|
2024-05-10 17:33:46 +08:00 |
|
Cengguang Zhang
|
cfed76b2ed
|
LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937)
* LLM: add split tensor support for baichuan2-7b and qwen1.5-7b.
* fix style.
* fix style.
* fix style.
* add support for mistral and fix condition threshold.
* fix style.
* fix comments.
|
2024-05-10 16:40:15 +08:00 |
|
Yishuo Wang
|
e753125880
|
use fp16_sdp when head_dim=96 (#10976)
|
2024-05-09 17:02:59 +08:00 |
|
Yishuo Wang
|
697ca79eca
|
use quantize kv and sdp in phi3-mini (#10973)
|
2024-05-09 15:16:18 +08:00 |
|
Yishuo Wang
|
2ebec0395c
|
optimize phi-3-mini-128 (#10959)
|
2024-05-08 16:33:17 +08:00 |
|
Yishuo Wang
|
c801c37bc6
|
optimize phi3 again: use quantize kv if possible (#10953)
|
2024-05-07 17:26:19 +08:00 |
|
Yishuo Wang
|
aa2fa9fde1
|
optimize phi3 again: use sdp if possible (#10951)
|
2024-05-07 15:53:08 +08:00 |
|
Wang, Jian4
|
191b184341
|
LLM: Optimize cohere model (#10878)
* use mlp and rms
* optimize kv_cache
* add fuse qkv
* add flash attention and fp16 sdp
* error fp8 sdp
* fix optimized
* fix style
* update
* add for pp
|
2024-05-07 10:19:50 +08:00 |
|
Cengguang Zhang
|
75dbf240ec
|
LLM: update split tensor conditions. (#10872)
* LLM: update split tensor condition.
* add cond for split tensor.
* update priority of env.
* fix style.
* update env name.
|
2024-04-30 17:07:21 +08:00 |
|
Yishuo Wang
|
d884c62dc4
|
remove new_layout parameter (#10906)
|
2024-04-29 10:31:50 +08:00 |
|
Guancheng Fu
|
c9fac8c26b
|
Fix sdp logic (#10896)
* fix
* fix
|
2024-04-28 22:02:14 +08:00 |
|
Cengguang Zhang
|
9752ffe979
|
LLM: update split qkv native sdp. (#10895)
* LLM: update split qkv native sdp.
* fix typo.
|
2024-04-26 18:47:35 +08:00 |
|
Yishuo Wang
|
46ba962168
|
use new quantize kv (#10888)
|
2024-04-26 14:42:17 +08:00 |
|
Yina Chen
|
8811f268ff
|
Use new fp16 sdp in Qwen and modify the constraint (#10882)
|
2024-04-25 19:23:37 +08:00 |
|
Yina Chen
|
dc27b3bc35
|
Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790)
* update sdp condition
* update
* fix
* update & test llama
* mistral
* fix style
* update
* fix style
* remove pvc constrain
* update ds on arc
* fix style
|
2024-04-24 17:24:01 +08:00 |
|
binbin Deng
|
c9feffff9a
|
LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864)
|
2024-04-24 16:02:27 +08:00 |
|
Yishuo Wang
|
2d210817ff
|
add phi3 optimization (#10871)
|
2024-04-24 15:17:40 +08:00 |
|
Cengguang Zhang
|
763413b7e1
|
LLM: support llama split tensor for long context in transformers>=4.36. (#10844)
* LLm: support llama split tensor for long context in transformers>=4.36.
* fix dtype.
* fix style.
* fix style.
* fix style.
* fix style.
* fix dtype.
* fix style.
|
2024-04-23 16:13:25 +08:00 |
|
Wang, Jian4
|
18c032652d
|
LLM: Add mixtral speculative CPU example (#10830)
* init mixtral sp example
* use different prompt_format
* update output
* update
|
2024-04-23 10:05:51 +08:00 |
|
Yishuo Wang
|
fe5a082b84
|
add phi-2 optimization (#10843)
|
2024-04-22 18:56:47 +08:00 |
|
Guancheng Fu
|
caf75beef8
|
Disable sdpa (#10814)
|
2024-04-19 17:33:18 +08:00 |
|
Yishuo Wang
|
08458b4f74
|
remove rms norm copy (#10793)
|
2024-04-19 13:57:48 +08:00 |
|
Ruonan Wang
|
754b0ffecf
|
Fix pvc llama (#10798)
* ifx
* update
|
2024-04-18 10:44:57 -07:00 |
|
Wang, Jian4
|
14ca42a048
|
LLM:Fix moe indexs error on cpu (#10791)
|
2024-04-18 15:56:52 +08:00 |
|
Wang, Jian4
|
209c3501e6
|
LLM: Optimize qwen1.5 moe model (#10706)
* update moe block
* fix style
* enable optmize MLP
* enabel kv_cache
* enable fuse rope
* enable fused qkv
* enable flash_attention
* error sdp quantize
* use old api
* use fuse
* use xetla
* fix python style
* update moe_blocks num
* fix output error
* add cpu sdpa
* update
* update
* update
|
2024-04-18 14:54:05 +08:00 |
|
Ziteng Zhang
|
ff040c8f01
|
LISA Finetuning Example (#10743)
* enabling xetla only supports qtype=SYM_INT4 or FP8E5
* LISA Finetuning Example on gpu
* update readme
* add licence
* Explain parameters of lisa & Move backend codes to src dir
* fix style
* fix style
* update readme
* support chatglm
* fix style
* fix style
* update readme
* fix
|
2024-04-18 13:48:10 +08:00 |
|
Yang Wang
|
952e517db9
|
use config rope_theta (#10787)
* use config rope_theta
* fix style
|
2024-04-17 20:39:11 -07:00 |
|
Guancheng Fu
|
31ea2f9a9f
|
Fix wrong output for Llama models on CPU (#10742)
|
2024-04-18 11:07:27 +08:00 |
|
Xin Qiu
|
e764f9b1b1
|
Disable fast fused rope on UHD (#10780)
* use decoding fast path
* update
* update
* cleanup
|
2024-04-18 10:03:53 +08:00 |
|
Wang, Jian4
|
a20271ffe4
|
LLM: Fix yi-6b fp16 error on pvc (#10781)
* updat for yi fp16
* update
* update
|
2024-04-17 16:49:59 +08:00 |
|
Cengguang Zhang
|
3e2662c87e
|
LLM: fix get env KV_CACHE_ALLOC_BLOCK_LENGTH type. (#10771)
|
2024-04-16 09:32:30 +08:00 |
|
binbin Deng
|
c3fc8f4b90
|
LLM: add bs limitation for llama softmax upcast to fp32 (#10752)
|
2024-04-12 15:40:25 +08:00 |
|
Yishuo Wang
|
8086554d33
|
use new fp16 sdp in llama and mistral (#10734)
|
2024-04-12 10:49:02 +08:00 |
|
Yang Wang
|
019293e1b9
|
Fuse MOE indexes computation (#10716)
* try moe
* use c++ cpu to compute indexes
* fix style
|
2024-04-11 10:12:55 -07:00 |
|
Cengguang Zhang
|
4b024b7aac
|
LLM: optimize chatglm2 8k input. (#10723)
* LLM: optimize chatglm2 8k input.
* rename.
|
2024-04-10 16:59:06 +08:00 |
|
Keyan (Kyrie) Zhang
|
585c174e92
|
Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables (#10707)
* Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables.
* Fix style
|
2024-04-10 10:48:46 +08:00 |
|
Jiao Wang
|
878a97077b
|
Fix llava example to support transformerds 4.36 (#10614)
* fix llava example
* update
|
2024-04-09 13:47:07 -07:00 |
|
Yishuo Wang
|
8f45e22072
|
fix llama2 (#10710)
|
2024-04-09 17:28:37 +08:00 |
|
Yishuo Wang
|
e438f941f2
|
disable rwkv5 fp16 (#10699)
|
2024-04-09 16:42:11 +08:00 |
|