Commit graph

165 commits

Author SHA1 Message Date
Yina Chen
670ad887fc
Qwen support compress kv (#11680)
* Qwen support compress kv

* fix style

* fix
2024-07-30 11:16:42 +08:00
hxsz1997
9b36877897
disable default quantize_kv of GQA on MTL (#11679)
* disable default quantizekv of gqa in mtl

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype
2024-07-30 09:38:46 +08:00
Yishuo Wang
c02003925b
add mlp for gemma2 (#11678) 2024-07-29 16:10:23 +08:00
Yishuo Wang
6f999e6e90
add sdp for gemma2 (#11677) 2024-07-29 15:15:47 +08:00
Ruonan Wang
c11d5301d7
add sdp fp8 for llama (#11671)
* add sdp fp8 for llama

* fix style

* refactor
2024-07-29 13:46:22 +08:00
Yishuo Wang
7f88ce23cd
add more gemma2 optimization (#11673) 2024-07-29 11:13:00 +08:00
Yishuo Wang
3e8819734b
add basic gemma2 optimization (#11672) 2024-07-29 10:46:51 +08:00
Yina Chen
fc7f8feb83
Support compress kv (#11642)
* mistral snapkv

* update

* mtl update

* update

* update

* update

* add comments

* style fix

* fix style

* support llama

* llama use compress kv

* support mistral 4.40

* fix style

* support diff transformers versions

* move snapkv util to kv

* fix style

* meet comments & small fix

* revert all in one

* fix indent

---------

Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-07-26 16:02:00 +08:00
Yishuo Wang
6bcdc6cc8f
fix qwen2 cpu (#11663) 2024-07-26 13:41:51 +08:00
Guoqiong Song
380717f50d
fix gemma for 4.41 (#11531)
* fix gemma for 4.41
2024-07-18 15:02:50 -07:00
Guoqiong Song
5a6211fd56
fix minicpm for transformers>=4.39 (#11533)
* fix minicpm for transformers>=4.39
2024-07-18 15:01:57 -07:00
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Yishuo Wang
019da6c0ab
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574) 2024-07-13 16:32:54 +08:00
Yishuo Wang
a945500a98
fix internlm xcomposser stream chat (#11564) 2024-07-11 18:21:17 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v (#11545) 2024-07-11 16:06:06 +08:00
Cengguang Zhang
70ab1a6f1a
LLM: unify memory optimization env variables. (#11549)
* LLM: unify memory optimization env variables.

* fix comments.
2024-07-11 11:01:28 +08:00
Yishuo Wang
994e49a510
optimize internlm xcomposser performance again (#11551) 2024-07-10 17:08:56 +08:00
Yishuo Wang
82f9514303
optimize internlm xcomposer2 performance (#11550) 2024-07-10 15:57:04 +08:00
Yishuo Wang
99b2802d3b
optimize qewn2 memory (#11535) 2024-07-09 17:14:01 +08:00
Yishuo Wang
2929eb262e
support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
Yishuo Wang
7cb09a8eac
optimize qwen2 memory usage again (#11520) 2024-07-05 17:32:34 +08:00
Xin Qiu
a31f2cbe13
update minicpm.py (#11517)
* update minicpm

* meet code review
2024-07-05 15:25:44 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl (#11503) 2024-07-04 18:03:57 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat (#11463) 2024-07-03 14:25:28 +08:00
Yishuo Wang
d97c2664ce
use new fuse rope in stablelm family (#11497) 2024-07-03 11:08:26 +08:00
Yishuo Wang
39bcb33a67
add sdp support for stablelm 3b (#11473) 2024-07-01 14:56:15 +08:00
Yishuo Wang
c6e5ad668d
fix internlm xcomposser meta-instruction typo (#11448) 2024-06-27 15:29:43 +08:00
Yishuo Wang
2a0f8087e3
optimize qwen2 gpu memory usage again (#11435) 2024-06-26 16:52:29 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'

* fix python style checks

* refactor

* update GPU example
2024-06-26 15:36:22 +08:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423) 2024-06-25 15:49:32 +08:00
binbin Deng
aacc1fd8c0
Fix shape error when run qwen1.5-14b using deepspeed autotp (#11420) 2024-06-25 13:48:37 +08:00
Yishuo Wang
abe53eaa4f
optimize qwen1.5/2 memory usage when running long input with fp16 (#11403) 2024-06-24 13:43:04 +08:00
Guoqiong Song
7507000ef2
Fix 1383 Llama model on transformers=4.41[WIP] (#11280) 2024-06-21 11:24:10 -07:00
binbin Deng
4ba82191f2
Support PP inference for chatglm3 (#11375) 2024-06-21 09:59:01 +08:00
Yishuo Wang
f0fdfa081b
Optimize qwen 1.5 14B batch performance (#11370) 2024-06-20 17:23:39 +08:00
Guoqiong Song
c44b1942ed
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
2024-06-18 13:39:35 -07:00
SONG Ge
ef4b6519fb
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support

* add phi3 example
2024-06-17 17:44:24 +08:00
Xin Qiu
183e0c6cf5
glm-4v-9b support (#11327)
* chatglm4v support

* fix style check

* update glm4v
2024-06-17 13:52:37 +08:00
Yishuo Wang
e8dd8e97ef
fix chatglm lookahead on ARC (#11320) 2024-06-14 16:26:11 +08:00
Yishuo Wang
91965b5d05
add glm_sdpa back to fix chatglm-6b (#11313) 2024-06-14 10:31:43 +08:00
Yishuo Wang
7f65836cb9
fix chatglm2/3-32k/128k fp16 (#11311) 2024-06-14 09:58:07 +08:00
Xin Qiu
1b0c4c8cb8
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4

* rempve
2024-06-13 19:02:18 +08:00
Xin Qiu
f1410d6823
refactor chatglm4 (#11301)
* glm4

* remove useless code

* stype

* add rope_ratio

* update

* fix fp16

* fix style
2024-06-13 18:06:04 +08:00
Yishuo Wang
5e25766855
fix and optimize chatglm2-32k and chatglm3-128k (#11306) 2024-06-13 17:37:58 +08:00
binbin Deng
60cb1dac7c
Support PP for qwen1.5 (#11300) 2024-06-13 17:35:24 +08:00
Yishuo Wang
a24666b8f3
fix chatglm3-6b-32k (#11303) 2024-06-13 16:01:34 +08:00
Yishuo Wang
01fe0fc1a2
refactor chatglm2/3 (#11290) 2024-06-13 12:22:58 +08:00
Xin Qiu
592f7aa61e
Refine glm1-4 sdp (#11276)
* chatglm

* update

* update

* change chatglm

* update sdpa

* update

* fix style

* fix

* fix glm

* update glm2-32k

* update glm2-32k

* fix cpu

* update

* change lower_bound
2024-06-12 17:11:56 +08:00
Yishuo Wang
10e480ee96
refactor internlm and internlm2 (#11274) 2024-06-11 14:19:19 +08:00