Commit graph

154 commits

Author SHA1 Message Date
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Yishuo Wang
019da6c0ab
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574) 2024-07-13 16:32:54 +08:00
Yishuo Wang
a945500a98
fix internlm xcomposser stream chat (#11564) 2024-07-11 18:21:17 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v (#11545) 2024-07-11 16:06:06 +08:00
Cengguang Zhang
70ab1a6f1a
LLM: unify memory optimization env variables. (#11549)
* LLM: unify memory optimization env variables.

* fix comments.
2024-07-11 11:01:28 +08:00
Yishuo Wang
994e49a510
optimize internlm xcomposser performance again (#11551) 2024-07-10 17:08:56 +08:00
Yishuo Wang
82f9514303
optimize internlm xcomposer2 performance (#11550) 2024-07-10 15:57:04 +08:00
Yishuo Wang
99b2802d3b
optimize qewn2 memory (#11535) 2024-07-09 17:14:01 +08:00
Yishuo Wang
2929eb262e
support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
Yishuo Wang
7cb09a8eac
optimize qwen2 memory usage again (#11520) 2024-07-05 17:32:34 +08:00
Xin Qiu
a31f2cbe13
update minicpm.py (#11517)
* update minicpm

* meet code review
2024-07-05 15:25:44 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl (#11503) 2024-07-04 18:03:57 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat (#11463) 2024-07-03 14:25:28 +08:00
Yishuo Wang
d97c2664ce
use new fuse rope in stablelm family (#11497) 2024-07-03 11:08:26 +08:00
Yishuo Wang
39bcb33a67
add sdp support for stablelm 3b (#11473) 2024-07-01 14:56:15 +08:00
Yishuo Wang
c6e5ad668d
fix internlm xcomposser meta-instruction typo (#11448) 2024-06-27 15:29:43 +08:00
Yishuo Wang
2a0f8087e3
optimize qwen2 gpu memory usage again (#11435) 2024-06-26 16:52:29 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'

* fix python style checks

* refactor

* update GPU example
2024-06-26 15:36:22 +08:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423) 2024-06-25 15:49:32 +08:00
binbin Deng
aacc1fd8c0
Fix shape error when run qwen1.5-14b using deepspeed autotp (#11420) 2024-06-25 13:48:37 +08:00
Yishuo Wang
abe53eaa4f
optimize qwen1.5/2 memory usage when running long input with fp16 (#11403) 2024-06-24 13:43:04 +08:00
Guoqiong Song
7507000ef2
Fix 1383 Llama model on transformers=4.41[WIP] (#11280) 2024-06-21 11:24:10 -07:00
binbin Deng
4ba82191f2
Support PP inference for chatglm3 (#11375) 2024-06-21 09:59:01 +08:00
Yishuo Wang
f0fdfa081b
Optimize qwen 1.5 14B batch performance (#11370) 2024-06-20 17:23:39 +08:00
Guoqiong Song
c44b1942ed
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
2024-06-18 13:39:35 -07:00
SONG Ge
ef4b6519fb
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support

* add phi3 example
2024-06-17 17:44:24 +08:00
Xin Qiu
183e0c6cf5
glm-4v-9b support (#11327)
* chatglm4v support

* fix style check

* update glm4v
2024-06-17 13:52:37 +08:00
Yishuo Wang
e8dd8e97ef
fix chatglm lookahead on ARC (#11320) 2024-06-14 16:26:11 +08:00
Yishuo Wang
91965b5d05
add glm_sdpa back to fix chatglm-6b (#11313) 2024-06-14 10:31:43 +08:00
Yishuo Wang
7f65836cb9
fix chatglm2/3-32k/128k fp16 (#11311) 2024-06-14 09:58:07 +08:00
Xin Qiu
1b0c4c8cb8
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4

* rempve
2024-06-13 19:02:18 +08:00
Xin Qiu
f1410d6823
refactor chatglm4 (#11301)
* glm4

* remove useless code

* stype

* add rope_ratio

* update

* fix fp16

* fix style
2024-06-13 18:06:04 +08:00
Yishuo Wang
5e25766855
fix and optimize chatglm2-32k and chatglm3-128k (#11306) 2024-06-13 17:37:58 +08:00
binbin Deng
60cb1dac7c
Support PP for qwen1.5 (#11300) 2024-06-13 17:35:24 +08:00
Yishuo Wang
a24666b8f3
fix chatglm3-6b-32k (#11303) 2024-06-13 16:01:34 +08:00
Yishuo Wang
01fe0fc1a2
refactor chatglm2/3 (#11290) 2024-06-13 12:22:58 +08:00
Xin Qiu
592f7aa61e
Refine glm1-4 sdp (#11276)
* chatglm

* update

* update

* change chatglm

* update sdpa

* update

* fix style

* fix

* fix glm

* update glm2-32k

* update glm2-32k

* fix cpu

* update

* change lower_bound
2024-06-12 17:11:56 +08:00
Yishuo Wang
10e480ee96
refactor internlm and internlm2 (#11274) 2024-06-11 14:19:19 +08:00
Yishuo Wang
42fab480ea
support stablm2 12b (#11265) 2024-06-07 15:46:00 +08:00
Xin Qiu
dbc3c2d72d
glm4 sdp (#11253)
* glm4 sdp

* fix style

* update comment
2024-06-07 15:42:23 +08:00
Xin Qiu
151fcf37bb
check devie name in use_flash_attention (#11263) 2024-06-07 15:07:47 +08:00
Yishuo Wang
2623944604
qwen2 sdpa small fix (#11261) 2024-06-07 14:42:18 +08:00
Yishuo Wang
ea0d03fd28
Refactor baichuan1 7B and 13B (#11258) 2024-06-07 14:29:20 +08:00
Yishuo Wang
ef8e9b2ecd
Refactor qwen2 moe (#11244) 2024-06-07 13:14:54 +08:00
Xin Qiu
2f809116e2
optimize Chatglm4 (#11239)
* chatglm4

* update

* update

* add rms norm

* chatglm4
2024-06-06 18:25:20 +08:00
Yishuo Wang
2e4ccd541c
fix qwen2 cpu (#11240) 2024-06-06 16:24:19 +08:00
Yishuo Wang
e738ec38f4
disable quantize kv in specific qwen model (#11238) 2024-06-06 14:08:39 +08:00
Yishuo Wang
c4e5806e01
add latest optimization in starcoder2 (#11236) 2024-06-06 14:02:17 +08:00
Yishuo Wang
ba27e750b1
refactor yuan2 (#11235) 2024-06-06 13:17:54 +08:00