Guoqiong Song
|
d64711900a
|
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
|
2024-07-17 17:18:59 -07:00 |
|
Yishuo Wang
|
019da6c0ab
|
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574)
|
2024-07-13 16:32:54 +08:00 |
|
Yishuo Wang
|
a945500a98
|
fix internlm xcomposser stream chat (#11564)
|
2024-07-11 18:21:17 +08:00 |
|
binbin Deng
|
2b8ad8731e
|
Support pipeline parallel for glm-4v (#11545)
|
2024-07-11 16:06:06 +08:00 |
|
Cengguang Zhang
|
70ab1a6f1a
|
LLM: unify memory optimization env variables. (#11549)
* LLM: unify memory optimization env variables.
* fix comments.
|
2024-07-11 11:01:28 +08:00 |
|
Yishuo Wang
|
994e49a510
|
optimize internlm xcomposser performance again (#11551)
|
2024-07-10 17:08:56 +08:00 |
|
Yishuo Wang
|
82f9514303
|
optimize internlm xcomposer2 performance (#11550)
|
2024-07-10 15:57:04 +08:00 |
|
Yishuo Wang
|
99b2802d3b
|
optimize qewn2 memory (#11535)
|
2024-07-09 17:14:01 +08:00 |
|
Yishuo Wang
|
2929eb262e
|
support npu glm4 (#11539)
|
2024-07-09 15:46:49 +08:00 |
|
Yishuo Wang
|
7cb09a8eac
|
optimize qwen2 memory usage again (#11520)
|
2024-07-05 17:32:34 +08:00 |
|
Xin Qiu
|
a31f2cbe13
|
update minicpm.py (#11517)
* update minicpm
* meet code review
|
2024-07-05 15:25:44 +08:00 |
|
binbin Deng
|
60de428b37
|
Support pipeline parallel for qwen-vl (#11503)
|
2024-07-04 18:03:57 +08:00 |
|
Yishuo Wang
|
1a8bab172e
|
add minicpm 1B/2B npu support (#11507)
|
2024-07-04 16:31:04 +08:00 |
|
binbin Deng
|
9274282ef7
|
Support pipeline parallel for glm-4-9b-chat (#11463)
|
2024-07-03 14:25:28 +08:00 |
|
Yishuo Wang
|
d97c2664ce
|
use new fuse rope in stablelm family (#11497)
|
2024-07-03 11:08:26 +08:00 |
|
Yishuo Wang
|
39bcb33a67
|
add sdp support for stablelm 3b (#11473)
|
2024-07-01 14:56:15 +08:00 |
|
Yishuo Wang
|
c6e5ad668d
|
fix internlm xcomposser meta-instruction typo (#11448)
|
2024-06-27 15:29:43 +08:00 |
|
Yishuo Wang
|
2a0f8087e3
|
optimize qwen2 gpu memory usage again (#11435)
|
2024-06-26 16:52:29 +08:00 |
|
Shaojun Liu
|
ab9f7f3ac5
|
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'
* fix python style checks
* refactor
* update GPU example
|
2024-06-26 15:36:22 +08:00 |
|
binbin Deng
|
e473b8d946
|
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423)
|
2024-06-25 15:49:32 +08:00 |
|
binbin Deng
|
aacc1fd8c0
|
Fix shape error when run qwen1.5-14b using deepspeed autotp (#11420)
|
2024-06-25 13:48:37 +08:00 |
|
Yishuo Wang
|
abe53eaa4f
|
optimize qwen1.5/2 memory usage when running long input with fp16 (#11403)
|
2024-06-24 13:43:04 +08:00 |
|
Guoqiong Song
|
7507000ef2
|
Fix 1383 Llama model on transformers=4.41[WIP] (#11280)
|
2024-06-21 11:24:10 -07:00 |
|
binbin Deng
|
4ba82191f2
|
Support PP inference for chatglm3 (#11375)
|
2024-06-21 09:59:01 +08:00 |
|
Yishuo Wang
|
f0fdfa081b
|
Optimize qwen 1.5 14B batch performance (#11370)
|
2024-06-20 17:23:39 +08:00 |
|
Guoqiong Song
|
c44b1942ed
|
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
|
2024-06-18 13:39:35 -07:00 |
|
SONG Ge
|
ef4b6519fb
|
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support
* add phi3 example
|
2024-06-17 17:44:24 +08:00 |
|
Xin Qiu
|
183e0c6cf5
|
glm-4v-9b support (#11327)
* chatglm4v support
* fix style check
* update glm4v
|
2024-06-17 13:52:37 +08:00 |
|
Yishuo Wang
|
e8dd8e97ef
|
fix chatglm lookahead on ARC (#11320)
|
2024-06-14 16:26:11 +08:00 |
|
Yishuo Wang
|
91965b5d05
|
add glm_sdpa back to fix chatglm-6b (#11313)
|
2024-06-14 10:31:43 +08:00 |
|
Yishuo Wang
|
7f65836cb9
|
fix chatglm2/3-32k/128k fp16 (#11311)
|
2024-06-14 09:58:07 +08:00 |
|
Xin Qiu
|
1b0c4c8cb8
|
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4
* rempve
|
2024-06-13 19:02:18 +08:00 |
|
Xin Qiu
|
f1410d6823
|
refactor chatglm4 (#11301)
* glm4
* remove useless code
* stype
* add rope_ratio
* update
* fix fp16
* fix style
|
2024-06-13 18:06:04 +08:00 |
|
Yishuo Wang
|
5e25766855
|
fix and optimize chatglm2-32k and chatglm3-128k (#11306)
|
2024-06-13 17:37:58 +08:00 |
|
binbin Deng
|
60cb1dac7c
|
Support PP for qwen1.5 (#11300)
|
2024-06-13 17:35:24 +08:00 |
|
Yishuo Wang
|
a24666b8f3
|
fix chatglm3-6b-32k (#11303)
|
2024-06-13 16:01:34 +08:00 |
|
Yishuo Wang
|
01fe0fc1a2
|
refactor chatglm2/3 (#11290)
|
2024-06-13 12:22:58 +08:00 |
|
Xin Qiu
|
592f7aa61e
|
Refine glm1-4 sdp (#11276)
* chatglm
* update
* update
* change chatglm
* update sdpa
* update
* fix style
* fix
* fix glm
* update glm2-32k
* update glm2-32k
* fix cpu
* update
* change lower_bound
|
2024-06-12 17:11:56 +08:00 |
|
Yishuo Wang
|
10e480ee96
|
refactor internlm and internlm2 (#11274)
|
2024-06-11 14:19:19 +08:00 |
|
Yishuo Wang
|
42fab480ea
|
support stablm2 12b (#11265)
|
2024-06-07 15:46:00 +08:00 |
|
Xin Qiu
|
dbc3c2d72d
|
glm4 sdp (#11253)
* glm4 sdp
* fix style
* update comment
|
2024-06-07 15:42:23 +08:00 |
|
Xin Qiu
|
151fcf37bb
|
check devie name in use_flash_attention (#11263)
|
2024-06-07 15:07:47 +08:00 |
|
Yishuo Wang
|
2623944604
|
qwen2 sdpa small fix (#11261)
|
2024-06-07 14:42:18 +08:00 |
|
Yishuo Wang
|
ea0d03fd28
|
Refactor baichuan1 7B and 13B (#11258)
|
2024-06-07 14:29:20 +08:00 |
|
Yishuo Wang
|
ef8e9b2ecd
|
Refactor qwen2 moe (#11244)
|
2024-06-07 13:14:54 +08:00 |
|
Xin Qiu
|
2f809116e2
|
optimize Chatglm4 (#11239)
* chatglm4
* update
* update
* add rms norm
* chatglm4
|
2024-06-06 18:25:20 +08:00 |
|
Yishuo Wang
|
2e4ccd541c
|
fix qwen2 cpu (#11240)
|
2024-06-06 16:24:19 +08:00 |
|
Yishuo Wang
|
e738ec38f4
|
disable quantize kv in specific qwen model (#11238)
|
2024-06-06 14:08:39 +08:00 |
|
Yishuo Wang
|
c4e5806e01
|
add latest optimization in starcoder2 (#11236)
|
2024-06-06 14:02:17 +08:00 |
|
Yishuo Wang
|
ba27e750b1
|
refactor yuan2 (#11235)
|
2024-06-06 13:17:54 +08:00 |
|