Commit graph

272 commits

Author SHA1 Message Date
binbin Deng
6ea1e71af0
Update PP inference benchmark script (#11323) 2024-06-17 09:59:36 +08:00
SONG Ge
be00380f1a
Fix pipeline parallel inference past_key_value error in Baichuan (#11318)
* fix past_key_value error

* add baichuan2 example

* fix style

* update doc

* add script link in doc

* fix import error

* update
2024-06-17 09:29:32 +08:00
Yina Chen
0af0102e61
Add quantization scale search switch (#11326)
* add scale_search switch

* remove llama3 instruct

* remove print
2024-06-14 18:46:52 +08:00
Ruonan Wang
8a3247ac71
support batch forward for q4_k, q6_k (#11325) 2024-06-14 18:25:50 +08:00
Yishuo Wang
e8dd8e97ef
fix chatglm lookahead on ARC (#11320) 2024-06-14 16:26:11 +08:00
Yishuo Wang
91965b5d05
add glm_sdpa back to fix chatglm-6b (#11313) 2024-06-14 10:31:43 +08:00
Yishuo Wang
7f65836cb9
fix chatglm2/3-32k/128k fp16 (#11311) 2024-06-14 09:58:07 +08:00
Xin Qiu
1b0c4c8cb8
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4

* rempve
2024-06-13 19:02:18 +08:00
Xin Qiu
f1410d6823
refactor chatglm4 (#11301)
* glm4

* remove useless code

* stype

* add rope_ratio

* update

* fix fp16

* fix style
2024-06-13 18:06:04 +08:00
Yishuo Wang
5e25766855
fix and optimize chatglm2-32k and chatglm3-128k (#11306) 2024-06-13 17:37:58 +08:00
binbin Deng
60cb1dac7c
Support PP for qwen1.5 (#11300) 2024-06-13 17:35:24 +08:00
Yishuo Wang
a24666b8f3
fix chatglm3-6b-32k (#11303) 2024-06-13 16:01:34 +08:00
Yishuo Wang
01fe0fc1a2
refactor chatglm2/3 (#11290) 2024-06-13 12:22:58 +08:00
Guancheng Fu
57a023aadc
Fix vllm tp (#11297) 2024-06-13 10:47:48 +08:00
binbin Deng
220151e2a1
Refactor pipeline parallel multi-stage implementation (#11286) 2024-06-13 10:00:23 +08:00
Ruonan Wang
14b1e6b699
Fix gguf_q4k (#11293)
* udpate embedding parameter

* update benchmark
2024-06-12 20:43:08 +08:00
Yuwen Hu
8edcdeb0e7
Fix bug that torch.ops.torch_ipex.matmul_bias_out cannot work on Linux MTL for short input (#11292) 2024-06-12 19:12:57 +08:00
Xin Qiu
592f7aa61e
Refine glm1-4 sdp (#11276)
* chatglm

* update

* update

* change chatglm

* update sdpa

* update

* fix style

* fix

* fix glm

* update glm2-32k

* update glm2-32k

* fix cpu

* update

* change lower_bound
2024-06-12 17:11:56 +08:00
Yuwen Hu
cffb932f05
Expose timeout for streamer for fastchat worker (#11288)
* Expose timeout for stremer for fastchat worker

* Change to read from env variables
2024-06-12 17:02:40 +08:00
Qiyuan Gong
0d9cc9c106
Remove duplicate check for ipex (#11281)
* Replacing builtin.import is causing lots of unpredicted problems. Remove this function.
2024-06-12 13:52:02 +08:00
Yishuo Wang
10e480ee96
refactor internlm and internlm2 (#11274) 2024-06-11 14:19:19 +08:00
Xiangyu Tian
4b07712fd8
LLM: Fix vLLM CPU model convert mismatch (#11254)
Fix vLLM CPU model convert mismatch.
2024-06-07 15:54:34 +08:00
Yishuo Wang
42fab480ea
support stablm2 12b (#11265) 2024-06-07 15:46:00 +08:00
Xin Qiu
dbc3c2d72d
glm4 sdp (#11253)
* glm4 sdp

* fix style

* update comment
2024-06-07 15:42:23 +08:00
Xin Qiu
151fcf37bb
check devie name in use_flash_attention (#11263) 2024-06-07 15:07:47 +08:00
Yishuo Wang
2623944604
qwen2 sdpa small fix (#11261) 2024-06-07 14:42:18 +08:00
Yishuo Wang
ea0d03fd28
Refactor baichuan1 7B and 13B (#11258) 2024-06-07 14:29:20 +08:00
Qiyuan Gong
1aa9c9597a
Avoid duplicate import in IPEX auto importer (#11227)
* Add custom import to avoid ipex duplicate importing
* Add scope limitation
2024-06-07 14:08:00 +08:00
Yishuo Wang
ef8e9b2ecd
Refactor qwen2 moe (#11244) 2024-06-07 13:14:54 +08:00
Zhao Changmin
b7948671de
[WIP] Add look up table in 1st token stage (#11193)
* lookuptb
2024-06-07 10:51:05 +08:00
Xin Qiu
2f809116e2
optimize Chatglm4 (#11239)
* chatglm4

* update

* update

* add rms norm

* chatglm4
2024-06-06 18:25:20 +08:00
Yishuo Wang
2e4ccd541c
fix qwen2 cpu (#11240) 2024-06-06 16:24:19 +08:00
Yishuo Wang
e738ec38f4
disable quantize kv in specific qwen model (#11238) 2024-06-06 14:08:39 +08:00
Yishuo Wang
c4e5806e01
add latest optimization in starcoder2 (#11236) 2024-06-06 14:02:17 +08:00
Yishuo Wang
ba27e750b1
refactor yuan2 (#11235) 2024-06-06 13:17:54 +08:00
Guoqiong Song
f6d5c6af78
fix issue 1407 (#11171) 2024-06-05 13:35:57 -07:00
Yina Chen
ed67435491
Support Fp6 k in ipex-llm (#11222)
* support fp6_k

* support fp6_k

* remove

* fix style
2024-06-05 17:34:36 +08:00
binbin Deng
a6674f5bce
Fix should_use_fuse_rope error of Qwen1.5-MoE-A2.7B-Chat (#11216) 2024-06-05 15:56:10 +08:00
Xin Qiu
566691c5a3
quantized attention forward for minicpm (#11200)
* quantized minicpm

* fix style check
2024-06-05 09:15:25 +08:00
Jiao Wang
bb83bc23fd
Fix Starcoder issue on CPU on transformers 4.36+ (#11190)
* fix starcoder for sdpa

* update

* style
2024-06-04 10:05:40 -07:00
Xiangyu Tian
ac3d53ff5d
LLM: Fix vLLM CPU version error (#11206)
Fix vLLM CPU version error
2024-06-04 19:10:23 +08:00
Ruonan Wang
1dde204775
update q6k (#11205) 2024-06-04 17:14:33 +08:00
Qiyuan Gong
ce3f08b25a
Fix IPEX auto importer (#11192)
* Fix ipex auto importer with Python builtins.
* Raise errors if the user imports ipex manually before importing ipex_llm. Do nothing if they import ipex after importing ipex_llm.
* Remove import ipex in examples.
2024-06-04 16:57:18 +08:00
Yishuo Wang
6454655dcc
use sdp in baichuan2 13b (#11198) 2024-06-04 15:39:00 +08:00
Yishuo Wang
d90cd977d0
refactor stablelm (#11195) 2024-06-04 13:14:43 +08:00
Xin Qiu
5f13700c9f
optimize Minicpm (#11189)
* minicpm optimize

* update
2024-06-03 18:28:29 +08:00
Shaojun Liu
401013a630
Remove chatglm_C Module to Eliminate LGPL Dependency (#11178)
* remove chatglm_C.**.pyd to solve ngsolve weak copyright vunl

* fix style check error

* remove chatglm native int4 from langchain
2024-05-31 17:03:11 +08:00
Ruonan Wang
50b5f4476f
update q4k convert (#11179) 2024-05-31 11:36:53 +08:00
ZehuaCao
4127b99ed6
Fix null pointer dereferences error. (#11125)
* delete unused function on tgi_server

* update

* update

* fix style
2024-05-30 16:16:10 +08:00
Guancheng Fu
50ee004ac7
Fix vllm condition (#11169)
* add use-vllm

* done

* fix style

* fix done
2024-05-30 15:23:17 +08:00
Ruonan Wang
9bfbf78bf4
update api usage of xe_batch & fp16 (#11164)
* update api usage

* update setup.py
2024-05-29 15:15:14 +08:00
Yina Chen
e29e2f1c78
Support new fp8 e4m3 (#11158) 2024-05-29 14:27:14 +08:00
Yishuo Wang
bc5008f0d5
disable sdp_causal in phi-3 to fix overflow (#11157) 2024-05-28 17:25:53 +08:00
SONG Ge
33852bd23e
Refactor pipeline parallel device config (#11149)
* refactor pipeline parallel device config

* meet comments

* update example

* add warnings and update code doc
2024-05-28 16:52:46 +08:00
Yishuo Wang
d307622797
fix first token sdp with batch (#11153) 2024-05-28 15:03:06 +08:00
Yina Chen
3464440839
fix qwen import error (#11154) 2024-05-28 14:50:12 +08:00
Yina Chen
b6b70d1ba0
Divide core-xe packages (#11131)
* temp

* add batch

* fix style

* update package name

* fix style

* add workflow

* use temp version to run uts

* trigger performance test

* trigger win igpu perf

* revert workflow & setup
2024-05-28 12:00:18 +08:00
binbin Deng
c9168b85b7
Fix error during merging adapter (#11145) 2024-05-27 19:41:42 +08:00
Guancheng Fu
daf7b1cd56
[Docker] Fix image using two cards error (#11144)
* fix all

* done
2024-05-27 16:20:13 +08:00
binbin Deng
367de141f2
Fix mixtral-8x7b with transformers=4.37.0 (#11132) 2024-05-27 09:50:54 +08:00
Guancheng Fu
fabc395d0d
add langchain vllm interface (#11121)
* done

* fix

* fix

* add vllm

* add langchain vllm exampels

* add docs

* temp
2024-05-24 17:19:27 +08:00
ZehuaCao
63e95698eb
[LLM]Reopen autotp generate_stream (#11120)
* reopen autotp generate_stream

* fix style error

* update
2024-05-24 17:16:14 +08:00
Yishuo Wang
1dc680341b
fix phi-3-vision import (#11129) 2024-05-24 15:57:15 +08:00
Guancheng Fu
7f772c5a4f
Add half precision for fastchat models (#11130) 2024-05-24 15:41:14 +08:00
Zhao Changmin
65f4212f89
Fix qwen 14b run into register attention fwd (#11128)
* fix qwen 14b
2024-05-24 14:45:07 +08:00
Yishuo Wang
1db9d9a63b
optimize internlm2 xcomposer agin (#11124) 2024-05-24 13:44:52 +08:00
Yishuo Wang
9372ce87ce
fix internlm xcomposer2 fp16 (#11123) 2024-05-24 11:03:31 +08:00
Cengguang Zhang
011b9faa5c
LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107)
* LLM: unify alibi mask dtype.

* fix comments.
2024-05-24 10:27:53 +08:00
Jiao Wang
0a06a6e1d4
Update tests for transformers 4.36 (#10858)
* update unit test

* update

* update

* update

* update

* update

* fix gpu attention test

* update

* update

* update

* update

* update

* update

* update example test

* replace replit code

* update

* update

* update

* update

* set safe_serialization false

* perf test

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* delete

* update

* update

* update

* update

* update

* update

* revert

* update
2024-05-24 10:26:38 +08:00
Xiangyu Tian
b3f6faa038
LLM: Add CPU vLLM entrypoint (#11083)
Add CPU vLLM entrypoint and update CPU vLLM serving example.
2024-05-24 09:16:59 +08:00
Yishuo Wang
797dbc48b8
fix phi-2 and phi-3 convert (#11116) 2024-05-23 17:37:37 +08:00
Yishuo Wang
37b98a531f
support running internlm xcomposer2 on gpu and add sdp optimization (#11115) 2024-05-23 17:26:24 +08:00
Zhao Changmin
c5e8b90c8d
Add Qwen register attention implemention (#11110)
* qwen_register
2024-05-23 17:17:45 +08:00
Yishuo Wang
0e53f20edb
support running internlm-xcomposer2 on cpu (#11111) 2024-05-23 16:36:09 +08:00
Yishuo Wang
cd4dff09ee
support phi-3 vision (#11101) 2024-05-22 17:43:50 +08:00
Xin Qiu
71bcd18f44
fix qwen vl (#11090) 2024-05-21 18:40:29 +08:00
Yishuo Wang
f00625f9a4
refactor qwen2 (#11087) 2024-05-21 16:53:42 +08:00
Yishuo Wang
d830a63bb7
refactor qwen (#11074) 2024-05-20 18:08:37 +08:00
Wang, Jian4
74950a152a
Fix tgi_api_server error file name (#11075) 2024-05-20 16:48:40 +08:00
Yishuo Wang
4e97047d70
fix baichuan2 13b fp16 (#11071) 2024-05-20 11:21:20 +08:00
Wang, Jian4
a2e1578fd9
Merge tgi_api_server to main (#11036)
* init

* fix style

* speculative can not use benchmark

* add tgi server readme
2024-05-20 09:15:03 +08:00
Yishuo Wang
31ce3e0c13
refactor baichuan2-13b (#11064) 2024-05-17 16:25:30 +08:00
Ruonan Wang
f1156e6b20
support gguf_q4k_m / gguf_q4k_s (#10887)
* initial commit

* UPDATE

* fix style

* fix style

* add gguf_q4k_s

* update comment

* fix
2024-05-17 14:30:09 +08:00
Yishuo Wang
981d668be6
refactor baichuan2-7b (#11062) 2024-05-17 13:01:34 +08:00
Ruonan Wang
3a72e5df8c
disable mlp fusion of fp6 on mtl (#11059) 2024-05-17 10:10:16 +08:00
SONG Ge
192ae35012
Add support for llama2 quantize_kv with transformers 4.38.0 (#11054)
* add support for llama2 quantize_kv with transformers 4.38.0

* fix code style

* fix code style
2024-05-16 22:23:39 +08:00
SONG Ge
16b2a418be
hotfix native_sdp ut (#11046)
* hotfix native_sdp

* update
2024-05-16 17:15:37 +08:00
Xin Qiu
6be70283b7
fix chatglm run error (#11045)
* fix chatglm

* update

* fix style
2024-05-16 15:39:18 +08:00
Yishuo Wang
8cae897643
use new rope in phi3 (#11047) 2024-05-16 15:12:35 +08:00
Yishuo Wang
59df750326
Use new sdp again (#11025) 2024-05-16 09:33:34 +08:00
SONG Ge
9942a4ba69
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0

* add supprot for quantize_qkv

* add original support for 4.38.0 now

* code style fix
2024-05-15 18:07:00 +08:00
Yina Chen
686f6038a8
Support fp6 save & load (#11034) 2024-05-15 17:52:02 +08:00
Ruonan Wang
ac384e0f45
add fp6 mlp fusion (#11032)
* add fp6 fusion

* add qkv fusion for fp6

* remove qkv first
2024-05-15 17:42:50 +08:00
Wang, Jian4
2084ebe4ee
Enable fastchat benchmark latency (#11017)
* enable fastchat benchmark

* add readme

* update readme

* update
2024-05-15 14:52:09 +08:00
hxsz1997
93d40ab127
Update lookahead strategy (#11021)
* update lookahead strategy

* remove lines

* fix python style check
2024-05-15 14:48:05 +08:00
Wang, Jian4
d9f71f1f53
Update benchmark util for example using (#11027)
* mv benchmark_util.py to utils/

* remove

* update
2024-05-15 14:16:35 +08:00
Yishuo Wang
fad1dbaf60
use sdp fp8 causal kernel (#11023) 2024-05-15 10:22:35 +08:00
Yishuo Wang
ee325e9cc9
fix phi3 (#11022) 2024-05-15 09:32:12 +08:00
Zhao Changmin
0a732bebe7
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed

* pep8
2024-05-15 08:16:43 +08:00
Yina Chen
893197434d
Add fp6 support on gpu (#11008)
* add fp6 support

* fix style
2024-05-14 16:31:44 +08:00