Yishuo Wang
|
ae7b662ed2
|
add fp16 NPU Linear support and fix intel_npu_acceleration_library version 1.0 support (#11352)
|
2024-06-19 09:14:59 +08:00 |
|
Guoqiong Song
|
c44b1942ed
|
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
|
2024-06-18 13:39:35 -07:00 |
|
Yishuo Wang
|
83082e5cc7
|
add initial support for intel npu acceleration library (#11347)
|
2024-06-18 16:07:16 +08:00 |
|
Yina Chen
|
5dad33e5af
|
Support fp8_e4m3 scale search (#11339)
* fp8e4m3 switch off
* fix style
|
2024-06-18 11:47:43 +08:00 |
|
binbin Deng
|
e50c890e1f
|
Support finishing PP inference once eos_token_id is found (#11336)
|
2024-06-18 09:55:40 +08:00 |
|
SONG Ge
|
ef4b6519fb
|
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support
* add phi3 example
|
2024-06-17 17:44:24 +08:00 |
|
Xin Qiu
|
183e0c6cf5
|
glm-4v-9b support (#11327)
* chatglm4v support
* fix style check
* update glm4v
|
2024-06-17 13:52:37 +08:00 |
|
binbin Deng
|
6ea1e71af0
|
Update PP inference benchmark script (#11323)
|
2024-06-17 09:59:36 +08:00 |
|
SONG Ge
|
be00380f1a
|
Fix pipeline parallel inference past_key_value error in Baichuan (#11318)
* fix past_key_value error
* add baichuan2 example
* fix style
* update doc
* add script link in doc
* fix import error
* update
|
2024-06-17 09:29:32 +08:00 |
|
Yina Chen
|
0af0102e61
|
Add quantization scale search switch (#11326)
* add scale_search switch
* remove llama3 instruct
* remove print
|
2024-06-14 18:46:52 +08:00 |
|
Ruonan Wang
|
8a3247ac71
|
support batch forward for q4_k, q6_k (#11325)
|
2024-06-14 18:25:50 +08:00 |
|
Yishuo Wang
|
e8dd8e97ef
|
fix chatglm lookahead on ARC (#11320)
|
2024-06-14 16:26:11 +08:00 |
|
Yishuo Wang
|
91965b5d05
|
add glm_sdpa back to fix chatglm-6b (#11313)
|
2024-06-14 10:31:43 +08:00 |
|
Yishuo Wang
|
7f65836cb9
|
fix chatglm2/3-32k/128k fp16 (#11311)
|
2024-06-14 09:58:07 +08:00 |
|
Xin Qiu
|
1b0c4c8cb8
|
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4
* rempve
|
2024-06-13 19:02:18 +08:00 |
|
Xin Qiu
|
f1410d6823
|
refactor chatglm4 (#11301)
* glm4
* remove useless code
* stype
* add rope_ratio
* update
* fix fp16
* fix style
|
2024-06-13 18:06:04 +08:00 |
|
Yishuo Wang
|
5e25766855
|
fix and optimize chatglm2-32k and chatglm3-128k (#11306)
|
2024-06-13 17:37:58 +08:00 |
|
binbin Deng
|
60cb1dac7c
|
Support PP for qwen1.5 (#11300)
|
2024-06-13 17:35:24 +08:00 |
|
Yishuo Wang
|
a24666b8f3
|
fix chatglm3-6b-32k (#11303)
|
2024-06-13 16:01:34 +08:00 |
|
Yishuo Wang
|
01fe0fc1a2
|
refactor chatglm2/3 (#11290)
|
2024-06-13 12:22:58 +08:00 |
|
Guancheng Fu
|
57a023aadc
|
Fix vllm tp (#11297)
|
2024-06-13 10:47:48 +08:00 |
|
binbin Deng
|
220151e2a1
|
Refactor pipeline parallel multi-stage implementation (#11286)
|
2024-06-13 10:00:23 +08:00 |
|
Ruonan Wang
|
14b1e6b699
|
Fix gguf_q4k (#11293)
* udpate embedding parameter
* update benchmark
|
2024-06-12 20:43:08 +08:00 |
|
Yuwen Hu
|
8edcdeb0e7
|
Fix bug that torch.ops.torch_ipex.matmul_bias_out cannot work on Linux MTL for short input (#11292)
|
2024-06-12 19:12:57 +08:00 |
|
Xin Qiu
|
592f7aa61e
|
Refine glm1-4 sdp (#11276)
* chatglm
* update
* update
* change chatglm
* update sdpa
* update
* fix style
* fix
* fix glm
* update glm2-32k
* update glm2-32k
* fix cpu
* update
* change lower_bound
|
2024-06-12 17:11:56 +08:00 |
|
Yishuo Wang
|
10e480ee96
|
refactor internlm and internlm2 (#11274)
|
2024-06-11 14:19:19 +08:00 |
|
Yishuo Wang
|
42fab480ea
|
support stablm2 12b (#11265)
|
2024-06-07 15:46:00 +08:00 |
|
Xin Qiu
|
dbc3c2d72d
|
glm4 sdp (#11253)
* glm4 sdp
* fix style
* update comment
|
2024-06-07 15:42:23 +08:00 |
|
Xin Qiu
|
151fcf37bb
|
check devie name in use_flash_attention (#11263)
|
2024-06-07 15:07:47 +08:00 |
|
Yishuo Wang
|
2623944604
|
qwen2 sdpa small fix (#11261)
|
2024-06-07 14:42:18 +08:00 |
|
Yishuo Wang
|
ea0d03fd28
|
Refactor baichuan1 7B and 13B (#11258)
|
2024-06-07 14:29:20 +08:00 |
|
Yishuo Wang
|
ef8e9b2ecd
|
Refactor qwen2 moe (#11244)
|
2024-06-07 13:14:54 +08:00 |
|
Zhao Changmin
|
b7948671de
|
[WIP] Add look up table in 1st token stage (#11193)
* lookuptb
|
2024-06-07 10:51:05 +08:00 |
|
Xin Qiu
|
2f809116e2
|
optimize Chatglm4 (#11239)
* chatglm4
* update
* update
* add rms norm
* chatglm4
|
2024-06-06 18:25:20 +08:00 |
|
Yishuo Wang
|
2e4ccd541c
|
fix qwen2 cpu (#11240)
|
2024-06-06 16:24:19 +08:00 |
|
Yishuo Wang
|
e738ec38f4
|
disable quantize kv in specific qwen model (#11238)
|
2024-06-06 14:08:39 +08:00 |
|
Yishuo Wang
|
c4e5806e01
|
add latest optimization in starcoder2 (#11236)
|
2024-06-06 14:02:17 +08:00 |
|
Yishuo Wang
|
ba27e750b1
|
refactor yuan2 (#11235)
|
2024-06-06 13:17:54 +08:00 |
|
Guoqiong Song
|
f6d5c6af78
|
fix issue 1407 (#11171)
|
2024-06-05 13:35:57 -07:00 |
|
Yina Chen
|
ed67435491
|
Support Fp6 k in ipex-llm (#11222)
* support fp6_k
* support fp6_k
* remove
* fix style
|
2024-06-05 17:34:36 +08:00 |
|
binbin Deng
|
a6674f5bce
|
Fix should_use_fuse_rope error of Qwen1.5-MoE-A2.7B-Chat (#11216)
|
2024-06-05 15:56:10 +08:00 |
|
Xin Qiu
|
566691c5a3
|
quantized attention forward for minicpm (#11200)
* quantized minicpm
* fix style check
|
2024-06-05 09:15:25 +08:00 |
|
Jiao Wang
|
bb83bc23fd
|
Fix Starcoder issue on CPU on transformers 4.36+ (#11190)
* fix starcoder for sdpa
* update
* style
|
2024-06-04 10:05:40 -07:00 |
|
Xiangyu Tian
|
ac3d53ff5d
|
LLM: Fix vLLM CPU version error (#11206)
Fix vLLM CPU version error
|
2024-06-04 19:10:23 +08:00 |
|
Ruonan Wang
|
1dde204775
|
update q6k (#11205)
|
2024-06-04 17:14:33 +08:00 |
|
Yishuo Wang
|
6454655dcc
|
use sdp in baichuan2 13b (#11198)
|
2024-06-04 15:39:00 +08:00 |
|
Yishuo Wang
|
d90cd977d0
|
refactor stablelm (#11195)
|
2024-06-04 13:14:43 +08:00 |
|
Xin Qiu
|
5f13700c9f
|
optimize Minicpm (#11189)
* minicpm optimize
* update
|
2024-06-03 18:28:29 +08:00 |
|
Shaojun Liu
|
401013a630
|
Remove chatglm_C Module to Eliminate LGPL Dependency (#11178)
* remove chatglm_C.**.pyd to solve ngsolve weak copyright vunl
* fix style check error
* remove chatglm native int4 from langchain
|
2024-05-31 17:03:11 +08:00 |
|
Ruonan Wang
|
50b5f4476f
|
update q4k convert (#11179)
|
2024-05-31 11:36:53 +08:00 |
|
ZehuaCao
|
4127b99ed6
|
Fix null pointer dereferences error. (#11125)
* delete unused function on tgi_server
* update
* update
* fix style
|
2024-05-30 16:16:10 +08:00 |
|
Guancheng Fu
|
50ee004ac7
|
Fix vllm condition (#11169)
* add use-vllm
* done
* fix style
* fix done
|
2024-05-30 15:23:17 +08:00 |
|
Ruonan Wang
|
9bfbf78bf4
|
update api usage of xe_batch & fp16 (#11164)
* update api usage
* update setup.py
|
2024-05-29 15:15:14 +08:00 |
|
Yina Chen
|
e29e2f1c78
|
Support new fp8 e4m3 (#11158)
|
2024-05-29 14:27:14 +08:00 |
|
Yishuo Wang
|
bc5008f0d5
|
disable sdp_causal in phi-3 to fix overflow (#11157)
|
2024-05-28 17:25:53 +08:00 |
|
SONG Ge
|
33852bd23e
|
Refactor pipeline parallel device config (#11149)
* refactor pipeline parallel device config
* meet comments
* update example
* add warnings and update code doc
|
2024-05-28 16:52:46 +08:00 |
|
Yishuo Wang
|
d307622797
|
fix first token sdp with batch (#11153)
|
2024-05-28 15:03:06 +08:00 |
|
Yina Chen
|
3464440839
|
fix qwen import error (#11154)
|
2024-05-28 14:50:12 +08:00 |
|
Yina Chen
|
b6b70d1ba0
|
Divide core-xe packages (#11131)
* temp
* add batch
* fix style
* update package name
* fix style
* add workflow
* use temp version to run uts
* trigger performance test
* trigger win igpu perf
* revert workflow & setup
|
2024-05-28 12:00:18 +08:00 |
|
binbin Deng
|
c9168b85b7
|
Fix error during merging adapter (#11145)
|
2024-05-27 19:41:42 +08:00 |
|
binbin Deng
|
367de141f2
|
Fix mixtral-8x7b with transformers=4.37.0 (#11132)
|
2024-05-27 09:50:54 +08:00 |
|
ZehuaCao
|
63e95698eb
|
[LLM]Reopen autotp generate_stream (#11120)
* reopen autotp generate_stream
* fix style error
* update
|
2024-05-24 17:16:14 +08:00 |
|
Yishuo Wang
|
1dc680341b
|
fix phi-3-vision import (#11129)
|
2024-05-24 15:57:15 +08:00 |
|
Guancheng Fu
|
7f772c5a4f
|
Add half precision for fastchat models (#11130)
|
2024-05-24 15:41:14 +08:00 |
|
Zhao Changmin
|
65f4212f89
|
Fix qwen 14b run into register attention fwd (#11128)
* fix qwen 14b
|
2024-05-24 14:45:07 +08:00 |
|
Yishuo Wang
|
1db9d9a63b
|
optimize internlm2 xcomposer agin (#11124)
|
2024-05-24 13:44:52 +08:00 |
|
Yishuo Wang
|
9372ce87ce
|
fix internlm xcomposer2 fp16 (#11123)
|
2024-05-24 11:03:31 +08:00 |
|
Cengguang Zhang
|
011b9faa5c
|
LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107)
* LLM: unify alibi mask dtype.
* fix comments.
|
2024-05-24 10:27:53 +08:00 |
|
Yishuo Wang
|
797dbc48b8
|
fix phi-2 and phi-3 convert (#11116)
|
2024-05-23 17:37:37 +08:00 |
|
Yishuo Wang
|
37b98a531f
|
support running internlm xcomposer2 on gpu and add sdp optimization (#11115)
|
2024-05-23 17:26:24 +08:00 |
|
Zhao Changmin
|
c5e8b90c8d
|
Add Qwen register attention implemention (#11110)
* qwen_register
|
2024-05-23 17:17:45 +08:00 |
|
Yishuo Wang
|
0e53f20edb
|
support running internlm-xcomposer2 on cpu (#11111)
|
2024-05-23 16:36:09 +08:00 |
|
Yishuo Wang
|
cd4dff09ee
|
support phi-3 vision (#11101)
|
2024-05-22 17:43:50 +08:00 |
|
Xin Qiu
|
71bcd18f44
|
fix qwen vl (#11090)
|
2024-05-21 18:40:29 +08:00 |
|
Yishuo Wang
|
f00625f9a4
|
refactor qwen2 (#11087)
|
2024-05-21 16:53:42 +08:00 |
|
Yishuo Wang
|
d830a63bb7
|
refactor qwen (#11074)
|
2024-05-20 18:08:37 +08:00 |
|
Yishuo Wang
|
4e97047d70
|
fix baichuan2 13b fp16 (#11071)
|
2024-05-20 11:21:20 +08:00 |
|
Yishuo Wang
|
31ce3e0c13
|
refactor baichuan2-13b (#11064)
|
2024-05-17 16:25:30 +08:00 |
|
Ruonan Wang
|
f1156e6b20
|
support gguf_q4k_m / gguf_q4k_s (#10887)
* initial commit
* UPDATE
* fix style
* fix style
* add gguf_q4k_s
* update comment
* fix
|
2024-05-17 14:30:09 +08:00 |
|
Yishuo Wang
|
981d668be6
|
refactor baichuan2-7b (#11062)
|
2024-05-17 13:01:34 +08:00 |
|
Ruonan Wang
|
3a72e5df8c
|
disable mlp fusion of fp6 on mtl (#11059)
|
2024-05-17 10:10:16 +08:00 |
|
SONG Ge
|
192ae35012
|
Add support for llama2 quantize_kv with transformers 4.38.0 (#11054)
* add support for llama2 quantize_kv with transformers 4.38.0
* fix code style
* fix code style
|
2024-05-16 22:23:39 +08:00 |
|
SONG Ge
|
16b2a418be
|
hotfix native_sdp ut (#11046)
* hotfix native_sdp
* update
|
2024-05-16 17:15:37 +08:00 |
|
Xin Qiu
|
6be70283b7
|
fix chatglm run error (#11045)
* fix chatglm
* update
* fix style
|
2024-05-16 15:39:18 +08:00 |
|
Yishuo Wang
|
8cae897643
|
use new rope in phi3 (#11047)
|
2024-05-16 15:12:35 +08:00 |
|
Yishuo Wang
|
59df750326
|
Use new sdp again (#11025)
|
2024-05-16 09:33:34 +08:00 |
|
SONG Ge
|
9942a4ba69
|
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0
* add supprot for quantize_qkv
* add original support for 4.38.0 now
* code style fix
|
2024-05-15 18:07:00 +08:00 |
|
Yina Chen
|
686f6038a8
|
Support fp6 save & load (#11034)
|
2024-05-15 17:52:02 +08:00 |
|
Ruonan Wang
|
ac384e0f45
|
add fp6 mlp fusion (#11032)
* add fp6 fusion
* add qkv fusion for fp6
* remove qkv first
|
2024-05-15 17:42:50 +08:00 |
|
hxsz1997
|
93d40ab127
|
Update lookahead strategy (#11021)
* update lookahead strategy
* remove lines
* fix python style check
|
2024-05-15 14:48:05 +08:00 |
|
Yishuo Wang
|
fad1dbaf60
|
use sdp fp8 causal kernel (#11023)
|
2024-05-15 10:22:35 +08:00 |
|
Yishuo Wang
|
ee325e9cc9
|
fix phi3 (#11022)
|
2024-05-15 09:32:12 +08:00 |
|
Zhao Changmin
|
0a732bebe7
|
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed
* pep8
|
2024-05-15 08:16:43 +08:00 |
|
Yina Chen
|
893197434d
|
Add fp6 support on gpu (#11008)
* add fp6 support
* fix style
|
2024-05-14 16:31:44 +08:00 |
|
Zhao Changmin
|
b03c859278
|
Add phi3RMS (#10988)
* phi3RMS
|
2024-05-14 15:16:27 +08:00 |
|
Yishuo Wang
|
170e3d65e0
|
use new sdp and fp32 sdp (#11007)
|
2024-05-14 14:29:18 +08:00 |
|
Guancheng Fu
|
74997a3ed1
|
Adding load_low_bit interface for ipex_llm_worker (#11000)
* initial implementation, need tests
* fix
* fix baichuan issue
* fix typo
|
2024-05-13 15:30:19 +08:00 |
|
Yishuo Wang
|
1b3c7a6928
|
remove phi3 empty cache (#10997)
|
2024-05-13 14:09:55 +08:00 |
|
Yishuo Wang
|
ad96f32ce0
|
optimize phi3 1st token performance (#10981)
|
2024-05-10 17:33:46 +08:00 |
|
Cengguang Zhang
|
cfed76b2ed
|
LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937)
* LLM: add split tensor support for baichuan2-7b and qwen1.5-7b.
* fix style.
* fix style.
* fix style.
* add support for mistral and fix condition threshold.
* fix style.
* fix comments.
|
2024-05-10 16:40:15 +08:00 |
|