Guancheng Fu
|
57a023aadc
|
Fix vllm tp (#11297)
|
2024-06-13 10:47:48 +08:00 |
|
Yishuo Wang
|
10e480ee96
|
refactor internlm and internlm2 (#11274)
|
2024-06-11 14:19:19 +08:00 |
|
Yishuo Wang
|
ea0d03fd28
|
Refactor baichuan1 7B and 13B (#11258)
|
2024-06-07 14:29:20 +08:00 |
|
Yishuo Wang
|
ef8e9b2ecd
|
Refactor qwen2 moe (#11244)
|
2024-06-07 13:14:54 +08:00 |
|
Xin Qiu
|
2f809116e2
|
optimize Chatglm4 (#11239)
* chatglm4
* update
* update
* add rms norm
* chatglm4
|
2024-06-06 18:25:20 +08:00 |
|
Yishuo Wang
|
2e4ccd541c
|
fix qwen2 cpu (#11240)
|
2024-06-06 16:24:19 +08:00 |
|
Yishuo Wang
|
ba27e750b1
|
refactor yuan2 (#11235)
|
2024-06-06 13:17:54 +08:00 |
|
Guoqiong Song
|
f6d5c6af78
|
fix issue 1407 (#11171)
|
2024-06-05 13:35:57 -07:00 |
|
Xin Qiu
|
566691c5a3
|
quantized attention forward for minicpm (#11200)
* quantized minicpm
* fix style check
|
2024-06-05 09:15:25 +08:00 |
|
Jiao Wang
|
bb83bc23fd
|
Fix Starcoder issue on CPU on transformers 4.36+ (#11190)
* fix starcoder for sdpa
* update
* style
|
2024-06-04 10:05:40 -07:00 |
|
Xiangyu Tian
|
ac3d53ff5d
|
LLM: Fix vLLM CPU version error (#11206)
Fix vLLM CPU version error
|
2024-06-04 19:10:23 +08:00 |
|
Xin Qiu
|
5f13700c9f
|
optimize Minicpm (#11189)
* minicpm optimize
* update
|
2024-06-03 18:28:29 +08:00 |
|
ZehuaCao
|
4127b99ed6
|
Fix null pointer dereferences error. (#11125)
* delete unused function on tgi_server
* update
* update
* fix style
|
2024-05-30 16:16:10 +08:00 |
|
Guancheng Fu
|
50ee004ac7
|
Fix vllm condition (#11169)
* add use-vllm
* done
* fix style
* fix done
|
2024-05-30 15:23:17 +08:00 |
|
Zhao Changmin
|
65f4212f89
|
Fix qwen 14b run into register attention fwd (#11128)
* fix qwen 14b
|
2024-05-24 14:45:07 +08:00 |
|
Yishuo Wang
|
797dbc48b8
|
fix phi-2 and phi-3 convert (#11116)
|
2024-05-23 17:37:37 +08:00 |
|
Yishuo Wang
|
37b98a531f
|
support running internlm xcomposer2 on gpu and add sdp optimization (#11115)
|
2024-05-23 17:26:24 +08:00 |
|
Zhao Changmin
|
c5e8b90c8d
|
Add Qwen register attention implemention (#11110)
* qwen_register
|
2024-05-23 17:17:45 +08:00 |
|
Yishuo Wang
|
0e53f20edb
|
support running internlm-xcomposer2 on cpu (#11111)
|
2024-05-23 16:36:09 +08:00 |
|
Yishuo Wang
|
cd4dff09ee
|
support phi-3 vision (#11101)
|
2024-05-22 17:43:50 +08:00 |
|
Yishuo Wang
|
f00625f9a4
|
refactor qwen2 (#11087)
|
2024-05-21 16:53:42 +08:00 |
|
Yishuo Wang
|
d830a63bb7
|
refactor qwen (#11074)
|
2024-05-20 18:08:37 +08:00 |
|
Ruonan Wang
|
f1156e6b20
|
support gguf_q4k_m / gguf_q4k_s (#10887)
* initial commit
* UPDATE
* fix style
* fix style
* add gguf_q4k_s
* update comment
* fix
|
2024-05-17 14:30:09 +08:00 |
|
Yishuo Wang
|
981d668be6
|
refactor baichuan2-7b (#11062)
|
2024-05-17 13:01:34 +08:00 |
|
SONG Ge
|
192ae35012
|
Add support for llama2 quantize_kv with transformers 4.38.0 (#11054)
* add support for llama2 quantize_kv with transformers 4.38.0
* fix code style
* fix code style
|
2024-05-16 22:23:39 +08:00 |
|
Yishuo Wang
|
8cae897643
|
use new rope in phi3 (#11047)
|
2024-05-16 15:12:35 +08:00 |
|
SONG Ge
|
9942a4ba69
|
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0
* add supprot for quantize_qkv
* add original support for 4.38.0 now
* code style fix
|
2024-05-15 18:07:00 +08:00 |
|
Yishuo Wang
|
ee325e9cc9
|
fix phi3 (#11022)
|
2024-05-15 09:32:12 +08:00 |
|
Zhao Changmin
|
0a732bebe7
|
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed
* pep8
|
2024-05-15 08:16:43 +08:00 |
|
Zhao Changmin
|
b03c859278
|
Add phi3RMS (#10988)
* phi3RMS
|
2024-05-14 15:16:27 +08:00 |
|
Yishuo Wang
|
1b3c7a6928
|
remove phi3 empty cache (#10997)
|
2024-05-13 14:09:55 +08:00 |
|
Kai Huang
|
a6342cc068
|
Empty cache after phi first attention to support 4k input (#10972)
* empty cache
* fix style
|
2024-05-09 19:50:04 +08:00 |
|
Yishuo Wang
|
2ebec0395c
|
optimize phi-3-mini-128 (#10959)
|
2024-05-08 16:33:17 +08:00 |
|
Wang, Jian4
|
191b184341
|
LLM: Optimize cohere model (#10878)
* use mlp and rms
* optimize kv_cache
* add fuse qkv
* add flash attention and fp16 sdp
* error fp8 sdp
* fix optimized
* fix style
* update
* add for pp
|
2024-05-07 10:19:50 +08:00 |
|
Guancheng Fu
|
49ab5a2b0e
|
Add embeddings (#10931)
|
2024-05-07 09:07:02 +08:00 |
|
Guancheng Fu
|
2c64754eb0
|
Add vLLM to ipex-llm serving image (#10807)
* add vllm
* done
* doc work
* fix done
* temp
* add docs
* format
* add start-fastchat-service.sh
* fix
|
2024-04-29 17:25:42 +08:00 |
|
Guancheng Fu
|
990535b1cf
|
Add tensor parallel for vLLM (#10879)
* initial
* test initial tp
* initial sup
* fix format
* fix
* fix
|
2024-04-26 17:10:49 +08:00 |
|
Yang Wang
|
1ce8d7bcd9
|
Support the desc_act feature in GPTQ model (#10851)
* support act_order
* update versions
* fix style
* fix bug
* clean up
|
2024-04-24 10:17:13 -07:00 |
|
Yishuo Wang
|
2d210817ff
|
add phi3 optimization (#10871)
|
2024-04-24 15:17:40 +08:00 |
|
Yishuo Wang
|
fe5a082b84
|
add phi-2 optimization (#10843)
|
2024-04-22 18:56:47 +08:00 |
|
Ruonan Wang
|
439c834ed3
|
LLM: add mixed precision for lm_head (#10795)
* add mixed_quantization
* meet code review
* update
* fix style
* meet review
|
2024-04-18 19:11:31 +08:00 |
|
Guancheng Fu
|
cbe7b5753f
|
Add vLLM[xpu] related code (#10779)
* Add ipex-llm side change
* add runable offline_inference
* refactor to call vllm2
* Verified async server
* add new v2 example
* add README
* fix
* change dir
* refactor readme.md
* add experimental
* fix
|
2024-04-18 15:29:20 +08:00 |
|
Wang, Jian4
|
209c3501e6
|
LLM: Optimize qwen1.5 moe model (#10706)
* update moe block
* fix style
* enable optmize MLP
* enabel kv_cache
* enable fuse rope
* enable fused qkv
* enable flash_attention
* error sdp quantize
* use old api
* use fuse
* use xetla
* fix python style
* update moe_blocks num
* fix output error
* add cpu sdpa
* update
* update
* update
|
2024-04-18 14:54:05 +08:00 |
|
binbin Deng
|
0a62933d36
|
LLM: fix qwen AutoTP (#10766)
|
2024-04-16 09:56:17 +08:00 |
|
Wang, Jian4
|
c9e6d42ad1
|
LLM: Fix chatglm3-6b-32k error (#10719)
* fix chatglm3-6b-32k
* update style
|
2024-04-10 11:24:06 +08:00 |
|
binbin Deng
|
44922bb5c2
|
LLM: support baichuan2-13b using AutoTP (#10691)
|
2024-04-09 14:06:01 +08:00 |
|
Ovo233
|
dcb2038aad
|
Enable optimization for sentence_transformers (#10679)
* enable optimization for sentence_transformers
* fix python style check failure
|
2024-04-09 12:33:46 +08:00 |
|
Xin Qiu
|
1274cba79b
|
stablelm fp8 kv cache (#10672)
* stablelm fp8 kvcache
* update
* fix
* change to fp8 matmul
* fix style
* fix
* fix
* meet code review
* add comment
|
2024-04-08 15:16:46 +08:00 |
|
Xin Qiu
|
3a9ab8f1ae
|
fix stablelm logits diff (#10636)
* fix logits diff
* Small fixes
---------
Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
|
2024-04-03 15:08:12 +08:00 |
|
Yuwen Hu
|
fd384ddfb8
|
Optimize StableLM (#10619)
* Initial commit for stablelm optimizations
* Small style fix
* add dependency
* Add mlp optimizations
* Small fix
* add attention forward
* Remove quantize kv for now as head_dim=80
* Add merged qkv
* fix lisence
* Python style fix
---------
Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>
|
2024-04-02 18:58:38 +08:00 |
|