Commit graph

119 commits

Author SHA1 Message Date
Yishuo Wang
59df750326
Use new sdp again (#11025) 2024-05-16 09:33:34 +08:00
SONG Ge
9942a4ba69
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0

* add supprot for quantize_qkv

* add original support for 4.38.0 now

* code style fix
2024-05-15 18:07:00 +08:00
Yina Chen
686f6038a8
Support fp6 save & load (#11034) 2024-05-15 17:52:02 +08:00
Ruonan Wang
ac384e0f45
add fp6 mlp fusion (#11032)
* add fp6 fusion

* add qkv fusion for fp6

* remove qkv first
2024-05-15 17:42:50 +08:00
hxsz1997
93d40ab127
Update lookahead strategy (#11021)
* update lookahead strategy

* remove lines

* fix python style check
2024-05-15 14:48:05 +08:00
Yishuo Wang
fad1dbaf60
use sdp fp8 causal kernel (#11023) 2024-05-15 10:22:35 +08:00
Yishuo Wang
ee325e9cc9
fix phi3 (#11022) 2024-05-15 09:32:12 +08:00
Zhao Changmin
0a732bebe7
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed

* pep8
2024-05-15 08:16:43 +08:00
Yina Chen
893197434d
Add fp6 support on gpu (#11008)
* add fp6 support

* fix style
2024-05-14 16:31:44 +08:00
Zhao Changmin
b03c859278
Add phi3RMS (#10988)
* phi3RMS
2024-05-14 15:16:27 +08:00
Yishuo Wang
170e3d65e0
use new sdp and fp32 sdp (#11007) 2024-05-14 14:29:18 +08:00
Guancheng Fu
74997a3ed1
Adding load_low_bit interface for ipex_llm_worker (#11000)
* initial implementation, need tests

* fix

* fix baichuan issue

* fix typo
2024-05-13 15:30:19 +08:00
Yishuo Wang
1b3c7a6928
remove phi3 empty cache (#10997) 2024-05-13 14:09:55 +08:00
Yishuo Wang
ad96f32ce0
optimize phi3 1st token performance (#10981) 2024-05-10 17:33:46 +08:00
Cengguang Zhang
cfed76b2ed
LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937)
* LLM: add split tensor support for baichuan2-7b and qwen1.5-7b.

* fix style.

* fix style.

* fix style.

* add support for mistral and fix condition threshold.

* fix  style.

* fix comments.
2024-05-10 16:40:15 +08:00
Kai Huang
a6342cc068
Empty cache after phi first attention to support 4k input (#10972)
* empty cache

* fix style
2024-05-09 19:50:04 +08:00
Yishuo Wang
e753125880
use fp16_sdp when head_dim=96 (#10976) 2024-05-09 17:02:59 +08:00
Yishuo Wang
697ca79eca
use quantize kv and sdp in phi3-mini (#10973) 2024-05-09 15:16:18 +08:00
Wang, Jian4
3209d6b057
Fix spculative llama3 no stop error (#10963)
* fix normal

* add eos_tokens_id on sp and add list if

* update

* no none
2024-05-08 17:09:47 +08:00
Yishuo Wang
2ebec0395c
optimize phi-3-mini-128 (#10959) 2024-05-08 16:33:17 +08:00
Zhao Changmin
0d6e12036f
Disable fast_init_ in load_low_bit (#10945)
* fast_init_ disable
2024-05-08 10:46:19 +08:00
Yishuo Wang
c801c37bc6
optimize phi3 again: use quantize kv if possible (#10953) 2024-05-07 17:26:19 +08:00
Yishuo Wang
aa2fa9fde1
optimize phi3 again: use sdp if possible (#10951) 2024-05-07 15:53:08 +08:00
Qiyuan Gong
d7ca5d935b
Upgrade Peft version to 0.10.0 for LLM finetune (#10886)
* Upgrade Peft version to 0.10.0
* Upgrade Peft version in ARC unit test and HF-Peft example.
2024-05-07 15:09:14 +08:00
Wang, Jian4
191b184341
LLM: Optimize cohere model (#10878)
* use mlp and rms

* optimize kv_cache

* add fuse qkv

* add flash attention and fp16 sdp

* error fp8 sdp

* fix optimized

* fix style

* update

* add for pp
2024-05-07 10:19:50 +08:00
Guancheng Fu
49ab5a2b0e
Add embeddings (#10931) 2024-05-07 09:07:02 +08:00
Wang, Jian4
0e0bd309e2
LLM: Enable Speculative on Fastchat (#10909)
* init

* enable streamer

* update

* update

* remove deprecated

* update

* update

* add gpu example
2024-05-06 10:06:20 +08:00
Cengguang Zhang
75dbf240ec
LLM: update split tensor conditions. (#10872)
* LLM: update split tensor condition.

* add cond for split tensor.

* update priority of env.

* fix style.

* update env name.
2024-04-30 17:07:21 +08:00
Guancheng Fu
2c64754eb0
Add vLLM to ipex-llm serving image (#10807)
* add vllm

* done

* doc work

* fix done

* temp

* add docs

* format

* add start-fastchat-service.sh

* fix
2024-04-29 17:25:42 +08:00
Yishuo Wang
d884c62dc4
remove new_layout parameter (#10906) 2024-04-29 10:31:50 +08:00
Guancheng Fu
fbcd7bc737
Fix Loader issue with dtype fp16 (#10907) 2024-04-29 10:16:02 +08:00
Guancheng Fu
c9fac8c26b
Fix sdp logic (#10896)
* fix

* fix
2024-04-28 22:02:14 +08:00
Yina Chen
015d07a58f
Fix lookahead sample error & add update strategy (#10894)
* Fix sample error & add update strategy

* add mtl config

* fix style

* remove print
2024-04-28 17:21:00 +08:00
Cengguang Zhang
9752ffe979
LLM: update split qkv native sdp. (#10895)
* LLM: update split qkv native sdp.

* fix typo.
2024-04-26 18:47:35 +08:00
Guancheng Fu
990535b1cf
Add tensor parallel for vLLM (#10879)
* initial

* test initial tp

* initial sup

* fix format

* fix

* fix
2024-04-26 17:10:49 +08:00
Yishuo Wang
46ba962168
use new quantize kv (#10888) 2024-04-26 14:42:17 +08:00
Wang, Jian4
3e8ed54270
LLM: Fix bigdl_ipex_int8 warning (#10890) 2024-04-26 11:18:44 +08:00
Yina Chen
8811f268ff
Use new fp16 sdp in Qwen and modify the constraint (#10882) 2024-04-25 19:23:37 +08:00
Yang Wang
1ce8d7bcd9
Support the desc_act feature in GPTQ model (#10851)
* support act_order

* update versions

* fix style

* fix bug

* clean up
2024-04-24 10:17:13 -07:00
Yina Chen
dc27b3bc35
Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790)
* update sdp condition

* update

* fix

* update & test llama

* mistral

* fix style

* update

* fix style

* remove pvc constrain

* update ds on arc

* fix style
2024-04-24 17:24:01 +08:00
binbin Deng
c9feffff9a
LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864) 2024-04-24 16:02:27 +08:00
Yishuo Wang
2d210817ff
add phi3 optimization (#10871) 2024-04-24 15:17:40 +08:00
Cengguang Zhang
763413b7e1
LLM: support llama split tensor for long context in transformers>=4.36. (#10844)
* LLm: support llama split tensor for long context in transformers>=4.36.

* fix dtype.

* fix style.

* fix style.

* fix style.

* fix style.

* fix dtype.

* fix style.
2024-04-23 16:13:25 +08:00
ZehuaCao
92ea54b512
Fix speculative decoding bug (#10855) 2024-04-23 14:28:31 +08:00
Wang, Jian4
18c032652d
LLM: Add mixtral speculative CPU example (#10830)
* init mixtral sp example

* use different prompt_format

* update output

* update
2024-04-23 10:05:51 +08:00
Yishuo Wang
fe5a082b84
add phi-2 optimization (#10843) 2024-04-22 18:56:47 +08:00
Wang, Jian4
23c6a52fb0
LLM: Fix ipex torchscript=True error (#10832)
* remove

* update

* remove torchscript
2024-04-22 15:53:09 +08:00
Yina Chen
3daad242b8
Fix No module named 'transformers.cache_utils' with transformers < 4.36 (#10835)
* update sdp condition

* update

* fix

* fix 431 error

* revert sdp & style fix

* fix

* meet comments
2024-04-22 14:05:50 +08:00
Guancheng Fu
caf75beef8
Disable sdpa (#10814) 2024-04-19 17:33:18 +08:00
Yishuo Wang
57edf2033c
fix lookahead with transformers >= 4.36 (#10808) 2024-04-19 16:24:56 +08:00