Wang, Jian4
0e0bd309e2
LLM: Enable Speculative on Fastchat ( #10909 )
...
* init
* enable streamer
* update
* update
* remove deprecated
* update
* update
* add gpu example
2024-05-06 10:06:20 +08:00
Cengguang Zhang
75dbf240ec
LLM: update split tensor conditions. ( #10872 )
...
* LLM: update split tensor condition.
* add cond for split tensor.
* update priority of env.
* fix style.
* update env name.
2024-04-30 17:07:21 +08:00
Guancheng Fu
2c64754eb0
Add vLLM to ipex-llm serving image ( #10807 )
...
* add vllm
* done
* doc work
* fix done
* temp
* add docs
* format
* add start-fastchat-service.sh
* fix
2024-04-29 17:25:42 +08:00
Yishuo Wang
d884c62dc4
remove new_layout parameter ( #10906 )
2024-04-29 10:31:50 +08:00
Guancheng Fu
fbcd7bc737
Fix Loader issue with dtype fp16 ( #10907 )
2024-04-29 10:16:02 +08:00
Guancheng Fu
c9fac8c26b
Fix sdp logic ( #10896 )
...
* fix
* fix
2024-04-28 22:02:14 +08:00
Yina Chen
015d07a58f
Fix lookahead sample error & add update strategy ( #10894 )
...
* Fix sample error & add update strategy
* add mtl config
* fix style
* remove print
2024-04-28 17:21:00 +08:00
Cengguang Zhang
9752ffe979
LLM: update split qkv native sdp. ( #10895 )
...
* LLM: update split qkv native sdp.
* fix typo.
2024-04-26 18:47:35 +08:00
Guancheng Fu
990535b1cf
Add tensor parallel for vLLM ( #10879 )
...
* initial
* test initial tp
* initial sup
* fix format
* fix
* fix
2024-04-26 17:10:49 +08:00
Yishuo Wang
46ba962168
use new quantize kv ( #10888 )
2024-04-26 14:42:17 +08:00
Wang, Jian4
3e8ed54270
LLM: Fix bigdl_ipex_int8 warning ( #10890 )
2024-04-26 11:18:44 +08:00
Yina Chen
8811f268ff
Use new fp16 sdp in Qwen and modify the constraint ( #10882 )
2024-04-25 19:23:37 +08:00
Yang Wang
1ce8d7bcd9
Support the desc_act feature in GPTQ model ( #10851 )
...
* support act_order
* update versions
* fix style
* fix bug
* clean up
2024-04-24 10:17:13 -07:00
Yina Chen
dc27b3bc35
Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) ( #10790 )
...
* update sdp condition
* update
* fix
* update & test llama
* mistral
* fix style
* update
* fix style
* remove pvc constrain
* update ds on arc
* fix style
2024-04-24 17:24:01 +08:00
binbin Deng
c9feffff9a
LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference ( #10864 )
2024-04-24 16:02:27 +08:00
Yishuo Wang
2d210817ff
add phi3 optimization ( #10871 )
2024-04-24 15:17:40 +08:00
Cengguang Zhang
763413b7e1
LLM: support llama split tensor for long context in transformers>=4.36. ( #10844 )
...
* LLm: support llama split tensor for long context in transformers>=4.36.
* fix dtype.
* fix style.
* fix style.
* fix style.
* fix style.
* fix dtype.
* fix style.
2024-04-23 16:13:25 +08:00
ZehuaCao
92ea54b512
Fix speculative decoding bug ( #10855 )
2024-04-23 14:28:31 +08:00
Wang, Jian4
18c032652d
LLM: Add mixtral speculative CPU example ( #10830 )
...
* init mixtral sp example
* use different prompt_format
* update output
* update
2024-04-23 10:05:51 +08:00
Yishuo Wang
fe5a082b84
add phi-2 optimization ( #10843 )
2024-04-22 18:56:47 +08:00
Wang, Jian4
23c6a52fb0
LLM: Fix ipex torchscript=True error ( #10832 )
...
* remove
* update
* remove torchscript
2024-04-22 15:53:09 +08:00
Yina Chen
3daad242b8
Fix No module named 'transformers.cache_utils' with transformers < 4.36 ( #10835 )
...
* update sdp condition
* update
* fix
* fix 431 error
* revert sdp & style fix
* fix
* meet comments
2024-04-22 14:05:50 +08:00
Guancheng Fu
caf75beef8
Disable sdpa ( #10814 )
2024-04-19 17:33:18 +08:00
Yishuo Wang
57edf2033c
fix lookahead with transformers >= 4.36 ( #10808 )
2024-04-19 16:24:56 +08:00
Ovo233
1a885020ee
Updated importing of top_k_top_p_filtering for transformers>=4.39.0 ( #10794 )
...
* In transformers>=4.39.0, the top_k_top_p_filtering function has been deprecated and moved to the hugging face package trl. Thus, for versions >= 4.39.0, import this function from trl.
2024-04-19 15:34:39 +08:00
Yishuo Wang
08458b4f74
remove rms norm copy ( #10793 )
2024-04-19 13:57:48 +08:00
Ruonan Wang
754b0ffecf
Fix pvc llama ( #10798 )
...
* ifx
* update
2024-04-18 10:44:57 -07:00
Ruonan Wang
439c834ed3
LLM: add mixed precision for lm_head ( #10795 )
...
* add mixed_quantization
* meet code review
* update
* fix style
* meet review
2024-04-18 19:11:31 +08:00
Yina Chen
8796401b08
Support q4k in ipex-llm ( #10796 )
...
* support q4k
* update
2024-04-18 18:55:28 +08:00
Ruonan Wang
0e8aac19e3
add q6k precision in ipex-llm ( #10792 )
...
* add q6k
* add initial 16k
* update
* fix style
2024-04-18 16:52:09 +08:00
Wang, Jian4
14ca42a048
LLM:Fix moe indexs error on cpu ( #10791 )
2024-04-18 15:56:52 +08:00
Guancheng Fu
cbe7b5753f
Add vLLM[xpu] related code ( #10779 )
...
* Add ipex-llm side change
* add runable offline_inference
* refactor to call vllm2
* Verified async server
* add new v2 example
* add README
* fix
* change dir
* refactor readme.md
* add experimental
* fix
2024-04-18 15:29:20 +08:00
Wang, Jian4
209c3501e6
LLM: Optimize qwen1.5 moe model ( #10706 )
...
* update moe block
* fix style
* enable optmize MLP
* enabel kv_cache
* enable fuse rope
* enable fused qkv
* enable flash_attention
* error sdp quantize
* use old api
* use fuse
* use xetla
* fix python style
* update moe_blocks num
* fix output error
* add cpu sdpa
* update
* update
* update
2024-04-18 14:54:05 +08:00
Ziteng Zhang
ff040c8f01
LISA Finetuning Example ( #10743 )
...
* enabling xetla only supports qtype=SYM_INT4 or FP8E5
* LISA Finetuning Example on gpu
* update readme
* add licence
* Explain parameters of lisa & Move backend codes to src dir
* fix style
* fix style
* update readme
* support chatglm
* fix style
* fix style
* update readme
* fix
2024-04-18 13:48:10 +08:00
Yang Wang
952e517db9
use config rope_theta ( #10787 )
...
* use config rope_theta
* fix style
2024-04-17 20:39:11 -07:00
Guancheng Fu
31ea2f9a9f
Fix wrong output for Llama models on CPU ( #10742 )
2024-04-18 11:07:27 +08:00
Xin Qiu
e764f9b1b1
Disable fast fused rope on UHD ( #10780 )
...
* use decoding fast path
* update
* update
* cleanup
2024-04-18 10:03:53 +08:00
Yina Chen
ea5b373a97
Add lookahead GPU example ( #10785 )
...
* Add lookahead example
* fix style & attn mask
* fix typo
* address comments
2024-04-17 17:41:55 +08:00
Wang, Jian4
a20271ffe4
LLM: Fix yi-6b fp16 error on pvc ( #10781 )
...
* updat for yi fp16
* update
* update
2024-04-17 16:49:59 +08:00
Yina Chen
766fe45222
Fix spec error caused by lookup pr ( #10777 )
...
* Fix spec error
* remove
* fix style
2024-04-17 11:27:35 +08:00
Qiyuan Gong
f2e923b3ca
Axolotl v0.4.0 support ( #10773 )
...
* Add Axolotl 0.4.0, remove legacy 0.3.0 support.
* replace is_torch_bf16_gpu_available
* Add HF_HUB_OFFLINE=1
* Move transformers out of requirement
* Refine readme and qlora.yml
2024-04-17 09:49:11 +08:00
Yina Chen
899d392e2f
Support prompt lookup in ipex-llm ( #10768 )
...
* lookup init
* add lookup
* fix style
* remove redundant code
* change param name
* fix style
2024-04-16 16:52:38 +08:00
binbin Deng
0a62933d36
LLM: fix qwen AutoTP ( #10766 )
2024-04-16 09:56:17 +08:00
Cengguang Zhang
3e2662c87e
LLM: fix get env KV_CACHE_ALLOC_BLOCK_LENGTH type. ( #10771 )
2024-04-16 09:32:30 +08:00
binbin Deng
c3fc8f4b90
LLM: add bs limitation for llama softmax upcast to fp32 ( #10752 )
2024-04-12 15:40:25 +08:00
Yishuo Wang
8086554d33
use new fp16 sdp in llama and mistral ( #10734 )
2024-04-12 10:49:02 +08:00
Yang Wang
019293e1b9
Fuse MOE indexes computation ( #10716 )
...
* try moe
* use c++ cpu to compute indexes
* fix style
2024-04-11 10:12:55 -07:00
binbin Deng
70ed9397f9
LLM: fix AttributeError of FP16Linear ( #10740 )
2024-04-11 17:03:56 +08:00
Cengguang Zhang
4b024b7aac
LLM: optimize chatglm2 8k input. ( #10723 )
...
* LLM: optimize chatglm2 8k input.
* rename.
2024-04-10 16:59:06 +08:00
Wang, Jian4
c9e6d42ad1
LLM: Fix chatglm3-6b-32k error ( #10719 )
...
* fix chatglm3-6b-32k
* update style
2024-04-10 11:24:06 +08:00