Yina Chen
ce5840a8b7
GPT-J rope optimization on xpu ( #10182 )
...
* optimize
* update
* fix style & move use_fuse_rope
* add ipex version check
* fix style
* update
* fix style
* meet comments
* address comments
* fix style
2024-02-22 16:25:12 +08:00
Xiangyu Tian
f445217d02
LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize ( #10189 )
...
Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.
2024-02-22 16:01:11 +08:00
Heyang Sun
c876d9b5ca
Support for MPT rotary embedding ( #10208 )
2024-02-22 15:16:31 +08:00
Ruonan Wang
5e1fee5e05
LLM: add GGUF-IQ2 examples ( #10207 )
...
* add iq2 examples
* small fix
* meet code review
* fix
* meet review
* small fix
2024-02-22 14:18:45 +08:00
SONG Ge
ca1166a0e5
[LLM] Add quantize kv_cache for Baichuan2-13B ( #10203 )
...
* add quantize kv_cache for baichuan2-13b
* style fix
2024-02-22 13:43:35 +08:00
Ruonan Wang
34ee1aa91f
LLM: add esimd sdp support for chatglm3 ( #10205 )
...
* add esimd sdp support
* fix style
2024-02-22 13:37:16 +08:00
Ruonan Wang
f7c96b19ef
LLM: support iq2 for mixtral ( #10191 )
...
* support name mapping for mixtral
* support mixtral mixed quantization
* fix style
* fix
2024-02-21 16:00:29 +08:00
Xin Qiu
56ad781f2f
qwen2 cpu fix ( #10187 )
2024-02-21 11:23:51 +08:00
Zhao Changmin
4fbf449c2d
for rwkv4 ( #10179 )
2024-02-21 10:11:10 +08:00
Ruonan Wang
3288acb8de
LLM : Support embedding quantization (only q2k now) ( #10170 )
...
* basic logic added
* basic support
* support save&load, update mixed strategy
* fix style
* use int8 for lm_head
* add check for xpu
2024-02-20 16:56:57 +08:00
binbin Deng
2bb96c775c
LLM: fix device setting during saving optimized model ( #10154 )
2024-02-20 09:52:59 +08:00
Xin Qiu
1f6d5b9f30
enable fused rmsnorm and rope qwen2 ( #10163 )
...
* qwen2
* change convert
* cleanup
2024-02-20 08:33:09 +08:00
Zhao Changmin
f8730e8dc1
Skip rescale rwkv linear when load_low_bit ( #10164 )
...
* rwkv_ld
2024-02-19 15:56:42 +08:00
Heyang Sun
3e2af5ec0a
Fix IPEX Baichuan Speculative ( #10162 )
...
* Fix IPEX Baichuan Speculative
* compatible with 13B
* Update speculative.py
2024-02-19 15:27:34 +08:00
Yina Chen
23c91cdce6
[LLM] Add min_step_draft in speculative decoding ( #10142 )
...
* Fix gptj kvcache & position id
* Add min_draft_tokens in speculative decoding
* fix style
* update
2024-02-19 14:31:41 +08:00
Wang, Jian4
f2417e083c
LLM: enable chatglm3-6b target_model ipex ( #10085 )
...
* init
* always make casual_mask
* not return last tensor
* update
* optimize_model = False
* enable optimized=False
* enable optimized_model=true
* speed_up ipex target_model
* remove if True
* use group_size
* update python style
* update
* update
2024-02-19 13:38:32 +08:00
Yina Chen
1508d6b089
Fix gptj kvcache & position id ( #10141 )
2024-02-18 10:02:49 +08:00
Yishuo Wang
4d33aac7f9
quick fix qwen2 fp8 kv cache ( #10135 )
2024-02-08 17:04:59 +08:00
Cengguang Zhang
39d90839aa
LLM: add quantize kv cache for llama. ( #10086 )
...
* feat: add quantize kv cache for llama.
* fix style.
* add quantized attention forward function.
* revert style.
* fix style.
* fix style.
* update quantized kv cache and add quantize_qkv
* fix style.
* fix style.
* optimize quantize kv cache.
* fix style.
2024-02-08 16:49:22 +08:00
Yishuo Wang
d848efe17c
add quantize kv cache support for qwen2 ( #10134 )
2024-02-08 16:17:21 +08:00
SONG Ge
3f79128ed7
[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 ( #10131 )
...
* add support for kv_cache optimization on transformers-v4.37.0
* enable attention forward
* style fix
* disable rotary for now
2024-02-08 14:20:26 +08:00
Ruonan Wang
063dc145ac
LLM: basic support for q2k ( #10132 )
...
* basic support for q2k
* fix style
2024-02-08 13:52:01 +08:00
Cengguang Zhang
0cf6a12691
LLM: add default torch_dtype for fp16. ( #10124 )
...
* set default torch_dtype for fp16.
* fix style.
* bug fix.
* update bug fix.
2024-02-08 10:24:16 +08:00
Yishuo Wang
1aa0c623ce
disable fused layer norm on UHD ( #10130 )
2024-02-08 10:20:01 +08:00
Yuwen Hu
a8450fc300
[LLM] Support MLP optimization for Qwen1.5 ( #10123 )
2024-02-08 09:15:34 +08:00
binbin Deng
925f82107e
LLM: support models hosted by modelscope ( #10106 )
2024-02-07 16:46:36 +08:00
Xiangyu Tian
8953acd7d6
[LLM] Fix log condition for BIGDL_OPT_IPEX ( #10115 )
...
Fix log condition for BIGDL_OPT_IPEX
2024-02-07 10:27:10 +08:00
Yuwen Hu
518ef95abc
Small fix for Nonetype error ( #10104 )
2024-02-06 14:58:52 +08:00
Ruonan Wang
d61f4905ac
LLM: 2bit quantization initial support ( #10042 )
...
* basis quantize support
* fix new module name
* small update
* and mixed int4 with iq2_xxs
* remove print
* code refactor
* fix style
* meet code review
2024-02-06 14:58:32 +08:00
Jiao Wang
33b9e7744d
fix dimension ( #10097 )
2024-02-05 15:07:38 -08:00
Zhicun
7d2be7994f
add phixtral and optimize phi-moe ( #10052 )
2024-02-05 11:12:47 +08:00
Zhicun
676d6923f2
LLM: modify transformersembeddings.embed() in langchain ( #10051 )
2024-02-05 10:42:10 +08:00
Jin Qiao
ad050107b3
LLM: fix mpt load_low_bit issue ( #10075 )
...
* fix
* retry
* retry
2024-02-05 10:17:07 +08:00
Ruonan Wang
8e33cb0f38
LLM: support speecht5_tts ( #10077 )
...
* support speecht5_tts
* fix
2024-02-04 13:26:42 +08:00
ivy-lv11
428b7105f6
Add HF and PyTorch example InternLM2 ( #10061 )
2024-02-04 10:25:55 +08:00
Yina Chen
77be19bb97
LLM: Support gpt-j in speculative decoding ( #10067 )
...
* gptj
* support gptj in speculative decoding
* fix
* update readme
* small fix
2024-02-02 14:54:55 +08:00
Xin Qiu
6e0f1a1e92
use apply_rotary_pos_emb_cache_freq_xpu in mixtral ( #10060 )
...
* use apply_rotary_pos_emb_cache_freq_xpu in mixtral
* fix style
2024-02-01 15:40:49 +08:00
Heyang Sun
601024f418
Mistral CPU example of speculative decoding ( #10024 )
...
* Mistral CPU example of speculative decoding
* update transformres version
* update example
* Update README.md
2024-02-01 10:52:32 +08:00
Heyang Sun
968e70544d
Enable IPEX Mistral in Speculative ( #10059 )
2024-02-01 10:48:16 +08:00
Yina Chen
3ca03d4e97
Add deepmind sample into bigdl-llm speculative decoding ( #10041 )
...
* migrate deepmind sample
* update
* meet comments
* fix style
* fix style
2024-02-01 09:57:02 +08:00
Wang, Jian4
7e5cd42a5c
LLM : Update optimize ipex bf16 ( #10038 )
...
* use 4.35.2 and remove
* update rmsnorm
* remove
* remove
* update python style
* update
* update python style
* update
* fix style
* update
* remove whitespace
2024-01-31 10:59:55 +08:00
Ruonan Wang
3685622f29
LLM: fix llama 4.36 forward( #10047 )
2024-01-31 10:31:10 +08:00
Yishuo Wang
53a5140eff
Optimize rwkv v5 rest token again ( #10043 )
2024-01-31 10:01:11 +08:00
Ruonan Wang
6b63ba23d1
LLM: add full module name during convert ( #10035 )
2024-01-30 14:43:07 +08:00
Yishuo Wang
7dfa6dbe46
add rwkv time shift optimization ( #10032 )
2024-01-30 14:10:55 +08:00
Xiangyu Tian
f57d0fda8b
[LLM] Use IPEX Optimization for Self Speculative Decoding ( #9997 )
...
Use IPEX Optimization for Self Speculative Decoding
2024-01-30 09:11:06 +08:00
Ruonan Wang
ccf8f613fb
LLM: update fp16 Linear on ARC/FLEX ( #10023 )
2024-01-29 18:25:26 +08:00
Shaojun Liu
824c8029d7
Fix "local variable 'model' referenced before assignment" ( #10022 )
2024-01-29 16:18:04 +08:00
Xiangyu Tian
f37e4702bc
[LLM] Use IPEX Optimization for BF16 Model ( #9988 )
...
Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true
2024-01-29 11:28:25 +08:00
Yishuo Wang
d720554d43
simplify quantize kv cache api ( #10011 )
2024-01-29 09:23:57 +08:00
Yina Chen
a3322e2a6c
add fp8 e5 to use_xmx ( #10015 )
2024-01-26 18:29:46 +08:00
Qiyuan Gong
9e18ea187f
[LLM] Avoid KV Cache OOM when seq len is larger than 1 ( #10006 )
...
* Avoid OOM during muti-round streaming chat with kv cache
* For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31.
* Other models need to compare kv cache size with kv_len.
2024-01-26 17:30:08 +08:00
Ruonan Wang
a00efa0564
LLM: add mlp & qkv fusion for FP16 Llama-7B ( #9932 )
...
* add mlp fusion for llama
* add mlp fusion
* fix style
* update
* add mm_qkv_out
* fix style
* update
* meet code review
* meet code review
2024-01-26 11:50:38 +08:00
Wang, Jian4
98ea3459e5
LLM : Fix llama draft_model dtype error ( #10005 )
...
* fix llama draft_model dtype error
* updat
2024-01-26 10:59:48 +08:00
Yishuo Wang
aae1870096
fix qwen kv cache length ( #9998 )
2024-01-26 10:15:01 +08:00
Yishuo Wang
24b34b6e46
change xmx condition ( #10000 )
2024-01-25 17:48:11 +08:00
Yishuo Wang
bf65548d29
Add quantize kv cache support for chaglm2/3 ( #9996 )
2024-01-25 16:55:59 +08:00
Wang, Jian4
9bff84e6fd
LLM: Convert draft_model kv_cache from bf16 to fp32 ( #9964 )
...
* convert bf16 to fp32
* update
* change when init
* init first and cut off after
* init and exchange
* update python type
* update
* fix bug
* update
* update
2024-01-25 11:20:27 +08:00
Yina Chen
27338540c3
Fix repetition_penalty not activated issue ( #9989 )
2024-01-25 10:40:41 +08:00
Yuwen Hu
b27e5a27b9
Remove the check for meta device in _replace_with_low_bit_linear ( #9984 )
2024-01-24 18:15:39 +08:00
Yina Chen
b176cad75a
LLM: Add baichuan2 gpu spec example ( #9973 )
...
* add baichuan2 gpu spec example
* update readme & example
* remove print
* fix typo
* meet comments
* revert
* update
2024-01-24 16:40:16 +08:00
Chen, Zhentao
e0db44dcb6
fix unexpected keyword argument 'device' ( #9982 )
...
* add device for chatglm3 only
* add comment for this change
* fix style
* fix style
* fix style again..
* finally fixed style
2024-01-24 13:20:46 +08:00
Yuwen Hu
8d28aa8e2b
[LLM] Fix the model.device problem when cpu_embedding=True ( #9971 )
...
* Overwrite the device attribute for CPUPinnedParam
* Expose cpu_embedding=True for Linux users
* Fix python style
2024-01-23 18:51:11 +08:00
Yishuo Wang
f82782cd3b
fix starcoder ( #9975 )
2024-01-23 17:24:53 +08:00
Yishuo Wang
2c8a9aaf0d
fix qwen causal mask when quantize_kv_cache=True ( #9968 )
2024-01-23 16:34:05 +08:00
Yina Chen
36c665667d
Add logits processor & qwen eos stop in speculative decoding ( #9963 )
...
* add logits processor & qwen eos
* fix style
* fix
* fix
* fix style
* fix style
* support transformers 4.31
* fix style
* fix style
---------
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Xin Qiu
da4687c917
fix fp16 ( #9970 )
2024-01-23 15:53:32 +08:00
Ruonan Wang
27b19106f3
LLM: add readme for speculative decoding gpu examples ( #9961 )
...
* add readme
* add readme
* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a
add default device meta when lcmu enabled ( #9941 )
2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294
add fused rotary pos emb for qwen ( #9956 )
...
* add fused rotary pos emb for qwen
* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0
LLM: limit esimd sdp usage for k_len < 8 ( #9959 )
...
* update
* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d
LLM: Support speculative decoding in bigdl-llm ( #9951 )
...
* first commit
* fix error, add llama example
* hidden print
* update api usage
* change to api v3
* update
* meet code review
* meet code review, fix style
* add reference, fix style
* fix style
* fix first token time
2024-01-22 19:14:56 +08:00
Heyang Sun
fb91c97fe8
support for Baichuan/Baichuan2 13B Chat running speculative decoding ( #9921 )
...
* support for Baichuan/Baichuan2 13B Chat running speculative decoding
* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975
optimize Decilm 7b ( #9922 )
...
* optimize deci
* update
* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272
Update optimize qwen ( #9943 )
...
* update for n tokens input
* fix dtype
* update
2024-01-19 16:54:59 +08:00
Ruonan Wang
bf37b3a670
LLM: optimize CPU speculative decoding of chatglm3 ( #9928 )
...
* update
* fix style
* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8
gguf memory optimization for mixtral ( #9939 )
2024-01-19 11:13:15 +08:00
Lilac09
7032a2ad73
Optimize gguf load memory for mistral ( #9923 )
...
* optimize gguf load for mistral
* fix output of gguf mistral
* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7
gguf memory optimization for baichuan ( #9937 )
2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e
[Serving] Add vllm_worker to fastchat serving framework ( #9934 )
...
* add worker
* finish
* finish
* add license
* add more comments
2024-01-18 21:33:36 +08:00
Yishuo Wang
7bbb98abb6
Disable fused layer norm when using XMX to fix mpt UT ( #9933 )
2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265
LLM: Update for Qwen n tokens inputs ( #9931 )
...
* update for n tokens inputs
* update style
* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9
Fix Mixtral GGUF Wrong Output Issue ( #9930 )
...
* Fix Mixtral GGUF Wrong Output Issue
* fix style
* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9
add rwkv v5 attention kernel ( #9927 )
2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f
LLM: Fix rope of chatglm3 to support speculative decoding on CPU ( #9926 )
2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432
[LLM]Solve the problem of calling bmm operator in BF16Linear ( #9924 )
...
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4
Support fast rope for training ( #9745 )
...
* init
* init
* fix style
* add test and fix
* address comment
* update
* merge upstream main
2024-01-17 15:51:38 +08:00
Ruonan Wang
427f75000b
LLM: fix sdp of chatglm3 ( #9917 )
...
* fix
* fix
* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf
optimize rwkv v4 first token performance ( #9912 )
2024-01-17 09:27:41 +08:00
Shaojun Liu
b909c5c9c2
GGUF load memory optimization ( #9913 )
...
* block-wise
* convert linear for module
* revert
* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Xin Qiu
dee32f7d15
copy fused rms norm's reuslt to avoid <unk> ( #9909 )
2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03
LLM: fix chatglm3 sdp to support speculative decoding ( #9900 )
...
* fix chatglm3
* fix
* update
* meet code review
* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb
Update PVC XMX condition ( #9901 )
...
* update pvc xmx condition
* update condition
* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf
change xmx condition ( #9896 )
2024-01-12 19:51:48 +08:00
Ruonan Wang
d9cf55bce9
LLM: fix MLP check of mixtral ( #9891 )
2024-01-11 18:01:59 +08:00
Ziteng Zhang
4af88a67b9
support chatglm3 with bf16 ( #9888 )
...
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965
[LLM] Improve LLM doc regarding windows gpu related info ( #9880 )
...
* Improve runtime configuration for windows
* Add python 310/311 supports for wheel downloading
* Add troubleshooting for windows gpu
* Remove manually import ipex due to auto importer
* Add info regarding cpu_embedding=True on iGPU
* More info for Windows users
* Small updates to API docs
* Python style fix
* Remove tip for loading from saved optimize_model for now
* Updated based on comments
* Update win info for multi-intel gpus selection
* Small fix
* Small fix
2024-01-11 14:37:16 +08:00
Ruonan Wang
53531ae4ee
LLM: support qkv fusion for fp8e5 ( #9878 )
...
* update
* add mistral
* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec
add mistral and chatglm support to vllm ( #9879 )
...
* add mistral and chatglm support to vllm
* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
Ruonan Wang
3e05c9e11b
LLM: update esimd sdp kernel ( #9871 )
2024-01-09 18:10:01 +08:00