Commit graph

382 commits

Author SHA1 Message Date
Yishuo Wang
bf65548d29 Add quantize kv cache support for chaglm2/3 (#9996) 2024-01-25 16:55:59 +08:00
Wang, Jian4
9bff84e6fd LLM: Convert draft_model kv_cache from bf16 to fp32 (#9964)
* convert bf16 to fp32

* update

* change when init

* init first and cut off after

* init and exchange

* update python type

* update

* fix bug

* update

* update
2024-01-25 11:20:27 +08:00
Yina Chen
27338540c3 Fix repetition_penalty not activated issue (#9989) 2024-01-25 10:40:41 +08:00
Yuwen Hu
b27e5a27b9 Remove the check for meta device in _replace_with_low_bit_linear (#9984) 2024-01-24 18:15:39 +08:00
Yina Chen
b176cad75a LLM: Add baichuan2 gpu spec example (#9973)
* add baichuan2 gpu spec example

* update readme & example

* remove print

* fix typo

* meet comments

* revert

* update
2024-01-24 16:40:16 +08:00
Chen, Zhentao
e0db44dcb6 fix unexpected keyword argument 'device' (#9982)
* add device for chatglm3 only

* add comment for this change

* fix style

* fix style

* fix style again..

* finally fixed style
2024-01-24 13:20:46 +08:00
Yuwen Hu
8d28aa8e2b [LLM] Fix the model.device problem when cpu_embedding=True (#9971)
* Overwrite the device attribute for CPUPinnedParam

* Expose cpu_embedding=True for Linux users

* Fix python style
2024-01-23 18:51:11 +08:00
Yishuo Wang
f82782cd3b fix starcoder (#9975) 2024-01-23 17:24:53 +08:00
Yishuo Wang
2c8a9aaf0d fix qwen causal mask when quantize_kv_cache=True (#9968) 2024-01-23 16:34:05 +08:00
Yina Chen
36c665667d Add logits processor & qwen eos stop in speculative decoding (#9963)
* add logits processor & qwen eos

* fix style

* fix

* fix

* fix style

* fix style

* support transformers 4.31

* fix style

* fix style

---------

Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Xin Qiu
da4687c917 fix fp16 (#9970) 2024-01-23 15:53:32 +08:00
Ruonan Wang
27b19106f3 LLM: add readme for speculative decoding gpu examples (#9961)
* add readme

* add readme

* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a add default device meta when lcmu enabled (#9941) 2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294 add fused rotary pos emb for qwen (#9956)
* add fused rotary pos emb for qwen

* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0 LLM: limit esimd sdp usage for k_len < 8 (#9959)
* update

* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d LLM: Support speculative decoding in bigdl-llm (#9951)
* first commit

* fix error, add llama example

* hidden print

* update api usage

* change to api v3

* update

* meet code review

* meet code review, fix style

* add reference, fix style

* fix style

* fix first token time
2024-01-22 19:14:56 +08:00
Heyang Sun
fb91c97fe8 support for Baichuan/Baichuan2 13B Chat running speculative decoding (#9921)
* support for Baichuan/Baichuan2 13B Chat running speculative decoding

* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975 optimize Decilm 7b (#9922)
* optimize deci

* update

* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272 Update optimize qwen (#9943)
* update for n tokens input

* fix dtype

* update
2024-01-19 16:54:59 +08:00
Ruonan Wang
bf37b3a670 LLM: optimize CPU speculative decoding of chatglm3 (#9928)
* update

* fix style

* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8 gguf memory optimization for mixtral (#9939) 2024-01-19 11:13:15 +08:00
Lilac09
7032a2ad73 Optimize gguf load memory for mistral (#9923)
* optimize gguf load for mistral

* fix output of gguf mistral

* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7 gguf memory optimization for baichuan (#9937) 2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e [Serving] Add vllm_worker to fastchat serving framework (#9934)
* add worker

* finish

* finish

* add license

* add more comments
2024-01-18 21:33:36 +08:00
Yishuo Wang
7bbb98abb6 Disable fused layer norm when using XMX to fix mpt UT (#9933) 2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265 LLM: Update for Qwen n tokens inputs (#9931)
* update for n tokens inputs

* update style

* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9 Fix Mixtral GGUF Wrong Output Issue (#9930)
* Fix Mixtral GGUF Wrong Output Issue

* fix style

* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9 add rwkv v5 attention kernel (#9927) 2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f LLM: Fix rope of chatglm3 to support speculative decoding on CPU (#9926) 2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432 [LLM]Solve the problem of calling bmm operator in BF16Linear (#9924)
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4 Support fast rope for training (#9745)
* init

* init

* fix style

* add test and fix

* address comment

* update

* merge upstream main
2024-01-17 15:51:38 +08:00
Ruonan Wang
427f75000b LLM: fix sdp of chatglm3 (#9917)
* fix

* fix

* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf optimize rwkv v4 first token performance (#9912) 2024-01-17 09:27:41 +08:00
Shaojun Liu
b909c5c9c2 GGUF load memory optimization (#9913)
* block-wise

* convert linear for module

* revert

* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Xin Qiu
dee32f7d15 copy fused rms norm's reuslt to avoid <unk> (#9909) 2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03 LLM: fix chatglm3 sdp to support speculative decoding (#9900)
* fix chatglm3

* fix

* update

* meet code review

* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb Update PVC XMX condition (#9901)
* update pvc xmx condition

* update condition

* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf change xmx condition (#9896) 2024-01-12 19:51:48 +08:00
Ruonan Wang
d9cf55bce9 LLM: fix MLP check of mixtral (#9891) 2024-01-11 18:01:59 +08:00
Ziteng Zhang
4af88a67b9 support chatglm3 with bf16 (#9888)
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965 [LLM] Improve LLM doc regarding windows gpu related info (#9880)
* Improve runtime configuration for windows

* Add python 310/311 supports for wheel downloading

* Add troubleshooting for windows gpu

* Remove manually import ipex due to auto importer

* Add info regarding cpu_embedding=True on iGPU

* More info for Windows users

* Small updates to API docs

* Python style fix

* Remove tip for loading from saved optimize_model for now

* Updated based on comments

* Update win info for multi-intel gpus selection

* Small fix

* Small fix
2024-01-11 14:37:16 +08:00
Ruonan Wang
53531ae4ee LLM: support qkv fusion for fp8e5 (#9878)
* update

* add mistral

* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec add mistral and chatglm support to vllm (#9879)
* add mistral and chatglm support to vllm

* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
Ruonan Wang
3e05c9e11b LLM: update esimd sdp kernel (#9871) 2024-01-09 18:10:01 +08:00
Yishuo Wang
36496d60ac only use quantize kv cache on MTL (#9862) 2024-01-09 13:24:02 +08:00
ZehuaCao
146076bdb5 Support llm-awq backend (#9856)
* Support for LLM-AWQ Backend

* fix

* Update README.md

* Add awqconfig

* modify init

* update

* support llm-awq

* fix style

* fix style

* update

* fix AwqBackendPackingMethod not found error

* fix style

* update README

* fix style

---------

Co-authored-by: Uxito-Ada <414416158@qq.com>
Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
Co-authored-by: cyita <yitastudy@gmail.com>
2024-01-09 13:07:32 +08:00
Ruonan Wang
fea6f16057 LLM: add mlp fusion for fp8e5 and update related check (#9860)
* update mlp fusion

* fix style

* update
2024-01-09 09:56:32 +08:00
Jiao Wang
3b6372ab12 Fix Llama transformers 4.36 support (#9852)
* supoort 4.36

* style

* update

* update

* update

* fix merge

* update
2024-01-08 00:32:23 -08:00
Chen, Zhentao
1b585b0d40 set fp8 default as e5m2 (#9859) 2024-01-08 15:53:57 +08:00
Ruonan Wang
dc995006cc LLM: add flash attention for mistral / mixtral (#9846)
* add flash attention for mistral

* update

* add flash attn for mixtral

* fix style
2024-01-08 09:51:34 +08:00