Commit graph

295 commits

Author SHA1 Message Date
binbin Deng
4ff2ca9d0d LLM: fix loss error on Arc (#9550) 2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997 support loading q4_1/q5_0/q5_1/q8_0 gguf model (#9546) 2023-11-29 14:40:37 +08:00
Yuwen Hu
5f5ca38b74 [LLM Doc] Fix api doc rendering error (#9542)
* Fix api rendering error

* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56 [LLM] support loading gguf model (#9544) 2023-11-28 15:51:15 +08:00
Xiangyu Tian
916c338772 fix bugs in vllm length check (#9543) 2023-11-28 11:09:54 +08:00
Zhao Changmin
e7e0cd3b5e CPU Pinned embedding Layer (#9538)
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79 Add vLLM-XPU version's README/examples (#9536)
* test

* test

* fix last kv cache

* add xpu readme

* remove numactl for xpu example

* fix link error

* update max_num_batched_tokens logic

* add explaination

* add xpu environement version requirement

* refine gpu memory

* fix

* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748 Remove xformers from vLLM-CPU (#9535) 2023-11-27 11:21:25 +08:00
binbin Deng
6bec0faea5 LLM: support Mistral AWQ models (#9520) 2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27 LLM: fix abnormal Mistral GPU accuracy by updating rms_norm (#9529) 2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda hot-fix mistral kv_cache (#9528) 2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5 Replace torch.bmm with safe_bmm (#9519)
* replace bmm with safe one

* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Ruonan Wang
b63aae8a8e LLM: add flash attention support for llama (#9518)
* add initial flash attention for llama

* accelerate fp32 first token by changing to fp16 in advance

* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2 Integrate vllm (#9310)
* done

* Rename structure

* add models

* Add structure/sampling_params,sequence

* add input_metadata

* add outputs

* Add policy,logger

* add and update

* add parallelconfig back

* core/scheduler.py

* Add llm_engine.py

* Add async_llm_engine.py

* Add tested entrypoint

* fix minor error

* Fix everything

* fix kv cache view

* fix

* fix

* fix

* format&refine

* remove logger from repo

* try to add token latency

* remove logger

* Refine config.py

* finish worker.py

* delete utils.py

* add license

* refine

* refine sequence.py

* remove sampling_params.py

* finish

* add license

* format

* add license

* refine

* refine

* Refine line too long

* remove exception

* so dumb style-check

* refine

* refine

* refine

* refine

* refine

* refine

* add README

* refine README

* add warning instead error

* fix padding

* add license

* format

* format

* format fix

* Refine vllm dependency (#1)

vllm dependency clear

* fix licence

* fix format

* fix format

* fix

* adapt LLM engine

* fix

* add license

* fix format

* fix

* Moving README.md to the correct position

* Fix readme.md

* done

* guide for adding models

* fix

* Fix README.md

* Add new model readme

* remove ray-logic

* refactor arg_utils.py

* remove distributed_init_method logic

* refactor entrypoints

* refactor input_metadata

* refactor model_loader

* refactor utils.py

* refactor models

* fix api server

* remove vllm.stucture

* revert by txy 1120

* remove utils

* format

* fix license

* add bigdl model

* Refer to a specfic commit

* Change code base

* add comments

* add async_llm_engine comment

* refine

* formatted

* add worker comments

* add comments

* add comments

* fix style

* add changes

---------

Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Qiyuan Gong
0f0c6bb631 [LLM] Fix Qwen registered_causal_mask is None (#9513)
* Add registered_causal_mask init based on 2abd8e5777.
2023-11-23 09:28:04 +08:00
Ruonan Wang
076d106ef5 LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing (#9499)
* update to bf16 to accelerate gradient checkpoint

* add utils and fix ut
2023-11-21 17:08:36 +08:00
Xin Qiu
50b01058f1 enable new q4_1 (#9479) 2023-11-17 14:58:57 +08:00
Zhao Changmin
30abd304a7 LLM: Fix baichuan pre-normalize model tensor assigning issue when loading (#9481)
* No need to normalized when loading
2023-11-16 21:57:28 +08:00
Ruonan Wang
c0ef70df02 llm: quick fix of fast_rms_norm (#9480) 2023-11-16 14:42:16 +08:00
Yina Chen
d5263e6681 Add awq load support (#9453)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* init

* address comments

* add examples

* fix style

* fix style

* fix style

* fix style

* update

* remove

* meet comments

* fix style

---------

Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-11-16 14:06:25 +08:00
Ruonan Wang
d2c064124a LLM: update rms related usage to suport ipex 2.1 new api (#9466)
* update rms related usage

* fix style
2023-11-16 11:21:50 +08:00
Yuwen Hu
731b0aaade Empty cache after embedding to cpu (#9477) 2023-11-16 10:52:30 +08:00
Yang Wang
51d07a9fd8 Support directly loading gptq models from huggingface (#9391)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* address comments
2023-11-13 20:48:12 -08:00
SONG Ge
2888818b3a [LLM] Support mixed_fp8 on Arc (#9415)
* ut gpu allocation memory fix

* support mix_8bit on arc

* rename mixed_4bit to mixed_fp4 and mixed_8bit to mixed_fp8

* revert unexpected changes

* revert unexpected changes

* unify common logits

* rename in llm xmx_checker

* fix typo error and re-unify
2023-11-13 09:26:30 +08:00
Heyang Sun
df8e4d7889 [LLM] apply allreduce and bias to training in LowBitLinear (#9395) 2023-11-09 14:35:54 +08:00
Wang, Jian4
40cead6b5b LLM: Fix CPU qlora dtype convert issue (#9394) 2023-11-09 14:34:01 +08:00
Ruonan Wang
bfca76dfa7 LLM: optimize QLoRA by updating lora convert logic (#9372)
* update convert logic of qlora

* update

* refactor and further improve performance

* fix style

* meet code review
2023-11-08 17:46:49 +08:00
Ruonan Wang
7e8fb29b7c LLM: optimize QLoRA by reducing convert time (#9370) 2023-11-08 13:14:34 +08:00
Yishuo Wang
bfd9f88f0d [LLM] Use fp32 as dtype when batch_size <=8 and qtype is q4_0/q8_0/fp8 (#9365) 2023-11-08 09:54:53 +08:00
Heyang Sun
fae6db3ddc [LLM] refactor cpu low-bit forward logic (#9366)
* [LLM] refactor cpu low-bit forward logic

* fix style

* Update low_bit_linear.py

* Update low_bit_linear.py

* refine
2023-11-07 15:09:16 +08:00
Heyang Sun
af94058203 [LLM] Support CPU deepspeed distributed inference (#9259)
* [LLM] Support CPU Deepspeed distributed inference

* Update run_deepspeed.py

* Rename

* fix style

* add new codes

* refine

* remove annotated codes

* refine

* Update README.md

* refine doc and example code
2023-11-06 17:56:42 +08:00
Xin Qiu
1420e45cc0 Chatglm2 rope optimization on xpu (#9350) 2023-11-06 13:56:34 +08:00
Yuwen Hu
a0150bb205 [LLM] Move embedding layer to CPU for iGPU inference (#9343)
* Move embedding layer to CPU for iGPU llm inference

* Empty cache after to cpu

* Remove empty cache as it seems to have some negative effect to first token
2023-11-03 11:13:45 +08:00
Yishuo Wang
726203d778 [LLM] Replace Embedding layer to fix it on CPU (#9254) 2023-11-01 13:58:10 +08:00
Yang Wang
e1bc18f8eb fix import ipex problem (#9323)
* fix import ipex problem

* fix style
2023-10-31 20:31:34 -07:00
Yina Chen
2262ae4d13 Support MoFQ4 on arc (#9301)
* init

* update

* fix style

* fix style

* fix style

* meet comments
2023-11-01 10:59:46 +08:00
Yang Wang
163d033616 Support qlora in CPU (#9233)
* support qlora in CPU

* revert example

* fix style
2023-10-27 14:01:15 -07:00
Cengguang Zhang
44b5fcc190 LLM: fix pretraining_tp argument issue. (#9281) 2023-10-26 18:43:58 +08:00
WeiguangHan
6b2a32eba2 LLM: add missing function for PyTorch InternLM model (#9285) 2023-10-26 18:05:23 +08:00
Yina Chen
f879c48f98 fp8 convert use ggml code (#9277) 2023-10-26 17:03:29 +08:00
Yina Chen
e2264e8845 Support arc fp4 (#9266)
* support arc fp4

* fix style

* fix style
2023-10-25 15:42:48 +08:00
Yang Wang
067c7e8098 Support deepspeed AutoTP (#9230)
* Support deepspeed

* add test script

* refactor convert

* refine example

* refine

* refine example

* fix style

* refine example and adapte latest ipex

* fix style
2023-10-24 23:46:28 -07:00
Jin Qiao
90162264a3 LLM: replace torch.float32 with auto type (#9261) 2023-10-24 17:12:13 +08:00
SONG Ge
bd5215d75b [LLM] Reimplement chatglm fuse rms optimization (#9260)
* re-implement chatglm rope rms

* update
2023-10-24 16:35:12 +08:00
SONG Ge
bfc1e2d733 add fused rms optimization for chatglm model (#9256) 2023-10-24 14:40:58 +08:00
Guancheng Fu
f37547249d Refine README/CICD (#9253) 2023-10-24 12:56:03 +08:00
binbin Deng
db37edae8a LLM: update langchain api document page (#9222) 2023-10-24 10:13:41 +08:00
Wang, Jian4
c14a61681b Add load low-bit in model-serving for reduce EPC (#9239)
* init load low-bit

* fix

* fix
2023-10-23 11:28:20 +08:00
Yina Chen
0383306688 Add arc fp8 support (#9232)
* add fp8 support

* add log

* fix style
2023-10-20 17:15:07 +08:00
Yang Wang
118249b011 support transformers 4.34+ for llama (#9229) 2023-10-19 22:36:30 -07:00
Chen, Zhentao
5850241423 correct Readme GPU example and API docstring (#9225)
* update readme to correct GPU usage

* update from_pretrained supported low bit options

* fix stype check
2023-10-19 16:08:47 +08:00
Yang Wang
b0ddde0410 Fix removing convert dtype bug (#9216)
* Fix removing convert dtype bug

* fix style
2023-10-18 11:24:22 -07:00
Ruonan Wang
942d6418e7 LLM: fix chatglm kv cache (#9215) 2023-10-18 19:09:53 +08:00
SONG Ge
0765f94770 [LLM] Optimize kv_cache for mistral model family (#9189)
* add kv_cache optimization for mistral model

* kv_cache optimize for mistral

* update stylr

* update
2023-10-18 15:13:37 +08:00
Ruonan Wang
3555ebc148 LLM: fix wrong length in gptj kv_cache optimization (#9210)
* fix wrong length in gptj kv cache

* update
2023-10-18 14:59:02 +08:00
Shengsheng Huang
6dad8d16df optimize NormHead for Baichuan2 (#9205)
* optimize NormHead for Baichuan2

* fix ut and change name

* rename functions
2023-10-18 14:05:07 +08:00
Ruonan Wang
09815f7064 LLM: fix RMSNorm optimization of Baichuan2-13B/Baichuan-13B (#9204)
* fix rmsnorm of baichuan2-13B

* update baichuan1-13B too

* fix style
2023-10-17 18:40:34 +08:00
Ruonan Wang
c0497ab41b LLM: support kv_cache optimization for Qwen-VL-Chat (#9193)
* dupport qwen_vl_chat

* fix style
2023-10-17 13:33:56 +08:00
binbin Deng
1cd9ab15b8 LLM: fix ChatGLMConfig check (#9191) 2023-10-17 11:52:56 +08:00
Yang Wang
7160afd4d1 Support XPU DDP training and autocast for LowBitMatmul (#9167)
* support autocast in low bit matmul

* Support XPU DDP training

* fix  amp
2023-10-16 20:47:19 -07:00
Ruonan Wang
77afb8796b LLM: fix convert of chatglm (#9190) 2023-10-17 10:48:13 +08:00
dingbaorong
af3b575c7e expose modules_to_not_convert in optimize_model (#9180)
* expose modules_to_not_convert in optimize_model

* some fixes
2023-10-17 09:50:26 +08:00
Cengguang Zhang
5ca8a851e9 LLM: add fuse optimization for Mistral. (#9184)
* add fuse optimization for mistral.

* fix.

* fix

* fix style.

* fix.

* fix error.

* fix style.

* fix style.
2023-10-16 16:50:31 +08:00
Jiao Wang
49e1381c7f update rope (#9155) 2023-10-15 21:51:45 -07:00
binbin Deng
a164c24746 LLM: add kv_cache optimization for chatglm2-6b-32k (#9165) 2023-10-16 10:43:15 +08:00
Yang Wang
7a2de00b48 Fixes for xpu Bf16 training (#9156)
* Support bf16 training

* Use a stable transformer version

* remove env

* fix style
2023-10-14 21:28:59 -07:00
Cengguang Zhang
51a133de56 LLM: add fuse rope and norm optimization for Baichuan. (#9166)
* add fuse rope optimization.

* add rms norm optimization.
2023-10-13 17:36:52 +08:00
Cengguang Zhang
433f408081 LLM: Add fuse rope and norm optimization for Aquila. (#9161)
* add fuse norm optimization.

* add fuse rope optimization
2023-10-13 14:18:37 +08:00
SONG Ge
e7aa67e141 [LLM] Add rope optimization for internlm (#9159)
* add rope and norm optimization for internlm and gptneox

* revert gptneox back and split with pr#9155 #

* add norm_forward

* style fix

* update

* update
2023-10-13 14:18:28 +08:00
Ruonan Wang
b8aee7bb1b LLM: Fix Qwen kv_cache optimization (#9148)
* first commit

* ut pass

* accelerate rotate half by using common util function

* fix style
2023-10-12 15:49:42 +08:00
binbin Deng
69942d3826 LLM: fix model check before attention optimization (#9149) 2023-10-12 15:21:51 +08:00
binbin Deng
eb3fb18eb4 LLM: improve PyTorch API doc (#9128) 2023-10-11 15:03:39 +08:00
Zhao Changmin
1709beba5b LLM: Explicitly close pickle file pointer before removing temporary directory (#9120)
* fp close
2023-10-10 14:57:23 +08:00
binbin Deng
e4d1457a70 LLM: improve transformers style API doc (#9113) 2023-10-10 09:31:00 +08:00
Zhao Changmin
edccfb2ed3 LLM: Check model device type (#9092)
* check model device
2023-10-09 15:49:15 +08:00
Yina Chen
4c4f8d1663 [LLM]Fix Arc falcon abnormal output issue (#9096)
* update

* update

* fix error & style

* fix style

* update train

* to input_seq_size
2023-10-09 15:09:37 +08:00
Zhao Changmin
548e4dd5fe LLM: Adapt transformers models for optimize model SL (#9022)
* LLM: Adapt transformers model for SL
2023-10-09 11:13:44 +08:00
Ruonan Wang
f64257a093 LLM: basic api support for esimd fp16 (#9067)
* basic api support for fp16

* fix style

* fix

* fix error and style

* fix style

* meet code review

* update based on comments
2023-10-09 11:05:17 +08:00
Xin Qiu
b3e94a32d4 change log4error import (#9098) 2023-10-08 09:23:28 +08:00
Kai Huang
78ea7ddb1c Combine apply_rotary_pos_emb for gpt-neox (#9074) 2023-10-07 16:27:46 +08:00
Yang Wang
36dd4afd61 Fix llama when rope scaling is not None (#9086)
* Fix llama when rope scaling is not None

* fix style

* fix style
2023-10-06 13:27:37 -07:00
Yang Wang
fcb1c618a0 using bigdl-llm fused rope for llama (#9066)
* optimize llama xpu rope

* fix bug

* fix style

* refine append cache

* remove check

* do not cache cos sin

* remove unnecessary changes

* clean up

* fix style

* check for training
2023-10-06 09:57:29 -07:00
Jiao Wang
aefa5a5bfe Qwen kv cache (#9079)
* qwen and aquila

* update

* update

* style
2023-10-05 11:59:17 -07:00
Jiao Wang
d5ca1f32b6 Aquila KV cache optimization (#9080)
* update

* update

* style
2023-10-05 11:10:57 -07:00
Yang Wang
88565c76f6 add export merged model example (#9018)
* add export merged model example

* add sources

* add script

* fix style
2023-10-04 21:18:52 -07:00
Yang Wang
0cd8f1c79c Use ipex fused rms norm for llama (#9081)
* also apply rmsnorm

* fix cpu
2023-10-04 21:04:55 -07:00
Cengguang Zhang
fb883100e7 LLM: support chatglm-18b convert attention forward in benchmark scripts. (#9072)
* add chatglm-18b convert.

* fix if statement.

* fix
2023-09-28 14:04:52 +08:00
Yishuo Wang
6de2189e90 [LLM] fix chatglm main choice (#9073) 2023-09-28 11:23:37 +08:00
Cengguang Zhang
b4a1266ef0 [WIP] LLM: add kv cache support for internlm. (#9036)
* LLM: add kv cache support for internlm

* add internlm apply_rotary_pos_emb

* fix.

* fix style.
2023-09-25 14:16:59 +08:00
Ruonan Wang
975da86e00 LLM: fix gptneox kv cache (#9044) 2023-09-25 13:03:57 +08:00
Jiao Wang
028a6d9383 MPT model optimize for long sequence (#9020)
* mpt_long_seq

* update

* update

* update

* style

* style2

* update
2023-09-21 21:27:23 -07:00
Ruonan Wang
b943d73844 LLM: refactor kv cache (#9030)
* refactor utils

* meet code review; update all models

* small fix
2023-09-21 21:28:03 +08:00
Cengguang Zhang
868511cf02 LLM: fix kv cache issue of bloom and falcon. (#9029) 2023-09-21 18:12:20 +08:00
Ruonan Wang
bf51ec40b2 LLM: Fix empty cache (#9024)
* fix

* fix

* update example
2023-09-21 17:16:07 +08:00
Yina Chen
714884414e fix error (#9025) 2023-09-21 16:42:11 +08:00
SONG Ge
fa47967583 [LLM] Optimize kv_cache for gptj model family (#9010)
* optimize gptj model family attention

* add license and comment for dolly-model

* remove xpu mentioned

* remove useless info

* code sytle

* style fix

* code style in gptj fix

* remove gptj arch

* move apply_rotary_pos_emb into utils

* kv_seq_length update

* use hidden_states instead of query layer to reach batch size
2023-09-21 10:42:08 +08:00
Cengguang Zhang
b3cad7de57 LLM: add bloom kv cache support (#9012)
* LLM: add bloom kv cache support

* fix style.
2023-09-20 21:10:53 +08:00
Kai Huang
156af15d1e Add NF3 (#9008)
* add nf3

* grammar
2023-09-20 20:03:07 +08:00
Kai Huang
6981745fe4 Optimize kv_cache for gpt-neox model family (#9015)
* override gptneox

* style

* move to utils

* revert
2023-09-20 19:59:19 +08:00
Cengguang Zhang
735a17f7b4 LLM: add kv cache to falcon family. (#8995)
* add kv cache to falcon family.

* fix: import error.

* refactor

* update comments.

* add two version falcon attention forward.

* fix

* fix.

* fix.

* fix.

* fix style.

* fix style.
2023-09-20 15:36:30 +08:00