Commit graph

233 commits

Author SHA1 Message Date
Ruonan Wang
b63aae8a8e LLM: add flash attention support for llama (#9518)
* add initial flash attention for llama

* accelerate fp32 first token by changing to fp16 in advance

* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2 Integrate vllm (#9310)
* done

* Rename structure

* add models

* Add structure/sampling_params,sequence

* add input_metadata

* add outputs

* Add policy,logger

* add and update

* add parallelconfig back

* core/scheduler.py

* Add llm_engine.py

* Add async_llm_engine.py

* Add tested entrypoint

* fix minor error

* Fix everything

* fix kv cache view

* fix

* fix

* fix

* format&refine

* remove logger from repo

* try to add token latency

* remove logger

* Refine config.py

* finish worker.py

* delete utils.py

* add license

* refine

* refine sequence.py

* remove sampling_params.py

* finish

* add license

* format

* add license

* refine

* refine

* Refine line too long

* remove exception

* so dumb style-check

* refine

* refine

* refine

* refine

* refine

* refine

* add README

* refine README

* add warning instead error

* fix padding

* add license

* format

* format

* format fix

* Refine vllm dependency (#1)

vllm dependency clear

* fix licence

* fix format

* fix format

* fix

* adapt LLM engine

* fix

* add license

* fix format

* fix

* Moving README.md to the correct position

* Fix readme.md

* done

* guide for adding models

* fix

* Fix README.md

* Add new model readme

* remove ray-logic

* refactor arg_utils.py

* remove distributed_init_method logic

* refactor entrypoints

* refactor input_metadata

* refactor model_loader

* refactor utils.py

* refactor models

* fix api server

* remove vllm.stucture

* revert by txy 1120

* remove utils

* format

* fix license

* add bigdl model

* Refer to a specfic commit

* Change code base

* add comments

* add async_llm_engine comment

* refine

* formatted

* add worker comments

* add comments

* add comments

* fix style

* add changes

---------

Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Qiyuan Gong
0f0c6bb631 [LLM] Fix Qwen registered_causal_mask is None (#9513)
* Add registered_causal_mask init based on 2abd8e5777.
2023-11-23 09:28:04 +08:00
Ruonan Wang
076d106ef5 LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing (#9499)
* update to bf16 to accelerate gradient checkpoint

* add utils and fix ut
2023-11-21 17:08:36 +08:00
Xin Qiu
50b01058f1 enable new q4_1 (#9479) 2023-11-17 14:58:57 +08:00
Zhao Changmin
30abd304a7 LLM: Fix baichuan pre-normalize model tensor assigning issue when loading (#9481)
* No need to normalized when loading
2023-11-16 21:57:28 +08:00
Ruonan Wang
c0ef70df02 llm: quick fix of fast_rms_norm (#9480) 2023-11-16 14:42:16 +08:00
Yina Chen
d5263e6681 Add awq load support (#9453)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* init

* address comments

* add examples

* fix style

* fix style

* fix style

* fix style

* update

* remove

* meet comments

* fix style

---------

Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-11-16 14:06:25 +08:00
Ruonan Wang
d2c064124a LLM: update rms related usage to suport ipex 2.1 new api (#9466)
* update rms related usage

* fix style
2023-11-16 11:21:50 +08:00
Yuwen Hu
731b0aaade Empty cache after embedding to cpu (#9477) 2023-11-16 10:52:30 +08:00
Yang Wang
51d07a9fd8 Support directly loading gptq models from huggingface (#9391)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* address comments
2023-11-13 20:48:12 -08:00
SONG Ge
2888818b3a [LLM] Support mixed_fp8 on Arc (#9415)
* ut gpu allocation memory fix

* support mix_8bit on arc

* rename mixed_4bit to mixed_fp4 and mixed_8bit to mixed_fp8

* revert unexpected changes

* revert unexpected changes

* unify common logits

* rename in llm xmx_checker

* fix typo error and re-unify
2023-11-13 09:26:30 +08:00
Heyang Sun
df8e4d7889 [LLM] apply allreduce and bias to training in LowBitLinear (#9395) 2023-11-09 14:35:54 +08:00
Wang, Jian4
40cead6b5b LLM: Fix CPU qlora dtype convert issue (#9394) 2023-11-09 14:34:01 +08:00
Ruonan Wang
bfca76dfa7 LLM: optimize QLoRA by updating lora convert logic (#9372)
* update convert logic of qlora

* update

* refactor and further improve performance

* fix style

* meet code review
2023-11-08 17:46:49 +08:00
Ruonan Wang
7e8fb29b7c LLM: optimize QLoRA by reducing convert time (#9370) 2023-11-08 13:14:34 +08:00
Yishuo Wang
bfd9f88f0d [LLM] Use fp32 as dtype when batch_size <=8 and qtype is q4_0/q8_0/fp8 (#9365) 2023-11-08 09:54:53 +08:00
Heyang Sun
fae6db3ddc [LLM] refactor cpu low-bit forward logic (#9366)
* [LLM] refactor cpu low-bit forward logic

* fix style

* Update low_bit_linear.py

* Update low_bit_linear.py

* refine
2023-11-07 15:09:16 +08:00
Heyang Sun
af94058203 [LLM] Support CPU deepspeed distributed inference (#9259)
* [LLM] Support CPU Deepspeed distributed inference

* Update run_deepspeed.py

* Rename

* fix style

* add new codes

* refine

* remove annotated codes

* refine

* Update README.md

* refine doc and example code
2023-11-06 17:56:42 +08:00
Xin Qiu
1420e45cc0 Chatglm2 rope optimization on xpu (#9350) 2023-11-06 13:56:34 +08:00
Yuwen Hu
a0150bb205 [LLM] Move embedding layer to CPU for iGPU inference (#9343)
* Move embedding layer to CPU for iGPU llm inference

* Empty cache after to cpu

* Remove empty cache as it seems to have some negative effect to first token
2023-11-03 11:13:45 +08:00
Yishuo Wang
726203d778 [LLM] Replace Embedding layer to fix it on CPU (#9254) 2023-11-01 13:58:10 +08:00
Yang Wang
e1bc18f8eb fix import ipex problem (#9323)
* fix import ipex problem

* fix style
2023-10-31 20:31:34 -07:00
Yina Chen
2262ae4d13 Support MoFQ4 on arc (#9301)
* init

* update

* fix style

* fix style

* fix style

* meet comments
2023-11-01 10:59:46 +08:00
Yang Wang
163d033616 Support qlora in CPU (#9233)
* support qlora in CPU

* revert example

* fix style
2023-10-27 14:01:15 -07:00
Cengguang Zhang
44b5fcc190 LLM: fix pretraining_tp argument issue. (#9281) 2023-10-26 18:43:58 +08:00
WeiguangHan
6b2a32eba2 LLM: add missing function for PyTorch InternLM model (#9285) 2023-10-26 18:05:23 +08:00
Yina Chen
f879c48f98 fp8 convert use ggml code (#9277) 2023-10-26 17:03:29 +08:00
Yina Chen
e2264e8845 Support arc fp4 (#9266)
* support arc fp4

* fix style

* fix style
2023-10-25 15:42:48 +08:00
Yang Wang
067c7e8098 Support deepspeed AutoTP (#9230)
* Support deepspeed

* add test script

* refactor convert

* refine example

* refine

* refine example

* fix style

* refine example and adapte latest ipex

* fix style
2023-10-24 23:46:28 -07:00
Jin Qiao
90162264a3 LLM: replace torch.float32 with auto type (#9261) 2023-10-24 17:12:13 +08:00
SONG Ge
bd5215d75b [LLM] Reimplement chatglm fuse rms optimization (#9260)
* re-implement chatglm rope rms

* update
2023-10-24 16:35:12 +08:00
SONG Ge
bfc1e2d733 add fused rms optimization for chatglm model (#9256) 2023-10-24 14:40:58 +08:00
Guancheng Fu
f37547249d Refine README/CICD (#9253) 2023-10-24 12:56:03 +08:00
binbin Deng
db37edae8a LLM: update langchain api document page (#9222) 2023-10-24 10:13:41 +08:00
Wang, Jian4
c14a61681b Add load low-bit in model-serving for reduce EPC (#9239)
* init load low-bit

* fix

* fix
2023-10-23 11:28:20 +08:00
Yina Chen
0383306688 Add arc fp8 support (#9232)
* add fp8 support

* add log

* fix style
2023-10-20 17:15:07 +08:00
Yang Wang
118249b011 support transformers 4.34+ for llama (#9229) 2023-10-19 22:36:30 -07:00
Chen, Zhentao
5850241423 correct Readme GPU example and API docstring (#9225)
* update readme to correct GPU usage

* update from_pretrained supported low bit options

* fix stype check
2023-10-19 16:08:47 +08:00
Yang Wang
b0ddde0410 Fix removing convert dtype bug (#9216)
* Fix removing convert dtype bug

* fix style
2023-10-18 11:24:22 -07:00
Ruonan Wang
942d6418e7 LLM: fix chatglm kv cache (#9215) 2023-10-18 19:09:53 +08:00
SONG Ge
0765f94770 [LLM] Optimize kv_cache for mistral model family (#9189)
* add kv_cache optimization for mistral model

* kv_cache optimize for mistral

* update stylr

* update
2023-10-18 15:13:37 +08:00
Ruonan Wang
3555ebc148 LLM: fix wrong length in gptj kv_cache optimization (#9210)
* fix wrong length in gptj kv cache

* update
2023-10-18 14:59:02 +08:00
Shengsheng Huang
6dad8d16df optimize NormHead for Baichuan2 (#9205)
* optimize NormHead for Baichuan2

* fix ut and change name

* rename functions
2023-10-18 14:05:07 +08:00
Ruonan Wang
09815f7064 LLM: fix RMSNorm optimization of Baichuan2-13B/Baichuan-13B (#9204)
* fix rmsnorm of baichuan2-13B

* update baichuan1-13B too

* fix style
2023-10-17 18:40:34 +08:00
Ruonan Wang
c0497ab41b LLM: support kv_cache optimization for Qwen-VL-Chat (#9193)
* dupport qwen_vl_chat

* fix style
2023-10-17 13:33:56 +08:00
binbin Deng
1cd9ab15b8 LLM: fix ChatGLMConfig check (#9191) 2023-10-17 11:52:56 +08:00
Yang Wang
7160afd4d1 Support XPU DDP training and autocast for LowBitMatmul (#9167)
* support autocast in low bit matmul

* Support XPU DDP training

* fix  amp
2023-10-16 20:47:19 -07:00
Ruonan Wang
77afb8796b LLM: fix convert of chatglm (#9190) 2023-10-17 10:48:13 +08:00
dingbaorong
af3b575c7e expose modules_to_not_convert in optimize_model (#9180)
* expose modules_to_not_convert in optimize_model

* some fixes
2023-10-17 09:50:26 +08:00