Yang Wang
51d07a9fd8
Support directly loading gptq models from huggingface ( #9391 )
...
* Support directly loading GPTQ models from huggingface
* fix style
* fix tests
* change example structure
* address comments
* fix style
* address comments
2023-11-13 20:48:12 -08:00
SONG Ge
2888818b3a
[LLM] Support mixed_fp8 on Arc ( #9415 )
...
* ut gpu allocation memory fix
* support mix_8bit on arc
* rename mixed_4bit to mixed_fp4 and mixed_8bit to mixed_fp8
* revert unexpected changes
* revert unexpected changes
* unify common logits
* rename in llm xmx_checker
* fix typo error and re-unify
2023-11-13 09:26:30 +08:00
Heyang Sun
df8e4d7889
[LLM] apply allreduce and bias to training in LowBitLinear ( #9395 )
2023-11-09 14:35:54 +08:00
Wang, Jian4
40cead6b5b
LLM: Fix CPU qlora dtype convert issue ( #9394 )
2023-11-09 14:34:01 +08:00
Ruonan Wang
bfca76dfa7
LLM: optimize QLoRA by updating lora convert logic ( #9372 )
...
* update convert logic of qlora
* update
* refactor and further improve performance
* fix style
* meet code review
2023-11-08 17:46:49 +08:00
Ruonan Wang
7e8fb29b7c
LLM: optimize QLoRA by reducing convert time ( #9370 )
2023-11-08 13:14:34 +08:00
Yishuo Wang
bfd9f88f0d
[LLM] Use fp32 as dtype when batch_size <=8 and qtype is q4_0/q8_0/fp8 ( #9365 )
2023-11-08 09:54:53 +08:00
Heyang Sun
fae6db3ddc
[LLM] refactor cpu low-bit forward logic ( #9366 )
...
* [LLM] refactor cpu low-bit forward logic
* fix style
* Update low_bit_linear.py
* Update low_bit_linear.py
* refine
2023-11-07 15:09:16 +08:00
Heyang Sun
af94058203
[LLM] Support CPU deepspeed distributed inference ( #9259 )
...
* [LLM] Support CPU Deepspeed distributed inference
* Update run_deepspeed.py
* Rename
* fix style
* add new codes
* refine
* remove annotated codes
* refine
* Update README.md
* refine doc and example code
2023-11-06 17:56:42 +08:00
Xin Qiu
1420e45cc0
Chatglm2 rope optimization on xpu ( #9350 )
2023-11-06 13:56:34 +08:00
Yuwen Hu
a0150bb205
[LLM] Move embedding layer to CPU for iGPU inference ( #9343 )
...
* Move embedding layer to CPU for iGPU llm inference
* Empty cache after to cpu
* Remove empty cache as it seems to have some negative effect to first token
2023-11-03 11:13:45 +08:00
Yishuo Wang
726203d778
[LLM] Replace Embedding layer to fix it on CPU ( #9254 )
2023-11-01 13:58:10 +08:00
Yang Wang
e1bc18f8eb
fix import ipex problem ( #9323 )
...
* fix import ipex problem
* fix style
2023-10-31 20:31:34 -07:00
Yina Chen
2262ae4d13
Support MoFQ4 on arc ( #9301 )
...
* init
* update
* fix style
* fix style
* fix style
* meet comments
2023-11-01 10:59:46 +08:00
Yang Wang
163d033616
Support qlora in CPU ( #9233 )
...
* support qlora in CPU
* revert example
* fix style
2023-10-27 14:01:15 -07:00
Cengguang Zhang
44b5fcc190
LLM: fix pretraining_tp argument issue. ( #9281 )
2023-10-26 18:43:58 +08:00
WeiguangHan
6b2a32eba2
LLM: add missing function for PyTorch InternLM model ( #9285 )
2023-10-26 18:05:23 +08:00
Yina Chen
f879c48f98
fp8 convert use ggml code ( #9277 )
2023-10-26 17:03:29 +08:00
Yina Chen
e2264e8845
Support arc fp4 ( #9266 )
...
* support arc fp4
* fix style
* fix style
2023-10-25 15:42:48 +08:00
Yang Wang
067c7e8098
Support deepspeed AutoTP ( #9230 )
...
* Support deepspeed
* add test script
* refactor convert
* refine example
* refine
* refine example
* fix style
* refine example and adapte latest ipex
* fix style
2023-10-24 23:46:28 -07:00
Jin Qiao
90162264a3
LLM: replace torch.float32 with auto type ( #9261 )
2023-10-24 17:12:13 +08:00
SONG Ge
bd5215d75b
[LLM] Reimplement chatglm fuse rms optimization ( #9260 )
...
* re-implement chatglm rope rms
* update
2023-10-24 16:35:12 +08:00
SONG Ge
bfc1e2d733
add fused rms optimization for chatglm model ( #9256 )
2023-10-24 14:40:58 +08:00
Guancheng Fu
f37547249d
Refine README/CICD ( #9253 )
2023-10-24 12:56:03 +08:00
binbin Deng
db37edae8a
LLM: update langchain api document page ( #9222 )
2023-10-24 10:13:41 +08:00
Wang, Jian4
c14a61681b
Add load low-bit in model-serving for reduce EPC ( #9239 )
...
* init load low-bit
* fix
* fix
2023-10-23 11:28:20 +08:00
Yina Chen
0383306688
Add arc fp8 support ( #9232 )
...
* add fp8 support
* add log
* fix style
2023-10-20 17:15:07 +08:00
Yang Wang
118249b011
support transformers 4.34+ for llama ( #9229 )
2023-10-19 22:36:30 -07:00
Chen, Zhentao
5850241423
correct Readme GPU example and API docstring ( #9225 )
...
* update readme to correct GPU usage
* update from_pretrained supported low bit options
* fix stype check
2023-10-19 16:08:47 +08:00
Yang Wang
b0ddde0410
Fix removing convert dtype bug ( #9216 )
...
* Fix removing convert dtype bug
* fix style
2023-10-18 11:24:22 -07:00
Ruonan Wang
942d6418e7
LLM: fix chatglm kv cache ( #9215 )
2023-10-18 19:09:53 +08:00
SONG Ge
0765f94770
[LLM] Optimize kv_cache for mistral model family ( #9189 )
...
* add kv_cache optimization for mistral model
* kv_cache optimize for mistral
* update stylr
* update
2023-10-18 15:13:37 +08:00
Ruonan Wang
3555ebc148
LLM: fix wrong length in gptj kv_cache optimization ( #9210 )
...
* fix wrong length in gptj kv cache
* update
2023-10-18 14:59:02 +08:00
Shengsheng Huang
6dad8d16df
optimize NormHead for Baichuan2 ( #9205 )
...
* optimize NormHead for Baichuan2
* fix ut and change name
* rename functions
2023-10-18 14:05:07 +08:00
Ruonan Wang
09815f7064
LLM: fix RMSNorm optimization of Baichuan2-13B/Baichuan-13B ( #9204 )
...
* fix rmsnorm of baichuan2-13B
* update baichuan1-13B too
* fix style
2023-10-17 18:40:34 +08:00
Ruonan Wang
c0497ab41b
LLM: support kv_cache optimization for Qwen-VL-Chat ( #9193 )
...
* dupport qwen_vl_chat
* fix style
2023-10-17 13:33:56 +08:00
binbin Deng
1cd9ab15b8
LLM: fix ChatGLMConfig check ( #9191 )
2023-10-17 11:52:56 +08:00
Yang Wang
7160afd4d1
Support XPU DDP training and autocast for LowBitMatmul ( #9167 )
...
* support autocast in low bit matmul
* Support XPU DDP training
* fix amp
2023-10-16 20:47:19 -07:00
Ruonan Wang
77afb8796b
LLM: fix convert of chatglm ( #9190 )
2023-10-17 10:48:13 +08:00
dingbaorong
af3b575c7e
expose modules_to_not_convert in optimize_model ( #9180 )
...
* expose modules_to_not_convert in optimize_model
* some fixes
2023-10-17 09:50:26 +08:00
Cengguang Zhang
5ca8a851e9
LLM: add fuse optimization for Mistral. ( #9184 )
...
* add fuse optimization for mistral.
* fix.
* fix
* fix style.
* fix.
* fix error.
* fix style.
* fix style.
2023-10-16 16:50:31 +08:00
Jiao Wang
49e1381c7f
update rope ( #9155 )
2023-10-15 21:51:45 -07:00
binbin Deng
a164c24746
LLM: add kv_cache optimization for chatglm2-6b-32k ( #9165 )
2023-10-16 10:43:15 +08:00
Yang Wang
7a2de00b48
Fixes for xpu Bf16 training ( #9156 )
...
* Support bf16 training
* Use a stable transformer version
* remove env
* fix style
2023-10-14 21:28:59 -07:00
Cengguang Zhang
51a133de56
LLM: add fuse rope and norm optimization for Baichuan. ( #9166 )
...
* add fuse rope optimization.
* add rms norm optimization.
2023-10-13 17:36:52 +08:00
Cengguang Zhang
433f408081
LLM: Add fuse rope and norm optimization for Aquila. ( #9161 )
...
* add fuse norm optimization.
* add fuse rope optimization
2023-10-13 14:18:37 +08:00
SONG Ge
e7aa67e141
[LLM] Add rope optimization for internlm ( #9159 )
...
* add rope and norm optimization for internlm and gptneox
* revert gptneox back and split with pr#9155 #
* add norm_forward
* style fix
* update
* update
2023-10-13 14:18:28 +08:00
Ruonan Wang
b8aee7bb1b
LLM: Fix Qwen kv_cache optimization ( #9148 )
...
* first commit
* ut pass
* accelerate rotate half by using common util function
* fix style
2023-10-12 15:49:42 +08:00
binbin Deng
69942d3826
LLM: fix model check before attention optimization ( #9149 )
2023-10-12 15:21:51 +08:00
binbin Deng
eb3fb18eb4
LLM: improve PyTorch API doc ( #9128 )
2023-10-11 15:03:39 +08:00
Zhao Changmin
1709beba5b
LLM: Explicitly close pickle file pointer before removing temporary directory ( #9120 )
...
* fp close
2023-10-10 14:57:23 +08:00
binbin Deng
e4d1457a70
LLM: improve transformers style API doc ( #9113 )
2023-10-10 09:31:00 +08:00
Zhao Changmin
edccfb2ed3
LLM: Check model device type ( #9092 )
...
* check model device
2023-10-09 15:49:15 +08:00
Yina Chen
4c4f8d1663
[LLM]Fix Arc falcon abnormal output issue ( #9096 )
...
* update
* update
* fix error & style
* fix style
* update train
* to input_seq_size
2023-10-09 15:09:37 +08:00
Zhao Changmin
548e4dd5fe
LLM: Adapt transformers models for optimize model SL ( #9022 )
...
* LLM: Adapt transformers model for SL
2023-10-09 11:13:44 +08:00
Ruonan Wang
f64257a093
LLM: basic api support for esimd fp16 ( #9067 )
...
* basic api support for fp16
* fix style
* fix
* fix error and style
* fix style
* meet code review
* update based on comments
2023-10-09 11:05:17 +08:00
Xin Qiu
b3e94a32d4
change log4error import ( #9098 )
2023-10-08 09:23:28 +08:00
Kai Huang
78ea7ddb1c
Combine apply_rotary_pos_emb for gpt-neox ( #9074 )
2023-10-07 16:27:46 +08:00
Yang Wang
36dd4afd61
Fix llama when rope scaling is not None ( #9086 )
...
* Fix llama when rope scaling is not None
* fix style
* fix style
2023-10-06 13:27:37 -07:00
Yang Wang
fcb1c618a0
using bigdl-llm fused rope for llama ( #9066 )
...
* optimize llama xpu rope
* fix bug
* fix style
* refine append cache
* remove check
* do not cache cos sin
* remove unnecessary changes
* clean up
* fix style
* check for training
2023-10-06 09:57:29 -07:00
Jiao Wang
aefa5a5bfe
Qwen kv cache ( #9079 )
...
* qwen and aquila
* update
* update
* style
2023-10-05 11:59:17 -07:00
Jiao Wang
d5ca1f32b6
Aquila KV cache optimization ( #9080 )
...
* update
* update
* style
2023-10-05 11:10:57 -07:00
Yang Wang
88565c76f6
add export merged model example ( #9018 )
...
* add export merged model example
* add sources
* add script
* fix style
2023-10-04 21:18:52 -07:00
Yang Wang
0cd8f1c79c
Use ipex fused rms norm for llama ( #9081 )
...
* also apply rmsnorm
* fix cpu
2023-10-04 21:04:55 -07:00
Cengguang Zhang
fb883100e7
LLM: support chatglm-18b convert attention forward in benchmark scripts. ( #9072 )
...
* add chatglm-18b convert.
* fix if statement.
* fix
2023-09-28 14:04:52 +08:00
Yishuo Wang
6de2189e90
[LLM] fix chatglm main choice ( #9073 )
2023-09-28 11:23:37 +08:00
Cengguang Zhang
b4a1266ef0
[WIP] LLM: add kv cache support for internlm. ( #9036 )
...
* LLM: add kv cache support for internlm
* add internlm apply_rotary_pos_emb
* fix.
* fix style.
2023-09-25 14:16:59 +08:00
Ruonan Wang
975da86e00
LLM: fix gptneox kv cache ( #9044 )
2023-09-25 13:03:57 +08:00
Jiao Wang
028a6d9383
MPT model optimize for long sequence ( #9020 )
...
* mpt_long_seq
* update
* update
* update
* style
* style2
* update
2023-09-21 21:27:23 -07:00
Ruonan Wang
b943d73844
LLM: refactor kv cache ( #9030 )
...
* refactor utils
* meet code review; update all models
* small fix
2023-09-21 21:28:03 +08:00
Cengguang Zhang
868511cf02
LLM: fix kv cache issue of bloom and falcon. ( #9029 )
2023-09-21 18:12:20 +08:00
Ruonan Wang
bf51ec40b2
LLM: Fix empty cache ( #9024 )
...
* fix
* fix
* update example
2023-09-21 17:16:07 +08:00
Yina Chen
714884414e
fix error ( #9025 )
2023-09-21 16:42:11 +08:00
SONG Ge
fa47967583
[LLM] Optimize kv_cache for gptj model family ( #9010 )
...
* optimize gptj model family attention
* add license and comment for dolly-model
* remove xpu mentioned
* remove useless info
* code sytle
* style fix
* code style in gptj fix
* remove gptj arch
* move apply_rotary_pos_emb into utils
* kv_seq_length update
* use hidden_states instead of query layer to reach batch size
2023-09-21 10:42:08 +08:00
Cengguang Zhang
b3cad7de57
LLM: add bloom kv cache support ( #9012 )
...
* LLM: add bloom kv cache support
* fix style.
2023-09-20 21:10:53 +08:00
Kai Huang
156af15d1e
Add NF3 ( #9008 )
...
* add nf3
* grammar
2023-09-20 20:03:07 +08:00
Kai Huang
6981745fe4
Optimize kv_cache for gpt-neox model family ( #9015 )
...
* override gptneox
* style
* move to utils
* revert
2023-09-20 19:59:19 +08:00
Cengguang Zhang
735a17f7b4
LLM: add kv cache to falcon family. ( #8995 )
...
* add kv cache to falcon family.
* fix: import error.
* refactor
* update comments.
* add two version falcon attention forward.
* fix
* fix.
* fix.
* fix.
* fix style.
* fix style.
2023-09-20 15:36:30 +08:00
Ruonan Wang
94a7f8917b
LLM: fix optimized kv cache for baichuan-13b ( #9009 )
...
* fix baichuan 13b
* fix style
* fix
* fix style
2023-09-20 15:30:14 +08:00
Yang Wang
c88f6ec457
Experiment XPU QLora Finetuning ( #8937 )
...
* Support xpu finetuning
* support xpu finetuning
* fix style
* fix style
* fix style
* refine example
* add readme
* refine readme
* refine api
* fix fp16
* fix example
* refactor
* fix style
* fix compute type
* add qlora
* refine training args
* fix example
* fix style
* fast path forinference
* address comments
* refine readme
* revert lint
2023-09-19 10:15:44 -07:00
Ruonan Wang
004c45c2be
LLM: Support optimized kv_cache for baichuan family ( #8997 )
...
* add initial support for baichuan attantion
* support baichuan1
* update based on comment
* update based on comment
* support baichuan2
* update link, change how to jusge baichuan2
* fix style
* add model parameter for pob emb
* update based on comment
2023-09-19 15:38:54 +08:00
Zhao Changmin
2a05581da7
LLM: Apply low_cpu_mem_usage algorithm on optimize_model API ( #8987 )
...
* low_cpu_mem_usage
2023-09-18 21:41:42 +08:00
Zhao Changmin
16b9412e80
tie_word_embeddings ( #8977 )
...
tie_word_embeddings
2023-09-15 10:17:09 +08:00
Yishuo Wang
bcf456070c
fix bloom-176b int overflow ( #8973 )
2023-09-14 14:37:57 +08:00
Ruonan Wang
dd57623650
LLM: reduce GPU memory for optimize_model=True ( #8965 )
...
* reduce gpu memory for llama & chatglm
* change to device type
2023-09-13 17:27:09 +08:00
SONG Ge
7132ef6081
[LLM Doc] Add optimize_model doc in transformers api ( #8957 )
...
* add optimize in from_pretrained
* add api doc for load_low_bit
* update api docs following comments
* update api docs
* update
* reord comments
2023-09-13 10:42:33 +08:00
Zhao Changmin
c32c260ce2
LLM: Add save/load API in optimize_model to support general pytorch model ( #8956 )
...
* support hf format SL
2023-09-13 10:22:00 +08:00
Guancheng Fu
0bf5857908
[LLM] Integrate FastChat as a serving framework for BigDL-LLM ( #8821 )
...
* Finish changing
* format
* add licence
* Add licence
* fix
* fix
* Add xpu support for fschat
* Fix patch
* Also install webui dependencies
* change setup.py dependency installs
* fiox
* format
* final test
2023-09-13 09:28:05 +08:00
Zhao Changmin
dcaa4dc130
LLM: Support GQA on llama kvcache ( #8938 )
...
* support GQA
2023-09-12 12:18:40 +08:00
Yang Wang
16761c58be
Make llama attention stateless ( #8928 )
...
* Make llama attention stateless
* fix style
* fix chatglm
* fix chatglm xpu
2023-09-11 18:21:50 -07:00
Zhao Changmin
e62eda74b8
refine ( #8912 )
...
Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2023-09-11 16:40:33 +08:00
Yina Chen
df165ad165
init ( #8933 )
2023-09-11 14:30:55 +08:00
Ruonan Wang
b3f5dd5b5d
LLM: update q8 convert xpu&cpu ( #8930 )
2023-09-08 16:01:17 +08:00
Yina Chen
33d75adadf
[LLM]Support q5_0 on arc ( #8926 )
...
* support q5_0
* delete
* fix style
2023-09-08 15:52:36 +08:00
Yang Wang
ee98cdd85c
Support latest transformer version ( #8923 )
...
* Support latest transformer version
* fix style
2023-09-07 19:01:32 -07:00
Yang Wang
25428b22b4
Fix chatglm2 attention and kv cache ( #8924 )
...
* fix chatglm2 attention
* fix bf16 bug
* make model stateless
* add utils
* cleanup
* fix style
2023-09-07 18:54:29 -07:00
Yina Chen
b209b8f7b6
[LLM] Fix arc qtype != q4_0 generate issue ( #8920 )
...
* Fix arc precision!=q4_0 generate issue
* meet comments
2023-09-07 08:56:36 -07:00
Yang Wang
c34400e6b0
Use new layout for xpu qlinear ( #8896 )
...
* use new layout for xpu qlinear
* fix style
2023-09-06 21:55:33 -07:00
Zhao Changmin
8bc1d8a17c
LLM: Fix discards in optimize_model with non-hf models and add openai whisper example ( #8877 )
...
* openai-whisper
2023-09-07 10:35:59 +08:00
SONG Ge
7a71ced78f
[LLM Docs] Remain API Docs Issues Solution ( #8780 )
...
* langchain readthedocs update
* solve langchain.llms.transformersllm issues
* langchain.embeddings.transformersembeddings/transfortmersllms issues
* update docs for get_num_tokens
* add low_bit api doc
* add optimizer model api doc
* update rst index
* fix coomments style
* update docs following the comments
* update api doc
2023-09-06 16:29:34 +08:00