Xin Qiu
1dd40b429c
enable fp4 fused mlp and qkv ( #10531 )
...
* enable fp4 fused mlp and qkv
* update qwen
* update qwen2
2024-03-26 08:34:00 +08:00
Wang, Jian4
16b2ef49c6
Update_document by heyang ( #30 )
2024-03-25 10:06:02 +08:00
Wang, Jian4
a1048ca7f6
Update setup.py and add new actions and add compatible mode ( #25 )
...
* update setup.py
* add new action
* add compatible mode
2024-03-22 15:44:59 +08:00
Wang, Jian4
9df70d95eb
Refactor bigdl.llm to ipex_llm ( #24 )
...
* Rename bigdl/llm to ipex_llm
* rm python/llm/src/bigdl
* from bigdl.llm to from ipex_llm
2024-03-22 15:41:21 +08:00
Wang, Jian4
34d0a9328c
LLM: Speed-up mixtral in pipeline parallel inference ( #10472 )
...
* speed-up mixtral
* fix style
2024-03-22 11:06:28 +08:00
Cengguang Zhang
b9d4280892
LLM: fix baichuan7b quantize kv abnormal output. ( #10504 )
...
* fix abnormal output.
* fix style.
* fix style.
2024-03-22 10:00:08 +08:00
Yishuo Wang
f0f317b6cf
fix a typo in yuan ( #10503 )
2024-03-22 09:40:04 +08:00
Guancheng Fu
3a3756b51d
Add FastChat bigdl_worker ( #10493 )
...
* done
* fix format
* add licence
* done
* fix doc
* refactor folder
* add license
2024-03-21 18:35:05 +08:00
Xin Qiu
dba7ddaab3
add sdp fp8 for qwen llama436 baichuan mistral baichuan2 ( #10485 )
...
* add sdp fp8
* fix style
* fix qwen
* fix baichuan 13
* revert baichuan 13b and baichuan2-13b
* fix style
* update
2024-03-21 17:23:05 +08:00
Kai Huang
30f111cd32
lm_head empty_cache for more models ( #10490 )
...
* modify constraint
* fix style
2024-03-21 17:11:43 +08:00
binbin Deng
2958ca49c0
LLM: add patching function for llm finetuning ( #10247 )
2024-03-21 16:01:01 +08:00
Kai Huang
021d77fd22
Remove softmax upcast fp32 in llama ( #10481 )
...
* update
* fix style
2024-03-20 18:17:34 +08:00
Yishuo Wang
cfdf8ad496
Fix modules_not_to_convert argument ( #10483 )
2024-03-20 17:47:03 +08:00
Xiangyu Tian
cbe24cc7e6
LLM: Enable BigDL IPEX Int8 ( #10480 )
...
Enable BigDL IPEX Int8
2024-03-20 15:59:54 +08:00
ZehuaCao
1d062e24db
Update serving doc ( #10475 )
...
* update serving doc
* add tob
* update
* update
* update
* update vllm worker
2024-03-20 14:44:43 +08:00
Cengguang Zhang
4581e4f17f
LLM: fix whiper model missing config. ( #10473 )
...
* fix whiper model missing config.
* fix style.
* fix style.
* style.
2024-03-20 14:22:37 +08:00
Yishuo Wang
749bedaf1e
fix rwkv v5 fp16 ( #10474 )
2024-03-20 13:15:08 +08:00
Yuwen Hu
72bcc27da9
[LLM] Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings ( #10459 )
...
* Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings
* Small fixes
2024-03-19 18:04:35 +08:00
Cengguang Zhang
463a86cd5d
LLM: fix qwen-vl interpolation gpu abnormal results. ( #10457 )
...
* fix qwen-vl interpolation gpu abnormal results.
* fix style.
* update qwen-vl gpu example.
* fix comment and update example.
* fix style.
2024-03-19 16:59:39 +08:00
Xin Qiu
bbd749dceb
qwen2 fp8 cache ( #10446 )
...
* qwen2 fp8 cache
* fix style check
2024-03-19 08:32:39 +08:00
Yang Wang
9e763b049c
Support running pipeline parallel inference by vertically partitioning model to different devices ( #10392 )
...
* support pipeline parallel inference
* fix logging
* remove benchmark file
* fic
* need to warmup twice
* support qwen and qwen2
* fix lint
* remove genxir
* refine
2024-03-18 13:04:45 -07:00
Xiangyu Tian
dbdeaddd6a
LLM: Fix log condition for BIGDL_OPT_IPEX ( #10441 )
...
remove log for BIGDL_OPT_IPEX
2024-03-18 16:03:51 +08:00
Xin Qiu
399843faf0
Baichuan 7b fp16 sdp and qwen2 pvc sdp ( #10435 )
...
* add baichuan sdp
* update
* baichuan2
* fix
* fix style
* revert 13b
* revert
2024-03-18 10:15:34 +08:00
Yishuo Wang
bd64488b2a
add mask support for llama/chatglm fp8 sdp ( #10433 )
...
* add mask support for fp8 sdp
* fix chatglm2 dtype
* update
2024-03-15 17:36:52 +08:00
Xin Qiu
24473e331a
Qwen2 fp16 sdp ( #10427 )
...
* qwen2 sdp and refine
* update
* update
* fix style
* remove use_flash_attention
2024-03-15 13:12:03 +08:00
Ruonan Wang
b036205be2
LLM: add fp8 sdp for chatglm2/3 ( #10411 )
...
* add fp8 sdp for chatglm2
* fix style
2024-03-15 09:38:18 +08:00
Wang, Jian4
fe8976a00f
LLM: Support gguf models use low_bit and fix no json( #10408 )
...
* support others model use low_bit
* update readme
* update to add *.json
2024-03-15 09:34:18 +08:00
Xin Qiu
cda38f85a9
Qwen fp16 sdp ( #10401 )
...
* qwen sdp
* fix
* update
* update
* update sdp
* update
* fix style check
* add to origin type
2024-03-15 08:51:50 +08:00
dingbaorong
1c0f7ed3fa
add xpu support ( #10419 )
2024-03-14 17:13:48 +08:00
Heyang Sun
7d29765092
refactor qwen2 forward to enable XPU ( #10409 )
...
* refactor awen2 forward to enable XPU
* Update qwen2.py
2024-03-14 11:03:05 +08:00
ZehuaCao
f66329e35d
Fix multiple get_enable_ipex function error ( #10400 )
...
* fix multiple get_enable_ipex function error
* remove get_enable_ipex_low_bit function
2024-03-14 10:14:13 +08:00
Kai Huang
76e30d8ec8
Empty cache for lm_head ( #10317 )
...
* empty cache
* add comments
2024-03-13 20:31:53 +08:00
Yishuo Wang
06a851afa9
support new baichuan model ( #10404 )
2024-03-13 17:45:50 +08:00
Yishuo Wang
b268baafd6
use fp8 sdp in llama ( #10396 )
2024-03-13 16:45:38 +08:00
Xiangyu Tian
60043a3ae8
LLM: Support Baichuan2-13b in BigDL-vLLM ( #10398 )
...
Support Baichuan2-13b in BigDL-vLLM.
2024-03-13 16:21:06 +08:00
Xiangyu Tian
e10de2c42d
[Fix] LLM: Fix condition check error for speculative decoding on CPU ( #10402 )
...
Fix condition check error for speculative decoding on CPU
2024-03-13 16:05:06 +08:00
Heyang Sun
d72c0fad0d
Qwen2 SDPA forward on CPU ( #10395 )
...
* Fix Qwen1.5 CPU forward
* Update convert.py
* Update qwen2.py
2024-03-13 13:10:03 +08:00
Wang, Jian4
0193f29411
LLM : Enable gguf float16 and Yuan2 model ( #10372 )
...
* enable float16
* add yun files
* enable yun
* enable set low_bit on yuan2
* update
* update license
* update generate
* update readme
* update python style
* update
2024-03-13 10:19:18 +08:00
Yina Chen
f5d65203c0
First token lm_head optimization ( #10318 )
...
* add lm head linear
* update
* address comments and fix style
* address comment
2024-03-13 10:11:32 +08:00
Xin Qiu
28c4a8cf5c
Qwen fused qkv ( #10368 )
...
* fused qkv + rope for qwen
* quantized kv cache
* fix
* update qwen
* fixed quantized qkv
* fix
* meet code review
* update split
* convert.py
* extend when no enough kv
* fix
2024-03-12 17:39:00 +08:00
Yishuo Wang
741c2bf1df
use new rms norm ( #10384 )
2024-03-12 17:29:51 +08:00
Xiangyu Tian
0ded0b4b13
LLM: Enable BigDL IPEX optimization for int4 ( #10319 )
...
Enable BigDL IPEX optimization for int4
2024-03-12 17:08:50 +08:00
Zhao Changmin
df2b84f7de
Enable kv cache on arc batch ( #10308 )
2024-03-12 16:46:04 +08:00
Guancheng Fu
cc4148636d
[FastChat-integration] Add initial implementation for loader ( #10323 )
...
* add initial implementation for loader
* add test method for model_loader
* data
* Refine
2024-03-12 10:54:59 +08:00
binbin Deng
dbcfc5c2fa
LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub ( #10364 )
2024-03-11 16:19:17 +08:00
Chen, Zhentao
a425eaabfc
fix from_pretrained when device_map=None ( #10361 )
...
* pr trigger
* fix error when device_map=None
* fix device_map=None
2024-03-11 16:06:12 +08:00
Yina Chen
d7b765fd3f
serving xpu memory opt ( #10358 )
2024-03-11 15:21:22 +08:00
Ruonan Wang
be29833b2b
LLM: fix qwen2 ( #10356 )
2024-03-11 09:29:08 +08:00
Zhicun
9026c08633
Fix llamaindex AutoTokenizer bug ( #10345 )
...
* fix tokenizer
* fix AutoTokenizer bug
* modify code style
2024-03-08 16:24:50 +08:00
Keyan (Kyrie) Zhang
7a621a4db0
Fix device_map bug by raise an error when using device_map=xpu ( #10340 )
...
* Fix device_map bug by raise an error when using device_map=xpu
* Fix sync error
* Fix python style
* Use invalidInputError instead of invalidOperationError
2024-03-08 13:38:52 +08:00
Yishuo Wang
1ac193ba02
add rope theta argument ( #10343 )
2024-03-07 17:27:19 +08:00
Cengguang Zhang
496d18ab6d
LLM: add quantize kv cache support for baichuan 7b and 13b. ( #10330 )
...
* add quantize kv cache for baichuan 7b and 13b.
* fix typo.
* fix.
* fix style.
* fix style.
2024-03-07 16:17:38 +08:00
Yina Chen
9ea499ca68
Optimize speculative decoding PVC memory usage ( #10329 )
...
* optimize memory
* update
* update
* update
* support other models
* update
* fix style
2024-03-06 09:54:21 +08:00
dingbaorong
cc796848ea
fix typos ( #10274 )
...
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 18:38:22 +08:00
Yishuo Wang
0011ff9f64
optimize bge large performance ( #10324 )
2024-03-05 17:06:03 +08:00
Cengguang Zhang
30d009bca7
LLM: support quantized kv cache for Mistral in transformers >=4.36.0 ( #10326 )
...
* support quantize kv for mistral in transformers 4.36
* update mistral support.
* fix style.
2024-03-05 16:23:50 +08:00
dingbaorong
1e6f0c6f1a
Add llamaindex gpu example ( #10314 )
...
* add llamaindex example
* fix core dump
* refine readme
* add trouble shooting
* refine readme
---------
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:36:00 +08:00
dingbaorong
fc7f10cd12
add langchain gpu example ( #10277 )
...
* first draft
* fix
* add readme for transformer_int4_gpu
* fix doc
* check device_map
* add arc ut test
* fix ut test
* fix langchain ut
* Refine README
* fix gpu mem too high
* fix ut test
---------
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:33:57 +08:00
Cengguang Zhang
ab9fc2485f
LLM: add quantize kv support for llama transformer 4.36 ( #10298 )
...
* add quantize kv support for llama transformer 4.36
* fix style.
* fix style.
2024-03-04 10:33:35 +08:00
SONG Ge
0ab40917fb
[LLM] Split merged_qk to separated q/k linear ( #10299 )
...
* modify merge_qk_linear to separated q/k linear
* update
2024-03-01 16:48:55 +08:00
Yang Wang
f4d7dbcde2
use fused qkv forward in qwen2 ( #10185 )
...
* use fused qkv forward in qwen2
* support both
* fix style
* fix rope
* remove pring
* fix style
* clean up
2024-03-01 16:46:35 +08:00
Wang, Jian4
beb9433cec
LLM: Reduce speculative _ipex_optimize_model memory use ( #10281 )
...
* use tpp
* update ipex
2024-03-01 13:48:23 +08:00
Yuwen Hu
f0ff0eebe1
[LLM] Support quantize kv cache for Baichuan2 7B ( #10280 )
...
* Add quatized kv cache framework for Baichuan2 7B
* Support quantize kv cache for baichuan2
* Small fix
* Fix python style
2024-03-01 13:35:42 +08:00
SONG Ge
273de341d7
hot-fix silu error import ( #10292 )
2024-03-01 10:11:37 +08:00
Xin Qiu
232273a1b5
Enable Gemma fused mlp + Gelu ( #10276 )
...
* update llama mlp forward
* add all
* fix style check
* split
* update
* update
* update
* fix style
2024-02-29 16:53:24 +08:00
Guancheng Fu
2d930bdca8
Add vLLM bf16 support ( #10278 )
...
* add argument load_in_low_bit
* add docs
* modify gpu doc
* done
---------
Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>
2024-02-29 16:33:42 +08:00
SONG Ge
13b0bc9075
[LLM] Add quantize_kv optimization for yuan2 model ( #10243 )
...
* add initial quantize_kv support for yuan2 model
* fix yuan2 quantize_kv generation
* apply fp16 conv layer optimizations
* disable mlp for quantize_kv
2024-02-29 16:33:26 +08:00
Zhicun
4e6cc424f1
Add LlamaIndex RAG ( #10263 )
...
* run demo
* format code
* add llamaindex
* add custom LLM with bigdl
* update
* add readme
* begin ut
* add unit test
* add license
* add license
* revised
* update
* modify docs
* remove data folder
* update
* modify prompt
* fixed
* fixed
* fixed
2024-02-29 15:21:19 +08:00
Ruonan Wang
a9fd20b6ba
LLM: Update qkv fusion for GGUF-IQ2 ( #10271 )
...
* first commit
* update mistral
* fix transformers==4.36.0
* fix
* disable qk for mixtral now
* fix style
2024-02-29 12:49:53 +08:00
Ruonan Wang
4b08bc1417
LLM: relax batch check of flash atttention by double check attention mask ( #10270 )
...
* relax batch check
* fix
* fix style
2024-02-29 09:39:55 +08:00
Yina Chen
07f36fbfcc
Fix gptj failed to extend ( #10269 )
2024-02-29 09:39:27 +08:00
Yishuo Wang
cccb02dad1
fix baichuan2 13b 2k input ( #10267 )
2024-02-28 17:20:20 +08:00
Heyang Sun
7244fd1ba5
Fix Arc StarCoder wrong query_shape when input is long ( #10268 )
...
* Fix Arc StarCoder wrong query_shape when input is long
* Update gptbigcode.py
2024-02-28 17:07:08 +08:00
Cengguang Zhang
a4de3095f3
LLM: Support quantize kv cache in mistral. ( #10261 )
...
* init
* update quantize kv.
2024-02-28 14:08:08 +08:00
Zhicun
308e637d0d
Add DeepSeek-MoE-16B-Chat ( #10155 )
...
* dsmoe-hf add
* add dsmoe pytorch
* update README
* modify comment
* remove GPU example
* update model name
* format code
2024-02-28 10:12:09 +08:00
Yang Wang
c581c6db30
draft mmint4 ( #10031 )
...
change to llm.cpp
support transposed format
revert
implement qkv fuse
fix style
change to vertically pack
change to enable_xetla
fix mlp_fusion_check
remove comments
address comments
add some comments
fix style
2024-02-27 14:55:16 -08:00
Yishuo Wang
b4fa4ab46f
optimize yuan 2.0 again ( #10252 )
2024-02-27 14:51:42 +08:00
Heyang Sun
36a9e88104
Speculative Starcoder on CPU ( #10138 )
...
* Speculative Starcoder on CPU
* enable kv-cache pre-allocation
* refine codes
* refine
* fix style
* fix style
* fix style
* refine
* refine
* Update speculative.py
* Update gptbigcode.py
* fix style
* Update speculative.py
* enable mixed-datatype layernorm on top of torch API
* adaptive dtype
* Update README.md
2024-02-27 09:57:29 +08:00
Yishuo Wang
a47989c860
optimize yuan 2.0 performance ( #10244 )
2024-02-26 17:20:10 +08:00
Wang, Jian4
6c74b99a28
LLM: Update qwen readme ( #10245 )
2024-02-26 17:03:09 +08:00
Wang, Jian4
f9b75f900b
LLM: Enable qwen target_model ipex ( #10232 )
...
* change order
* enable qwen ipex
* update qwen example
* update
* fix style
* update
2024-02-26 16:41:12 +08:00
Yuwen Hu
e38e29511c
[LLM] Yuan2 MLP and Rotary optimization ( #10231 )
...
* Add optimization for rotary embedding
* Add mlp fused optimizatgion
* Python style fix
* Fix rotary embedding due to logits difference
* Small fix
2024-02-26 15:10:08 +08:00
SONG Ge
df2f3885ba
[LLM] Enable kv_cache and forward_qkv optimizations for yuan2 ( #10225 )
...
* add init kv_cache support for yuan2
* add forward qkv in yuan
2024-02-26 11:29:48 +08:00
Ruonan Wang
28513f3978
LLM: support fp16 embedding & add mlp fusion for iq2_xxs ( #10219 )
...
* add fp16 embed
* small fixes
* fix style
* fix style
* fix comment
2024-02-23 17:26:24 +08:00
Yuwen Hu
eeecd9fc08
Python style fix ( #10230 )
2024-02-23 17:21:23 +08:00
Yuwen Hu
e511bbd8f1
[LLM] Add basic optimization framework for Yuan2 ( #10227 )
...
* Add basic optimization framework for Yuan2
* Small fix
* Python style fix
* Small fix
* Small fix
2024-02-23 17:05:00 +08:00
Xin Qiu
30795bdfbc
Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv ( #10212 )
...
* gemma optimization
* update
* update
* fix style
* meet code review
2024-02-23 10:07:24 +08:00
Guoqiong Song
63681af97e
falcon for transformers 4.36 ( #9960 )
...
* falcon for transformers 4.36
2024-02-22 17:04:40 -08:00
Yina Chen
ce5840a8b7
GPT-J rope optimization on xpu ( #10182 )
...
* optimize
* update
* fix style & move use_fuse_rope
* add ipex version check
* fix style
* update
* fix style
* meet comments
* address comments
* fix style
2024-02-22 16:25:12 +08:00
Xiangyu Tian
f445217d02
LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize ( #10189 )
...
Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.
2024-02-22 16:01:11 +08:00
Heyang Sun
c876d9b5ca
Support for MPT rotary embedding ( #10208 )
2024-02-22 15:16:31 +08:00
Ruonan Wang
5e1fee5e05
LLM: add GGUF-IQ2 examples ( #10207 )
...
* add iq2 examples
* small fix
* meet code review
* fix
* meet review
* small fix
2024-02-22 14:18:45 +08:00
SONG Ge
ca1166a0e5
[LLM] Add quantize kv_cache for Baichuan2-13B ( #10203 )
...
* add quantize kv_cache for baichuan2-13b
* style fix
2024-02-22 13:43:35 +08:00
Ruonan Wang
34ee1aa91f
LLM: add esimd sdp support for chatglm3 ( #10205 )
...
* add esimd sdp support
* fix style
2024-02-22 13:37:16 +08:00
Ruonan Wang
f7c96b19ef
LLM: support iq2 for mixtral ( #10191 )
...
* support name mapping for mixtral
* support mixtral mixed quantization
* fix style
* fix
2024-02-21 16:00:29 +08:00
Xin Qiu
56ad781f2f
qwen2 cpu fix ( #10187 )
2024-02-21 11:23:51 +08:00
Zhao Changmin
4fbf449c2d
for rwkv4 ( #10179 )
2024-02-21 10:11:10 +08:00
Ruonan Wang
3288acb8de
LLM : Support embedding quantization (only q2k now) ( #10170 )
...
* basic logic added
* basic support
* support save&load, update mixed strategy
* fix style
* use int8 for lm_head
* add check for xpu
2024-02-20 16:56:57 +08:00
binbin Deng
2bb96c775c
LLM: fix device setting during saving optimized model ( #10154 )
2024-02-20 09:52:59 +08:00
Xin Qiu
1f6d5b9f30
enable fused rmsnorm and rope qwen2 ( #10163 )
...
* qwen2
* change convert
* cleanup
2024-02-20 08:33:09 +08:00
Zhao Changmin
f8730e8dc1
Skip rescale rwkv linear when load_low_bit ( #10164 )
...
* rwkv_ld
2024-02-19 15:56:42 +08:00
Heyang Sun
3e2af5ec0a
Fix IPEX Baichuan Speculative ( #10162 )
...
* Fix IPEX Baichuan Speculative
* compatible with 13B
* Update speculative.py
2024-02-19 15:27:34 +08:00
Yina Chen
23c91cdce6
[LLM] Add min_step_draft in speculative decoding ( #10142 )
...
* Fix gptj kvcache & position id
* Add min_draft_tokens in speculative decoding
* fix style
* update
2024-02-19 14:31:41 +08:00
Wang, Jian4
f2417e083c
LLM: enable chatglm3-6b target_model ipex ( #10085 )
...
* init
* always make casual_mask
* not return last tensor
* update
* optimize_model = False
* enable optimized=False
* enable optimized_model=true
* speed_up ipex target_model
* remove if True
* use group_size
* update python style
* update
* update
2024-02-19 13:38:32 +08:00
Yina Chen
1508d6b089
Fix gptj kvcache & position id ( #10141 )
2024-02-18 10:02:49 +08:00
Yishuo Wang
4d33aac7f9
quick fix qwen2 fp8 kv cache ( #10135 )
2024-02-08 17:04:59 +08:00
Cengguang Zhang
39d90839aa
LLM: add quantize kv cache for llama. ( #10086 )
...
* feat: add quantize kv cache for llama.
* fix style.
* add quantized attention forward function.
* revert style.
* fix style.
* fix style.
* update quantized kv cache and add quantize_qkv
* fix style.
* fix style.
* optimize quantize kv cache.
* fix style.
2024-02-08 16:49:22 +08:00
Yishuo Wang
d848efe17c
add quantize kv cache support for qwen2 ( #10134 )
2024-02-08 16:17:21 +08:00
SONG Ge
3f79128ed7
[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 ( #10131 )
...
* add support for kv_cache optimization on transformers-v4.37.0
* enable attention forward
* style fix
* disable rotary for now
2024-02-08 14:20:26 +08:00
Ruonan Wang
063dc145ac
LLM: basic support for q2k ( #10132 )
...
* basic support for q2k
* fix style
2024-02-08 13:52:01 +08:00
Cengguang Zhang
0cf6a12691
LLM: add default torch_dtype for fp16. ( #10124 )
...
* set default torch_dtype for fp16.
* fix style.
* bug fix.
* update bug fix.
2024-02-08 10:24:16 +08:00
Yishuo Wang
1aa0c623ce
disable fused layer norm on UHD ( #10130 )
2024-02-08 10:20:01 +08:00
Yuwen Hu
a8450fc300
[LLM] Support MLP optimization for Qwen1.5 ( #10123 )
2024-02-08 09:15:34 +08:00
binbin Deng
925f82107e
LLM: support models hosted by modelscope ( #10106 )
2024-02-07 16:46:36 +08:00
Xiangyu Tian
8953acd7d6
[LLM] Fix log condition for BIGDL_OPT_IPEX ( #10115 )
...
Fix log condition for BIGDL_OPT_IPEX
2024-02-07 10:27:10 +08:00
Yuwen Hu
518ef95abc
Small fix for Nonetype error ( #10104 )
2024-02-06 14:58:52 +08:00
Ruonan Wang
d61f4905ac
LLM: 2bit quantization initial support ( #10042 )
...
* basis quantize support
* fix new module name
* small update
* and mixed int4 with iq2_xxs
* remove print
* code refactor
* fix style
* meet code review
2024-02-06 14:58:32 +08:00
Jiao Wang
33b9e7744d
fix dimension ( #10097 )
2024-02-05 15:07:38 -08:00
Zhicun
7d2be7994f
add phixtral and optimize phi-moe ( #10052 )
2024-02-05 11:12:47 +08:00
Zhicun
676d6923f2
LLM: modify transformersembeddings.embed() in langchain ( #10051 )
2024-02-05 10:42:10 +08:00
Jin Qiao
ad050107b3
LLM: fix mpt load_low_bit issue ( #10075 )
...
* fix
* retry
* retry
2024-02-05 10:17:07 +08:00
Ruonan Wang
8e33cb0f38
LLM: support speecht5_tts ( #10077 )
...
* support speecht5_tts
* fix
2024-02-04 13:26:42 +08:00
ivy-lv11
428b7105f6
Add HF and PyTorch example InternLM2 ( #10061 )
2024-02-04 10:25:55 +08:00
Yina Chen
77be19bb97
LLM: Support gpt-j in speculative decoding ( #10067 )
...
* gptj
* support gptj in speculative decoding
* fix
* update readme
* small fix
2024-02-02 14:54:55 +08:00
Xin Qiu
6e0f1a1e92
use apply_rotary_pos_emb_cache_freq_xpu in mixtral ( #10060 )
...
* use apply_rotary_pos_emb_cache_freq_xpu in mixtral
* fix style
2024-02-01 15:40:49 +08:00
Heyang Sun
601024f418
Mistral CPU example of speculative decoding ( #10024 )
...
* Mistral CPU example of speculative decoding
* update transformres version
* update example
* Update README.md
2024-02-01 10:52:32 +08:00
Heyang Sun
968e70544d
Enable IPEX Mistral in Speculative ( #10059 )
2024-02-01 10:48:16 +08:00
Yina Chen
3ca03d4e97
Add deepmind sample into bigdl-llm speculative decoding ( #10041 )
...
* migrate deepmind sample
* update
* meet comments
* fix style
* fix style
2024-02-01 09:57:02 +08:00
Wang, Jian4
7e5cd42a5c
LLM : Update optimize ipex bf16 ( #10038 )
...
* use 4.35.2 and remove
* update rmsnorm
* remove
* remove
* update python style
* update
* update python style
* update
* fix style
* update
* remove whitespace
2024-01-31 10:59:55 +08:00
Ruonan Wang
3685622f29
LLM: fix llama 4.36 forward( #10047 )
2024-01-31 10:31:10 +08:00
Yishuo Wang
53a5140eff
Optimize rwkv v5 rest token again ( #10043 )
2024-01-31 10:01:11 +08:00
Ruonan Wang
6b63ba23d1
LLM: add full module name during convert ( #10035 )
2024-01-30 14:43:07 +08:00
Yishuo Wang
7dfa6dbe46
add rwkv time shift optimization ( #10032 )
2024-01-30 14:10:55 +08:00
Xiangyu Tian
f57d0fda8b
[LLM] Use IPEX Optimization for Self Speculative Decoding ( #9997 )
...
Use IPEX Optimization for Self Speculative Decoding
2024-01-30 09:11:06 +08:00
Ruonan Wang
ccf8f613fb
LLM: update fp16 Linear on ARC/FLEX ( #10023 )
2024-01-29 18:25:26 +08:00
Shaojun Liu
824c8029d7
Fix "local variable 'model' referenced before assignment" ( #10022 )
2024-01-29 16:18:04 +08:00
Xiangyu Tian
f37e4702bc
[LLM] Use IPEX Optimization for BF16 Model ( #9988 )
...
Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true
2024-01-29 11:28:25 +08:00
Yishuo Wang
d720554d43
simplify quantize kv cache api ( #10011 )
2024-01-29 09:23:57 +08:00
Yina Chen
a3322e2a6c
add fp8 e5 to use_xmx ( #10015 )
2024-01-26 18:29:46 +08:00
Qiyuan Gong
9e18ea187f
[LLM] Avoid KV Cache OOM when seq len is larger than 1 ( #10006 )
...
* Avoid OOM during muti-round streaming chat with kv cache
* For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31.
* Other models need to compare kv cache size with kv_len.
2024-01-26 17:30:08 +08:00
Ruonan Wang
a00efa0564
LLM: add mlp & qkv fusion for FP16 Llama-7B ( #9932 )
...
* add mlp fusion for llama
* add mlp fusion
* fix style
* update
* add mm_qkv_out
* fix style
* update
* meet code review
* meet code review
2024-01-26 11:50:38 +08:00
Wang, Jian4
98ea3459e5
LLM : Fix llama draft_model dtype error ( #10005 )
...
* fix llama draft_model dtype error
* updat
2024-01-26 10:59:48 +08:00
Yishuo Wang
aae1870096
fix qwen kv cache length ( #9998 )
2024-01-26 10:15:01 +08:00
Yishuo Wang
24b34b6e46
change xmx condition ( #10000 )
2024-01-25 17:48:11 +08:00
Yishuo Wang
bf65548d29
Add quantize kv cache support for chaglm2/3 ( #9996 )
2024-01-25 16:55:59 +08:00
Wang, Jian4
9bff84e6fd
LLM: Convert draft_model kv_cache from bf16 to fp32 ( #9964 )
...
* convert bf16 to fp32
* update
* change when init
* init first and cut off after
* init and exchange
* update python type
* update
* fix bug
* update
* update
2024-01-25 11:20:27 +08:00
Yina Chen
27338540c3
Fix repetition_penalty not activated issue ( #9989 )
2024-01-25 10:40:41 +08:00
Yuwen Hu
b27e5a27b9
Remove the check for meta device in _replace_with_low_bit_linear ( #9984 )
2024-01-24 18:15:39 +08:00
Yina Chen
b176cad75a
LLM: Add baichuan2 gpu spec example ( #9973 )
...
* add baichuan2 gpu spec example
* update readme & example
* remove print
* fix typo
* meet comments
* revert
* update
2024-01-24 16:40:16 +08:00
Chen, Zhentao
e0db44dcb6
fix unexpected keyword argument 'device' ( #9982 )
...
* add device for chatglm3 only
* add comment for this change
* fix style
* fix style
* fix style again..
* finally fixed style
2024-01-24 13:20:46 +08:00
Yuwen Hu
8d28aa8e2b
[LLM] Fix the model.device problem when cpu_embedding=True ( #9971 )
...
* Overwrite the device attribute for CPUPinnedParam
* Expose cpu_embedding=True for Linux users
* Fix python style
2024-01-23 18:51:11 +08:00
Yishuo Wang
f82782cd3b
fix starcoder ( #9975 )
2024-01-23 17:24:53 +08:00
Yishuo Wang
2c8a9aaf0d
fix qwen causal mask when quantize_kv_cache=True ( #9968 )
2024-01-23 16:34:05 +08:00
Yina Chen
36c665667d
Add logits processor & qwen eos stop in speculative decoding ( #9963 )
...
* add logits processor & qwen eos
* fix style
* fix
* fix
* fix style
* fix style
* support transformers 4.31
* fix style
* fix style
---------
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Xin Qiu
da4687c917
fix fp16 ( #9970 )
2024-01-23 15:53:32 +08:00
Ruonan Wang
27b19106f3
LLM: add readme for speculative decoding gpu examples ( #9961 )
...
* add readme
* add readme
* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a
add default device meta when lcmu enabled ( #9941 )
2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294
add fused rotary pos emb for qwen ( #9956 )
...
* add fused rotary pos emb for qwen
* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0
LLM: limit esimd sdp usage for k_len < 8 ( #9959 )
...
* update
* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d
LLM: Support speculative decoding in bigdl-llm ( #9951 )
...
* first commit
* fix error, add llama example
* hidden print
* update api usage
* change to api v3
* update
* meet code review
* meet code review, fix style
* add reference, fix style
* fix style
* fix first token time
2024-01-22 19:14:56 +08:00
Heyang Sun
fb91c97fe8
support for Baichuan/Baichuan2 13B Chat running speculative decoding ( #9921 )
...
* support for Baichuan/Baichuan2 13B Chat running speculative decoding
* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975
optimize Decilm 7b ( #9922 )
...
* optimize deci
* update
* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272
Update optimize qwen ( #9943 )
...
* update for n tokens input
* fix dtype
* update
2024-01-19 16:54:59 +08:00
Ruonan Wang
bf37b3a670
LLM: optimize CPU speculative decoding of chatglm3 ( #9928 )
...
* update
* fix style
* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8
gguf memory optimization for mixtral ( #9939 )
2024-01-19 11:13:15 +08:00
Lilac09
7032a2ad73
Optimize gguf load memory for mistral ( #9923 )
...
* optimize gguf load for mistral
* fix output of gguf mistral
* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7
gguf memory optimization for baichuan ( #9937 )
2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e
[Serving] Add vllm_worker to fastchat serving framework ( #9934 )
...
* add worker
* finish
* finish
* add license
* add more comments
2024-01-18 21:33:36 +08:00
Yishuo Wang
7bbb98abb6
Disable fused layer norm when using XMX to fix mpt UT ( #9933 )
2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265
LLM: Update for Qwen n tokens inputs ( #9931 )
...
* update for n tokens inputs
* update style
* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9
Fix Mixtral GGUF Wrong Output Issue ( #9930 )
...
* Fix Mixtral GGUF Wrong Output Issue
* fix style
* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9
add rwkv v5 attention kernel ( #9927 )
2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f
LLM: Fix rope of chatglm3 to support speculative decoding on CPU ( #9926 )
2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432
[LLM]Solve the problem of calling bmm operator in BF16Linear ( #9924 )
...
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4
Support fast rope for training ( #9745 )
...
* init
* init
* fix style
* add test and fix
* address comment
* update
* merge upstream main
2024-01-17 15:51:38 +08:00
Ruonan Wang
427f75000b
LLM: fix sdp of chatglm3 ( #9917 )
...
* fix
* fix
* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf
optimize rwkv v4 first token performance ( #9912 )
2024-01-17 09:27:41 +08:00
Shaojun Liu
b909c5c9c2
GGUF load memory optimization ( #9913 )
...
* block-wise
* convert linear for module
* revert
* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Xin Qiu
dee32f7d15
copy fused rms norm's reuslt to avoid <unk> ( #9909 )
2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03
LLM: fix chatglm3 sdp to support speculative decoding ( #9900 )
...
* fix chatglm3
* fix
* update
* meet code review
* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb
Update PVC XMX condition ( #9901 )
...
* update pvc xmx condition
* update condition
* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf
change xmx condition ( #9896 )
2024-01-12 19:51:48 +08:00
Ruonan Wang
d9cf55bce9
LLM: fix MLP check of mixtral ( #9891 )
2024-01-11 18:01:59 +08:00
Ziteng Zhang
4af88a67b9
support chatglm3 with bf16 ( #9888 )
...
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965
[LLM] Improve LLM doc regarding windows gpu related info ( #9880 )
...
* Improve runtime configuration for windows
* Add python 310/311 supports for wheel downloading
* Add troubleshooting for windows gpu
* Remove manually import ipex due to auto importer
* Add info regarding cpu_embedding=True on iGPU
* More info for Windows users
* Small updates to API docs
* Python style fix
* Remove tip for loading from saved optimize_model for now
* Updated based on comments
* Update win info for multi-intel gpus selection
* Small fix
* Small fix
2024-01-11 14:37:16 +08:00
Ruonan Wang
53531ae4ee
LLM: support qkv fusion for fp8e5 ( #9878 )
...
* update
* add mistral
* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec
add mistral and chatglm support to vllm ( #9879 )
...
* add mistral and chatglm support to vllm
* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
Ruonan Wang
3e05c9e11b
LLM: update esimd sdp kernel ( #9871 )
2024-01-09 18:10:01 +08:00
Yishuo Wang
36496d60ac
only use quantize kv cache on MTL ( #9862 )
2024-01-09 13:24:02 +08:00
ZehuaCao
146076bdb5
Support llm-awq backend ( #9856 )
...
* Support for LLM-AWQ Backend
* fix
* Update README.md
* Add awqconfig
* modify init
* update
* support llm-awq
* fix style
* fix style
* update
* fix AwqBackendPackingMethod not found error
* fix style
* update README
* fix style
---------
Co-authored-by: Uxito-Ada <414416158@qq.com>
Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
Co-authored-by: cyita <yitastudy@gmail.com>
2024-01-09 13:07:32 +08:00
Ruonan Wang
fea6f16057
LLM: add mlp fusion for fp8e5 and update related check ( #9860 )
...
* update mlp fusion
* fix style
* update
2024-01-09 09:56:32 +08:00
Jiao Wang
3b6372ab12
Fix Llama transformers 4.36 support ( #9852 )
...
* supoort 4.36
* style
* update
* update
* update
* fix merge
* update
2024-01-08 00:32:23 -08:00
Chen, Zhentao
1b585b0d40
set fp8 default as e5m2 ( #9859 )
2024-01-08 15:53:57 +08:00
Ruonan Wang
dc995006cc
LLM: add flash attention for mistral / mixtral ( #9846 )
...
* add flash attention for mistral
* update
* add flash attn for mixtral
* fix style
2024-01-08 09:51:34 +08:00
Yishuo Wang
afaa871144
[LLM] support quantize kv cache to fp8 ( #9812 )
2024-01-08 09:28:20 +08:00
Jiao Wang
248ae7fad2
LLama optimize_model to support transformers 4.36 ( #9818 )
...
* supoort 4.36
* style
* update
* update
* update
2024-01-05 11:30:18 -08:00
Ruonan Wang
a60bda3324
LLM: update check for deepspeed ( #9838 )
2024-01-05 16:44:10 +08:00
Ruonan Wang
16433dd959
LLM: fix first token judgement of flash attention ( #9841 )
...
* fix flash attention
* meet code review
* fix
2024-01-05 13:49:37 +08:00
Yina Chen
f919f5792a
fix kv cache out of bound ( #9827 )
2024-01-05 12:38:57 +08:00
Ruonan Wang
5df31db773
LLM: fix accuracy issue of chatglm3 ( #9830 )
...
* add attn mask for first token
* fix
* fix
* change attn calculation
* fix
* fix
* fix style
* fix style
2024-01-05 10:52:05 +08:00
Xiangyu Tian
38c05be1c0
[LLM] Fix dtype mismatch in Baichuan2-13b ( #9834 )
2024-01-04 15:34:42 +08:00
Ziteng Zhang
05b681fa85
[LLM] IPEX auto importer set on by default ( #9832 )
...
* Set BIGDL_IMPORT_IPEX default to True
* Remove import intel_extension_for_pytorch as ipex from GPU example
2024-01-04 13:33:29 +08:00
Wang, Jian4
4ceefc9b18
LLM: Support bitsandbytes config on qlora finetune ( #9715 )
...
* test support bitsandbytesconfig
* update style
* update cpu example
* update example
* update readme
* update unit test
* use bfloat16
* update logic
* use int4
* set defalut bnb_4bit_use_double_quant
* update
* update example
* update model.py
* update
* support lora example
2024-01-04 11:23:16 +08:00
Ruonan Wang
20e9742fa0
LLM: fix chatglm3 issue ( #9820 )
...
* fix chatglm3 issue
* small update
2024-01-03 16:15:55 +08:00
Wang, Jian4
a54cd767b1
LLM: Add gguf falcon ( #9801 )
...
* init falcon
* update convert.py
* update style
2024-01-03 14:49:02 +08:00
Qiyuan Gong
f0f9d45eac
[LLM] IPEX import support bigdl-core-xe-21 ( #9769 )
...
Add support for bigdl-core-xe-21.
2023-12-28 15:23:58 +08:00
Guancheng Fu
5857a38321
[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH ( #9782 )
...
* add option kv_cache_block
* change var name
2023-12-28 14:41:47 +08:00
Ruonan Wang
99bddd3ab4
LLM: better FP16 support for Intel GPUs ( #9791 )
...
* initial support
* fix
* fix style
* fix
* limi esimd usage condition
* refactor code
* fix style
* small fix
* meet code review
* small fix
2023-12-28 13:30:13 +08:00
Yishuo Wang
7d9f6c6efc
fix cpuinfo error ( #9793 )
2023-12-28 09:23:44 +08:00
Wang, Jian4
7ed9538b9f
LLM: support gguf mpt ( #9773 )
...
* add gguf mpt
* update
2023-12-28 09:22:39 +08:00
Cengguang Zhang
d299f108d0
update falcon attention forward. ( #9796 )
2023-12-28 09:11:59 +08:00
Kai Huang
689889482c
Reduce max_cache_pos to reduce Baichuan2-13B memory ( #9694 )
...
* optimize baichuan2 memory
* fix
* style
* fp16 mask
* disable fp16
* fix style
* empty cache
* revert empty cache
2023-12-26 19:51:25 +08:00
Xiangyu Tian
0ea842231e
[LLM] vLLM: Add api_server entrypoint ( #9783 )
...
Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.
2023-12-26 16:03:57 +08:00
Ruonan Wang
11d883301b
LLM: fix wrong batch output caused by flash attention ( #9780 )
...
* fix
* meet code review
* move batch size check to the beginning
* move qlen check inside function
* meet code review
2023-12-26 09:41:27 +08:00
Heyang Sun
66e286a73d
Support for Mixtral AWQ ( #9775 )
...
* Support for Mixtral AWQ
* Update README.md
* Update README.md
* Update awq_config.py
* Update README.md
* Update README.md
2023-12-25 16:08:09 +08:00
Ruonan Wang
1917bbe626
LLM: fix BF16Linear related training & inference issue ( #9755 )
...
* fix bf16 related issue
* fix
* update based on comment & add arc lora script
* update readme
* update based on comment
* update based on comment
* update
* force to bf16
* fix style
* move check input dtype into function
* update convert
* meet code review
* meet code review
* update merged model to support new training_mode api
* fix typo
2023-12-25 14:49:30 +08:00
Xiangyu Tian
30dab36f76
[LLM] vLLM: Fix kv cache init ( #9771 )
...
Fix kv cache init
2023-12-25 14:17:06 +08:00
Yina Chen
449b387125
Support relora in bigdl-llm ( #9687 )
...
* init
* fix style
* update
* support resume & update readme
* update
* update
* remove important
* add training mode
* meet comments
2023-12-25 14:04:28 +08:00
Ziteng Zhang
986f65cea9
[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py ( #9762 )
2023-12-25 11:31:14 +08:00
Guancheng Fu
daf536fb2d
vLLM: Apply attention optimizations for selective batching ( #9758 )
...
* fuse_rope for prefil
* apply kv_cache optimizations
* apply fast_decoding_path
* Re-enable kv_cache optimizations for prefill
* reduce KV_CACHE_ALLOC_BLOCK for selective_batching
2023-12-25 10:29:31 +08:00
Qiyuan Gong
4c487313f2
Revert "[LLM] IPEX auto importer turn on by default for XPU ( #9730 )" ( #9759 )
...
This reverts commit 0284801fbd .
2023-12-22 16:38:24 +08:00
Qiyuan Gong
0284801fbd
[LLM] IPEX auto importer turn on by default for XPU ( #9730 )
...
* Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU.
* Remove import intel_extension_for_pytorch as ipex from GPU example.
* Add support for bigdl-core-xe-21.
2023-12-22 16:20:32 +08:00
Guancheng Fu
fdf93c9267
Implement selective batching for vLLM ( #9659 )
...
* add control to load hf model
* finish initial version of selective_batching
* temp
* finish
* Remove print statement
* fix error
* Apply yang's optimization
* a version that works
* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path
* format
* temp solution: not batching prefill requests
* a version that works for prefill batching
* format
* a solid version: works normally
* a temp version
* Solid version: remove redundant functions
* fix format
* format
* solid: add option to enable selective_batching
* remove logic for using transformer models
* format
* format
* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING
* format
* finish
* format
2023-12-22 13:45:46 +08:00
Ruonan Wang
2f36769208
LLM: bigdl-llm lora support & lora example ( #9740 )
...
* lora support and single card example
* support multi-card, refactor code
* fix model id and style
* remove torch patch, add two new class for bf16, update example
* fix style
* change to training_mode
* small fix
* add more info in help
* fixstyle, update readme
* fix ut
* fix ut
* Handling compatibility issues with default LoraConfig
2023-12-22 11:05:39 +08:00
SONG Ge
ba0b939579
[LLM] Support transformers-v4.36.0 on mistral model ( #9744 )
...
* add support transformers-v4.36.0 on mistral model
* python/llm/src/bigdl/llm/transformers/models/mistral.py
* make the redundant implementation as utils
* fix code style
* fix
* fix style
* update with utils enough_kv_room
2023-12-22 09:59:27 +08:00
Xin Qiu
e36111e713
mixstral fused qkv and rope ( #9724 )
...
* mixstral fused qkv and rope
* fix and clean
* fix style
* update
* update
* fix
* update
* fix
2023-12-22 09:26:35 +08:00
Jiao Wang
e4f6e43675
safetenor to false ( #9728 )
2023-12-21 14:41:51 -08:00
Yishuo Wang
426660b88e
simplify qwen attention ( #9747 )
2023-12-21 17:53:29 +08:00
Wang, Jian4
984697afe2
LLM: Add bloom gguf support ( #9734 )
...
* init
* update bloom add merges
* update
* update readme
* update for llama error
* update
2023-12-21 14:06:25 +08:00
Heyang Sun
df775cf316
fix python style ( #9742 )
...
* fix python style
* fix
* fix
2023-12-21 11:25:05 +08:00
Xin Qiu
6c3e698bf1
mistral decoding_fast_path and fused mlp ( #9714 )
...
* mistral decoding_fast_path and fused mlp
* meet code review
2023-12-21 10:11:37 +08:00
Heyang Sun
d157f623b6
Load Mixtral gguf in a block-wise way ( #9725 )
...
* Load Mixtral gguf in a block-wise way
* refine
2023-12-21 10:03:23 +08:00
Zhao Changmin
4bda975a3e
LLM: Align lowbit model config ( #9735 )
...
* align lowbit model config
2023-12-21 09:48:58 +08:00
Wang, Jian4
e1e921f425
LLM: gguf other model using dtype ( #9729 )
2023-12-21 09:33:40 +08:00
Yishuo Wang
13ea6330bd
optimize qwen rope ( #9737 )
2023-12-20 17:34:34 +08:00
Ziteng Zhang
4c032a433e
[LLM] Add glibc checker ( #9624 )
...
* Add glibc checker
* Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.
2023-12-20 16:52:43 +08:00
Yina Chen
cd652a1710
Support fp8 e5m2 on arc ( #9711 )
...
* init
* fix style
* update
* fix style
* update
2023-12-20 16:26:17 +08:00
Yishuo Wang
e54c428d30
add bf16/fp16 fuse mlp support ( #9726 )
2023-12-20 10:40:45 +08:00
Heyang Sun
612651cb5d
fix typo ( #9723 )
2023-12-20 09:41:59 +08:00
Yishuo Wang
522cf5ed82
[LLM] Improve chatglm2/3 rest token performance with long context ( #9716 )
2023-12-19 17:29:38 +08:00
Yishuo Wang
f2e6abb563
fix mlp batch size check ( #9718 )
2023-12-19 14:22:22 +08:00
Heyang Sun
1fa7793fc0
Load Mixtral GGUF Model ( #9690 )
...
* Load Mixtral GGUF Model
* refactor
* fix empty tensor when to cpu
* update gpu and cpu readmes
* add dtype when set tensor into module
2023-12-19 13:54:38 +08:00
Qiyuan Gong
d0a3095b97
[LLM] IPEX auto importer ( #9706 )
...
* IPEX auto importer and get_ipex_version.
* Add BIGDL_IMPORT_IPEX to control auto import, default is false.
2023-12-19 13:39:38 +08:00
Yang Wang
f4fb58d99c
fusing qkv project and rope ( #9612 )
...
* Try fusing qkv project and rope
* add fused mlp
* fuse append cache
* fix style and clean up code
* clean up
2023-12-18 16:45:00 -08:00
Cengguang Zhang
4d22add4af
LLM: fix qwen efficiency issue in perf-test.
2023-12-18 18:32:54 +08:00
Ruonan Wang
8ed89557e5
LLM: add mlp optimization of mixtral ( #9709 )
2023-12-18 16:59:52 +08:00
Xin Qiu
320110d158
handle empty fused norm result ( #9688 )
...
* handle empty fused norm result
* remove fast_rms_norm
* fix style
2023-12-18 09:56:11 +08:00
SONG Ge
d5b81af7bd
Support mixtral attention optimization on transformers-v4.36.0 ( #9674 )
...
* add example code to support mistral/mixtral attention on transformers v4.36.0
* update
* style fix
* add update for seen-tokens
* support mixtral
* rm mistral change
* small fix
* add more comments and remove use_cache part
---------
Co-authored-by: plusbang <binbin1.deng@intel.com>
2023-12-15 14:30:23 +08:00
Cengguang Zhang
adbef56001
LLM: update qwen attention forward. ( #9695 )
...
* feat: update qwen attention forward.
* fix: style.
2023-12-15 14:06:15 +08:00
Wang, Jian4
b8437a1c1e
LLM: Add gguf mistral model support ( #9691 )
...
* add mistral support
* need to upgrade transformers version
* update
2023-12-15 13:37:39 +08:00
Wang, Jian4
496bb2e845
LLM: Support load BaiChuan model family gguf model ( #9685 )
...
* support baichuan model family gguf model
* update gguf generate.py
* add verify models
* add support model_family
* update
* update style
* update type
* update readme
* update
* remove support model_family
2023-12-15 13:34:33 +08:00
Yishuo Wang
9a330bfc2b
fix fuse mlp when using q5_0 or fp8 ( #9689 )
2023-12-14 16:16:05 +08:00
Xin Qiu
5e46e0e5af
fix baichuan2-7b 1st token performance regression on xpu ( #9683 )
...
* fix baichuan2-7b 1st token performance regression
* add comments
* fix style
2023-12-14 09:58:32 +08:00
Yishuo Wang
09ca540f9b
use fuse mlp in qwen ( #9672 )
2023-12-13 17:20:08 +08:00
Ruonan Wang
c7741c4e84
LLM: update moe block convert to optimize rest token latency of Mixtral ( #9669 )
...
* update moe block convert
* further accelerate final_hidden_states
* fix style
* fix style
2023-12-13 16:17:06 +08:00
Xiangyu Tian
1c6499e880
[LLM] vLLM: Support Mixtral Model ( #9670 )
...
Add Mixtral support for BigDL vLLM.
2023-12-13 14:44:47 +08:00
Ruonan Wang
dc5b1d7e9d
LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] ( #9633 )
...
* integrate sdp
* update api
* fix style
* meet code review
* fix
* distinguish mtl from arc
* small fix
2023-12-13 11:29:57 +08:00
Qiyuan Gong
5b0e7e308c
[LLM] Add support for empty activation ( #9664 )
...
* Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch.
* Add comments.
2023-12-13 11:07:45 +08:00
SONG Ge
284e7697b1
[LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC ( #9579 )
...
* optimize kv_cache to support beam_search on Arc
* correctness test update
* fix query_length issue
* simplify implementation
* only enable the optimization on gpu device
* limit the beam_search support only enabled with gpu device and batch_size > 1
* add comments for beam_search case and revert ut change
* meet comments
* add more comments to describe the differece between multi-cases
2023-12-13 11:02:14 +08:00
Ziteng Zhang
8931f2eb62
[LLM] Fix transformer qwen size mismatch and rename causal_mask ( #9655 )
...
* Fix size mismatching caused by context_layer
* Change registered_causal_mask to causal_mask
2023-12-12 20:57:40 +08:00
binbin Deng
59ce86d292
LLM: support optimize_model=True for Mixtral model ( #9657 )
2023-12-12 16:41:26 +08:00
Heyang Sun
9f02f96160
[LLM] support for Yi AWQ model ( #9648 )
2023-12-11 14:07:34 +08:00
Xin Qiu
82255f9726
Enable fused layernorm ( #9614 )
...
* bloom layernorm
* fix
* layernorm
* fix
* fix
* fix
* style fix
* fix
* replace nn.LayerNorm
2023-12-11 09:26:13 +08:00
Yina Chen
70f5e7bf0d
Support peft LoraConfig ( #9636 )
...
* support peft loraconfig
* use testcase to test
* fix style
* meet comments
2023-12-08 16:13:03 +08:00
Xin Qiu
0b6f29a7fc
add fused rms norm for Yi and Qwen ( #9640 )
2023-12-08 16:04:38 +08:00
Xin Qiu
5636b0ba80
set new linear status ( #9639 )
2023-12-08 11:02:49 +08:00
Yuwen Hu
6f34978b94
[LLM] Add more performance tests for win iGPU (more in-out pairs, RWKV model) ( #9626 )
...
* Add supports for loading rwkv models using from_pretrained api
* Temporarily enable pr tests
* Add RWKV in tests and more in-out pairs
* Add rwkv for 512 tests
* Make iterations smaller
* Change back to nightly trigger
2023-12-07 18:55:16 +08:00
Ruonan Wang
d9b0c01de3
LLM: fix unlora module in qlora finetune ( #9621 )
...
* fix unlora module
* split train and inference
2023-12-07 16:32:02 +08:00
Yishuo Wang
7319f2c227
use fused mlp in baichuan2 ( #9620 )
2023-12-07 15:50:57 +08:00
Xiangyu Tian
deee65785c
[LLM] vLLM: Delete last_kv_cache before prefilling ( #9619 )
...
Remove last_kv_cache before prefilling to reduce peak memory usage.
2023-12-07 11:32:33 +08:00
Xiangyu Tian
0327169b50
[LLM] vLLM: fix memory leak in prepare_kv_cache ( #9616 )
...
Revert modification in prepare_kv_cache to fix memory leak.
2023-12-07 10:08:18 +08:00
Xin Qiu
13d47955a8
use fused rms norm in chatglm2 and baichuan ( #9613 )
...
* use fused rms norm in chatglm2 and baichuan
* style fix
2023-12-07 09:21:41 +08:00
Yina Chen
404e101ded
QALora example ( #9551 )
...
* Support qa-lora
* init
* update
* update
* update
* update
* update
* update merge
* update
* fix style & update scripts
* update
* address comments
* fix typo
* fix typo
---------
Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-12-06 15:36:21 +08:00
Guancheng Fu
6978b2c316
[VLLM] Change padding patterns for vLLM & clean code ( #9609 )
...
* optimize
* fix minor error
* optimizations
* fix style
2023-12-06 15:27:26 +08:00
Zheng, Yi
d154b38bf9
Add llama2 gpu low memory example ( #9514 )
...
* Add low memory example
* Minor fixes
* Update readme.md
2023-12-05 17:29:48 +08:00
Ziteng Zhang
65934c9f4f
[LLM] Fix Qwen causal_mask and attention_mask size mismatching ( #9600 )
...
* Fix #9582 , caused by Qwen modified modeling_qwen.py 7f62181c94 (d2h-049182)
2023-12-05 15:15:54 +08:00
Qiyuan Gong
f211f136b6
Configurable TORCH_LINEAR_THRESHOLD from env ( #9588 )
...
* Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD)
* Change default to 512
2023-12-05 13:19:47 +08:00
Xiangyu Tian
5c03651309
[LLM] vLLM: Add Preempt for scheduler ( #9568 )
...
Implement Preempt_by_recompute method for vllm.
2023-12-03 20:16:25 +08:00
Xin Qiu
69c49d21f5
use fused rms norm ( #9572 )
...
* use fused rms norm
* meet code review
2023-11-30 21:47:41 +08:00
Yishuo Wang
7f6465518a
support loading llama tokenizer from gguf model ( #9565 )
2023-11-30 14:56:12 +08:00
Yuwen Hu
34503efa6a
Fix cpu pinned embedding ( #9556 )
2023-11-29 18:27:56 +08:00
binbin Deng
4ff2ca9d0d
LLM: fix loss error on Arc ( #9550 )
2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997
support loading q4_1/q5_0/q5_1/q8_0 gguf model ( #9546 )
2023-11-29 14:40:37 +08:00
Yuwen Hu
5f5ca38b74
[LLM Doc] Fix api doc rendering error ( #9542 )
...
* Fix api rendering error
* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56
[LLM] support loading gguf model ( #9544 )
2023-11-28 15:51:15 +08:00
Xiangyu Tian
916c338772
fix bugs in vllm length check ( #9543 )
2023-11-28 11:09:54 +08:00
Zhao Changmin
e7e0cd3b5e
CPU Pinned embedding Layer ( #9538 )
...
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79
Add vLLM-XPU version's README/examples ( #9536 )
...
* test
* test
* fix last kv cache
* add xpu readme
* remove numactl for xpu example
* fix link error
* update max_num_batched_tokens logic
* add explaination
* add xpu environement version requirement
* refine gpu memory
* fix
* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748
Remove xformers from vLLM-CPU ( #9535 )
2023-11-27 11:21:25 +08:00
binbin Deng
6bec0faea5
LLM: support Mistral AWQ models ( #9520 )
2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27
LLM: fix abnormal Mistral GPU accuracy by updating rms_norm ( #9529 )
2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda
hot-fix mistral kv_cache ( #9528 )
2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5
Replace torch.bmm with safe_bmm ( #9519 )
...
* replace bmm with safe one
* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Ruonan Wang
b63aae8a8e
LLM: add flash attention support for llama ( #9518 )
...
* add initial flash attention for llama
* accelerate fp32 first token by changing to fp16 in advance
* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2
Integrate vllm ( #9310 )
...
* done
* Rename structure
* add models
* Add structure/sampling_params,sequence
* add input_metadata
* add outputs
* Add policy,logger
* add and update
* add parallelconfig back
* core/scheduler.py
* Add llm_engine.py
* Add async_llm_engine.py
* Add tested entrypoint
* fix minor error
* Fix everything
* fix kv cache view
* fix
* fix
* fix
* format&refine
* remove logger from repo
* try to add token latency
* remove logger
* Refine config.py
* finish worker.py
* delete utils.py
* add license
* refine
* refine sequence.py
* remove sampling_params.py
* finish
* add license
* format
* add license
* refine
* refine
* Refine line too long
* remove exception
* so dumb style-check
* refine
* refine
* refine
* refine
* refine
* refine
* add README
* refine README
* add warning instead error
* fix padding
* add license
* format
* format
* format fix
* Refine vllm dependency (#1 )
vllm dependency clear
* fix licence
* fix format
* fix format
* fix
* adapt LLM engine
* fix
* add license
* fix format
* fix
* Moving README.md to the correct position
* Fix readme.md
* done
* guide for adding models
* fix
* Fix README.md
* Add new model readme
* remove ray-logic
* refactor arg_utils.py
* remove distributed_init_method logic
* refactor entrypoints
* refactor input_metadata
* refactor model_loader
* refactor utils.py
* refactor models
* fix api server
* remove vllm.stucture
* revert by txy 1120
* remove utils
* format
* fix license
* add bigdl model
* Refer to a specfic commit
* Change code base
* add comments
* add async_llm_engine comment
* refine
* formatted
* add worker comments
* add comments
* add comments
* fix style
* add changes
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Qiyuan Gong
0f0c6bb631
[LLM] Fix Qwen registered_causal_mask is None ( #9513 )
...
* Add registered_causal_mask init based on 2abd8e5777 .
2023-11-23 09:28:04 +08:00
Ruonan Wang
076d106ef5
LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing ( #9499 )
...
* update to bf16 to accelerate gradient checkpoint
* add utils and fix ut
2023-11-21 17:08:36 +08:00
Xin Qiu
50b01058f1
enable new q4_1 ( #9479 )
2023-11-17 14:58:57 +08:00
Zhao Changmin
30abd304a7
LLM: Fix baichuan pre-normalize model tensor assigning issue when loading ( #9481 )
...
* No need to normalized when loading
2023-11-16 21:57:28 +08:00
Ruonan Wang
c0ef70df02
llm: quick fix of fast_rms_norm ( #9480 )
2023-11-16 14:42:16 +08:00