Yuxuan Xia
f36224aac4
Fix ceval run.sh ( #10410 )
2024-03-14 10:57:25 +08:00
ZehuaCao
f66329e35d
Fix multiple get_enable_ipex function error ( #10400 )
...
* fix multiple get_enable_ipex function error
* remove get_enable_ipex_low_bit function
2024-03-14 10:14:13 +08:00
Kai Huang
76e30d8ec8
Empty cache for lm_head ( #10317 )
...
* empty cache
* add comments
2024-03-13 20:31:53 +08:00
Ruonan Wang
2be8bbd236
LLM: add cpp option in setup.py ( #10403 )
...
* add llama_cpp option
* meet code review
2024-03-13 20:12:59 +08:00
Ovo233
0dbce53464
LLM: Add decoder/layernorm unit tests ( #10211 )
...
* add decoder/layernorm unit tests
* update tests
* delete decoder tests
* address comments
* remove none type check
* restore nonetype checks
* delete nonetype checks; add decoder tests for Llama
* add gc
* deal with tuple output
2024-03-13 19:41:47 +08:00
Yishuo Wang
06a851afa9
support new baichuan model ( #10404 )
2024-03-13 17:45:50 +08:00
Yuxuan Xia
a90e9b6ec2
Fix C-Eval Workflow ( #10359 )
...
* Fix Baichuan2 prompt format
* Fix ceval workflow errors
* Fix ceval workflow error
* Fix ceval error
* Fix ceval error
* Test ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Fix ceval
* Add ceval dependency test
* Fix ceval
* Fix ceval
* Test full ceval
* Test full ceval
* Fix ceval
* Fix ceval
2024-03-13 17:23:17 +08:00
Yishuo Wang
b268baafd6
use fp8 sdp in llama ( #10396 )
2024-03-13 16:45:38 +08:00
Xiangyu Tian
60043a3ae8
LLM: Support Baichuan2-13b in BigDL-vLLM ( #10398 )
...
Support Baichuan2-13b in BigDL-vLLM.
2024-03-13 16:21:06 +08:00
Xiangyu Tian
e10de2c42d
[Fix] LLM: Fix condition check error for speculative decoding on CPU ( #10402 )
...
Fix condition check error for speculative decoding on CPU
2024-03-13 16:05:06 +08:00
Keyan (Kyrie) Zhang
f158b49835
[LLM] Recover arc ut test for Falcon ( #10385 )
2024-03-13 13:31:35 +08:00
Heyang Sun
d72c0fad0d
Qwen2 SDPA forward on CPU ( #10395 )
...
* Fix Qwen1.5 CPU forward
* Update convert.py
* Update qwen2.py
2024-03-13 13:10:03 +08:00
Yishuo Wang
ca58a69b97
fix arc rms norm UT ( #10394 )
2024-03-13 13:09:15 +08:00
Wang, Jian4
0193f29411
LLM : Enable gguf float16 and Yuan2 model ( #10372 )
...
* enable float16
* add yun files
* enable yun
* enable set low_bit on yuan2
* update
* update license
* update generate
* update readme
* update python style
* update
2024-03-13 10:19:18 +08:00
Yina Chen
f5d65203c0
First token lm_head optimization ( #10318 )
...
* add lm head linear
* update
* address comments and fix style
* address comment
2024-03-13 10:11:32 +08:00
Keyan (Kyrie) Zhang
7cf01e6ec8
Add LangChain upstream ut test ( #10349 )
...
* Add LangChain upstream ut test
* Add LangChain upstream ut test
* Specify version numbers in yml script
* Correct langchain-community version
2024-03-13 09:52:45 +08:00
Xin Qiu
28c4a8cf5c
Qwen fused qkv ( #10368 )
...
* fused qkv + rope for qwen
* quantized kv cache
* fix
* update qwen
* fixed quantized qkv
* fix
* meet code review
* update split
* convert.py
* extend when no enough kv
* fix
2024-03-12 17:39:00 +08:00
Yishuo Wang
741c2bf1df
use new rms norm ( #10384 )
2024-03-12 17:29:51 +08:00
Xiangyu Tian
0ded0b4b13
LLM: Enable BigDL IPEX optimization for int4 ( #10319 )
...
Enable BigDL IPEX optimization for int4
2024-03-12 17:08:50 +08:00
binbin Deng
5d7e044dbc
LLM: add low bit option in deepspeed autotp example ( #10382 )
2024-03-12 17:07:09 +08:00
binbin Deng
df3bcc0e65
LLM: remove english_quotes dataset ( #10370 )
2024-03-12 16:57:40 +08:00
Zhao Changmin
df2b84f7de
Enable kv cache on arc batch ( #10308 )
2024-03-12 16:46:04 +08:00
Lilac09
5809a3f5fe
Add run-hbm.sh & add user guide for spr and hbm ( #10357 )
...
* add run-hbm.sh
* add spr and hbm guide
* only support quad mode
* only support quad mode
* update special cases
* update special cases
2024-03-12 16:15:27 +08:00
binbin Deng
5d996a5caf
LLM: add benchmark script for deepspeed autotp on gpu ( #10380 )
2024-03-12 15:19:57 +08:00
Keyan (Kyrie) Zhang
f9c144dc4c
Fix final logits ut failure ( #10377 )
...
* Fix final logits ut failure
* Fix final logits ut failure
* Remove Falcon from completion test for now
* Remove Falcon from unit test for now
2024-03-12 14:34:01 +08:00
Guancheng Fu
cc4148636d
[FastChat-integration] Add initial implementation for loader ( #10323 )
...
* add initial implementation for loader
* add test method for model_loader
* data
* Refine
2024-03-12 10:54:59 +08:00
WeiguangHan
17bdb1a60b
LLM: add whisper models into nightly test ( #10193 )
...
* LLM: add whisper models into nightly test
* small fix
* small fix
* add more whisper models
* test all cases
* test specific cases
* collect the csv
* store the resut
* to html
* small fix
* small test
* test all cases
* modify whisper_csv_to_html
2024-03-11 20:00:47 +08:00
binbin Deng
dbcfc5c2fa
LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub ( #10364 )
2024-03-11 16:19:17 +08:00
binbin Deng
fe27a6971c
LLM: update modelscope version ( #10367 )
2024-03-11 16:18:27 +08:00
Chen, Zhentao
a425eaabfc
fix from_pretrained when device_map=None ( #10361 )
...
* pr trigger
* fix error when device_map=None
* fix device_map=None
2024-03-11 16:06:12 +08:00
Yina Chen
d7b765fd3f
serving xpu memory opt ( #10358 )
2024-03-11 15:21:22 +08:00
Ruonan Wang
be29833b2b
LLM: fix qwen2 ( #10356 )
2024-03-11 09:29:08 +08:00
Zhicun
9026c08633
Fix llamaindex AutoTokenizer bug ( #10345 )
...
* fix tokenizer
* fix AutoTokenizer bug
* modify code style
2024-03-08 16:24:50 +08:00
Zhicun
2a10b53d73
rename docqa.py->rag.py ( #10353 )
2024-03-08 16:07:09 +08:00
Keyan (Kyrie) Zhang
f1825d7408
Add RMSNorm unit test ( #10190 )
2024-03-08 15:51:03 +08:00
Shengsheng Huang
370c52090c
Langchain readme ( #10348 )
...
* update langchain readme
* update readme
* create new README
* Update README_nativeint4.md
2024-03-08 14:57:24 +08:00
Keyan (Kyrie) Zhang
7a621a4db0
Fix device_map bug by raise an error when using device_map=xpu ( #10340 )
...
* Fix device_map bug by raise an error when using device_map=xpu
* Fix sync error
* Fix python style
* Use invalidInputError instead of invalidOperationError
2024-03-08 13:38:52 +08:00
Yishuo Wang
1ac193ba02
add rope theta argument ( #10343 )
2024-03-07 17:27:19 +08:00
Yuxuan Xia
0c8d3c9830
Add C-Eval HTML report ( #10294 )
...
* Add C-Eval HTML report
* Fix C-Eval workflow pr trigger path
* Fix C-Eval workflow typos
* Add permissions to C-Eval workflow
* Fix C-Eval workflow typo
* Add pandas dependency
* Fix C-Eval workflow typo
2024-03-07 16:44:49 +08:00
Cengguang Zhang
496d18ab6d
LLM: add quantize kv cache support for baichuan 7b and 13b. ( #10330 )
...
* add quantize kv cache for baichuan 7b and 13b.
* fix typo.
* fix.
* fix style.
* fix style.
2024-03-07 16:17:38 +08:00
hxsz1997
b7db21414e
Update llamaindex ut ( #10338 )
...
* add test_llamaindex of gpu
* add llamaindex gpu tests bash
* add llamaindex cpu tests bash
* update name of Run LLM langchain GPU test
* import llama_index in llamaindex gpu ut
* update the dependency of test_llamaindex
* add Run LLM llamaindex GPU test
* modify import dependency of llamaindex cpu test
* add Run LLM llamaindex test
* update llama_model_path
* delete unused model path
* add LLAMA2_7B_ORIGIN_PATH in llamaindex cpu test
2024-03-07 10:06:16 +08:00
ZehuaCao
267de7abc3
fix fschat DEP version error ( #10325 )
2024-03-06 16:15:27 +08:00
Yina Chen
9ea499ca68
Optimize speculative decoding PVC memory usage ( #10329 )
...
* optimize memory
* update
* update
* update
* support other models
* update
* fix style
2024-03-06 09:54:21 +08:00
dingbaorong
cc796848ea
fix typos ( #10274 )
...
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 18:38:22 +08:00
hxsz1997
af11c53473
Add the installation step of postgresql and pgvector on windows in LlamaIndex GPU support ( #10328 )
...
* add the installation of postgresql and pgvector of windows
* fix some format
2024-03-05 18:31:19 +08:00
Yishuo Wang
0011ff9f64
optimize bge large performance ( #10324 )
2024-03-05 17:06:03 +08:00
Shaojun Liu
178eea5009
upload bigdl-llm wheel to sourceforge for backup ( #10321 )
...
* test: upload to sourceforge
* update scripts
* revert
2024-03-05 16:36:01 +08:00
Cengguang Zhang
30d009bca7
LLM: support quantized kv cache for Mistral in transformers >=4.36.0 ( #10326 )
...
* support quantize kv for mistral in transformers 4.36
* update mistral support.
* fix style.
2024-03-05 16:23:50 +08:00
dingbaorong
1e6f0c6f1a
Add llamaindex gpu example ( #10314 )
...
* add llamaindex example
* fix core dump
* refine readme
* add trouble shooting
* refine readme
---------
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:36:00 +08:00
dingbaorong
fc7f10cd12
add langchain gpu example ( #10277 )
...
* first draft
* fix
* add readme for transformer_int4_gpu
* fix doc
* check device_map
* add arc ut test
* fix ut test
* fix langchain ut
* Refine README
* fix gpu mem too high
* fix ut test
---------
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:33:57 +08:00
Yuwen Hu
5dbbe1a826
[LLM] Support for new arc ut runner ( #10311 )
...
* Support for new arc ut runner
* Comment unnecessary OMP_NUM_THREADS related settings for arc uts
2024-03-04 18:42:02 +08:00
Yuwen Hu
d45e577d8c
[LLM] Test load_low_bit in iGPU perf test on Windows ( #10313 )
2024-03-04 18:03:57 +08:00
WeiguangHan
fd81d66047
LLM: Compress some models to save space ( #10315 )
...
* LLM: compress some models to save space
* add deleted comments
2024-03-04 17:53:03 +08:00
Shaojun Liu
bab2ee5f9e
update nightly spr perf test ( #10178 )
...
* update nightly spr perf test
* update
* update runner lable
* update
* update
* update folder
* revert
2024-03-04 13:46:33 +08:00
Cengguang Zhang
ab9fc2485f
LLM: add quantize kv support for llama transformer 4.36 ( #10298 )
...
* add quantize kv support for llama transformer 4.36
* fix style.
* fix style.
2024-03-04 10:33:35 +08:00
Xin Qiu
58208a5883
Update FAQ document. ( #10300 )
...
* Update install_gpu.md
* Update resolve_error.md
* Update README.md
* Update resolve_error.md
* Update README.md
* Update resolve_error.md
2024-03-04 08:35:11 +08:00
Yuwen Hu
27d9a14989
[LLM] all-on-one update: memory optimize and streaming output ( #10302 )
...
* Memory saving for continous in-out pair run and add support for streaming output on MTL iGPU
* Small fix
* Small fix
* Add things back
2024-03-01 18:02:30 +08:00
SONG Ge
0ab40917fb
[LLM] Split merged_qk to separated q/k linear ( #10299 )
...
* modify merge_qk_linear to separated q/k linear
* update
2024-03-01 16:48:55 +08:00
Yang Wang
f4d7dbcde2
use fused qkv forward in qwen2 ( #10185 )
...
* use fused qkv forward in qwen2
* support both
* fix style
* fix rope
* remove pring
* fix style
* clean up
2024-03-01 16:46:35 +08:00
Xin Qiu
509e206de0
update doc about gemma random and unreadable output. ( #10297 )
...
* Update install_gpu.md
* Update README.md
* Update README.md
2024-03-01 15:41:16 +08:00
Wang, Jian4
beb9433cec
LLM: Reduce speculative _ipex_optimize_model memory use ( #10281 )
...
* use tpp
* update ipex
2024-03-01 13:48:23 +08:00
Yuwen Hu
f0ff0eebe1
[LLM] Support quantize kv cache for Baichuan2 7B ( #10280 )
...
* Add quatized kv cache framework for Baichuan2 7B
* Support quantize kv cache for baichuan2
* Small fix
* Fix python style
2024-03-01 13:35:42 +08:00
SONG Ge
273de341d7
hot-fix silu error import ( #10292 )
2024-03-01 10:11:37 +08:00
Shengsheng Huang
bcfad555df
revise llamaindex readme ( #10283 )
2024-02-29 17:19:23 +08:00
Xin Qiu
232273a1b5
Enable Gemma fused mlp + Gelu ( #10276 )
...
* update llama mlp forward
* add all
* fix style check
* split
* update
* update
* update
* fix style
2024-02-29 16:53:24 +08:00
Guancheng Fu
2d930bdca8
Add vLLM bf16 support ( #10278 )
...
* add argument load_in_low_bit
* add docs
* modify gpu doc
* done
---------
Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>
2024-02-29 16:33:42 +08:00
SONG Ge
13b0bc9075
[LLM] Add quantize_kv optimization for yuan2 model ( #10243 )
...
* add initial quantize_kv support for yuan2 model
* fix yuan2 quantize_kv generation
* apply fp16 conv layer optimizations
* disable mlp for quantize_kv
2024-02-29 16:33:26 +08:00
Zhicun
4e6cc424f1
Add LlamaIndex RAG ( #10263 )
...
* run demo
* format code
* add llamaindex
* add custom LLM with bigdl
* update
* add readme
* begin ut
* add unit test
* add license
* add license
* revised
* update
* modify docs
* remove data folder
* update
* modify prompt
* fixed
* fixed
* fixed
2024-02-29 15:21:19 +08:00
Jin Qiao
5d7243067c
LLM: add Baichuan2-13B-Chat 2048-256 to MTL perf ( #10273 )
2024-02-29 13:48:55 +08:00
Ruonan Wang
a9fd20b6ba
LLM: Update qkv fusion for GGUF-IQ2 ( #10271 )
...
* first commit
* update mistral
* fix transformers==4.36.0
* fix
* disable qk for mixtral now
* fix style
2024-02-29 12:49:53 +08:00
Jiao Wang
6fb65bb9d2
fix in transformers 4.36 ( #10150 )
2024-02-28 18:43:01 -08:00
Shengsheng Huang
43dac97e03
Update README.md ( #10260 )
2024-02-29 10:41:14 +08:00
Ruonan Wang
4b08bc1417
LLM: relax batch check of flash atttention by double check attention mask ( #10270 )
...
* relax batch check
* fix
* fix style
2024-02-29 09:39:55 +08:00
Yina Chen
07f36fbfcc
Fix gptj failed to extend ( #10269 )
2024-02-29 09:39:27 +08:00
Yishuo Wang
cccb02dad1
fix baichuan2 13b 2k input ( #10267 )
2024-02-28 17:20:20 +08:00
Heyang Sun
7244fd1ba5
Fix Arc StarCoder wrong query_shape when input is long ( #10268 )
...
* Fix Arc StarCoder wrong query_shape when input is long
* Update gptbigcode.py
2024-02-28 17:07:08 +08:00
Cengguang Zhang
a4de3095f3
LLM: Support quantize kv cache in mistral. ( #10261 )
...
* init
* update quantize kv.
2024-02-28 14:08:08 +08:00
Shengsheng Huang
db0d129226
Revert "Add rwkv example ( #9432 )" ( #10264 )
...
This reverts commit 6930422b42 .
2024-02-28 11:48:31 +08:00
Yining Wang
6930422b42
Add rwkv example ( #9432 )
...
* codeshell fix wrong urls
* restart runner
* add RWKV CPU & GPU example (rwkv-4-world-7b)
* restart runner
* update submodule
* fix runner
* runner-test
---------
Co-authored-by: Shengsheng Huang <shengsheng.huang@intel.com>
2024-02-28 11:41:00 +08:00
Keyan (Kyrie) Zhang
59861f73e5
Add Deepseek-6.7B ( #9991 )
...
* Add new example Deepseek
* Add new example Deepseek
* Add new example Deepseek
* Add new example Deepseek
* Add new example Deepseek
* modify deepseek
* modify deepseek
* Add verified model in README
* Turn cpu_embedding=True in Deepseek example
---------
Co-authored-by: Shengsheng Huang <shengsheng.huang@intel.com>
2024-02-28 11:36:39 +08:00
Yuxuan Xia
2524273198
Update AutoGen README ( #10255 )
...
* Update AutoGen README
* Fix AutoGen README typos
* Update AutoGen README
* Update AutoGen README
2024-02-28 11:34:45 +08:00
Zheng, Yi
2347f611cf
Add cpu and gpu examples of Mamba ( #9797 )
...
* Add mamba cpu example
* Add mamba gpu example
* Use a smaller model as the example
* minor fixes
---------
Co-authored-by: Shengsheng Huang <shengsheng.huang@intel.com>
2024-02-28 11:33:29 +08:00
Zhao Changmin
937e1f7c74
rebase ( #9104 )
...
Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2024-02-28 11:18:21 +08:00
JunX
4833067489
fix GPU example link in README.md ( #9533 )
...
* fix GPU example link in README.md
* fix GPU links in llm README.md
2024-02-28 11:13:18 +08:00
Zhicun
308e637d0d
Add DeepSeek-MoE-16B-Chat ( #10155 )
...
* dsmoe-hf add
* add dsmoe pytorch
* update README
* modify comment
* remove GPU example
* update model name
* format code
2024-02-28 10:12:09 +08:00
Guoqiong Song
f4a2e32106
Stream llm example for both GPU and CPU ( #9390 )
2024-02-27 15:54:47 -08:00
Yang Wang
c581c6db30
draft mmint4 ( #10031 )
...
change to llm.cpp
support transposed format
revert
implement qkv fuse
fix style
change to vertically pack
change to enable_xetla
fix mlp_fusion_check
remove comments
address comments
add some comments
fix style
2024-02-27 14:55:16 -08:00
hxsz1997
cba61a2909
Add html report of ppl ( #10218 )
...
* remove include and language option, select the corresponding dataset based on the model name in Run
* change the nightly test time
* change the nightly test time of harness and ppl
* save the ppl result to json file
* generate csv file and print table result
* generate html
* modify the way to get parent folder
* update html in parent folder
* add llm-ppl-summary and llm-ppl-summary-html
* modify echo single result
* remove download fp16.csv
* change model name of PR
* move ppl nightly related files to llm/test folder
* reformat
* seperate make_table from make_table_and_csv.py
* separate make_csv from make_table_and_csv.py
* update llm-ppl-html
* remove comment
* add Download fp16.results
2024-02-27 17:37:08 +08:00
Zhicun
6d60982746
Env script: add license ( #10257 )
...
* env script
* update README.md
* modify README
* modify cpu info output
* add env-check.sh
* add env-check.bat
* add windows
* modify bat
* add license
2024-02-27 15:29:20 +08:00
Yishuo Wang
b4fa4ab46f
optimize yuan 2.0 again ( #10252 )
2024-02-27 14:51:42 +08:00
Zhicun
03b9c4930a
UX: Script to print env info ( #10088 )
...
* env script
* update README.md
* modify README
* modify cpu info output
* add env-check.sh
* add env-check.bat
* add windows
* modify bat
2024-02-27 14:45:36 +08:00
Keyan (Kyrie) Zhang
843fe546b0
Add CPU and GPU examples for DeciLM-7B ( #9867 )
...
* Add cpu and gpu examples for DeciLM-7B
* Add cpu and gpu examples for DeciLM-7B
* Add DeciLM-7B to README table
* modify deciLM
* modify deciLM
* modify deciLM
* Add verified model in README
* Add cpu_embedding=True
2024-02-27 13:15:49 +08:00
Yuwen Hu
38ae4b372f
Add yuan2-2b to win igpu perf test ( #10250 )
2024-02-27 11:08:33 +08:00
Heyang Sun
36a9e88104
Speculative Starcoder on CPU ( #10138 )
...
* Speculative Starcoder on CPU
* enable kv-cache pre-allocation
* refine codes
* refine
* fix style
* fix style
* fix style
* refine
* refine
* Update speculative.py
* Update gptbigcode.py
* fix style
* Update speculative.py
* enable mixed-datatype layernorm on top of torch API
* adaptive dtype
* Update README.md
2024-02-27 09:57:29 +08:00
Yishuo Wang
a47989c860
optimize yuan 2.0 performance ( #10244 )
2024-02-26 17:20:10 +08:00
Wang, Jian4
6c74b99a28
LLM: Update qwen readme ( #10245 )
2024-02-26 17:03:09 +08:00
hxsz1997
15ad2fd72e
Merge pull request #10226 from zhentaocc/fix_harness
...
Fix harness
2024-02-26 16:49:27 +08:00
Wang, Jian4
f9b75f900b
LLM: Enable qwen target_model ipex ( #10232 )
...
* change order
* enable qwen ipex
* update qwen example
* update
* fix style
* update
2024-02-26 16:41:12 +08:00
Jin Qiao
3e6d188553
LLM: add baichuan2-13b to mtl perf ( #10238 )
2024-02-26 15:55:56 +08:00
Yuwen Hu
e38e29511c
[LLM] Yuan2 MLP and Rotary optimization ( #10231 )
...
* Add optimization for rotary embedding
* Add mlp fused optimizatgion
* Python style fix
* Fix rotary embedding due to logits difference
* Small fix
2024-02-26 15:10:08 +08:00
Ziteng Zhang
ea23afc8ec
[LLM]update ipex part in mistral example readme ( #10239 )
...
* update ipex part in mistral example readme
2024-02-26 14:35:20 +08:00
SONG Ge
df2f3885ba
[LLM] Enable kv_cache and forward_qkv optimizations for yuan2 ( #10225 )
...
* add init kv_cache support for yuan2
* add forward qkv in yuan
2024-02-26 11:29:48 +08:00
Xiangyu Tian
85a99e13e8
LLM: Fix ChatGLM3 Speculative Example ( #10236 )
...
Fix ChatGLM3 Speculative Example.
2024-02-26 10:57:28 +08:00
Chen, Zhentao
213ef06691
fix readme
2024-02-24 00:38:08 +08:00
Ruonan Wang
28513f3978
LLM: support fp16 embedding & add mlp fusion for iq2_xxs ( #10219 )
...
* add fp16 embed
* small fixes
* fix style
* fix style
* fix comment
2024-02-23 17:26:24 +08:00
Yuwen Hu
eeecd9fc08
Python style fix ( #10230 )
2024-02-23 17:21:23 +08:00
Yuwen Hu
e511bbd8f1
[LLM] Add basic optimization framework for Yuan2 ( #10227 )
...
* Add basic optimization framework for Yuan2
* Small fix
* Python style fix
* Small fix
* Small fix
2024-02-23 17:05:00 +08:00
Xin Qiu
8ef5482da2
update Gemma readme ( #10229 )
...
* Update README.md
* Update README.md
* Update README.md
* Update README.md
2024-02-23 16:57:08 +08:00
Chen, Zhentao
6fe5344fa6
separate make_csv from the file
2024-02-23 16:33:38 +08:00
Chen, Zhentao
bfa98666a6
fall back to make_table.py
2024-02-23 16:33:38 +08:00
Ruonan Wang
19260492c7
LLM: fix action/installation error of mpmath ( #10223 )
...
* fix
* test
* fix
* update
2024-02-23 16:14:53 +08:00
Xin Qiu
aabfc06977
add gemma example ( #10224 )
...
* add gemma gpu example
* Update README.md
* add cpu example
* Update README.md
* Update README.md
* Update generate.py
* Update generate.py
2024-02-23 15:20:57 +08:00
yb-peng
a2c1675546
Add CPU and GPU examples for Yuan2-2B-hf ( #9946 )
...
* Add a new CPU example of Yuan2-2B-hf
* Add a new CPU generate.py of Yuan2-2B-hf example
* Add a new GPU example of Yuan2-2B-hf
* Add Yuan2 to README table
* In CPU example:1.Use English as default prompt; 2.Provide modified files in yuan2-2B-instruct
* In GPU example:1.Use English as default prompt;2.Provide modified files
* GPU example:update README
* update Yuan2-2B-hf in README table
* Add CPU example for Yuan2-2B in Pytorch-Models
* Add GPU example for Yuan2-2B in Pytorch-Models
* Add license in generate.py; Modify README
* In GPU Add license in generate.py; Modify README
* In CPU yuan2 modify README
* In GPU yuan2 modify README
* In CPU yuan2 modify README
* In GPU example, updated the readme for Windows GPU supports
* In GPU torch example, updated the readme for Windows GPU supports
* GPU hf example README modified
* GPU example README modified
2024-02-23 14:09:30 +08:00
yb-peng
f1f4094a09
Add CPU and GPU examples of phi-2 ( #10014 )
...
* Add CPU and GPU examples of phi-2
* In GPU hf example, updated the readme for Windows GPU supports
* In GPU torch example, updated the readme for Windows GPU supports
* update the table in BigDL/README.md
* update the table in BigDL/python/llm/README.md
2024-02-23 14:05:53 +08:00
Chen, Zhentao
f315c7f93a
Move harness nightly related files to llm/test folder ( #10209 )
...
* move harness nightly files to test folder
* change workflow file path accordingly
* use arc01 when pr
* fix path
* fix fp16 csv path
2024-02-23 11:12:36 +08:00
Xin Qiu
30795bdfbc
Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv ( #10212 )
...
* gemma optimization
* update
* update
* fix style
* meet code review
2024-02-23 10:07:24 +08:00
Guoqiong Song
63681af97e
falcon for transformers 4.36 ( #9960 )
...
* falcon for transformers 4.36
2024-02-22 17:04:40 -08:00
Jason Dai
84d5f40936
Update README.md ( #10213 )
2024-02-22 17:22:59 +08:00
Yina Chen
ce5840a8b7
GPT-J rope optimization on xpu ( #10182 )
...
* optimize
* update
* fix style & move use_fuse_rope
* add ipex version check
* fix style
* update
* fix style
* meet comments
* address comments
* fix style
2024-02-22 16:25:12 +08:00
Xiangyu Tian
f445217d02
LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize ( #10189 )
...
Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.
2024-02-22 16:01:11 +08:00
Heyang Sun
c876d9b5ca
Support for MPT rotary embedding ( #10208 )
2024-02-22 15:16:31 +08:00
Ruonan Wang
5e1fee5e05
LLM: add GGUF-IQ2 examples ( #10207 )
...
* add iq2 examples
* small fix
* meet code review
* fix
* meet review
* small fix
2024-02-22 14:18:45 +08:00
Yuwen Hu
21de2613ce
[LLM] Add model loading time record for all-in-one benchmark ( #10201 )
...
* Add model loading time record in csv for all-in-one benchmark
* Small fix
* Small fix to number after .
2024-02-22 13:57:18 +08:00
Ovo233
60e11b6739
LLM: Add mlp layer unit tests ( #10200 )
...
* add mlp layer unit tests
* add download baichuan-13b
* exclude llama for now
* install additional packages
* rename bash file
* switch to Baichuan2
* delete attention related code
* fix name errors in yml file
2024-02-22 13:44:45 +08:00
SONG Ge
ca1166a0e5
[LLM] Add quantize kv_cache for Baichuan2-13B ( #10203 )
...
* add quantize kv_cache for baichuan2-13b
* style fix
2024-02-22 13:43:35 +08:00
Ruonan Wang
34ee1aa91f
LLM: add esimd sdp support for chatglm3 ( #10205 )
...
* add esimd sdp support
* fix style
2024-02-22 13:37:16 +08:00
Yuxuan Xia
7cbc2429a6
Fix C-Eval ChatGLM loading issue ( #10206 )
...
* Add c-eval workflow and modify running files
* Modify the chatglm evaluator file
* Modify the ceval workflow for triggering test
* Modify the ceval workflow file
* Modify the ceval workflow file
* Modify ceval workflow
* Adjust the ceval dataset download
* Add ceval workflow dependencies
* Modify ceval workflow dataset download
* Add ceval test dependencies
* Add ceval test dependencies
* Correct the result print
* Fix the nightly test trigger time
* Fix ChatGLM loading issue
2024-02-22 10:00:43 +08:00
Yuwen Hu
94cb16fe40
[LLM] Small updates to Win GPU Install Doc ( #10199 )
...
* Make Offline installer as default for win gpu doc for oneAPI
* Small other fixes
2024-02-21 17:58:40 +08:00
binbin Deng
9975b029c5
LLM: add qlora finetuning example using trl.SFTTrainer ( #10183 )
2024-02-21 16:40:04 +08:00
Ruonan Wang
f7c96b19ef
LLM: support iq2 for mixtral ( #10191 )
...
* support name mapping for mixtral
* support mixtral mixed quantization
* fix style
* fix
2024-02-21 16:00:29 +08:00
yb-peng
b1a97b71a9
Harness eval: Add is_last parameter and fix logical operator in highlight_vals ( #10192 )
...
* Add is_last parameter and fix logical operator in highlight_vals
* Add script to update HTML files in parent folder
* Add running update_html_in_parent_folder.py in summarize step
* Add licence info
* Remove update_html_in_parent_folder.py in Summarize the results for pull request
2024-02-21 14:45:32 +08:00
Zhicun
c7e839e66c
Add Qwen1.5-7B-Chat ( #10113 )
...
* add Qwen1.5-7B-Chat
* modify Qwen1.5 example
* update README
* update prompt format
* update folder name and example README
* add Chinese prompt sample output
* update link in README
* correct the link
* update transformer version
2024-02-21 13:29:29 +08:00
Xin Qiu
56ad781f2f
qwen2 cpu fix ( #10187 )
2024-02-21 11:23:51 +08:00
Chen, Zhentao
39d37bd042
upgrade harness package version in workflow ( #10188 )
...
* upgrade harness
* update readme
2024-02-21 11:21:30 +08:00
Yuwen Hu
001c13243e
[LLM] Add support for low_low_bit benchmark on Windows GPU ( #10167 )
...
* Add support for low_low_bit performance test on Windows GPU
* Small fix
* Small fix
* Save memory during converting model process
* Drop the results for first time when loading in low bit on mtl igpu for better performance
* Small fix
2024-02-21 10:51:52 +08:00
Ziteng Zhang
276ef0e885
Speculative Ziya on CPU ( #10160 )
...
* Speculative Ziya on CPU
* Without part of Accelerate with BIGDL_OPT_IPEX
2024-02-21 10:30:39 +08:00
Zhao Changmin
4fbf449c2d
for rwkv4 ( #10179 )
2024-02-21 10:11:10 +08:00
yb-peng
de3dc609ee
Modify harness evaluation workflow ( #10174 )
...
* Modify table head in harness
* Specify the file path of fp16.csv
* change run to run nightly and run pr to debug
* Modify the way to get fp16.csv to downloading from github
* Change the method to calculate diff in html table
* Change the method to calculate diff in html table
* Re-arrange job order
* Re-arrange job order
* Change limit
* Change fp16.csv path
* Change highlight rules
* Change limit
2024-02-20 18:55:43 +08:00
Ruonan Wang
3288acb8de
LLM : Support embedding quantization (only q2k now) ( #10170 )
...
* basic logic added
* basic support
* support save&load, update mixed strategy
* fix style
* use int8 for lm_head
* add check for xpu
2024-02-20 16:56:57 +08:00
hxsz1997
6e10d98a8d
Fix some typos ( #10175 )
...
* add llm-ppl workflow
* update the DATASET_DIR
* test multiple precisions
* modify nightly test
* match the updated ppl code
* add matrix.include
* fix the include error
* update the include
* add more model
* update the precision of include
* update nightly time and add more models
* fix the workflow_dispatch description, change default model of pr and modify the env
* modify workflow_dispatch language options
* modify options
* modify language options
* modeify workflow_dispatch type
* modify type
* modify the type of language
* change seq_len type
* fix some typos
* revert changes to stress_test.txt
2024-02-20 14:14:53 +08:00
Zhicun
add3899311
Add ziya CPU example ( #10114 )
...
* ziya on CPU
* add README for ziya
* specify use_cache
* add arc CPU
* update prompt format
* update link
* add comments to emphasize use_cache
* update pip cmd
2024-02-20 13:59:52 +08:00
binbin Deng
2bb96c775c
LLM: fix device setting during saving optimized model ( #10154 )
2024-02-20 09:52:59 +08:00
Xin Qiu
1f6d5b9f30
enable fused rmsnorm and rope qwen2 ( #10163 )
...
* qwen2
* change convert
* cleanup
2024-02-20 08:33:09 +08:00
yb-peng
e31210ba00
Modify html table style and add fp16.csv in harness ( #10169 )
...
* Specify the version of pandas in harness evaluation workflow
* Specify the version of pandas in harness evaluation workflow
* Modify html table style and add fp16.csv in harness
* Modify comments
2024-02-19 18:13:40 +08:00
WeiguangHan
6c09aed90d
LLM: add qwen_1.5_7b model for arc perf test ( #10166 )
...
* LLM: add qwen_1.5_7b model for arc perf test
* small fix
* revert some codes
2024-02-19 17:21:00 +08:00
Yuxuan Xia
209122559a
Add Ceval workflow and modify the result printing ( #10140 )
...
* Add c-eval workflow and modify running files
* Modify the chatglm evaluator file
* Modify the ceval workflow for triggering test
* Modify the ceval workflow file
* Modify the ceval workflow file
* Modify ceval workflow
* Adjust the ceval dataset download
* Add ceval workflow dependencies
* Modify ceval workflow dataset download
* Add ceval test dependencies
* Add ceval test dependencies
* Correct the result print
2024-02-19 17:06:53 +08:00
Zhao Changmin
f8730e8dc1
Skip rescale rwkv linear when load_low_bit ( #10164 )
...
* rwkv_ld
2024-02-19 15:56:42 +08:00
Heyang Sun
3e2af5ec0a
Fix IPEX Baichuan Speculative ( #10162 )
...
* Fix IPEX Baichuan Speculative
* compatible with 13B
* Update speculative.py
2024-02-19 15:27:34 +08:00
Yina Chen
23c91cdce6
[LLM] Add min_step_draft in speculative decoding ( #10142 )
...
* Fix gptj kvcache & position id
* Add min_draft_tokens in speculative decoding
* fix style
* update
2024-02-19 14:31:41 +08:00
Chen, Zhentao
14ba2c5135
Harness: remove deprecated files ( #10165 )
2024-02-19 14:27:49 +08:00
Wang, Jian4
d3591383d5
LLM : Add CPU chatglm3 speculative example ( #10004 )
...
* init chatglm
* update
* update
2024-02-19 13:38:52 +08:00
Wang, Jian4
f2417e083c
LLM: enable chatglm3-6b target_model ipex ( #10085 )
...
* init
* always make casual_mask
* not return last tensor
* update
* optimize_model = False
* enable optimized=False
* enable optimized_model=true
* speed_up ipex target_model
* remove if True
* use group_size
* update python style
* update
* update
2024-02-19 13:38:32 +08:00
Heyang Sun
177273c1a4
IPEX Speculative Support for Baichuan2 7B ( #10112 )
...
* IPEX Speculative Support for Baichuan2 7B
* fix license problems
* refine
2024-02-19 09:12:57 +08:00
Yina Chen
1508d6b089
Fix gptj kvcache & position id ( #10141 )
2024-02-18 10:02:49 +08:00
yb-peng
b4dc33def6
In harness-evaluation workflow, add statistical tables ( #10118 )
...
* chnage storage
* fix typo
* change label
* change label to arc03
* change needs in the last step
* add generate csv in harness/make_table_results.py
* modify needs in the last job
* add csv to html
* mfix path issue in llm-harness-summary-nightly
* modify output_path
* modify args in make_table_results.py
* modify make table command in summary
* change pr env label
* remove irrelevant code in summary; add set output path step; add limit in harness run
* re-organize code structure
* modify limit in run harness
* modify csv_to_html input path
* modify needs in summary-nightly
2024-02-08 19:01:05 +08:00
Yishuo Wang
4d33aac7f9
quick fix qwen2 fp8 kv cache ( #10135 )
2024-02-08 17:04:59 +08:00
Cengguang Zhang
39d90839aa
LLM: add quantize kv cache for llama. ( #10086 )
...
* feat: add quantize kv cache for llama.
* fix style.
* add quantized attention forward function.
* revert style.
* fix style.
* fix style.
* update quantized kv cache and add quantize_qkv
* fix style.
* fix style.
* optimize quantize kv cache.
* fix style.
2024-02-08 16:49:22 +08:00
Yishuo Wang
d848efe17c
add quantize kv cache support for qwen2 ( #10134 )
2024-02-08 16:17:21 +08:00
SONG Ge
3f79128ed7
[LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 ( #10131 )
...
* add support for kv_cache optimization on transformers-v4.37.0
* enable attention forward
* style fix
* disable rotary for now
2024-02-08 14:20:26 +08:00
Ruonan Wang
063dc145ac
LLM: basic support for q2k ( #10132 )
...
* basic support for q2k
* fix style
2024-02-08 13:52:01 +08:00
binbin Deng
11fe5a87ec
LLM: add Modelscope model example ( #10126 )
2024-02-08 11:18:07 +08:00
Cengguang Zhang
0cf6a12691
LLM: add default torch_dtype for fp16. ( #10124 )
...
* set default torch_dtype for fp16.
* fix style.
* bug fix.
* update bug fix.
2024-02-08 10:24:16 +08:00
Yishuo Wang
1aa0c623ce
disable fused layer norm on UHD ( #10130 )
2024-02-08 10:20:01 +08:00
Yuwen Hu
a8450fc300
[LLM] Support MLP optimization for Qwen1.5 ( #10123 )
2024-02-08 09:15:34 +08:00
Yuwen Hu
81ed65fbe7
[LLM] Add qwen1.5-7B in iGPU perf ( #10127 )
...
* Add qwen1.5 test config yaml with transformers 4.37.0
* Update for yaml file
2024-02-07 22:31:20 +08:00
Jin Qiao
0fcfbfaf6f
LLM: add rwkv5 eagle GPU HF example ( #10122 )
...
* LLM: add rwkv5 eagle example
* fix
* fix link
2024-02-07 16:58:29 +08:00
binbin Deng
925f82107e
LLM: support models hosted by modelscope ( #10106 )
2024-02-07 16:46:36 +08:00
binbin Deng
c1ec3d8921
LLM: update FAQ about too many open files ( #10119 )
2024-02-07 15:02:24 +08:00
Keyan (Kyrie) Zhang
2e80701f58
Unit test on final logits and the logits of the last attention layer ( #10093 )
...
* Add unit test on final logits and attention
* Add unit test on final logits and attention
* Modify unit test on final logits and attention
2024-02-07 14:25:36 +08:00
Yuxuan Xia
3832eb0ce0
Add ChatGLM C-Eval Evaluator ( #10095 )
...
* Add ChatGLM ceval evaluator
* Modify ChatGLM Evaluator Reference
2024-02-07 11:27:06 +08:00
Jin Qiao
63050c954d
fix ( #10117 )
2024-02-07 11:05:11 +08:00
Jin Qiao
d3d2ee1b63
LLM: add speech T5 GPU example ( #10090 )
...
* add speech t5 example
* fix
* fix
2024-02-07 10:50:02 +08:00
Jin Qiao
2f4c754759
LLM: add bark gpu example ( #10091 )
...
* add bark gpu example
* fix
* fix license
* add bark
* add example
* fix
* another way
2024-02-07 10:47:11 +08:00
Xiangyu Tian
8953acd7d6
[LLM] Fix log condition for BIGDL_OPT_IPEX ( #10115 )
...
Fix log condition for BIGDL_OPT_IPEX
2024-02-07 10:27:10 +08:00
SONG Ge
0eccb94d75
remove text-generation-webui from bigdl repo ( #10107 )
2024-02-06 17:46:52 +08:00
Ovo233
2aaa21c41d
LLM: Update ppl tests ( #10092 )
...
* update ppl tests
* use load_dataset api
* add exception handling
* add language argument
* address comments
2024-02-06 17:31:48 +08:00
Yuwen Hu
3a46b57253
[LLM] Add RWKV4 HF GPU Example ( #10105 )
...
* Add GPU HF example for RWKV 4
* Add link to rwkv4
* fix
2024-02-06 16:30:24 +08:00
Yuwen Hu
518ef95abc
Small fix for Nonetype error ( #10104 )
2024-02-06 14:58:52 +08:00
Ruonan Wang
d61f4905ac
LLM: 2bit quantization initial support ( #10042 )
...
* basis quantize support
* fix new module name
* small update
* and mixed int4 with iq2_xxs
* remove print
* code refactor
* fix style
* meet code review
2024-02-06 14:58:32 +08:00
dingbaorong
36c9442c6d
Arc Stable version test ( #10087 )
...
* add batch_size in stable version test
* add batch_size in excludes
* add excludes for batch_size
* fix ci
* triger regression test
* fix xpu version
* disable ci
* address kai's comment
---------
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-02-06 10:23:50 +08:00
Jiao Wang
33b9e7744d
fix dimension ( #10097 )
2024-02-05 15:07:38 -08:00
SONG Ge
4b02ff188b
[WebUI] Add prompt format and stopping words for Qwen ( #10066 )
...
* add prompt format and stopping_words for qwen mdoel
* performance optimization
* optimize
* update
* meet comments
2024-02-05 18:23:13 +08:00
WeiguangHan
0aecd8637b
LLM: small fix for the html script ( #10094 )
2024-02-05 17:27:34 +08:00
Zhicun
7d2be7994f
add phixtral and optimize phi-moe ( #10052 )
2024-02-05 11:12:47 +08:00
Zhicun
676d6923f2
LLM: modify transformersembeddings.embed() in langchain ( #10051 )
2024-02-05 10:42:10 +08:00
Jin Qiao
ad050107b3
LLM: fix mpt load_low_bit issue ( #10075 )
...
* fix
* retry
* retry
2024-02-05 10:17:07 +08:00
SONG Ge
9050991e4e
fix gradio check issue temply ( #10082 )
2024-02-04 16:46:29 +08:00
WeiguangHan
c2e562d037
LLM: add batch_size to the csv and html ( #10080 )
...
* LLM: add batch_size to the csv and html
* small fix
2024-02-04 16:35:44 +08:00
binbin Deng
7e49fbc5dd
LLM: make finetuning examples more common for other models ( #10078 )
2024-02-04 16:03:52 +08:00
Heyang Sun
90f004b80b
remove benchmarkwrapper form deepspeed example ( #10079 )
2024-02-04 15:42:15 +08:00
Ruonan Wang
8e33cb0f38
LLM: support speecht5_tts ( #10077 )
...
* support speecht5_tts
* fix
2024-02-04 13:26:42 +08:00
ivy-lv11
428b7105f6
Add HF and PyTorch example InternLM2 ( #10061 )
2024-02-04 10:25:55 +08:00
Yina Chen
77be19bb97
LLM: Support gpt-j in speculative decoding ( #10067 )
...
* gptj
* support gptj in speculative decoding
* fix
* update readme
* small fix
2024-02-02 14:54:55 +08:00
SONG Ge
19183ef476
[WebUI] Reset bigdl-llm loader options with default value ( #10064 )
...
* reset bigdl-llm loader options with default value
* remove options which maybe complex for naive users
2024-02-01 15:45:39 +08:00
Xin Qiu
6e0f1a1e92
use apply_rotary_pos_emb_cache_freq_xpu in mixtral ( #10060 )
...
* use apply_rotary_pos_emb_cache_freq_xpu in mixtral
* fix style
2024-02-01 15:40:49 +08:00
binbin Deng
aae20d728e
LLM: Add initial DPO finetuning example ( #10021 )
2024-02-01 14:18:08 +08:00
Heyang Sun
601024f418
Mistral CPU example of speculative decoding ( #10024 )
...
* Mistral CPU example of speculative decoding
* update transformres version
* update example
* Update README.md
2024-02-01 10:52:32 +08:00
Heyang Sun
968e70544d
Enable IPEX Mistral in Speculative ( #10059 )
2024-02-01 10:48:16 +08:00
Yina Chen
3ca03d4e97
Add deepmind sample into bigdl-llm speculative decoding ( #10041 )
...
* migrate deepmind sample
* update
* meet comments
* fix style
* fix style
2024-02-01 09:57:02 +08:00
WeiguangHan
d2d3f6b091
LLM: ensure the result of daily arc perf test ( #10016 )
...
* ensure the result of daily arc perf test
* small fix
* small fix
* small fix
* small fix
* small fix
* small fix
* small fix
* small fix
* small fix
* small fix
* concat more csvs
* small fix
* revert some files
2024-01-31 18:26:21 +08:00
WeiguangHan
9724939499
temporarily disable bloom 2k input ( #10056 )
2024-01-31 17:49:12 +08:00
Jin Qiao
8c8fc148c9
LLM: add rwkv 5 ( #10048 )
2024-01-31 15:54:55 +08:00
WeiguangHan
a9018a0e95
LLM: modify the GPU example for redpajama model ( #10044 )
...
* LLM: modify the GPU example for redpajama model
* small fix
2024-01-31 14:32:08 +08:00
Yuxuan Xia
95636cad97
Add AutoGen CPU and XPU Example ( #9980 )
...
* Add AutoGen example
* Adjust AutoGen README
* Adjust AutoGen README
* Change AutoGen README
* Change AutoGen README
2024-01-31 11:31:18 +08:00
Heyang Sun
7284edd9b7
Vicuna CPU example of speculative decoding ( #10018 )
...
* Vicuna CPU example of speculative decoding
* Update speculative.py
* Update README.md
* add requirements for ipex
* Update README.md
* Update speculative.py
* Update speculative.py
2024-01-31 11:23:50 +08:00
Wang, Jian4
7e5cd42a5c
LLM : Update optimize ipex bf16 ( #10038 )
...
* use 4.35.2 and remove
* update rmsnorm
* remove
* remove
* update python style
* update
* update python style
* update
* fix style
* update
* remove whitespace
2024-01-31 10:59:55 +08:00
Wang, Jian4
fb53b994f8
LLM : Add llama ipex optimized ( #10046 )
...
* init ipex
* remove padding
2024-01-31 10:38:46 +08:00
Ruonan Wang
3685622f29
LLM: fix llama 4.36 forward( #10047 )
2024-01-31 10:31:10 +08:00
Yishuo Wang
53a5140eff
Optimize rwkv v5 rest token again ( #10043 )
2024-01-31 10:01:11 +08:00
Heyang Sun
b1ff28ceb6
LLama2 CPU example of speculative decoding ( #9962 )
...
* LLama2 example of speculative decoding
* add docs
* Update speculative.py
* Update README.md
* Update README.md
* Update speculative.py
* remove autocast
2024-01-31 09:45:20 +08:00
WeiguangHan
0fcad6ce14
LLM: add gpu example for redpajama models ( #10040 )
2024-01-30 19:39:28 +08:00
Xiangyu Tian
9978089796
[LLM] Enable BIGDL_OPT_IPEX in speculative baichuan2 13b example ( #10028 )
...
Enable BIGDL_OPT_IPEX in speculative baichuan2 13b example
2024-01-30 17:11:37 +08:00
Ovo233
226f398c2a
fix ppl test errors ( #10036 )
2024-01-30 16:26:21 +08:00
Xin Qiu
13e61738c5
hide detail memory for each token in benchmark_utils.py ( #10037 )
2024-01-30 16:04:17 +08:00
Ruonan Wang
6b63ba23d1
LLM: add full module name during convert ( #10035 )
2024-01-30 14:43:07 +08:00
Yishuo Wang
7dfa6dbe46
add rwkv time shift optimization ( #10032 )
2024-01-30 14:10:55 +08:00
Xiangyu Tian
f57d0fda8b
[LLM] Use IPEX Optimization for Self Speculative Decoding ( #9997 )
...
Use IPEX Optimization for Self Speculative Decoding
2024-01-30 09:11:06 +08:00
Ruonan Wang
ccf8f613fb
LLM: update fp16 Linear on ARC/FLEX ( #10023 )
2024-01-29 18:25:26 +08:00
Shaojun Liu
824c8029d7
Fix "local variable 'model' referenced before assignment" ( #10022 )
2024-01-29 16:18:04 +08:00
Heyang Sun
cc3f122f6a
Baichuan2 CPU example of speculative decoding ( #10003 )
...
* Baichuan2 CPU example of speculative decoding
* Update generate.py
* Update README.md
* Update generate.py
* Update generate.py
* Update generate.py
* fix default model
* fix wrong chinese coding
* Update generate.py
* update prompt
* update sample outputs
* baichuan 7b needs transformers==4.31.0
* rename example file's name
2024-01-29 14:21:09 +08:00
Xiangyu Tian
f37e4702bc
[LLM] Use IPEX Optimization for BF16 Model ( #9988 )
...
Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true
2024-01-29 11:28:25 +08:00
Jin Qiao
440cfe18ed
LLM: GPU Example Updates for Windows ( #9992 )
...
* modify aquila
* modify aquila2
* add baichuan
* modify baichuan2
* modify blue-lm
* modify chatglm3
* modify chinese-llama2
* modiy codellama
* modify distil-whisper
* modify dolly-v1
* modify dolly-v2
* modify falcon
* modify flan-t5
* modify gpt-j
* modify internlm
* modify llama2
* modify mistral
* modify mixtral
* modify mpt
* modify phi-1_5
* modify qwen
* modify qwen-vl
* modify replit
* modify solar
* modify starcoder
* modify vicuna
* modify voiceassistant
* modify whisper
* modify yi
* modify aquila2
* modify baichuan
* modify baichuan2
* modify blue-lm
* modify chatglm2
* modify chatglm3
* modify codellama
* modify distil-whisper
* modify dolly-v1
* modify dolly-v2
* modify flan-t5
* modify llama2
* modify llava
* modify mistral
* modify mixtral
* modify phi-1_5
* modify qwen-vl
* modify replit
* modify solar
* modify starcoder
* modify yi
* correct the comments
* remove cpu_embedding in code for whisper and distil-whisper
* remove comment
* remove cpu_embedding for voice assistant
* revert modify voice assistant
* modify for voice assistant
* add comment for voice assistant
* fix comments
* fix comments
2024-01-29 11:25:11 +08:00
Yuwen Hu
c6d4f91777
[LLM] Add UTs of load_low_bit for transformers-style API ( #10001 )
...
* Add uts for transformers api load_low_bit generation
* Small fixes
* Remove replit-code for CPU tests due to current load_low_bit issue on MPT
* Small change
* Small reorganization to llm unit tests on CPU
* Small fixes
2024-01-29 10:18:23 +08:00
Yishuo Wang
d720554d43
simplify quantize kv cache api ( #10011 )
2024-01-29 09:23:57 +08:00
Yina Chen
a3322e2a6c
add fp8 e5 to use_xmx ( #10015 )
2024-01-26 18:29:46 +08:00
Qiyuan Gong
9e18ea187f
[LLM] Avoid KV Cache OOM when seq len is larger than 1 ( #10006 )
...
* Avoid OOM during muti-round streaming chat with kv cache
* For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31.
* Other models need to compare kv cache size with kv_len.
2024-01-26 17:30:08 +08:00
binbin Deng
e5ae6f2c13
LLM: fix truncation logic of past_key_values in chatglm multi turn chat ( #10007 )
...
* Avoid frequently truncating past_key_values when its length is larger than required.
2024-01-26 16:56:02 +08:00
Yuwen Hu
1eaaace2dc
Update perf test all-in-one config for batch_size arg ( #10012 )
2024-01-26 16:46:36 +08:00
Xin Qiu
7952bbc919
add conf batch_size to run_model ( #10010 )
2024-01-26 15:48:48 +08:00
SONG Ge
421e7cee80
[LLM] Add Text_Generation_WebUI Support ( #9884 )
...
* initially add text_generation_webui support
* add env requirements install
* add necessary dependencies
* update for starting webui
* update shared and noted to place models
* update heading of part3
* meet comments
* add copyright license
* remove extensions
* convert tutorial to windows side
* add warm-up to optimize performance
2024-01-26 15:12:49 +08:00
Yuwen Hu
f0da0c131b
Disable llama2 optimize model true or false test for now in Arc UTs ( #10008 )
2024-01-26 14:42:11 +08:00
Ruonan Wang
a00efa0564
LLM: add mlp & qkv fusion for FP16 Llama-7B ( #9932 )
...
* add mlp fusion for llama
* add mlp fusion
* fix style
* update
* add mm_qkv_out
* fix style
* update
* meet code review
* meet code review
2024-01-26 11:50:38 +08:00
Wang, Jian4
98ea3459e5
LLM : Fix llama draft_model dtype error ( #10005 )
...
* fix llama draft_model dtype error
* updat
2024-01-26 10:59:48 +08:00
Yishuo Wang
aae1870096
fix qwen kv cache length ( #9998 )
2024-01-26 10:15:01 +08:00
Chen, Zhentao
762adc4f9d
Reformat summary table ( #9942 )
...
* reformat the table
* refactor the file
* read result.json only
2024-01-25 23:49:00 +08:00
binbin Deng
171fb2d185
LLM: reorganize GPU finetuning examples ( #9952 )
2024-01-25 19:02:38 +08:00
Yishuo Wang
24b34b6e46
change xmx condition ( #10000 )
2024-01-25 17:48:11 +08:00
Ziteng Zhang
8b08ad408b
Add batch_size in all_in_one ( #9999 )
...
Add batch_size in all_in_one, except run_native_int4
2024-01-25 17:43:49 +08:00
Wang, Jian4
093e6f8f73
LLM: Add qwen CPU speculative example ( #9985 )
...
* init from gpu
* update for cpu
* update
* update
* fix xpu readme
* update
* update example prompt
* update prompt and add 72b
* update
* update
2024-01-25 17:01:34 +08:00
Yishuo Wang
bf65548d29
Add quantize kv cache support for chaglm2/3 ( #9996 )
2024-01-25 16:55:59 +08:00
Chen, Zhentao
86055d76d5
fix optimize_model not working ( #9995 )
2024-01-25 16:39:05 +08:00
Wang, Jian4
9bff84e6fd
LLM: Convert draft_model kv_cache from bf16 to fp32 ( #9964 )
...
* convert bf16 to fp32
* update
* change when init
* init first and cut off after
* init and exchange
* update python type
* update
* fix bug
* update
* update
2024-01-25 11:20:27 +08:00
Yina Chen
99ff6cf048
Update gpu spec decoding baichuan2 example dependency ( #9990 )
...
* add dependency
* update
* update
2024-01-25 11:05:04 +08:00
Yina Chen
27338540c3
Fix repetition_penalty not activated issue ( #9989 )
2024-01-25 10:40:41 +08:00
Jason Dai
3bc3d0bbcd
Update self-speculative readme ( #9986 )
2024-01-24 22:37:32 +08:00
Yuwen Hu
b27e5a27b9
Remove the check for meta device in _replace_with_low_bit_linear ( #9984 )
2024-01-24 18:15:39 +08:00
Ruonan Wang
d4f65a6033
LLM: add mistral speculative example ( #9976 )
...
* add mistral example
* update
2024-01-24 17:35:15 +08:00
Yina Chen
b176cad75a
LLM: Add baichuan2 gpu spec example ( #9973 )
...
* add baichuan2 gpu spec example
* update readme & example
* remove print
* fix typo
* meet comments
* revert
* update
2024-01-24 16:40:16 +08:00
Jinyi Wan
ec2d9de0ea
Fix README.md for solar ( #9957 )
2024-01-24 15:50:54 +08:00
Mingyu Wei
bc9cff51a8
LLM GPU Example Update for Windows Support ( #9902 )
...
* Update README in LLM GPU Examples
* Update reference of Intel GPU
* add cpu_embedding=True in comment
* small fixes
* update GPU/README.md and add explanation for cpu_embedding=True
* address comments
* fix small typos
* add backtick for cpu_embedding=True
* remove extra backtick in the doc
* add period mark
* update readme
2024-01-24 13:42:27 +08:00
Chen, Zhentao
e0db44dcb6
fix unexpected keyword argument 'device' ( #9982 )
...
* add device for chatglm3 only
* add comment for this change
* fix style
* fix style
* fix style again..
* finally fixed style
2024-01-24 13:20:46 +08:00
Mingyu Wei
50a851e3b3
LLM: separate arc ut for disable XMX ( #9953 )
...
* separate test_optimize_model api with disabled xmx
* delete test_optimize_model in test_transformers_api.py
* set env variable in .sh/ put back test_optimize_model
* unset env variable
* remove env setting in .py
* address errors in action
* remove import ipex
* lower tolerance
2024-01-23 19:04:47 +08:00
Yuwen Hu
8d28aa8e2b
[LLM] Fix the model.device problem when cpu_embedding=True ( #9971 )
...
* Overwrite the device attribute for CPUPinnedParam
* Expose cpu_embedding=True for Linux users
* Fix python style
2024-01-23 18:51:11 +08:00
Yishuo Wang
f82782cd3b
fix starcoder ( #9975 )
2024-01-23 17:24:53 +08:00
WeiguangHan
be5836bee1
LLM: fix outlier value ( #9945 )
...
* fix outlier value
* small fix
2024-01-23 17:04:13 +08:00
Yishuo Wang
2c8a9aaf0d
fix qwen causal mask when quantize_kv_cache=True ( #9968 )
2024-01-23 16:34:05 +08:00
Yina Chen
5aa4b32c1b
LLM: Add qwen spec gpu example ( #9965 )
...
* add qwen spec gpu example
* update readme
---------
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:59:43 +08:00
Yina Chen
36c665667d
Add logits processor & qwen eos stop in speculative decoding ( #9963 )
...
* add logits processor & qwen eos
* fix style
* fix
* fix
* fix style
* fix style
* support transformers 4.31
* fix style
* fix style
---------
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Ruonan Wang
60b35db1f1
LLM: add chatglm3 speculative decoding example ( #9966 )
...
* add chatglm3 example
* update
* fix
2024-01-23 15:54:12 +08:00
Xin Qiu
da4687c917
fix fp16 ( #9970 )
2024-01-23 15:53:32 +08:00
Chen, Zhentao
301425e377
harness tests on pvc multiple xpus ( #9908 )
...
* add run_multi_llb.py
* update readme
* add job hint
2024-01-23 13:20:37 +08:00
Ruonan Wang
27b19106f3
LLM: add readme for speculative decoding gpu examples ( #9961 )
...
* add readme
* add readme
* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a
add default device meta when lcmu enabled ( #9941 )
2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294
add fused rotary pos emb for qwen ( #9956 )
...
* add fused rotary pos emb for qwen
* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0
LLM: limit esimd sdp usage for k_len < 8 ( #9959 )
...
* update
* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d
LLM: Support speculative decoding in bigdl-llm ( #9951 )
...
* first commit
* fix error, add llama example
* hidden print
* update api usage
* change to api v3
* update
* meet code review
* meet code review, fix style
* add reference, fix style
* fix style
* fix first token time
2024-01-22 19:14:56 +08:00
Cheen Hau, 俊豪
947b1e27b7
Add readme for Whisper Test ( #9944 )
...
* Fix local data path
* Remove non-essential files
* Add readme
* Minor fixes to script
* Bugfix, refactor
* Add references to original source. Bugfixes.
* Reviewer comments
* Properly print and explain output
* Move files to dev/benchmark
* Fixes
2024-01-22 15:11:33 +08:00
Xin Qiu
6fb3f40f7e
fix error for benchmark_util.py running on cpu ( #9949 )
2024-01-22 10:14:40 +08:00
Heyang Sun
fb91c97fe8
support for Baichuan/Baichuan2 13B Chat running speculative decoding ( #9921 )
...
* support for Baichuan/Baichuan2 13B Chat running speculative decoding
* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975
optimize Decilm 7b ( #9922 )
...
* optimize deci
* update
* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272
Update optimize qwen ( #9943 )
...
* update for n tokens input
* fix dtype
* update
2024-01-19 16:54:59 +08:00
binbin Deng
db8e90796a
LLM: add avg token latency information and benchmark guide of autotp ( #9940 )
2024-01-19 15:09:57 +08:00
Ruonan Wang
bf37b3a670
LLM: optimize CPU speculative decoding of chatglm3 ( #9928 )
...
* update
* fix style
* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8
gguf memory optimization for mixtral ( #9939 )
2024-01-19 11:13:15 +08:00
Xin Qiu
610b5226be
move reserved memory to benchmark_utils.py ( #9907 )
...
* move reserved memory to benchmark_utils.py
* meet code review
2024-01-19 09:44:30 +08:00
Lilac09
7032a2ad73
Optimize gguf load memory for mistral ( #9923 )
...
* optimize gguf load for mistral
* fix output of gguf mistral
* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7
gguf memory optimization for baichuan ( #9937 )
2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e
[Serving] Add vllm_worker to fastchat serving framework ( #9934 )
...
* add worker
* finish
* finish
* add license
* add more comments
2024-01-18 21:33:36 +08:00
Chen, Zhentao
a8c866c32b
add ppl benchmark ( #9914 )
...
* add ppl benchmark
* add license
* add readme
* add dataset argument
* add dataset usage
* fixed low bit args
* correct result
* fix terminal display
* fix ppl update
* enable fp16 fp32 bf16
* format the desc
* fix model_kwargs
* add more readme
2024-01-18 17:54:28 +08:00
WeiguangHan
100e0a87e5
LLM: add compressed chatglm3 model ( #9892 )
...
* LLM: add compressed chatglm3 model
* small fix
* revert github action
2024-01-18 17:48:15 +08:00
Yuwen Hu
9e2ac5291b
Add rwkv v4 back for igpu perf test 32-512 ( #9938 )
2024-01-18 17:15:28 +08:00
Yishuo Wang
7bbb98abb6
Disable fused layer norm when using XMX to fix mpt UT ( #9933 )
2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265
LLM: Update for Qwen n tokens inputs ( #9931 )
...
* update for n tokens inputs
* update style
* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9
Fix Mixtral GGUF Wrong Output Issue ( #9930 )
...
* Fix Mixtral GGUF Wrong Output Issue
* fix style
* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9
add rwkv v5 attention kernel ( #9927 )
2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f
LLM: Fix rope of chatglm3 to support speculative decoding on CPU ( #9926 )
2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432
[LLM]Solve the problem of calling bmm operator in BF16Linear ( #9924 )
...
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4
Support fast rope for training ( #9745 )
...
* init
* init
* fix style
* add test and fix
* address comment
* update
* merge upstream main
2024-01-17 15:51:38 +08:00
Yuwen Hu
0c498a7b64
Add llama2-13b to igpu perf test ( #9920 )
2024-01-17 14:58:45 +08:00
Ruonan Wang
b059a32fff
LLM: add benchmark api for bigdl-llm fp16 on GPU ( #9919 )
...
* add bmk for bigdl fp16
* fix
2024-01-17 14:24:35 +08:00
Ruonan Wang
427f75000b
LLM: fix sdp of chatglm3 ( #9917 )
...
* fix
* fix
* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf
optimize rwkv v4 first token performance ( #9912 )
2024-01-17 09:27:41 +08:00
Cengguang Zhang
511cbcf773
LLM: add Ceval benchmark test. ( #9872 )
...
* init ceval benchmark test.
* upload dataset.
* add other tests.
* add qwen evaluator.
* fix qwen evaluator style.
* fix qwen evaluator style.
* update qwen evaluator.
* add llama evaluator.
* update eval
* fix typo.
* fix
* fix typo.
* fix llama evaluator.
* fix bug.
* fix style.
* delete dataset.
* fix style.
* fix style.
* add README.md and fix typo.
* fix comments.
* remove run scripts
2024-01-16 19:14:26 +08:00
Shaojun Liu
b909c5c9c2
GGUF load memory optimization ( #9913 )
...
* block-wise
* convert linear for module
* revert
* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Yuwen Hu
8643b62521
[LLM] Support longer context in iGPU perf tests (2048-256) ( #9910 )
2024-01-16 17:48:37 +08:00
Xin Qiu
dee32f7d15
copy fused rms norm's reuslt to avoid <unk> ( #9909 )
2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03
LLM: fix chatglm3 sdp to support speculative decoding ( #9900 )
...
* fix chatglm3
* fix
* update
* meet code review
* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb
Update PVC XMX condition ( #9901 )
...
* update pvc xmx condition
* update condition
* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf
change xmx condition ( #9896 )
2024-01-12 19:51:48 +08:00
WeiguangHan
0e69bfe6b0
LLM: fix the performance drop of starcoder ( #9889 )
...
* LLM: fix the performance drop of starcoder
* small fix
* small fix
2024-01-12 09:14:15 +08:00
Ruonan Wang
d9cf55bce9
LLM: fix MLP check of mixtral ( #9891 )
2024-01-11 18:01:59 +08:00
Ziteng Zhang
4f4ce73f31
[LLM] Add transformer_autocast_bf16 into all-in-one ( #9890 )
...
* Add transformer_autocast_bf16 into all-in-one
2024-01-11 17:51:07 +08:00
Ziteng Zhang
4af88a67b9
support chatglm3 with bf16 ( #9888 )
...
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965
[LLM] Improve LLM doc regarding windows gpu related info ( #9880 )
...
* Improve runtime configuration for windows
* Add python 310/311 supports for wheel downloading
* Add troubleshooting for windows gpu
* Remove manually import ipex due to auto importer
* Add info regarding cpu_embedding=True on iGPU
* More info for Windows users
* Small updates to API docs
* Python style fix
* Remove tip for loading from saved optimize_model for now
* Updated based on comments
* Update win info for multi-intel gpus selection
* Small fix
* Small fix
2024-01-11 14:37:16 +08:00
Jinyi Wan
07485eff5a
Add SOLAR-10.7B to README ( #9869 )
2024-01-11 14:28:41 +08:00
WeiguangHan
33fd1f9c76
LLM: fix input length logic for run_transformer_int4_gpu ( #9864 )
...
* LLM: fix input length logic for run_transformer_int4_gpu
* small fix
* small fix
* small fix
2024-01-10 18:20:14 +08:00
Ruonan Wang
53531ae4ee
LLM: support qkv fusion for fp8e5 ( #9878 )
...
* update
* add mistral
* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec
add mistral and chatglm support to vllm ( #9879 )
...
* add mistral and chatglm support to vllm
* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
ZehuaCao
e76d984164
[LLM] Support llm-awq vicuna-7b-1.5 on arc ( #9874 )
...
* support llm-awq vicuna-7b-1.5 on arc
* support llm-awq vicuna-7b-1.5 on arc
2024-01-10 14:28:39 +08:00
Ruonan Wang
3e05c9e11b
LLM: update esimd sdp kernel ( #9871 )
2024-01-09 18:10:01 +08:00
Yuwen Hu
023679459e
[LLM] Small fixes for finetune related examples and UTs ( #9870 )
2024-01-09 18:05:03 +08:00
Cheen Hau, 俊豪
b2aa267f50
Enhance LLM GPU installation document ( #9828 )
...
* Improve gpu install doc
* Add troubleshooting - setvars.sh not done properly.
* Further improvements
* 2024.x.x -> 2024.0
* Fixes
* Fix Install BigDL-LLM From Wheel : bigdl-llm[xpu_2.0]
* Remove "export USE_XETLA=OFF" for Max GPU
2024-01-09 16:30:50 +08:00
Yuwen Hu
23fc888abe
Update llm gpu xpu default related info to PyTorch 2.1 ( #9866 )
2024-01-09 15:38:47 +08:00
Yishuo Wang
36496d60ac
only use quantize kv cache on MTL ( #9862 )
2024-01-09 13:24:02 +08:00
ZehuaCao
146076bdb5
Support llm-awq backend ( #9856 )
...
* Support for LLM-AWQ Backend
* fix
* Update README.md
* Add awqconfig
* modify init
* update
* support llm-awq
* fix style
* fix style
* update
* fix AwqBackendPackingMethod not found error
* fix style
* update README
* fix style
---------
Co-authored-by: Uxito-Ada <414416158@qq.com>
Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
Co-authored-by: cyita <yitastudy@gmail.com>
2024-01-09 13:07:32 +08:00
Ruonan Wang
fea6f16057
LLM: add mlp fusion for fp8e5 and update related check ( #9860 )
...
* update mlp fusion
* fix style
* update
2024-01-09 09:56:32 +08:00
binbin Deng
294fd32787
LLM: update DeepSpeed AutoTP example with GPU memory optimization ( #9823 )
2024-01-09 09:22:49 +08:00
Yuwen Hu
5ba1dc38d4
[LLM] Change default Linux GPU install option to PyTorch 2.1 ( #9858 )
...
* Update default xpu to ipex 2.1
* Update related install ut support correspondingly
* Add arc ut tests for both ipex 2.0 and 2.1
* Small fix
* Diable ipex 2.1 test for now as oneapi 2024.0 has not beed installed on the test machine
* Update document for default PyTorch 2.1
* Small fix
* Small fix
* Small doc fixes
* Small fixes
2024-01-08 17:16:17 +08:00
Mingyu Wei
ed81baa35e
LLM: Use default typing-extension in LangChain examples ( #9857 )
...
* remove typing extension downgrade in readme; minor fixes of code
* fix typos in README
* change default question of docqa.py
2024-01-08 16:50:55 +08:00
Jiao Wang
3b6372ab12
Fix Llama transformers 4.36 support ( #9852 )
...
* supoort 4.36
* style
* update
* update
* update
* fix merge
* update
2024-01-08 00:32:23 -08:00
Chen, Zhentao
1b585b0d40
set fp8 default as e5m2 ( #9859 )
2024-01-08 15:53:57 +08:00
Ruonan Wang
dc995006cc
LLM: add flash attention for mistral / mixtral ( #9846 )
...
* add flash attention for mistral
* update
* add flash attn for mixtral
* fix style
2024-01-08 09:51:34 +08:00
Yishuo Wang
afaa871144
[LLM] support quantize kv cache to fp8 ( #9812 )
2024-01-08 09:28:20 +08:00
Jiao Wang
248ae7fad2
LLama optimize_model to support transformers 4.36 ( #9818 )
...
* supoort 4.36
* style
* update
* update
* update
2024-01-05 11:30:18 -08:00
Ruonan Wang
a60bda3324
LLM: update check for deepspeed ( #9838 )
2024-01-05 16:44:10 +08:00
Ruonan Wang
16433dd959
LLM: fix first token judgement of flash attention ( #9841 )
...
* fix flash attention
* meet code review
* fix
2024-01-05 13:49:37 +08:00
Yina Chen
f919f5792a
fix kv cache out of bound ( #9827 )
2024-01-05 12:38:57 +08:00
Ruonan Wang
5df31db773
LLM: fix accuracy issue of chatglm3 ( #9830 )
...
* add attn mask for first token
* fix
* fix
* change attn calculation
* fix
* fix
* fix style
* fix style
2024-01-05 10:52:05 +08:00
Jinyi Wan
3147ebe63d
Add cpu and gpu examples for SOLAR-10.7B ( #9821 )
2024-01-05 09:50:28 +08:00
WeiguangHan
ad6b182916
LLM: change the color of peak diff ( #9836 )
2024-01-04 19:30:32 +08:00
Xiangyu Tian
38c05be1c0
[LLM] Fix dtype mismatch in Baichuan2-13b ( #9834 )
2024-01-04 15:34:42 +08:00
Ruonan Wang
8504a2bbca
LLM: update qlora alpaca example to change lora usage ( #9835 )
...
* update example
* fix style
2024-01-04 15:22:20 +08:00
Ziteng Zhang
05b681fa85
[LLM] IPEX auto importer set on by default ( #9832 )
...
* Set BIGDL_IMPORT_IPEX default to True
* Remove import intel_extension_for_pytorch as ipex from GPU example
2024-01-04 13:33:29 +08:00
Wang, Jian4
4ceefc9b18
LLM: Support bitsandbytes config on qlora finetune ( #9715 )
...
* test support bitsandbytesconfig
* update style
* update cpu example
* update example
* update readme
* update unit test
* use bfloat16
* update logic
* use int4
* set defalut bnb_4bit_use_double_quant
* update
* update example
* update model.py
* update
* support lora example
2024-01-04 11:23:16 +08:00
WeiguangHan
9a14465560
LLM: add peak diff ( #9789 )
...
* add peak diff
* small fix
* revert yml file
2024-01-03 18:18:19 +08:00
Mingyu Wei
f4eb5da42d
disable arc ut ( #9825 )
2024-01-03 18:10:34 +08:00
Ruonan Wang
20e9742fa0
LLM: fix chatglm3 issue ( #9820 )
...
* fix chatglm3 issue
* small update
2024-01-03 16:15:55 +08:00
Wang, Jian4
a54cd767b1
LLM: Add gguf falcon ( #9801 )
...
* init falcon
* update convert.py
* update style
2024-01-03 14:49:02 +08:00
Yuwen Hu
668c2095b1
Remove unnecessary warning when installing llm ( #9815 )
2024-01-03 10:30:05 +08:00
dingbaorong
f5752ead36
Add whisper test ( #9808 )
...
* add whisper benchmark code
* add librispeech_asr.py
* add bigdl license
2024-01-02 16:36:05 +08:00
binbin Deng
6584539c91
LLM: fix installation of codellama ( #9813 )
2024-01-02 14:32:50 +08:00
Kai Huang
4d01069302
Temp remove baichuan2-13b 1k from arc perf test ( #9810 )
2023-12-29 12:54:13 +08:00
dingbaorong
a2e668a61d
fix arc ut test ( #9736 )
2023-12-28 16:55:34 +08:00
Qiyuan Gong
f0f9d45eac
[LLM] IPEX import support bigdl-core-xe-21 ( #9769 )
...
Add support for bigdl-core-xe-21.
2023-12-28 15:23:58 +08:00
dingbaorong
a8baf68865
fix csv_to_html ( #9802 )
2023-12-28 14:58:51 +08:00
Guancheng Fu
5857a38321
[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH ( #9782 )
...
* add option kv_cache_block
* change var name
2023-12-28 14:41:47 +08:00
Ruonan Wang
99bddd3ab4
LLM: better FP16 support for Intel GPUs ( #9791 )
...
* initial support
* fix
* fix style
* fix
* limi esimd usage condition
* refactor code
* fix style
* small fix
* meet code review
* small fix
2023-12-28 13:30:13 +08:00
Yishuo Wang
7d9f6c6efc
fix cpuinfo error ( #9793 )
2023-12-28 09:23:44 +08:00
Wang, Jian4
7ed9538b9f
LLM: support gguf mpt ( #9773 )
...
* add gguf mpt
* update
2023-12-28 09:22:39 +08:00
Cengguang Zhang
d299f108d0
update falcon attention forward. ( #9796 )
2023-12-28 09:11:59 +08:00
Shaojun Liu
a5e5c3daec
set warm_up: 3 num_trials: 50 for cpu stress test ( #9799 )
2023-12-28 08:55:43 +08:00
dingbaorong
f6bb4ab313
Arc stress test ( #9795 )
...
* add arc stress test
* triger ci
* triger CI
* triger ci
* disable ci
2023-12-27 21:02:41 +08:00
Kai Huang
40eaf76ae3
Add baichuan2-13b to Arc perf ( #9794 )
...
* add baichuan2-13b
* fix indent
* revert
2023-12-27 19:38:53 +08:00
Shaojun Liu
6c75c689ea
bigdl-llm stress test for stable version ( #9781 )
...
* 1k-512 2k-512 baseline
* add cpu stress test
* update yaml name
* update
* update
* clean up
* test
* update
* update
* update
* test
* update
2023-12-27 15:40:53 +08:00
dingbaorong
5cfb4c4f5b
Arc stable version performance regression test ( #9785 )
...
* add arc stable version regression test
* empty gpu mem between different models
* triger ci
* comment spr test
* triger ci
* address kai's comments and disable ci
* merge fp8 and int4
* disable ci
2023-12-27 11:01:56 +08:00
binbin Deng
40edb7b5d7
LLM: fix get environment variables setting ( #9787 )
2023-12-27 09:11:37 +08:00
Kai Huang
689889482c
Reduce max_cache_pos to reduce Baichuan2-13B memory ( #9694 )
...
* optimize baichuan2 memory
* fix
* style
* fp16 mask
* disable fp16
* fix style
* empty cache
* revert empty cache
2023-12-26 19:51:25 +08:00
Jason Dai
361781bcd0
Update readme ( #9788 )
2023-12-26 19:46:11 +08:00
Yuwen Hu
c38e18f2ff
[LLM] Migrate iGPU perf tests to new machine ( #9784 )
...
* Move 1024 test just after 32-32 test; and enable all model for 1024-128
* Make sure python output encoding in utf-8 so that redirect to txt can always be success
* Upload results to ftp
* Small fix
2023-12-26 19:15:57 +08:00
WeiguangHan
c05d7e1532
LLM: add star_corder_15.5b model ( #9772 )
...
* LLM: add star_corder_15.5b model
* revert llm_performance_tests.yml
2023-12-26 18:55:56 +08:00
Ziteng Zhang
44b4a0c9c5
[LLM] Correct prompt format of Yi, Llama2 and Qwen in generate.py ( #9786 )
...
* correct prompt format of Yi
* correct prompt format of llama2 in cpu generate.py
* correct prompt format of Qwen in GPU example
2023-12-26 16:57:55 +08:00
Xiangyu Tian
0ea842231e
[LLM] vLLM: Add api_server entrypoint ( #9783 )
...
Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.
2023-12-26 16:03:57 +08:00
dingbaorong
64d05e581c
add peak gpu mem stats in transformer_int4_gpu ( #9766 )
...
* add peak gpu mem stats in transformer_int4_gpu
* address weiguang's comments
2023-12-26 15:38:28 +08:00
Ziteng Zhang
87b4100054
[LLM] Support Yi model in chat.py ( #9778 )
...
* Suppot Yi model
* code style& add reference link
2023-12-26 10:03:39 +08:00
Ruonan Wang
11d883301b
LLM: fix wrong batch output caused by flash attention ( #9780 )
...
* fix
* meet code review
* move batch size check to the beginning
* move qlen check inside function
* meet code review
2023-12-26 09:41:27 +08:00
Heyang Sun
66e286a73d
Support for Mixtral AWQ ( #9775 )
...
* Support for Mixtral AWQ
* Update README.md
* Update README.md
* Update awq_config.py
* Update README.md
* Update README.md
2023-12-25 16:08:09 +08:00
Ruonan Wang
1917bbe626
LLM: fix BF16Linear related training & inference issue ( #9755 )
...
* fix bf16 related issue
* fix
* update based on comment & add arc lora script
* update readme
* update based on comment
* update based on comment
* update
* force to bf16
* fix style
* move check input dtype into function
* update convert
* meet code review
* meet code review
* update merged model to support new training_mode api
* fix typo
2023-12-25 14:49:30 +08:00
Xiangyu Tian
30dab36f76
[LLM] vLLM: Fix kv cache init ( #9771 )
...
Fix kv cache init
2023-12-25 14:17:06 +08:00
Yina Chen
449b387125
Support relora in bigdl-llm ( #9687 )
...
* init
* fix style
* update
* support resume & update readme
* update
* update
* remove important
* add training mode
* meet comments
2023-12-25 14:04:28 +08:00
Shaojun Liu
b6222404b8
bigdl-llm stable version: let the perf test fail if the difference between perf and baseline is greater than 5% ( #9750 )
...
* test
* test
* test
* update
* revert
2023-12-25 13:47:11 +08:00
Ziteng Zhang
986f65cea9
[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py ( #9762 )
2023-12-25 11:31:14 +08:00
Yishuo Wang
be13b162fe
add codeshell example ( #9743 )
2023-12-25 10:54:01 +08:00
Guancheng Fu
daf536fb2d
vLLM: Apply attention optimizations for selective batching ( #9758 )
...
* fuse_rope for prefil
* apply kv_cache optimizations
* apply fast_decoding_path
* Re-enable kv_cache optimizations for prefill
* reduce KV_CACHE_ALLOC_BLOCK for selective_batching
2023-12-25 10:29:31 +08:00
binbin Deng
ed8ed76d4f
LLM: update deepspeed autotp usage ( #9733 )
2023-12-25 09:41:14 +08:00
Yuwen Hu
02436c6cce
[LLM] Enable more long context in-out pairs for iGPU perf tests ( #9765 )
...
* Add test for 1024-128 and enable more tests for 512-64
* Fix date in results csv name to the time when the performance is triggered
* Small fix
* Small fix
* further fixes
2023-12-22 18:18:23 +08:00
Chen, Zhentao
7fd7c37e1b
Enable fp8e5 harness ( #9761 )
...
* fix precision format like fp8e5
* match fp8_e5m2
2023-12-22 16:59:48 +08:00
Qiyuan Gong
4c487313f2
Revert "[LLM] IPEX auto importer turn on by default for XPU ( #9730 )" ( #9759 )
...
This reverts commit 0284801fbd .
2023-12-22 16:38:24 +08:00
Qiyuan Gong
0284801fbd
[LLM] IPEX auto importer turn on by default for XPU ( #9730 )
...
* Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU.
* Remove import intel_extension_for_pytorch as ipex from GPU example.
* Add support for bigdl-core-xe-21.
2023-12-22 16:20:32 +08:00
Chen, Zhentao
86a69e289c
fix harness runner label of manual trigger ( #9754 )
...
* fix runner
* update golden
2023-12-22 15:09:22 +08:00
Guancheng Fu
fdf93c9267
Implement selective batching for vLLM ( #9659 )
...
* add control to load hf model
* finish initial version of selective_batching
* temp
* finish
* Remove print statement
* fix error
* Apply yang's optimization
* a version that works
* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path
* format
* temp solution: not batching prefill requests
* a version that works for prefill batching
* format
* a solid version: works normally
* a temp version
* Solid version: remove redundant functions
* fix format
* format
* solid: add option to enable selective_batching
* remove logic for using transformer models
* format
* format
* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING
* format
* finish
* format
2023-12-22 13:45:46 +08:00
Ruonan Wang
2f36769208
LLM: bigdl-llm lora support & lora example ( #9740 )
...
* lora support and single card example
* support multi-card, refactor code
* fix model id and style
* remove torch patch, add two new class for bf16, update example
* fix style
* change to training_mode
* small fix
* add more info in help
* fixstyle, update readme
* fix ut
* fix ut
* Handling compatibility issues with default LoraConfig
2023-12-22 11:05:39 +08:00
SONG Ge
ba0b939579
[LLM] Support transformers-v4.36.0 on mistral model ( #9744 )
...
* add support transformers-v4.36.0 on mistral model
* python/llm/src/bigdl/llm/transformers/models/mistral.py
* make the redundant implementation as utils
* fix code style
* fix
* fix style
* update with utils enough_kv_room
2023-12-22 09:59:27 +08:00
Xin Qiu
e36111e713
mixstral fused qkv and rope ( #9724 )
...
* mixstral fused qkv and rope
* fix and clean
* fix style
* update
* update
* fix
* update
* fix
2023-12-22 09:26:35 +08:00
Jiao Wang
e4f6e43675
safetenor to false ( #9728 )
2023-12-21 14:41:51 -08:00
Shaojun Liu
bb52239e0a
bigdl-llm stable version release & test ( #9732 )
...
* stable version test
* trigger spr test
* update
* trigger
* test
* test
* test
* test
* test
* refine
* release linux first
2023-12-21 22:55:33 +08:00
WeiguangHan
d4d2ccdd9d
LLM: remove startcorder-15.5b ( #9748 )
2023-12-21 18:52:52 +08:00
WeiguangHan
474c099559
LLM: using separate threads to do inference ( #9727 )
...
* using separate threads to do inference
* resolve some comments
* resolve some comments
* revert llm_performance_tests.yml file
2023-12-21 17:56:43 +08:00
Yishuo Wang
426660b88e
simplify qwen attention ( #9747 )
2023-12-21 17:53:29 +08:00
Wang, Jian4
984697afe2
LLM: Add bloom gguf support ( #9734 )
...
* init
* update bloom add merges
* update
* update readme
* update for llama error
* update
2023-12-21 14:06:25 +08:00
Heyang Sun
df775cf316
fix python style ( #9742 )
...
* fix python style
* fix
* fix
2023-12-21 11:25:05 +08:00
Chen, Zhentao
b06a3146c8
Fix 70b oom ( #9738 )
...
* add default value to bigdl llm
* fix model oom
2023-12-21 10:40:52 +08:00
Xin Qiu
6c3e698bf1
mistral decoding_fast_path and fused mlp ( #9714 )
...
* mistral decoding_fast_path and fused mlp
* meet code review
2023-12-21 10:11:37 +08:00
Heyang Sun
d157f623b6
Load Mixtral gguf in a block-wise way ( #9725 )
...
* Load Mixtral gguf in a block-wise way
* refine
2023-12-21 10:03:23 +08:00
WeiguangHan
34bb804189
LLM: check csv and its corresponding yaml file ( #9702 )
...
* LLM: check csv and its corresponding yaml file
* run PR arc perf test
* modify the name of some variables
* execute the check results script in right place
* use cp to replace mv command
* resolve some comments
* resolve more comments
* revert the llm_performance_test.yaml file
2023-12-21 09:54:33 +08:00
Zhao Changmin
4bda975a3e
LLM: Align lowbit model config ( #9735 )
...
* align lowbit model config
2023-12-21 09:48:58 +08:00
Wang, Jian4
e1e921f425
LLM: gguf other model using dtype ( #9729 )
2023-12-21 09:33:40 +08:00
Yishuo Wang
13ea6330bd
optimize qwen rope ( #9737 )
2023-12-20 17:34:34 +08:00
Ziteng Zhang
4c032a433e
[LLM] Add glibc checker ( #9624 )
...
* Add glibc checker
* Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.
2023-12-20 16:52:43 +08:00
Yina Chen
cd652a1710
Support fp8 e5m2 on arc ( #9711 )
...
* init
* fix style
* update
* fix style
* update
2023-12-20 16:26:17 +08:00
Yishuo Wang
e54c428d30
add bf16/fp16 fuse mlp support ( #9726 )
2023-12-20 10:40:45 +08:00
Heyang Sun
612651cb5d
fix typo ( #9723 )
2023-12-20 09:41:59 +08:00
WeiguangHan
3aa8b66bc3
LLM: remove starcoder-15.5b model temporarily ( #9720 )
2023-12-19 20:14:46 +08:00
Yishuo Wang
522cf5ed82
[LLM] Improve chatglm2/3 rest token performance with long context ( #9716 )
2023-12-19 17:29:38 +08:00
Yishuo Wang
f2e6abb563
fix mlp batch size check ( #9718 )
2023-12-19 14:22:22 +08:00
Heyang Sun
1fa7793fc0
Load Mixtral GGUF Model ( #9690 )
...
* Load Mixtral GGUF Model
* refactor
* fix empty tensor when to cpu
* update gpu and cpu readmes
* add dtype when set tensor into module
2023-12-19 13:54:38 +08:00
Qiyuan Gong
d0a3095b97
[LLM] IPEX auto importer ( #9706 )
...
* IPEX auto importer and get_ipex_version.
* Add BIGDL_IMPORT_IPEX to control auto import, default is false.
2023-12-19 13:39:38 +08:00
Yang Wang
f4fb58d99c
fusing qkv project and rope ( #9612 )
...
* Try fusing qkv project and rope
* add fused mlp
* fuse append cache
* fix style and clean up code
* clean up
2023-12-18 16:45:00 -08:00
Kai Huang
4c112ee70c
Rename qwen in model name for arc perf test ( #9712 )
2023-12-18 20:34:31 +08:00
Cengguang Zhang
4d22add4af
LLM: fix qwen efficiency issue in perf-test.
2023-12-18 18:32:54 +08:00
Ruonan Wang
8ed89557e5
LLM: add mlp optimization of mixtral ( #9709 )
2023-12-18 16:59:52 +08:00
Chen, Zhentao
b3647507c0
Fix harness workflow ( #9704 )
...
* error when larger than 0.001
* fix env setup
* fix typo
* fix typo
2023-12-18 15:42:10 +08:00
binbin Deng
12df70953e
LLM: add resume_from_checkpoint related section ( #9705 )
2023-12-18 12:27:02 +08:00
Xin Qiu
320110d158
handle empty fused norm result ( #9688 )
...
* handle empty fused norm result
* remove fast_rms_norm
* fix style
2023-12-18 09:56:11 +08:00
Ziteng Zhang
67cc155771
[LLM] Correct chat format of llama and add llama_stream_chat in chat.py
...
* correct chat format of llama
* add llama_stream_chat
2023-12-15 16:36:46 +08:00
Ziteng Zhang
0d41b7ba7b
[LLM] Correct chat format & add stop words for chatglm3 in chat.py
...
* correct chat format of chatglm3
* correct stop words of chatglm3
2023-12-15 16:35:17 +08:00
Ziteng Zhang
d57efd8eb9
[LM] Add stop_word for Qwen model and correct qwen chat format in chat.py ( #9642 )
...
* add stop words list for qwen
* change qwen chat format
2023-12-15 14:53:58 +08:00
SONG Ge
d5b81af7bd
Support mixtral attention optimization on transformers-v4.36.0 ( #9674 )
...
* add example code to support mistral/mixtral attention on transformers v4.36.0
* update
* style fix
* add update for seen-tokens
* support mixtral
* rm mistral change
* small fix
* add more comments and remove use_cache part
---------
Co-authored-by: plusbang <binbin1.deng@intel.com>
2023-12-15 14:30:23 +08:00
Cengguang Zhang
adbef56001
LLM: update qwen attention forward. ( #9695 )
...
* feat: update qwen attention forward.
* fix: style.
2023-12-15 14:06:15 +08:00
Wang, Jian4
b8437a1c1e
LLM: Add gguf mistral model support ( #9691 )
...
* add mistral support
* need to upgrade transformers version
* update
2023-12-15 13:37:39 +08:00
Wang, Jian4
496bb2e845
LLM: Support load BaiChuan model family gguf model ( #9685 )
...
* support baichuan model family gguf model
* update gguf generate.py
* add verify models
* add support model_family
* update
* update style
* update type
* update readme
* update
* remove support model_family
2023-12-15 13:34:33 +08:00
Lilac09
3afed99216
fix path issue ( #9696 )
2023-12-15 11:21:49 +08:00
Jason Dai
37f509bb95
Update readme ( #9692 )
2023-12-14 19:50:21 +08:00
WeiguangHan
1f0245039d
LLM: check the final csv results for arc perf test ( #9684 )
...
* LLM: check the final csv results for arc perf test
* delete useless python script
* change threshold
* revert the llm_performance_tests.yml
2023-12-14 19:46:08 +08:00
Yishuo Wang
9a330bfc2b
fix fuse mlp when using q5_0 or fp8 ( #9689 )
2023-12-14 16:16:05 +08:00
Yuwen Hu
82ac2dbf55
[LLM] Small fixes for win igpu test for ipex 2.1 ( #9686 )
...
* Fixes to install for igpu performance tests
* Small update for core performance tests model lists
2023-12-14 15:39:51 +08:00
WeiguangHan
3e8d198b57
LLM: add eval func ( #9662 )
...
* Add eval func
* add left eval
2023-12-14 14:59:02 +08:00
Ziteng Zhang
21c7503a42
[LLM] Correct prompt format of Qwen in generate.py ( #9678 )
...
* Change qwen prompt format to chatml
2023-12-14 14:01:30 +08:00
Qiyuan Gong
223c9622f7
[LLM] Mixtral CPU examples ( #9673 )
...
* Mixtral CPU PyTorch and hugging face examples, based on #9661 and #9671
2023-12-14 10:35:11 +08:00
Xin Qiu
5e46e0e5af
fix baichuan2-7b 1st token performance regression on xpu ( #9683 )
...
* fix baichuan2-7b 1st token performance regression
* add comments
* fix style
2023-12-14 09:58:32 +08:00
ZehuaCao
877229f3be
[LLM]Add Yi-34B-AWQ to verified AWQ model. ( #9676 )
...
* verfiy Yi-34B-AWQ
* update
2023-12-14 09:55:47 +08:00
binbin Deng
68a4be762f
remove disco mixtral, update oneapi version ( #9671 )
2023-12-13 23:24:59 +08:00
Ruonan Wang
1456d30765
LLM: add dot to option name in setup ( #9682 )
2023-12-13 20:57:27 +08:00
Yuwen Hu
cbdd49f229
[LLM] win igpu performance for ipex 2.1 and oneapi 2024.0 ( #9679 )
...
* Change igpu win tests for ipex 2.1 and oneapi 2024.0
* Qwen model repo id updates; updates model list for 512-64
* Add .eval for win igpu all-in-one benchmark for best performance
2023-12-13 18:52:29 +08:00
Mingyu Wei
16febc949c
[LLM] Add exclude option in all-in-one performance test ( #9632 )
...
* add exclude option in all-in-one perf test
* update arc-perf-test.yaml
* Exclude in_out_pairs in main function
* fix some bugs
* address Kai's comments
* define excludes at the beginning
* add bloomz:2048 to exclude
2023-12-13 18:13:06 +08:00
Ruonan Wang
9b9cd51de1
LLM: update setup to provide new install option to support ipex 2.1 & oneapi 2024 ( #9647 )
...
* update setup
* default to 2.0 now
* meet code review
2023-12-13 17:31:56 +08:00
Yishuo Wang
09ca540f9b
use fuse mlp in qwen ( #9672 )
2023-12-13 17:20:08 +08:00
Ruonan Wang
c7741c4e84
LLM: update moe block convert to optimize rest token latency of Mixtral ( #9669 )
...
* update moe block convert
* further accelerate final_hidden_states
* fix style
* fix style
2023-12-13 16:17:06 +08:00
ZehuaCao
503880809c
verfiy codeLlama ( #9668 )
2023-12-13 15:39:31 +08:00
Xiangyu Tian
1c6499e880
[LLM] vLLM: Support Mixtral Model ( #9670 )
...
Add Mixtral support for BigDL vLLM.
2023-12-13 14:44:47 +08:00
Ruonan Wang
dc5b1d7e9d
LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] ( #9633 )
...
* integrate sdp
* update api
* fix style
* meet code review
* fix
* distinguish mtl from arc
* small fix
2023-12-13 11:29:57 +08:00
Qiyuan Gong
5b0e7e308c
[LLM] Add support for empty activation ( #9664 )
...
* Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch.
* Add comments.
2023-12-13 11:07:45 +08:00
SONG Ge
284e7697b1
[LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC ( #9579 )
...
* optimize kv_cache to support beam_search on Arc
* correctness test update
* fix query_length issue
* simplify implementation
* only enable the optimization on gpu device
* limit the beam_search support only enabled with gpu device and batch_size > 1
* add comments for beam_search case and revert ut change
* meet comments
* add more comments to describe the differece between multi-cases
2023-12-13 11:02:14 +08:00
Heyang Sun
c64e2248ef
fix str returned by get_int_from_str rather than expected int ( #9667 )
2023-12-13 11:01:21 +08:00
binbin Deng
bf1bcf4a14
add official Mixtral model support ( #9663 )
2023-12-12 22:27:07 +08:00
Ziteng Zhang
8931f2eb62
[LLM] Fix transformer qwen size mismatch and rename causal_mask ( #9655 )
...
* Fix size mismatching caused by context_layer
* Change registered_causal_mask to causal_mask
2023-12-12 20:57:40 +08:00
binbin Deng
2fe38b4b9b
LLM: add mixtral GPU examples ( #9661 )
2023-12-12 20:26:36 +08:00
Yuwen Hu
968d99e6f5
Remove empty cache between each iteration of generation ( #9660 )
2023-12-12 17:24:06 +08:00
Xin Qiu
0e639b920f
disable test_optimized_model.py temporarily due to out of memory on A730M(pr validation machine) ( #9658 )
...
* disable test_optimized_model.py
* disable seq2seq
2023-12-12 17:13:52 +08:00
binbin Deng
59ce86d292
LLM: support optimize_model=True for Mixtral model ( #9657 )
2023-12-12 16:41:26 +08:00
Yuwen Hu
d272b6dc47
[LLM] Enable generation of html again for win igpu tests ( #9652 )
...
* Enable generation of html again and comment out rwkv for 32-512 as it is not very stable
* Small fix
2023-12-11 19:15:17 +08:00
WeiguangHan
afa895877c
LLM: fix the issue that may generate blank html ( #9650 )
...
* LLM: fix the issue that may generate blank html
* reslove some comments
2023-12-11 19:14:57 +08:00
ZehuaCao
45721f3473
verfiy llava ( #9649 )
2023-12-11 14:26:05 +08:00
Heyang Sun
9f02f96160
[LLM] support for Yi AWQ model ( #9648 )
2023-12-11 14:07:34 +08:00
Xin Qiu
82255f9726
Enable fused layernorm ( #9614 )
...
* bloom layernorm
* fix
* layernorm
* fix
* fix
* fix
* style fix
* fix
* replace nn.LayerNorm
2023-12-11 09:26:13 +08:00
Yuwen Hu
894d0aaf5e
[LLM] iGPU win perf test reorg based on in-out pairs ( #9645 )
...
* trigger pr temparorily
* Saparate benchmark run for win igpu based in in-out pairs
* Rename fix
* Test workflow
* Small fix
* Skip generation of html for now
* Change back to nightly triggered
2023-12-08 20:46:40 +08:00
Chen, Zhentao
972cdb9992
gsm8k OOM workaround ( #9597 )
...
* update bigdl_llm.py
* update the installation of harness
* fix partial function
* import ipex
* force seq len in decrease order
* put func outside class
* move comments
* default 'trust_remote_code' as True
* Update llm-harness-evaluation.yml
2023-12-08 18:47:25 +08:00
WeiguangHan
1ff4bc43a6
degrade pandas version ( #9643 )
2023-12-08 17:44:51 +08:00
Yina Chen
70f5e7bf0d
Support peft LoraConfig ( #9636 )
...
* support peft loraconfig
* use testcase to test
* fix style
* meet comments
2023-12-08 16:13:03 +08:00
Xin Qiu
0b6f29a7fc
add fused rms norm for Yi and Qwen ( #9640 )
2023-12-08 16:04:38 +08:00
Xin Qiu
5636b0ba80
set new linear status ( #9639 )
2023-12-08 11:02:49 +08:00
binbin Deng
499100daf1
LLM: Add solution to fix oneccl related error ( #9630 )
2023-12-08 10:51:55 +08:00
ZehuaCao
6eca8a8bb5
update transformer version ( #9631 )
2023-12-08 09:36:00 +08:00
WeiguangHan
e9299adb3b
LLM: Highlight some values in the html ( #9635 )
...
* highlight some values in the html
* revert the llm_performance_tests.yml
2023-12-07 19:02:41 +08:00
Yuwen Hu
6f34978b94
[LLM] Add more performance tests for win iGPU (more in-out pairs, RWKV model) ( #9626 )
...
* Add supports for loading rwkv models using from_pretrained api
* Temporarily enable pr tests
* Add RWKV in tests and more in-out pairs
* Add rwkv for 512 tests
* Make iterations smaller
* Change back to nightly trigger
2023-12-07 18:55:16 +08:00
Ruonan Wang
d9b0c01de3
LLM: fix unlora module in qlora finetune ( #9621 )
...
* fix unlora module
* split train and inference
2023-12-07 16:32:02 +08:00
Heyang Sun
3811cf43c9
[LLM] update AWQ documents ( #9623 )
...
* [LLM] update AWQ and verified models' documents
* refine
* refine links
* refine
2023-12-07 16:02:20 +08:00
Yishuo Wang
7319f2c227
use fused mlp in baichuan2 ( #9620 )
2023-12-07 15:50:57 +08:00
Xiangyu Tian
deee65785c
[LLM] vLLM: Delete last_kv_cache before prefilling ( #9619 )
...
Remove last_kv_cache before prefilling to reduce peak memory usage.
2023-12-07 11:32:33 +08:00
Yuwen Hu
48b85593b3
Update all-in-one benchmark readme ( #9618 )
2023-12-07 10:32:09 +08:00
Xiangyu Tian
0327169b50
[LLM] vLLM: fix memory leak in prepare_kv_cache ( #9616 )
...
Revert modification in prepare_kv_cache to fix memory leak.
2023-12-07 10:08:18 +08:00
Xin Qiu
13d47955a8
use fused rms norm in chatglm2 and baichuan ( #9613 )
...
* use fused rms norm in chatglm2 and baichuan
* style fix
2023-12-07 09:21:41 +08:00
Jason Dai
51b668f229
Update GGUF readme ( #9611 )
2023-12-06 18:21:54 +08:00
dingbaorong
a7bc89b3a1
remove q4_1 in gguf example ( #9610 )
...
* remove q4_1
* fixes
2023-12-06 16:00:05 +08:00
Yina Chen
404e101ded
QALora example ( #9551 )
...
* Support qa-lora
* init
* update
* update
* update
* update
* update
* update merge
* update
* fix style & update scripts
* update
* address comments
* fix typo
* fix typo
---------
Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-12-06 15:36:21 +08:00
Guancheng Fu
6978b2c316
[VLLM] Change padding patterns for vLLM & clean code ( #9609 )
...
* optimize
* fix minor error
* optimizations
* fix style
2023-12-06 15:27:26 +08:00
dingbaorong
89069d6173
Add gpu gguf example ( #9603 )
...
* add gpu gguf example
* some fixes
* address kai's comments
* address json's comments
2023-12-06 15:17:54 +08:00
Yuwen Hu
0e8f4020e5
Add traceback error output for win igpu test api in benchmark ( #9607 )
2023-12-06 14:35:16 +08:00
Ziteng Zhang
aeb77b2ab1
Add minimum Qwen model version ( #9606 )
2023-12-06 11:49:14 +08:00
Yuwen Hu
c998f5f2ba
[LLM] iGPU long context tests ( #9598 )
...
* Temp enable PR
* Enable tests for 256-64
* Try again 128-64
* Empty cache after each iteration for igpu benchmark scripts
* Try tests for 512
* change order for 512
* Skip chatglm3 and llama2 for now
* Separate tests for 512-64
* Small fix
* Further fixes
* Change back to nightly again
2023-12-06 10:19:20 +08:00
Heyang Sun
4e70e33934
[LLM] code and document for distributed qlora ( #9585 )
...
* [LLM] code and document for distributed qlora
* doc
* refine for gradient checkpoint
* refine
* Update alpaca_qlora_finetuning_cpu.py
* Update alpaca_qlora_finetuning_cpu.py
* Update alpaca_qlora_finetuning_cpu.py
* add link in doc
2023-12-06 09:23:17 +08:00
Zheng, Yi
d154b38bf9
Add llama2 gpu low memory example ( #9514 )
...
* Add low memory example
* Minor fixes
* Update readme.md
2023-12-05 17:29:48 +08:00
Jason Dai
06febb5fa7
Update readme for FP8/FP4 inference examples ( #9601 )
2023-12-05 15:59:03 +08:00
dingbaorong
a66fbedd7e
add gpu more data types example ( #9592 )
...
* add gpu more data types example
* add int8
2023-12-05 15:45:38 +08:00
Ziteng Zhang
65934c9f4f
[LLM] Fix Qwen causal_mask and attention_mask size mismatching ( #9600 )
...
* Fix #9582 , caused by Qwen modified modeling_qwen.py 7f62181c94 (d2h-049182)
2023-12-05 15:15:54 +08:00
Jinyi Wan
b721138132
Add cpu and gpu examples for BlueLM ( #9589 )
...
* Add cpu int4 example for BlueLM
* addexample optimize_model cpu for bluelm
* add example gpu int4 blueLM
* add example optimiza_model GPU for bluelm
* Fixing naming issues and BigDL package version.
* Fixing naming issues...
* Add BlueLM in README.md "Verified Models"
2023-12-05 13:59:02 +08:00
Guancheng Fu
8b00653039
fix doc ( #9599 )
2023-12-05 13:49:31 +08:00
Qiyuan Gong
f211f136b6
Configurable TORCH_LINEAR_THRESHOLD from env ( #9588 )
...
* Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD)
* Change default to 512
2023-12-05 13:19:47 +08:00
Yuwen Hu
1012507a40
[LLM] Fix performance tests ( #9596 )
...
* Fix missing key for cpu_embedding
* Remove 512 as it stuck for now
* Small fix
2023-12-05 10:59:28 +08:00
Chen, Zhentao
8c8a27ded7
Add harness summary job ( #9457 )
...
* format yml
* add make_table_results
* add summary job
* add a job to print single result
* upload full directory
2023-12-05 10:04:10 +08:00
Yuwen Hu
3f4ad97929
[LLM] Add performance tests for windows iGPU ( #9584 )
...
* Add support for win gpu benchmark with peak gpu memory monitoring
* Add win igpu tests
* Small fix
* Forward outputs
* Small fix
* Test and small fixes
* Small fix
* Small fix and test
* Small fixes
* Add tests for 512-64 and change back to nightly tests
* Small fix
2023-12-04 20:50:02 +08:00
Chen, Zhentao
9557aa9c21
Fix harness nightly ( #9586 )
...
* update golden
* loose the restriction of diff
* only compare results when scheduled
2023-12-04 11:45:00 +08:00
Xiangyu Tian
5c03651309
[LLM] vLLM: Add Preempt for scheduler ( #9568 )
...
Implement Preempt_by_recompute method for vllm.
2023-12-03 20:16:25 +08:00
Chen, Zhentao
cb228c70ea
Add harness nightly ( #9552 )
...
* modify output_path as a directory
* schedule nightly at 21 on Friday
* add tasks and models for nightly
* add accuracy regression
* comment out if to test
* mixed fp4
* for test
* add missing delimiter
* remove comma
* fixed golden results
* add mixed 4 golden result
* add more options
* add mistral results
* get golden result of stable lm
* move nightly scripts and results to test folder
* add license
* add fp8 stable lm golden
* run on all available devices
* trigger only when ready for review
* fix new line
* update golden
* add mistral
2023-12-01 14:16:35 +08:00
Chen, Zhentao
4d7d5d4c59
Add 3 leaderboard tasks ( #9566 )
...
* update leaderboard map
* download model and dataset without overwritten
* fix task drop
* run on all available devices
2023-12-01 14:01:14 +08:00
Wang, Jian4
ed0dc57c6e
LLM: Add cpu qlora support other models guide ( #9567 )
...
* use bf16 flag
* add using baichuan model
* update merge
* remove
* update
2023-12-01 11:18:04 +08:00
Jason Dai
bda404fc8f
Update readme ( #9575 )
2023-11-30 22:45:52 +08:00
Xin Qiu
69c49d21f5
use fused rms norm ( #9572 )
...
* use fused rms norm
* meet code review
2023-11-30 21:47:41 +08:00
Yishuo Wang
66f5b45f57
[LLM] add a llama2 gguf example ( #9553 )
2023-11-30 16:37:17 +08:00
Yishuo Wang
7f6465518a
support loading llama tokenizer from gguf model ( #9565 )
2023-11-30 14:56:12 +08:00
Wang, Jian4
a0a80d232e
LLM: Add qlora cpu distributed readme ( #9561 )
...
* init readme
* add distributed guide
* update
2023-11-30 13:42:30 +08:00
Chen, Zhentao
c8e0c2ed48
Fixed dumped logs in harness ( #9549 )
...
* install transformers==4.34.0
* modify output_path as a directory
* add device and task to output dir parents
2023-11-30 12:47:56 +08:00
Qiyuan Gong
d85a430a8c
Uing bigdl-llm-init instead of bigdl-nano-init ( #9558 )
...
* Replace `bigdl-nano-init` with `bigdl-llm-init`.
* Install `bigdl-llm` instead of `bigdl-nano`.
* Remove nano in README.
2023-11-30 10:10:29 +08:00
Yuwen Hu
34503efa6a
Fix cpu pinned embedding ( #9556 )
2023-11-29 18:27:56 +08:00
binbin Deng
4ff2ca9d0d
LLM: fix loss error on Arc ( #9550 )
2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997
support loading q4_1/q5_0/q5_1/q8_0 gguf model ( #9546 )
2023-11-29 14:40:37 +08:00
Wang, Jian4
b824754256
LLM: Update for cpu qlora mpirun ( #9548 )
2023-11-29 10:56:17 +08:00
Yuwen Hu
5f5ca38b74
[LLM Doc] Fix api doc rendering error ( #9542 )
...
* Fix api rendering error
* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56
[LLM] support loading gguf model ( #9544 )
2023-11-28 15:51:15 +08:00
Xiangyu Tian
916c338772
fix bugs in vllm length check ( #9543 )
2023-11-28 11:09:54 +08:00
WeiguangHan
5098bc3544
LLM: enable previous models ( #9505 )
...
* enable previous models
* test mistral model
* for test
* run models separately
* test all models
* for test
* revert the llm_performance_test.yaml
2023-11-28 10:21:07 +08:00
Zhao Changmin
e7e0cd3b5e
CPU Pinned embedding Layer ( #9538 )
...
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79
Add vLLM-XPU version's README/examples ( #9536 )
...
* test
* test
* fix last kv cache
* add xpu readme
* remove numactl for xpu example
* fix link error
* update max_num_batched_tokens logic
* add explaination
* add xpu environement version requirement
* refine gpu memory
* fix
* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748
Remove xformers from vLLM-CPU ( #9535 )
2023-11-27 11:21:25 +08:00
binbin Deng
2b9c7d2a59
LLM: quick fix alpaca qlora finetuning script ( #9534 )
2023-11-27 11:04:27 +08:00
Yuwen Hu
11fa3de290
Add sutup support of win gpu for bigdl-llm ( #9512 )
2023-11-24 17:49:21 +08:00
Chen, Zhentao
45820cf3b9
add optimize model option ( #9530 )
2023-11-24 17:10:49 +08:00
binbin Deng
6bec0faea5
LLM: support Mistral AWQ models ( #9520 )
2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27
LLM: fix abnormal Mistral GPU accuracy by updating rms_norm ( #9529 )
2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda
hot-fix mistral kv_cache ( #9528 )
2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5
Replace torch.bmm with safe_bmm ( #9519 )
...
* replace bmm with safe one
* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Jason Dai
b3178d449f
Update README.md ( #9525 )
2023-11-23 21:45:20 +08:00
Jason Dai
82898a4203
Update GPU example README ( #9524 )
2023-11-23 21:20:26 +08:00
Jason Dai
064848028f
Update README.md ( #9523 )
2023-11-23 21:16:21 +08:00
Ruonan Wang
b63aae8a8e
LLM: add flash attention support for llama ( #9518 )
...
* add initial flash attention for llama
* accelerate fp32 first token by changing to fp16 in advance
* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2
Integrate vllm ( #9310 )
...
* done
* Rename structure
* add models
* Add structure/sampling_params,sequence
* add input_metadata
* add outputs
* Add policy,logger
* add and update
* add parallelconfig back
* core/scheduler.py
* Add llm_engine.py
* Add async_llm_engine.py
* Add tested entrypoint
* fix minor error
* Fix everything
* fix kv cache view
* fix
* fix
* fix
* format&refine
* remove logger from repo
* try to add token latency
* remove logger
* Refine config.py
* finish worker.py
* delete utils.py
* add license
* refine
* refine sequence.py
* remove sampling_params.py
* finish
* add license
* format
* add license
* refine
* refine
* Refine line too long
* remove exception
* so dumb style-check
* refine
* refine
* refine
* refine
* refine
* refine
* add README
* refine README
* add warning instead error
* fix padding
* add license
* format
* format
* format fix
* Refine vllm dependency (#1 )
vllm dependency clear
* fix licence
* fix format
* fix format
* fix
* adapt LLM engine
* fix
* add license
* fix format
* fix
* Moving README.md to the correct position
* Fix readme.md
* done
* guide for adding models
* fix
* Fix README.md
* Add new model readme
* remove ray-logic
* refactor arg_utils.py
* remove distributed_init_method logic
* refactor entrypoints
* refactor input_metadata
* refactor model_loader
* refactor utils.py
* refactor models
* fix api server
* remove vllm.stucture
* revert by txy 1120
* remove utils
* format
* fix license
* add bigdl model
* Refer to a specfic commit
* Change code base
* add comments
* add async_llm_engine comment
* refine
* formatted
* add worker comments
* add comments
* add comments
* fix style
* add changes
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Heyang Sun
48fbb1eb94
support ccl (MPI) distributed mode in alpaca_qlora_finetuning_cpu ( #9507 )
2023-11-23 10:58:09 +08:00
Qiyuan Gong
0f0c6bb631
[LLM] Fix Qwen registered_causal_mask is None ( #9513 )
...
* Add registered_causal_mask init based on 2abd8e5777 .
2023-11-23 09:28:04 +08:00
Heyang Sun
11fa5a8a0e
Fix QLoRA CPU dispatch_model issue about accelerate ( #9506 )
2023-11-23 08:41:25 +08:00
Heyang Sun
1453046938
install bigdl-llm in deepspeed cpu inference example ( #9508 )
2023-11-23 08:39:21 +08:00
binbin Deng
86743fb57b
LLM: fix transformers version in CPU finetuning example ( #9511 )
2023-11-22 15:53:07 +08:00
binbin Deng
1a2129221d
LLM: support resume from checkpoint in Alpaca QLoRA ( #9502 )
2023-11-22 13:49:14 +08:00
Ruonan Wang
139e98aa18
LLM: quick fix benchmark ( #9509 )
2023-11-22 10:19:57 +08:00
WeiguangHan
c2aeb4d1e8
del model after test ( #9504 )
2023-11-21 18:41:50 +08:00
Ruonan Wang
076d106ef5
LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing ( #9499 )
...
* update to bf16 to accelerate gradient checkpoint
* add utils and fix ut
2023-11-21 17:08:36 +08:00
Cheen Hau, 俊豪
3e39828420
Update all in one benchmark readme ( #9496 )
...
* Add gperftools install to all in one benchmark readme
* Update readme
2023-11-21 14:57:16 +08:00
binbin Deng
b7ae572ac3
LLM: update Alpaca QLoRA finetuning example on GPU ( #9492 )
2023-11-21 14:22:19 +08:00
Wang, Jian4
c5cb3ab82e
LLM : Add CPU alpaca qlora example ( #9469 )
...
* init
* update xpu to cpu
* update
* update readme
* update example
* update
* add refer
* add guide to train different datasets
* update readme
* update
2023-11-21 09:19:58 +08:00
binbin Deng
96fd26759c
LLM: fix QLoRA finetuning example on CPU ( #9489 )
2023-11-20 14:31:24 +08:00
Xin Qiu
50b01058f1
enable new q4_1 ( #9479 )
2023-11-17 14:58:57 +08:00
binbin Deng
3dac21ac7b
LLM: add more example usages about alpaca qlora on different hardware ( #9458 )
2023-11-17 09:56:43 +08:00
Heyang Sun
921b263d6a
update deepspeed install and run guide in README ( #9441 )
2023-11-17 09:11:39 +08:00
Zhao Changmin
30abd304a7
LLM: Fix baichuan pre-normalize model tensor assigning issue when loading ( #9481 )
...
* No need to normalized when loading
2023-11-16 21:57:28 +08:00
WeiguangHan
bc06bec90e
LLM: modify the script to generate html results more accurately ( #9445 )
...
* modify the script to generate html results more accurately
* resolve some comments
* revert some codes
2023-11-16 19:50:23 +08:00
Ruonan Wang
c0ef70df02
llm: quick fix of fast_rms_norm ( #9480 )
2023-11-16 14:42:16 +08:00
Yina Chen
d5263e6681
Add awq load support ( #9453 )
...
* Support directly loading GPTQ models from huggingface
* fix style
* fix tests
* change example structure
* address comments
* fix style
* init
* address comments
* add examples
* fix style
* fix style
* fix style
* fix style
* update
* remove
* meet comments
* fix style
---------
Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-11-16 14:06:25 +08:00
Ruonan Wang
d2c064124a
LLM: update rms related usage to suport ipex 2.1 new api ( #9466 )
...
* update rms related usage
* fix style
2023-11-16 11:21:50 +08:00
Yuwen Hu
731b0aaade
Empty cache after embedding to cpu ( #9477 )
2023-11-16 10:52:30 +08:00
WeiguangHan
c487b53f21
LLM: only run arc perf test nightly ( #9448 )
...
* LLM: only run arc perf test nightly
* deleted unused python scripts
* rebase main
2023-11-15 19:38:14 +08:00
WeiguangHan
0d55bbd9f1
LLM: ajust the order of some models ( #9470 )
2023-11-15 17:04:59 +08:00
Xin Qiu
170e0072af
chatglm2 correctness test ( #9450 )
...
* chatglm2 ut
* some update
* chatglm2 path
* fix
* add print
2023-11-15 15:44:56 +08:00