Commit graph

824 commits

Author SHA1 Message Date
Yishuo Wang
f82782cd3b fix starcoder (#9975) 2024-01-23 17:24:53 +08:00
WeiguangHan
be5836bee1 LLM: fix outlier value (#9945)
* fix outlier value

* small fix
2024-01-23 17:04:13 +08:00
Yishuo Wang
2c8a9aaf0d fix qwen causal mask when quantize_kv_cache=True (#9968) 2024-01-23 16:34:05 +08:00
Yina Chen
5aa4b32c1b LLM: Add qwen spec gpu example (#9965)
* add qwen spec gpu example

* update readme

---------

Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:59:43 +08:00
Yina Chen
36c665667d Add logits processor & qwen eos stop in speculative decoding (#9963)
* add logits processor & qwen eos

* fix style

* fix

* fix

* fix style

* fix style

* support transformers 4.31

* fix style

* fix style

---------

Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Ruonan Wang
60b35db1f1 LLM: add chatglm3 speculative decoding example (#9966)
* add chatglm3 example

* update

* fix
2024-01-23 15:54:12 +08:00
Xin Qiu
da4687c917 fix fp16 (#9970) 2024-01-23 15:53:32 +08:00
Chen, Zhentao
301425e377 harness tests on pvc multiple xpus (#9908)
* add run_multi_llb.py

* update readme

* add job hint
2024-01-23 13:20:37 +08:00
Ruonan Wang
27b19106f3 LLM: add readme for speculative decoding gpu examples (#9961)
* add readme

* add readme

* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a add default device meta when lcmu enabled (#9941) 2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294 add fused rotary pos emb for qwen (#9956)
* add fused rotary pos emb for qwen

* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0 LLM: limit esimd sdp usage for k_len < 8 (#9959)
* update

* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d LLM: Support speculative decoding in bigdl-llm (#9951)
* first commit

* fix error, add llama example

* hidden print

* update api usage

* change to api v3

* update

* meet code review

* meet code review, fix style

* add reference, fix style

* fix style

* fix first token time
2024-01-22 19:14:56 +08:00
Cheen Hau, 俊豪
947b1e27b7 Add readme for Whisper Test (#9944)
* Fix local data path

* Remove non-essential files

* Add readme

* Minor fixes to script

* Bugfix, refactor

* Add references to original source. Bugfixes.

* Reviewer comments

* Properly print and explain output

* Move files to dev/benchmark

* Fixes
2024-01-22 15:11:33 +08:00
Xin Qiu
6fb3f40f7e fix error for benchmark_util.py running on cpu (#9949) 2024-01-22 10:14:40 +08:00
Heyang Sun
fb91c97fe8 support for Baichuan/Baichuan2 13B Chat running speculative decoding (#9921)
* support for Baichuan/Baichuan2 13B Chat running speculative decoding

* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975 optimize Decilm 7b (#9922)
* optimize deci

* update

* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272 Update optimize qwen (#9943)
* update for n tokens input

* fix dtype

* update
2024-01-19 16:54:59 +08:00
binbin Deng
db8e90796a LLM: add avg token latency information and benchmark guide of autotp (#9940) 2024-01-19 15:09:57 +08:00
Ruonan Wang
bf37b3a670 LLM: optimize CPU speculative decoding of chatglm3 (#9928)
* update

* fix style

* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8 gguf memory optimization for mixtral (#9939) 2024-01-19 11:13:15 +08:00
Xin Qiu
610b5226be move reserved memory to benchmark_utils.py (#9907)
* move reserved memory to benchmark_utils.py

* meet code review
2024-01-19 09:44:30 +08:00
Lilac09
7032a2ad73 Optimize gguf load memory for mistral (#9923)
* optimize gguf load for mistral

* fix output of gguf mistral

* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7 gguf memory optimization for baichuan (#9937) 2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e [Serving] Add vllm_worker to fastchat serving framework (#9934)
* add worker

* finish

* finish

* add license

* add more comments
2024-01-18 21:33:36 +08:00
Chen, Zhentao
a8c866c32b add ppl benchmark (#9914)
* add ppl benchmark

* add license

* add readme

* add dataset argument

* add dataset usage

* fixed low bit args

* correct result

* fix terminal display

* fix ppl update

* enable fp16 fp32 bf16

* format the desc

* fix model_kwargs

* add more readme
2024-01-18 17:54:28 +08:00
WeiguangHan
100e0a87e5 LLM: add compressed chatglm3 model (#9892)
* LLM: add compressed chatglm3 model

* small fix

* revert github action
2024-01-18 17:48:15 +08:00
Yuwen Hu
9e2ac5291b Add rwkv v4 back for igpu perf test 32-512 (#9938) 2024-01-18 17:15:28 +08:00
Yishuo Wang
7bbb98abb6 Disable fused layer norm when using XMX to fix mpt UT (#9933) 2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265 LLM: Update for Qwen n tokens inputs (#9931)
* update for n tokens inputs

* update style

* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9 Fix Mixtral GGUF Wrong Output Issue (#9930)
* Fix Mixtral GGUF Wrong Output Issue

* fix style

* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9 add rwkv v5 attention kernel (#9927) 2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f LLM: Fix rope of chatglm3 to support speculative decoding on CPU (#9926) 2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432 [LLM]Solve the problem of calling bmm operator in BF16Linear (#9924)
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4 Support fast rope for training (#9745)
* init

* init

* fix style

* add test and fix

* address comment

* update

* merge upstream main
2024-01-17 15:51:38 +08:00
Yuwen Hu
0c498a7b64 Add llama2-13b to igpu perf test (#9920) 2024-01-17 14:58:45 +08:00
Ruonan Wang
b059a32fff LLM: add benchmark api for bigdl-llm fp16 on GPU (#9919)
* add bmk for bigdl fp16

* fix
2024-01-17 14:24:35 +08:00
Ruonan Wang
427f75000b LLM: fix sdp of chatglm3 (#9917)
* fix

* fix

* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf optimize rwkv v4 first token performance (#9912) 2024-01-17 09:27:41 +08:00
Cengguang Zhang
511cbcf773 LLM: add Ceval benchmark test. (#9872)
* init ceval benchmark test.

* upload dataset.

* add other tests.

* add qwen evaluator.

* fix qwen evaluator style.

* fix qwen evaluator style.

* update qwen evaluator.

* add llama evaluator.

* update eval

* fix typo.

* fix

* fix typo.

* fix llama evaluator.

* fix bug.

* fix style.

* delete dataset.

* fix style.

* fix style.

* add README.md and fix typo.

* fix comments.

* remove run scripts
2024-01-16 19:14:26 +08:00
Shaojun Liu
b909c5c9c2 GGUF load memory optimization (#9913)
* block-wise

* convert linear for module

* revert

* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Yuwen Hu
8643b62521 [LLM] Support longer context in iGPU perf tests (2048-256) (#9910) 2024-01-16 17:48:37 +08:00
Xin Qiu
dee32f7d15 copy fused rms norm's reuslt to avoid <unk> (#9909) 2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03 LLM: fix chatglm3 sdp to support speculative decoding (#9900)
* fix chatglm3

* fix

* update

* meet code review

* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb Update PVC XMX condition (#9901)
* update pvc xmx condition

* update condition

* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf change xmx condition (#9896) 2024-01-12 19:51:48 +08:00
WeiguangHan
0e69bfe6b0 LLM: fix the performance drop of starcoder (#9889)
* LLM: fix the performance drop of starcoder

* small fix

* small fix
2024-01-12 09:14:15 +08:00
Ruonan Wang
d9cf55bce9 LLM: fix MLP check of mixtral (#9891) 2024-01-11 18:01:59 +08:00
Ziteng Zhang
4f4ce73f31 [LLM] Add transformer_autocast_bf16 into all-in-one (#9890)
* Add transformer_autocast_bf16 into all-in-one
2024-01-11 17:51:07 +08:00
Ziteng Zhang
4af88a67b9 support chatglm3 with bf16 (#9888)
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965 [LLM] Improve LLM doc regarding windows gpu related info (#9880)
* Improve runtime configuration for windows

* Add python 310/311 supports for wheel downloading

* Add troubleshooting for windows gpu

* Remove manually import ipex due to auto importer

* Add info regarding cpu_embedding=True on iGPU

* More info for Windows users

* Small updates to API docs

* Python style fix

* Remove tip for loading from saved optimize_model for now

* Updated based on comments

* Update win info for multi-intel gpus selection

* Small fix

* Small fix
2024-01-11 14:37:16 +08:00
Jinyi Wan
07485eff5a Add SOLAR-10.7B to README (#9869) 2024-01-11 14:28:41 +08:00
WeiguangHan
33fd1f9c76 LLM: fix input length logic for run_transformer_int4_gpu (#9864)
* LLM: fix input length logic for run_transformer_int4_gpu

* small fix

* small fix

* small fix
2024-01-10 18:20:14 +08:00
Ruonan Wang
53531ae4ee LLM: support qkv fusion for fp8e5 (#9878)
* update

* add mistral

* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec add mistral and chatglm support to vllm (#9879)
* add mistral and chatglm support to vllm

* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
ZehuaCao
e76d984164 [LLM] Support llm-awq vicuna-7b-1.5 on arc (#9874)
* support llm-awq vicuna-7b-1.5 on arc

* support llm-awq vicuna-7b-1.5 on arc
2024-01-10 14:28:39 +08:00
Ruonan Wang
3e05c9e11b LLM: update esimd sdp kernel (#9871) 2024-01-09 18:10:01 +08:00
Yuwen Hu
023679459e [LLM] Small fixes for finetune related examples and UTs (#9870) 2024-01-09 18:05:03 +08:00
Cheen Hau, 俊豪
b2aa267f50 Enhance LLM GPU installation document (#9828)
* Improve gpu install doc

* Add troubleshooting - setvars.sh not done properly.

* Further improvements

* 2024.x.x -> 2024.0

* Fixes

* Fix Install BigDL-LLM From Wheel : bigdl-llm[xpu_2.0]

* Remove "export USE_XETLA=OFF" for Max GPU
2024-01-09 16:30:50 +08:00
Yuwen Hu
23fc888abe Update llm gpu xpu default related info to PyTorch 2.1 (#9866) 2024-01-09 15:38:47 +08:00
Yishuo Wang
36496d60ac only use quantize kv cache on MTL (#9862) 2024-01-09 13:24:02 +08:00
ZehuaCao
146076bdb5 Support llm-awq backend (#9856)
* Support for LLM-AWQ Backend

* fix

* Update README.md

* Add awqconfig

* modify init

* update

* support llm-awq

* fix style

* fix style

* update

* fix AwqBackendPackingMethod not found error

* fix style

* update README

* fix style

---------

Co-authored-by: Uxito-Ada <414416158@qq.com>
Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
Co-authored-by: cyita <yitastudy@gmail.com>
2024-01-09 13:07:32 +08:00
Ruonan Wang
fea6f16057 LLM: add mlp fusion for fp8e5 and update related check (#9860)
* update mlp fusion

* fix style

* update
2024-01-09 09:56:32 +08:00
binbin Deng
294fd32787 LLM: update DeepSpeed AutoTP example with GPU memory optimization (#9823) 2024-01-09 09:22:49 +08:00
Yuwen Hu
5ba1dc38d4 [LLM] Change default Linux GPU install option to PyTorch 2.1 (#9858)
* Update default xpu to ipex 2.1

* Update related install ut support correspondingly

* Add arc ut tests for both ipex 2.0 and 2.1

* Small fix

* Diable ipex 2.1 test for now as oneapi 2024.0 has not beed installed on the test machine

* Update document for default PyTorch 2.1

* Small fix

* Small fix

* Small doc fixes

* Small fixes
2024-01-08 17:16:17 +08:00
Mingyu Wei
ed81baa35e LLM: Use default typing-extension in LangChain examples (#9857)
* remove typing extension downgrade in readme; minor fixes of code

* fix typos in README

* change default question of docqa.py
2024-01-08 16:50:55 +08:00
Jiao Wang
3b6372ab12 Fix Llama transformers 4.36 support (#9852)
* supoort 4.36

* style

* update

* update

* update

* fix merge

* update
2024-01-08 00:32:23 -08:00
Chen, Zhentao
1b585b0d40 set fp8 default as e5m2 (#9859) 2024-01-08 15:53:57 +08:00
Ruonan Wang
dc995006cc LLM: add flash attention for mistral / mixtral (#9846)
* add flash attention for mistral

* update

* add flash attn for mixtral

* fix style
2024-01-08 09:51:34 +08:00
Yishuo Wang
afaa871144 [LLM] support quantize kv cache to fp8 (#9812) 2024-01-08 09:28:20 +08:00
Jiao Wang
248ae7fad2 LLama optimize_model to support transformers 4.36 (#9818)
* supoort 4.36

* style

* update

* update

* update
2024-01-05 11:30:18 -08:00
Ruonan Wang
a60bda3324 LLM: update check for deepspeed (#9838) 2024-01-05 16:44:10 +08:00
Ruonan Wang
16433dd959 LLM: fix first token judgement of flash attention (#9841)
* fix flash attention

* meet code review

* fix
2024-01-05 13:49:37 +08:00
Yina Chen
f919f5792a fix kv cache out of bound (#9827) 2024-01-05 12:38:57 +08:00
Ruonan Wang
5df31db773 LLM: fix accuracy issue of chatglm3 (#9830)
* add attn mask for first token

* fix

* fix

* change attn calculation

* fix

* fix

* fix style

* fix style
2024-01-05 10:52:05 +08:00
Jinyi Wan
3147ebe63d Add cpu and gpu examples for SOLAR-10.7B (#9821) 2024-01-05 09:50:28 +08:00
WeiguangHan
ad6b182916 LLM: change the color of peak diff (#9836) 2024-01-04 19:30:32 +08:00
Xiangyu Tian
38c05be1c0 [LLM] Fix dtype mismatch in Baichuan2-13b (#9834) 2024-01-04 15:34:42 +08:00
Ruonan Wang
8504a2bbca LLM: update qlora alpaca example to change lora usage (#9835)
* update example

* fix style
2024-01-04 15:22:20 +08:00
Ziteng Zhang
05b681fa85 [LLM] IPEX auto importer set on by default (#9832)
* Set BIGDL_IMPORT_IPEX default to True

* Remove import intel_extension_for_pytorch as ipex from GPU example
2024-01-04 13:33:29 +08:00
Wang, Jian4
4ceefc9b18 LLM: Support bitsandbytes config on qlora finetune (#9715)
* test support bitsandbytesconfig

* update style

* update cpu example

* update example

* update readme

* update unit test

* use bfloat16

* update logic

* use int4

* set defalut bnb_4bit_use_double_quant

* update

* update example

* update model.py

* update

* support lora example
2024-01-04 11:23:16 +08:00
WeiguangHan
9a14465560 LLM: add peak diff (#9789)
* add peak diff

* small fix

* revert yml file
2024-01-03 18:18:19 +08:00
Mingyu Wei
f4eb5da42d disable arc ut (#9825) 2024-01-03 18:10:34 +08:00
Ruonan Wang
20e9742fa0 LLM: fix chatglm3 issue (#9820)
* fix chatglm3 issue

* small update
2024-01-03 16:15:55 +08:00
Wang, Jian4
a54cd767b1 LLM: Add gguf falcon (#9801)
* init falcon

* update convert.py

* update style
2024-01-03 14:49:02 +08:00
Yuwen Hu
668c2095b1 Remove unnecessary warning when installing llm (#9815) 2024-01-03 10:30:05 +08:00
dingbaorong
f5752ead36 Add whisper test (#9808)
* add whisper benchmark code

* add librispeech_asr.py

* add bigdl license
2024-01-02 16:36:05 +08:00
binbin Deng
6584539c91 LLM: fix installation of codellama (#9813) 2024-01-02 14:32:50 +08:00
Kai Huang
4d01069302 Temp remove baichuan2-13b 1k from arc perf test (#9810) 2023-12-29 12:54:13 +08:00
dingbaorong
a2e668a61d fix arc ut test (#9736) 2023-12-28 16:55:34 +08:00
Qiyuan Gong
f0f9d45eac [LLM] IPEX import support bigdl-core-xe-21 (#9769)
Add support for bigdl-core-xe-21.
2023-12-28 15:23:58 +08:00
dingbaorong
a8baf68865 fix csv_to_html (#9802) 2023-12-28 14:58:51 +08:00
Guancheng Fu
5857a38321 [vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH (#9782)
* add option kv_cache_block

* change var name
2023-12-28 14:41:47 +08:00
Ruonan Wang
99bddd3ab4 LLM: better FP16 support for Intel GPUs (#9791)
* initial support

* fix

* fix style

* fix

* limi esimd usage condition

* refactor code

* fix style

* small fix

* meet code review

* small fix
2023-12-28 13:30:13 +08:00
Yishuo Wang
7d9f6c6efc fix cpuinfo error (#9793) 2023-12-28 09:23:44 +08:00
Wang, Jian4
7ed9538b9f LLM: support gguf mpt (#9773)
* add gguf mpt

* update
2023-12-28 09:22:39 +08:00
Cengguang Zhang
d299f108d0 update falcon attention forward. (#9796) 2023-12-28 09:11:59 +08:00
Shaojun Liu
a5e5c3daec set warm_up: 3 num_trials: 50 for cpu stress test (#9799) 2023-12-28 08:55:43 +08:00
dingbaorong
f6bb4ab313 Arc stress test (#9795)
* add arc stress test

* triger ci

* triger CI

* triger ci

* disable ci
2023-12-27 21:02:41 +08:00
Kai Huang
40eaf76ae3 Add baichuan2-13b to Arc perf (#9794)
* add baichuan2-13b

* fix indent

* revert
2023-12-27 19:38:53 +08:00