dingbaorong
89069d6173
Add gpu gguf example ( #9603 )
...
* add gpu gguf example
* some fixes
* address kai's comments
* address json's comments
2023-12-06 15:17:54 +08:00
Yuwen Hu
0e8f4020e5
Add traceback error output for win igpu test api in benchmark ( #9607 )
2023-12-06 14:35:16 +08:00
Ziteng Zhang
aeb77b2ab1
Add minimum Qwen model version ( #9606 )
2023-12-06 11:49:14 +08:00
Yuwen Hu
c998f5f2ba
[LLM] iGPU long context tests ( #9598 )
...
* Temp enable PR
* Enable tests for 256-64
* Try again 128-64
* Empty cache after each iteration for igpu benchmark scripts
* Try tests for 512
* change order for 512
* Skip chatglm3 and llama2 for now
* Separate tests for 512-64
* Small fix
* Further fixes
* Change back to nightly again
2023-12-06 10:19:20 +08:00
Heyang Sun
4e70e33934
[LLM] code and document for distributed qlora ( #9585 )
...
* [LLM] code and document for distributed qlora
* doc
* refine for gradient checkpoint
* refine
* Update alpaca_qlora_finetuning_cpu.py
* Update alpaca_qlora_finetuning_cpu.py
* Update alpaca_qlora_finetuning_cpu.py
* add link in doc
2023-12-06 09:23:17 +08:00
Zheng, Yi
d154b38bf9
Add llama2 gpu low memory example ( #9514 )
...
* Add low memory example
* Minor fixes
* Update readme.md
2023-12-05 17:29:48 +08:00
Jason Dai
06febb5fa7
Update readme for FP8/FP4 inference examples ( #9601 )
2023-12-05 15:59:03 +08:00
dingbaorong
a66fbedd7e
add gpu more data types example ( #9592 )
...
* add gpu more data types example
* add int8
2023-12-05 15:45:38 +08:00
Ziteng Zhang
65934c9f4f
[LLM] Fix Qwen causal_mask and attention_mask size mismatching ( #9600 )
...
* Fix #9582 , caused by Qwen modified modeling_qwen.py 7f62181c94 (d2h-049182)
2023-12-05 15:15:54 +08:00
Jinyi Wan
b721138132
Add cpu and gpu examples for BlueLM ( #9589 )
...
* Add cpu int4 example for BlueLM
* addexample optimize_model cpu for bluelm
* add example gpu int4 blueLM
* add example optimiza_model GPU for bluelm
* Fixing naming issues and BigDL package version.
* Fixing naming issues...
* Add BlueLM in README.md "Verified Models"
2023-12-05 13:59:02 +08:00
Guancheng Fu
8b00653039
fix doc ( #9599 )
2023-12-05 13:49:31 +08:00
Qiyuan Gong
f211f136b6
Configurable TORCH_LINEAR_THRESHOLD from env ( #9588 )
...
* Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD)
* Change default to 512
2023-12-05 13:19:47 +08:00
Yuwen Hu
1012507a40
[LLM] Fix performance tests ( #9596 )
...
* Fix missing key for cpu_embedding
* Remove 512 as it stuck for now
* Small fix
2023-12-05 10:59:28 +08:00
Chen, Zhentao
8c8a27ded7
Add harness summary job ( #9457 )
...
* format yml
* add make_table_results
* add summary job
* add a job to print single result
* upload full directory
2023-12-05 10:04:10 +08:00
Yuwen Hu
3f4ad97929
[LLM] Add performance tests for windows iGPU ( #9584 )
...
* Add support for win gpu benchmark with peak gpu memory monitoring
* Add win igpu tests
* Small fix
* Forward outputs
* Small fix
* Test and small fixes
* Small fix
* Small fix and test
* Small fixes
* Add tests for 512-64 and change back to nightly tests
* Small fix
2023-12-04 20:50:02 +08:00
Chen, Zhentao
9557aa9c21
Fix harness nightly ( #9586 )
...
* update golden
* loose the restriction of diff
* only compare results when scheduled
2023-12-04 11:45:00 +08:00
Xiangyu Tian
5c03651309
[LLM] vLLM: Add Preempt for scheduler ( #9568 )
...
Implement Preempt_by_recompute method for vllm.
2023-12-03 20:16:25 +08:00
Chen, Zhentao
cb228c70ea
Add harness nightly ( #9552 )
...
* modify output_path as a directory
* schedule nightly at 21 on Friday
* add tasks and models for nightly
* add accuracy regression
* comment out if to test
* mixed fp4
* for test
* add missing delimiter
* remove comma
* fixed golden results
* add mixed 4 golden result
* add more options
* add mistral results
* get golden result of stable lm
* move nightly scripts and results to test folder
* add license
* add fp8 stable lm golden
* run on all available devices
* trigger only when ready for review
* fix new line
* update golden
* add mistral
2023-12-01 14:16:35 +08:00
Chen, Zhentao
4d7d5d4c59
Add 3 leaderboard tasks ( #9566 )
...
* update leaderboard map
* download model and dataset without overwritten
* fix task drop
* run on all available devices
2023-12-01 14:01:14 +08:00
Wang, Jian4
ed0dc57c6e
LLM: Add cpu qlora support other models guide ( #9567 )
...
* use bf16 flag
* add using baichuan model
* update merge
* remove
* update
2023-12-01 11:18:04 +08:00
Jason Dai
bda404fc8f
Update readme ( #9575 )
2023-11-30 22:45:52 +08:00
Xin Qiu
69c49d21f5
use fused rms norm ( #9572 )
...
* use fused rms norm
* meet code review
2023-11-30 21:47:41 +08:00
Yishuo Wang
66f5b45f57
[LLM] add a llama2 gguf example ( #9553 )
2023-11-30 16:37:17 +08:00
Yishuo Wang
7f6465518a
support loading llama tokenizer from gguf model ( #9565 )
2023-11-30 14:56:12 +08:00
Wang, Jian4
a0a80d232e
LLM: Add qlora cpu distributed readme ( #9561 )
...
* init readme
* add distributed guide
* update
2023-11-30 13:42:30 +08:00
Chen, Zhentao
c8e0c2ed48
Fixed dumped logs in harness ( #9549 )
...
* install transformers==4.34.0
* modify output_path as a directory
* add device and task to output dir parents
2023-11-30 12:47:56 +08:00
Qiyuan Gong
d85a430a8c
Uing bigdl-llm-init instead of bigdl-nano-init ( #9558 )
...
* Replace `bigdl-nano-init` with `bigdl-llm-init`.
* Install `bigdl-llm` instead of `bigdl-nano`.
* Remove nano in README.
2023-11-30 10:10:29 +08:00
Yuwen Hu
34503efa6a
Fix cpu pinned embedding ( #9556 )
2023-11-29 18:27:56 +08:00
binbin Deng
4ff2ca9d0d
LLM: fix loss error on Arc ( #9550 )
2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997
support loading q4_1/q5_0/q5_1/q8_0 gguf model ( #9546 )
2023-11-29 14:40:37 +08:00
Wang, Jian4
b824754256
LLM: Update for cpu qlora mpirun ( #9548 )
2023-11-29 10:56:17 +08:00
Yuwen Hu
5f5ca38b74
[LLM Doc] Fix api doc rendering error ( #9542 )
...
* Fix api rendering error
* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56
[LLM] support loading gguf model ( #9544 )
2023-11-28 15:51:15 +08:00
Xiangyu Tian
916c338772
fix bugs in vllm length check ( #9543 )
2023-11-28 11:09:54 +08:00
WeiguangHan
5098bc3544
LLM: enable previous models ( #9505 )
...
* enable previous models
* test mistral model
* for test
* run models separately
* test all models
* for test
* revert the llm_performance_test.yaml
2023-11-28 10:21:07 +08:00
Zhao Changmin
e7e0cd3b5e
CPU Pinned embedding Layer ( #9538 )
...
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79
Add vLLM-XPU version's README/examples ( #9536 )
...
* test
* test
* fix last kv cache
* add xpu readme
* remove numactl for xpu example
* fix link error
* update max_num_batched_tokens logic
* add explaination
* add xpu environement version requirement
* refine gpu memory
* fix
* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748
Remove xformers from vLLM-CPU ( #9535 )
2023-11-27 11:21:25 +08:00
binbin Deng
2b9c7d2a59
LLM: quick fix alpaca qlora finetuning script ( #9534 )
2023-11-27 11:04:27 +08:00
Yuwen Hu
11fa3de290
Add sutup support of win gpu for bigdl-llm ( #9512 )
2023-11-24 17:49:21 +08:00
Chen, Zhentao
45820cf3b9
add optimize model option ( #9530 )
2023-11-24 17:10:49 +08:00
binbin Deng
6bec0faea5
LLM: support Mistral AWQ models ( #9520 )
2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27
LLM: fix abnormal Mistral GPU accuracy by updating rms_norm ( #9529 )
2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda
hot-fix mistral kv_cache ( #9528 )
2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5
Replace torch.bmm with safe_bmm ( #9519 )
...
* replace bmm with safe one
* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Jason Dai
b3178d449f
Update README.md ( #9525 )
2023-11-23 21:45:20 +08:00
Jason Dai
82898a4203
Update GPU example README ( #9524 )
2023-11-23 21:20:26 +08:00
Jason Dai
064848028f
Update README.md ( #9523 )
2023-11-23 21:16:21 +08:00
Ruonan Wang
b63aae8a8e
LLM: add flash attention support for llama ( #9518 )
...
* add initial flash attention for llama
* accelerate fp32 first token by changing to fp16 in advance
* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2
Integrate vllm ( #9310 )
...
* done
* Rename structure
* add models
* Add structure/sampling_params,sequence
* add input_metadata
* add outputs
* Add policy,logger
* add and update
* add parallelconfig back
* core/scheduler.py
* Add llm_engine.py
* Add async_llm_engine.py
* Add tested entrypoint
* fix minor error
* Fix everything
* fix kv cache view
* fix
* fix
* fix
* format&refine
* remove logger from repo
* try to add token latency
* remove logger
* Refine config.py
* finish worker.py
* delete utils.py
* add license
* refine
* refine sequence.py
* remove sampling_params.py
* finish
* add license
* format
* add license
* refine
* refine
* Refine line too long
* remove exception
* so dumb style-check
* refine
* refine
* refine
* refine
* refine
* refine
* add README
* refine README
* add warning instead error
* fix padding
* add license
* format
* format
* format fix
* Refine vllm dependency (#1 )
vllm dependency clear
* fix licence
* fix format
* fix format
* fix
* adapt LLM engine
* fix
* add license
* fix format
* fix
* Moving README.md to the correct position
* Fix readme.md
* done
* guide for adding models
* fix
* Fix README.md
* Add new model readme
* remove ray-logic
* refactor arg_utils.py
* remove distributed_init_method logic
* refactor entrypoints
* refactor input_metadata
* refactor model_loader
* refactor utils.py
* refactor models
* fix api server
* remove vllm.stucture
* revert by txy 1120
* remove utils
* format
* fix license
* add bigdl model
* Refer to a specfic commit
* Change code base
* add comments
* add async_llm_engine comment
* refine
* formatted
* add worker comments
* add comments
* add comments
* fix style
* add changes
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00