Jinyi Wan
b721138132
Add cpu and gpu examples for BlueLM ( #9589 )
...
* Add cpu int4 example for BlueLM
* addexample optimize_model cpu for bluelm
* add example gpu int4 blueLM
* add example optimiza_model GPU for bluelm
* Fixing naming issues and BigDL package version.
* Fixing naming issues...
* Add BlueLM in README.md "Verified Models"
2023-12-05 13:59:02 +08:00
Guancheng Fu
8b00653039
fix doc ( #9599 )
2023-12-05 13:49:31 +08:00
Qiyuan Gong
f211f136b6
Configurable TORCH_LINEAR_THRESHOLD from env ( #9588 )
...
* Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD)
* Change default to 512
2023-12-05 13:19:47 +08:00
Yuwen Hu
1012507a40
[LLM] Fix performance tests ( #9596 )
...
* Fix missing key for cpu_embedding
* Remove 512 as it stuck for now
* Small fix
2023-12-05 10:59:28 +08:00
Chen, Zhentao
8c8a27ded7
Add harness summary job ( #9457 )
...
* format yml
* add make_table_results
* add summary job
* add a job to print single result
* upload full directory
2023-12-05 10:04:10 +08:00
Yuwen Hu
3f4ad97929
[LLM] Add performance tests for windows iGPU ( #9584 )
...
* Add support for win gpu benchmark with peak gpu memory monitoring
* Add win igpu tests
* Small fix
* Forward outputs
* Small fix
* Test and small fixes
* Small fix
* Small fix and test
* Small fixes
* Add tests for 512-64 and change back to nightly tests
* Small fix
2023-12-04 20:50:02 +08:00
Chen, Zhentao
29d5bb8df4
Harness workflow dispatch ( #9591 )
...
* add set-matrix job
* add workflow_dispatch
* fix context
* fix manual run
* rename step
* add quotes
* add runner option
* not required labels
* add runner label to output
* use double quote
2023-12-04 15:53:29 +08:00
Chen, Zhentao
9557aa9c21
Fix harness nightly ( #9586 )
...
* update golden
* loose the restriction of diff
* only compare results when scheduled
2023-12-04 11:45:00 +08:00
Xiangyu Tian
5c03651309
[LLM] vLLM: Add Preempt for scheduler ( #9568 )
...
Implement Preempt_by_recompute method for vllm.
2023-12-03 20:16:25 +08:00
Kai Huang
f7e596d85a
Update doc ( #9580 )
...
* update aiohttp in docs
* update doc
2023-12-01 15:40:37 +08:00
Chen, Zhentao
5de92090b3
try to fix deps installation of bigdl ( #9578 )
2023-12-01 15:25:47 +08:00
Chen, Zhentao
cb228c70ea
Add harness nightly ( #9552 )
...
* modify output_path as a directory
* schedule nightly at 21 on Friday
* add tasks and models for nightly
* add accuracy regression
* comment out if to test
* mixed fp4
* for test
* add missing delimiter
* remove comma
* fixed golden results
* add mixed 4 golden result
* add more options
* add mistral results
* get golden result of stable lm
* move nightly scripts and results to test folder
* add license
* add fp8 stable lm golden
* run on all available devices
* trigger only when ready for review
* fix new line
* update golden
* add mistral
2023-12-01 14:16:35 +08:00
Chen, Zhentao
4d7d5d4c59
Add 3 leaderboard tasks ( #9566 )
...
* update leaderboard map
* download model and dataset without overwritten
* fix task drop
* run on all available devices
2023-12-01 14:01:14 +08:00
Heyang Sun
74fd7077a2
[LLM] Multi-process and distributed QLoRA on CPU platform ( #9491 )
...
* [LLM] Multi-process and distributed QLoRA on CPU platform
* Update README.md
* Update README.md
* Update README.md
* Update README.md
* enable llm-init and bind to socket
* refine
* Update Dockerfile
* add all files of qlora cpu example to /bigdl
* fix
* fix k8s
* Update bigdl-qlora-finetuing-entrypoint.sh
* Update bigdl-qlora-finetuing-entrypoint.sh
* Update bigdl-qlora-finetuning-job.yaml
* fix train sync and performance issues
* add node affinity
* disable user to tune cpu per pod
* Update bigdl-qlora-finetuning-job.yaml
2023-12-01 13:47:19 +08:00
Wang, Jian4
ed0dc57c6e
LLM: Add cpu qlora support other models guide ( #9567 )
...
* use bf16 flag
* add using baichuan model
* update merge
* remove
* update
2023-12-01 11:18:04 +08:00
Jason Dai
bda404fc8f
Update readme ( #9575 )
2023-11-30 22:45:52 +08:00
Xin Qiu
69c49d21f5
use fused rms norm ( #9572 )
...
* use fused rms norm
* meet code review
2023-11-30 21:47:41 +08:00
Lilac09
b785376f5c
Add vllm-example to docker inference image ( #9570 )
...
* add vllm-serving to cpu image
* add vllm-serving to cpu image
* add vllm-serving
2023-11-30 17:04:53 +08:00
Yishuo Wang
66f5b45f57
[LLM] add a llama2 gguf example ( #9553 )
2023-11-30 16:37:17 +08:00
Yishuo Wang
7f6465518a
support loading llama tokenizer from gguf model ( #9565 )
2023-11-30 14:56:12 +08:00
Lilac09
2554ba0913
Add usage of vllm ( #9564 )
...
* add usage of vllm
* add usage of vllm
* add usage of vllm
* add usage of vllm
* add usage of vllm
* add usage of vllm
2023-11-30 14:19:23 +08:00
Wang, Jian4
a0a80d232e
LLM: Add qlora cpu distributed readme ( #9561 )
...
* init readme
* add distributed guide
* update
2023-11-30 13:42:30 +08:00
Chen, Zhentao
c8e0c2ed48
Fixed dumped logs in harness ( #9549 )
...
* install transformers==4.34.0
* modify output_path as a directory
* add device and task to output dir parents
2023-11-30 12:47:56 +08:00
Qiyuan Gong
d85a430a8c
Uing bigdl-llm-init instead of bigdl-nano-init ( #9558 )
...
* Replace `bigdl-nano-init` with `bigdl-llm-init`.
* Install `bigdl-llm` instead of `bigdl-nano`.
* Remove nano in README.
2023-11-30 10:10:29 +08:00
Yuwen Hu
34503efa6a
Fix cpu pinned embedding ( #9556 )
2023-11-29 18:27:56 +08:00
Lilac09
557bb6bbdb
add judgement for running serve ( #9555 )
2023-11-29 16:57:00 +08:00
binbin Deng
4ff2ca9d0d
LLM: fix loss error on Arc ( #9550 )
2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997
support loading q4_1/q5_0/q5_1/q8_0 gguf model ( #9546 )
2023-11-29 14:40:37 +08:00
Wang, Jian4
b824754256
LLM: Update for cpu qlora mpirun ( #9548 )
2023-11-29 10:56:17 +08:00
Yuwen Hu
5f5ca38b74
[LLM Doc] Fix api doc rendering error ( #9542 )
...
* Fix api rendering error
* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56
[LLM] support loading gguf model ( #9544 )
2023-11-28 15:51:15 +08:00
Xin Qiu
32b37f3af7
Update gpu install.md ( #9541 )
...
* Update install_gpu.md
* Update install_gpu.md
2023-11-28 11:15:03 +08:00
Xiangyu Tian
916c338772
fix bugs in vllm length check ( #9543 )
2023-11-28 11:09:54 +08:00
WeiguangHan
5098bc3544
LLM: enable previous models ( #9505 )
...
* enable previous models
* test mistral model
* for test
* run models separately
* test all models
* for test
* revert the llm_performance_test.yaml
2023-11-28 10:21:07 +08:00
Zhao Changmin
e7e0cd3b5e
CPU Pinned embedding Layer ( #9538 )
...
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79
Add vLLM-XPU version's README/examples ( #9536 )
...
* test
* test
* fix last kv cache
* add xpu readme
* remove numactl for xpu example
* fix link error
* update max_num_batched_tokens logic
* add explaination
* add xpu environement version requirement
* refine gpu memory
* fix
* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748
Remove xformers from vLLM-CPU ( #9535 )
2023-11-27 11:21:25 +08:00
binbin Deng
2b9c7d2a59
LLM: quick fix alpaca qlora finetuning script ( #9534 )
2023-11-27 11:04:27 +08:00
Yuwen Hu
11fa3de290
Add sutup support of win gpu for bigdl-llm ( #9512 )
2023-11-24 17:49:21 +08:00
Chen, Zhentao
45820cf3b9
add optimize model option ( #9530 )
2023-11-24 17:10:49 +08:00
binbin Deng
6bec0faea5
LLM: support Mistral AWQ models ( #9520 )
2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27
LLM: fix abnormal Mistral GPU accuracy by updating rms_norm ( #9529 )
2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda
hot-fix mistral kv_cache ( #9528 )
2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5
Replace torch.bmm with safe_bmm ( #9519 )
...
* replace bmm with safe one
* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Jason Dai
b3178d449f
Update README.md ( #9525 )
2023-11-23 21:45:20 +08:00
Jason Dai
82898a4203
Update GPU example README ( #9524 )
2023-11-23 21:20:26 +08:00
Jason Dai
064848028f
Update README.md ( #9523 )
2023-11-23 21:16:21 +08:00
Ruonan Wang
b63aae8a8e
LLM: add flash attention support for llama ( #9518 )
...
* add initial flash attention for llama
* accelerate fp32 first token by changing to fp16 in advance
* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2
Integrate vllm ( #9310 )
...
* done
* Rename structure
* add models
* Add structure/sampling_params,sequence
* add input_metadata
* add outputs
* Add policy,logger
* add and update
* add parallelconfig back
* core/scheduler.py
* Add llm_engine.py
* Add async_llm_engine.py
* Add tested entrypoint
* fix minor error
* Fix everything
* fix kv cache view
* fix
* fix
* fix
* format&refine
* remove logger from repo
* try to add token latency
* remove logger
* Refine config.py
* finish worker.py
* delete utils.py
* add license
* refine
* refine sequence.py
* remove sampling_params.py
* finish
* add license
* format
* add license
* refine
* refine
* Refine line too long
* remove exception
* so dumb style-check
* refine
* refine
* refine
* refine
* refine
* refine
* add README
* refine README
* add warning instead error
* fix padding
* add license
* format
* format
* format fix
* Refine vllm dependency (#1 )
vllm dependency clear
* fix licence
* fix format
* fix format
* fix
* adapt LLM engine
* fix
* add license
* fix format
* fix
* Moving README.md to the correct position
* Fix readme.md
* done
* guide for adding models
* fix
* Fix README.md
* Add new model readme
* remove ray-logic
* refactor arg_utils.py
* remove distributed_init_method logic
* refactor entrypoints
* refactor input_metadata
* refactor model_loader
* refactor utils.py
* refactor models
* fix api server
* remove vllm.stucture
* revert by txy 1120
* remove utils
* format
* fix license
* add bigdl model
* Refer to a specfic commit
* Change code base
* add comments
* add async_llm_engine comment
* refine
* formatted
* add worker comments
* add comments
* add comments
* fix style
* add changes
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Heyang Sun
48fbb1eb94
support ccl (MPI) distributed mode in alpaca_qlora_finetuning_cpu ( #9507 )
2023-11-23 10:58:09 +08:00