Commit graph

1827 commits

Author SHA1 Message Date
Qiyuan Gong
223c9622f7 [LLM] Mixtral CPU examples (#9673)
* Mixtral CPU PyTorch and hugging face examples, based on #9661 and #9671
2023-12-14 10:35:11 +08:00
Xin Qiu
5e46e0e5af fix baichuan2-7b 1st token performance regression on xpu (#9683)
* fix baichuan2-7b 1st token performance regression

* add comments

* fix style
2023-12-14 09:58:32 +08:00
ZehuaCao
877229f3be [LLM]Add Yi-34B-AWQ to verified AWQ model. (#9676)
* verfiy Yi-34B-AWQ

* update
2023-12-14 09:55:47 +08:00
binbin Deng
68a4be762f remove disco mixtral, update oneapi version (#9671) 2023-12-13 23:24:59 +08:00
Ruonan Wang
1456d30765 LLM: add dot to option name in setup (#9682) 2023-12-13 20:57:27 +08:00
Yuwen Hu
cbdd49f229 [LLM] win igpu performance for ipex 2.1 and oneapi 2024.0 (#9679)
* Change igpu win tests for ipex 2.1 and oneapi 2024.0

* Qwen model repo id updates; updates model list for 512-64

* Add .eval for win igpu all-in-one benchmark for best performance
2023-12-13 18:52:29 +08:00
Mingyu Wei
16febc949c [LLM] Add exclude option in all-in-one performance test (#9632)
* add exclude option in all-in-one perf test

* update arc-perf-test.yaml

* Exclude in_out_pairs in main function

* fix some bugs

* address Kai's comments

* define excludes at the beginning

* add bloomz:2048 to exclude
2023-12-13 18:13:06 +08:00
Ruonan Wang
9b9cd51de1 LLM: update setup to provide new install option to support ipex 2.1 & oneapi 2024 (#9647)
* update setup

* default to 2.0 now

* meet code review
2023-12-13 17:31:56 +08:00
Yishuo Wang
09ca540f9b use fuse mlp in qwen (#9672) 2023-12-13 17:20:08 +08:00
Ruonan Wang
c7741c4e84 LLM: update moe block convert to optimize rest token latency of Mixtral (#9669)
* update moe block convert

* further accelerate final_hidden_states

* fix style

* fix style
2023-12-13 16:17:06 +08:00
ZehuaCao
503880809c verfiy codeLlama (#9668) 2023-12-13 15:39:31 +08:00
Xiangyu Tian
1c6499e880 [LLM] vLLM: Support Mixtral Model (#9670)
Add Mixtral support for BigDL vLLM.
2023-12-13 14:44:47 +08:00
Ruonan Wang
dc5b1d7e9d LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] (#9633)
* integrate sdp

* update api

* fix style

* meet code review

* fix

* distinguish mtl from arc

* small fix
2023-12-13 11:29:57 +08:00
Qiyuan Gong
5b0e7e308c [LLM] Add support for empty activation (#9664)
* Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch.
* Add comments.
2023-12-13 11:07:45 +08:00
SONG Ge
284e7697b1 [LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC (#9579)
* optimize kv_cache to support beam_search on Arc

* correctness test update

* fix query_length issue

* simplify implementation

* only enable the optimization on gpu device

* limit the beam_search support only enabled with gpu device and batch_size > 1

* add comments for beam_search case and revert ut change

* meet comments

* add more comments to describe the differece between multi-cases
2023-12-13 11:02:14 +08:00
Heyang Sun
c64e2248ef fix str returned by get_int_from_str rather than expected int (#9667) 2023-12-13 11:01:21 +08:00
binbin Deng
bf1bcf4a14 add official Mixtral model support (#9663) 2023-12-12 22:27:07 +08:00
Ziteng Zhang
8931f2eb62 [LLM] Fix transformer qwen size mismatch and rename causal_mask (#9655)
* Fix size mismatching caused by context_layer
* Change registered_causal_mask to causal_mask
2023-12-12 20:57:40 +08:00
binbin Deng
2fe38b4b9b LLM: add mixtral GPU examples (#9661) 2023-12-12 20:26:36 +08:00
Yuwen Hu
968d99e6f5 Remove empty cache between each iteration of generation (#9660) 2023-12-12 17:24:06 +08:00
Xin Qiu
0e639b920f disable test_optimized_model.py temporarily due to out of memory on A730M(pr validation machine) (#9658)
* disable test_optimized_model.py

* disable seq2seq
2023-12-12 17:13:52 +08:00
binbin Deng
59ce86d292 LLM: support optimize_model=True for Mixtral model (#9657) 2023-12-12 16:41:26 +08:00
Yuwen Hu
017932a7fb Small fix for html generation (#9656) 2023-12-12 14:06:18 +08:00
WeiguangHan
1e25499de0 LLM: test new oneapi (#9654)
* test new oneapi

* revert llm_performance_tests.yml
2023-12-12 11:12:14 +08:00
Yuwen Hu
d272b6dc47 [LLM] Enable generation of html again for win igpu tests (#9652)
* Enable generation of html again and comment out rwkv for 32-512 as it is not very stable

* Small fix
2023-12-11 19:15:17 +08:00
WeiguangHan
afa895877c LLM: fix the issue that may generate blank html (#9650)
* LLM: fix the issue that may generate blank html

* reslove some comments
2023-12-11 19:14:57 +08:00
Yining Wang
a04a027b4c Edit gpu doc (#9583)
* harness: run llama2-7b

* harness: run llama2-7b

* harness: run llama2-7b

* harness: run llama2-7b

* edit-gpu-doc

* fix some format problem

* fix spelling problems

* fix evaluation yml

* delete redundant space

* fix some problems

* address comments

* change link
2023-12-11 14:59:07 +08:00
ZehuaCao
45721f3473 verfiy llava (#9649) 2023-12-11 14:26:05 +08:00
Heyang Sun
9f02f96160 [LLM] support for Yi AWQ model (#9648) 2023-12-11 14:07:34 +08:00
Xin Qiu
82255f9726 Enable fused layernorm (#9614)
* bloom layernorm

* fix

* layernorm

* fix

* fix

* fix

* style fix

* fix

* replace nn.LayerNorm
2023-12-11 09:26:13 +08:00
Jason Dai
84a19705a6 Update readme (#9617) 2023-12-09 19:23:14 +08:00
Yuwen Hu
894d0aaf5e [LLM] iGPU win perf test reorg based on in-out pairs (#9645)
* trigger pr temparorily

* Saparate benchmark run for win igpu based in in-out pairs

* Rename fix

* Test workflow

* Small fix

* Skip generation of html for now

* Change back to nightly triggered
2023-12-08 20:46:40 +08:00
Chen, Zhentao
972cdb9992 gsm8k OOM workaround (#9597)
* update bigdl_llm.py

* update the installation of harness

* fix partial function

* import ipex

* force seq len in decrease order

* put func outside class

* move comments

* default 'trust_remote_code' as True

* Update llm-harness-evaluation.yml
2023-12-08 18:47:25 +08:00
WeiguangHan
1ff4bc43a6 degrade pandas version (#9643) 2023-12-08 17:44:51 +08:00
Yina Chen
70f5e7bf0d Support peft LoraConfig (#9636)
* support peft loraconfig

* use testcase to test

* fix style

* meet comments
2023-12-08 16:13:03 +08:00
Xin Qiu
0b6f29a7fc add fused rms norm for Yi and Qwen (#9640) 2023-12-08 16:04:38 +08:00
Xin Qiu
5636b0ba80 set new linear status (#9639) 2023-12-08 11:02:49 +08:00
binbin Deng
499100daf1 LLM: Add solution to fix oneccl related error (#9630) 2023-12-08 10:51:55 +08:00
ZehuaCao
d204125e88 [LLM] Use to build a more slim docker for k8s (#9608)
* Create Dockerfile.k8s

* Update Dockerfile

More slim standalone image

* Update Dockerfile

* Update Dockerfile.k8s

* Update bigdl-qlora-finetuing-entrypoint.sh

* Update qlora_finetuning_cpu.py

* Update alpaca_qlora_finetuning_cpu.py

Refer to this [pr](https://github.com/intel-analytics/BigDL/pull/9551/files#diff-2025188afa54672d21236e6955c7c7f7686bec9239532e41c7983858cc9aaa89), update the LoraConfig

* update

* update

* update

* update

* update

* update

* update

* update transformer version

* update Dockerfile

* update Docker image name

* fix error
2023-12-08 10:25:36 +08:00
ZehuaCao
6eca8a8bb5 update transformer version (#9631) 2023-12-08 09:36:00 +08:00
WeiguangHan
e9299adb3b LLM: Highlight some values in the html (#9635)
* highlight some values in the html

* revert the llm_performance_tests.yml
2023-12-07 19:02:41 +08:00
Yuwen Hu
6f34978b94 [LLM] Add more performance tests for win iGPU (more in-out pairs, RWKV model) (#9626)
* Add supports for loading rwkv models using from_pretrained api

* Temporarily enable pr tests

* Add RWKV in tests and more in-out pairs

* Add rwkv for 512 tests

* Make iterations smaller

* Change back to nightly trigger
2023-12-07 18:55:16 +08:00
Ruonan Wang
d9b0c01de3 LLM: fix unlora module in qlora finetune (#9621)
* fix unlora module

* split train and inference
2023-12-07 16:32:02 +08:00
Heyang Sun
3811cf43c9 [LLM] update AWQ documents (#9623)
* [LLM] update AWQ and verified models' documents

* refine

* refine links

* refine
2023-12-07 16:02:20 +08:00
Yishuo Wang
7319f2c227 use fused mlp in baichuan2 (#9620) 2023-12-07 15:50:57 +08:00
Xiangyu Tian
deee65785c [LLM] vLLM: Delete last_kv_cache before prefilling (#9619)
Remove last_kv_cache before prefilling to reduce peak memory usage.
2023-12-07 11:32:33 +08:00
Yuwen Hu
48b85593b3 Update all-in-one benchmark readme (#9618) 2023-12-07 10:32:09 +08:00
Xiangyu Tian
0327169b50 [LLM] vLLM: fix memory leak in prepare_kv_cache (#9616)
Revert modification in prepare_kv_cache to fix memory leak.
2023-12-07 10:08:18 +08:00
Xin Qiu
13d47955a8 use fused rms norm in chatglm2 and baichuan (#9613)
* use fused rms norm in chatglm2 and baichuan

* style fix
2023-12-07 09:21:41 +08:00
Jason Dai
51b668f229 Update GGUF readme (#9611) 2023-12-06 18:21:54 +08:00