Commit graph

1074 commits

Author SHA1 Message Date
Keyan (Kyrie) Zhang
585c174e92
Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables (#10707)
* Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables.

* Fix style
2024-04-10 10:48:46 +08:00
Jiao Wang
878a97077b
Fix llava example to support transformerds 4.36 (#10614)
* fix llava example

* update
2024-04-09 13:47:07 -07:00
Zhicun
b4147a97bb
Fix dtype mismatch error (#10609)
* fix llama

* fix

* fix code style

* add torch type in model.py

---------

Co-authored-by: arda <arda@arda-arc19.sh.intel.com>
2024-04-09 17:50:33 +08:00
Yishuo Wang
8f45e22072
fix llama2 (#10710) 2024-04-09 17:28:37 +08:00
Yishuo Wang
e438f941f2
disable rwkv5 fp16 (#10699) 2024-04-09 16:42:11 +08:00
binbin Deng
44922bb5c2
LLM: support baichuan2-13b using AutoTP (#10691) 2024-04-09 14:06:01 +08:00
Yina Chen
c7422712fc
mistral 4.36 use fp16 sdp (#10704) 2024-04-09 13:50:33 +08:00
Ovo233
dcb2038aad
Enable optimization for sentence_transformers (#10679)
* enable optimization for sentence_transformers

* fix python style check failure
2024-04-09 12:33:46 +08:00
Yang Wang
5a1f446d3c
support fp8 in xetla (#10555)
* support fp8 in xetla

* change name

* adjust model file

* support convert back to cpu

* factor

* fix bug

* fix style
2024-04-08 13:22:09 -07:00
Cengguang Zhang
7c43ac0164
LLM: optimize llama natvie sdp for split qkv tensor (#10693)
* LLM: optimize llama natvie sdp for split qkv tensor.

* fix block real size.

* fix comment.

* fix style.

* refactor.
2024-04-08 17:48:11 +08:00
Xin Qiu
1274cba79b
stablelm fp8 kv cache (#10672)
* stablelm fp8 kvcache

* update

* fix

* change to fp8 matmul

* fix style

* fix

* fix

* meet code review

* add comment
2024-04-08 15:16:46 +08:00
Cengguang Zhang
c0cd238e40
LLM: support llama2 8k input with w4a16. (#10677)
* LLM: support llama2 8k input with w4a16.

* fix comment and style.

* fix style.

* fix comments and split tensor to quantized attention forward.

* fix style.

* refactor name.

* fix style.

* fix style.

* fix style.

* refactor checker name.

* refactor native sdp split qkv tensor name.

* fix style.

* fix comment rename variables.

* fix co-exist of intermedia results.
2024-04-08 11:43:15 +08:00
Wang, Jian4
47cabe8fcc
LLM: Fix no return_last_logit running bigdl_ipex chatglm3 (#10678)
* fix no return_last_logits

* update only for chatglm
2024-04-07 15:27:58 +08:00
Zhicun
9d8ba64c0d
Llamaindex: add tokenizer_id and support chat (#10590)
* add tokenizer_id

* fix

* modify

* add from_model_id and from_mode_id_low_bit

* fix typo and add comment

* fix python code style

---------

Co-authored-by: pengyb2001 <284261055@qq.com>
2024-04-07 13:51:34 +08:00
Xiangyu Tian
08018a18df
Remove not-imported MistralConfig (#10670) 2024-04-07 10:32:05 +08:00
Cengguang Zhang
1a9b8204a4
LLM: support int4 fp16 chatglm2-6b 8k input. (#10648) 2024-04-07 09:39:21 +08:00
Jiao Wang
69bdbf5806
Fix vllm print error message issue (#10664)
* update chatglm readme

* Add condition to invalidInputError

* update

* update

* style
2024-04-05 15:08:13 -07:00
Xin Qiu
4c3e493b2d
fix stablelm2 1.6b (#10656)
* fix stablelm2 1.6b

* meet code review
2024-04-03 22:15:32 +08:00
Yishuo Wang
702e686901
optimize starcoder normal kv cache (#10642) 2024-04-03 15:27:02 +08:00
Xin Qiu
3a9ab8f1ae
fix stablelm logits diff (#10636)
* fix logits diff

* Small fixes

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-04-03 15:08:12 +08:00
Zhicun
b827f534d5
Add tokenizer_id in Langchain (#10588)
* fix low-bit

* fix

* fix style

---------

Co-authored-by: arda <arda@arda-arc12.sh.intel.com>
2024-04-03 14:25:35 +08:00
Kai Huang
c875b3c858
Add seq len check for llama softmax upcast to fp32 (#10629) 2024-04-03 12:05:13 +08:00
Jiao Wang
23e33a0ca1
Fix qwen-vl style (#10633)
* update

* update
2024-04-02 18:41:38 -07:00
binbin Deng
2bbd8a1548
LLM: fix llama2 FP16 & bs>1 & autotp on PVC and ARC (#10611) 2024-04-03 09:28:04 +08:00
Jiao Wang
654dc5ba57
Fix Qwen-VL example problem (#10582)
* update

* update

* update

* update
2024-04-02 12:17:30 -07:00
Yuwen Hu
fd384ddfb8
Optimize StableLM (#10619)
* Initial commit for stablelm optimizations

* Small style fix

* add dependency

* Add mlp optimizations

* Small fix

* add attention forward

* Remove quantize kv for now as head_dim=80

* Add merged qkv

* fix lisence

* Python style fix

---------

Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>
2024-04-02 18:58:38 +08:00
Yishuo Wang
ba8cc6bd68
optimize starcoder2-3b (#10625) 2024-04-02 17:16:29 +08:00
Shaojun Liu
a10f5a1b8d
add python style check (#10620)
* add python style check

* fix style checks

* update runner

* add ipex-llm-finetune-qlora-cpu-k8s to manually_build workflow

* update tag to 2.1.0-SNAPSHOT
2024-04-02 16:17:56 +08:00
Cengguang Zhang
58b57177e3
LLM: support bigdl quantize kv cache env and add warning. (#10623)
* LLM: support bigdl quantize kv cache env and add warnning.

* fix style.

* fix comments.
2024-04-02 15:41:08 +08:00
Kai Huang
0a95c556a1
Fix starcoder first token perf (#10612)
* add bias check

* update
2024-04-02 09:21:38 +08:00
Cengguang Zhang
e567956121
LLM: add memory optimization for llama. (#10592)
* add initial memory optimization.

* fix logic.

* fix logic,

* remove env var check in mlp split.
2024-04-02 09:07:50 +08:00
Ruonan Wang
bfc1caa5e5
LLM: support iq1s for llama2-70b-hf (#10596) 2024-04-01 13:13:13 +08:00
Yishuo Wang
437a349dd6
fix rwkv with pip installer (#10591) 2024-03-29 17:56:45 +08:00
Ruonan Wang
0136fad1d4
LLM: support iq1_s (#10564)
* init version

* update utils

* remove unsed code
2024-03-29 09:43:55 +08:00
Qiyuan Gong
f4537798c1
Enable kv cache quantization by default for flex when 1 < batch <= 8 (#10584)
* Enable kv cache quantization by default for flex when 1 < batch <= 8.
* Change up bound from <8 to <=8.
2024-03-29 09:43:42 +08:00
Cengguang Zhang
b44f7adbad
LLM: Disable esimd sdp for PVC GPU when batch size>1 (#10579)
* llm: disable esimd sdp for pvc bz>1.

* fix logic.

* fix: avoid call get device name twice.
2024-03-28 22:55:48 +08:00
Xin Qiu
5963239b46
Fix qwen's position_ids no enough (#10572)
* fix position_ids

* fix position_ids
2024-03-28 17:05:49 +08:00
ZehuaCao
52a2135d83
Replace ipex with ipex-llm (#10554)
* fix ipex with ipex_llm

* fix ipex with ipex_llm

* update

* update

* update

* update

* update

* update

* update

* update
2024-03-28 13:54:40 +08:00
Cheen Hau, 俊豪
1c5eb14128
Update pip install to use --extra-index-url for ipex package (#10557)
* Change to 'pip install .. --extra-index-url' for readthedocs

* Change to 'pip install .. --extra-index-url' for examples

* Change to 'pip install .. --extra-index-url' for remaining files

* Fix URL for ipex

* Add links for ipex US and CN servers

* Update ipex cpu url

* remove readme

* Update for github actions

* Update for dockerfiles
2024-03-28 09:56:23 +08:00
binbin Deng
92dfed77be
LLM: fix abnormal output of fp16 deepspeed autotp (#10558) 2024-03-28 09:35:48 +08:00
Xiangyu Tian
51d34ca68e
Fix wrong import in speculative (#10562) 2024-03-27 18:21:07 +08:00
Guancheng Fu
04baac5a2e
Fix fastchat top_k (#10560)
* fix -1 top_k

* fix

* done
2024-03-27 16:01:58 +08:00
binbin Deng
fc8c7904f0
LLM: fix torch_dtype setting of apply fp16 optimization through optimize_model (#10556) 2024-03-27 14:18:45 +08:00
Ruonan Wang
ea4bc450c4
LLM: add esimd sdp for pvc (#10543)
* add esimd sdp for pvc

* update

* fix

* fix batch
2024-03-26 19:04:40 +08:00
Xiangyu Tian
11550d3f25
LLM: Add length check for IPEX-CPU speculative decoding (#10529)
Add length check for IPEX-CPU speculative decoding.
2024-03-26 17:47:10 +08:00
Guancheng Fu
a3b007f3b1
[Serving] Fix fastchat breaks (#10548)
* fix fastchat

* fix doc
2024-03-26 17:03:52 +08:00
Yishuo Wang
69a28d6b4c
fix chatglm (#10540) 2024-03-26 16:01:00 +08:00
binbin Deng
0a3e4e788f
LLM: fix mistral hidden_size setting for deepspeed autotp (#10527) 2024-03-26 10:55:44 +08:00
Xin Qiu
1dd40b429c
enable fp4 fused mlp and qkv (#10531)
* enable fp4 fused mlp and qkv

* update qwen

* update qwen2
2024-03-26 08:34:00 +08:00
Wang, Jian4
16b2ef49c6
Update_document by heyang (#30) 2024-03-25 10:06:02 +08:00
Wang, Jian4
a1048ca7f6
Update setup.py and add new actions and add compatible mode (#25)
* update setup.py

* add new action

* add compatible mode
2024-03-22 15:44:59 +08:00
Wang, Jian4
9df70d95eb
Refactor bigdl.llm to ipex_llm (#24)
* Rename bigdl/llm to ipex_llm

* rm python/llm/src/bigdl

* from bigdl.llm to from ipex_llm
2024-03-22 15:41:21 +08:00
Wang, Jian4
34d0a9328c LLM: Speed-up mixtral in pipeline parallel inference (#10472)
* speed-up mixtral

* fix style
2024-03-22 11:06:28 +08:00
Cengguang Zhang
b9d4280892 LLM: fix baichuan7b quantize kv abnormal output. (#10504)
* fix abnormal output.

* fix style.

* fix style.
2024-03-22 10:00:08 +08:00
Yishuo Wang
f0f317b6cf fix a typo in yuan (#10503) 2024-03-22 09:40:04 +08:00
Guancheng Fu
3a3756b51d Add FastChat bigdl_worker (#10493)
* done

* fix format

* add licence

* done

* fix doc

* refactor folder

* add license
2024-03-21 18:35:05 +08:00
Xin Qiu
dba7ddaab3 add sdp fp8 for qwen llama436 baichuan mistral baichuan2 (#10485)
* add sdp fp8

* fix style

* fix qwen

* fix baichuan 13

* revert baichuan 13b and baichuan2-13b

* fix style

* update
2024-03-21 17:23:05 +08:00
Kai Huang
30f111cd32 lm_head empty_cache for more models (#10490)
* modify constraint

* fix style
2024-03-21 17:11:43 +08:00
binbin Deng
2958ca49c0 LLM: add patching function for llm finetuning (#10247) 2024-03-21 16:01:01 +08:00
Kai Huang
021d77fd22 Remove softmax upcast fp32 in llama (#10481)
* update

* fix style
2024-03-20 18:17:34 +08:00
Yishuo Wang
cfdf8ad496 Fix modules_not_to_convert argument (#10483) 2024-03-20 17:47:03 +08:00
Xiangyu Tian
cbe24cc7e6 LLM: Enable BigDL IPEX Int8 (#10480)
Enable BigDL IPEX Int8
2024-03-20 15:59:54 +08:00
ZehuaCao
1d062e24db Update serving doc (#10475)
* update serving doc

* add tob

* update

* update

* update

* update vllm worker
2024-03-20 14:44:43 +08:00
Cengguang Zhang
4581e4f17f LLM: fix whiper model missing config. (#10473)
* fix whiper model missing config.

* fix style.

* fix style.

* style.
2024-03-20 14:22:37 +08:00
Yishuo Wang
749bedaf1e fix rwkv v5 fp16 (#10474) 2024-03-20 13:15:08 +08:00
Yuwen Hu
72bcc27da9 [LLM] Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings (#10459)
* Add TransformersBgeEmbeddings class in bigdl.llm.langchain.embeddings

* Small fixes
2024-03-19 18:04:35 +08:00
Cengguang Zhang
463a86cd5d LLM: fix qwen-vl interpolation gpu abnormal results. (#10457)
* fix qwen-vl interpolation gpu abnormal results.

* fix style.

* update qwen-vl gpu example.

* fix comment and update example.

* fix style.
2024-03-19 16:59:39 +08:00
Xin Qiu
bbd749dceb qwen2 fp8 cache (#10446)
* qwen2 fp8 cache

* fix style check
2024-03-19 08:32:39 +08:00
Yang Wang
9e763b049c Support running pipeline parallel inference by vertically partitioning model to different devices (#10392)
* support pipeline parallel inference

* fix logging

* remove benchmark file

* fic

* need to warmup twice

* support qwen and qwen2

* fix lint

* remove genxir

* refine
2024-03-18 13:04:45 -07:00
Xiangyu Tian
dbdeaddd6a LLM: Fix log condition for BIGDL_OPT_IPEX (#10441)
remove log for BIGDL_OPT_IPEX
2024-03-18 16:03:51 +08:00
Xin Qiu
399843faf0 Baichuan 7b fp16 sdp and qwen2 pvc sdp (#10435)
* add baichuan sdp

* update

* baichuan2

* fix

* fix style

* revert 13b

* revert
2024-03-18 10:15:34 +08:00
Yishuo Wang
bd64488b2a add mask support for llama/chatglm fp8 sdp (#10433)
* add mask support for fp8 sdp

* fix chatglm2 dtype

* update
2024-03-15 17:36:52 +08:00
Xin Qiu
24473e331a Qwen2 fp16 sdp (#10427)
* qwen2 sdp and refine

* update

* update

* fix style

* remove use_flash_attention
2024-03-15 13:12:03 +08:00
Ruonan Wang
b036205be2 LLM: add fp8 sdp for chatglm2/3 (#10411)
* add fp8 sdp for chatglm2

* fix style
2024-03-15 09:38:18 +08:00
Wang, Jian4
fe8976a00f LLM: Support gguf models use low_bit and fix no json(#10408)
* support others model use low_bit

* update readme

* update to add *.json
2024-03-15 09:34:18 +08:00
Xin Qiu
cda38f85a9 Qwen fp16 sdp (#10401)
* qwen sdp

* fix

* update

* update

* update sdp

* update

* fix style check

* add to origin type
2024-03-15 08:51:50 +08:00
dingbaorong
1c0f7ed3fa add xpu support (#10419) 2024-03-14 17:13:48 +08:00
Heyang Sun
7d29765092 refactor qwen2 forward to enable XPU (#10409)
* refactor awen2 forward to enable XPU

* Update qwen2.py
2024-03-14 11:03:05 +08:00
ZehuaCao
f66329e35d Fix multiple get_enable_ipex function error (#10400)
* fix multiple get_enable_ipex function error

* remove get_enable_ipex_low_bit function
2024-03-14 10:14:13 +08:00
Kai Huang
76e30d8ec8 Empty cache for lm_head (#10317)
* empty cache

* add comments
2024-03-13 20:31:53 +08:00
Yishuo Wang
06a851afa9 support new baichuan model (#10404) 2024-03-13 17:45:50 +08:00
Yishuo Wang
b268baafd6 use fp8 sdp in llama (#10396) 2024-03-13 16:45:38 +08:00
Xiangyu Tian
60043a3ae8 LLM: Support Baichuan2-13b in BigDL-vLLM (#10398)
Support Baichuan2-13b in BigDL-vLLM.
2024-03-13 16:21:06 +08:00
Xiangyu Tian
e10de2c42d [Fix] LLM: Fix condition check error for speculative decoding on CPU (#10402)
Fix condition check error for speculative decoding on CPU
2024-03-13 16:05:06 +08:00
Heyang Sun
d72c0fad0d Qwen2 SDPA forward on CPU (#10395)
* Fix Qwen1.5 CPU forward

* Update convert.py

* Update qwen2.py
2024-03-13 13:10:03 +08:00
Wang, Jian4
0193f29411 LLM : Enable gguf float16 and Yuan2 model (#10372)
* enable float16

* add yun files

* enable yun

* enable set low_bit on yuan2

* update

* update license

* update generate

* update readme

* update python style

* update
2024-03-13 10:19:18 +08:00
Yina Chen
f5d65203c0 First token lm_head optimization (#10318)
* add lm head linear

* update

* address comments and fix style

* address comment
2024-03-13 10:11:32 +08:00
Xin Qiu
28c4a8cf5c Qwen fused qkv (#10368)
* fused qkv + rope for qwen

* quantized kv cache

* fix

* update qwen

* fixed quantized qkv

* fix

* meet code review

* update split

* convert.py

* extend when no enough kv

* fix
2024-03-12 17:39:00 +08:00
Yishuo Wang
741c2bf1df use new rms norm (#10384) 2024-03-12 17:29:51 +08:00
Xiangyu Tian
0ded0b4b13 LLM: Enable BigDL IPEX optimization for int4 (#10319)
Enable BigDL IPEX optimization for int4
2024-03-12 17:08:50 +08:00
Zhao Changmin
df2b84f7de Enable kv cache on arc batch (#10308) 2024-03-12 16:46:04 +08:00
Guancheng Fu
cc4148636d [FastChat-integration] Add initial implementation for loader (#10323)
* add initial implementation for loader

* add test method for model_loader

* data

* Refine
2024-03-12 10:54:59 +08:00
binbin Deng
dbcfc5c2fa LLM: fix error of 'AI-ModelScope/phi-2' hosted by ModelScope hub (#10364) 2024-03-11 16:19:17 +08:00
Chen, Zhentao
a425eaabfc fix from_pretrained when device_map=None (#10361)
* pr trigger

* fix error when device_map=None

* fix device_map=None
2024-03-11 16:06:12 +08:00
Yina Chen
d7b765fd3f serving xpu memory opt (#10358) 2024-03-11 15:21:22 +08:00
Ruonan Wang
be29833b2b LLM: fix qwen2 (#10356) 2024-03-11 09:29:08 +08:00
Zhicun
9026c08633 Fix llamaindex AutoTokenizer bug (#10345)
* fix tokenizer

* fix AutoTokenizer bug

* modify code style
2024-03-08 16:24:50 +08:00
Keyan (Kyrie) Zhang
7a621a4db0 Fix device_map bug by raise an error when using device_map=xpu (#10340)
* Fix device_map bug by raise an error when using device_map=xpu

* Fix sync error

* Fix python style

* Use invalidInputError instead of invalidOperationError
2024-03-08 13:38:52 +08:00
Yishuo Wang
1ac193ba02 add rope theta argument (#10343) 2024-03-07 17:27:19 +08:00
Cengguang Zhang
496d18ab6d LLM: add quantize kv cache support for baichuan 7b and 13b. (#10330)
* add quantize kv cache for baichuan 7b and 13b.

* fix typo.

* fix.

* fix style.

* fix style.
2024-03-07 16:17:38 +08:00
Yina Chen
9ea499ca68 Optimize speculative decoding PVC memory usage (#10329)
* optimize memory

* update

* update

* update

* support other models

* update

* fix style
2024-03-06 09:54:21 +08:00
dingbaorong
cc796848ea fix typos (#10274)
Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 18:38:22 +08:00
Yishuo Wang
0011ff9f64 optimize bge large performance (#10324) 2024-03-05 17:06:03 +08:00
Cengguang Zhang
30d009bca7 LLM: support quantized kv cache for Mistral in transformers >=4.36.0 (#10326)
* support quantize kv for mistral in transformers 4.36

* update mistral support.

* fix style.
2024-03-05 16:23:50 +08:00
dingbaorong
1e6f0c6f1a Add llamaindex gpu example (#10314)
* add llamaindex example

* fix core dump

* refine readme

* add trouble shooting

* refine readme

---------

Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:36:00 +08:00
dingbaorong
fc7f10cd12 add langchain gpu example (#10277)
* first draft

* fix

* add readme for transformer_int4_gpu

* fix doc

* check device_map

* add arc ut test

* fix ut test

* fix langchain ut

* Refine README

* fix gpu mem too high

* fix ut test

---------

Co-authored-by: Ariadne <wyn2000330@126.com>
2024-03-05 13:33:57 +08:00
Cengguang Zhang
ab9fc2485f LLM: add quantize kv support for llama transformer 4.36 (#10298)
* add quantize kv support for llama transformer 4.36

* fix style.

* fix style.
2024-03-04 10:33:35 +08:00
SONG Ge
0ab40917fb [LLM] Split merged_qk to separated q/k linear (#10299)
* modify merge_qk_linear to separated q/k linear

* update
2024-03-01 16:48:55 +08:00
Yang Wang
f4d7dbcde2 use fused qkv forward in qwen2 (#10185)
* use fused qkv forward in qwen2

* support both

* fix style

* fix rope

* remove pring

* fix style

* clean up
2024-03-01 16:46:35 +08:00
Wang, Jian4
beb9433cec LLM: Reduce speculative _ipex_optimize_model memory use (#10281)
* use tpp

* update ipex
2024-03-01 13:48:23 +08:00
Yuwen Hu
f0ff0eebe1 [LLM] Support quantize kv cache for Baichuan2 7B (#10280)
* Add quatized kv cache framework for Baichuan2 7B

* Support quantize kv cache for baichuan2

* Small fix

* Fix python style
2024-03-01 13:35:42 +08:00
SONG Ge
273de341d7 hot-fix silu error import (#10292) 2024-03-01 10:11:37 +08:00
Xin Qiu
232273a1b5 Enable Gemma fused mlp + Gelu (#10276)
* update llama mlp forward

* add all

* fix style check

* split

* update

* update

* update

* fix style
2024-02-29 16:53:24 +08:00
Guancheng Fu
2d930bdca8 Add vLLM bf16 support (#10278)
* add argument load_in_low_bit

* add docs

* modify gpu doc

* done

---------

Co-authored-by: ivy-lv11 <lvzc@lamda.nju.edu.cn>
2024-02-29 16:33:42 +08:00
SONG Ge
13b0bc9075 [LLM] Add quantize_kv optimization for yuan2 model (#10243)
* add initial quantize_kv support for yuan2 model

* fix yuan2 quantize_kv generation

* apply fp16 conv layer optimizations

* disable mlp for quantize_kv
2024-02-29 16:33:26 +08:00
Zhicun
4e6cc424f1 Add LlamaIndex RAG (#10263)
* run demo

* format code

* add llamaindex

* add custom LLM with bigdl

* update

* add readme

* begin ut

* add unit test

* add license

* add license

* revised

* update

* modify docs

* remove data folder

* update

* modify prompt

* fixed

* fixed

* fixed
2024-02-29 15:21:19 +08:00
Ruonan Wang
a9fd20b6ba LLM: Update qkv fusion for GGUF-IQ2 (#10271)
* first commit

* update mistral

* fix transformers==4.36.0

* fix

* disable qk for mixtral now

* fix style
2024-02-29 12:49:53 +08:00
Ruonan Wang
4b08bc1417 LLM: relax batch check of flash atttention by double check attention mask (#10270)
* relax batch check

* fix

* fix style
2024-02-29 09:39:55 +08:00
Yina Chen
07f36fbfcc Fix gptj failed to extend (#10269) 2024-02-29 09:39:27 +08:00
Yishuo Wang
cccb02dad1 fix baichuan2 13b 2k input (#10267) 2024-02-28 17:20:20 +08:00
Heyang Sun
7244fd1ba5 Fix Arc StarCoder wrong query_shape when input is long (#10268)
* Fix Arc StarCoder wrong query_shape when input is long

* Update gptbigcode.py
2024-02-28 17:07:08 +08:00
Cengguang Zhang
a4de3095f3 LLM: Support quantize kv cache in mistral. (#10261)
* init

* update quantize kv.
2024-02-28 14:08:08 +08:00
Zhicun
308e637d0d Add DeepSeek-MoE-16B-Chat (#10155)
* dsmoe-hf add

* add dsmoe pytorch

* update README

* modify comment

* remove GPU example

* update model name

* format code
2024-02-28 10:12:09 +08:00
Yang Wang
c581c6db30 draft mmint4 (#10031)
change to llm.cpp

support transposed format

revert

implement qkv fuse

fix style

change to vertically pack

change to enable_xetla

fix mlp_fusion_check

remove comments

address comments

add some comments

fix style
2024-02-27 14:55:16 -08:00
Yishuo Wang
b4fa4ab46f optimize yuan 2.0 again (#10252) 2024-02-27 14:51:42 +08:00
Heyang Sun
36a9e88104 Speculative Starcoder on CPU (#10138)
* Speculative Starcoder on CPU

* enable kv-cache pre-allocation

* refine codes

* refine

* fix style

* fix style

* fix style

* refine

* refine

* Update speculative.py

* Update gptbigcode.py

* fix style

* Update speculative.py

* enable mixed-datatype layernorm on top of torch API

* adaptive dtype

* Update README.md
2024-02-27 09:57:29 +08:00
Yishuo Wang
a47989c860 optimize yuan 2.0 performance (#10244) 2024-02-26 17:20:10 +08:00
Wang, Jian4
6c74b99a28 LLM: Update qwen readme (#10245) 2024-02-26 17:03:09 +08:00
Wang, Jian4
f9b75f900b LLM: Enable qwen target_model ipex (#10232)
* change order

* enable qwen ipex

* update qwen example

* update

* fix style

* update
2024-02-26 16:41:12 +08:00
Yuwen Hu
e38e29511c [LLM] Yuan2 MLP and Rotary optimization (#10231)
* Add optimization for rotary embedding

* Add mlp fused optimizatgion

* Python style fix

* Fix rotary embedding due to logits difference

* Small fix
2024-02-26 15:10:08 +08:00
SONG Ge
df2f3885ba [LLM] Enable kv_cache and forward_qkv optimizations for yuan2 (#10225)
* add init kv_cache support for yuan2

* add forward qkv in yuan
2024-02-26 11:29:48 +08:00
Ruonan Wang
28513f3978 LLM: support fp16 embedding & add mlp fusion for iq2_xxs (#10219)
* add fp16 embed

* small fixes

* fix style

* fix style

* fix comment
2024-02-23 17:26:24 +08:00
Yuwen Hu
eeecd9fc08 Python style fix (#10230) 2024-02-23 17:21:23 +08:00
Yuwen Hu
e511bbd8f1 [LLM] Add basic optimization framework for Yuan2 (#10227)
* Add basic optimization framework for Yuan2

* Small fix

* Python style fix

* Small fix

* Small fix
2024-02-23 17:05:00 +08:00
Xin Qiu
30795bdfbc Gemma optimization: rms_norm, kv_cache, fused_rope, fused_rope+qkv (#10212)
* gemma optimization

* update

* update

* fix style

* meet code review
2024-02-23 10:07:24 +08:00
Guoqiong Song
63681af97e falcon for transformers 4.36 (#9960)
* falcon for transformers 4.36
2024-02-22 17:04:40 -08:00
Yina Chen
ce5840a8b7 GPT-J rope optimization on xpu (#10182)
* optimize

* update

* fix style & move use_fuse_rope

* add ipex version check

* fix style

* update

* fix style

* meet comments

* address comments

* fix style
2024-02-22 16:25:12 +08:00
Xiangyu Tian
f445217d02 LLM: Update IPEX to 2.2.0+cpu and Refactor for _ipex_optimize (#10189)
Update IPEX to 2.2.0+cpu and refactor for _ipex_optimize.
2024-02-22 16:01:11 +08:00
Heyang Sun
c876d9b5ca Support for MPT rotary embedding (#10208) 2024-02-22 15:16:31 +08:00
Ruonan Wang
5e1fee5e05 LLM: add GGUF-IQ2 examples (#10207)
* add iq2 examples

* small fix

* meet code review

* fix

* meet review

* small fix
2024-02-22 14:18:45 +08:00
SONG Ge
ca1166a0e5 [LLM] Add quantize kv_cache for Baichuan2-13B (#10203)
* add quantize kv_cache for baichuan2-13b

* style fix
2024-02-22 13:43:35 +08:00
Ruonan Wang
34ee1aa91f LLM: add esimd sdp support for chatglm3 (#10205)
* add esimd sdp support

* fix style
2024-02-22 13:37:16 +08:00
Ruonan Wang
f7c96b19ef LLM: support iq2 for mixtral (#10191)
* support name mapping for mixtral

* support mixtral mixed quantization

* fix style

* fix
2024-02-21 16:00:29 +08:00
Xin Qiu
56ad781f2f qwen2 cpu fix (#10187) 2024-02-21 11:23:51 +08:00
Zhao Changmin
4fbf449c2d for rwkv4 (#10179) 2024-02-21 10:11:10 +08:00
Ruonan Wang
3288acb8de LLM : Support embedding quantization (only q2k now) (#10170)
* basic logic added

* basic support

* support save&load, update mixed strategy

* fix style

* use int8 for lm_head

* add check for xpu
2024-02-20 16:56:57 +08:00
binbin Deng
2bb96c775c LLM: fix device setting during saving optimized model (#10154) 2024-02-20 09:52:59 +08:00
Xin Qiu
1f6d5b9f30 enable fused rmsnorm and rope qwen2 (#10163)
* qwen2

* change convert

* cleanup
2024-02-20 08:33:09 +08:00
Zhao Changmin
f8730e8dc1 Skip rescale rwkv linear when load_low_bit (#10164)
* rwkv_ld
2024-02-19 15:56:42 +08:00
Heyang Sun
3e2af5ec0a Fix IPEX Baichuan Speculative (#10162)
* Fix IPEX Baichuan Speculative

* compatible with 13B

* Update speculative.py
2024-02-19 15:27:34 +08:00
Yina Chen
23c91cdce6 [LLM] Add min_step_draft in speculative decoding (#10142)
* Fix gptj kvcache & position id

* Add min_draft_tokens in speculative decoding

* fix style

* update
2024-02-19 14:31:41 +08:00
Wang, Jian4
f2417e083c LLM: enable chatglm3-6b target_model ipex (#10085)
* init

* always make casual_mask

* not return last tensor

* update

* optimize_model = False

* enable optimized=False

* enable optimized_model=true

* speed_up ipex target_model

* remove if True

* use group_size

* update python style

* update

* update
2024-02-19 13:38:32 +08:00
Yina Chen
1508d6b089 Fix gptj kvcache & position id (#10141) 2024-02-18 10:02:49 +08:00
Yishuo Wang
4d33aac7f9 quick fix qwen2 fp8 kv cache (#10135) 2024-02-08 17:04:59 +08:00
Cengguang Zhang
39d90839aa LLM: add quantize kv cache for llama. (#10086)
* feat: add quantize kv cache for llama.

* fix style.

* add quantized attention forward function.

* revert style.

* fix style.

* fix style.

* update quantized kv cache and add quantize_qkv

* fix style.

* fix style.

* optimize quantize kv cache.

* fix style.
2024-02-08 16:49:22 +08:00
Yishuo Wang
d848efe17c add quantize kv cache support for qwen2 (#10134) 2024-02-08 16:17:21 +08:00
SONG Ge
3f79128ed7 [LLM] Enable kv_cache optimization for Qwen2 on transformers-v4.37.0 (#10131)
* add support for kv_cache optimization on transformers-v4.37.0

* enable attention forward

* style fix

* disable rotary for now
2024-02-08 14:20:26 +08:00
Ruonan Wang
063dc145ac LLM: basic support for q2k (#10132)
* basic support for q2k

* fix style
2024-02-08 13:52:01 +08:00
Cengguang Zhang
0cf6a12691 LLM: add default torch_dtype for fp16. (#10124)
* set default torch_dtype for fp16.

* fix style.

* bug fix.

* update bug fix.
2024-02-08 10:24:16 +08:00
Yishuo Wang
1aa0c623ce disable fused layer norm on UHD (#10130) 2024-02-08 10:20:01 +08:00
Yuwen Hu
a8450fc300 [LLM] Support MLP optimization for Qwen1.5 (#10123) 2024-02-08 09:15:34 +08:00
binbin Deng
925f82107e LLM: support models hosted by modelscope (#10106) 2024-02-07 16:46:36 +08:00
Xiangyu Tian
8953acd7d6 [LLM] Fix log condition for BIGDL_OPT_IPEX (#10115)
Fix log condition for BIGDL_OPT_IPEX
2024-02-07 10:27:10 +08:00
Yuwen Hu
518ef95abc Small fix for Nonetype error (#10104) 2024-02-06 14:58:52 +08:00
Ruonan Wang
d61f4905ac LLM: 2bit quantization initial support (#10042)
* basis quantize support

* fix new module name

* small update

* and mixed int4 with iq2_xxs

* remove print

* code refactor

* fix style

* meet code review
2024-02-06 14:58:32 +08:00
Jiao Wang
33b9e7744d fix dimension (#10097) 2024-02-05 15:07:38 -08:00
Zhicun
7d2be7994f add phixtral and optimize phi-moe (#10052) 2024-02-05 11:12:47 +08:00
Zhicun
676d6923f2 LLM: modify transformersembeddings.embed() in langchain (#10051) 2024-02-05 10:42:10 +08:00
Jin Qiao
ad050107b3 LLM: fix mpt load_low_bit issue (#10075)
* fix

* retry

* retry
2024-02-05 10:17:07 +08:00
Ruonan Wang
8e33cb0f38 LLM: support speecht5_tts (#10077)
* support speecht5_tts

* fix
2024-02-04 13:26:42 +08:00
ivy-lv11
428b7105f6 Add HF and PyTorch example InternLM2 (#10061) 2024-02-04 10:25:55 +08:00
Yina Chen
77be19bb97 LLM: Support gpt-j in speculative decoding (#10067)
* gptj

* support gptj in speculative decoding

* fix

* update readme

* small fix
2024-02-02 14:54:55 +08:00
Xin Qiu
6e0f1a1e92 use apply_rotary_pos_emb_cache_freq_xpu in mixtral (#10060)
* use apply_rotary_pos_emb_cache_freq_xpu in mixtral

* fix style
2024-02-01 15:40:49 +08:00
Heyang Sun
601024f418 Mistral CPU example of speculative decoding (#10024)
* Mistral CPU example of speculative decoding

* update transformres version

* update example

* Update README.md
2024-02-01 10:52:32 +08:00
Heyang Sun
968e70544d Enable IPEX Mistral in Speculative (#10059) 2024-02-01 10:48:16 +08:00
Yina Chen
3ca03d4e97 Add deepmind sample into bigdl-llm speculative decoding (#10041)
* migrate deepmind sample

* update

* meet comments

* fix style

* fix style
2024-02-01 09:57:02 +08:00
Wang, Jian4
7e5cd42a5c LLM : Update optimize ipex bf16 (#10038)
* use 4.35.2 and remove

* update rmsnorm

* remove

* remove

* update python style

* update

* update python style

* update

* fix style

* update

* remove whitespace
2024-01-31 10:59:55 +08:00
Ruonan Wang
3685622f29 LLM: fix llama 4.36 forward(#10047) 2024-01-31 10:31:10 +08:00
Yishuo Wang
53a5140eff Optimize rwkv v5 rest token again (#10043) 2024-01-31 10:01:11 +08:00
Ruonan Wang
6b63ba23d1 LLM: add full module name during convert (#10035) 2024-01-30 14:43:07 +08:00
Yishuo Wang
7dfa6dbe46 add rwkv time shift optimization (#10032) 2024-01-30 14:10:55 +08:00
Xiangyu Tian
f57d0fda8b [LLM] Use IPEX Optimization for Self Speculative Decoding (#9997)
Use IPEX Optimization for Self Speculative Decoding
2024-01-30 09:11:06 +08:00
Ruonan Wang
ccf8f613fb LLM: update fp16 Linear on ARC/FLEX (#10023) 2024-01-29 18:25:26 +08:00
Shaojun Liu
824c8029d7 Fix "local variable 'model' referenced before assignment" (#10022) 2024-01-29 16:18:04 +08:00
Xiangyu Tian
f37e4702bc [LLM] Use IPEX Optimization for BF16 Model (#9988)
Use IPEX Optimization for BF16 Model by env BIGDL_OPT_IPEX=true
2024-01-29 11:28:25 +08:00
Yishuo Wang
d720554d43 simplify quantize kv cache api (#10011) 2024-01-29 09:23:57 +08:00
Yina Chen
a3322e2a6c add fp8 e5 to use_xmx (#10015) 2024-01-26 18:29:46 +08:00
Qiyuan Gong
9e18ea187f [LLM] Avoid KV Cache OOM when seq len is larger than 1 (#10006)
* Avoid OOM during muti-round streaming chat with kv cache
* For llama like kv cache, i.e., [bs, n_head, seq_len, head_dim], use is_enough_kv_cache_room_4_31.
* Other models need to compare kv cache size with kv_len.
2024-01-26 17:30:08 +08:00
Ruonan Wang
a00efa0564 LLM: add mlp & qkv fusion for FP16 Llama-7B (#9932)
* add mlp fusion for llama

* add mlp fusion

* fix style

* update

* add mm_qkv_out

* fix style

* update

* meet code review

* meet code review
2024-01-26 11:50:38 +08:00
Wang, Jian4
98ea3459e5 LLM : Fix llama draft_model dtype error (#10005)
* fix llama draft_model dtype error

* updat
2024-01-26 10:59:48 +08:00
Yishuo Wang
aae1870096 fix qwen kv cache length (#9998) 2024-01-26 10:15:01 +08:00
Yishuo Wang
24b34b6e46 change xmx condition (#10000) 2024-01-25 17:48:11 +08:00
Yishuo Wang
bf65548d29 Add quantize kv cache support for chaglm2/3 (#9996) 2024-01-25 16:55:59 +08:00
Wang, Jian4
9bff84e6fd LLM: Convert draft_model kv_cache from bf16 to fp32 (#9964)
* convert bf16 to fp32

* update

* change when init

* init first and cut off after

* init and exchange

* update python type

* update

* fix bug

* update

* update
2024-01-25 11:20:27 +08:00
Yina Chen
27338540c3 Fix repetition_penalty not activated issue (#9989) 2024-01-25 10:40:41 +08:00
Yuwen Hu
b27e5a27b9 Remove the check for meta device in _replace_with_low_bit_linear (#9984) 2024-01-24 18:15:39 +08:00
Yina Chen
b176cad75a LLM: Add baichuan2 gpu spec example (#9973)
* add baichuan2 gpu spec example

* update readme & example

* remove print

* fix typo

* meet comments

* revert

* update
2024-01-24 16:40:16 +08:00
Chen, Zhentao
e0db44dcb6 fix unexpected keyword argument 'device' (#9982)
* add device for chatglm3 only

* add comment for this change

* fix style

* fix style

* fix style again..

* finally fixed style
2024-01-24 13:20:46 +08:00
Yuwen Hu
8d28aa8e2b [LLM] Fix the model.device problem when cpu_embedding=True (#9971)
* Overwrite the device attribute for CPUPinnedParam

* Expose cpu_embedding=True for Linux users

* Fix python style
2024-01-23 18:51:11 +08:00
Yishuo Wang
f82782cd3b fix starcoder (#9975) 2024-01-23 17:24:53 +08:00
Yishuo Wang
2c8a9aaf0d fix qwen causal mask when quantize_kv_cache=True (#9968) 2024-01-23 16:34:05 +08:00
Yina Chen
36c665667d Add logits processor & qwen eos stop in speculative decoding (#9963)
* add logits processor & qwen eos

* fix style

* fix

* fix

* fix style

* fix style

* support transformers 4.31

* fix style

* fix style

---------

Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-01-23 15:57:28 +08:00
Xin Qiu
da4687c917 fix fp16 (#9970) 2024-01-23 15:53:32 +08:00
Ruonan Wang
27b19106f3 LLM: add readme for speculative decoding gpu examples (#9961)
* add readme

* add readme

* meet code review
2024-01-23 12:54:19 +08:00
Chen, Zhentao
39219b7e9a add default device meta when lcmu enabled (#9941) 2024-01-23 11:00:49 +08:00
Xin Qiu
dacf680294 add fused rotary pos emb for qwen (#9956)
* add fused rotary pos emb for qwen

* update
2024-01-23 10:37:56 +08:00
Ruonan Wang
7b1d9ad7c0 LLM: limit esimd sdp usage for k_len < 8 (#9959)
* update

* fix
2024-01-23 09:28:23 +08:00
Ruonan Wang
3e601f9a5d LLM: Support speculative decoding in bigdl-llm (#9951)
* first commit

* fix error, add llama example

* hidden print

* update api usage

* change to api v3

* update

* meet code review

* meet code review, fix style

* add reference, fix style

* fix style

* fix first token time
2024-01-22 19:14:56 +08:00
Heyang Sun
fb91c97fe8 support for Baichuan/Baichuan2 13B Chat running speculative decoding (#9921)
* support for Baichuan/Baichuan2 13B Chat running speculative decoding

* fix stype
2024-01-22 09:11:44 +08:00
Xin Qiu
97f0cd8975 optimize Decilm 7b (#9922)
* optimize deci

* update

* decilm attension forward
2024-01-19 17:31:13 +08:00
Wang, Jian4
bcaeb05272 Update optimize qwen (#9943)
* update for n tokens input

* fix dtype

* update
2024-01-19 16:54:59 +08:00
Ruonan Wang
bf37b3a670 LLM: optimize CPU speculative decoding of chatglm3 (#9928)
* update

* fix style

* meet code review
2024-01-19 14:10:22 +08:00
Shaojun Liu
967714bac8 gguf memory optimization for mixtral (#9939) 2024-01-19 11:13:15 +08:00
Lilac09
7032a2ad73 Optimize gguf load memory for mistral (#9923)
* optimize gguf load for mistral

* fix output of gguf mistral

* reset
2024-01-19 09:14:39 +08:00
Shaojun Liu
9a46f019d7 gguf memory optimization for baichuan (#9937) 2024-01-19 09:11:02 +08:00
Guancheng Fu
2e1448f08e [Serving] Add vllm_worker to fastchat serving framework (#9934)
* add worker

* finish

* finish

* add license

* add more comments
2024-01-18 21:33:36 +08:00
Yishuo Wang
7bbb98abb6 Disable fused layer norm when using XMX to fix mpt UT (#9933) 2024-01-18 16:22:12 +08:00
Wang, Jian4
1fc9dfa265 LLM: Update for Qwen n tokens inputs (#9931)
* update for n tokens inputs

* update style

* update
2024-01-18 15:56:29 +08:00
Heyang Sun
5184f400f9 Fix Mixtral GGUF Wrong Output Issue (#9930)
* Fix Mixtral GGUF Wrong Output Issue

* fix style

* fix style
2024-01-18 14:11:27 +08:00
Yishuo Wang
453df868c9 add rwkv v5 attention kernel (#9927) 2024-01-18 10:16:29 +08:00
Ruonan Wang
054952f82f LLM: Fix rope of chatglm3 to support speculative decoding on CPU (#9926) 2024-01-18 09:28:10 +08:00
Ziteng Zhang
18cd1f1432 [LLM]Solve the problem of calling bmm operator in BF16Linear (#9924)
* Solve the problem of calling bmm operator in BF16Linear
2024-01-17 18:08:35 +08:00
Yina Chen
98b86f83d4 Support fast rope for training (#9745)
* init

* init

* fix style

* add test and fix

* address comment

* update

* merge upstream main
2024-01-17 15:51:38 +08:00
Ruonan Wang
427f75000b LLM: fix sdp of chatglm3 (#9917)
* fix

* fix

* fix
2024-01-17 13:37:28 +08:00
Yishuo Wang
94767da7cf optimize rwkv v4 first token performance (#9912) 2024-01-17 09:27:41 +08:00
Shaojun Liu
b909c5c9c2 GGUF load memory optimization (#9913)
* block-wise

* convert linear for module

* revert

* Fix PEP8 checks Error
2024-01-16 18:54:39 +08:00
Xin Qiu
dee32f7d15 copy fused rms norm's reuslt to avoid <unk> (#9909) 2024-01-16 16:54:08 +08:00
Ruonan Wang
8d7326ae03 LLM: fix chatglm3 sdp to support speculative decoding (#9900)
* fix chatglm3

* fix

* update

* meet code review

* fix
2024-01-16 11:29:13 +08:00
Guancheng Fu
9f34da7cdb Update PVC XMX condition (#9901)
* update pvc xmx condition

* update condition

* update conditon
2024-01-15 15:42:15 +08:00
Yishuo Wang
6637860ddf change xmx condition (#9896) 2024-01-12 19:51:48 +08:00
Ruonan Wang
d9cf55bce9 LLM: fix MLP check of mixtral (#9891) 2024-01-11 18:01:59 +08:00
Ziteng Zhang
4af88a67b9 support chatglm3 with bf16 (#9888)
* support chatglm3 with bigdl-bf16
2024-01-11 16:45:21 +08:00
Yuwen Hu
0aef35a965 [LLM] Improve LLM doc regarding windows gpu related info (#9880)
* Improve runtime configuration for windows

* Add python 310/311 supports for wheel downloading

* Add troubleshooting for windows gpu

* Remove manually import ipex due to auto importer

* Add info regarding cpu_embedding=True on iGPU

* More info for Windows users

* Small updates to API docs

* Python style fix

* Remove tip for loading from saved optimize_model for now

* Updated based on comments

* Update win info for multi-intel gpus selection

* Small fix

* Small fix
2024-01-11 14:37:16 +08:00
Ruonan Wang
53531ae4ee LLM: support qkv fusion for fp8e5 (#9878)
* update

* add mistral

* meet code review
2024-01-10 17:50:00 +08:00
Lilac09
cb32b985ec add mistral and chatglm support to vllm (#9879)
* add mistral and chatglm support to vllm

* add mistral and chatglm support to vllm
2024-01-10 15:38:42 +08:00
Ruonan Wang
3e05c9e11b LLM: update esimd sdp kernel (#9871) 2024-01-09 18:10:01 +08:00
Yishuo Wang
36496d60ac only use quantize kv cache on MTL (#9862) 2024-01-09 13:24:02 +08:00
ZehuaCao
146076bdb5 Support llm-awq backend (#9856)
* Support for LLM-AWQ Backend

* fix

* Update README.md

* Add awqconfig

* modify init

* update

* support llm-awq

* fix style

* fix style

* update

* fix AwqBackendPackingMethod not found error

* fix style

* update README

* fix style

---------

Co-authored-by: Uxito-Ada <414416158@qq.com>
Co-authored-by: Heyang Sun <60865256+Uxito-Ada@users.noreply.github.com>
Co-authored-by: cyita <yitastudy@gmail.com>
2024-01-09 13:07:32 +08:00
Ruonan Wang
fea6f16057 LLM: add mlp fusion for fp8e5 and update related check (#9860)
* update mlp fusion

* fix style

* update
2024-01-09 09:56:32 +08:00
Jiao Wang
3b6372ab12 Fix Llama transformers 4.36 support (#9852)
* supoort 4.36

* style

* update

* update

* update

* fix merge

* update
2024-01-08 00:32:23 -08:00
Chen, Zhentao
1b585b0d40 set fp8 default as e5m2 (#9859) 2024-01-08 15:53:57 +08:00
Ruonan Wang
dc995006cc LLM: add flash attention for mistral / mixtral (#9846)
* add flash attention for mistral

* update

* add flash attn for mixtral

* fix style
2024-01-08 09:51:34 +08:00
Yishuo Wang
afaa871144 [LLM] support quantize kv cache to fp8 (#9812) 2024-01-08 09:28:20 +08:00
Jiao Wang
248ae7fad2 LLama optimize_model to support transformers 4.36 (#9818)
* supoort 4.36

* style

* update

* update

* update
2024-01-05 11:30:18 -08:00
Ruonan Wang
a60bda3324 LLM: update check for deepspeed (#9838) 2024-01-05 16:44:10 +08:00
Ruonan Wang
16433dd959 LLM: fix first token judgement of flash attention (#9841)
* fix flash attention

* meet code review

* fix
2024-01-05 13:49:37 +08:00
Yina Chen
f919f5792a fix kv cache out of bound (#9827) 2024-01-05 12:38:57 +08:00
Ruonan Wang
5df31db773 LLM: fix accuracy issue of chatglm3 (#9830)
* add attn mask for first token

* fix

* fix

* change attn calculation

* fix

* fix

* fix style

* fix style
2024-01-05 10:52:05 +08:00
Xiangyu Tian
38c05be1c0 [LLM] Fix dtype mismatch in Baichuan2-13b (#9834) 2024-01-04 15:34:42 +08:00
Ziteng Zhang
05b681fa85 [LLM] IPEX auto importer set on by default (#9832)
* Set BIGDL_IMPORT_IPEX default to True

* Remove import intel_extension_for_pytorch as ipex from GPU example
2024-01-04 13:33:29 +08:00
Wang, Jian4
4ceefc9b18 LLM: Support bitsandbytes config on qlora finetune (#9715)
* test support bitsandbytesconfig

* update style

* update cpu example

* update example

* update readme

* update unit test

* use bfloat16

* update logic

* use int4

* set defalut bnb_4bit_use_double_quant

* update

* update example

* update model.py

* update

* support lora example
2024-01-04 11:23:16 +08:00
Ruonan Wang
20e9742fa0 LLM: fix chatglm3 issue (#9820)
* fix chatglm3 issue

* small update
2024-01-03 16:15:55 +08:00
Wang, Jian4
a54cd767b1 LLM: Add gguf falcon (#9801)
* init falcon

* update convert.py

* update style
2024-01-03 14:49:02 +08:00
Qiyuan Gong
f0f9d45eac [LLM] IPEX import support bigdl-core-xe-21 (#9769)
Add support for bigdl-core-xe-21.
2023-12-28 15:23:58 +08:00
Guancheng Fu
5857a38321 [vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH (#9782)
* add option kv_cache_block

* change var name
2023-12-28 14:41:47 +08:00
Ruonan Wang
99bddd3ab4 LLM: better FP16 support for Intel GPUs (#9791)
* initial support

* fix

* fix style

* fix

* limi esimd usage condition

* refactor code

* fix style

* small fix

* meet code review

* small fix
2023-12-28 13:30:13 +08:00
Yishuo Wang
7d9f6c6efc fix cpuinfo error (#9793) 2023-12-28 09:23:44 +08:00
Wang, Jian4
7ed9538b9f LLM: support gguf mpt (#9773)
* add gguf mpt

* update
2023-12-28 09:22:39 +08:00
Cengguang Zhang
d299f108d0 update falcon attention forward. (#9796) 2023-12-28 09:11:59 +08:00
Kai Huang
689889482c Reduce max_cache_pos to reduce Baichuan2-13B memory (#9694)
* optimize baichuan2 memory

* fix

* style

* fp16 mask

* disable fp16

* fix style

* empty cache

* revert empty cache
2023-12-26 19:51:25 +08:00
Xiangyu Tian
0ea842231e [LLM] vLLM: Add api_server entrypoint (#9783)
Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.
2023-12-26 16:03:57 +08:00
Ruonan Wang
11d883301b LLM: fix wrong batch output caused by flash attention (#9780)
* fix

* meet code review

* move batch size check to the beginning

* move qlen check inside function

* meet code review
2023-12-26 09:41:27 +08:00
Heyang Sun
66e286a73d Support for Mixtral AWQ (#9775)
* Support for Mixtral AWQ

* Update README.md

* Update README.md

* Update awq_config.py

* Update README.md

* Update README.md
2023-12-25 16:08:09 +08:00
Ruonan Wang
1917bbe626 LLM: fix BF16Linear related training & inference issue (#9755)
* fix bf16 related issue

* fix

* update based on comment & add arc lora script

* update readme

* update based on comment

* update based on comment

* update

* force to bf16

* fix style

* move check input dtype into function

* update convert

* meet code review

* meet code review

* update merged model to support new training_mode api

* fix typo
2023-12-25 14:49:30 +08:00
Xiangyu Tian
30dab36f76 [LLM] vLLM: Fix kv cache init (#9771)
Fix kv cache init
2023-12-25 14:17:06 +08:00
Yina Chen
449b387125 Support relora in bigdl-llm (#9687)
* init

* fix style

* update

* support resume & update readme

* update

* update

* remove important

* add training mode

* meet comments
2023-12-25 14:04:28 +08:00
Ziteng Zhang
986f65cea9 [LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py (#9762) 2023-12-25 11:31:14 +08:00
Guancheng Fu
daf536fb2d vLLM: Apply attention optimizations for selective batching (#9758)
* fuse_rope for prefil

* apply kv_cache optimizations

* apply fast_decoding_path

* Re-enable kv_cache optimizations for prefill

* reduce KV_CACHE_ALLOC_BLOCK for selective_batching
2023-12-25 10:29:31 +08:00
Qiyuan Gong
4c487313f2 Revert "[LLM] IPEX auto importer turn on by default for XPU (#9730)" (#9759)
This reverts commit 0284801fbd.
2023-12-22 16:38:24 +08:00
Qiyuan Gong
0284801fbd [LLM] IPEX auto importer turn on by default for XPU (#9730)
* Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU.
* Remove import intel_extension_for_pytorch as ipex from GPU example.
* Add support for bigdl-core-xe-21.
2023-12-22 16:20:32 +08:00
Guancheng Fu
fdf93c9267 Implement selective batching for vLLM (#9659)
* add control to load hf model

* finish initial version of selective_batching

* temp

* finish

* Remove print statement

* fix error

* Apply yang's optimization

* a version that works

* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path

* format

* temp solution: not batching prefill requests

* a version that works for prefill batching

* format

* a solid version: works normally

* a temp version

* Solid version: remove redundant functions

* fix format

* format

* solid: add option to enable selective_batching

* remove logic for using transformer models

* format

* format

* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING

* format

* finish

* format
2023-12-22 13:45:46 +08:00
Ruonan Wang
2f36769208 LLM: bigdl-llm lora support & lora example (#9740)
* lora support and single card example

* support multi-card, refactor code

* fix model id and style

* remove torch patch, add two new class for bf16, update example

* fix style

* change to training_mode

* small fix

* add more info in help

* fixstyle, update readme

* fix ut

* fix ut

* Handling compatibility issues with default LoraConfig
2023-12-22 11:05:39 +08:00
SONG Ge
ba0b939579 [LLM] Support transformers-v4.36.0 on mistral model (#9744)
* add support transformers-v4.36.0 on mistral model

* python/llm/src/bigdl/llm/transformers/models/mistral.py

* make the redundant implementation as utils

* fix code style

* fix

* fix style

* update with utils enough_kv_room
2023-12-22 09:59:27 +08:00
Xin Qiu
e36111e713 mixstral fused qkv and rope (#9724)
* mixstral fused qkv and rope

* fix and clean

* fix style

* update

* update

* fix

* update

* fix
2023-12-22 09:26:35 +08:00
Jiao Wang
e4f6e43675 safetenor to false (#9728) 2023-12-21 14:41:51 -08:00
Yishuo Wang
426660b88e simplify qwen attention (#9747) 2023-12-21 17:53:29 +08:00
Wang, Jian4
984697afe2 LLM: Add bloom gguf support (#9734)
* init

* update bloom add merges

* update

* update readme

* update for llama error

* update
2023-12-21 14:06:25 +08:00
Heyang Sun
df775cf316 fix python style (#9742)
* fix python style

* fix

* fix
2023-12-21 11:25:05 +08:00
Xin Qiu
6c3e698bf1 mistral decoding_fast_path and fused mlp (#9714)
* mistral decoding_fast_path and fused mlp

* meet code review
2023-12-21 10:11:37 +08:00
Heyang Sun
d157f623b6 Load Mixtral gguf in a block-wise way (#9725)
* Load Mixtral gguf in a block-wise way

* refine
2023-12-21 10:03:23 +08:00
Zhao Changmin
4bda975a3e LLM: Align lowbit model config (#9735)
* align lowbit model config
2023-12-21 09:48:58 +08:00
Wang, Jian4
e1e921f425 LLM: gguf other model using dtype (#9729) 2023-12-21 09:33:40 +08:00
Yishuo Wang
13ea6330bd optimize qwen rope (#9737) 2023-12-20 17:34:34 +08:00
Ziteng Zhang
4c032a433e [LLM] Add glibc checker (#9624)
* Add glibc checker
* Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.
2023-12-20 16:52:43 +08:00
Yina Chen
cd652a1710 Support fp8 e5m2 on arc (#9711)
* init

* fix style

* update

* fix style

* update
2023-12-20 16:26:17 +08:00
Yishuo Wang
e54c428d30 add bf16/fp16 fuse mlp support (#9726) 2023-12-20 10:40:45 +08:00
Heyang Sun
612651cb5d fix typo (#9723) 2023-12-20 09:41:59 +08:00
Yishuo Wang
522cf5ed82 [LLM] Improve chatglm2/3 rest token performance with long context (#9716) 2023-12-19 17:29:38 +08:00
Yishuo Wang
f2e6abb563 fix mlp batch size check (#9718) 2023-12-19 14:22:22 +08:00
Heyang Sun
1fa7793fc0 Load Mixtral GGUF Model (#9690)
* Load Mixtral GGUF Model

* refactor

* fix empty tensor when to cpu

* update gpu and cpu readmes

* add dtype when set tensor into module
2023-12-19 13:54:38 +08:00
Qiyuan Gong
d0a3095b97 [LLM] IPEX auto importer (#9706)
* IPEX auto importer and get_ipex_version.
* Add BIGDL_IMPORT_IPEX to control auto import, default is false.
2023-12-19 13:39:38 +08:00
Yang Wang
f4fb58d99c fusing qkv project and rope (#9612)
* Try fusing qkv project and rope

* add fused mlp

* fuse append cache

* fix style and clean up code

* clean up
2023-12-18 16:45:00 -08:00
Cengguang Zhang
4d22add4af LLM: fix qwen efficiency issue in perf-test. 2023-12-18 18:32:54 +08:00
Ruonan Wang
8ed89557e5 LLM: add mlp optimization of mixtral (#9709) 2023-12-18 16:59:52 +08:00
Xin Qiu
320110d158 handle empty fused norm result (#9688)
* handle empty fused norm result

* remove fast_rms_norm

* fix style
2023-12-18 09:56:11 +08:00
SONG Ge
d5b81af7bd Support mixtral attention optimization on transformers-v4.36.0 (#9674)
* add example code to support mistral/mixtral attention on transformers v4.36.0

* update

* style fix

* add update for seen-tokens

* support mixtral

* rm mistral change

* small fix

* add more comments and remove use_cache part

---------

Co-authored-by: plusbang <binbin1.deng@intel.com>
2023-12-15 14:30:23 +08:00
Cengguang Zhang
adbef56001 LLM: update qwen attention forward. (#9695)
* feat: update qwen attention forward.

* fix: style.
2023-12-15 14:06:15 +08:00
Wang, Jian4
b8437a1c1e LLM: Add gguf mistral model support (#9691)
* add mistral support

* need to upgrade transformers version

* update
2023-12-15 13:37:39 +08:00
Wang, Jian4
496bb2e845 LLM: Support load BaiChuan model family gguf model (#9685)
* support baichuan model family gguf model

* update gguf generate.py

* add verify models

* add support model_family

* update

* update style

* update type

* update readme

* update

* remove support model_family
2023-12-15 13:34:33 +08:00
Yishuo Wang
9a330bfc2b fix fuse mlp when using q5_0 or fp8 (#9689) 2023-12-14 16:16:05 +08:00
Xin Qiu
5e46e0e5af fix baichuan2-7b 1st token performance regression on xpu (#9683)
* fix baichuan2-7b 1st token performance regression

* add comments

* fix style
2023-12-14 09:58:32 +08:00
Yishuo Wang
09ca540f9b use fuse mlp in qwen (#9672) 2023-12-13 17:20:08 +08:00
Ruonan Wang
c7741c4e84 LLM: update moe block convert to optimize rest token latency of Mixtral (#9669)
* update moe block convert

* further accelerate final_hidden_states

* fix style

* fix style
2023-12-13 16:17:06 +08:00
Xiangyu Tian
1c6499e880 [LLM] vLLM: Support Mixtral Model (#9670)
Add Mixtral support for BigDL vLLM.
2023-12-13 14:44:47 +08:00
Ruonan Wang
dc5b1d7e9d LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] (#9633)
* integrate sdp

* update api

* fix style

* meet code review

* fix

* distinguish mtl from arc

* small fix
2023-12-13 11:29:57 +08:00
Qiyuan Gong
5b0e7e308c [LLM] Add support for empty activation (#9664)
* Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch.
* Add comments.
2023-12-13 11:07:45 +08:00
SONG Ge
284e7697b1 [LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC (#9579)
* optimize kv_cache to support beam_search on Arc

* correctness test update

* fix query_length issue

* simplify implementation

* only enable the optimization on gpu device

* limit the beam_search support only enabled with gpu device and batch_size > 1

* add comments for beam_search case and revert ut change

* meet comments

* add more comments to describe the differece between multi-cases
2023-12-13 11:02:14 +08:00
Ziteng Zhang
8931f2eb62 [LLM] Fix transformer qwen size mismatch and rename causal_mask (#9655)
* Fix size mismatching caused by context_layer
* Change registered_causal_mask to causal_mask
2023-12-12 20:57:40 +08:00
binbin Deng
59ce86d292 LLM: support optimize_model=True for Mixtral model (#9657) 2023-12-12 16:41:26 +08:00
Heyang Sun
9f02f96160 [LLM] support for Yi AWQ model (#9648) 2023-12-11 14:07:34 +08:00
Xin Qiu
82255f9726 Enable fused layernorm (#9614)
* bloom layernorm

* fix

* layernorm

* fix

* fix

* fix

* style fix

* fix

* replace nn.LayerNorm
2023-12-11 09:26:13 +08:00
Yina Chen
70f5e7bf0d Support peft LoraConfig (#9636)
* support peft loraconfig

* use testcase to test

* fix style

* meet comments
2023-12-08 16:13:03 +08:00
Xin Qiu
0b6f29a7fc add fused rms norm for Yi and Qwen (#9640) 2023-12-08 16:04:38 +08:00
Xin Qiu
5636b0ba80 set new linear status (#9639) 2023-12-08 11:02:49 +08:00
Yuwen Hu
6f34978b94 [LLM] Add more performance tests for win iGPU (more in-out pairs, RWKV model) (#9626)
* Add supports for loading rwkv models using from_pretrained api

* Temporarily enable pr tests

* Add RWKV in tests and more in-out pairs

* Add rwkv for 512 tests

* Make iterations smaller

* Change back to nightly trigger
2023-12-07 18:55:16 +08:00
Ruonan Wang
d9b0c01de3 LLM: fix unlora module in qlora finetune (#9621)
* fix unlora module

* split train and inference
2023-12-07 16:32:02 +08:00
Yishuo Wang
7319f2c227 use fused mlp in baichuan2 (#9620) 2023-12-07 15:50:57 +08:00
Xiangyu Tian
deee65785c [LLM] vLLM: Delete last_kv_cache before prefilling (#9619)
Remove last_kv_cache before prefilling to reduce peak memory usage.
2023-12-07 11:32:33 +08:00
Xiangyu Tian
0327169b50 [LLM] vLLM: fix memory leak in prepare_kv_cache (#9616)
Revert modification in prepare_kv_cache to fix memory leak.
2023-12-07 10:08:18 +08:00
Xin Qiu
13d47955a8 use fused rms norm in chatglm2 and baichuan (#9613)
* use fused rms norm in chatglm2 and baichuan

* style fix
2023-12-07 09:21:41 +08:00
Yina Chen
404e101ded QALora example (#9551)
* Support qa-lora

* init

* update

* update

* update

* update

* update

* update merge

* update

* fix style & update scripts

* update

* address comments

* fix typo

* fix typo

---------

Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-12-06 15:36:21 +08:00
Guancheng Fu
6978b2c316 [VLLM] Change padding patterns for vLLM & clean code (#9609)
* optimize

* fix minor error

* optimizations

* fix style
2023-12-06 15:27:26 +08:00
Zheng, Yi
d154b38bf9 Add llama2 gpu low memory example (#9514)
* Add low memory example

* Minor fixes

* Update readme.md
2023-12-05 17:29:48 +08:00
Ziteng Zhang
65934c9f4f [LLM] Fix Qwen causal_mask and attention_mask size mismatching (#9600)
* Fix #9582 , caused by Qwen modified modeling_qwen.py 7f62181c94 (d2h-049182)
2023-12-05 15:15:54 +08:00
Qiyuan Gong
f211f136b6 Configurable TORCH_LINEAR_THRESHOLD from env (#9588)
* Add TORCH_LINEAR_THRESHOLD from env (BIGDL_LLM_LINEAR_THRESHOLD)
* Change default to 512
2023-12-05 13:19:47 +08:00
Xiangyu Tian
5c03651309 [LLM] vLLM: Add Preempt for scheduler (#9568)
Implement Preempt_by_recompute method for vllm.
2023-12-03 20:16:25 +08:00
Xin Qiu
69c49d21f5 use fused rms norm (#9572)
* use fused rms norm

* meet code review
2023-11-30 21:47:41 +08:00
Yishuo Wang
7f6465518a support loading llama tokenizer from gguf model (#9565) 2023-11-30 14:56:12 +08:00
Yuwen Hu
34503efa6a Fix cpu pinned embedding (#9556) 2023-11-29 18:27:56 +08:00
binbin Deng
4ff2ca9d0d LLM: fix loss error on Arc (#9550) 2023-11-29 15:16:18 +08:00
Yishuo Wang
65121c7997 support loading q4_1/q5_0/q5_1/q8_0 gguf model (#9546) 2023-11-29 14:40:37 +08:00
Yuwen Hu
5f5ca38b74 [LLM Doc] Fix api doc rendering error (#9542)
* Fix api rendering error

* Fix python style
2023-11-29 09:17:09 +08:00
Yishuo Wang
a86c6e0b56 [LLM] support loading gguf model (#9544) 2023-11-28 15:51:15 +08:00
Xiangyu Tian
916c338772 fix bugs in vllm length check (#9543) 2023-11-28 11:09:54 +08:00
Zhao Changmin
e7e0cd3b5e CPU Pinned embedding Layer (#9538)
* CPU Pinned embedding
2023-11-28 09:46:31 +08:00
Guancheng Fu
963a5c8d79 Add vLLM-XPU version's README/examples (#9536)
* test

* test

* fix last kv cache

* add xpu readme

* remove numactl for xpu example

* fix link error

* update max_num_batched_tokens logic

* add explaination

* add xpu environement version requirement

* refine gpu memory

* fix

* fix style
2023-11-28 09:44:03 +08:00
Guancheng Fu
b6c3520748 Remove xformers from vLLM-CPU (#9535) 2023-11-27 11:21:25 +08:00
binbin Deng
6bec0faea5 LLM: support Mistral AWQ models (#9520) 2023-11-24 16:20:22 +08:00
Ruonan Wang
914a5a5a27 LLM: fix abnormal Mistral GPU accuracy by updating rms_norm (#9529) 2023-11-24 15:37:50 +08:00
SONG Ge
3d24823cda hot-fix mistral kv_cache (#9528) 2023-11-24 14:33:04 +08:00
Zhao Changmin
42b7a16bc5 Replace torch.bmm with safe_bmm (#9519)
* replace bmm with safe one

* rename args and deprecated warning
2023-11-24 12:16:48 +08:00
Ruonan Wang
b63aae8a8e LLM: add flash attention support for llama (#9518)
* add initial flash attention for llama

* accelerate fp32 first token by changing to fp16 in advance

* support fp32
2023-11-23 18:40:18 +08:00
Guancheng Fu
bf579507c2 Integrate vllm (#9310)
* done

* Rename structure

* add models

* Add structure/sampling_params,sequence

* add input_metadata

* add outputs

* Add policy,logger

* add and update

* add parallelconfig back

* core/scheduler.py

* Add llm_engine.py

* Add async_llm_engine.py

* Add tested entrypoint

* fix minor error

* Fix everything

* fix kv cache view

* fix

* fix

* fix

* format&refine

* remove logger from repo

* try to add token latency

* remove logger

* Refine config.py

* finish worker.py

* delete utils.py

* add license

* refine

* refine sequence.py

* remove sampling_params.py

* finish

* add license

* format

* add license

* refine

* refine

* Refine line too long

* remove exception

* so dumb style-check

* refine

* refine

* refine

* refine

* refine

* refine

* add README

* refine README

* add warning instead error

* fix padding

* add license

* format

* format

* format fix

* Refine vllm dependency (#1)

vllm dependency clear

* fix licence

* fix format

* fix format

* fix

* adapt LLM engine

* fix

* add license

* fix format

* fix

* Moving README.md to the correct position

* Fix readme.md

* done

* guide for adding models

* fix

* Fix README.md

* Add new model readme

* remove ray-logic

* refactor arg_utils.py

* remove distributed_init_method logic

* refactor entrypoints

* refactor input_metadata

* refactor model_loader

* refactor utils.py

* refactor models

* fix api server

* remove vllm.stucture

* revert by txy 1120

* remove utils

* format

* fix license

* add bigdl model

* Refer to a specfic commit

* Change code base

* add comments

* add async_llm_engine comment

* refine

* formatted

* add worker comments

* add comments

* add comments

* fix style

* add changes

---------

Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2023-11-23 16:46:45 +08:00
Qiyuan Gong
0f0c6bb631 [LLM] Fix Qwen registered_causal_mask is None (#9513)
* Add registered_causal_mask init based on 2abd8e5777.
2023-11-23 09:28:04 +08:00
Ruonan Wang
076d106ef5 LLM: GPU QLoRA update to bf16 to accelerate gradient checkpointing (#9499)
* update to bf16 to accelerate gradient checkpoint

* add utils and fix ut
2023-11-21 17:08:36 +08:00
Xin Qiu
50b01058f1 enable new q4_1 (#9479) 2023-11-17 14:58:57 +08:00
Zhao Changmin
30abd304a7 LLM: Fix baichuan pre-normalize model tensor assigning issue when loading (#9481)
* No need to normalized when loading
2023-11-16 21:57:28 +08:00
Ruonan Wang
c0ef70df02 llm: quick fix of fast_rms_norm (#9480) 2023-11-16 14:42:16 +08:00
Yina Chen
d5263e6681 Add awq load support (#9453)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* init

* address comments

* add examples

* fix style

* fix style

* fix style

* fix style

* update

* remove

* meet comments

* fix style

---------

Co-authored-by: Yang Wang <yang3.wang@intel.com>
2023-11-16 14:06:25 +08:00
Ruonan Wang
d2c064124a LLM: update rms related usage to suport ipex 2.1 new api (#9466)
* update rms related usage

* fix style
2023-11-16 11:21:50 +08:00
Yuwen Hu
731b0aaade Empty cache after embedding to cpu (#9477) 2023-11-16 10:52:30 +08:00
Yang Wang
51d07a9fd8 Support directly loading gptq models from huggingface (#9391)
* Support directly loading GPTQ models from huggingface

* fix style

* fix tests

* change example structure

* address comments

* fix style

* address comments
2023-11-13 20:48:12 -08:00
SONG Ge
2888818b3a [LLM] Support mixed_fp8 on Arc (#9415)
* ut gpu allocation memory fix

* support mix_8bit on arc

* rename mixed_4bit to mixed_fp4 and mixed_8bit to mixed_fp8

* revert unexpected changes

* revert unexpected changes

* unify common logits

* rename in llm xmx_checker

* fix typo error and re-unify
2023-11-13 09:26:30 +08:00
Heyang Sun
df8e4d7889 [LLM] apply allreduce and bias to training in LowBitLinear (#9395) 2023-11-09 14:35:54 +08:00
Wang, Jian4
40cead6b5b LLM: Fix CPU qlora dtype convert issue (#9394) 2023-11-09 14:34:01 +08:00
Ruonan Wang
bfca76dfa7 LLM: optimize QLoRA by updating lora convert logic (#9372)
* update convert logic of qlora

* update

* refactor and further improve performance

* fix style

* meet code review
2023-11-08 17:46:49 +08:00
Ruonan Wang
7e8fb29b7c LLM: optimize QLoRA by reducing convert time (#9370) 2023-11-08 13:14:34 +08:00
Yishuo Wang
bfd9f88f0d [LLM] Use fp32 as dtype when batch_size <=8 and qtype is q4_0/q8_0/fp8 (#9365) 2023-11-08 09:54:53 +08:00
Heyang Sun
fae6db3ddc [LLM] refactor cpu low-bit forward logic (#9366)
* [LLM] refactor cpu low-bit forward logic

* fix style

* Update low_bit_linear.py

* Update low_bit_linear.py

* refine
2023-11-07 15:09:16 +08:00
Heyang Sun
af94058203 [LLM] Support CPU deepspeed distributed inference (#9259)
* [LLM] Support CPU Deepspeed distributed inference

* Update run_deepspeed.py

* Rename

* fix style

* add new codes

* refine

* remove annotated codes

* refine

* Update README.md

* refine doc and example code
2023-11-06 17:56:42 +08:00
Xin Qiu
1420e45cc0 Chatglm2 rope optimization on xpu (#9350) 2023-11-06 13:56:34 +08:00
Yuwen Hu
a0150bb205 [LLM] Move embedding layer to CPU for iGPU inference (#9343)
* Move embedding layer to CPU for iGPU llm inference

* Empty cache after to cpu

* Remove empty cache as it seems to have some negative effect to first token
2023-11-03 11:13:45 +08:00
Yishuo Wang
726203d778 [LLM] Replace Embedding layer to fix it on CPU (#9254) 2023-11-01 13:58:10 +08:00
Yang Wang
e1bc18f8eb fix import ipex problem (#9323)
* fix import ipex problem

* fix style
2023-10-31 20:31:34 -07:00
Yina Chen
2262ae4d13 Support MoFQ4 on arc (#9301)
* init

* update

* fix style

* fix style

* fix style

* meet comments
2023-11-01 10:59:46 +08:00
Yang Wang
163d033616 Support qlora in CPU (#9233)
* support qlora in CPU

* revert example

* fix style
2023-10-27 14:01:15 -07:00
Cengguang Zhang
44b5fcc190 LLM: fix pretraining_tp argument issue. (#9281) 2023-10-26 18:43:58 +08:00
WeiguangHan
6b2a32eba2 LLM: add missing function for PyTorch InternLM model (#9285) 2023-10-26 18:05:23 +08:00
Yina Chen
f879c48f98 fp8 convert use ggml code (#9277) 2023-10-26 17:03:29 +08:00
Yina Chen
e2264e8845 Support arc fp4 (#9266)
* support arc fp4

* fix style

* fix style
2023-10-25 15:42:48 +08:00
Yang Wang
067c7e8098 Support deepspeed AutoTP (#9230)
* Support deepspeed

* add test script

* refactor convert

* refine example

* refine

* refine example

* fix style

* refine example and adapte latest ipex

* fix style
2023-10-24 23:46:28 -07:00
Jin Qiao
90162264a3 LLM: replace torch.float32 with auto type (#9261) 2023-10-24 17:12:13 +08:00
SONG Ge
bd5215d75b [LLM] Reimplement chatglm fuse rms optimization (#9260)
* re-implement chatglm rope rms

* update
2023-10-24 16:35:12 +08:00
SONG Ge
bfc1e2d733 add fused rms optimization for chatglm model (#9256) 2023-10-24 14:40:58 +08:00
Guancheng Fu
f37547249d Refine README/CICD (#9253) 2023-10-24 12:56:03 +08:00
binbin Deng
db37edae8a LLM: update langchain api document page (#9222) 2023-10-24 10:13:41 +08:00
Wang, Jian4
c14a61681b Add load low-bit in model-serving for reduce EPC (#9239)
* init load low-bit

* fix

* fix
2023-10-23 11:28:20 +08:00
Yina Chen
0383306688 Add arc fp8 support (#9232)
* add fp8 support

* add log

* fix style
2023-10-20 17:15:07 +08:00
Yang Wang
118249b011 support transformers 4.34+ for llama (#9229) 2023-10-19 22:36:30 -07:00
Chen, Zhentao
5850241423 correct Readme GPU example and API docstring (#9225)
* update readme to correct GPU usage

* update from_pretrained supported low bit options

* fix stype check
2023-10-19 16:08:47 +08:00
Yang Wang
b0ddde0410 Fix removing convert dtype bug (#9216)
* Fix removing convert dtype bug

* fix style
2023-10-18 11:24:22 -07:00
Ruonan Wang
942d6418e7 LLM: fix chatglm kv cache (#9215) 2023-10-18 19:09:53 +08:00
SONG Ge
0765f94770 [LLM] Optimize kv_cache for mistral model family (#9189)
* add kv_cache optimization for mistral model

* kv_cache optimize for mistral

* update stylr

* update
2023-10-18 15:13:37 +08:00
Ruonan Wang
3555ebc148 LLM: fix wrong length in gptj kv_cache optimization (#9210)
* fix wrong length in gptj kv cache

* update
2023-10-18 14:59:02 +08:00
Shengsheng Huang
6dad8d16df optimize NormHead for Baichuan2 (#9205)
* optimize NormHead for Baichuan2

* fix ut and change name

* rename functions
2023-10-18 14:05:07 +08:00
Ruonan Wang
09815f7064 LLM: fix RMSNorm optimization of Baichuan2-13B/Baichuan-13B (#9204)
* fix rmsnorm of baichuan2-13B

* update baichuan1-13B too

* fix style
2023-10-17 18:40:34 +08:00
Ruonan Wang
c0497ab41b LLM: support kv_cache optimization for Qwen-VL-Chat (#9193)
* dupport qwen_vl_chat

* fix style
2023-10-17 13:33:56 +08:00
binbin Deng
1cd9ab15b8 LLM: fix ChatGLMConfig check (#9191) 2023-10-17 11:52:56 +08:00
Yang Wang
7160afd4d1 Support XPU DDP training and autocast for LowBitMatmul (#9167)
* support autocast in low bit matmul

* Support XPU DDP training

* fix  amp
2023-10-16 20:47:19 -07:00
Ruonan Wang
77afb8796b LLM: fix convert of chatglm (#9190) 2023-10-17 10:48:13 +08:00
dingbaorong
af3b575c7e expose modules_to_not_convert in optimize_model (#9180)
* expose modules_to_not_convert in optimize_model

* some fixes
2023-10-17 09:50:26 +08:00
Cengguang Zhang
5ca8a851e9 LLM: add fuse optimization for Mistral. (#9184)
* add fuse optimization for mistral.

* fix.

* fix

* fix style.

* fix.

* fix error.

* fix style.

* fix style.
2023-10-16 16:50:31 +08:00
Jiao Wang
49e1381c7f update rope (#9155) 2023-10-15 21:51:45 -07:00
binbin Deng
a164c24746 LLM: add kv_cache optimization for chatglm2-6b-32k (#9165) 2023-10-16 10:43:15 +08:00
Yang Wang
7a2de00b48 Fixes for xpu Bf16 training (#9156)
* Support bf16 training

* Use a stable transformer version

* remove env

* fix style
2023-10-14 21:28:59 -07:00
Cengguang Zhang
51a133de56 LLM: add fuse rope and norm optimization for Baichuan. (#9166)
* add fuse rope optimization.

* add rms norm optimization.
2023-10-13 17:36:52 +08:00
Cengguang Zhang
433f408081 LLM: Add fuse rope and norm optimization for Aquila. (#9161)
* add fuse norm optimization.

* add fuse rope optimization
2023-10-13 14:18:37 +08:00
SONG Ge
e7aa67e141 [LLM] Add rope optimization for internlm (#9159)
* add rope and norm optimization for internlm and gptneox

* revert gptneox back and split with pr#9155 #

* add norm_forward

* style fix

* update

* update
2023-10-13 14:18:28 +08:00
Ruonan Wang
b8aee7bb1b LLM: Fix Qwen kv_cache optimization (#9148)
* first commit

* ut pass

* accelerate rotate half by using common util function

* fix style
2023-10-12 15:49:42 +08:00
binbin Deng
69942d3826 LLM: fix model check before attention optimization (#9149) 2023-10-12 15:21:51 +08:00
binbin Deng
eb3fb18eb4 LLM: improve PyTorch API doc (#9128) 2023-10-11 15:03:39 +08:00
Zhao Changmin
1709beba5b LLM: Explicitly close pickle file pointer before removing temporary directory (#9120)
* fp close
2023-10-10 14:57:23 +08:00
binbin Deng
e4d1457a70 LLM: improve transformers style API doc (#9113) 2023-10-10 09:31:00 +08:00
Zhao Changmin
edccfb2ed3 LLM: Check model device type (#9092)
* check model device
2023-10-09 15:49:15 +08:00
Yina Chen
4c4f8d1663 [LLM]Fix Arc falcon abnormal output issue (#9096)
* update

* update

* fix error & style

* fix style

* update train

* to input_seq_size
2023-10-09 15:09:37 +08:00
Zhao Changmin
548e4dd5fe LLM: Adapt transformers models for optimize model SL (#9022)
* LLM: Adapt transformers model for SL
2023-10-09 11:13:44 +08:00
Ruonan Wang
f64257a093 LLM: basic api support for esimd fp16 (#9067)
* basic api support for fp16

* fix style

* fix

* fix error and style

* fix style

* meet code review

* update based on comments
2023-10-09 11:05:17 +08:00
Xin Qiu
b3e94a32d4 change log4error import (#9098) 2023-10-08 09:23:28 +08:00
Kai Huang
78ea7ddb1c Combine apply_rotary_pos_emb for gpt-neox (#9074) 2023-10-07 16:27:46 +08:00
Yang Wang
36dd4afd61 Fix llama when rope scaling is not None (#9086)
* Fix llama when rope scaling is not None

* fix style

* fix style
2023-10-06 13:27:37 -07:00
Yang Wang
fcb1c618a0 using bigdl-llm fused rope for llama (#9066)
* optimize llama xpu rope

* fix bug

* fix style

* refine append cache

* remove check

* do not cache cos sin

* remove unnecessary changes

* clean up

* fix style

* check for training
2023-10-06 09:57:29 -07:00
Jiao Wang
aefa5a5bfe Qwen kv cache (#9079)
* qwen and aquila

* update

* update

* style
2023-10-05 11:59:17 -07:00
Jiao Wang
d5ca1f32b6 Aquila KV cache optimization (#9080)
* update

* update

* style
2023-10-05 11:10:57 -07:00
Yang Wang
88565c76f6 add export merged model example (#9018)
* add export merged model example

* add sources

* add script

* fix style
2023-10-04 21:18:52 -07:00
Yang Wang
0cd8f1c79c Use ipex fused rms norm for llama (#9081)
* also apply rmsnorm

* fix cpu
2023-10-04 21:04:55 -07:00
Cengguang Zhang
fb883100e7 LLM: support chatglm-18b convert attention forward in benchmark scripts. (#9072)
* add chatglm-18b convert.

* fix if statement.

* fix
2023-09-28 14:04:52 +08:00
Yishuo Wang
6de2189e90 [LLM] fix chatglm main choice (#9073) 2023-09-28 11:23:37 +08:00
Cengguang Zhang
b4a1266ef0 [WIP] LLM: add kv cache support for internlm. (#9036)
* LLM: add kv cache support for internlm

* add internlm apply_rotary_pos_emb

* fix.

* fix style.
2023-09-25 14:16:59 +08:00
Ruonan Wang
975da86e00 LLM: fix gptneox kv cache (#9044) 2023-09-25 13:03:57 +08:00
Jiao Wang
028a6d9383 MPT model optimize for long sequence (#9020)
* mpt_long_seq

* update

* update

* update

* style

* style2

* update
2023-09-21 21:27:23 -07:00
Ruonan Wang
b943d73844 LLM: refactor kv cache (#9030)
* refactor utils

* meet code review; update all models

* small fix
2023-09-21 21:28:03 +08:00
Cengguang Zhang
868511cf02 LLM: fix kv cache issue of bloom and falcon. (#9029) 2023-09-21 18:12:20 +08:00
Ruonan Wang
bf51ec40b2 LLM: Fix empty cache (#9024)
* fix

* fix

* update example
2023-09-21 17:16:07 +08:00
Yina Chen
714884414e fix error (#9025) 2023-09-21 16:42:11 +08:00
SONG Ge
fa47967583 [LLM] Optimize kv_cache for gptj model family (#9010)
* optimize gptj model family attention

* add license and comment for dolly-model

* remove xpu mentioned

* remove useless info

* code sytle

* style fix

* code style in gptj fix

* remove gptj arch

* move apply_rotary_pos_emb into utils

* kv_seq_length update

* use hidden_states instead of query layer to reach batch size
2023-09-21 10:42:08 +08:00
Cengguang Zhang
b3cad7de57 LLM: add bloom kv cache support (#9012)
* LLM: add bloom kv cache support

* fix style.
2023-09-20 21:10:53 +08:00
Kai Huang
156af15d1e Add NF3 (#9008)
* add nf3

* grammar
2023-09-20 20:03:07 +08:00
Kai Huang
6981745fe4 Optimize kv_cache for gpt-neox model family (#9015)
* override gptneox

* style

* move to utils

* revert
2023-09-20 19:59:19 +08:00
Cengguang Zhang
735a17f7b4 LLM: add kv cache to falcon family. (#8995)
* add kv cache to falcon family.

* fix: import error.

* refactor

* update comments.

* add two version falcon attention forward.

* fix

* fix.

* fix.

* fix.

* fix style.

* fix style.
2023-09-20 15:36:30 +08:00
Ruonan Wang
94a7f8917b LLM: fix optimized kv cache for baichuan-13b (#9009)
* fix baichuan 13b

* fix style

* fix

* fix style
2023-09-20 15:30:14 +08:00
Yang Wang
c88f6ec457 Experiment XPU QLora Finetuning (#8937)
* Support xpu finetuning

* support xpu finetuning

* fix style

* fix style

* fix style

* refine example

* add readme

* refine readme

* refine api

* fix fp16

* fix example

* refactor

* fix style

* fix compute type

* add qlora

* refine training args

* fix example

* fix style

* fast path forinference

* address comments

* refine readme

* revert lint
2023-09-19 10:15:44 -07:00
Ruonan Wang
004c45c2be LLM: Support optimized kv_cache for baichuan family (#8997)
* add initial support for baichuan attantion

* support baichuan1

* update based on comment

* update based on comment

* support baichuan2

* update link, change how to jusge baichuan2

* fix style

* add model parameter for pob emb

* update based on comment
2023-09-19 15:38:54 +08:00
Zhao Changmin
2a05581da7 LLM: Apply low_cpu_mem_usage algorithm on optimize_model API (#8987)
* low_cpu_mem_usage
2023-09-18 21:41:42 +08:00
Zhao Changmin
16b9412e80 tie_word_embeddings (#8977)
tie_word_embeddings
2023-09-15 10:17:09 +08:00
Yishuo Wang
bcf456070c fix bloom-176b int overflow (#8973) 2023-09-14 14:37:57 +08:00
Ruonan Wang
dd57623650 LLM: reduce GPU memory for optimize_model=True (#8965)
* reduce gpu memory for llama & chatglm

* change to device type
2023-09-13 17:27:09 +08:00
SONG Ge
7132ef6081 [LLM Doc] Add optimize_model doc in transformers api (#8957)
* add optimize in from_pretrained

* add api doc for load_low_bit

* update api docs following comments

* update api docs

* update

* reord comments
2023-09-13 10:42:33 +08:00
Zhao Changmin
c32c260ce2 LLM: Add save/load API in optimize_model to support general pytorch model (#8956)
* support hf format SL
2023-09-13 10:22:00 +08:00
Guancheng Fu
0bf5857908 [LLM] Integrate FastChat as a serving framework for BigDL-LLM (#8821)
* Finish changing

* format

* add licence

* Add licence

* fix

* fix

* Add xpu support for fschat

* Fix patch

* Also install webui dependencies

* change setup.py dependency installs

* fiox

* format

* final test
2023-09-13 09:28:05 +08:00
Zhao Changmin
dcaa4dc130 LLM: Support GQA on llama kvcache (#8938)
* support GQA
2023-09-12 12:18:40 +08:00
Yang Wang
16761c58be Make llama attention stateless (#8928)
* Make llama attention stateless

* fix style

* fix chatglm

* fix chatglm xpu
2023-09-11 18:21:50 -07:00
Zhao Changmin
e62eda74b8 refine (#8912)
Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2023-09-11 16:40:33 +08:00
Yina Chen
df165ad165 init (#8933) 2023-09-11 14:30:55 +08:00
Ruonan Wang
b3f5dd5b5d LLM: update q8 convert xpu&cpu (#8930) 2023-09-08 16:01:17 +08:00
Yina Chen
33d75adadf [LLM]Support q5_0 on arc (#8926)
* support q5_0

* delete

* fix style
2023-09-08 15:52:36 +08:00
Yang Wang
ee98cdd85c Support latest transformer version (#8923)
* Support latest transformer version

* fix style
2023-09-07 19:01:32 -07:00
Yang Wang
25428b22b4 Fix chatglm2 attention and kv cache (#8924)
* fix chatglm2 attention

* fix bf16 bug

* make model stateless

* add utils

* cleanup

* fix style
2023-09-07 18:54:29 -07:00
Yina Chen
b209b8f7b6 [LLM] Fix arc qtype != q4_0 generate issue (#8920)
* Fix arc precision!=q4_0 generate issue

* meet comments
2023-09-07 08:56:36 -07:00
Yang Wang
c34400e6b0 Use new layout for xpu qlinear (#8896)
* use new layout for xpu qlinear

* fix style
2023-09-06 21:55:33 -07:00
Zhao Changmin
8bc1d8a17c LLM: Fix discards in optimize_model with non-hf models and add openai whisper example (#8877)
* openai-whisper
2023-09-07 10:35:59 +08:00
SONG Ge
7a71ced78f [LLM Docs] Remain API Docs Issues Solution (#8780)
* langchain readthedocs update

* solve langchain.llms.transformersllm issues

* langchain.embeddings.transformersembeddings/transfortmersllms issues

* update docs for get_num_tokens

* add low_bit api doc

* add optimizer model api doc

* update rst index

* fix coomments style

* update docs following the comments

* update api doc
2023-09-06 16:29:34 +08:00
Kai Huang
4a9ff050a1 Add qlora nf4 (#8782)
* add nf4

* dequant nf4

* style
2023-09-06 09:39:22 +08:00
Zhao Changmin
95271f10e0 LLM: Rename low bit layer (#8875)
* rename lowbit

---------

Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2023-09-05 13:21:12 +08:00
Yang Wang
242c9d6036 Fix chatglm2 multi-turn streamchat (#8867) 2023-08-31 22:13:49 -07:00
xingyuan li
de6c6bb17f [LLM] Downgrade amx build gcc version and remove avx flag display (#8856)
* downgrade to gcc 11
* remove avx display
2023-08-31 14:08:13 +09:00
Yang Wang
3b4f4e1c3d Fix llama attention optimization for XPU (#8855)
* Fix llama attention optimization fo XPU

* fix chatglm2

* fix typo
2023-08-30 21:30:49 -07:00
Shengsheng Huang
7b566bf686 [LLM] add new API for optimize any pytorch models (#8827)
* add new API for optimize any pytorch models

* change test util name

* revise API and update UT

* fix python style

* update ut config, change default value

* change defaults, disable ut transcribe
2023-08-30 19:41:53 +08:00
Xin Qiu
8eca982301 windows add env (#8852) 2023-08-30 15:54:52 +08:00
Zhao Changmin
731916c639 LLM: Enable attempting loading method automatically (#8841)
* enable auto load method

* warning error

* logger info

---------

Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2023-08-30 15:41:55 +08:00
Yishuo Wang
bba73ec9d2 [LLM] change chatglm native int4 checkpoint name (#8851) 2023-08-30 15:05:19 +08:00
Yina Chen
55e705a84c [LLM] Support the rest of AutoXXX classes in Transformers API (#8815)
* add transformers auto models

* fix
2023-08-30 11:16:14 +08:00
Yishuo Wang
7429ea0606 [LLM] support transformer int4 + amx int4 (#8838) 2023-08-29 17:27:18 +08:00
Zhao Changmin
bb31d4fe80 LLM: Implement hf low_cpu_mem_usage with 1xbinary file peak memory on transformer int4 (#8731)
* 1x peak memory
2023-08-29 09:33:17 +08:00
SONG Ge
d2926c7672 [LLM] Unify Langchain Native and Transformers LLM API (#8752)
* deprecate BigDLNativeTransformers and add specific LMEmbedding method

* deprecate and add LM methods for langchain llms

* add native params to native langchain

* new imple for embedding

* move ut from bigdlnative to casual llm

* rename embeddings api and examples update align with usage updating

* docqa example hot-fix

* add more api docs

* add langchain ut for starcoder

* support model_kwargs for transformer methods when calling causalLM and add ut

* ut fix for transformers embedding

* update for langchain causal supporting transformers

* remove model_family in readme doc

* add model_families params to support more models

* update api docs and remove chatglm embeddings for now

* remove chatglm embeddings in examples

* new refactor for ut to add bloom and transformers llama ut

* disable llama transformers embedding ut
2023-08-25 11:14:21 +08:00
Yang Wang
bf3591e2ff Optimize chatglm2 for bf16 (#8725)
* make chatglm works with bf16

* fix style

* support chatglm v1

* fix style

* fix style

* add chatglm2 file
2023-08-24 10:04:25 -07:00
Yishuo Wang
611c1fb628 [LLM] change default n_threads of native int4 langchain API (#8779) 2023-08-21 13:30:12 +08:00
Yishuo Wang
3d1f2b44f8 LLM: change default n_threads of native int4 models (#8776) 2023-08-18 15:46:19 +08:00
Yishuo Wang
2ba2133613 fix starcoder chinese output (#8773) 2023-08-18 13:37:02 +08:00
binbin Deng
548f7a6cf7 LLM: update convert of llama family to support llama2-70B (#8747) 2023-08-18 09:30:35 +08:00
Yina Chen
4afea496ab support q8_0 (#8765) 2023-08-17 15:06:36 +08:00
Ruonan Wang
e9aa2bd890 LLM: reduce GPU 1st token latency and update example (#8763)
* reduce 1st token latency

* update example

* fix

* fix style

* update readme of gpu benchmark
2023-08-16 18:01:23 +08:00
SONG Ge
f4164e4492 [BigDL LLM] Update readme for unifying transformers API (#8737)
* update readme doc

* fix readthedocs error

* update comment

* update exception error info

* invalidInputError instead

* fix readme typo error and remove import error

* fix more typo
2023-08-16 14:22:32 +08:00
Yishuo Wang
77844125f2 [LLM] Support chatglm cache (#8745) 2023-08-14 15:10:46 +08:00
SONG Ge
aceea4dc29 [LLM] Unify Transformers and Native API (#8713)
* re-open pr to run on latest runner

* re-add examples and ut

* rename ut and move deprecate to warning instead of raising an error info

* ut fix
2023-08-11 19:45:47 +08:00
Yishuo Wang
f91035c298 [LLM] fix chatglm native int4 emoji output (#8739) 2023-08-11 15:38:41 +08:00
binbin Deng
77efcf7b1d LLM: fix ChatGLM2 native int4 stream output (#8733) 2023-08-11 14:51:50 +08:00
Ruonan Wang
ca3e59a1dc LLM: support stop for starcoder native int4 stream (#8734) 2023-08-11 14:51:30 +08:00
Yishuo Wang
3d5a7484a2 [LLM] fix bloom and starcoder memory release (#8728) 2023-08-11 11:18:19 +08:00
Ruonan Wang
1a7b698a83 [LLM] support ipex arc int4 & add basic llama2 example (#8700)
* first support of xpu

* make it works on gpu

update setup

update

add GPU llama2 examples

add use_optimize flag to disbale optimize for gpu

fix style

update gpu exmaple readme

fix

* update example, and update env

* fix setup to add cpp files

* replace jit with aot to avoid data leak

* rename to bigdl-core-xe

* update installation in example readme
2023-08-09 22:20:32 +08:00
Kai Huang
1b65288bdb Add api doc for LLM (#8605)
* api doc initial

* update desc
2023-08-08 18:17:16 +08:00
binbin Deng
ea5d7aff5b LLM: add chatglm native int4 transformers API (#8695) 2023-08-07 17:52:47 +08:00
Yishuo Wang
ef08250c21 [LLM] chatglm pybinding support (#8672) 2023-08-04 14:27:29 +08:00
Yang Wang
b6468bac43 optimize chatglm2 long sequence (#8662)
* add chatglm2

* optimize a little

* optimize chatglm long sequence

* fix style

* address comments and fix style

* fix bug
2023-08-03 17:56:24 -07:00
Yang Wang
3407f87075 Fix llama kv cache bug (#8674) 2023-08-03 17:54:55 -07:00
binbin Deng
a15a2516e6 add (#8659) 2023-08-03 10:12:10 +08:00
Yina Chen
119bf6d710 [LLM] Support linux cpp dynamic load .so (#8655)
* support linux cpp dynamic load .so

* update cli
2023-08-02 20:15:45 +08:00
Zhao Changmin
ca998cc6f2 LLM: Mute shape mismatch output (#8601)
* LLM: Mute shape mismatch output
2023-08-02 16:46:22 +08:00
Zhao Changmin
04c713ef06 LLM: Disable transformer api pretraining_tp (#8645)
* disable pretraining_tp
2023-08-02 11:26:01 +08:00
Yang Wang
cbeae97a26 Optimize Llama Attention to to reduce KV cache memory copy (#8580)
* Optimize llama attention to reduce KV cache memory copy

* fix bug

* fix style

* remove git

* fix style

* fix style

* fix style

* fix tests

* move llama attention to another file

* revert

* fix style

* remove jit

* fix
2023-08-01 16:37:58 -07:00
xingyuan li
cdfbe652ca [LLM] Add chatglm support for llm-cli (#8641)
* add chatglm build
* add llm-cli support
* update git
* install cmake
* add ut for chatglm
* add files to setup
* fix bug cause permission error when sf lack file
2023-08-01 14:30:17 +09:00
Zhao Changmin
3e10260c6d LLM: llm-convert support chatglm family (#8643)
* convert chatglm
2023-08-01 11:16:18 +08:00
Yina Chen
a607972c0b [LLM]LLM windows load -api.dll (#8631)
* temp

* update

* revert setup.py
2023-07-31 13:47:20 +08:00
xingyuan li
3361b66449 [LLM] Revert llm-cli to disable selecting executables on Windows (#8630)
* revert vnni file select
* revert setup.py
* add model-api.dll
2023-07-31 11:15:44 +09:00
binbin Deng
fb32fefcbe LLM: support tensor input of native int4 generate (#8620) 2023-07-27 17:59:49 +08:00
Zhao Changmin
5b484ab48d LLM: Support load_low_bit loading models in shards format (#8612)
* shards_model

---------

Co-authored-by: leonardozcm <leonaordo1997zcm@gmail.com>
2023-07-26 13:30:01 +08:00
Zhao Changmin
af201052db avoid malloc all missing keys in fp32 (#8600) 2023-07-25 09:48:51 +08:00
Yuwen Hu
ba42a6da63 [LLM] Set torch_dtype default value to 'auto' for transformers low bit from_pretrained API 2023-07-21 17:55:00 +08:00
Yang Wang
feb3af0567 Optimize transformer int4 memory footprint (#8579) 2023-07-20 20:22:13 -07:00
Yang Wang
57e880f63a [LLM] use pytorch linear for large input matrix (#8492)
* use pytorch linear for large input matrix

* only works on server

* fix style

* optimize memory

* first check server

* revert

* address comments

* fix style
2023-07-20 09:54:25 -07:00
Zhao Changmin
e680af45ea LLM: Optimize Langchain Pipeline (#8561)
* LLM: Optimize Langchain Pipeline

* load in low bit
2023-07-19 17:43:13 +08:00
Zhao Changmin
49d636e295 [LLM] whisper model transformer int4 verification and example (#8511)
* LLM: transformer api support

* va

* example

* revert

* pep8

* pep8
2023-07-19 08:33:20 +08:00
Yina Chen
9a7bc17ca1 [LLM] llm supports vnni link on windows (#8543)
* support win vnni link

* fix style

* fix style

* use isa_checker

* fix

* typo

* fix

* update
2023-07-18 16:43:45 +08:00
Yina Chen
4582b6939d [LLM]llm gptneox chat (#8527)
* linux

* support win

* merge upstream & support vnni lib in chat
2023-07-18 11:17:17 +08:00
Xin Qiu
fccae91461 Add load_low_bit save_load_bit to AutoModelForCausalLM (#8531)
* transformers save_low_bit load_low_bit

* update example and add readme

* update

* update

* update

* add ut

* update
2023-07-17 15:29:55 +08:00
xingyuan li
e57db777e0 [LLM] Setup.py & llm-cli update for windows vnni binary files (#8537)
* update setup.py
* update llm-cli
2023-07-17 12:28:38 +09:00
Yishuo Wang
6320bf201e LLM: fix memory access violation (#8519) 2023-07-13 17:08:08 +08:00
Xin Qiu
90e3d86bce rename low bit type name (#8512)
* change qx_0 to sym_intx

* update

* fix typo

* update

* fix type

* fix style

* add python doc

* meet code review

* fix style
2023-07-13 15:53:31 +08:00
Zhao Changmin
ba0da17b40 LLM: Support AutoModelForSeq2SeqLM transformer API (#8449)
* LLM: support AutoModelForSeq2SeqLM transformer API
2023-07-13 13:33:51 +08:00
Yishuo Wang
86b5938075 LLM: fix llm pybinding (#8509) 2023-07-13 10:27:08 +08:00
Zhao Changmin
23f6a4c21f LLM: Optimize transformer int4 loading (#8499)
* LLM: Optimize transformer int4 loading
2023-07-12 15:25:42 +08:00
Yishuo Wang
dd3f953288 Support vnni check (#8497) 2023-07-12 10:11:15 +08:00
Xin Qiu
cd7a980ec4 Transformer int4 add qtype, support q4_1 q5_0 q5_1 q8_0 (#8481)
* quant in Q4 5 8

* meet code review

* update readme

* style

* update

* fix error

* fix error

* update

* fix style

* update

* Update README.md

* Add load_in_low_bit
2023-07-12 08:23:08 +08:00
Yishuo Wang
db39d0a6b3 LLM: disable mmap by default for better performance (#8467) 2023-07-11 09:26:26 +08:00
Zhao Changmin
81d655cda9 LLM: transformer int4 save and load (#8462)
* LLM: transformer int4 save and load
2023-07-10 16:34:41 +08:00
binbin Deng
d489775d2c LLM: fix inconsistency between output token number and max_new_token (#8479) 2023-07-07 17:31:05 +08:00
Ruonan Wang
2f77d485d8 Llm: Initial support of langchain transformer int4 API (#8459)
* first commit of transformer int4 and pipeline

* basic examples

temp save for embeddings

support embeddings and docqa exaple

* fix based on comment

* small fix
2023-07-06 17:50:05 +08:00
binbin Deng
14626fe05b LLM: refactor transformers and langchain class name (#8470) 2023-07-06 17:16:44 +08:00
binbin Deng
77808fa124 LLM: fix n_batch in starcoder pybinding (#8461) 2023-07-05 17:06:50 +08:00
Yina Chen
f2bb469847 [WIP] LLm llm-cli chat mode (#8440)
* fix timezone

* temp

* Update linux interactive mode

* modify init text for interactive mode

* meet comments

* update

* win script

* meet comments
2023-07-05 14:04:17 +08:00
binbin Deng
e54e52b438 LLM: fix n_batch in bloom pybinding (#8454) 2023-07-04 15:10:32 +08:00
Yang Wang
449aea7ffc Optimize transformer int4 loading memory (#8400)
* Optimize transformer int4 loading memory

* move cast to convert

* default settting low_cpu_mem_usage
2023-06-30 20:12:12 -07:00
Zhao Changmin
cc76ec809a check out dir (#8395) 2023-06-27 21:28:39 +08:00
Xin Qiu
e68d631c0a gptq2ggml: support loading safetensors model. (#8401)
* update convert gptq to ggml

* update convert gptq to ggml

* gptq to ggml

* update script

* meet code review

* meet code review
2023-06-27 11:19:33 +08:00
binbin Deng
19e19efb4c LLM: raise warning instead of error when use unsupported parameters (#8382) 2023-06-26 13:23:55 +08:00
Shengsheng Huang
c113ecb929 [LLM] langchain bloom, UT's, default parameters (#8357)
* update langchain default parameters to align w/ api

* add ut's for llm and embeddings

* update inference test script to install langchain deps

* update tests workflows

---------

Co-authored-by: leonardozcm <changmin.zhao@intel.com>
2023-06-25 17:38:00 +08:00
Shengsheng Huang
446175cc05 transformer api refactor (#8389)
* transformer api refactor

* fix style

* add huggingface tokenizer usage in example and make ggml tokenzizer as option 1 and huggingface tokenizer as option 2

* fix style
2023-06-25 17:15:33 +08:00
Yang Wang
ce6d06eb0a Support directly quantizing huggingface transformers into 4bit format (#8371)
* Support directly quantizing huggingface transformers into 4bit format

* refine example

* license

* fix bias

* address comments

* move to ggml transformers

* fix example

* fix style

* fix style

* address comments

* rename

* change API

* fix style

* add lm head to conversion

* address comments
2023-06-25 16:35:06 +08:00
binbin Deng
03c5fb71a8 LLM: fix ModuleNotFoundError when use llm-cli (#8378) 2023-06-21 15:03:14 +08:00
Ruonan Wang
7296453f07 LLM: support starcoder in llm-cli (#8377)
* support starcoder in cli

* small fix
2023-06-21 14:38:30 +08:00
Ruonan Wang
50af0251e4 LLM: First commit of StarCoder pybinding (#8354)
* first commit of starcoder

* update setup.py and fix style

* add starcoder_cpp, fix style

* fix style

* support windows binary

* update pybinding

* fix style, add avx2 binary

* small fix

* fix style
2023-06-21 13:23:06 +08:00
Yuwen Hu
7ef1c890eb [LLM] Supports GPTQ convert in transfomers-like API, and supports folder outfile for llm-convert (#8366)
* Add docstrings to llm_convert

* Small docstrings fix

* Unify outfile type to be a folder path for either gptq or pth model_format

* Supports gptq model input for from_pretrained

* Fix example and readme

* Small fix

* Python style fix

* Bug fix in llm_convert

* Python style check

* Fix based on comments

* Small fix
2023-06-20 17:42:38 +08:00
Zhao Changmin
4ec46afa4f LLM: Align converting GPTQ model API with transformer style (#8365)
* LLM: Align GPTQ API with transformer style
2023-06-20 14:27:41 +08:00
Ruonan Wang
f99d348954 LLM: convert and quantize support for StarCoder (#8359)
* basic support for starcoder

* update from_pretrained

* fix bug and fix style
2023-06-20 13:39:35 +08:00
binbin Deng
5f4f399ca7 LLM: fix bugs during supporting bloom in langchain (#8362) 2023-06-20 13:30:37 +08:00
Zhao Changmin
30ac9a70f5 LLM: fix expected 2 blank lines (#8360) 2023-06-19 18:10:02 +08:00
Zhao Changmin
c256cd136b LLM: Fix ggml return value (#8358)
* ggml return original value
2023-06-19 17:02:56 +08:00
Zhao Changmin
d4027d7164 fix typos in llm_convert (#8355) 2023-06-19 16:17:21 +08:00
Zhao Changmin
4d177ca0a1 LLM: Merge convert pth/gptq model script into one shell script (#8348)
* convert model in one

* model type

* license

* readme and pep8

* ut path

* rename

* readme

* fix docs

* without lines
2023-06-19 11:50:05 +08:00
Ruonan Wang
9daf543e2f LLM: Update convert of gpenox to sync with new libgptneox.so (#8345) 2023-06-15 16:28:50 +08:00
Ruonan Wang
f7f4e65788 LLM: support int8 and tmp_path for from_pretrained (#8338) 2023-06-15 14:48:21 +08:00
Ruonan Wang
5094970175 LLM: update convert_model to support int8 (#8326)
* update example and convert_model for int8

* reset example

* fix style
2023-06-15 09:25:07 +08:00
binbin Deng
f64e703083 LLM: first add _tokenize, detokenize and _generate for bloom pybinding (#8316) 2023-06-14 17:29:57 +08:00
Xin Qiu
5576679a92 add convert-gptq-to-ggml.py to bigdl-llama (#8298) 2023-06-14 14:51:51 +08:00
Ruonan Wang
a6c4b733cb LLM: Update subprocess to show error message (#8323)
* update subprocess

* fix style
2023-06-13 16:43:37 +08:00
Shengsheng Huang
02c583144c [LLM] langchain integrations and examples (#8256)
* langchain intergrations and examples

* add licences and rename

* add licences

* fix license issues and change backbone to model_family

* update examples to use model_family param

* fix linting

* fix code style

* exclude langchain integration from stylecheck

* update langchain examples and update integrations based on latets changes

* update simple llama-cpp-python style API example

* remove bloom in README

* change default n_threads to 2 and remove redundant code

---------

Co-authored-by: leonardozcm <changmin.zhao@intel.com>
2023-06-12 19:22:07 +08:00
xingyuan li
c4028d507c [LLM] Add unified default value for cli programs (#8310)
* add unified default value for threads and n_predict
2023-06-12 16:30:27 +08:00
binbin Deng
5d5da7b2c7 LLM: optimize namespace and remove unused import logic (#8302) 2023-06-09 15:17:49 +08:00
Ruonan Wang
5d0e130605 LLM: fix convert path error of gptneox and bloom on windows (#8304) 2023-06-09 10:10:19 +08:00
Yina Chen
7bfa0fcdf9 fix style (#8300) 2023-06-08 16:52:17 +08:00
Yina Chen
637b72f2ad [LLM] llm transformers api support batch actions (#8288)
* llm transformers api support batch actions

* align with transformer

* meet comment
2023-06-08 15:10:08 +08:00