Ruonan Wang
dc995006cc
LLM: add flash attention for mistral / mixtral ( #9846 )
...
* add flash attention for mistral
* update
* add flash attn for mixtral
* fix style
2024-01-08 09:51:34 +08:00
Yishuo Wang
afaa871144
[LLM] support quantize kv cache to fp8 ( #9812 )
2024-01-08 09:28:20 +08:00
Jiao Wang
248ae7fad2
LLama optimize_model to support transformers 4.36 ( #9818 )
...
* supoort 4.36
* style
* update
* update
* update
2024-01-05 11:30:18 -08:00
Ruonan Wang
a60bda3324
LLM: update check for deepspeed ( #9838 )
2024-01-05 16:44:10 +08:00
Ruonan Wang
16433dd959
LLM: fix first token judgement of flash attention ( #9841 )
...
* fix flash attention
* meet code review
* fix
2024-01-05 13:49:37 +08:00
Yina Chen
f919f5792a
fix kv cache out of bound ( #9827 )
2024-01-05 12:38:57 +08:00
Ruonan Wang
5df31db773
LLM: fix accuracy issue of chatglm3 ( #9830 )
...
* add attn mask for first token
* fix
* fix
* change attn calculation
* fix
* fix
* fix style
* fix style
2024-01-05 10:52:05 +08:00
Xiangyu Tian
38c05be1c0
[LLM] Fix dtype mismatch in Baichuan2-13b ( #9834 )
2024-01-04 15:34:42 +08:00
Ziteng Zhang
05b681fa85
[LLM] IPEX auto importer set on by default ( #9832 )
...
* Set BIGDL_IMPORT_IPEX default to True
* Remove import intel_extension_for_pytorch as ipex from GPU example
2024-01-04 13:33:29 +08:00
Wang, Jian4
4ceefc9b18
LLM: Support bitsandbytes config on qlora finetune ( #9715 )
...
* test support bitsandbytesconfig
* update style
* update cpu example
* update example
* update readme
* update unit test
* use bfloat16
* update logic
* use int4
* set defalut bnb_4bit_use_double_quant
* update
* update example
* update model.py
* update
* support lora example
2024-01-04 11:23:16 +08:00
Ruonan Wang
20e9742fa0
LLM: fix chatglm3 issue ( #9820 )
...
* fix chatglm3 issue
* small update
2024-01-03 16:15:55 +08:00
Wang, Jian4
a54cd767b1
LLM: Add gguf falcon ( #9801 )
...
* init falcon
* update convert.py
* update style
2024-01-03 14:49:02 +08:00
Qiyuan Gong
f0f9d45eac
[LLM] IPEX import support bigdl-core-xe-21 ( #9769 )
...
Add support for bigdl-core-xe-21.
2023-12-28 15:23:58 +08:00
Guancheng Fu
5857a38321
[vLLM] Add option to adjust KV_CACHE_ALLOC_BLOCK_LENGTH ( #9782 )
...
* add option kv_cache_block
* change var name
2023-12-28 14:41:47 +08:00
Ruonan Wang
99bddd3ab4
LLM: better FP16 support for Intel GPUs ( #9791 )
...
* initial support
* fix
* fix style
* fix
* limi esimd usage condition
* refactor code
* fix style
* small fix
* meet code review
* small fix
2023-12-28 13:30:13 +08:00
Yishuo Wang
7d9f6c6efc
fix cpuinfo error ( #9793 )
2023-12-28 09:23:44 +08:00
Wang, Jian4
7ed9538b9f
LLM: support gguf mpt ( #9773 )
...
* add gguf mpt
* update
2023-12-28 09:22:39 +08:00
Cengguang Zhang
d299f108d0
update falcon attention forward. ( #9796 )
2023-12-28 09:11:59 +08:00
Kai Huang
689889482c
Reduce max_cache_pos to reduce Baichuan2-13B memory ( #9694 )
...
* optimize baichuan2 memory
* fix
* style
* fp16 mask
* disable fp16
* fix style
* empty cache
* revert empty cache
2023-12-26 19:51:25 +08:00
Xiangyu Tian
0ea842231e
[LLM] vLLM: Add api_server entrypoint ( #9783 )
...
Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.
2023-12-26 16:03:57 +08:00
Ruonan Wang
11d883301b
LLM: fix wrong batch output caused by flash attention ( #9780 )
...
* fix
* meet code review
* move batch size check to the beginning
* move qlen check inside function
* meet code review
2023-12-26 09:41:27 +08:00
Heyang Sun
66e286a73d
Support for Mixtral AWQ ( #9775 )
...
* Support for Mixtral AWQ
* Update README.md
* Update README.md
* Update awq_config.py
* Update README.md
* Update README.md
2023-12-25 16:08:09 +08:00
Ruonan Wang
1917bbe626
LLM: fix BF16Linear related training & inference issue ( #9755 )
...
* fix bf16 related issue
* fix
* update based on comment & add arc lora script
* update readme
* update based on comment
* update based on comment
* update
* force to bf16
* fix style
* move check input dtype into function
* update convert
* meet code review
* meet code review
* update merged model to support new training_mode api
* fix typo
2023-12-25 14:49:30 +08:00
Xiangyu Tian
30dab36f76
[LLM] vLLM: Fix kv cache init ( #9771 )
...
Fix kv cache init
2023-12-25 14:17:06 +08:00
Yina Chen
449b387125
Support relora in bigdl-llm ( #9687 )
...
* init
* fix style
* update
* support resume & update readme
* update
* update
* remove important
* add training mode
* meet comments
2023-12-25 14:04:28 +08:00
Ziteng Zhang
986f65cea9
[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py ( #9762 )
2023-12-25 11:31:14 +08:00
Guancheng Fu
daf536fb2d
vLLM: Apply attention optimizations for selective batching ( #9758 )
...
* fuse_rope for prefil
* apply kv_cache optimizations
* apply fast_decoding_path
* Re-enable kv_cache optimizations for prefill
* reduce KV_CACHE_ALLOC_BLOCK for selective_batching
2023-12-25 10:29:31 +08:00
Qiyuan Gong
4c487313f2
Revert "[LLM] IPEX auto importer turn on by default for XPU ( #9730 )" ( #9759 )
...
This reverts commit 0284801fbd .
2023-12-22 16:38:24 +08:00
Qiyuan Gong
0284801fbd
[LLM] IPEX auto importer turn on by default for XPU ( #9730 )
...
* Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU.
* Remove import intel_extension_for_pytorch as ipex from GPU example.
* Add support for bigdl-core-xe-21.
2023-12-22 16:20:32 +08:00
Guancheng Fu
fdf93c9267
Implement selective batching for vLLM ( #9659 )
...
* add control to load hf model
* finish initial version of selective_batching
* temp
* finish
* Remove print statement
* fix error
* Apply yang's optimization
* a version that works
* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path
* format
* temp solution: not batching prefill requests
* a version that works for prefill batching
* format
* a solid version: works normally
* a temp version
* Solid version: remove redundant functions
* fix format
* format
* solid: add option to enable selective_batching
* remove logic for using transformer models
* format
* format
* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING
* format
* finish
* format
2023-12-22 13:45:46 +08:00
Ruonan Wang
2f36769208
LLM: bigdl-llm lora support & lora example ( #9740 )
...
* lora support and single card example
* support multi-card, refactor code
* fix model id and style
* remove torch patch, add two new class for bf16, update example
* fix style
* change to training_mode
* small fix
* add more info in help
* fixstyle, update readme
* fix ut
* fix ut
* Handling compatibility issues with default LoraConfig
2023-12-22 11:05:39 +08:00
SONG Ge
ba0b939579
[LLM] Support transformers-v4.36.0 on mistral model ( #9744 )
...
* add support transformers-v4.36.0 on mistral model
* python/llm/src/bigdl/llm/transformers/models/mistral.py
* make the redundant implementation as utils
* fix code style
* fix
* fix style
* update with utils enough_kv_room
2023-12-22 09:59:27 +08:00
Xin Qiu
e36111e713
mixstral fused qkv and rope ( #9724 )
...
* mixstral fused qkv and rope
* fix and clean
* fix style
* update
* update
* fix
* update
* fix
2023-12-22 09:26:35 +08:00
Jiao Wang
e4f6e43675
safetenor to false ( #9728 )
2023-12-21 14:41:51 -08:00
Yishuo Wang
426660b88e
simplify qwen attention ( #9747 )
2023-12-21 17:53:29 +08:00
Wang, Jian4
984697afe2
LLM: Add bloom gguf support ( #9734 )
...
* init
* update bloom add merges
* update
* update readme
* update for llama error
* update
2023-12-21 14:06:25 +08:00
Heyang Sun
df775cf316
fix python style ( #9742 )
...
* fix python style
* fix
* fix
2023-12-21 11:25:05 +08:00
Xin Qiu
6c3e698bf1
mistral decoding_fast_path and fused mlp ( #9714 )
...
* mistral decoding_fast_path and fused mlp
* meet code review
2023-12-21 10:11:37 +08:00
Heyang Sun
d157f623b6
Load Mixtral gguf in a block-wise way ( #9725 )
...
* Load Mixtral gguf in a block-wise way
* refine
2023-12-21 10:03:23 +08:00
Zhao Changmin
4bda975a3e
LLM: Align lowbit model config ( #9735 )
...
* align lowbit model config
2023-12-21 09:48:58 +08:00
Wang, Jian4
e1e921f425
LLM: gguf other model using dtype ( #9729 )
2023-12-21 09:33:40 +08:00
Yishuo Wang
13ea6330bd
optimize qwen rope ( #9737 )
2023-12-20 17:34:34 +08:00
Ziteng Zhang
4c032a433e
[LLM] Add glibc checker ( #9624 )
...
* Add glibc checker
* Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.
2023-12-20 16:52:43 +08:00
Yina Chen
cd652a1710
Support fp8 e5m2 on arc ( #9711 )
...
* init
* fix style
* update
* fix style
* update
2023-12-20 16:26:17 +08:00
Yishuo Wang
e54c428d30
add bf16/fp16 fuse mlp support ( #9726 )
2023-12-20 10:40:45 +08:00
Heyang Sun
612651cb5d
fix typo ( #9723 )
2023-12-20 09:41:59 +08:00
Yishuo Wang
522cf5ed82
[LLM] Improve chatglm2/3 rest token performance with long context ( #9716 )
2023-12-19 17:29:38 +08:00
Yishuo Wang
f2e6abb563
fix mlp batch size check ( #9718 )
2023-12-19 14:22:22 +08:00
Heyang Sun
1fa7793fc0
Load Mixtral GGUF Model ( #9690 )
...
* Load Mixtral GGUF Model
* refactor
* fix empty tensor when to cpu
* update gpu and cpu readmes
* add dtype when set tensor into module
2023-12-19 13:54:38 +08:00
Qiyuan Gong
d0a3095b97
[LLM] IPEX auto importer ( #9706 )
...
* IPEX auto importer and get_ipex_version.
* Add BIGDL_IMPORT_IPEX to control auto import, default is false.
2023-12-19 13:39:38 +08:00