Xiangyu Tian
0ea842231e
[LLM] vLLM: Add api_server entrypoint ( #9783 )
...
Add vllm.entrypoints.api_server for benchmark_serving.py in vllm.
2023-12-26 16:03:57 +08:00
Ruonan Wang
11d883301b
LLM: fix wrong batch output caused by flash attention ( #9780 )
...
* fix
* meet code review
* move batch size check to the beginning
* move qlen check inside function
* meet code review
2023-12-26 09:41:27 +08:00
Heyang Sun
66e286a73d
Support for Mixtral AWQ ( #9775 )
...
* Support for Mixtral AWQ
* Update README.md
* Update README.md
* Update awq_config.py
* Update README.md
* Update README.md
2023-12-25 16:08:09 +08:00
Ruonan Wang
1917bbe626
LLM: fix BF16Linear related training & inference issue ( #9755 )
...
* fix bf16 related issue
* fix
* update based on comment & add arc lora script
* update readme
* update based on comment
* update based on comment
* update
* force to bf16
* fix style
* move check input dtype into function
* update convert
* meet code review
* meet code review
* update merged model to support new training_mode api
* fix typo
2023-12-25 14:49:30 +08:00
Xiangyu Tian
30dab36f76
[LLM] vLLM: Fix kv cache init ( #9771 )
...
Fix kv cache init
2023-12-25 14:17:06 +08:00
Yina Chen
449b387125
Support relora in bigdl-llm ( #9687 )
...
* init
* fix style
* update
* support resume & update readme
* update
* update
* remove important
* add training mode
* meet comments
2023-12-25 14:04:28 +08:00
Ziteng Zhang
986f65cea9
[LLM] Add trust_remote_code for local renamed model in bigdl_llm_model.py ( #9762 )
2023-12-25 11:31:14 +08:00
Guancheng Fu
daf536fb2d
vLLM: Apply attention optimizations for selective batching ( #9758 )
...
* fuse_rope for prefil
* apply kv_cache optimizations
* apply fast_decoding_path
* Re-enable kv_cache optimizations for prefill
* reduce KV_CACHE_ALLOC_BLOCK for selective_batching
2023-12-25 10:29:31 +08:00
Qiyuan Gong
4c487313f2
Revert "[LLM] IPEX auto importer turn on by default for XPU ( #9730 )" ( #9759 )
...
This reverts commit 0284801fbd .
2023-12-22 16:38:24 +08:00
Qiyuan Gong
0284801fbd
[LLM] IPEX auto importer turn on by default for XPU ( #9730 )
...
* Set BIGDL_IMPORT_IPEX default to true, i.e., auto import IPEX for XPU.
* Remove import intel_extension_for_pytorch as ipex from GPU example.
* Add support for bigdl-core-xe-21.
2023-12-22 16:20:32 +08:00
Guancheng Fu
fdf93c9267
Implement selective batching for vLLM ( #9659 )
...
* add control to load hf model
* finish initial version of selective_batching
* temp
* finish
* Remove print statement
* fix error
* Apply yang's optimization
* a version that works
* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path
* format
* temp solution: not batching prefill requests
* a version that works for prefill batching
* format
* a solid version: works normally
* a temp version
* Solid version: remove redundant functions
* fix format
* format
* solid: add option to enable selective_batching
* remove logic for using transformer models
* format
* format
* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING
* format
* finish
* format
2023-12-22 13:45:46 +08:00
Ruonan Wang
2f36769208
LLM: bigdl-llm lora support & lora example ( #9740 )
...
* lora support and single card example
* support multi-card, refactor code
* fix model id and style
* remove torch patch, add two new class for bf16, update example
* fix style
* change to training_mode
* small fix
* add more info in help
* fixstyle, update readme
* fix ut
* fix ut
* Handling compatibility issues with default LoraConfig
2023-12-22 11:05:39 +08:00
SONG Ge
ba0b939579
[LLM] Support transformers-v4.36.0 on mistral model ( #9744 )
...
* add support transformers-v4.36.0 on mistral model
* python/llm/src/bigdl/llm/transformers/models/mistral.py
* make the redundant implementation as utils
* fix code style
* fix
* fix style
* update with utils enough_kv_room
2023-12-22 09:59:27 +08:00
Xin Qiu
e36111e713
mixstral fused qkv and rope ( #9724 )
...
* mixstral fused qkv and rope
* fix and clean
* fix style
* update
* update
* fix
* update
* fix
2023-12-22 09:26:35 +08:00
Jiao Wang
e4f6e43675
safetenor to false ( #9728 )
2023-12-21 14:41:51 -08:00
Yishuo Wang
426660b88e
simplify qwen attention ( #9747 )
2023-12-21 17:53:29 +08:00
Wang, Jian4
984697afe2
LLM: Add bloom gguf support ( #9734 )
...
* init
* update bloom add merges
* update
* update readme
* update for llama error
* update
2023-12-21 14:06:25 +08:00
Heyang Sun
df775cf316
fix python style ( #9742 )
...
* fix python style
* fix
* fix
2023-12-21 11:25:05 +08:00
Xin Qiu
6c3e698bf1
mistral decoding_fast_path and fused mlp ( #9714 )
...
* mistral decoding_fast_path and fused mlp
* meet code review
2023-12-21 10:11:37 +08:00
Heyang Sun
d157f623b6
Load Mixtral gguf in a block-wise way ( #9725 )
...
* Load Mixtral gguf in a block-wise way
* refine
2023-12-21 10:03:23 +08:00
Zhao Changmin
4bda975a3e
LLM: Align lowbit model config ( #9735 )
...
* align lowbit model config
2023-12-21 09:48:58 +08:00
Wang, Jian4
e1e921f425
LLM: gguf other model using dtype ( #9729 )
2023-12-21 09:33:40 +08:00
Yishuo Wang
13ea6330bd
optimize qwen rope ( #9737 )
2023-12-20 17:34:34 +08:00
Ziteng Zhang
4c032a433e
[LLM] Add glibc checker ( #9624 )
...
* Add glibc checker
* Add env BIGDL_GLIBC_CHECK to control glibc checker. The default is false, i.e., don't check.
2023-12-20 16:52:43 +08:00
Yina Chen
cd652a1710
Support fp8 e5m2 on arc ( #9711 )
...
* init
* fix style
* update
* fix style
* update
2023-12-20 16:26:17 +08:00
Yishuo Wang
e54c428d30
add bf16/fp16 fuse mlp support ( #9726 )
2023-12-20 10:40:45 +08:00
Heyang Sun
612651cb5d
fix typo ( #9723 )
2023-12-20 09:41:59 +08:00
Yishuo Wang
522cf5ed82
[LLM] Improve chatglm2/3 rest token performance with long context ( #9716 )
2023-12-19 17:29:38 +08:00
Yishuo Wang
f2e6abb563
fix mlp batch size check ( #9718 )
2023-12-19 14:22:22 +08:00
Heyang Sun
1fa7793fc0
Load Mixtral GGUF Model ( #9690 )
...
* Load Mixtral GGUF Model
* refactor
* fix empty tensor when to cpu
* update gpu and cpu readmes
* add dtype when set tensor into module
2023-12-19 13:54:38 +08:00
Qiyuan Gong
d0a3095b97
[LLM] IPEX auto importer ( #9706 )
...
* IPEX auto importer and get_ipex_version.
* Add BIGDL_IMPORT_IPEX to control auto import, default is false.
2023-12-19 13:39:38 +08:00
Yang Wang
f4fb58d99c
fusing qkv project and rope ( #9612 )
...
* Try fusing qkv project and rope
* add fused mlp
* fuse append cache
* fix style and clean up code
* clean up
2023-12-18 16:45:00 -08:00
Cengguang Zhang
4d22add4af
LLM: fix qwen efficiency issue in perf-test.
2023-12-18 18:32:54 +08:00
Ruonan Wang
8ed89557e5
LLM: add mlp optimization of mixtral ( #9709 )
2023-12-18 16:59:52 +08:00
Xin Qiu
320110d158
handle empty fused norm result ( #9688 )
...
* handle empty fused norm result
* remove fast_rms_norm
* fix style
2023-12-18 09:56:11 +08:00
SONG Ge
d5b81af7bd
Support mixtral attention optimization on transformers-v4.36.0 ( #9674 )
...
* add example code to support mistral/mixtral attention on transformers v4.36.0
* update
* style fix
* add update for seen-tokens
* support mixtral
* rm mistral change
* small fix
* add more comments and remove use_cache part
---------
Co-authored-by: plusbang <binbin1.deng@intel.com>
2023-12-15 14:30:23 +08:00
Cengguang Zhang
adbef56001
LLM: update qwen attention forward. ( #9695 )
...
* feat: update qwen attention forward.
* fix: style.
2023-12-15 14:06:15 +08:00
Wang, Jian4
b8437a1c1e
LLM: Add gguf mistral model support ( #9691 )
...
* add mistral support
* need to upgrade transformers version
* update
2023-12-15 13:37:39 +08:00
Wang, Jian4
496bb2e845
LLM: Support load BaiChuan model family gguf model ( #9685 )
...
* support baichuan model family gguf model
* update gguf generate.py
* add verify models
* add support model_family
* update
* update style
* update type
* update readme
* update
* remove support model_family
2023-12-15 13:34:33 +08:00
Yishuo Wang
9a330bfc2b
fix fuse mlp when using q5_0 or fp8 ( #9689 )
2023-12-14 16:16:05 +08:00
Xin Qiu
5e46e0e5af
fix baichuan2-7b 1st token performance regression on xpu ( #9683 )
...
* fix baichuan2-7b 1st token performance regression
* add comments
* fix style
2023-12-14 09:58:32 +08:00
Yishuo Wang
09ca540f9b
use fuse mlp in qwen ( #9672 )
2023-12-13 17:20:08 +08:00
Ruonan Wang
c7741c4e84
LLM: update moe block convert to optimize rest token latency of Mixtral ( #9669 )
...
* update moe block convert
* further accelerate final_hidden_states
* fix style
* fix style
2023-12-13 16:17:06 +08:00
Xiangyu Tian
1c6499e880
[LLM] vLLM: Support Mixtral Model ( #9670 )
...
Add Mixtral support for BigDL vLLM.
2023-12-13 14:44:47 +08:00
Ruonan Wang
dc5b1d7e9d
LLM: integrate sdp kernel for FP16 rest token inference on GPU [DG2/ATSM] ( #9633 )
...
* integrate sdp
* update api
* fix style
* meet code review
* fix
* distinguish mtl from arc
* small fix
2023-12-13 11:29:57 +08:00
Qiyuan Gong
5b0e7e308c
[LLM] Add support for empty activation ( #9664 )
...
* Add support for empty activation, e.g., [0, 4096]. Empty activation is allowed by PyTorch.
* Add comments.
2023-12-13 11:07:45 +08:00
SONG Ge
284e7697b1
[LLM] Optimize ChatGLM2 kv_cache to support beam_search on ARC ( #9579 )
...
* optimize kv_cache to support beam_search on Arc
* correctness test update
* fix query_length issue
* simplify implementation
* only enable the optimization on gpu device
* limit the beam_search support only enabled with gpu device and batch_size > 1
* add comments for beam_search case and revert ut change
* meet comments
* add more comments to describe the differece between multi-cases
2023-12-13 11:02:14 +08:00
Ziteng Zhang
8931f2eb62
[LLM] Fix transformer qwen size mismatch and rename causal_mask ( #9655 )
...
* Fix size mismatching caused by context_layer
* Change registered_causal_mask to causal_mask
2023-12-12 20:57:40 +08:00
binbin Deng
59ce86d292
LLM: support optimize_model=True for Mixtral model ( #9657 )
2023-12-12 16:41:26 +08:00
Heyang Sun
9f02f96160
[LLM] support for Yi AWQ model ( #9648 )
2023-12-11 14:07:34 +08:00