Ruonan Wang
640998edea
update inter_pp of qwen2 ( #12041 )
2024-09-10 10:34:17 +08:00
Yishuo Wang
048b4590aa
add basic minicpm3 optimization ( #12039 )
2024-09-09 17:25:08 +08:00
Yishuo Wang
6cedb601e4
remove some useless code ( #12035 )
2024-09-06 17:51:08 +08:00
binbin Deng
d2e1b9aaff
Add input padding during prefill for qwen2-7b ( #12033 )
2024-09-06 16:39:59 +08:00
Ruonan Wang
0d04531ae0
update NPU readme of Qwen2 ( #12032 )
...
* update readme
* update broadcast
2024-09-06 15:02:39 +08:00
Yang Wang
58555bd9de
Optimize broadcast for npu llama ( #12028 )
2024-09-06 13:28:20 +08:00
binbin Deng
845e5dc89e
Support lm_head of minicpm-2b on NPU ( #12019 )
2024-09-05 16:19:22 +08:00
Guoqiong Song
8803242f5c
fix llama on cpu ( #12018 )
2024-09-04 19:17:54 -07:00
Wang, Jian4
b3b2cd64b4
Support lightweight-serving glm-4v-9b ( #11994 )
...
* enable glm-4v-9b serving
* update readme
* update for no image input
2024-09-05 09:25:08 +08:00
Wang, Jian4
2b993ad479
vllm update for glm-4 model automatic not_convert ( #12003 )
2024-09-04 13:50:32 +08:00
Ruonan Wang
9eaff5e47d
add save & load support for NPU optimized model ( #11999 )
...
* add save & load support
* fix style
2024-09-03 20:53:22 +08:00
Yuwen Hu
6eb55653ba
Performance mode strategy update for input_embeds input ( #11997 )
2024-09-03 17:46:16 +08:00
binbin Deng
01099f08ee
Revert prefill logic of qwen2-7b ( #11992 )
2024-09-03 14:45:01 +08:00
Yuwen Hu
659d15defc
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation ( #11989 )
...
* Fix garbage output for input_embeds inputs during lookup generation
* Fix on sliding windows
* Simplify code
2024-09-02 19:09:12 +08:00
binbin Deng
2f3d1bd0ec
hotfix qwen2-7b weight setting ( #11991 )
2024-09-02 18:11:08 +08:00
binbin Deng
a40ea7038d
Fix AttributeError of qwen2-1.5B ( #11990 )
2024-09-02 17:55:10 +08:00
Yang Wang
c48817bd43
Support Qwen2-7b MLP in int4 and transpose_value_cache=True ( #11968 )
2024-09-02 14:37:44 +08:00
Ruonan Wang
573c20bae6
fix npu lm_head cpu condition ( #11976 )
...
* fix
* fix
* fix
* fix stype
* fix style
* fix style
2024-08-30 17:11:26 +08:00
Ruonan Wang
60aa1a2c0f
Initial NPU support for MiniCPM-V-2_6 ( #11966 )
...
* initial pr
* update npu model
* fix
* fix kv cache type
* fix
* small fix
* fix style
* fix model id
* change inter_pp=4
* address comment
* fix
* fix style
* fix
* rebase
2024-08-30 16:34:35 +08:00
SONG Ge
158289d205
[NPU] Add initial support for minicpm-llama-v2.5 ( #11962 )
...
* add initial support for minicpm-llama-v2.5
* update impl
* add minicpm-llama3-v2.5 example
2024-08-30 16:00:33 +08:00
binbin Deng
cd077881f1
Disable lm head ( #11972 )
2024-08-30 11:05:18 +08:00
Wang, Jian4
7d103417b8
Fix glm4-9b-chat nan error on vllm 0.3.3 ( #11970 )
...
* fix nan value
* update
2024-08-30 09:50:18 +08:00
Yang Wang
fbf088f61e
remove obselete npu code ( #11967 )
2024-08-29 14:16:44 -07:00
Yuwen Hu
a9e485eb1b
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer ( #11963 )
...
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer
* Style fixes
2024-08-29 19:22:09 +08:00
Yina Chen
882f4a5ff7
Add lnl npu driver recommend version and enable cpu_lm_head on llama3 ( #11952 )
...
* update lnl npu driver version and enable cpu_lm_head on llama3
* update
* fix style
* typo
* address comments
* update
* add qwen2-7b
2024-08-29 15:01:18 +08:00
binbin Deng
71f03dcc39
Support qwen2-7b with fused decoderlayer optimization on NPU ( #11912 )
2024-08-29 13:34:20 +08:00
Jiao Wang
63ac5f64bb
Refactor NPU baichuan multiple-process ( #11945 )
...
* update
* add baichuan mp
* clean
* refactor
* merge
* style
* update
* update
2024-08-28 11:33:40 -07:00
SONG Ge
5ca7390082
[NPU] Add minicpm-2b support for npu multi-processing ( #11949 )
...
* add minicpm-2b support
* update example for minicpm-2b
* add LNL NPU driver requirement in readme
2024-08-28 18:08:49 +08:00
Yishuo Wang
0fbb10259a
use sdp_causal to reduce internvl2-4b memory usage if set environment variable ( #11953 )
2024-08-28 17:35:05 +08:00
Guancheng Fu
0a7bd274e2
Add vllm awq loading logic ( #11950 )
...
* add vllm awq loading logic
* fix
* refine
2024-08-28 16:46:18 +08:00
Yina Chen
b38fb67bec
[NPU] lm head to cpu ( #11943 )
...
* lm head to cpu
* qwen2
* mv logic and add param to disable cpu_lm_head
* use env and lm_head opt to mp file
* fix
* update
* remove print
2024-08-28 16:34:07 +08:00
binbin Deng
bec00e2015
Improve baichuan2 NPU performance ( #11942 )
2024-08-27 18:37:08 +08:00
Zijie Li
90f692937d
Update npu baichuan2 ( #11939 )
2024-08-27 16:56:26 +08:00
Jiao Wang
b4b6ddf73c
NPU Baichuan2 Multi- Process example ( #11928 )
2024-08-27 15:25:49 +08:00
SONG Ge
e211a5b076
update minicpm to meet latest refactor ( #11937 )
2024-08-27 15:08:01 +08:00
binbin Deng
7c8c9a0670
Update benchmark script for NPU ( #11932 )
2024-08-27 14:41:14 +08:00
Zijie Li
6c3eb1e1e8
refactor from_pretrained API for NPU ( #11927 )
2024-08-27 09:50:30 +08:00
Xiangyu Tian
7ca557aada
LLM: Fix vLLM CPU convert error ( #11926 )
2024-08-27 09:22:19 +08:00
Yuwen Hu
c1d07bc626
Support streaming for lookup generation ( #11922 )
...
* Support streaming for lookup generation
* Small update
* Style fixes
* Add origin generate full back for batch inference and beam search; support input length threshold judgement for directly input with input_ids
* Fix lookup stream generate with eos token
* Small fixes
* Small fix
* index fix
* Small fix
2024-08-26 19:33:31 +08:00
SONG Ge
019f725d4d
[NPU] Add support for running mp minicpm model on npu ( #11909 )
...
* add initial support for npu minicpm mp
* fix minicpm-1b abnormal output error
2024-08-26 17:52:55 +08:00
Yuwen Hu
24c279e0ae
Update IPEX_LLM_PERFORMANCE_MODE with input length threshold ( #11908 )
...
* Update IPEX_LLM_PERFORMANCE_MODE with input length threshold
* Update based on comments. And and judgement for inputs_embeds
* Fix for benchmarking purposes
* Update based on comments
* Small fix
2024-08-23 20:49:15 +08:00
binbin Deng
303a090a6b
Add lm_head optimization on NPU ( #11903 )
2024-08-23 15:51:07 +08:00
Yina Chen
23631cd357
disable lm_head opt for baichuan2-13b ( #11905 )
2024-08-23 15:39:47 +08:00
hxsz1997
650e6e6ce4
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
...
Add compress_kv for Baichuan2
2024-08-23 06:09:58 +03:00
Ruonan Wang
4a61f7d20d
update mlp of llama ( #11897 )
...
* update mlp of llama
* relax threshold of mlp test
* revert code
2024-08-22 20:34:53 +08:00
Yuwen Hu
420ce7d164
Fix non-stop at eos token problem for lookup generation ( #11896 )
...
* Fix non-stop by eos_token_id problem for lookup
* Small fix
* Add judgement when generation_config.eos_token_id is None
* Fix based on comments
2024-08-22 18:55:59 +08:00
Huang, Xinshengzi
4cf03d6212
update baichuan-7b
2024-08-22 18:16:33 +08:00
Guancheng Fu
278b191dc1
Fix optimize lm head error ( #11899 )
2024-08-22 17:45:26 +08:00
Wang, Jian4
5c4ed00593
Add lightweight-serving whisper asr example ( #11847 )
...
* add asr init
* update for pp
* update style
* update readme
* update reamde
2024-08-22 15:46:28 +08:00
Huang, Xinshengzi
eb1e65f8a9
add comment
2024-08-22 15:14:47 +08:00
Huang, Xinshengzi
a2be3d7501
add comment of compress kv in attention forward
2024-08-22 15:11:55 +08:00
Huang, Xinshengzi
ce7de77085
add comment of change in model forward
2024-08-22 14:29:27 +08:00
Huang, Xinshengzi
42398a0045
add comment
2024-08-22 13:17:13 +08:00
Huang, Xinshengzi
48a827aa07
fix typos
2024-08-22 11:35:47 +08:00
Huang, Xinshengzi
8a5df93de2
fix typos
2024-08-22 11:33:07 +08:00
Huang, Xinshengzi
01ed397e7a
fix typos
2024-08-22 11:31:25 +08:00
Huang, Xinshengzi
c6ed1c412d
fix typos
2024-08-22 11:26:49 +08:00
Huang, Xinshengzi
2a0aa9271b
fix typos
2024-08-22 11:23:22 +08:00
Huang, Xinshengzi
4adadddbbc
fix typos
2024-08-22 11:12:23 +08:00
Huang, Xinshengzi
6a5ca17afc
fix typoes
2024-08-22 11:09:58 +08:00
binbin Deng
72a7bf624b
Support qwen2-1.5b with fused decoderlayer optimization on NPU ( #11888 )
2024-08-22 11:09:12 +08:00
Huang, Xinshengzi
6bb9035788
fix typos
2024-08-22 11:08:48 +08:00
Huang, Xinshengzi
86248b0505
add compress_kv for baichuan2
2024-08-22 10:59:08 +08:00
Yina Chen
cc27321441
support chatglm4 in lookup ( #11855 )
2024-08-21 15:53:17 +08:00
Yina Chen
0236de3ac2
set IPEX_LLM_LAST_LM_HEAD=1 as default ( #11885 )
2024-08-21 15:06:12 +08:00
Yang Wang
209d42ab79
Refactor npu mp to make it easier to integrate new models ( #11873 )
...
* Refactor npu mp to make it easier to integrate new models
* fix style
* move layer functions to base
2024-08-20 20:58:47 -07:00
Guancheng Fu
537c0d2767
fix vllm qwen2 models ( #11879 )
2024-08-21 11:05:24 +08:00
Yishuo Wang
bd1e490d62
fix phi3 ( #11878 )
2024-08-21 10:31:41 +08:00
Yang Wang
bdaeee1d63
Fix run_decoders bug ( #11871 )
2024-08-20 12:04:59 -07:00
Yina Chen
c3c058373f
Update compresskv model forward type logic ( #11868 )
...
* update
* fix
2024-08-20 18:11:37 +08:00
Yishuo Wang
d4ee0a89f3
optimize phi3 memory usage ( #11867 )
2024-08-20 17:32:51 +08:00
Yishuo Wang
2946420e14
add minicpmv 2.6 load_low_bit workaround ( #11856 )
2024-08-20 11:16:02 +08:00
Yang Wang
99b05ba1dc
separate prefill into a process ( #11787 )
...
* seperate prefill into a process
* using model.share_memory()
* might work
* worked
* use long prompt
* refactor
* cleanup
* fix bug
* clean up
* changable inter and intra process stages
* refactor
* add max output len
* fix npu_model changes that may cause generate down
* fix npu_model generate import error
* fix generare forward error
---------
Co-authored-by: sgwhat <ge.song@intel.com>
2024-08-19 17:53:36 +08:00
Yishuo Wang
9490781aec
optimize phi3 memory usage again ( #11848 )
2024-08-19 17:26:59 +08:00
Yina Chen
3cd4e87168
Support compress KV with quantize KV ( #11812 )
...
* update llama
* support llama 4.41
* fix style
* support minicpm
* support qwen2
* support minicpm & update
* support chatglm4
* support chatglm
* remove print
* add DynamicCompressFp8Cache & support qwen
* support llama
* support minicpm phi3
* update chatglm2/4
* small fix & support qwen 4.42
* remove print
2024-08-19 15:32:32 +08:00
Zhao Changmin
6841a9ac8f
fix load low bit com dtype ( #11832 )
2024-08-19 13:43:19 +08:00
Yuwen Hu
96796f95cb
Update all-in-one benchmark prompts for continuation task & lookup update for minicpmv ( #11827 )
...
* Update all-in-one benchmark prompts for continuation task
* Small fix
* Add pure-text benchmark support for minicpm-v-2_6
* Support lookahead for model.llm generate of minicpmv
* Add prompt reference
* Small update
* Small fix
2024-08-16 17:16:35 +08:00
Yishuo Wang
e966e85df8
force lm_head optimization in any model if set environment variable ( #11830 )
2024-08-16 16:48:45 +08:00
Yishuo Wang
17a0beb21f
optimize qwen2-audio again ( #11825 )
2024-08-16 11:11:35 +08:00
Yuwen Hu
9e9086cc2a
Update IPEX_LLM_PERFORMANCE_MODE ( #11823 )
2024-08-16 09:48:36 +08:00
Wang, Jian4
5a80fd2633
Fix lightweight-serving no streaming resp on mtl ( #11822 )
2024-08-16 09:43:03 +08:00
Guancheng Fu
e70ae0638e
Fix vLLM not convert issues ( #11817 )
...
* Fix not convert issues
* refine
2024-08-15 19:04:05 +08:00
Yishuo Wang
750d4ad5dc
fix minicpm-v-2 fp16 ( #11819 )
2024-08-15 18:34:40 +08:00
Yishuo Wang
828ab16537
fix phi3 and minicpmv cpu ( #11818 )
2024-08-15 17:43:29 +08:00
Yishuo Wang
4e178f0c5d
rewrite minicpmv optimization ( #11816 )
2024-08-15 17:27:12 +08:00
Yishuo Wang
07b7f13982
support and optimize qwen2-audio ( #11809 )
2024-08-15 14:59:04 +08:00
Yishuo Wang
9a93808fc5
fix and optimize minicpm v 2 ( #11799 )
2024-08-14 17:27:23 +08:00
Yishuo Wang
3d6cfa291d
optimize minicpm v 2.5 ( #11793 )
2024-08-14 16:07:24 +08:00
Yuwen Hu
356281cb80
Further all-in-one benchmark update continuation task ( #11784 )
...
* Further update prompt for continuation task, and disable lookup candidate update strategy on MTL
* style fix
2024-08-14 14:39:34 +08:00
Ruonan Wang
43cca3be27
fix gemma2 runtime error caused by sliding window ( #11788 )
...
* fix runtime error
* revert workflow
2024-08-14 10:43:33 +08:00
Yang Wang
51bcac1229
follow up on experimental support of fused decoder layer for llama2 ( #11785 )
...
* clean up and support transpose value cache
* refine
* fix style
* fix style
2024-08-13 18:53:55 -07:00
Yishuo Wang
cb79dcda93
refactor llama convert to fix minicpm-v 2.5 optimization ( #11783 )
2024-08-14 09:29:57 +08:00
Yina Chen
7cd6ec9723
MiniCPM-V support compresskv ( #11779 )
...
* fix check error
* fix other models
* remove print
2024-08-13 19:03:40 +08:00
Qiyuan Gong
3998de14f0
Fix mistral forward_qkv in q4_0 ( #11781 )
...
* Fix mistral forward_qkv without self.rotary_emb.base in q4_0.
* Replace apply_rotary_pos_emb_no_cache_xpu with rotary_half_inplaced.
* Revert https://github.com/intel-analytics/ipex-llm/pull/11765
2024-08-13 16:48:19 +08:00
Heyang Sun
70c828b87c
deepspeed zero3 QLoRA finetuning ( #11625 )
...
* deepspeed zero3 QLoRA finetuning
* Update convert.py
* Update low_bit_linear.py
* Update utils.py
* Update qlora_finetune_llama2_13b_arch_2_card.sh
* Update low_bit_linear.py
* Update alpaca_qlora_finetuning.py
* Update low_bit_linear.py
* Update utils.py
* Update convert.py
* Update alpaca_qlora_finetuning.py
* Update alpaca_qlora_finetuning.py
* Update low_bit_linear.py
* Update deepspeed_zero3.json
* Update qlora_finetune_llama2_13b_arch_2_card.sh
* Update low_bit_linear.py
* Update low_bit_linear.py
* Update utils.py
* fix style
* fix style
* Update alpaca_qlora_finetuning.py
* Update qlora_finetune_llama2_13b_arch_2_card.sh
* Update convert.py
* Update low_bit_linear.py
* Update model.py
* Update alpaca_qlora_finetuning.py
* Update low_bit_linear.py
* Update low_bit_linear.py
* Update low_bit_linear.py
2024-08-13 16:15:29 +08:00
Yishuo Wang
a184b120c9
fix minicpm-v 2.5 ( #11780 )
2024-08-13 16:14:00 +08:00
Qiyuan Gong
a88c132e54
Reduce Mistral softmax memory only in low memory mode ( #11775 )
...
* Reduce Mistral softmax memory only in low memory mode
2024-08-13 14:50:54 +08:00
Yishuo Wang
aa861df066
use new fp32 softmax kernel ( #11776 )
2024-08-13 14:48:11 +08:00
binbin Deng
23d3acdc77
Add experimental support of fused decoder layer for llama2 ( #11768 )
2024-08-13 14:41:36 +08:00
Yishuo Wang
a1eb793f70
optimize minicpm v 2_6 firs token perf ( #11770 )
2024-08-13 09:51:18 +08:00
Yina Chen
841dbcdf3a
Fix compresskv with lookahead issue ( #11767 )
...
* fix compresskv + lookahead attn_mask qwen2
* support llama chatglm
* support mistral & chatglm
* address comments
* revert run.py
2024-08-12 18:53:55 +08:00
Xu, Shuo
1b05caba2b
Set mistral fuse rope to false except fp6 & fp16 ( #11765 )
...
* set mistral fuse rope to false except fp6 & fp16
* lint
* lint
---------
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-08-12 17:25:07 +08:00
Ruonan Wang
8db34057b4
optimize lookahead init time ( #11769 )
2024-08-12 17:19:12 +08:00
Yishuo Wang
57d177738d
optimize minicpm-v-2_6 repetition penalty ( #11763 )
2024-08-12 14:10:10 +08:00
Wang, Jian4
245dba0abc
Fix lightweight-serving codegeex error ( #11759 )
2024-08-12 10:35:37 +08:00
Ruonan Wang
66fe2ee464
initial support of IPEX_LLM_PERFORMANCE_MODE ( #11754 )
...
* add perf mode
* update
* fix style
2024-08-09 19:04:09 +08:00
Yina Chen
4b9c57cc60
Support compress kv with lookahead ( #11752 )
...
* support compress kv with lookahead
* enough kv miss param
2024-08-09 17:39:57 +08:00
Yishuo Wang
93455aac09
fix minicpm V 2.6 repeat output ( #11753 )
2024-08-09 17:39:24 +08:00
Ruonan Wang
7e917d6cfb
fix gptq of llama ( #11749 )
...
* fix gptq of llama
* small fix
2024-08-09 16:39:25 +08:00
Yina Chen
dd46c141bd
Phi3 support compresskv ( #11733 )
...
* phi3 support compresskv
* fix phi3 mtl error
* fix conflict with quant kv
* fix abnormal on mtl
* fix style
* use slide windows size to compress kv
* support sliding window
* fix style
* fix style
* temp: partial support quant kv
* support quant kv with compress kv, todo: model check
* temp
* fix style
* fix style
* remove prepare
* address comment
* default -> 1.8k
2024-08-09 15:43:43 +08:00
Qiyuan Gong
d8808cc2e3
Mistral apply_rotary_pos_emb_no_cache_xpu use rope_theta from config ( #11747 )
...
mistral-7B-instruct-v0.2 and mistral-7B-instruct-v0.1 use different rope_theta (0.2 is 1e, 0.1 is 1e5). Pass self.config.rope_theta to apply_rotary_pos_emb_no_cache_xpu to avoid output difference.
2024-08-09 10:35:51 +08:00
Xiangyu Tian
044e486480
Fix vLLM CPU /chat endpoint ( #11748 )
2024-08-09 10:33:52 +08:00
Yishuo Wang
54cc9353db
support and optimize minicpm-v-2_6 ( #11738 )
2024-08-07 18:21:16 +08:00
Yina Chen
e956e71fc1
fix conflict with quant kv ( #11737 )
2024-08-07 18:10:30 +08:00
Ruonan Wang
00a5574c8a
Use merge_qkv to replace fused_qkv for llama2 ( #11727 )
...
* update 4.38
* support new versions
* update
* fix style
* fix style
* update rope
* temp test sdpa
* fix style
* fix cpu ut
2024-08-07 18:04:01 +08:00
Yina Chen
d2abc9711b
Fix MTL 4k input qwen2 compresskv error ( #11734 )
...
* fix
* fix style
2024-08-07 16:21:57 +08:00
Yina Chen
a71ae7c22b
Support minicpm compresskv & modify default compresskv config & default enable compresskv on mtl 2.5k~4.5k ( #11726 )
...
* support minicpm & modify default & default enable on mtl 2.5k~4.5k
* fix style
2024-08-07 11:35:39 +08:00
Yishuo Wang
c093f7d980
fix phi3 ( #11729 )
2024-08-07 09:39:46 +08:00
Zijie Li
e7f7141781
Add benchmark util for transformers 4.42 ( #11725 )
...
* add new benchmark_util.py
Add new benchmark_util.py for transformers>=4.43.1. The old one renamed to benchmark_util_prev.py.
* Small fix to import code
* Update __init__.py
* fix file names
* Update lint-python
Update lint-python to exclude benchmark_util_4_29.py
benchmark_util_4_43.py
* Update benchmark_util_4_43.py
* add benchmark_util for transformers 4.42
2024-08-07 08:48:07 +08:00
Yishuo Wang
929675aa6b
support latest phi3 ( #11721 )
2024-08-06 15:52:55 +08:00
Yishuo Wang
bbdff6edeb
optimize internvl2 4b performance ( #11720 )
2024-08-06 14:25:08 +08:00
Yishuo Wang
f44b732aa8
support internvl2-4b ( #11718 )
2024-08-06 13:36:32 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py ( #11713 )
...
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input ( #11703 )
...
* init image_list
* enable internlm-xcomposer2 image input
* update style
* add readme
* update model
* update readme
2024-08-05 09:36:04 +08:00
Ruonan Wang
aa98ef96fe
change mixed_precision to q6_k ( #11706 )
2024-08-02 15:55:16 +08:00
Xiangyu Tian
1baa3efe0e
Optimizations for Pipeline Parallel Serving ( #11702 )
...
Optimizations for Pipeline Parallel Serving
2024-08-02 12:06:59 +08:00
Yina Chen
8d1e0bd2f4
add sdp causal support in llama ( #11705 )
2024-08-02 10:27:40 +08:00
Ruonan Wang
736a7ef72e
add sdp_causal for mistral 4.36 ( #11686 )
...
* add sdp_causal for mistral
* fix
* update
2024-08-01 18:57:31 +08:00
Yina Chen
45c730ff39
Chatglm support compresskv ( #11690 )
...
* chatglm4 support compresskv
* fix
* fix style
* support chatglm2
* fix quantkv conflict
* fix style
2024-08-01 18:20:20 +08:00
Guancheng Fu
afeca38a47
Fix import vllm condition ( #11682 )
2024-07-31 13:50:01 +08:00
Ruonan Wang
54bf3a23a6
add fallback for unsupported k-quants ( #11691 )
...
* add fallback
* fix style
* fix
2024-07-31 11:39:58 +08:00
Wang, Jian4
b119825152
Remove tgi parameter validation ( #11688 )
...
* remove validation
* add min warm up
* remove no need source
2024-07-30 16:37:44 +08:00
Yina Chen
670ad887fc
Qwen support compress kv ( #11680 )
...
* Qwen support compress kv
* fix style
* fix
2024-07-30 11:16:42 +08:00
hxsz1997
9b36877897
disable default quantize_kv of GQA on MTL ( #11679 )
...
* disable default quantizekv of gqa in mtl
* fix stype
* fix stype
* fix stype
* fix stype
* fix stype
* fix stype
2024-07-30 09:38:46 +08:00
Yishuo Wang
c02003925b
add mlp for gemma2 ( #11678 )
2024-07-29 16:10:23 +08:00
Yishuo Wang
6f999e6e90
add sdp for gemma2 ( #11677 )
2024-07-29 15:15:47 +08:00
Ruonan Wang
c11d5301d7
add sdp fp8 for llama ( #11671 )
...
* add sdp fp8 for llama
* fix style
* refactor
2024-07-29 13:46:22 +08:00
Yishuo Wang
7f88ce23cd
add more gemma2 optimization ( #11673 )
2024-07-29 11:13:00 +08:00
Yishuo Wang
3e8819734b
add basic gemma2 optimization ( #11672 )
2024-07-29 10:46:51 +08:00
Heyang Sun
ba01b85c13
empty cache only for 1st token but rest token to speed up ( #11665 )
2024-07-26 16:46:21 +08:00
Yina Chen
fc7f8feb83
Support compress kv ( #11642 )
...
* mistral snapkv
* update
* mtl update
* update
* update
* update
* add comments
* style fix
* fix style
* support llama
* llama use compress kv
* support mistral 4.40
* fix style
* support diff transformers versions
* move snapkv util to kv
* fix style
* meet comments & small fix
* revert all in one
* fix indent
---------
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-07-26 16:02:00 +08:00
Yishuo Wang
6bcdc6cc8f
fix qwen2 cpu ( #11663 )
2024-07-26 13:41:51 +08:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving ( #11648 )
...
* add options, support prompt and not return end_token
* enable openai parameter
* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Guancheng Fu
a4d30a8211
Change logic for detecting if vllm is available ( #11657 )
...
* fix
* fix
2024-07-25 15:24:19 +08:00
Xiangyu Tian
4499d25c26
LLM: Fix ParallelLMHead convert in vLLM cpu ( #11654 )
2024-07-25 13:07:19 +08:00
binbin Deng
777e61d8c8
Fix qwen2 & int4 on NPU ( #11646 )
2024-07-24 13:14:39 +08:00
Yishuo Wang
1b3b46e54d
fix chatglm new model ( #11639 )
2024-07-23 13:44:56 +08:00
Xiangyu Tian
060792a648
LLM: Refine Pipeline Parallel FastAPI ( #11587 )
...
Refine Pipeline Parallel FastAPI
2024-07-22 15:52:05 +08:00
Wang, Jian4
1eed0635f2
Add lightweight serving and support tgi parameter ( #11600 )
...
* init tgi request
* update openai api
* update for pp
* update and add readme
* add to docker
* add start bash
* update
* update
* update
2024-07-19 13:15:56 +08:00
Xiangyu Tian
d27a8cd08c
Fix Pipeline Parallel dtype ( #11623 )
2024-07-19 13:07:40 +08:00