Yishuo Wang
|
78d253165d
|
optimize qwen2 vl perf again (#12167)
|
2024-10-09 16:43:48 +08:00 |
|
Yishuo Wang
|
644af2a76e
|
add basic llama 3.2 vision support (#12163)
|
2024-10-08 10:46:48 +08:00 |
|
Yishuo Wang
|
584c3489e7
|
add basic support for llama3.2 (#12125)
|
2024-09-26 15:46:19 +08:00 |
|
Yishuo Wang
|
77af9bc5fa
|
support passing None to low_bit in optimize_model (#12121)
|
2024-09-26 11:09:35 +08:00 |
|
Yishuo Wang
|
9239fd4f12
|
add basic support and optimization for qwen2-vl (#12104)
|
2024-09-20 17:23:06 +08:00 |
|
Wang, Jian4
|
40e463c66b
|
Enable vllm load gptq model (#12083)
* enable vllm load gptq model
* update
* update
* update
* update style
|
2024-09-18 14:41:00 +08:00 |
|
Yishuo Wang
|
d8c044e79d
|
optimize minicpm3 kv cache (#12052)
|
2024-09-10 16:51:21 +08:00 |
|
Guancheng Fu
|
69c8d36f16
|
Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* Remove duplicate layer
* LLM: Update vLLM to v0.5.4 (#11746)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* update 0.5.4 api_server
* add dockerfile
* fix
* fix
* refine
* fix
---------
Co-authored-by: gc-fu <guancheng.fu@intel.com>
* Add vllm-0.5.4 Dockerfile (#11838)
* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)
* Fix vLLM not convert issues (#11817) (#11918)
* Fix not convert issues
* refine
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)
* init
* update mlp forward
* fix minicpm error in vllm 0.5.4
* fix dependabot alerts (#12008)
* Update 0.5.4 dockerfile (#12021)
* Add vllm awq loading logic (#11987)
* [ADD] Add vllm awq loading logic
* [FIX] fix the module.linear_method path
* [FIX] fix quant_config path error
* Enable Qwen padding mlp to 256 to support batch_forward (#12030)
* Enable padding mlp
* padding to 256
* update style
* Install 27191 runtime in 0.5.4 docker image (#12040)
* fix rebase error
* fix rebase error
* vLLM: format for 0.5.4 rebase (#12043)
* format
* Update model_convert.py
* Fix serving docker related modifications (#12046)
* Fix undesired modifications (#12048)
* fix
* Refine offline_inference arguments
---------
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
|
2024-09-10 15:37:43 +08:00 |
|
Yishuo Wang
|
abc370728c
|
optimize minicpm3 again (#12047)
|
2024-09-10 14:19:57 +08:00 |
|
Yishuo Wang
|
048b4590aa
|
add basic minicpm3 optimization (#12039)
|
2024-09-09 17:25:08 +08:00 |
|
Yuwen Hu
|
a9e485eb1b
|
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer (#11963)
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer
* Style fixes
|
2024-08-29 19:22:09 +08:00 |
|
Guancheng Fu
|
0a7bd274e2
|
Add vllm awq loading logic (#11950)
* add vllm awq loading logic
* fix
* refine
|
2024-08-28 16:46:18 +08:00 |
|
Yina Chen
|
23631cd357
|
disable lm_head opt for baichuan2-13b (#11905)
|
2024-08-23 15:39:47 +08:00 |
|
hxsz1997
|
650e6e6ce4
|
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
Add compress_kv for Baichuan2
|
2024-08-23 06:09:58 +03:00 |
|
Huang, Xinshengzi
|
4cf03d6212
|
update baichuan-7b
|
2024-08-22 18:16:33 +08:00 |
|
Guancheng Fu
|
278b191dc1
|
Fix optimize lm head error (#11899)
|
2024-08-22 17:45:26 +08:00 |
|
Huang, Xinshengzi
|
86248b0505
|
add compress_kv for baichuan2
|
2024-08-22 10:59:08 +08:00 |
|
Yina Chen
|
0236de3ac2
|
set IPEX_LLM_LAST_LM_HEAD=1 as default (#11885)
|
2024-08-21 15:06:12 +08:00 |
|
Yishuo Wang
|
2946420e14
|
add minicpmv 2.6 load_low_bit workaround (#11856)
|
2024-08-20 11:16:02 +08:00 |
|
Zhao Changmin
|
6841a9ac8f
|
fix load low bit com dtype (#11832)
|
2024-08-19 13:43:19 +08:00 |
|
Yishuo Wang
|
e966e85df8
|
force lm_head optimization in any model if set environment variable (#11830)
|
2024-08-16 16:48:45 +08:00 |
|
Yishuo Wang
|
17a0beb21f
|
optimize qwen2-audio again (#11825)
|
2024-08-16 11:11:35 +08:00 |
|
Guancheng Fu
|
e70ae0638e
|
Fix vLLM not convert issues (#11817)
* Fix not convert issues
* refine
|
2024-08-15 19:04:05 +08:00 |
|
Yishuo Wang
|
750d4ad5dc
|
fix minicpm-v-2 fp16 (#11819)
|
2024-08-15 18:34:40 +08:00 |
|
Yishuo Wang
|
4e178f0c5d
|
rewrite minicpmv optimization (#11816)
|
2024-08-15 17:27:12 +08:00 |
|
Yishuo Wang
|
07b7f13982
|
support and optimize qwen2-audio (#11809)
|
2024-08-15 14:59:04 +08:00 |
|
Yishuo Wang
|
9a93808fc5
|
fix and optimize minicpm v 2 (#11799)
|
2024-08-14 17:27:23 +08:00 |
|
Yishuo Wang
|
3d6cfa291d
|
optimize minicpm v 2.5 (#11793)
|
2024-08-14 16:07:24 +08:00 |
|
Yishuo Wang
|
cb79dcda93
|
refactor llama convert to fix minicpm-v 2.5 optimization (#11783)
|
2024-08-14 09:29:57 +08:00 |
|
Yishuo Wang
|
a184b120c9
|
fix minicpm-v 2.5 (#11780)
|
2024-08-13 16:14:00 +08:00 |
|
Yishuo Wang
|
a1eb793f70
|
optimize minicpm v 2_6 firs token perf (#11770)
|
2024-08-13 09:51:18 +08:00 |
|
Yishuo Wang
|
54cc9353db
|
support and optimize minicpm-v-2_6 (#11738)
|
2024-08-07 18:21:16 +08:00 |
|
Ruonan Wang
|
00a5574c8a
|
Use merge_qkv to replace fused_qkv for llama2 (#11727)
* update 4.38
* support new versions
* update
* fix style
* fix style
* update rope
* temp test sdpa
* fix style
* fix cpu ut
|
2024-08-07 18:04:01 +08:00 |
|
Yishuo Wang
|
bbdff6edeb
|
optimize internvl2 4b performance (#11720)
|
2024-08-06 14:25:08 +08:00 |
|
Yishuo Wang
|
f44b732aa8
|
support internvl2-4b (#11718)
|
2024-08-06 13:36:32 +08:00 |
|
Ruonan Wang
|
aa98ef96fe
|
change mixed_precision to q6_k (#11706)
|
2024-08-02 15:55:16 +08:00 |
|
Guancheng Fu
|
afeca38a47
|
Fix import vllm condition (#11682)
|
2024-07-31 13:50:01 +08:00 |
|
Ruonan Wang
|
54bf3a23a6
|
add fallback for unsupported k-quants (#11691)
* add fallback
* fix style
* fix
|
2024-07-31 11:39:58 +08:00 |
|
Yishuo Wang
|
c02003925b
|
add mlp for gemma2 (#11678)
|
2024-07-29 16:10:23 +08:00 |
|
Yishuo Wang
|
6f999e6e90
|
add sdp for gemma2 (#11677)
|
2024-07-29 15:15:47 +08:00 |
|
Yishuo Wang
|
7f88ce23cd
|
add more gemma2 optimization (#11673)
|
2024-07-29 11:13:00 +08:00 |
|
Yishuo Wang
|
3e8819734b
|
add basic gemma2 optimization (#11672)
|
2024-07-29 10:46:51 +08:00 |
|
Yina Chen
|
fc7f8feb83
|
Support compress kv (#11642)
* mistral snapkv
* update
* mtl update
* update
* update
* update
* add comments
* style fix
* fix style
* support llama
* llama use compress kv
* support mistral 4.40
* fix style
* support diff transformers versions
* move snapkv util to kv
* fix style
* meet comments & small fix
* revert all in one
* fix indent
---------
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
|
2024-07-26 16:02:00 +08:00 |
|
Guancheng Fu
|
a4d30a8211
|
Change logic for detecting if vllm is available (#11657)
* fix
* fix
|
2024-07-25 15:24:19 +08:00 |
|
Xiangyu Tian
|
4499d25c26
|
LLM: Fix ParallelLMHead convert in vLLM cpu (#11654)
|
2024-07-25 13:07:19 +08:00 |
|
Yishuo Wang
|
1b3b46e54d
|
fix chatglm new model (#11639)
|
2024-07-23 13:44:56 +08:00 |
|
Yishuo Wang
|
d020ad6397
|
add save_low_bit support for DiskEmbedding (#11621)
|
2024-07-19 10:34:53 +08:00 |
|
Guoqiong Song
|
380717f50d
|
fix gemma for 4.41 (#11531)
* fix gemma for 4.41
|
2024-07-18 15:02:50 -07:00 |
|
Guoqiong Song
|
5a6211fd56
|
fix minicpm for transformers>=4.39 (#11533)
* fix minicpm for transformers>=4.39
|
2024-07-18 15:01:57 -07:00 |
|
Yishuo Wang
|
0209427cf4
|
Add disk_embedding parameter to support put Embedding layer on CPU (#11617)
|
2024-07-18 17:06:06 +08:00 |
|