Yishuo Wang
|
a9e3f7f14c
|
optimize minicpm (#12496)
|
2024-12-04 17:14:16 +08:00 |
|
Yishuo Wang
|
6f3441ba4c
|
fix glm4-9b overflow (#12455)
|
2024-11-27 17:39:13 +08:00 |
|
Yishuo Wang
|
cdd41f5e4c
|
optimize sdxl again (#12441)
|
2024-11-25 17:46:46 +08:00 |
|
Yishuo Wang
|
8164aed802
|
small change (#12439)
|
2024-11-25 14:35:49 +08:00 |
|
Yishuo Wang
|
be132c4209
|
fix and optimize sd (#12436)
|
2024-11-25 14:09:48 +08:00 |
|
Yuwen Hu
|
e0918934c8
|
Add fused_mlp to glm4v models (#12378)
|
2024-11-11 17:10:25 +08:00 |
|
Yuwen Hu
|
1a6cbc473f
|
Add fused mlp optimizations to glm4 models (#12360)
* Add fused mlp to glm4 models
* Small fix
|
2024-11-07 18:52:47 +08:00 |
|
Yuwen Hu
|
872a74481a
|
Small optimization to glm4 models (#12351)
|
2024-11-06 19:16:58 +08:00 |
|
Yishuo Wang
|
e23ef7d088
|
optimize glm4v's vision part (#12346)
|
2024-11-06 15:43:40 +08:00 |
|
Yishuo Wang
|
c8b7265359
|
Add basic glm4v support (#12345)
|
2024-11-06 13:50:10 +08:00 |
|
Zhao Changmin
|
1b637e4477
|
Add chatglm2&3 fuse mlp (#12328)
* add chatglm fuse mlp
|
2024-11-04 18:04:41 +08:00 |
|
Xin Qiu
|
97a0f7fd35
|
Codegeex support (#12303)
* new codegeex attn
* use kv cache
* add compress/quantize kv
* remove compress/quantize kv
* fix style check
* fix style
* fix codegeex
|
2024-10-31 15:28:56 +08:00 |
|
Yuwen Hu
|
43b25a2fe7
|
Fix llama 3.2 vision on LNL (#12264)
* Fix llama 3.2 vision on LNL
* Small fix
|
2024-10-25 16:23:31 +08:00 |
|
Yishuo Wang
|
f3a2b20e6b
|
Optimize gpt2 (#12259)
|
2024-10-24 13:44:24 +08:00 |
|
Yuwen Hu
|
b3df47486d
|
Fix Gemma 2 on LNL (#12240)
* Fix gemma 2 on LNL
* Python style fix
|
2024-10-21 18:25:53 +08:00 |
|
Yishuo Wang
|
a4a758656a
|
refactor gemma to reduce old fuse rope usage (#12215)
|
2024-10-16 17:40:28 +08:00 |
|
Yishuo Wang
|
e279148aa0
|
optimize llama3.2 vision again (#12211)
|
2024-10-16 14:29:48 +08:00 |
|
Yishuo Wang
|
d5344587ab
|
optimize internvl2 vision model's attention (#12198)
|
2024-10-15 10:51:00 +08:00 |
|
Yuwen Hu
|
f8d1adc573
|
Fix Llama 3.2 & 3.1 on LNL (#12196)
|
2024-10-14 17:39:20 +08:00 |
|
Yishuo Wang
|
535bee5381
|
fix qwen2 vl again (#12174)
|
2024-10-10 13:50:01 +08:00 |
|
Yishuo Wang
|
78d253165d
|
optimize qwen2 vl perf again (#12167)
|
2024-10-09 16:43:48 +08:00 |
|
Yishuo Wang
|
644af2a76e
|
add basic llama 3.2 vision support (#12163)
|
2024-10-08 10:46:48 +08:00 |
|
Yishuo Wang
|
584c3489e7
|
add basic support for llama3.2 (#12125)
|
2024-09-26 15:46:19 +08:00 |
|
Yishuo Wang
|
77af9bc5fa
|
support passing None to low_bit in optimize_model (#12121)
|
2024-09-26 11:09:35 +08:00 |
|
Yishuo Wang
|
9239fd4f12
|
add basic support and optimization for qwen2-vl (#12104)
|
2024-09-20 17:23:06 +08:00 |
|
Wang, Jian4
|
40e463c66b
|
Enable vllm load gptq model (#12083)
* enable vllm load gptq model
* update
* update
* update
* update style
|
2024-09-18 14:41:00 +08:00 |
|
Yishuo Wang
|
d8c044e79d
|
optimize minicpm3 kv cache (#12052)
|
2024-09-10 16:51:21 +08:00 |
|
Guancheng Fu
|
69c8d36f16
|
Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* Remove duplicate layer
* LLM: Update vLLM to v0.5.4 (#11746)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* update 0.5.4 api_server
* add dockerfile
* fix
* fix
* refine
* fix
---------
Co-authored-by: gc-fu <guancheng.fu@intel.com>
* Add vllm-0.5.4 Dockerfile (#11838)
* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)
* Fix vLLM not convert issues (#11817) (#11918)
* Fix not convert issues
* refine
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)
* init
* update mlp forward
* fix minicpm error in vllm 0.5.4
* fix dependabot alerts (#12008)
* Update 0.5.4 dockerfile (#12021)
* Add vllm awq loading logic (#11987)
* [ADD] Add vllm awq loading logic
* [FIX] fix the module.linear_method path
* [FIX] fix quant_config path error
* Enable Qwen padding mlp to 256 to support batch_forward (#12030)
* Enable padding mlp
* padding to 256
* update style
* Install 27191 runtime in 0.5.4 docker image (#12040)
* fix rebase error
* fix rebase error
* vLLM: format for 0.5.4 rebase (#12043)
* format
* Update model_convert.py
* Fix serving docker related modifications (#12046)
* Fix undesired modifications (#12048)
* fix
* Refine offline_inference arguments
---------
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
|
2024-09-10 15:37:43 +08:00 |
|
Yishuo Wang
|
abc370728c
|
optimize minicpm3 again (#12047)
|
2024-09-10 14:19:57 +08:00 |
|
Yishuo Wang
|
048b4590aa
|
add basic minicpm3 optimization (#12039)
|
2024-09-09 17:25:08 +08:00 |
|
Yuwen Hu
|
a9e485eb1b
|
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer (#11963)
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer
* Style fixes
|
2024-08-29 19:22:09 +08:00 |
|
Guancheng Fu
|
0a7bd274e2
|
Add vllm awq loading logic (#11950)
* add vllm awq loading logic
* fix
* refine
|
2024-08-28 16:46:18 +08:00 |
|
Yina Chen
|
23631cd357
|
disable lm_head opt for baichuan2-13b (#11905)
|
2024-08-23 15:39:47 +08:00 |
|
hxsz1997
|
650e6e6ce4
|
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
Add compress_kv for Baichuan2
|
2024-08-23 06:09:58 +03:00 |
|
Huang, Xinshengzi
|
4cf03d6212
|
update baichuan-7b
|
2024-08-22 18:16:33 +08:00 |
|
Guancheng Fu
|
278b191dc1
|
Fix optimize lm head error (#11899)
|
2024-08-22 17:45:26 +08:00 |
|
Huang, Xinshengzi
|
86248b0505
|
add compress_kv for baichuan2
|
2024-08-22 10:59:08 +08:00 |
|
Yina Chen
|
0236de3ac2
|
set IPEX_LLM_LAST_LM_HEAD=1 as default (#11885)
|
2024-08-21 15:06:12 +08:00 |
|
Yishuo Wang
|
2946420e14
|
add minicpmv 2.6 load_low_bit workaround (#11856)
|
2024-08-20 11:16:02 +08:00 |
|
Zhao Changmin
|
6841a9ac8f
|
fix load low bit com dtype (#11832)
|
2024-08-19 13:43:19 +08:00 |
|
Yishuo Wang
|
e966e85df8
|
force lm_head optimization in any model if set environment variable (#11830)
|
2024-08-16 16:48:45 +08:00 |
|
Yishuo Wang
|
17a0beb21f
|
optimize qwen2-audio again (#11825)
|
2024-08-16 11:11:35 +08:00 |
|
Guancheng Fu
|
e70ae0638e
|
Fix vLLM not convert issues (#11817)
* Fix not convert issues
* refine
|
2024-08-15 19:04:05 +08:00 |
|
Yishuo Wang
|
750d4ad5dc
|
fix minicpm-v-2 fp16 (#11819)
|
2024-08-15 18:34:40 +08:00 |
|
Yishuo Wang
|
4e178f0c5d
|
rewrite minicpmv optimization (#11816)
|
2024-08-15 17:27:12 +08:00 |
|
Yishuo Wang
|
07b7f13982
|
support and optimize qwen2-audio (#11809)
|
2024-08-15 14:59:04 +08:00 |
|
Yishuo Wang
|
9a93808fc5
|
fix and optimize minicpm v 2 (#11799)
|
2024-08-14 17:27:23 +08:00 |
|
Yishuo Wang
|
3d6cfa291d
|
optimize minicpm v 2.5 (#11793)
|
2024-08-14 16:07:24 +08:00 |
|
Yishuo Wang
|
cb79dcda93
|
refactor llama convert to fix minicpm-v 2.5 optimization (#11783)
|
2024-08-14 09:29:57 +08:00 |
|
Yishuo Wang
|
a184b120c9
|
fix minicpm-v 2.5 (#11780)
|
2024-08-13 16:14:00 +08:00 |
|