Yishuo Wang
47e0b83cbf
optimize sd 1.5 ( #12119 )
2024-09-25 15:45:13 +08:00
Yishuo Wang
5d63aef60b
optimize qwen2 vl again ( #12109 )
2024-09-23 13:22:01 +08:00
Ruonan Wang
03bd01c99c
optimize npu qwen2 ( #12107 )
2024-09-20 19:46:16 +08:00
Yishuo Wang
9239fd4f12
add basic support and optimization for qwen2-vl ( #12104 )
2024-09-20 17:23:06 +08:00
Yuwen Hu
828fa01ad3
[NPU] Add mixed_precision for Qwen2 7B ( #12098 )
...
* Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct
* Small fix
* Fixed on load low bit with mixed precision
* Small fix
* Update example accordingly
* Update for default prompt
* Update base on comments
* Final fix
2024-09-20 16:36:21 +08:00
Ruonan Wang
09b8c80d9d
update code for NPU qwen2 ( #12094 )
...
* update code
* fix
2024-09-20 15:58:32 +08:00
Yishuo Wang
54b973c744
fix ipex_llm import in transformers 4.45 ( #12099 )
2024-09-20 15:24:59 +08:00
Yuwen Hu
f7fb3c896c
Update lm_head optimization for Qwen2 7B ( #12090 )
2024-09-18 17:02:02 +08:00
Wang, Jian4
40e463c66b
Enable vllm load gptq model ( #12083 )
...
* enable vllm load gptq model
* update
* update
* update
* update style
2024-09-18 14:41:00 +08:00
Ruonan Wang
081af41def
[NPU] Optimize Qwen2 lm_head to use INT4 ( #12072 )
...
* temp save
* update
* fix
* fix
* Split lm_head into 7 parts & remove int8 for lm_head when sym_int4
* Simlify and add condition to code
* Small fix
* refactor some code
* fix style
* fix style
* fix style
* fix
* fix
* temp sav e
* refactor
* fix style
* further refactor
* simplify code
* meet code review
* fix style
---------
Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-09-14 15:26:46 +08:00
Ch1y0q
b4b8c3e495
add lowbit_path for generate.py, fix npu_model ( #12077 )
...
* add `lowbit_path` for `generate.py`, fix `npu_model`
* update `README.md`
2024-09-13 17:28:05 +08:00
Wang, Jian4
d703e4f127
Enable vllm multimodal minicpm-v-2-6 ( #12074 )
...
* enable minicpm-v-2-6
* add image_url readme
2024-09-13 13:28:35 +08:00
Jinhe
4ca330da15
Fix NPU load error message and add minicpm npu lowbit feat ( #12064 )
...
* fix npu_model raise sym_int4 error
* add load_lowbit
* remove print&perf
2024-09-11 16:56:35 +08:00
Ruonan Wang
a0c73c26d8
clean NPU code ( #12060 )
...
* clean code
* remove time.perf_counter()
2024-09-11 15:10:35 +08:00
Wang, Jian4
c75f3dd874
vllm no padding glm4 to avoid nan error ( #12062 )
...
* no padding glm4
* add codegeex
2024-09-11 13:44:40 +08:00
Wang, Jian4
30a8680645
Update for vllm one card padding ( #12058 )
2024-09-11 10:52:55 +08:00
Yishuo Wang
d8c044e79d
optimize minicpm3 kv cache ( #12052 )
2024-09-10 16:51:21 +08:00
Wang, Jian4
5d3ab16a80
Add vllm glm and baichuan padding ( #12053 )
2024-09-10 15:57:28 +08:00
Guancheng Fu
69c8d36f16
Switching from vLLM v0.3.3 to vLLM 0.5.4 ( #12042 )
...
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* Remove duplicate layer
* LLM: Update vLLM to v0.5.4 (#11746 )
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* update 0.5.4 api_server
* add dockerfile
* fix
* fix
* refine
* fix
---------
Co-authored-by: gc-fu <guancheng.fu@intel.com>
* Add vllm-0.5.4 Dockerfile (#11838 )
* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957 )
* Fix vLLM not convert issues (#11817 ) (#11918 )
* Fix not convert issues
* refine
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969 )
* init
* update mlp forward
* fix minicpm error in vllm 0.5.4
* fix dependabot alerts (#12008 )
* Update 0.5.4 dockerfile (#12021 )
* Add vllm awq loading logic (#11987 )
* [ADD] Add vllm awq loading logic
* [FIX] fix the module.linear_method path
* [FIX] fix quant_config path error
* Enable Qwen padding mlp to 256 to support batch_forward (#12030 )
* Enable padding mlp
* padding to 256
* update style
* Install 27191 runtime in 0.5.4 docker image (#12040 )
* fix rebase error
* fix rebase error
* vLLM: format for 0.5.4 rebase (#12043 )
* format
* Update model_convert.py
* Fix serving docker related modifications (#12046 )
* Fix undesired modifications (#12048 )
* fix
* Refine offline_inference arguments
---------
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
2024-09-10 15:37:43 +08:00
Ruonan Wang
dc4af02b2a
Fix qwen2 1.5B NPU load error ( #12049 )
2024-09-10 14:41:18 +08:00
Yishuo Wang
abc370728c
optimize minicpm3 again ( #12047 )
2024-09-10 14:19:57 +08:00
Ch1y0q
f0061a9916
remove local import os to fix Baichuan NPU load issue ( #12044 )
2024-09-10 14:13:24 +08:00
Ruonan Wang
640998edea
update inter_pp of qwen2 ( #12041 )
2024-09-10 10:34:17 +08:00
Yishuo Wang
048b4590aa
add basic minicpm3 optimization ( #12039 )
2024-09-09 17:25:08 +08:00
Yishuo Wang
6cedb601e4
remove some useless code ( #12035 )
2024-09-06 17:51:08 +08:00
binbin Deng
d2e1b9aaff
Add input padding during prefill for qwen2-7b ( #12033 )
2024-09-06 16:39:59 +08:00
Ruonan Wang
0d04531ae0
update NPU readme of Qwen2 ( #12032 )
...
* update readme
* update broadcast
2024-09-06 15:02:39 +08:00
Yang Wang
58555bd9de
Optimize broadcast for npu llama ( #12028 )
2024-09-06 13:28:20 +08:00
binbin Deng
845e5dc89e
Support lm_head of minicpm-2b on NPU ( #12019 )
2024-09-05 16:19:22 +08:00
Guoqiong Song
8803242f5c
fix llama on cpu ( #12018 )
2024-09-04 19:17:54 -07:00
Wang, Jian4
b3b2cd64b4
Support lightweight-serving glm-4v-9b ( #11994 )
...
* enable glm-4v-9b serving
* update readme
* update for no image input
2024-09-05 09:25:08 +08:00
Wang, Jian4
2b993ad479
vllm update for glm-4 model automatic not_convert ( #12003 )
2024-09-04 13:50:32 +08:00
Ruonan Wang
9eaff5e47d
add save & load support for NPU optimized model ( #11999 )
...
* add save & load support
* fix style
2024-09-03 20:53:22 +08:00
Yuwen Hu
6eb55653ba
Performance mode strategy update for input_embeds input ( #11997 )
2024-09-03 17:46:16 +08:00
binbin Deng
01099f08ee
Revert prefill logic of qwen2-7b ( #11992 )
2024-09-03 14:45:01 +08:00
Yuwen Hu
659d15defc
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation ( #11989 )
...
* Fix garbage output for input_embeds inputs during lookup generation
* Fix on sliding windows
* Simplify code
2024-09-02 19:09:12 +08:00
binbin Deng
2f3d1bd0ec
hotfix qwen2-7b weight setting ( #11991 )
2024-09-02 18:11:08 +08:00
binbin Deng
a40ea7038d
Fix AttributeError of qwen2-1.5B ( #11990 )
2024-09-02 17:55:10 +08:00
Yang Wang
c48817bd43
Support Qwen2-7b MLP in int4 and transpose_value_cache=True ( #11968 )
2024-09-02 14:37:44 +08:00
Ruonan Wang
573c20bae6
fix npu lm_head cpu condition ( #11976 )
...
* fix
* fix
* fix
* fix stype
* fix style
* fix style
2024-08-30 17:11:26 +08:00
Ruonan Wang
60aa1a2c0f
Initial NPU support for MiniCPM-V-2_6 ( #11966 )
...
* initial pr
* update npu model
* fix
* fix kv cache type
* fix
* small fix
* fix style
* fix model id
* change inter_pp=4
* address comment
* fix
* fix style
* fix
* rebase
2024-08-30 16:34:35 +08:00
SONG Ge
158289d205
[NPU] Add initial support for minicpm-llama-v2.5 ( #11962 )
...
* add initial support for minicpm-llama-v2.5
* update impl
* add minicpm-llama3-v2.5 example
2024-08-30 16:00:33 +08:00
binbin Deng
cd077881f1
Disable lm head ( #11972 )
2024-08-30 11:05:18 +08:00
Wang, Jian4
7d103417b8
Fix glm4-9b-chat nan error on vllm 0.3.3 ( #11970 )
...
* fix nan value
* update
2024-08-30 09:50:18 +08:00
Yang Wang
fbf088f61e
remove obselete npu code ( #11967 )
2024-08-29 14:16:44 -07:00
Yuwen Hu
a9e485eb1b
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer ( #11963 )
...
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer
* Style fixes
2024-08-29 19:22:09 +08:00
Yina Chen
882f4a5ff7
Add lnl npu driver recommend version and enable cpu_lm_head on llama3 ( #11952 )
...
* update lnl npu driver version and enable cpu_lm_head on llama3
* update
* fix style
* typo
* address comments
* update
* add qwen2-7b
2024-08-29 15:01:18 +08:00
binbin Deng
71f03dcc39
Support qwen2-7b with fused decoderlayer optimization on NPU ( #11912 )
2024-08-29 13:34:20 +08:00
Jiao Wang
63ac5f64bb
Refactor NPU baichuan multiple-process ( #11945 )
...
* update
* add baichuan mp
* clean
* refactor
* merge
* style
* update
* update
2024-08-28 11:33:40 -07:00
SONG Ge
5ca7390082
[NPU] Add minicpm-2b support for npu multi-processing ( #11949 )
...
* add minicpm-2b support
* update example for minicpm-2b
* add LNL NPU driver requirement in readme
2024-08-28 18:08:49 +08:00