Commit graph

488 commits

Author SHA1 Message Date
Yishuo Wang
d5344587ab
optimize internvl2 vision model's attention (#12198) 2024-10-15 10:51:00 +08:00
Yuwen Hu
f8d1adc573
Fix Llama 3.2 & 3.1 on LNL (#12196) 2024-10-14 17:39:20 +08:00
Zijie Li
7d80db710e
Add benchmark_util for transformers >= 4.44.0 (#12171)
* Create benchmark_util_4_45.py

* Update __init__.py

* Update lint-python

* Update benchmark_util_4_45.py

* Update benchmark_util_4_45.py

* Create benchmark_util_4_44.py
2024-10-14 15:40:12 +08:00
Ruonan Wang
310f18c8af
update NPU pipeline generate (#12182)
* update

* fix style
2024-10-11 17:39:20 +08:00
Ruonan Wang
4d93bb81fe
Initial support of NPU level0 Model (#12177)
* first commit to support load dll and init llm pipeline

* add init generate

* fix style

* small updates

* fix style and check tokens number
2024-10-11 09:45:53 +08:00
Yuwen Hu
890662610b
Fix auto importer for LNL release (#12175) 2024-10-10 15:17:43 +08:00
Yishuo Wang
535bee5381
fix qwen2 vl again (#12174) 2024-10-10 13:50:01 +08:00
Yishuo Wang
78d253165d
optimize qwen2 vl perf again (#12167) 2024-10-09 16:43:48 +08:00
Yishuo Wang
644af2a76e
add basic llama 3.2 vision support (#12163) 2024-10-08 10:46:48 +08:00
Yishuo Wang
669ff1a97b
fix sd1.5 (#12129) 2024-09-26 17:15:16 +08:00
Yishuo Wang
a266528719
optimize llama 3.2 rope (#12128) 2024-09-26 16:08:10 +08:00
Yishuo Wang
584c3489e7
add basic support for llama3.2 (#12125) 2024-09-26 15:46:19 +08:00
Yishuo Wang
66f419f8b7
fix qwen2 vl (#12126) 2024-09-26 15:44:02 +08:00
Yishuo Wang
77af9bc5fa
support passing None to low_bit in optimize_model (#12121) 2024-09-26 11:09:35 +08:00
Yishuo Wang
47e0b83cbf
optimize sd 1.5 (#12119) 2024-09-25 15:45:13 +08:00
Yishuo Wang
5d63aef60b
optimize qwen2 vl again (#12109) 2024-09-23 13:22:01 +08:00
Ruonan Wang
03bd01c99c
optimize npu qwen2 (#12107) 2024-09-20 19:46:16 +08:00
Yishuo Wang
9239fd4f12
add basic support and optimization for qwen2-vl (#12104) 2024-09-20 17:23:06 +08:00
Yuwen Hu
828fa01ad3
[NPU] Add mixed_precision for Qwen2 7B (#12098)
* Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct

* Small fix

* Fixed on load low bit with mixed precision

* Small fix

* Update example accordingly

* Update for default prompt

* Update base on comments

* Final fix
2024-09-20 16:36:21 +08:00
Ruonan Wang
09b8c80d9d
update code for NPU qwen2 (#12094)
* update code

* fix
2024-09-20 15:58:32 +08:00
Yishuo Wang
54b973c744
fix ipex_llm import in transformers 4.45 (#12099) 2024-09-20 15:24:59 +08:00
Yuwen Hu
f7fb3c896c
Update lm_head optimization for Qwen2 7B (#12090) 2024-09-18 17:02:02 +08:00
Wang, Jian4
40e463c66b
Enable vllm load gptq model (#12083)
* enable vllm load gptq model

* update

* update

* update

* update style
2024-09-18 14:41:00 +08:00
Ruonan Wang
081af41def
[NPU] Optimize Qwen2 lm_head to use INT4 (#12072)
* temp save

* update

* fix

* fix

* Split lm_head into 7 parts & remove int8 for lm_head when sym_int4

* Simlify and add condition to code

* Small fix

* refactor some code

* fix style

* fix style

* fix style

* fix

* fix

* temp sav e

* refactor

* fix style

* further refactor

* simplify code

* meet code review

* fix style

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-09-14 15:26:46 +08:00
Ch1y0q
b4b8c3e495
add lowbit_path for generate.py, fix npu_model (#12077)
* add `lowbit_path` for `generate.py`, fix `npu_model`

* update `README.md`
2024-09-13 17:28:05 +08:00
Wang, Jian4
d703e4f127
Enable vllm multimodal minicpm-v-2-6 (#12074)
* enable minicpm-v-2-6

* add image_url readme
2024-09-13 13:28:35 +08:00
Jinhe
4ca330da15
Fix NPU load error message and add minicpm npu lowbit feat (#12064)
* fix npu_model raise sym_int4 error

* add load_lowbit

* remove print&perf
2024-09-11 16:56:35 +08:00
Ruonan Wang
a0c73c26d8
clean NPU code (#12060)
* clean code

* remove time.perf_counter()
2024-09-11 15:10:35 +08:00
Wang, Jian4
c75f3dd874
vllm no padding glm4 to avoid nan error (#12062)
* no padding glm4

* add codegeex
2024-09-11 13:44:40 +08:00
Wang, Jian4
30a8680645
Update for vllm one card padding (#12058) 2024-09-11 10:52:55 +08:00
Yishuo Wang
d8c044e79d
optimize minicpm3 kv cache (#12052) 2024-09-10 16:51:21 +08:00
Wang, Jian4
5d3ab16a80
Add vllm glm and baichuan padding (#12053) 2024-09-10 15:57:28 +08:00
Guancheng Fu
69c8d36f16
Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042)
* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* Remove duplicate layer

* LLM: Update vLLM to v0.5.4 (#11746)

* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* update 0.5.4 api_server

* add dockerfile

* fix

* fix

* refine

* fix

---------

Co-authored-by: gc-fu <guancheng.fu@intel.com>

* Add vllm-0.5.4 Dockerfile (#11838)

* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)

* Fix vLLM not convert issues (#11817) (#11918)

* Fix not convert issues

* refine

Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>

* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)

* init

* update mlp forward

* fix minicpm error in vllm 0.5.4

* fix dependabot alerts (#12008)

* Update 0.5.4 dockerfile (#12021)

* Add vllm awq loading logic (#11987)

* [ADD] Add vllm awq loading logic

* [FIX] fix the module.linear_method path

* [FIX] fix quant_config path error

* Enable Qwen padding mlp to 256 to support batch_forward (#12030)

* Enable padding mlp

* padding to 256

* update style

* Install 27191 runtime in 0.5.4 docker image (#12040)

* fix rebase error

* fix rebase error

* vLLM: format for 0.5.4 rebase (#12043)

* format

* Update model_convert.py

* Fix serving docker related modifications (#12046)

* Fix undesired modifications (#12048)

* fix

* Refine offline_inference arguments

---------

Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
2024-09-10 15:37:43 +08:00
Ruonan Wang
dc4af02b2a
Fix qwen2 1.5B NPU load error (#12049) 2024-09-10 14:41:18 +08:00
Yishuo Wang
abc370728c
optimize minicpm3 again (#12047) 2024-09-10 14:19:57 +08:00
Ch1y0q
f0061a9916
remove local import os to fix Baichuan NPU load issue (#12044) 2024-09-10 14:13:24 +08:00
Ruonan Wang
640998edea
update inter_pp of qwen2 (#12041) 2024-09-10 10:34:17 +08:00
Yishuo Wang
048b4590aa
add basic minicpm3 optimization (#12039) 2024-09-09 17:25:08 +08:00
Yishuo Wang
6cedb601e4
remove some useless code (#12035) 2024-09-06 17:51:08 +08:00
binbin Deng
d2e1b9aaff
Add input padding during prefill for qwen2-7b (#12033) 2024-09-06 16:39:59 +08:00
Ruonan Wang
0d04531ae0
update NPU readme of Qwen2 (#12032)
* update readme

* update broadcast
2024-09-06 15:02:39 +08:00
Yang Wang
58555bd9de
Optimize broadcast for npu llama (#12028) 2024-09-06 13:28:20 +08:00
binbin Deng
845e5dc89e
Support lm_head of minicpm-2b on NPU (#12019) 2024-09-05 16:19:22 +08:00
Guoqiong Song
8803242f5c
fix llama on cpu (#12018) 2024-09-04 19:17:54 -07:00
Wang, Jian4
b3b2cd64b4
Support lightweight-serving glm-4v-9b (#11994)
* enable glm-4v-9b serving

* update readme

* update for no image input
2024-09-05 09:25:08 +08:00
Wang, Jian4
2b993ad479
vllm update for glm-4 model automatic not_convert (#12003) 2024-09-04 13:50:32 +08:00
Ruonan Wang
9eaff5e47d
add save & load support for NPU optimized model (#11999)
* add save &  load support

* fix style
2024-09-03 20:53:22 +08:00
Yuwen Hu
6eb55653ba
Performance mode strategy update for input_embeds input (#11997) 2024-09-03 17:46:16 +08:00
binbin Deng
01099f08ee
Revert prefill logic of qwen2-7b (#11992) 2024-09-03 14:45:01 +08:00
Yuwen Hu
659d15defc
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation (#11989)
* Fix garbage output for input_embeds inputs during lookup generation

* Fix on sliding windows

* Simplify code
2024-09-02 19:09:12 +08:00