Ch1y0q
|
2269768e71
|
add internvl2 example (#12102)
* add internvl2 example
* add to README.md
* update
* add link to zh-CN readme
|
2024-09-20 16:31:54 +08:00 |
|
Ruonan Wang
|
09b8c80d9d
|
update code for NPU qwen2 (#12094)
* update code
* fix
|
2024-09-20 15:58:32 +08:00 |
|
Jin, Qiao
|
db7500bfd4
|
Add Qwen2.5 GPU example (#12101)
* Add Qwen2.5 GPU example
* fix end line
* fix description
|
2024-09-20 15:55:57 +08:00 |
|
Yishuo Wang
|
54b973c744
|
fix ipex_llm import in transformers 4.45 (#12099)
|
2024-09-20 15:24:59 +08:00 |
|
Ch1y0q
|
9650bf616a
|
add transpose_value_cache for NPU benchmark (#12092)
* add `transpose_value_cache`
* update
* update
|
2024-09-19 18:45:05 +08:00 |
|
Yuwen Hu
|
f7fb3c896c
|
Update lm_head optimization for Qwen2 7B (#12090)
|
2024-09-18 17:02:02 +08:00 |
|
Xu, Shuo
|
ee33b93464
|
Longbench: NV code to ipex-llm (#11662)
* add nv longbench
* LongBench: NV code to ipex-llm
* ammend
* add more models support
* ammend
* optimize LongBench's user experience
* ammend
* ammend
* fix typo
* ammend
* remove cuda related information & add a readme
* add license to python scripts & polish the readme
* ammend
* ammend
---------
Co-authored-by: cyita <yitastudy@gmail.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
|
2024-09-18 15:55:14 +08:00 |
|
Wang, Jian4
|
40e463c66b
|
Enable vllm load gptq model (#12083)
* enable vllm load gptq model
* update
* update
* update
* update style
|
2024-09-18 14:41:00 +08:00 |
|
Ruonan Wang
|
081af41def
|
[NPU] Optimize Qwen2 lm_head to use INT4 (#12072)
* temp save
* update
* fix
* fix
* Split lm_head into 7 parts & remove int8 for lm_head when sym_int4
* Simlify and add condition to code
* Small fix
* refactor some code
* fix style
* fix style
* fix style
* fix
* fix
* temp sav e
* refactor
* fix style
* further refactor
* simplify code
* meet code review
* fix style
---------
Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
|
2024-09-14 15:26:46 +08:00 |
|
Ch1y0q
|
b4b8c3e495
|
add lowbit_path for generate.py, fix npu_model (#12077)
* add `lowbit_path` for `generate.py`, fix `npu_model`
* update `README.md`
|
2024-09-13 17:28:05 +08:00 |
|
Wang, Jian4
|
d703e4f127
|
Enable vllm multimodal minicpm-v-2-6 (#12074)
* enable minicpm-v-2-6
* add image_url readme
|
2024-09-13 13:28:35 +08:00 |
|
Ruonan Wang
|
48d9092b5a
|
upgrade OneAPI version for cpp Windows (#12063)
* update version
* update quickstart
|
2024-09-12 11:12:12 +08:00 |
|
Jinhe
|
e78e45ee01
|
update NPU readme: run conhost as administrator (#12066)
|
2024-09-11 17:54:04 +08:00 |
|
Jinhe
|
4ca330da15
|
Fix NPU load error message and add minicpm npu lowbit feat (#12064)
* fix npu_model raise sym_int4 error
* add load_lowbit
* remove print&perf
|
2024-09-11 16:56:35 +08:00 |
|
Jinhe
|
32e8362da7
|
added minicpm cpu examples (#12027)
* minicpm cpu examples
* add link for minicpm-2
|
2024-09-11 15:51:21 +08:00 |
|
Ruonan Wang
|
a0c73c26d8
|
clean NPU code (#12060)
* clean code
* remove time.perf_counter()
|
2024-09-11 15:10:35 +08:00 |
|
Wang, Jian4
|
c75f3dd874
|
vllm no padding glm4 to avoid nan error (#12062)
* no padding glm4
* add codegeex
|
2024-09-11 13:44:40 +08:00 |
|
Chu,Youcheng
|
649390c464
|
fix: textual and env variable adjustment (#12038)
|
2024-09-11 13:38:01 +08:00 |
|
Wang, Jian4
|
30a8680645
|
Update for vllm one card padding (#12058)
|
2024-09-11 10:52:55 +08:00 |
|
Zijie Li
|
c5fdfde1bd
|
fix npu-model prompt (#12057)
|
2024-09-11 10:06:45 +08:00 |
|
Yishuo Wang
|
d8c044e79d
|
optimize minicpm3 kv cache (#12052)
|
2024-09-10 16:51:21 +08:00 |
|
Wang, Jian4
|
5d3ab16a80
|
Add vllm glm and baichuan padding (#12053)
|
2024-09-10 15:57:28 +08:00 |
|
Guancheng Fu
|
69c8d36f16
|
Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* Remove duplicate layer
* LLM: Update vLLM to v0.5.4 (#11746)
* Enable single card sync engine
* enable ipex-llm optimizations for vllm
* enable optimizations for lm_head
* Fix chatglm multi-reference problem
* update 0.5.4 api_server
* add dockerfile
* fix
* fix
* refine
* fix
---------
Co-authored-by: gc-fu <guancheng.fu@intel.com>
* Add vllm-0.5.4 Dockerfile (#11838)
* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)
* Fix vLLM not convert issues (#11817) (#11918)
* Fix not convert issues
* refine
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)
* init
* update mlp forward
* fix minicpm error in vllm 0.5.4
* fix dependabot alerts (#12008)
* Update 0.5.4 dockerfile (#12021)
* Add vllm awq loading logic (#11987)
* [ADD] Add vllm awq loading logic
* [FIX] fix the module.linear_method path
* [FIX] fix quant_config path error
* Enable Qwen padding mlp to 256 to support batch_forward (#12030)
* Enable padding mlp
* padding to 256
* update style
* Install 27191 runtime in 0.5.4 docker image (#12040)
* fix rebase error
* fix rebase error
* vLLM: format for 0.5.4 rebase (#12043)
* format
* Update model_convert.py
* Fix serving docker related modifications (#12046)
* Fix undesired modifications (#12048)
* fix
* Refine offline_inference arguments
---------
Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
|
2024-09-10 15:37:43 +08:00 |
|
Ch1y0q
|
73a4360f3f
|
update lowbit path for baichuan2, qwen2, generate.py (#12051)
* update lowbit path for baichuan2, qwen2, `generate.py`
* update readme
|
2024-09-10 15:35:24 +08:00 |
|
Ruonan Wang
|
dc4af02b2a
|
Fix qwen2 1.5B NPU load error (#12049)
|
2024-09-10 14:41:18 +08:00 |
|
Yishuo Wang
|
abc370728c
|
optimize minicpm3 again (#12047)
|
2024-09-10 14:19:57 +08:00 |
|
Ch1y0q
|
f0061a9916
|
remove local import os to fix Baichuan NPU load issue (#12044)
|
2024-09-10 14:13:24 +08:00 |
|
Ruonan Wang
|
640998edea
|
update inter_pp of qwen2 (#12041)
|
2024-09-10 10:34:17 +08:00 |
|
Yishuo Wang
|
048b4590aa
|
add basic minicpm3 optimization (#12039)
|
2024-09-09 17:25:08 +08:00 |
|
Chu,Youcheng
|
16c658e732
|
LLM: add known issues to harness evaluation (#12036)
* feat: 在harness中添加known issue
* fix: resolve comments
* fix: small fixes
|
2024-09-09 14:15:42 +08:00 |
|
Yishuo Wang
|
6cedb601e4
|
remove some useless code (#12035)
|
2024-09-06 17:51:08 +08:00 |
|
binbin Deng
|
d2e1b9aaff
|
Add input padding during prefill for qwen2-7b (#12033)
|
2024-09-06 16:39:59 +08:00 |
|
Yuwen Hu
|
f61b1785fb
|
Small update to NPU example readme (#12034)
* Small update to NPU example readme
* Small fix
|
2024-09-06 15:54:23 +08:00 |
|
Ruonan Wang
|
0d04531ae0
|
update NPU readme of Qwen2 (#12032)
* update readme
* update broadcast
|
2024-09-06 15:02:39 +08:00 |
|
Yang Wang
|
58555bd9de
|
Optimize broadcast for npu llama (#12028)
|
2024-09-06 13:28:20 +08:00 |
|
binbin Deng
|
5b18bb3c4a
|
Add recommend version for mtl npu (#12024)
|
2024-09-05 16:28:53 +08:00 |
|
binbin Deng
|
845e5dc89e
|
Support lm_head of minicpm-2b on NPU (#12019)
|
2024-09-05 16:19:22 +08:00 |
|
Ch1y0q
|
820f8a4554
|
add --lowbit-path option for NPU llama example (#12020)
* add option" `--lowbit-path`
* add descriptions in `README.md` and formatting
* Update llama.py
|
2024-09-05 15:31:01 +08:00 |
|
Guoqiong Song
|
8803242f5c
|
fix llama on cpu (#12018)
|
2024-09-04 19:17:54 -07:00 |
|
Wang, Jian4
|
b3b2cd64b4
|
Support lightweight-serving glm-4v-9b (#11994)
* enable glm-4v-9b serving
* update readme
* update for no image input
|
2024-09-05 09:25:08 +08:00 |
|
Yishuo Wang
|
b1408a1f1c
|
fix UT (#12005)
|
2024-09-04 18:02:49 +08:00 |
|
Wang, Jian4
|
2b993ad479
|
vllm update for glm-4 model automatic not_convert (#12003)
|
2024-09-04 13:50:32 +08:00 |
|
Ruonan Wang
|
9eaff5e47d
|
add save & load support for NPU optimized model (#11999)
* add save & load support
* fix style
|
2024-09-03 20:53:22 +08:00 |
|
Yuwen Hu
|
6eb55653ba
|
Performance mode strategy update for input_embeds input (#11997)
|
2024-09-03 17:46:16 +08:00 |
|
Jinhe
|
164f47adbd
|
MiniCPM-V-2 & MiniCPM-Llama3-V-2_5 example updates (#11988)
* minicpm example updates
* --stream
|
2024-09-03 17:02:06 +08:00 |
|
Jin, Qiao
|
2e54f4402b
|
Rename MiniCPM-V-2_6 CPU example (#11998)
|
2024-09-03 16:50:42 +08:00 |
|
binbin Deng
|
01099f08ee
|
Revert prefill logic of qwen2-7b (#11992)
|
2024-09-03 14:45:01 +08:00 |
|
Yuwen Hu
|
659d15defc
|
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation (#11989)
* Fix garbage output for input_embeds inputs during lookup generation
* Fix on sliding windows
* Simplify code
|
2024-09-02 19:09:12 +08:00 |
|
binbin Deng
|
2f3d1bd0ec
|
hotfix qwen2-7b weight setting (#11991)
|
2024-09-02 18:11:08 +08:00 |
|
binbin Deng
|
a40ea7038d
|
Fix AttributeError of qwen2-1.5B (#11990)
|
2024-09-02 17:55:10 +08:00 |
|