Yishuo Wang
3d5fbf2069
update batch kernel condition ( #12408 )
2024-11-15 13:47:05 +08:00
Ruonan Wang
6c5e8fc70c
fix again ( #12407 )
2024-11-15 11:57:58 +08:00
Ruonan Wang
fcc0fa7316
fix workflow again ( #12406 )
...
* fix again
* fix name
2024-11-15 11:01:35 +08:00
Yuwen Hu
d1cde7fac4
Tiny doc fix ( #12405 )
2024-11-15 10:28:38 +08:00
Ruonan Wang
548dec5185
fix npu pipeline workflow ( #12404 )
2024-11-15 10:01:33 +08:00
binbin Deng
d4d949443f
[NPU] change attention_mask to fp16 ( #12400 )
2024-11-14 17:20:29 +08:00
Qiyuan Gong
7e50ff113c
Add padding_token=eos_token for GPU trl QLora example ( #12398 )
...
* Avoid tokenizer doesn't have a padding token error.
2024-11-14 10:51:30 +08:00
SONG Ge
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU ( #12393 )
...
* Add initial support for modeling_xlm encoder on NPU
* Add EmbeddingModel class to keep the same usage with bce and npu fp16 linear convert
* Optimize currently implementation to support EmbeddingModel.encode API and convert other torch modules to NPU
* Add related example and documents
2024-11-14 10:50:27 +08:00
Xu, Shuo
6726b198fd
Update readme & doc for the vllm upgrade to v0.6.2 ( #12399 )
...
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-11-14 10:28:15 +08:00
Yina Chen
59b01fa7d2
small fix ( #12397 )
2024-11-14 10:03:36 +08:00
Yishuo Wang
00fce5c940
use new q4_0 batch kernel ( #12396 )
2024-11-13 18:37:34 +08:00
Yina Chen
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix ( #12395 )
...
* qwen prefill attn_mask type fp16
* update
2024-11-13 17:51:34 +08:00
Yina Chen
9220babaab
qwen prefill attn_mask type fp16 ( #12394 )
2024-11-13 17:45:26 +08:00
Yuwen Hu
1158f91648
Fix llava with multi-image inputs ( #12384 )
2024-11-13 09:27:50 +08:00
Shaojun Liu
27152476e1
minor fix ( #12389 )
2024-11-12 22:36:43 +08:00
Xu, Shuo
dd8964ba9c
changed inference-cpp/Dockerfile ( #12386 )
...
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
2024-11-12 20:40:21 +08:00
Guancheng Fu
0ee54fc55f
Upgrade to vllm 0.6.2 ( #12338 )
...
* Initial updates for vllm 0.6.2
* fix
* Change Dockerfile to support v062
* Fix
* fix examples
* Fix
* done
* fix
* Update engine.py
* Fix Dockerfile to original path
* fix
* add option
* fix
* fix
* fix
* fix
---------
Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
2024-11-12 20:35:34 +08:00
Jun Wang
4376fdee62
Decouple the openwebui and the ollama. in inference-cpp-xpu dockerfile ( #12382 )
...
* remove the openwebui in inference-cpp-xpu dockerfile
* update docker_cpp_xpu_quickstart.md
* add sample output in inference-cpp/readme
* remove the openwebui in main readme
* remove the openwebui in main readme
2024-11-12 20:15:23 +08:00
Ruonan Wang
6bf5a8c230
[NPU] Update qwen2 compile config ( #12383 )
...
* update
* fix
2024-11-12 16:59:44 +08:00
binbin Deng
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU ( #12375 )
2024-11-12 15:59:55 +08:00
Wang, Jian4
85c9279e6e
Update llama-cpp docker usage ( #12387 )
2024-11-12 15:30:17 +08:00
Shaojun Liu
c92d76b997
Update oneccl-binding.patch ( #12377 )
...
* Add files via upload
* upload oneccl-binding.patch
* Update Dockerfile
2024-11-11 22:34:08 +08:00
Yuwen Hu
e0918934c8
Add fused_mlp to glm4v models ( #12378 )
2024-11-11 17:10:25 +08:00
Yishuo Wang
dc34e8c51f
optimize glm4v vision attention ( #12369 )
2024-11-08 17:01:57 +08:00
Qiyuan Gong
2dfcc36825
Fix trl version and padding in trl qlora example ( #12368 )
...
* Change trl to 0.9.6
* Enable padding to avoid padding related errors.
2024-11-08 16:05:17 +08:00
Shaojun Liu
fad15c8ca0
Update fastchat demo script ( #12367 )
...
* Update README.md
* Update vllm_docker_quickstart.md
2024-11-08 15:42:17 +08:00
Yishuo Wang
51f7f87768
fix ipex 2.3 bug ( #12366 )
2024-11-08 13:29:15 +08:00
Yina Chen
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor ( #12337 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* baichuan part
* update
* update
* update
* baichuan support
* code refactor
* remove code
* fix style
* address comments
* revert
2024-11-08 11:42:42 +08:00
binbin Deng
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline ( #12361 )
2024-11-08 10:01:23 +08:00
Xin Qiu
7ef7696956
update linux installation doc ( #12365 )
...
* update linux doc
* update
2024-11-08 09:44:58 +08:00
Yuwen Hu
8fe294e01f
Small fix to all-in-one benchmark ( #12362 )
2024-11-07 18:56:34 +08:00
Yuwen Hu
1a6cbc473f
Add fused mlp optimizations to glm4 models ( #12360 )
...
* Add fused mlp to glm4 models
* Small fix
2024-11-07 18:52:47 +08:00
Xin Qiu
520af4e9b5
Update install_linux_gpu.md ( #12353 )
2024-11-07 16:08:01 +08:00
Yishuo Wang
ad68c56573
small improvement ( #12359 )
2024-11-07 15:57:41 +08:00
Jinhe
71ea539351
Add troubleshootings for ollama and llama.cpp ( #12358 )
...
* add ollama troubleshoot en
* zh ollama troubleshoot
* llamacpp trouble shoot
* llamacpp trouble shoot
* fix
* save gpu memory
2024-11-07 15:49:20 +08:00
Xu, Shuo
ce0c6ae423
Update Readme for FastChat docker demo ( #12354 )
...
* update Readme for FastChat docker demo
* update readme
* add 'Serving with FastChat' part in docs
* polish docs
---------
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-11-07 15:22:42 +08:00
Yina Chen
d880e534d2
[NPU] acclib llama3.2 support groupwise ( #12355 )
...
* change inter_pp
* add comment
2024-11-07 11:19:55 +08:00
Jinhe
79f2877413
add minicpm-v models to transformers_int4_npu_win api ( #12352 )
...
* add minicpm npu
* optimize model
2024-11-07 10:05:10 +08:00
SONG Ge
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU ( #12339 )
...
* Add initial support for llama3.2-1b/3b
* move llama3.2 support into current llama_mp impl
2024-11-06 19:21:40 +08:00
Yuwen Hu
872a74481a
Small optimization to glm4 models ( #12351 )
2024-11-06 19:16:58 +08:00
Ruonan Wang
c267355b35
fix three NPU benchmark issues ( #12350 )
...
* fix three issues
* limit mixed_precision for CW only
2024-11-06 19:01:01 +08:00
Yina Chen
f24352aef9
llama 3.1/3.2 support compresskv ( #12347 )
...
* llama 3.1/3.2 support compresskv
* update
* fix transformers 4.45 error
* fix style
* fix typo
* disable llama3.2 1b compresskv
2024-11-06 17:33:43 +08:00
Jin, Qiao
d984c0672a
Add MiniCPM-V-2_6 to arc perf test ( #12349 )
2024-11-06 16:32:28 +08:00
Yishuo Wang
e23ef7d088
optimize glm4v's vision part ( #12346 )
2024-11-06 15:43:40 +08:00
Yishuo Wang
c8b7265359
Add basic glm4v support ( #12345 )
2024-11-06 13:50:10 +08:00
binbin Deng
69e3a56943
[NPU] Hot fix of load_low_bit ( #12344 )
2024-11-06 10:07:00 +08:00
Xu, Shuo
899a30331a
Replace gradio_web_server.patch to adjust webui ( #12329 )
...
* replace gradio_web_server.patch to adjust webui
* fix patch problem
---------
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-11-06 09:16:32 +08:00
Jin, Qiao
7240c283a3
Add dummy model in iGPU perf ( #12341 )
...
* Add dummy model in iGPU perf
* Add dummy model in iGPU perf
* Fix
2024-11-05 17:56:10 +08:00
Zhao Changmin
8e9a3a1158
fix chatglm2 cpu ut ( #12336 )
2024-11-05 16:43:57 +08:00
Yina Chen
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support ( #12327 )
...
* support minicpm 1b & qwen 1.5b gw
* support minicpm 1b
* support minicpm 2b
* fix style & error
* fix style & update
* remove print
2024-11-05 15:51:31 +08:00