-
6f3441ba4c
fix glm4-9b overflow (#12455)
Yishuo Wang
2024-11-27 17:39:13 +0800
-
281c9b0bb9
[NPU] Add L0 support for NPU C++ (#12454)
Ruonan Wang
2024-11-27 01:04:13 -0800
-
ce6fcaa9ba
update transformers version in example of glm4 (#12453)
Chu,Youcheng
2024-11-27 15:02:25 +0800
-
effb9bb41c
Small update to LangChain examples readme (#12452)
Yuwen Hu
2024-11-27 14:02:25 +0800
-
acd77d9e87
Remove env variable
BIGDL_LLM_XMX_DISABLED in documentation (#12445)
Chu,Youcheng
2024-11-27 11:16:36 +0800
-
f8c2bb2943
[NPU] optimize qwen2 prefill performance for C++ (#12451)
Ruonan Wang
2024-11-26 18:46:18 -0800
-
8331875f34
Fix (#12390)
Guancheng Fu
2024-11-27 10:41:58 +0800
-
cb7b08948b
update vllm-docker-quick-start for vllm0.6.2 (#12392)
Jun Wang
2024-11-27 08:47:03 +0800
-
7b40f9b372
[NPU] Support GW for NPU C++ (#12450)
Ruonan Wang
2024-11-26 01:46:40 -0800
-
c2efa264d9
Update LangChain examples to use upstream (#12388)
Jin, Qiao
2024-11-26 16:43:15 +0800
-
24b46b2b19
[NPU] further fix of qwen2 int8 pipeline & C++ (#12449)
Ruonan Wang
2024-11-26 00:39:39 -0800
-
303b104c10
Fix abnormal output for Qwen2-7B when sym_int8 (#12446)
Yuwen Hu
2024-11-26 15:53:04 +0800
-
71e1f11aa6
update serving image runtime (#12433)
Pepijn de Vos
2024-11-26 07:55:30 +0100
-
52c17fe104
Optimize first token of C++ NPU by adding npu_dpu_groups (#12443)
Ruonan Wang
2024-11-25 19:41:32 -0800
-
66bd7abae4
add sdxl and lora-lcm optimization (#12444)
Jinhe
2024-11-26 11:38:09 +0800
-
0e23bd779f
Add support of llama3.2 for NPU C++ (#12442)
Ruonan Wang
2024-11-25 17:26:55 -0800
-
cdd41f5e4c
optimize sdxl again (#12441)
Yishuo Wang
2024-11-25 17:46:46 +0800
-
b9abb8a285
Support qwen2.5 3B for NPU & update related examples (#12438)
Ruonan Wang
2024-11-25 00:38:31 -0800
-
b633fbf26c
add chinese prompt troubleshooting for npu cpp examples (#12437)
Jinhe
2024-11-25 15:28:47 +0800
-
8164aed802
small change (#12439)
Yishuo Wang
2024-11-25 14:35:49 +0800
-
be132c4209
fix and optimize sd (#12436)
Yishuo Wang
2024-11-25 14:09:48 +0800
-
f41405368a
Support minicpm for NPU C++ (#12434)
Ruonan Wang
2024-11-24 18:42:02 -0800
-
0819fad34e
support Llama2-7B / Llama3-8B for NPU C++ (#12431)
Ruonan Wang
2024-11-22 02:47:19 -0800
-
4ffa6c752c
New convert support for C++ NPU (#12430)
Ruonan Wang
2024-11-21 22:28:30 -0800
-
c089b6c10d
Update english prompt to 34k (#12429)
Shaojun Liu
2024-11-22 11:20:35 +0800
-
e61ae88c5b
Upgrade denpendency for xpu_lnl and xpu_arl option (#12424)
Yuwen Hu
2024-11-21 18:37:15 +0800
-
2935e97610
small fix of cpp readme(#12425)
Ruonan Wang
2024-11-21 02:21:34 -0800
-
8fdc36c140
Optimize with new batch kernel when
batch_size=1 on LNL (#12419)
Yuwen Hu
2024-11-21 16:21:35 +0800
-
7e0a840f74
add optimization to openjourney (#12423)
Jinhe
2024-11-21 15:23:51 +0800
-
145e8b480f
update batch kernel condition (#12421)
Yishuo Wang
2024-11-21 10:12:46 +0800
-
7288c759ce
Initial NPU C++ Example (#12417)
Ruonan Wang
2024-11-20 18:09:26 -0800
-
d2a37b6ab2
add Stable diffusion examples (#12418)
Jinhe
2024-11-20 17:18:36 +0800
-
54c62feb74
[NPU] dump prefill IR for further C++ solution (#12402)
Ruonan Wang
2024-11-19 23:20:05 -0800
-
1bfcbc0640
Add multimodal benchmark (#12415)
Wang, Jian4
2024-11-20 14:21:13 +0800
-
ff3f7cb25f
Fix speech_paraformer issue with unexpected changes (#12416)
SONG Ge
2024-11-18 23:01:20 -0800
-
a9cb70a71c
Add install_windows_gpu.zh-CN.md and install_linux_gpu.zh-CN.md (#12409)
joan726
2024-11-19 14:39:53 +0800
-
d6057f6dd2
Update benchmark_vllm_throughput.py (#12414)
Guancheng Fu
2024-11-19 10:41:43 +0800
-
a69395f31f
Support performance mode of GLM4 model (#12401)
Yuwen Hu
2024-11-18 18:46:52 +0800
-
d2c821d458
Add missing arguments in pipeline parallel generate method (#12142)
Song Fuchang
2024-11-18 13:50:18 +0800
-
3d5fbf2069
update batch kernel condition (#12408)
Yishuo Wang
2024-11-15 13:47:05 +0800
-
6c5e8fc70c
fix again (#12407)
Ruonan Wang
2024-11-15 11:57:58 +0800
-
fcc0fa7316
fix workflow again (#12406)
Ruonan Wang
2024-11-15 11:01:35 +0800
-
d1cde7fac4
Tiny doc fix (#12405)
Yuwen Hu
2024-11-15 10:28:38 +0800
-
548dec5185
fix npu pipeline workflow (#12404)
Ruonan Wang
2024-11-15 10:01:33 +0800
-
d4d949443f
[NPU] change attention_mask to fp16 (#12400)
binbin Deng
2024-11-14 17:20:29 +0800
-
7e50ff113c
Add padding_token=eos_token for GPU trl QLora example (#12398)
Qiyuan Gong
2024-11-14 10:51:30 +0800
-
d2cbcb060c
Add initial support for modeling_xlm encoder on NPU (#12393)
SONG Ge
2024-11-14 10:50:27 +0800
-
6726b198fd
Update readme & doc for the vllm upgrade to v0.6.2 (#12399)
Xu, Shuo
2024-11-14 10:28:15 +0800
-
59b01fa7d2
small fix (#12397)
Yina Chen
2024-11-14 04:03:36 +0200
-
00fce5c940
use new q4_0 batch kernel (#12396)
Yishuo Wang
2024-11-13 18:37:34 +0800
-
d6d63d6b84
[NPU] Qwen prefill attn_mask type hotfix (#12395)
Yina Chen
2024-11-13 11:51:34 +0200
-
9220babaab
qwen prefill attn_mask type fp16 (#12394)
Yina Chen
2024-11-13 11:45:26 +0200
-
1158f91648
Fix llava with multi-image inputs (#12384)
Yuwen Hu
2024-11-13 09:27:50 +0800
-
27152476e1
minor fix (#12389)
Shaojun Liu
2024-11-12 22:36:43 +0800
-
dd8964ba9c
changed inference-cpp/Dockerfile (#12386)
Xu, Shuo
2024-11-12 20:40:21 +0800
-
0ee54fc55f
Upgrade to vllm 0.6.2 (#12338)
Guancheng Fu
2024-11-12 20:35:34 +0800
-
4376fdee62
Decouple the openwebui and the ollama. in inference-cpp-xpu dockerfile (#12382)
Jun Wang
2024-11-12 20:15:23 +0800
-
6bf5a8c230
[NPU] Update qwen2 compile config (#12383)
Ruonan Wang
2024-11-12 16:59:44 +0800
-
7a97fbb779
Support vpm and resampler module of minicpm-v on NPU (#12375)
binbin Deng
2024-11-12 15:59:55 +0800
-
85c9279e6e
Update llama-cpp docker usage (#12387)
Wang, Jian4
2024-11-12 15:30:17 +0800
-
c92d76b997
Update oneccl-binding.patch (#12377)
Shaojun Liu
2024-11-11 22:34:08 +0800
-
e0918934c8
Add fused_mlp to glm4v models (#12378)
Yuwen Hu
2024-11-11 17:10:25 +0800
-
dc34e8c51f
optimize glm4v vision attention (#12369)
Yishuo Wang
2024-11-08 17:01:57 +0800
-
2dfcc36825
Fix trl version and padding in trl qlora example (#12368)
Qiyuan Gong
2024-11-08 16:05:17 +0800
-
fad15c8ca0
Update fastchat demo script (#12367)
Shaojun Liu
2024-11-08 15:42:17 +0800
-
51f7f87768
fix ipex 2.3 bug (#12366)
Yishuo Wang
2024-11-08 13:29:15 +0800
-
b2e69a896c
[NPU] Support Baichuan groupwise & gw code refactor (#12337)
Yina Chen
2024-11-08 05:42:42 +0200
-
812d5cc32e
[NPU L0] Support llama3.2 in L0 pipeline (#12361)
binbin Deng
2024-11-08 10:01:23 +0800
-
7ef7696956
update linux installation doc (#12365)
Xin Qiu
2024-11-08 09:44:58 +0800
-
8fe294e01f
Small fix to all-in-one benchmark (#12362)
Yuwen Hu
2024-11-07 18:56:34 +0800
-
1a6cbc473f
Add fused mlp optimizations to glm4 models (#12360)
Yuwen Hu
2024-11-07 18:52:47 +0800
-
520af4e9b5
Update install_linux_gpu.md (#12353)
Xin Qiu
2024-11-07 16:08:01 +0800
-
ad68c56573
small improvement (#12359)
Yishuo Wang
2024-11-07 15:57:41 +0800
-
71ea539351
Add troubleshootings for ollama and llama.cpp (#12358)
Jinhe
2024-11-07 15:49:20 +0800
-
ce0c6ae423
Update Readme for FastChat docker demo (#12354)
Xu, Shuo
2024-11-07 15:22:42 +0800
-
d880e534d2
[NPU] acclib llama3.2 support groupwise (#12355)
Yina Chen
2024-11-07 05:19:55 +0200
-
79f2877413
add minicpm-v models to
transformers_int4_npu_win api (#12352)
Jinhe
2024-11-07 10:05:10 +0800
-
a7b66683f1
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU (#12339)
SONG Ge
2024-11-06 19:21:40 +0800
-
872a74481a
Small optimization to glm4 models (#12351)
Yuwen Hu
2024-11-06 19:16:58 +0800
-
c267355b35
fix three NPU benchmark issues (#12350)
Ruonan Wang
2024-11-06 19:01:01 +0800
-
f24352aef9
llama 3.1/3.2 support compresskv (#12347)
Yina Chen
2024-11-06 11:33:43 +0200
-
d984c0672a
Add MiniCPM-V-2_6 to arc perf test (#12349)
Jin, Qiao
2024-11-06 16:32:28 +0800
-
e23ef7d088
optimize glm4v's vision part (#12346)
Yishuo Wang
2024-11-06 15:43:40 +0800
-
c8b7265359
Add basic glm4v support (#12345)
Yishuo Wang
2024-11-06 13:50:10 +0800
-
69e3a56943
[NPU] Hot fix of load_low_bit (#12344)
binbin Deng
2024-11-06 10:07:00 +0800
-
899a30331a
Replace gradio_web_server.patch to adjust webui (#12329)
Xu, Shuo
2024-11-06 09:16:32 +0800
-
7240c283a3
Add dummy model in iGPU perf (#12341)
Jin, Qiao
2024-11-05 17:56:10 +0800
-
8e9a3a1158
fix chatglm2 cpu ut (#12336)
Zhao Changmin
2024-11-05 16:43:57 +0800
-
d872639395
[NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support (#12327)
Yina Chen
2024-11-05 09:51:31 +0200
-
82a61b5cf3
Limit trl version in example (#12332)
Jin, Qiao
2024-11-05 14:50:10 +0800
-
923d696854
Small fix to LNL performance tests (#12333)
Yuwen Hu
2024-11-05 13:24:58 +0800
-
45b0d371aa
update benchmark readme (#12323)
Zijie Li
2024-11-04 19:19:08 -0500
-
e2adc974fd
Small fix to LNL performance tests (#12331)
Yuwen Hu
2024-11-04 19:22:41 +0800
-
522cdf8e9d
Add initial support for LNL nightly performance tests (#12326)
Yuwen Hu
2024-11-04 18:53:51 +0800
-
1b637e4477
Add chatglm2&3 fuse mlp (#12328)
Zhao Changmin
2024-11-04 18:04:41 +0800
-
94c4ce389f
[NPU] Add env to disable compile opt (#12330)
Yina Chen
2024-11-04 11:46:17 +0200
-
e54af44ed6
Add
transformers_int4_npu_pipeline_win in all-in-one benchmark (#12325)
Ch1y0q
2024-11-04 16:00:20 +0800
-
5ee6f97d6f
[NPU L0] Add layernorm weight as const / input setting (#12322)
binbin Deng
2024-11-04 15:46:24 +0800
-
a01371f90b
Doc: update harness readme (#12324)
Chu,Youcheng
2024-11-04 14:58:54 +0800
-
4644cb640c
Perf test further fix regarding trl version (#12321)
Yuwen Hu
2024-11-04 11:01:25 +0800