-
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM (#11704)
Qiyuan Gong
2024-08-01 18:16:21 +0800
-
8ef4caaf5d
add 3k and 4k input of nightly perf test on iGPU (#11701)
hxsz1997
2024-08-01 09:17:46 +0300
-
afeca38a47
Fix import vllm condition (#11682)
Guancheng Fu
2024-07-31 13:50:01 +0800
-
54bf3a23a6
add fallback for unsupported k-quants (#11691)
Ruonan Wang
2024-07-31 06:39:58 +0300
-
5079ed9e06
Add Llama3.1 example (#11689)
Zijie Li
2024-07-31 10:53:30 +0800
-
6e3ce28173
Upgrade glm-4 example transformers version (#11659)
Jin, Qiao
2024-07-31 10:24:50 +0800
-
a44ab32153
Switch to conhost when running on NPU (#11687)
Jin, Qiao
2024-07-30 17:08:06 +0800
-
b119825152
Remove tgi parameter validation (#11688)
Wang, Jian4
2024-07-30 16:37:44 +0800
-
670ad887fc
Qwen support compress kv (#11680)
Yina Chen
2024-07-30 06:16:42 +0300
-
9b36877897
disable default quantize_kv of GQA on MTL (#11679)
hxsz1997
2024-07-30 04:38:46 +0300
-
c02003925b
add mlp for gemma2 (#11678)
Yishuo Wang
2024-07-29 16:10:23 +0800
-
1da1f1dd0e
Combine two versions of run_wikitext.py (#11597)
RyuKosei
2024-07-29 00:56:16 -0700
-
6f999e6e90
add sdp for gemma2 (#11677)
Yishuo Wang
2024-07-29 15:15:47 +0800
-
c11d5301d7
add sdp fp8 for llama (#11671)
Ruonan Wang
2024-07-29 08:46:22 +0300
-
7f88ce23cd
add more gemma2 optimization (#11673)
Yishuo Wang
2024-07-29 11:13:00 +0800
-
3e8819734b
add basic gemma2 optimization (#11672)
Yishuo Wang
2024-07-29 10:46:51 +0800
-
418640e466
Update install_gpu.md
Jason Dai
2024-07-27 08:30:10 +0800
-
336dfc04b1
fix 1482 (#11661)
Guoqiong Song
2024-07-26 12:39:09 -0700
-
ba01b85c13
empty cache only for 1st token but rest token to speed up (#11665)
Heyang Sun
2024-07-26 16:46:21 +0800
-
fc7f8feb83
Support compress kv (#11642)
Yina Chen
2024-07-26 11:02:00 +0300
-
6bcdc6cc8f
fix qwen2 cpu (#11663)
Yishuo Wang
2024-07-26 13:41:51 +0800
-
23681fbf5c
Support codegeex4-9b for lightweight-serving (#11648)
Wang, Jian4
2024-07-26 09:41:03 +0800
-
86fc0492f4
Update oneccl used (#11647)
Guancheng Fu
2024-07-26 09:38:39 +0800
-
a4d30a8211
Change logic for detecting if vllm is available (#11657)
Guancheng Fu
2024-07-25 15:24:19 +0800
-
0c6e0b86c0
Refine continuation get input_str (#11652)
Qiyuan Gong
2024-07-25 14:41:19 +0800
-
2fbd375a94
update several models for nightly perf test (#11643)
RyuKosei
2024-07-24 23:06:08 -0700
-
4499d25c26
LLM: Fix ParallelLMHead convert in vLLM cpu (#11654)
Xiangyu Tian
2024-07-25 13:07:19 +0800
-
777e61d8c8
Fix qwen2 & int4 on NPU (#11646)
binbin Deng
2024-07-24 13:14:39 +0800
-
1b3b46e54d
fix chatglm new model (#11639)
Yishuo Wang
2024-07-23 13:44:56 +0800
-
7f80db95eb
Change run.py in benchmark to support phi-3-vision in arc-perf (#11638)
Xu, Shuo
2024-07-23 09:51:36 +0800
-
060792a648
LLM: Refine Pipeline Parallel FastAPI (#11587)
Xiangyu Tian
2024-07-22 15:52:05 +0800
-
4d56ef5646
Fix openssf issue (#11632)
Shaojun Liu
2024-07-22 14:14:28 +0800
-
ac97b31664
update cpp quickstart about
ONEAPI_DEVICE_SELECTOR (#11630)
Ruonan Wang
2024-07-22 08:40:28 +0300
-
af6d406178
Add section title for conduct graphrag indexing (#11628)
Yuwen Hu
2024-07-22 10:23:26 +0800
-
1eed0635f2
Add lightweight serving and support tgi parameter (#11600)
Wang, Jian4
2024-07-19 13:15:56 +0800
-
d27a8cd08c
Fix Pipeline Parallel dtype (#11623)
Xiangyu Tian
2024-07-19 13:07:40 +0800
-
d020ad6397
add save_low_bit support for DiskEmbedding (#11621)
Yishuo Wang
2024-07-19 10:34:53 +0800
-
380717f50d
fix gemma for 4.41 (#11531)
Guoqiong Song
2024-07-18 15:02:50 -0700
-
5a6211fd56
fix minicpm for transformers>=4.39 (#11533)
Guoqiong Song
2024-07-18 15:01:57 -0700
-
0209427cf4
Add disk_embedding parameter to support put Embedding layer on CPU (#11617)
Yishuo Wang
2024-07-18 17:06:06 +0800
-
2478e2c14b
Add check in iGPU perf workflow for results integrity (#11616)
Yuwen Hu
2024-07-18 14:13:16 +0800
-
4594a3dd6c
LLM: Fix DummyLayer.weight device in Pipeline Parallel (#11612)
Xiangyu Tian
2024-07-18 13:39:34 +0800
-
4da93709b1
update doc/setup to use onednn gemm for cpp (#11598)
Ruonan Wang
2024-07-18 08:04:38 +0300
-
f4077fa905
fix llama3-8b npu long input stuck (#11613)
Yishuo Wang
2024-07-18 11:08:17 +0800
-
e5c0058c0e
fix baichuan (#11606)
Zhao Changmin
2024-07-18 09:43:36 +0800
-
bfcdc35b04
phi-3 on "transformers>=4.37.0,<=4.42.3" (#11534)
Guoqiong Song
2024-07-17 17:19:57 -0700
-
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
Guoqiong Song
2024-07-17 17:18:59 -0700
-
5b6eb85b85
phi model readme (#11595)
Guoqiong Song
2024-07-17 17:18:34 -0700
-
2b17536424
Fix python style check: update python version to 3.11 (#11601)
Shaojun Liu
2024-07-17 15:39:46 +0800
-
9c15abf825
Refactor fastapi-serving and add one card serving(#11581)
Wang, Jian4
2024-07-17 11:12:43 +0800
-
373ccbbb0c
Update README.md (#11592)
Jason Dai
2024-07-16 22:13:43 +0800
-
5837bc0014
fix chatglm3 npu output (#11590)
Yishuo Wang
2024-07-16 18:16:30 +0800
-
06930ab258
Enable ipex-llm optimization for lm head (#11589)
Guancheng Fu
2024-07-16 16:48:44 +0800
-
365adad59f
Support LoRA ChatGLM with Alpaca Dataset (#11580)
Heyang Sun
2024-07-16 15:40:02 +0800
-
99c22745b2
fix qwen 14b fp6 abnormal output (#11583)
Yina Chen
2024-07-16 05:59:00 +0300
-
c279849d27
add disk embedding api (#11585)
Yishuo Wang
2024-07-16 10:43:39 +0800
-
79c742dfd5
LLM: Add XPU Memory Optimizations for Pipeline Parallel (#11567)
Xiangyu Tian
2024-07-16 09:44:50 +0800
-
f06d2f72fb
Add GraphRAG QuickStart (#11582)
Yuwen Hu
2024-07-16 09:27:54 +0800
-
91409ffe8c
Add mtl AOT packages in faq.md (#11577)
Xin Qiu
2024-07-16 08:46:03 +0800
-
50cf563a71
Add example: MiniCPM-V (#11570)
Ch1y0q
2024-07-15 10:55:48 +0800
-
06745e5742
Add npu benchmark all-in-one script (#11571)
Zhao Changmin
2024-07-15 10:42:37 +0800
-
019da6c0ab
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574)
Yishuo Wang
2024-07-13 16:32:54 +0800
-
13a72dc51d
Test MiniCPM performance on iGPU in a more stable way (#11573)
Xu, Shuo
2024-07-12 17:07:41 +0800
-
0981b72275
Fix /generate_stream api in Pipeline Parallel FastAPI (#11569)
Xiangyu Tian
2024-07-12 13:19:42 +0800
-
a945500a98
fix internlm xcomposser stream chat (#11564)
Yishuo Wang
2024-07-11 18:21:17 +0800
-
b9c66994a5
add npu sdp (#11562)
Zhao Changmin
2024-07-11 16:57:35 +0800
-
2b8ad8731e
Support pipeline parallel for glm-4v (#11545)
binbin Deng
2024-07-11 16:06:06 +0800
-
7f5111a998
LLM: Refine start script for Pipeline Parallel Serving (#11557)
Xiangyu Tian
2024-07-11 15:45:27 +0800
-
1355b2ce06
Add model Qwen-VL-Chat to iGPU-perf (#11558)
Xu, Shuo
2024-07-11 15:39:02 +0800
-
105e124752
optimize phi3-v encoder npu performance and add multimodal example (#11553)
Zhao Changmin
2024-07-11 13:59:14 +0800
-
70ab1a6f1a
LLM: unify memory optimization env variables. (#11549)
Cengguang Zhang
2024-07-11 11:01:28 +0800
-
51f2effb05
Add xpu-tgi manually_build (#11556)
Wang, Jian4
2024-07-11 10:35:40 +0800
-
028ad4f63c
Add model phi-3-vision-128k-instruct to iGPU-perf benchmark (#11554)
Xu, Shuo
2024-07-10 17:26:30 +0800
-
994e49a510
optimize internlm xcomposser performance again (#11551)
Yishuo Wang
2024-07-10 17:08:56 +0800
-
61613b210c
try to improve MIniCPM performance (#11552)
Xu, Shuo
2024-07-10 16:58:23 +0800
-
82f9514303
optimize internlm xcomposer2 performance (#11550)
Yishuo Wang
2024-07-10 15:57:04 +0800
-
3c16c9f725
Optimize baichuan on NPU (#11548)
Zhao Changmin
2024-07-10 13:18:48 +0800
-
8982ab73d5
Add Yi-6B and StableLM to iGPU perf test (#11546)
Yuwen Hu
2024-07-09 18:51:23 +0800
-
7dc6756d86
add disk embedding (#11543)
Yishuo Wang
2024-07-09 17:38:40 +0800
-
76a5802acf
update NPU examples (#11540)
Zhao Changmin
2024-07-09 17:19:42 +0800
-
99b2802d3b
optimize qewn2 memory (#11535)
Yishuo Wang
2024-07-09 17:14:01 +0800
-
2929eb262e
support npu glm4 (#11539)
Yishuo Wang
2024-07-09 15:46:49 +0800
-
a1cede926d
Fix update_kv_cache in Pipeline-Parallel-Serving for glm4-9b model (#11537)
Xiangyu Tian
2024-07-09 14:08:04 +0800
-
fa81dbefd3
LLM: update multi gpu write csv in all-in-one benchmark. (#11538)
Cengguang Zhang
2024-07-09 11:14:17 +0800
-
69701b3ec8
fix typo in python/llm/scripts/README.md (#11536)
Xin Qiu
2024-07-09 09:53:14 +0800
-
099486afb7
Update README.md (#11530)
Jason Dai
2024-07-08 20:18:41 +0800
-
66f6ffe4b2
Update GPU HF-Transformers example structure (#11526)
binbin Deng
2024-07-08 17:58:06 +0800
-
f9a199900d
add model RWKV/v5-Eagle-7B-HF to igpu benchmark (#11528)
Xu, Shuo
2024-07-08 15:50:16 +0800
-
9b37ca6027
remove (#11527)
Shaojun Liu
2024-07-08 15:49:52 +0800
-
c26651f91f
add mistral npu support (#11523)
Yishuo Wang
2024-07-08 13:17:15 +0800
-
5a57e54400
[ADD] add 5 new models for igpu-perf (#11524)
Jun Wang
2024-07-08 11:12:15 +0800
-
64cfed602d
Add new models to benchmark (#11505)
Xu, Shuo
2024-07-08 10:35:55 +0800
-
252426793b
Fix setting of
use_quantize_kv_cache on different GPU in pipeline parallel (#11516)
binbin Deng
2024-07-08 09:27:01 +0800
-
7cb09a8eac
optimize qwen2 memory usage again (#11520)
Yishuo Wang
2024-07-05 17:32:34 +0800
-
8f376e5192
Change igpu perf to mainly test int4+fp16 (#11513)
Yuwen Hu
2024-07-05 17:12:33 +0800
-
1efb6ebe93
[ADD] add
transformer_int4_fp16_loadlowbit_gpu_win api (#11511)
Jun Wang
2024-07-05 16:38:41 +0800
-
f7e957aaf9
Clean npu dtype branch (#11515)
Zhao Changmin
2024-07-05 15:45:26 +0800
-
14ce058004
add chatglm3 npu support (#11518)
Yishuo Wang
2024-07-05 15:31:27 +0800
-
a31f2cbe13
update minicpm.py (#11517)
Xin Qiu
2024-07-05 15:25:44 +0800
-
24de13fc45
Optimize stablelm on NPU (#11512)
Zhao Changmin
2024-07-05 14:21:57 +0800