Commit graph

  • 762ad49362
    Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM (#11704) Qiyuan Gong 2024-08-01 18:16:21 +0800
  • 8ef4caaf5d
    add 3k and 4k input of nightly perf test on iGPU (#11701) hxsz1997 2024-08-01 09:17:46 +0300
  • afeca38a47
    Fix import vllm condition (#11682) Guancheng Fu 2024-07-31 13:50:01 +0800
  • 54bf3a23a6
    add fallback for unsupported k-quants (#11691) Ruonan Wang 2024-07-31 06:39:58 +0300
  • 5079ed9e06
    Add Llama3.1 example (#11689) Zijie Li 2024-07-31 10:53:30 +0800
  • 6e3ce28173
    Upgrade glm-4 example transformers version (#11659) Jin, Qiao 2024-07-31 10:24:50 +0800
  • a44ab32153
    Switch to conhost when running on NPU (#11687) Jin, Qiao 2024-07-30 17:08:06 +0800
  • b119825152
    Remove tgi parameter validation (#11688) Wang, Jian4 2024-07-30 16:37:44 +0800
  • 670ad887fc
    Qwen support compress kv (#11680) Yina Chen 2024-07-30 06:16:42 +0300
  • 9b36877897
    disable default quantize_kv of GQA on MTL (#11679) hxsz1997 2024-07-30 04:38:46 +0300
  • c02003925b
    add mlp for gemma2 (#11678) Yishuo Wang 2024-07-29 16:10:23 +0800
  • 1da1f1dd0e
    Combine two versions of run_wikitext.py (#11597) RyuKosei 2024-07-29 00:56:16 -0700
  • 6f999e6e90
    add sdp for gemma2 (#11677) Yishuo Wang 2024-07-29 15:15:47 +0800
  • c11d5301d7
    add sdp fp8 for llama (#11671) Ruonan Wang 2024-07-29 08:46:22 +0300
  • 7f88ce23cd
    add more gemma2 optimization (#11673) Yishuo Wang 2024-07-29 11:13:00 +0800
  • 3e8819734b
    add basic gemma2 optimization (#11672) Yishuo Wang 2024-07-29 10:46:51 +0800
  • 418640e466
    Update install_gpu.md Jason Dai 2024-07-27 08:30:10 +0800
  • 336dfc04b1
    fix 1482 (#11661) Guoqiong Song 2024-07-26 12:39:09 -0700
  • ba01b85c13
    empty cache only for 1st token but rest token to speed up (#11665) Heyang Sun 2024-07-26 16:46:21 +0800
  • fc7f8feb83
    Support compress kv (#11642) Yina Chen 2024-07-26 11:02:00 +0300
  • 6bcdc6cc8f
    fix qwen2 cpu (#11663) Yishuo Wang 2024-07-26 13:41:51 +0800
  • 23681fbf5c
    Support codegeex4-9b for lightweight-serving (#11648) Wang, Jian4 2024-07-26 09:41:03 +0800
  • 86fc0492f4
    Update oneccl used (#11647) Guancheng Fu 2024-07-26 09:38:39 +0800
  • a4d30a8211
    Change logic for detecting if vllm is available (#11657) Guancheng Fu 2024-07-25 15:24:19 +0800
  • 0c6e0b86c0
    Refine continuation get input_str (#11652) Qiyuan Gong 2024-07-25 14:41:19 +0800
  • 2fbd375a94
    update several models for nightly perf test (#11643) RyuKosei 2024-07-24 23:06:08 -0700
  • 4499d25c26
    LLM: Fix ParallelLMHead convert in vLLM cpu (#11654) Xiangyu Tian 2024-07-25 13:07:19 +0800
  • 777e61d8c8
    Fix qwen2 & int4 on NPU (#11646) binbin Deng 2024-07-24 13:14:39 +0800
  • 1b3b46e54d
    fix chatglm new model (#11639) Yishuo Wang 2024-07-23 13:44:56 +0800
  • 7f80db95eb
    Change run.py in benchmark to support phi-3-vision in arc-perf (#11638) Xu, Shuo 2024-07-23 09:51:36 +0800
  • 060792a648
    LLM: Refine Pipeline Parallel FastAPI (#11587) Xiangyu Tian 2024-07-22 15:52:05 +0800
  • 4d56ef5646
    Fix openssf issue (#11632) Shaojun Liu 2024-07-22 14:14:28 +0800
  • ac97b31664
    update cpp quickstart about ONEAPI_DEVICE_SELECTOR (#11630) Ruonan Wang 2024-07-22 08:40:28 +0300
  • af6d406178
    Add section title for conduct graphrag indexing (#11628) Yuwen Hu 2024-07-22 10:23:26 +0800
  • 1eed0635f2
    Add lightweight serving and support tgi parameter (#11600) Wang, Jian4 2024-07-19 13:15:56 +0800
  • d27a8cd08c
    Fix Pipeline Parallel dtype (#11623) Xiangyu Tian 2024-07-19 13:07:40 +0800
  • d020ad6397
    add save_low_bit support for DiskEmbedding (#11621) Yishuo Wang 2024-07-19 10:34:53 +0800
  • 380717f50d
    fix gemma for 4.41 (#11531) Guoqiong Song 2024-07-18 15:02:50 -0700
  • 5a6211fd56
    fix minicpm for transformers>=4.39 (#11533) Guoqiong Song 2024-07-18 15:01:57 -0700
  • 0209427cf4
    Add disk_embedding parameter to support put Embedding layer on CPU (#11617) Yishuo Wang 2024-07-18 17:06:06 +0800
  • 2478e2c14b
    Add check in iGPU perf workflow for results integrity (#11616) Yuwen Hu 2024-07-18 14:13:16 +0800
  • 4594a3dd6c
    LLM: Fix DummyLayer.weight device in Pipeline Parallel (#11612) Xiangyu Tian 2024-07-18 13:39:34 +0800
  • 4da93709b1
    update doc/setup to use onednn gemm for cpp (#11598) Ruonan Wang 2024-07-18 08:04:38 +0300
  • f4077fa905
    fix llama3-8b npu long input stuck (#11613) Yishuo Wang 2024-07-18 11:08:17 +0800
  • e5c0058c0e
    fix baichuan (#11606) Zhao Changmin 2024-07-18 09:43:36 +0800
  • bfcdc35b04
    phi-3 on "transformers>=4.37.0,<=4.42.3" (#11534) Guoqiong Song 2024-07-17 17:19:57 -0700
  • d64711900a
    Fix cohere model on transformers>=4.41 (#11575) Guoqiong Song 2024-07-17 17:18:59 -0700
  • 5b6eb85b85
    phi model readme (#11595) Guoqiong Song 2024-07-17 17:18:34 -0700
  • 2b17536424
    Fix python style check: update python version to 3.11 (#11601) Shaojun Liu 2024-07-17 15:39:46 +0800
  • 9c15abf825
    Refactor fastapi-serving and add one card serving(#11581) Wang, Jian4 2024-07-17 11:12:43 +0800
  • 373ccbbb0c
    Update README.md (#11592) Jason Dai 2024-07-16 22:13:43 +0800
  • 5837bc0014
    fix chatglm3 npu output (#11590) Yishuo Wang 2024-07-16 18:16:30 +0800
  • 06930ab258
    Enable ipex-llm optimization for lm head (#11589) Guancheng Fu 2024-07-16 16:48:44 +0800
  • 365adad59f
    Support LoRA ChatGLM with Alpaca Dataset (#11580) Heyang Sun 2024-07-16 15:40:02 +0800
  • 99c22745b2
    fix qwen 14b fp6 abnormal output (#11583) Yina Chen 2024-07-16 05:59:00 +0300
  • c279849d27
    add disk embedding api (#11585) Yishuo Wang 2024-07-16 10:43:39 +0800
  • 79c742dfd5
    LLM: Add XPU Memory Optimizations for Pipeline Parallel (#11567) Xiangyu Tian 2024-07-16 09:44:50 +0800
  • f06d2f72fb
    Add GraphRAG QuickStart (#11582) Yuwen Hu 2024-07-16 09:27:54 +0800
  • 91409ffe8c
    Add mtl AOT packages in faq.md (#11577) Xin Qiu 2024-07-16 08:46:03 +0800
  • 50cf563a71
    Add example: MiniCPM-V (#11570) Ch1y0q 2024-07-15 10:55:48 +0800
  • 06745e5742
    Add npu benchmark all-in-one script (#11571) Zhao Changmin 2024-07-15 10:42:37 +0800
  • 019da6c0ab
    use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574) Yishuo Wang 2024-07-13 16:32:54 +0800
  • 13a72dc51d
    Test MiniCPM performance on iGPU in a more stable way (#11573) Xu, Shuo 2024-07-12 17:07:41 +0800
  • 0981b72275
    Fix /generate_stream api in Pipeline Parallel FastAPI (#11569) Xiangyu Tian 2024-07-12 13:19:42 +0800
  • a945500a98
    fix internlm xcomposser stream chat (#11564) Yishuo Wang 2024-07-11 18:21:17 +0800
  • b9c66994a5
    add npu sdp (#11562) Zhao Changmin 2024-07-11 16:57:35 +0800
  • 2b8ad8731e
    Support pipeline parallel for glm-4v (#11545) binbin Deng 2024-07-11 16:06:06 +0800
  • 7f5111a998
    LLM: Refine start script for Pipeline Parallel Serving (#11557) Xiangyu Tian 2024-07-11 15:45:27 +0800
  • 1355b2ce06
    Add model Qwen-VL-Chat to iGPU-perf (#11558) Xu, Shuo 2024-07-11 15:39:02 +0800
  • 105e124752
    optimize phi3-v encoder npu performance and add multimodal example (#11553) Zhao Changmin 2024-07-11 13:59:14 +0800
  • 70ab1a6f1a
    LLM: unify memory optimization env variables. (#11549) Cengguang Zhang 2024-07-11 11:01:28 +0800
  • 51f2effb05
    Add xpu-tgi manually_build (#11556) Wang, Jian4 2024-07-11 10:35:40 +0800
  • 028ad4f63c
    Add model phi-3-vision-128k-instruct to iGPU-perf benchmark (#11554) Xu, Shuo 2024-07-10 17:26:30 +0800
  • 994e49a510
    optimize internlm xcomposser performance again (#11551) Yishuo Wang 2024-07-10 17:08:56 +0800
  • 61613b210c
    try to improve MIniCPM performance (#11552) Xu, Shuo 2024-07-10 16:58:23 +0800
  • 82f9514303
    optimize internlm xcomposer2 performance (#11550) Yishuo Wang 2024-07-10 15:57:04 +0800
  • 3c16c9f725
    Optimize baichuan on NPU (#11548) Zhao Changmin 2024-07-10 13:18:48 +0800
  • 8982ab73d5
    Add Yi-6B and StableLM to iGPU perf test (#11546) Yuwen Hu 2024-07-09 18:51:23 +0800
  • 7dc6756d86
    add disk embedding (#11543) Yishuo Wang 2024-07-09 17:38:40 +0800
  • 76a5802acf
    update NPU examples (#11540) Zhao Changmin 2024-07-09 17:19:42 +0800
  • 99b2802d3b
    optimize qewn2 memory (#11535) Yishuo Wang 2024-07-09 17:14:01 +0800
  • 2929eb262e
    support npu glm4 (#11539) Yishuo Wang 2024-07-09 15:46:49 +0800
  • a1cede926d
    Fix update_kv_cache in Pipeline-Parallel-Serving for glm4-9b model (#11537) Xiangyu Tian 2024-07-09 14:08:04 +0800
  • fa81dbefd3
    LLM: update multi gpu write csv in all-in-one benchmark. (#11538) Cengguang Zhang 2024-07-09 11:14:17 +0800
  • 69701b3ec8
    fix typo in python/llm/scripts/README.md (#11536) Xin Qiu 2024-07-09 09:53:14 +0800
  • 099486afb7
    Update README.md (#11530) Jason Dai 2024-07-08 20:18:41 +0800
  • 66f6ffe4b2
    Update GPU HF-Transformers example structure (#11526) binbin Deng 2024-07-08 17:58:06 +0800
  • f9a199900d
    add model RWKV/v5-Eagle-7B-HF to igpu benchmark (#11528) Xu, Shuo 2024-07-08 15:50:16 +0800
  • 9b37ca6027
    remove (#11527) Shaojun Liu 2024-07-08 15:49:52 +0800
  • c26651f91f
    add mistral npu support (#11523) Yishuo Wang 2024-07-08 13:17:15 +0800
  • 5a57e54400
    [ADD] add 5 new models for igpu-perf (#11524) Jun Wang 2024-07-08 11:12:15 +0800
  • 64cfed602d
    Add new models to benchmark (#11505) Xu, Shuo 2024-07-08 10:35:55 +0800
  • 252426793b
    Fix setting of use_quantize_kv_cache on different GPU in pipeline parallel (#11516) binbin Deng 2024-07-08 09:27:01 +0800
  • 7cb09a8eac
    optimize qwen2 memory usage again (#11520) Yishuo Wang 2024-07-05 17:32:34 +0800
  • 8f376e5192
    Change igpu perf to mainly test int4+fp16 (#11513) Yuwen Hu 2024-07-05 17:12:33 +0800
  • 1efb6ebe93
    [ADD] add transformer_int4_fp16_loadlowbit_gpu_win api (#11511) Jun Wang 2024-07-05 16:38:41 +0800
  • f7e957aaf9
    Clean npu dtype branch (#11515) Zhao Changmin 2024-07-05 15:45:26 +0800
  • 14ce058004
    add chatglm3 npu support (#11518) Yishuo Wang 2024-07-05 15:31:27 +0800
  • a31f2cbe13
    update minicpm.py (#11517) Xin Qiu 2024-07-05 15:25:44 +0800
  • 24de13fc45
    Optimize stablelm on NPU (#11512) Zhao Changmin 2024-07-05 14:21:57 +0800