Commit graph

  • 6f3441ba4c
    fix glm4-9b overflow (#12455) Yishuo Wang 2024-11-27 17:39:13 +0800
  • 281c9b0bb9
    [NPU] Add L0 support for NPU C++ (#12454) Ruonan Wang 2024-11-27 01:04:13 -0800
  • ce6fcaa9ba
    update transformers version in example of glm4 (#12453) Chu,Youcheng 2024-11-27 15:02:25 +0800
  • effb9bb41c
    Small update to LangChain examples readme (#12452) Yuwen Hu 2024-11-27 14:02:25 +0800
  • acd77d9e87
    Remove env variable BIGDL_LLM_XMX_DISABLED in documentation (#12445) Chu,Youcheng 2024-11-27 11:16:36 +0800
  • f8c2bb2943
    [NPU] optimize qwen2 prefill performance for C++ (#12451) Ruonan Wang 2024-11-26 18:46:18 -0800
  • 8331875f34
    Fix (#12390) Guancheng Fu 2024-11-27 10:41:58 +0800
  • cb7b08948b
    update vllm-docker-quick-start for vllm0.6.2 (#12392) Jun Wang 2024-11-27 08:47:03 +0800
  • 7b40f9b372
    [NPU] Support GW for NPU C++ (#12450) Ruonan Wang 2024-11-26 01:46:40 -0800
  • c2efa264d9
    Update LangChain examples to use upstream (#12388) Jin, Qiao 2024-11-26 16:43:15 +0800
  • 24b46b2b19
    [NPU] further fix of qwen2 int8 pipeline & C++ (#12449) Ruonan Wang 2024-11-26 00:39:39 -0800
  • 303b104c10
    Fix abnormal output for Qwen2-7B when sym_int8 (#12446) Yuwen Hu 2024-11-26 15:53:04 +0800
  • 71e1f11aa6
    update serving image runtime (#12433) Pepijn de Vos 2024-11-26 07:55:30 +0100
  • 52c17fe104
    Optimize first token of C++ NPU by adding npu_dpu_groups (#12443) Ruonan Wang 2024-11-25 19:41:32 -0800
  • 66bd7abae4
    add sdxl and lora-lcm optimization (#12444) Jinhe 2024-11-26 11:38:09 +0800
  • 0e23bd779f
    Add support of llama3.2 for NPU C++ (#12442) Ruonan Wang 2024-11-25 17:26:55 -0800
  • cdd41f5e4c
    optimize sdxl again (#12441) Yishuo Wang 2024-11-25 17:46:46 +0800
  • b9abb8a285
    Support qwen2.5 3B for NPU & update related examples (#12438) Ruonan Wang 2024-11-25 00:38:31 -0800
  • b633fbf26c
    add chinese prompt troubleshooting for npu cpp examples (#12437) Jinhe 2024-11-25 15:28:47 +0800
  • 8164aed802
    small change (#12439) Yishuo Wang 2024-11-25 14:35:49 +0800
  • be132c4209
    fix and optimize sd (#12436) Yishuo Wang 2024-11-25 14:09:48 +0800
  • f41405368a
    Support minicpm for NPU C++ (#12434) Ruonan Wang 2024-11-24 18:42:02 -0800
  • 0819fad34e
    support Llama2-7B / Llama3-8B for NPU C++ (#12431) Ruonan Wang 2024-11-22 02:47:19 -0800
  • 4ffa6c752c
    New convert support for C++ NPU (#12430) Ruonan Wang 2024-11-21 22:28:30 -0800
  • c089b6c10d
    Update english prompt to 34k (#12429) Shaojun Liu 2024-11-22 11:20:35 +0800
  • e61ae88c5b
    Upgrade denpendency for xpu_lnl and xpu_arl option (#12424) Yuwen Hu 2024-11-21 18:37:15 +0800
  • 2935e97610
    small fix of cpp readme(#12425) Ruonan Wang 2024-11-21 02:21:34 -0800
  • 8fdc36c140
    Optimize with new batch kernel when batch_size=1 on LNL (#12419) Yuwen Hu 2024-11-21 16:21:35 +0800
  • 7e0a840f74
    add optimization to openjourney (#12423) Jinhe 2024-11-21 15:23:51 +0800
  • 145e8b480f
    update batch kernel condition (#12421) Yishuo Wang 2024-11-21 10:12:46 +0800
  • 7288c759ce
    Initial NPU C++ Example (#12417) Ruonan Wang 2024-11-20 18:09:26 -0800
  • d2a37b6ab2
    add Stable diffusion examples (#12418) Jinhe 2024-11-20 17:18:36 +0800
  • 54c62feb74
    [NPU] dump prefill IR for further C++ solution (#12402) Ruonan Wang 2024-11-19 23:20:05 -0800
  • 1bfcbc0640
    Add multimodal benchmark (#12415) Wang, Jian4 2024-11-20 14:21:13 +0800
  • ff3f7cb25f
    Fix speech_paraformer issue with unexpected changes (#12416) SONG Ge 2024-11-18 23:01:20 -0800
  • a9cb70a71c
    Add install_windows_gpu.zh-CN.md and install_linux_gpu.zh-CN.md (#12409) joan726 2024-11-19 14:39:53 +0800
  • d6057f6dd2
    Update benchmark_vllm_throughput.py (#12414) Guancheng Fu 2024-11-19 10:41:43 +0800
  • a69395f31f
    Support performance mode of GLM4 model (#12401) Yuwen Hu 2024-11-18 18:46:52 +0800
  • d2c821d458
    Add missing arguments in pipeline parallel generate method (#12142) Song Fuchang 2024-11-18 13:50:18 +0800
  • 3d5fbf2069
    update batch kernel condition (#12408) Yishuo Wang 2024-11-15 13:47:05 +0800
  • 6c5e8fc70c
    fix again (#12407) Ruonan Wang 2024-11-15 11:57:58 +0800
  • fcc0fa7316
    fix workflow again (#12406) Ruonan Wang 2024-11-15 11:01:35 +0800
  • d1cde7fac4
    Tiny doc fix (#12405) Yuwen Hu 2024-11-15 10:28:38 +0800
  • 548dec5185
    fix npu pipeline workflow (#12404) Ruonan Wang 2024-11-15 10:01:33 +0800
  • d4d949443f
    [NPU] change attention_mask to fp16 (#12400) binbin Deng 2024-11-14 17:20:29 +0800
  • 7e50ff113c
    Add padding_token=eos_token for GPU trl QLora example (#12398) Qiyuan Gong 2024-11-14 10:51:30 +0800
  • d2cbcb060c
    Add initial support for modeling_xlm encoder on NPU (#12393) SONG Ge 2024-11-14 10:50:27 +0800
  • 6726b198fd
    Update readme & doc for the vllm upgrade to v0.6.2 (#12399) Xu, Shuo 2024-11-14 10:28:15 +0800
  • 59b01fa7d2
    small fix (#12397) Yina Chen 2024-11-14 04:03:36 +0200
  • 00fce5c940
    use new q4_0 batch kernel (#12396) Yishuo Wang 2024-11-13 18:37:34 +0800
  • d6d63d6b84
    [NPU] Qwen prefill attn_mask type hotfix (#12395) Yina Chen 2024-11-13 11:51:34 +0200
  • 9220babaab
    qwen prefill attn_mask type fp16 (#12394) Yina Chen 2024-11-13 11:45:26 +0200
  • 1158f91648
    Fix llava with multi-image inputs (#12384) Yuwen Hu 2024-11-13 09:27:50 +0800
  • 27152476e1
    minor fix (#12389) Shaojun Liu 2024-11-12 22:36:43 +0800
  • dd8964ba9c
    changed inference-cpp/Dockerfile (#12386) Xu, Shuo 2024-11-12 20:40:21 +0800
  • 0ee54fc55f
    Upgrade to vllm 0.6.2 (#12338) Guancheng Fu 2024-11-12 20:35:34 +0800
  • 4376fdee62
    Decouple the openwebui and the ollama. in inference-cpp-xpu dockerfile (#12382) Jun Wang 2024-11-12 20:15:23 +0800
  • 6bf5a8c230
    [NPU] Update qwen2 compile config (#12383) Ruonan Wang 2024-11-12 16:59:44 +0800
  • 7a97fbb779
    Support vpm and resampler module of minicpm-v on NPU (#12375) binbin Deng 2024-11-12 15:59:55 +0800
  • 85c9279e6e
    Update llama-cpp docker usage (#12387) Wang, Jian4 2024-11-12 15:30:17 +0800
  • c92d76b997
    Update oneccl-binding.patch (#12377) Shaojun Liu 2024-11-11 22:34:08 +0800
  • e0918934c8
    Add fused_mlp to glm4v models (#12378) Yuwen Hu 2024-11-11 17:10:25 +0800
  • dc34e8c51f
    optimize glm4v vision attention (#12369) Yishuo Wang 2024-11-08 17:01:57 +0800
  • 2dfcc36825
    Fix trl version and padding in trl qlora example (#12368) Qiyuan Gong 2024-11-08 16:05:17 +0800
  • fad15c8ca0
    Update fastchat demo script (#12367) Shaojun Liu 2024-11-08 15:42:17 +0800
  • 51f7f87768
    fix ipex 2.3 bug (#12366) Yishuo Wang 2024-11-08 13:29:15 +0800
  • b2e69a896c
    [NPU] Support Baichuan groupwise & gw code refactor (#12337) Yina Chen 2024-11-08 05:42:42 +0200
  • 812d5cc32e
    [NPU L0] Support llama3.2 in L0 pipeline (#12361) binbin Deng 2024-11-08 10:01:23 +0800
  • 7ef7696956
    update linux installation doc (#12365) Xin Qiu 2024-11-08 09:44:58 +0800
  • 8fe294e01f
    Small fix to all-in-one benchmark (#12362) Yuwen Hu 2024-11-07 18:56:34 +0800
  • 1a6cbc473f
    Add fused mlp optimizations to glm4 models (#12360) Yuwen Hu 2024-11-07 18:52:47 +0800
  • 520af4e9b5
    Update install_linux_gpu.md (#12353) Xin Qiu 2024-11-07 16:08:01 +0800
  • ad68c56573
    small improvement (#12359) Yishuo Wang 2024-11-07 15:57:41 +0800
  • 71ea539351
    Add troubleshootings for ollama and llama.cpp (#12358) Jinhe 2024-11-07 15:49:20 +0800
  • ce0c6ae423
    Update Readme for FastChat docker demo (#12354) Xu, Shuo 2024-11-07 15:22:42 +0800
  • d880e534d2
    [NPU] acclib llama3.2 support groupwise (#12355) Yina Chen 2024-11-07 05:19:55 +0200
  • 79f2877413
    add minicpm-v models to transformers_int4_npu_win api (#12352) Jinhe 2024-11-07 10:05:10 +0800
  • a7b66683f1
    [NPU] Add Optimized Support for Llama3.2-1B/3B on NPU (#12339) SONG Ge 2024-11-06 19:21:40 +0800
  • 872a74481a
    Small optimization to glm4 models (#12351) Yuwen Hu 2024-11-06 19:16:58 +0800
  • c267355b35
    fix three NPU benchmark issues (#12350) Ruonan Wang 2024-11-06 19:01:01 +0800
  • f24352aef9
    llama 3.1/3.2 support compresskv (#12347) Yina Chen 2024-11-06 11:33:43 +0200
  • d984c0672a
    Add MiniCPM-V-2_6 to arc perf test (#12349) Jin, Qiao 2024-11-06 16:32:28 +0800
  • e23ef7d088
    optimize glm4v's vision part (#12346) Yishuo Wang 2024-11-06 15:43:40 +0800
  • c8b7265359
    Add basic glm4v support (#12345) Yishuo Wang 2024-11-06 13:50:10 +0800
  • 69e3a56943
    [NPU] Hot fix of load_low_bit (#12344) binbin Deng 2024-11-06 10:07:00 +0800
  • 899a30331a
    Replace gradio_web_server.patch to adjust webui (#12329) Xu, Shuo 2024-11-06 09:16:32 +0800
  • 7240c283a3
    Add dummy model in iGPU perf (#12341) Jin, Qiao 2024-11-05 17:56:10 +0800
  • 8e9a3a1158
    fix chatglm2 cpu ut (#12336) Zhao Changmin 2024-11-05 16:43:57 +0800
  • d872639395
    [NPU] Llama3, Qwen2 1.5b, MiniCPM 1/2B groupwise support (#12327) Yina Chen 2024-11-05 09:51:31 +0200
  • 82a61b5cf3
    Limit trl version in example (#12332) Jin, Qiao 2024-11-05 14:50:10 +0800
  • 923d696854
    Small fix to LNL performance tests (#12333) Yuwen Hu 2024-11-05 13:24:58 +0800
  • 45b0d371aa
    update benchmark readme (#12323) Zijie Li 2024-11-04 19:19:08 -0500
  • e2adc974fd
    Small fix to LNL performance tests (#12331) Yuwen Hu 2024-11-04 19:22:41 +0800
  • 522cdf8e9d
    Add initial support for LNL nightly performance tests (#12326) Yuwen Hu 2024-11-04 18:53:51 +0800
  • 1b637e4477
    Add chatglm2&3 fuse mlp (#12328) Zhao Changmin 2024-11-04 18:04:41 +0800
  • 94c4ce389f
    [NPU] Add env to disable compile opt (#12330) Yina Chen 2024-11-04 11:46:17 +0200
  • e54af44ed6
    Add transformers_int4_npu_pipeline_win in all-in-one benchmark (#12325) Ch1y0q 2024-11-04 16:00:20 +0800
  • 5ee6f97d6f
    [NPU L0] Add layernorm weight as const / input setting (#12322) binbin Deng 2024-11-04 15:46:24 +0800
  • a01371f90b
    Doc: update harness readme (#12324) Chu,Youcheng 2024-11-04 14:58:54 +0800
  • 4644cb640c
    Perf test further fix regarding trl version (#12321) Yuwen Hu 2024-11-04 11:01:25 +0800