Jinhe
d0c89fb715
updated llama.cpp and ollama quickstart ( #11732 )
...
* updated llama.cpp and ollama quickstart.md
* added qwen2-1.5B sample output
* revision on quickstart updates
* revision on quickstart updates
* revision on qwen2 readme
* added 2 troubleshoots“
”
* troubleshoot revision
2024-08-08 11:04:01 +08:00
Yishuo Wang
54cc9353db
support and optimize minicpm-v-2_6 ( #11738 )
2024-08-07 18:21:16 +08:00
Yina Chen
e956e71fc1
fix conflict with quant kv ( #11737 )
2024-08-07 18:10:30 +08:00
Ruonan Wang
00a5574c8a
Use merge_qkv to replace fused_qkv for llama2 ( #11727 )
...
* update 4.38
* support new versions
* update
* fix style
* fix style
* update rope
* temp test sdpa
* fix style
* fix cpu ut
2024-08-07 18:04:01 +08:00
Yina Chen
d2abc9711b
Fix MTL 4k input qwen2 compresskv error ( #11734 )
...
* fix
* fix style
2024-08-07 16:21:57 +08:00
Yina Chen
a71ae7c22b
Support minicpm compresskv & modify default compresskv config & default enable compresskv on mtl 2.5k~4.5k ( #11726 )
...
* support minicpm & modify default & default enable on mtl 2.5k~4.5k
* fix style
2024-08-07 11:35:39 +08:00
Yishuo Wang
c093f7d980
fix phi3 ( #11729 )
2024-08-07 09:39:46 +08:00
Qiyuan Gong
e32d13d78c
Remove Out of tree Driver from GPU driver installation document ( #11728 )
...
GPU drivers are already upstreamed to Kernel 6.2+. Remove the out-of-tree driver (intel-i915-dkms) for 6.2-6.5. https://dgpu-docs.intel.com/driver/kernel-driver-types.html#gpu-driver-support
* Remove intel-i915-dkms intel-fw-gpu (only for kernel 5.19)
2024-08-07 09:38:19 +08:00
Zijie Li
e7f7141781
Add benchmark util for transformers 4.42 ( #11725 )
...
* add new benchmark_util.py
Add new benchmark_util.py for transformers>=4.43.1. The old one renamed to benchmark_util_prev.py.
* Small fix to import code
* Update __init__.py
* fix file names
* Update lint-python
Update lint-python to exclude benchmark_util_4_29.py
benchmark_util_4_43.py
* Update benchmark_util_4_43.py
* add benchmark_util for transformers 4.42
2024-08-07 08:48:07 +08:00
Ch1y0q
4676af2054
add gemma2 example ( #11724 )
...
* add `gemma2`
* update `transformers` version
* update `README.md`
2024-08-06 21:17:50 +08:00
SichengStevenLi
985213614b
Removed no longer needed models for Arc nightly perf ( #11722 )
...
* removed LLMs that are no longer needed
Removed:
mistralai/Mistral-7B-v0.1
deepseek-ai/deepseek-coder-6.7b-instruct
* Update arc-perf-test-batch4.yaml
Removed:
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1
* Update arc-perf-test.yaml
Removed:
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1
* Create arc-perf-transformers-438.yaml
* Moved arc-perf-transformers-438.yaml location
* Create arc-perf-transformers-438-batch2.yaml
* Create arc-perf-transformers-438-batch4.yaml
* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch2.yaml
* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch4.yaml
* Delete python/llm/test/benchmark/arc-perf-transformers-438.yaml
2024-08-06 16:12:00 +08:00
Yishuo Wang
929675aa6b
support latest phi3 ( #11721 )
2024-08-06 15:52:55 +08:00
Jin, Qiao
11650b6f81
upgrade glm-4v example transformers version ( #11719 )
2024-08-06 14:55:09 +08:00
Yishuo Wang
bbdff6edeb
optimize internvl2 4b performance ( #11720 )
2024-08-06 14:25:08 +08:00
Yishuo Wang
f44b732aa8
support internvl2-4b ( #11718 )
2024-08-06 13:36:32 +08:00
Jin, Qiao
7f241133da
Add MiniCPM-Llama3-V-2_5 GPU example ( #11693 )
...
* Add MiniCPM-Llama3-V-2_5 GPU example
* fix
2024-08-06 10:22:41 +08:00
Jin, Qiao
808d9a7bae
Add MiniCPM-V-2 GPU example ( #11699 )
...
* Add MiniCPM-V-2 GPU example
* add example in README.md
* add example in README.md
2024-08-06 10:22:33 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py ( #11713 )
...
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input ( #11703 )
...
* init image_list
* enable internlm-xcomposer2 image input
* update style
* add readme
* update model
* update readme
2024-08-05 09:36:04 +08:00
Ruonan Wang
aa98ef96fe
change mixed_precision to q6_k ( #11706 )
2024-08-02 15:55:16 +08:00
Xiangyu Tian
1baa3efe0e
Optimizations for Pipeline Parallel Serving ( #11702 )
...
Optimizations for Pipeline Parallel Serving
2024-08-02 12:06:59 +08:00
Yina Chen
8d1e0bd2f4
add sdp causal support in llama ( #11705 )
2024-08-02 10:27:40 +08:00
Ruonan Wang
736a7ef72e
add sdp_causal for mistral 4.36 ( #11686 )
...
* add sdp_causal for mistral
* fix
* update
2024-08-01 18:57:31 +08:00
Yina Chen
45c730ff39
Chatglm support compresskv ( #11690 )
...
* chatglm4 support compresskv
* fix
* fix style
* support chatglm2
* fix quantkv conflict
* fix style
2024-08-01 18:20:20 +08:00
Qiyuan Gong
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM ( #11704 )
...
* DeepSpeed-AutoTP will start multiple processors to load models and convert them in CPU memory. If model/rank_num is large, this will lead to OOM. Add RANK_WAIT_TIME to reduce memory usage by controlling model reading parallelism.
2024-08-01 18:16:21 +08:00
hxsz1997
8ef4caaf5d
add 3k and 4k input of nightly perf test on iGPU ( #11701 )
...
* Add 3k&4k input in workflow for iGPU (#11685 )
* add 3k&4k input in workflow
* comment for test
* comment models for accelarate test
* remove OOM models
* modify typo
* change test model (#11696 )
* reverse test models (#11700 )
2024-08-01 14:17:46 +08:00
Guancheng Fu
afeca38a47
Fix import vllm condition ( #11682 )
2024-07-31 13:50:01 +08:00
Ruonan Wang
54bf3a23a6
add fallback for unsupported k-quants ( #11691 )
...
* add fallback
* fix style
* fix
2024-07-31 11:39:58 +08:00
Zijie Li
5079ed9e06
Add Llama3.1 example ( #11689 )
...
* Add Llama3.1 example
Add Llama3.1 example for Linux arc and Windows MTL
* Changes made to adjust compatibilities
transformers changed to 4.43.1
* Update index.rst
* Update README.md
* Update index.rst
* Update index.rst
* Update index.rst
2024-07-31 10:53:30 +08:00
Jin, Qiao
6e3ce28173
Upgrade glm-4 example transformers version ( #11659 )
...
* upgrade glm-4 example transformers version
* move pip install in one line
2024-07-31 10:24:50 +08:00
Jin, Qiao
a44ab32153
Switch to conhost when running on NPU ( #11687 )
2024-07-30 17:08:06 +08:00
Wang, Jian4
b119825152
Remove tgi parameter validation ( #11688 )
...
* remove validation
* add min warm up
* remove no need source
2024-07-30 16:37:44 +08:00
Yina Chen
670ad887fc
Qwen support compress kv ( #11680 )
...
* Qwen support compress kv
* fix style
* fix
2024-07-30 11:16:42 +08:00
hxsz1997
9b36877897
disable default quantize_kv of GQA on MTL ( #11679 )
...
* disable default quantizekv of gqa in mtl
* fix stype
* fix stype
* fix stype
* fix stype
* fix stype
* fix stype
2024-07-30 09:38:46 +08:00
Yishuo Wang
c02003925b
add mlp for gemma2 ( #11678 )
2024-07-29 16:10:23 +08:00
RyuKosei
1da1f1dd0e
Combine two versions of run_wikitext.py ( #11597 )
...
* Combine two versions of run_wikitext.py
* Update run_wikitext.py
* Update run_wikitext.py
* aligned the format
* update error display
* simplified argument parser
---------
Co-authored-by: jenniew <jenniewang123@gmail.com>
2024-07-29 15:56:16 +08:00
Yishuo Wang
6f999e6e90
add sdp for gemma2 ( #11677 )
2024-07-29 15:15:47 +08:00
Ruonan Wang
c11d5301d7
add sdp fp8 for llama ( #11671 )
...
* add sdp fp8 for llama
* fix style
* refactor
2024-07-29 13:46:22 +08:00
Yishuo Wang
7f88ce23cd
add more gemma2 optimization ( #11673 )
2024-07-29 11:13:00 +08:00
Yishuo Wang
3e8819734b
add basic gemma2 optimization ( #11672 )
2024-07-29 10:46:51 +08:00
Jason Dai
418640e466
Update install_gpu.md
2024-07-27 08:30:10 +08:00
Guoqiong Song
336dfc04b1
fix 1482 ( #11661 )
...
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-26 12:39:09 -07:00
Heyang Sun
ba01b85c13
empty cache only for 1st token but rest token to speed up ( #11665 )
2024-07-26 16:46:21 +08:00
Yina Chen
fc7f8feb83
Support compress kv ( #11642 )
...
* mistral snapkv
* update
* mtl update
* update
* update
* update
* add comments
* style fix
* fix style
* support llama
* llama use compress kv
* support mistral 4.40
* fix style
* support diff transformers versions
* move snapkv util to kv
* fix style
* meet comments & small fix
* revert all in one
* fix indent
---------
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-07-26 16:02:00 +08:00
Yishuo Wang
6bcdc6cc8f
fix qwen2 cpu ( #11663 )
2024-07-26 13:41:51 +08:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving ( #11648 )
...
* add options, support prompt and not return end_token
* enable openai parameter
* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Guancheng Fu
86fc0492f4
Update oneccl used ( #11647 )
...
* Add internal oneccl
* fix
* fix
* add oneccl
2024-07-26 09:38:39 +08:00
Guancheng Fu
a4d30a8211
Change logic for detecting if vllm is available ( #11657 )
...
* fix
* fix
2024-07-25 15:24:19 +08:00
Qiyuan Gong
0c6e0b86c0
Refine continuation get input_str ( #11652 )
...
* Remove duplicate code in continuation get input_str.
* Avoid infinite loop in all-in-one due to test_length not in the list.
2024-07-25 14:41:19 +08:00
RyuKosei
2fbd375a94
update several models for nightly perf test ( #11643 )
...
Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-07-25 14:06:08 +08:00