Commit graph

1980 commits

Author SHA1 Message Date
Yuwen Hu
356281cb80
Further all-in-one benchmark update continuation task (#11784)
* Further update prompt for continuation task, and disable lookup candidate update strategy on MTL

* style fix
2024-08-14 14:39:34 +08:00
Ruonan Wang
43cca3be27
fix gemma2 runtime error caused by sliding window (#11788)
* fix runtime error

* revert workflow
2024-08-14 10:43:33 +08:00
Yang Wang
51bcac1229
follow up on experimental support of fused decoder layer for llama2 (#11785)
* clean up and support transpose value cache

* refine

* fix style

* fix style
2024-08-13 18:53:55 -07:00
Yishuo Wang
cb79dcda93
refactor llama convert to fix minicpm-v 2.5 optimization (#11783) 2024-08-14 09:29:57 +08:00
Yina Chen
7cd6ec9723
MiniCPM-V support compresskv (#11779)
* fix check error

* fix other models

* remove print
2024-08-13 19:03:40 +08:00
Qiyuan Gong
3998de14f0
Fix mistral forward_qkv in q4_0 (#11781)
* Fix mistral forward_qkv without self.rotary_emb.base in q4_0.
* Replace apply_rotary_pos_emb_no_cache_xpu with rotary_half_inplaced.
* Revert https://github.com/intel-analytics/ipex-llm/pull/11765
2024-08-13 16:48:19 +08:00
Heyang Sun
70c828b87c
deepspeed zero3 QLoRA finetuning (#11625)
* deepspeed zero3 QLoRA finetuning

* Update convert.py

* Update low_bit_linear.py

* Update utils.py

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update low_bit_linear.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update utils.py

* Update convert.py

* Update alpaca_qlora_finetuning.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update deepspeed_zero3.json

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update low_bit_linear.py

* Update low_bit_linear.py

* Update utils.py

* fix style

* fix style

* Update alpaca_qlora_finetuning.py

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update convert.py

* Update low_bit_linear.py

* Update model.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update low_bit_linear.py

* Update low_bit_linear.py
2024-08-13 16:15:29 +08:00
Yishuo Wang
a184b120c9
fix minicpm-v 2.5 (#11780) 2024-08-13 16:14:00 +08:00
Yuwen Hu
ec184af243
Add gemma-2-2b-it and gemma-2-9b-it to igpu nightly performance test (#11778)
* add yaml and modify `concat_csv.py` for `transformers` 4.43.1 (#11758)

* add yaml and modify `concat_csv.py` for `transformers` 4.43.1

* remove 4.43 for arc; fix;

* remove 4096-512 for 4.43

* comment some models

* Small fix

* uncomment models (#11777)

---------

Co-authored-by: Ch1y0q <qiyue2001@gmail.com>
2024-08-13 15:39:56 +08:00
Qiyuan Gong
a88c132e54
Reduce Mistral softmax memory only in low memory mode (#11775)
* Reduce Mistral softmax memory only in low memory mode
2024-08-13 14:50:54 +08:00
Yishuo Wang
aa861df066
use new fp32 softmax kernel (#11776) 2024-08-13 14:48:11 +08:00
binbin Deng
23d3acdc77
Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
Jin, Qiao
c28b3389e6
Update npu multimodal example (#11773) 2024-08-13 14:14:59 +08:00
Yuwen Hu
81824ff8c9
Fix stdout in all-in-one benchmark to utf-8 (#11772) 2024-08-13 10:51:08 +08:00
Yishuo Wang
a1eb793f70
optimize minicpm v 2_6 firs token perf (#11770) 2024-08-13 09:51:18 +08:00
Yina Chen
841dbcdf3a
Fix compresskv with lookahead issue (#11767)
* fix compresskv + lookahead attn_mask qwen2

* support llama chatglm

* support mistral & chatglm

* address comments

* revert run.py
2024-08-12 18:53:55 +08:00
Yuwen Hu
f97a77ea4e
Update all-in-one benchmark for continuation task input preparation (#11760)
* All use 8192.txt for prompt preparation for now

* Small fix

* Fix text encoding mode to utf-8

* Small update
2024-08-12 17:49:45 +08:00
Xu, Shuo
1b05caba2b
Set mistral fuse rope to false except fp6 & fp16 (#11765)
* set mistral fuse rope to false except fp6 & fp16

* lint

* lint

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-08-12 17:25:07 +08:00
Ruonan Wang
8db34057b4
optimize lookahead init time (#11769) 2024-08-12 17:19:12 +08:00
Jin, Qiao
05989ad0f9
Update npu example and all in one benckmark (#11766) 2024-08-12 16:46:46 +08:00
Yishuo Wang
57d177738d
optimize minicpm-v-2_6 repetition penalty (#11763) 2024-08-12 14:10:10 +08:00
Wang, Jian4
245dba0abc
Fix lightweight-serving codegeex error (#11759) 2024-08-12 10:35:37 +08:00
Ruonan Wang
66fe2ee464
initial support of IPEX_LLM_PERFORMANCE_MODE (#11754)
* add perf mode

* update

* fix style
2024-08-09 19:04:09 +08:00
Yina Chen
4b9c57cc60
Support compress kv with lookahead (#11752)
* support compress kv with lookahead

* enough kv miss param
2024-08-09 17:39:57 +08:00
Yishuo Wang
93455aac09
fix minicpm V 2.6 repeat output (#11753) 2024-08-09 17:39:24 +08:00
Ruonan Wang
7e917d6cfb
fix gptq of llama (#11749)
* fix gptq of llama

* small fix
2024-08-09 16:39:25 +08:00
Yina Chen
dd46c141bd
Phi3 support compresskv (#11733)
* phi3 support compresskv

* fix phi3 mtl error

* fix conflict with quant kv

* fix abnormal on mtl

* fix style

* use slide windows size to compress kv

* support sliding window

* fix style

* fix style

* temp: partial support quant kv

* support quant kv with compress kv, todo: model check

* temp

* fix style

* fix style

* remove prepare

* address comment

* default -> 1.8k
2024-08-09 15:43:43 +08:00
Qiyuan Gong
d8808cc2e3
Mistral apply_rotary_pos_emb_no_cache_xpu use rope_theta from config (#11747)
mistral-7B-instruct-v0.2 and mistral-7B-instruct-v0.1 use different rope_theta (0.2 is 1e, 0.1 is 1e5). Pass self.config.rope_theta to apply_rotary_pos_emb_no_cache_xpu to avoid output difference.
2024-08-09 10:35:51 +08:00
Xiangyu Tian
044e486480
Fix vLLM CPU /chat endpoint (#11748) 2024-08-09 10:33:52 +08:00
Jinhe
27b4b104ed
Add qwen2-1.5b-instruct into igpu performance (#11735)
* updated qwen1.5B to all transformer==4.37 yaml

* updated qwen1.5B to all transformer==4.37 yaml
2024-08-08 16:42:18 +08:00
Shaojun Liu
107f7aafd0
enable inference mode for deepspeed tp serving (#11742) 2024-08-08 14:38:30 +08:00
Zijie Li
9e65cf00b3
Add openai-whisper pytorch gpu (#11736)
* Add openai-whisper pytorch gpu

* Update README.md

* Update README.md

* fix typo

* fix names update readme

* Update README.md
2024-08-08 12:32:59 +08:00
Jinhe
d0c89fb715
updated llama.cpp and ollama quickstart (#11732)
* updated llama.cpp and ollama quickstart.md

* added qwen2-1.5B sample output

* revision on quickstart updates

* revision on quickstart updates

* revision on qwen2 readme

* added 2 troubleshoots“
”

* troubleshoot revision
2024-08-08 11:04:01 +08:00
Yishuo Wang
54cc9353db
support and optimize minicpm-v-2_6 (#11738) 2024-08-07 18:21:16 +08:00
Yina Chen
e956e71fc1
fix conflict with quant kv (#11737) 2024-08-07 18:10:30 +08:00
Ruonan Wang
00a5574c8a
Use merge_qkv to replace fused_qkv for llama2 (#11727)
* update 4.38

* support new versions

* update

* fix style

* fix style

* update rope

* temp test sdpa

* fix style

* fix cpu ut
2024-08-07 18:04:01 +08:00
Yina Chen
d2abc9711b
Fix MTL 4k input qwen2 compresskv error (#11734)
* fix

* fix style
2024-08-07 16:21:57 +08:00
Yina Chen
a71ae7c22b
Support minicpm compresskv & modify default compresskv config & default enable compresskv on mtl 2.5k~4.5k (#11726)
* support minicpm & modify default & default enable on mtl 2.5k~4.5k

* fix style
2024-08-07 11:35:39 +08:00
Yishuo Wang
c093f7d980
fix phi3 (#11729) 2024-08-07 09:39:46 +08:00
Zijie Li
e7f7141781
Add benchmark util for transformers 4.42 (#11725)
* add new benchmark_util.py

Add new benchmark_util.py for transformers>=4.43.1. The old one renamed to benchmark_util_prev.py.

* Small fix to import code

* Update __init__.py

* fix file names

* Update lint-python

Update lint-python to exclude benchmark_util_4_29.py
benchmark_util_4_43.py

* Update benchmark_util_4_43.py

* add benchmark_util for transformers 4.42
2024-08-07 08:48:07 +08:00
Ch1y0q
4676af2054
add gemma2 example (#11724)
* add `gemma2`

* update `transformers` version

* update `README.md`
2024-08-06 21:17:50 +08:00
SichengStevenLi
985213614b
Removed no longer needed models for Arc nightly perf (#11722)
* removed LLMs that are no longer needed

Removed: 
mistralai/Mistral-7B-v0.1
deepseek-ai/deepseek-coder-6.7b-instruct

* Update arc-perf-test-batch4.yaml

Removed: 
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1

* Update arc-perf-test.yaml

Removed: 
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1

* Create arc-perf-transformers-438.yaml

* Moved arc-perf-transformers-438.yaml location

* Create arc-perf-transformers-438-batch2.yaml

* Create arc-perf-transformers-438-batch4.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch2.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch4.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438.yaml
2024-08-06 16:12:00 +08:00
Yishuo Wang
929675aa6b
support latest phi3 (#11721) 2024-08-06 15:52:55 +08:00
Jin, Qiao
11650b6f81
upgrade glm-4v example transformers version (#11719) 2024-08-06 14:55:09 +08:00
Yishuo Wang
bbdff6edeb
optimize internvl2 4b performance (#11720) 2024-08-06 14:25:08 +08:00
Yishuo Wang
f44b732aa8
support internvl2-4b (#11718) 2024-08-06 13:36:32 +08:00
Jin, Qiao
7f241133da
Add MiniCPM-Llama3-V-2_5 GPU example (#11693)
* Add MiniCPM-Llama3-V-2_5 GPU example

* fix
2024-08-06 10:22:41 +08:00
Jin, Qiao
808d9a7bae
Add MiniCPM-V-2 GPU example (#11699)
* Add MiniCPM-V-2 GPU example

* add example in README.md

* add example in README.md
2024-08-06 10:22:33 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py (#11713)
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input (#11703)
* init image_list

* enable internlm-xcomposer2 image input

* update style

* add readme

* update model

* update readme
2024-08-05 09:36:04 +08:00
Ruonan Wang
aa98ef96fe
change mixed_precision to q6_k (#11706) 2024-08-02 15:55:16 +08:00
Xiangyu Tian
1baa3efe0e
Optimizations for Pipeline Parallel Serving (#11702)
Optimizations for Pipeline Parallel Serving
2024-08-02 12:06:59 +08:00
Yina Chen
8d1e0bd2f4
add sdp causal support in llama (#11705) 2024-08-02 10:27:40 +08:00
Ruonan Wang
736a7ef72e
add sdp_causal for mistral 4.36 (#11686)
* add sdp_causal for mistral

* fix

* update
2024-08-01 18:57:31 +08:00
Yina Chen
45c730ff39
Chatglm support compresskv (#11690)
* chatglm4 support compresskv

* fix

* fix style

* support chatglm2

* fix quantkv conflict

* fix style
2024-08-01 18:20:20 +08:00
Qiyuan Gong
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM (#11704)
* DeepSpeed-AutoTP will start multiple processors to load models and convert them in CPU memory. If model/rank_num is large, this will lead to OOM. Add RANK_WAIT_TIME to reduce memory usage by controlling model reading parallelism.
2024-08-01 18:16:21 +08:00
hxsz1997
8ef4caaf5d
add 3k and 4k input of nightly perf test on iGPU (#11701)
* Add 3k&4k input in workflow for iGPU (#11685)

* add 3k&4k input in workflow

* comment for test

* comment models for accelarate test

* remove OOM models

* modify typo

* change test model (#11696)

* reverse test models (#11700)
2024-08-01 14:17:46 +08:00
Guancheng Fu
afeca38a47
Fix import vllm condition (#11682) 2024-07-31 13:50:01 +08:00
Ruonan Wang
54bf3a23a6
add fallback for unsupported k-quants (#11691)
* add fallback

* fix style

* fix
2024-07-31 11:39:58 +08:00
Zijie Li
5079ed9e06
Add Llama3.1 example (#11689)
* Add Llama3.1 example

Add Llama3.1 example for Linux arc and Windows MTL

* Changes made to adjust compatibilities

transformers changed to 4.43.1

* Update index.rst

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst
2024-07-31 10:53:30 +08:00
Jin, Qiao
6e3ce28173
Upgrade glm-4 example transformers version (#11659)
* upgrade glm-4 example transformers version

* move pip install in one line
2024-07-31 10:24:50 +08:00
Jin, Qiao
a44ab32153
Switch to conhost when running on NPU (#11687) 2024-07-30 17:08:06 +08:00
Wang, Jian4
b119825152
Remove tgi parameter validation (#11688)
* remove validation

* add min warm up

* remove no need source
2024-07-30 16:37:44 +08:00
Yina Chen
670ad887fc
Qwen support compress kv (#11680)
* Qwen support compress kv

* fix style

* fix
2024-07-30 11:16:42 +08:00
hxsz1997
9b36877897
disable default quantize_kv of GQA on MTL (#11679)
* disable default quantizekv of gqa in mtl

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype
2024-07-30 09:38:46 +08:00
Yishuo Wang
c02003925b
add mlp for gemma2 (#11678) 2024-07-29 16:10:23 +08:00
RyuKosei
1da1f1dd0e
Combine two versions of run_wikitext.py (#11597)
* Combine two versions of run_wikitext.py

* Update run_wikitext.py

* Update run_wikitext.py

* aligned the format

* update error display

* simplified argument parser

---------

Co-authored-by: jenniew <jenniewang123@gmail.com>
2024-07-29 15:56:16 +08:00
Yishuo Wang
6f999e6e90
add sdp for gemma2 (#11677) 2024-07-29 15:15:47 +08:00
Ruonan Wang
c11d5301d7
add sdp fp8 for llama (#11671)
* add sdp fp8 for llama

* fix style

* refactor
2024-07-29 13:46:22 +08:00
Yishuo Wang
7f88ce23cd
add more gemma2 optimization (#11673) 2024-07-29 11:13:00 +08:00
Yishuo Wang
3e8819734b
add basic gemma2 optimization (#11672) 2024-07-29 10:46:51 +08:00
Guoqiong Song
336dfc04b1
fix 1482 (#11661)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-26 12:39:09 -07:00
Heyang Sun
ba01b85c13
empty cache only for 1st token but rest token to speed up (#11665) 2024-07-26 16:46:21 +08:00
Yina Chen
fc7f8feb83
Support compress kv (#11642)
* mistral snapkv

* update

* mtl update

* update

* update

* update

* add comments

* style fix

* fix style

* support llama

* llama use compress kv

* support mistral 4.40

* fix style

* support diff transformers versions

* move snapkv util to kv

* fix style

* meet comments & small fix

* revert all in one

* fix indent

---------

Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-07-26 16:02:00 +08:00
Yishuo Wang
6bcdc6cc8f
fix qwen2 cpu (#11663) 2024-07-26 13:41:51 +08:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving (#11648)
* add options, support prompt and not return end_token

* enable openai parameter

* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Guancheng Fu
a4d30a8211
Change logic for detecting if vllm is available (#11657)
* fix

* fix
2024-07-25 15:24:19 +08:00
Qiyuan Gong
0c6e0b86c0
Refine continuation get input_str (#11652)
* Remove duplicate code in continuation get input_str.
* Avoid infinite loop in all-in-one due to test_length not in the list.
2024-07-25 14:41:19 +08:00
RyuKosei
2fbd375a94
update several models for nightly perf test (#11643)
Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-07-25 14:06:08 +08:00
Xiangyu Tian
4499d25c26
LLM: Fix ParallelLMHead convert in vLLM cpu (#11654) 2024-07-25 13:07:19 +08:00
binbin Deng
777e61d8c8
Fix qwen2 & int4 on NPU (#11646) 2024-07-24 13:14:39 +08:00
Yishuo Wang
1b3b46e54d
fix chatglm new model (#11639) 2024-07-23 13:44:56 +08:00
Xu, Shuo
7f80db95eb
Change run.py in benchmark to support phi-3-vision in arc-perf (#11638)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-23 09:51:36 +08:00
Xiangyu Tian
060792a648
LLM: Refine Pipeline Parallel FastAPI (#11587)
Refine Pipeline Parallel FastAPI
2024-07-22 15:52:05 +08:00
Wang, Jian4
1eed0635f2
Add lightweight serving and support tgi parameter (#11600)
* init tgi request

* update openai api

* update for pp

* update and add readme

* add to docker

* add start bash

* update

* update

* update
2024-07-19 13:15:56 +08:00
Xiangyu Tian
d27a8cd08c
Fix Pipeline Parallel dtype (#11623) 2024-07-19 13:07:40 +08:00
Yishuo Wang
d020ad6397
add save_low_bit support for DiskEmbedding (#11621) 2024-07-19 10:34:53 +08:00
Guoqiong Song
380717f50d
fix gemma for 4.41 (#11531)
* fix gemma for 4.41
2024-07-18 15:02:50 -07:00
Guoqiong Song
5a6211fd56
fix minicpm for transformers>=4.39 (#11533)
* fix minicpm for transformers>=4.39
2024-07-18 15:01:57 -07:00
Yishuo Wang
0209427cf4
Add disk_embedding parameter to support put Embedding layer on CPU (#11617) 2024-07-18 17:06:06 +08:00
Yuwen Hu
2478e2c14b
Add check in iGPU perf workflow for results integrity (#11616)
* Add csv check for igpu benchmark workflow (#11610)

* add csv check for igpu benchmark workflow

* ready to test

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* Restore the temporarily removed models in iGPU-perf (#11615)

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

---------

Co-authored-by: Xu, Shuo <100334393+ATMxsp01@users.noreply.github.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-18 14:13:16 +08:00
Xiangyu Tian
4594a3dd6c
LLM: Fix DummyLayer.weight device in Pipeline Parallel (#11612) 2024-07-18 13:39:34 +08:00
Ruonan Wang
4da93709b1
update doc/setup to use onednn gemm for cpp (#11598)
* update doc/setup to use onednn gemm

* small fix

* Change TOC of graphrag quickstart back
2024-07-18 13:04:38 +08:00
Yishuo Wang
f4077fa905
fix llama3-8b npu long input stuck (#11613) 2024-07-18 11:08:17 +08:00
Zhao Changmin
e5c0058c0e
fix baichuan (#11606) 2024-07-18 09:43:36 +08:00
Guoqiong Song
bfcdc35b04
phi-3 on "transformers>=4.37.0,<=4.42.3" (#11534) 2024-07-17 17:19:57 -07:00
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Guoqiong Song
5b6eb85b85
phi model readme (#11595)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-17 17:18:34 -07:00
Wang, Jian4
9c15abf825
Refactor fastapi-serving and add one card serving(#11581)
* init fastapi-serving one card

* mv api code to source

* update worker

* update for style-check

* add worker

* update bash

* update

* update worker name and add readme

* rename update

* rename to fastapi
2024-07-17 11:12:43 +08:00
Yishuo Wang
5837bc0014
fix chatglm3 npu output (#11590) 2024-07-16 18:16:30 +08:00
Guancheng Fu
06930ab258
Enable ipex-llm optimization for lm head (#11589)
* basic

* Modify convert.py

* fix
2024-07-16 16:48:44 +08:00
Heyang Sun
365adad59f
Support LoRA ChatGLM with Alpaca Dataset (#11580)
* Support LoRA ChatGLM with Alpaca Dataset

* refine

* fix

* add 2-card alpaca
2024-07-16 15:40:02 +08:00
Yina Chen
99c22745b2
fix qwen 14b fp6 abnormal output (#11583) 2024-07-16 10:59:00 +08:00
Yishuo Wang
c279849d27
add disk embedding api (#11585) 2024-07-16 10:43:39 +08:00
Xiangyu Tian
79c742dfd5
LLM: Add XPU Memory Optimizations for Pipeline Parallel (#11567)
Add XPU Memory Optimizations for Pipeline Parallel
2024-07-16 09:44:50 +08:00
Ch1y0q
50cf563a71
Add example: MiniCPM-V (#11570) 2024-07-15 10:55:48 +08:00
Zhao Changmin
06745e5742
Add npu benchmark all-in-one script (#11571)
* npu benchmark
2024-07-15 10:42:37 +08:00
Yishuo Wang
019da6c0ab
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574) 2024-07-13 16:32:54 +08:00
Xu, Shuo
13a72dc51d
Test MiniCPM performance on iGPU in a more stable way (#11573)
* Test MiniCPM performance on iGPU in a more stable way

* small fix

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-12 17:07:41 +08:00
Xiangyu Tian
0981b72275
Fix /generate_stream api in Pipeline Parallel FastAPI (#11569) 2024-07-12 13:19:42 +08:00
Yishuo Wang
a945500a98
fix internlm xcomposser stream chat (#11564) 2024-07-11 18:21:17 +08:00
Zhao Changmin
b9c66994a5
add npu sdp (#11562) 2024-07-11 16:57:35 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v (#11545) 2024-07-11 16:06:06 +08:00
Xiangyu Tian
7f5111a998
LLM: Refine start script for Pipeline Parallel Serving (#11557)
Refine start script and readme for Pipeline Parallel Serving
2024-07-11 15:45:27 +08:00
Xu, Shuo
1355b2ce06
Add model Qwen-VL-Chat to iGPU-perf (#11558)
* Add model Qwen-VL-Chat to iGPU-perf

* small fix

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-11 15:39:02 +08:00
Zhao Changmin
105e124752
optimize phi3-v encoder npu performance and add multimodal example (#11553)
* phi3-v

* readme
2024-07-11 13:59:14 +08:00
Cengguang Zhang
70ab1a6f1a
LLM: unify memory optimization env variables. (#11549)
* LLM: unify memory optimization env variables.

* fix comments.
2024-07-11 11:01:28 +08:00
Xu, Shuo
028ad4f63c
Add model phi-3-vision-128k-instruct to iGPU-perf benchmark (#11554)
* try to improve MIniCPM performance

* Add model phi-3-vision-128k-instruct to iGPU-perf benchmark

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-10 17:26:30 +08:00
Yishuo Wang
994e49a510
optimize internlm xcomposser performance again (#11551) 2024-07-10 17:08:56 +08:00
Xu, Shuo
61613b210c
try to improve MIniCPM performance (#11552)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-10 16:58:23 +08:00
Yishuo Wang
82f9514303
optimize internlm xcomposer2 performance (#11550) 2024-07-10 15:57:04 +08:00
Zhao Changmin
3c16c9f725
Optimize baichuan on NPU (#11548)
* baichuan_npu
2024-07-10 13:18:48 +08:00
Yuwen Hu
8982ab73d5
Add Yi-6B and StableLM to iGPU perf test (#11546)
* Add transformer4.38.2 test to igpu benchmark (#11529)

* add transformer4.38.1 test to igpu benchmark

* use transformers4.38.2 & fix csv name error in 4.38 workflow

* add model Yi-6B-Chat & remove temporarily most models

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* filter some errorlevel (#11541)

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* Restore the temporarily removed models in iGPU-perf (#11544)

* filter some errorlevel

* restore the temporarily removed models in iGPU-perf

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

---------

Co-authored-by: Xu, Shuo <100334393+ATMxsp01@users.noreply.github.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-09 18:51:23 +08:00
Yishuo Wang
7dc6756d86
add disk embedding (#11543) 2024-07-09 17:38:40 +08:00
Zhao Changmin
76a5802acf
update NPU examples (#11540)
* update NPU examples
2024-07-09 17:19:42 +08:00
Yishuo Wang
99b2802d3b
optimize qewn2 memory (#11535) 2024-07-09 17:14:01 +08:00
Yishuo Wang
2929eb262e
support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
Xiangyu Tian
a1cede926d
Fix update_kv_cache in Pipeline-Parallel-Serving for glm4-9b model (#11537) 2024-07-09 14:08:04 +08:00
Cengguang Zhang
fa81dbefd3
LLM: update multi gpu write csv in all-in-one benchmark. (#11538) 2024-07-09 11:14:17 +08:00
Xin Qiu
69701b3ec8
fix typo in python/llm/scripts/README.md (#11536) 2024-07-09 09:53:14 +08:00
Jason Dai
099486afb7
Update README.md (#11530) 2024-07-08 20:18:41 +08:00
binbin Deng
66f6ffe4b2
Update GPU HF-Transformers example structure (#11526) 2024-07-08 17:58:06 +08:00
Xu, Shuo
f9a199900d
add model RWKV/v5-Eagle-7B-HF to igpu benchmark (#11528)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-08 15:50:16 +08:00
Shaojun Liu
9b37ca6027
remove (#11527) 2024-07-08 15:49:52 +08:00
Yishuo Wang
c26651f91f
add mistral npu support (#11523) 2024-07-08 13:17:15 +08:00
Jun Wang
5a57e54400
[ADD] add 5 new models for igpu-perf (#11524) 2024-07-08 11:12:15 +08:00
Xu, Shuo
64cfed602d
Add new models to benchmark (#11505)
* Add new models to benchmark

* remove Qwen/Qwen-VL-Chat to pass the validation

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-08 10:35:55 +08:00
binbin Deng
252426793b
Fix setting of use_quantize_kv_cache on different GPU in pipeline parallel (#11516) 2024-07-08 09:27:01 +08:00
Yishuo Wang
7cb09a8eac
optimize qwen2 memory usage again (#11520) 2024-07-05 17:32:34 +08:00
Yuwen Hu
8f376e5192
Change igpu perf to mainly test int4+fp16 (#11513) 2024-07-05 17:12:33 +08:00
Jun Wang
1efb6ebe93
[ADD] add transformer_int4_fp16_loadlowbit_gpu_win api (#11511)
* [ADD] add transformer_int4_fp16_loadlowbit_gpu_win api

* [UPDATE] add int4_fp16_lowbit config and description

* [FIX] fix run.py mistake

* [FIX] fix run.py mistake

* [FIX] fix indent; change dtype=float16 to model.half()
2024-07-05 16:38:41 +08:00
Zhao Changmin
f7e957aaf9
Clean npu dtype branch (#11515)
* clean branch

* create_npu_kernels
2024-07-05 15:45:26 +08:00
Yishuo Wang
14ce058004
add chatglm3 npu support (#11518) 2024-07-05 15:31:27 +08:00
Xin Qiu
a31f2cbe13
update minicpm.py (#11517)
* update minicpm

* meet code review
2024-07-05 15:25:44 +08:00
Zhao Changmin
24de13fc45
Optimize stablelm on NPU (#11512)
* stablelm_optimize
2024-07-05 14:21:57 +08:00
Xiangyu Tian
7d8bc83415
LLM: Partial Prefilling for Pipeline Parallel Serving (#11457)
LLM: Partial Prefilling for Pipeline Parallel Serving
2024-07-05 13:10:35 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl (#11503) 2024-07-04 18:03:57 +08:00
Zhao Changmin
57b8adb189
[WIP] Support npu load_low_bit method (#11502)
* npu_load_low_bit
2024-07-04 17:15:34 +08:00
Jun Wang
f07937945f
[REMOVE] remove all useless repo-id in benchmark/igpu-perf (#11508) 2024-07-04 16:38:34 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
Yishuo Wang
bb0a84044b
add qwen2 npu support (#11504) 2024-07-04 11:01:25 +08:00
Xin Qiu
f84ca99b9f
optimize gemma2 rmsnorm (#11500) 2024-07-03 15:21:03 +08:00
Wang, Jian4
61c36ba085
Add pp_serving verified models (#11498)
* add verified models

* update

* verify large model

* update commend
2024-07-03 14:57:09 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat (#11463) 2024-07-03 14:25:28 +08:00
Yishuo Wang
d97c2664ce
use new fuse rope in stablelm family (#11497) 2024-07-03 11:08:26 +08:00
Xu, Shuo
52519e07df
remove models we no longer need in benchmark. (#11492)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-02 17:20:48 +08:00
Zhao Changmin
6a0134a9b2
support q4_0_rtn (#11477)
* q4_0_rtn
2024-07-02 16:57:02 +08:00
Yishuo Wang
5e967205ac
remove the code converts input to fp16 before calling batch forward kernel (#11489) 2024-07-02 16:23:53 +08:00
Wang, Jian4
4390e7dc49
Fix codegeex2 transformers version (#11487) 2024-07-02 15:09:28 +08:00
Yishuo Wang
ec3a912ab6
optimize npu llama long context performance (#11478) 2024-07-01 16:49:23 +08:00
Heyang Sun
913e750b01
fix non-string deepseed config path bug (#11476)
* fix non-string deepseed config path bug

* Update lora_finetune_chatglm.py
2024-07-01 15:53:50 +08:00
binbin Deng
48ad482d3d
Fix import error caused by pydantic on cpu (#11474) 2024-07-01 15:49:49 +08:00
Yishuo Wang
39bcb33a67
add sdp support for stablelm 3b (#11473) 2024-07-01 14:56:15 +08:00
Zhao Changmin
cf8eb7b128
Init NPU quantize method and support q8_0_rtn (#11452)
* q8_0_rtn

* fix float point
2024-07-01 13:45:07 +08:00
Yishuo Wang
319a3b36b2
fix npu llama2 (#11471) 2024-07-01 10:14:11 +08:00
Heyang Sun
07362ffffc
ChatGLM3-6B LoRA Fine-tuning Demo (#11450)
* ChatGLM3-6B LoRA Fine-tuning Demo

* refine

* refine

* add 2-card deepspeed

* refine format

* add mpi4py and deepspeed install
2024-07-01 09:18:39 +08:00
Xiangyu Tian
fd933c92d8
Fix: Correct num_requests in benchmark for Pipeline Parallel Serving (#11462) 2024-06-28 16:10:51 +08:00
SONG Ge
a414e3ff8a
add pipeline parallel support with load_low_bit (#11414) 2024-06-28 10:17:56 +08:00
Cengguang Zhang
d0b801d7bc
LLM: change write mode in all-in-one benchmark. (#11444)
* LLM: change write mode in all-in-one benchmark.

* update output style.
2024-06-27 19:36:38 +08:00
binbin Deng
987017ef47
Update pipeline parallel serving for more model support (#11428) 2024-06-27 18:21:01 +08:00
Yishuo Wang
029ff15d28
optimize npu llama2 first token performance (#11451) 2024-06-27 17:37:33 +08:00
Qiyuan Gong
4e4ecd5095
Control sys.modules ipex duplicate check with BIGDL_CHECK_DUPLICATE_IMPORT (#11453)
* Control sys.modules ipex duplicate check with BIGDL_CHECK_DUPLICATE_IMPORT。
2024-06-27 17:21:45 +08:00
Yishuo Wang
c6e5ad668d
fix internlm xcomposser meta-instruction typo (#11448) 2024-06-27 15:29:43 +08:00
Yishuo Wang
f89ca23748
optimize npu llama2 perf again (#11445) 2024-06-27 15:13:42 +08:00
Yishuo Wang
cf0f5c4322
change npu document (#11446) 2024-06-27 13:59:59 +08:00
binbin Deng
508c364a79
Add precision option in PP inference examples (#11440) 2024-06-27 09:24:27 +08:00
Yishuo Wang
2a0f8087e3
optimize qwen2 gpu memory usage again (#11435) 2024-06-26 16:52:29 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'

* fix python style checks

* refactor

* update GPU example
2024-06-26 15:36:22 +08:00
Guancheng Fu
99cd16ef9f
Fix error while using pipeline parallism (#11434) 2024-06-26 15:33:47 +08:00
Jiao Wang
40fa23560e
Fix LLAVA example on CPU (#11271)
* update

* update

* update

* update
2024-06-25 20:04:59 -07:00
Yishuo Wang
ca0e69c3a7
optimize npu llama perf again (#11431) 2024-06-26 10:52:54 +08:00
Yishuo Wang
9f6e5b4fba
optimize llama npu perf (#11426) 2024-06-25 17:43:20 +08:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423) 2024-06-25 15:49:32 +08:00
binbin Deng
aacc1fd8c0
Fix shape error when run qwen1.5-14b using deepspeed autotp (#11420) 2024-06-25 13:48:37 +08:00
Yishuo Wang
3b23de684a
update npu examples (#11422) 2024-06-25 13:32:53 +08:00
Xiangyu Tian
8ddae22cfb
LLM: Refactor Pipeline-Parallel-FastAPI example (#11319)
Initially Refactor for Pipeline-Parallel-FastAPI example
2024-06-25 13:30:36 +08:00
SONG Ge
34c15d3a10
update pp document (#11421) 2024-06-25 10:17:20 +08:00
Xin Qiu
9e4ee61737
rename BIGDL_OPTIMIZE_LM_HEAD to IPEX_LLM_LAST_LM_HEAD and add qwen2 (#11418) 2024-06-24 18:42:37 +08:00
Heyang Sun
c985912ee3
Add Deepspeed LoRA dependencies in document (#11410) 2024-06-24 15:29:59 +08:00
Yishuo Wang
abe53eaa4f
optimize qwen1.5/2 memory usage when running long input with fp16 (#11403) 2024-06-24 13:43:04 +08:00
Guoqiong Song
7507000ef2
Fix 1383 Llama model on transformers=4.41[WIP] (#11280) 2024-06-21 11:24:10 -07:00
SONG Ge
0c67639539
Add more examples for pipeline parallel inference (#11372)
* add more model exampels for pipelien parallel inference

* add mixtral and vicuna models

* add yi model and past_kv supprot for chatglm family

* add docs

* doc update

* add license

* update
2024-06-21 17:55:16 +08:00
Xiangyu Tian
b30bf7648e
Fix vLLM CPU api_server params (#11384) 2024-06-21 13:00:06 +08:00
ivy-lv11
21fc781fce
Add GLM-4V example (#11343)
* add example

* modify

* modify

* add line

* add

* add link and replace with phi-3-vision template

* fix generate options

* fix

* fix

---------

Co-authored-by: jinbridge <2635480475@qq.com>
2024-06-21 12:54:31 +08:00
binbin Deng
4ba82191f2
Support PP inference for chatglm3 (#11375) 2024-06-21 09:59:01 +08:00
Yishuo Wang
f0fdfa081b
Optimize qwen 1.5 14B batch performance (#11370) 2024-06-20 17:23:39 +08:00
Wenjing Margaret Mao
c0e86c523a
Add qwen-moe batch1 to nightly perf (#11369)
* add moe

* reduce 437 models

* rename

* fix syntax

* add moe check result

* add 430 + 437

* all modes

* 4-37-4 exclud

* revert & comment

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-20 14:17:41 +08:00
Yishuo Wang
a5e7d93242
Add initial save/load low bit support for NPU(now only fp16 is supported) (#11359) 2024-06-20 10:49:39 +08:00
RyuKosei
05a8d051f6
Fix run.py run_ipex_fp16_gpu (#11361)
* fix a bug on run.py

* Update run.py

fixed the format problem

---------

Co-authored-by: sgwhat <ge.song@intel.com>
2024-06-20 10:29:32 +08:00
Wenjing Margaret Mao
b2f62a8561
Add batch 4 perf test (#11355)
* copy files to this branch

* add tasks

* comment one model

* change the model to test the 4.36

* only test batch-4

* typo

* typo

* typo

* typo

* typo

* typo

* add 4.37-batch4

* change the file name

* revet yaml file

* no print

* add batch4 task

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-20 09:48:52 +08:00
Zijie Li
ae452688c2
Add NPU HF example (#11358) 2024-06-19 18:07:28 +08:00
Qiyuan Gong
1eb884a249
IPEX Duplicate importer V2 (#11310)
* Add gguf support.
* Avoid error when import ipex-llm for multiple times.
* Add check to avoid duplicate replace and revert.
* Add calling from check to avoid raising exceptions in the submodule.
* Add BIGDL_CHECK_DUPLICATE_IMPORT for controlling duplicate checker. Default is true.
2024-06-19 16:29:19 +08:00
Yishuo Wang
ae7b662ed2
add fp16 NPU Linear support and fix intel_npu_acceleration_library version 1.0 support (#11352) 2024-06-19 09:14:59 +08:00
Guoqiong Song
c44b1942ed
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
2024-06-18 13:39:35 -07:00
Heyang Sun
67a1e05876
Remove zero3 context manager from LoRA (#11346) 2024-06-18 17:24:43 +08:00
Yishuo Wang
83082e5cc7
add initial support for intel npu acceleration library (#11347) 2024-06-18 16:07:16 +08:00
Shaojun Liu
694912698e
Upgrade scikit-learn to 1.5.0 to fix dependabot issue (#11349) 2024-06-18 15:47:25 +08:00
hxsz1997
44f22cba70
add config and default value (#11344)
* add config and default value

* add config in taml

* remove lookahead and max_matching_ngram_size in config

* remove streaming and use_fp16_torch_dtype in test yaml

* update task in readme

* update commit of task
2024-06-18 15:28:57 +08:00
Heyang Sun
00f322d8ee
Finetune ChatGLM with Deepspeed Zero3 LoRA (#11314)
* Fintune ChatGLM with Deepspeed Zero3 LoRA

* add deepspeed zero3 config

* rename config

* remove offload_param

* add save_checkpoint parameter

* Update lora_deepspeed_zero3_finetune_chatglm3_6b_arc_2_card.sh

* refine
2024-06-18 12:31:26 +08:00
Yina Chen
5dad33e5af
Support fp8_e4m3 scale search (#11339)
* fp8e4m3 switch off

* fix style
2024-06-18 11:47:43 +08:00
binbin Deng
e50c890e1f
Support finishing PP inference once eos_token_id is found (#11336) 2024-06-18 09:55:40 +08:00
Qiyuan Gong
de4bb97b4f
Remove accelerate 0.23.0 install command in readme and docker (#11333)
*ipex-llm's accelerate has been upgraded to 0.23.0. Remove accelerate 0.23.0 install command in README and docker。
2024-06-17 17:52:12 +08:00
SONG Ge
ef4b6519fb
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support

* add phi3 example
2024-06-17 17:44:24 +08:00
hxsz1997
99b309928b
Add lookahead in test_api: transformer_int4_fp16_gpu (#11337)
* add lookahead in test_api:transformer_int4_fp16_gpu

* change the short prompt of summarize

* change short prompt to cnn_64

* change short prompt of summarize
2024-06-17 17:41:41 +08:00
Qiyuan Gong
5d7c9bf901
Upgrade accelerate to 0.23.0 (#11331)
* Upgrade accelerate to 0.23.0
2024-06-17 15:03:11 +08:00
Xin Qiu
183e0c6cf5
glm-4v-9b support (#11327)
* chatglm4v support

* fix style check

* update glm4v
2024-06-17 13:52:37 +08:00
Wenjing Margaret Mao
bca5cbd96c
Modify arc nightly perf to fp16 (#11275)
* change api

* move to pr mode and remove the build

* add batch4 yaml and remove the bigcode

* remove batch4

* revert the starcode

* remove the exclude

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-17 13:47:22 +08:00
binbin Deng
6ea1e71af0
Update PP inference benchmark script (#11323) 2024-06-17 09:59:36 +08:00
SONG Ge
be00380f1a
Fix pipeline parallel inference past_key_value error in Baichuan (#11318)
* fix past_key_value error

* add baichuan2 example

* fix style

* update doc

* add script link in doc

* fix import error

* update
2024-06-17 09:29:32 +08:00
Yina Chen
0af0102e61
Add quantization scale search switch (#11326)
* add scale_search switch

* remove llama3 instruct

* remove print
2024-06-14 18:46:52 +08:00
Ruonan Wang
8a3247ac71
support batch forward for q4_k, q6_k (#11325) 2024-06-14 18:25:50 +08:00
Yishuo Wang
e8dd8e97ef
fix chatglm lookahead on ARC (#11320) 2024-06-14 16:26:11 +08:00
Shaojun Liu
f5ef94046e
exclude dolly-v2-12b for arc perf test (#11315)
* test arc perf

* test

* test

* exclude dolly-v2-12b:2048

* revert changes
2024-06-14 15:35:56 +08:00
Xiangyu Tian
4359ab3172
LLM: Add /generate_stream endpoint for Pipeline-Parallel-FastAPI example (#11187)
Add /generate_stream and OpenAI-formatted endpoint for Pipeline-Parallel-FastAPI example
2024-06-14 15:15:32 +08:00
Jin Qiao
0e7a31a09c
ChatGLM Examples Restructure regarding Installation Steps (#11285)
* merge install step in glm examples

* fix section

* fix section

* fix tiktoken
2024-06-14 12:37:05 +08:00
Yishuo Wang
91965b5d05
add glm_sdpa back to fix chatglm-6b (#11313) 2024-06-14 10:31:43 +08:00
Yishuo Wang
7f65836cb9
fix chatglm2/3-32k/128k fp16 (#11311) 2024-06-14 09:58:07 +08:00
Xin Qiu
1b0c4c8cb8
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4

* rempve
2024-06-13 19:02:18 +08:00
Xin Qiu
f1410d6823
refactor chatglm4 (#11301)
* glm4

* remove useless code

* stype

* add rope_ratio

* update

* fix fp16

* fix style
2024-06-13 18:06:04 +08:00
Yishuo Wang
5e25766855
fix and optimize chatglm2-32k and chatglm3-128k (#11306) 2024-06-13 17:37:58 +08:00
binbin Deng
60cb1dac7c
Support PP for qwen1.5 (#11300) 2024-06-13 17:35:24 +08:00
binbin Deng
f97cce2642
Fix import error of ds autotp (#11307) 2024-06-13 16:22:52 +08:00
Jin Qiao
3682c6a979
add glm4 and qwen2 to igpu perf (#11304) 2024-06-13 16:16:35 +08:00
Yishuo Wang
a24666b8f3
fix chatglm3-6b-32k (#11303) 2024-06-13 16:01:34 +08:00
Yishuo Wang
01fe0fc1a2
refactor chatglm2/3 (#11290) 2024-06-13 12:22:58 +08:00
Guancheng Fu
57a023aadc
Fix vllm tp (#11297) 2024-06-13 10:47:48 +08:00
Ruonan Wang
986af21896
fix perf test(#11295) 2024-06-13 10:35:48 +08:00
binbin Deng
220151e2a1
Refactor pipeline parallel multi-stage implementation (#11286) 2024-06-13 10:00:23 +08:00
Ruonan Wang
14b1e6b699
Fix gguf_q4k (#11293)
* udpate embedding parameter

* update benchmark
2024-06-12 20:43:08 +08:00
Yuwen Hu
8edcdeb0e7
Fix bug that torch.ops.torch_ipex.matmul_bias_out cannot work on Linux MTL for short input (#11292) 2024-06-12 19:12:57 +08:00
Wenjing Margaret Mao
b61f6e3ab1
Add update_parent_folder for nightly_perf_test (#11287)
* add update_parent_folder and change the workflow file

* add update_parent_folder and change the workflow file

* move to pr mode and comment the test

* use one model per comfig

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-12 17:58:13 +08:00
Xin Qiu
592f7aa61e
Refine glm1-4 sdp (#11276)
* chatglm

* update

* update

* change chatglm

* update sdpa

* update

* fix style

* fix

* fix glm

* update glm2-32k

* update glm2-32k

* fix cpu

* update

* change lower_bound
2024-06-12 17:11:56 +08:00
Yuwen Hu
cffb932f05
Expose timeout for streamer for fastchat worker (#11288)
* Expose timeout for stremer for fastchat worker

* Change to read from env variables
2024-06-12 17:02:40 +08:00
ivy-lv11
e7a4e2296f
Add Stable Diffusion examples on GPU and CPU (#11166)
* add sdxl and lcm-lora

* readme

* modify

* add cpu

* add license

* modify

* add file
2024-06-12 16:33:25 +08:00
Jin Qiao
f224e98297
Add GLM-4 CPU example (#11223)
* Add GLM-4 example

* add tiktoken dependency

* fix

* fix
2024-06-12 15:30:51 +08:00
Zijie Li
40fc8704c4
Add GPU example for GLM-4 (#11267)
* Add GPU example for GLM-4

* Update streamchat.py

* Fix pretrianed arguments

Fix pretrained arguments in generate and streamchat.py

* Update Readme

Update install tiktoken required for GLM-4

* Update comments in generate.py
2024-06-12 14:29:50 +08:00
Qiyuan Gong
0d9cc9c106
Remove duplicate check for ipex (#11281)
* Replacing builtin.import is causing lots of unpredicted problems. Remove this function.
2024-06-12 13:52:02 +08:00
Yishuo Wang
10e480ee96
refactor internlm and internlm2 (#11274) 2024-06-11 14:19:19 +08:00
Yuwen Hu
fac49f15e3
Remove manual importing ipex in all-in-one benchmark (#11272) 2024-06-11 09:32:13 +08:00
Wenjing Margaret Mao
70b17c87be
Merge multiple batches (#11264)
* add merge steps

* move to pr mode

* remove build + add merge.py

* add tohtml and change cp

* change test_batch folder path

* change merge_temp path

* change to html folder

* revert

* change place

* revert 437

* revert space

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-07 18:38:45 +08:00
Xiangyu Tian
4b07712fd8
LLM: Fix vLLM CPU model convert mismatch (#11254)
Fix vLLM CPU model convert mismatch.
2024-06-07 15:54:34 +08:00
Yishuo Wang
42fab480ea
support stablm2 12b (#11265) 2024-06-07 15:46:00 +08:00
Xin Qiu
dbc3c2d72d
glm4 sdp (#11253)
* glm4 sdp

* fix style

* update comment
2024-06-07 15:42:23 +08:00
Xin Qiu
151fcf37bb
check devie name in use_flash_attention (#11263) 2024-06-07 15:07:47 +08:00
Yishuo Wang
2623944604
qwen2 sdpa small fix (#11261) 2024-06-07 14:42:18 +08:00
Yishuo Wang
ea0d03fd28
Refactor baichuan1 7B and 13B (#11258) 2024-06-07 14:29:20 +08:00
Qiyuan Gong
1aa9c9597a
Avoid duplicate import in IPEX auto importer (#11227)
* Add custom import to avoid ipex duplicate importing
* Add scope limitation
2024-06-07 14:08:00 +08:00
Wang, Jian4
6f2684e5c9
Update pp llama.py to save memory (#11233) 2024-06-07 13:18:16 +08:00
Yishuo Wang
ef8e9b2ecd
Refactor qwen2 moe (#11244) 2024-06-07 13:14:54 +08:00
Zijie Li
7b753dc8ca
Update sample output for HF Qwen2 GPU and CPU (#11257) 2024-06-07 11:36:22 +08:00
Zhao Changmin
b7948671de
[WIP] Add look up table in 1st token stage (#11193)
* lookuptb
2024-06-07 10:51:05 +08:00
Yuwen Hu
8c36b5bdde
Add qwen2 example (#11252)
* Add GPU example for Qwen2

* Update comments in README

* Update README for Qwen2 GPU example

* Add CPU example for Qwen2

Sample Output under README pending

* Update generate.py and README for CPU Qwen2

* Update GPU example for Qwen2

* Small update

* Small fix

* Add Qwen2 table

* Update README for Qwen2 CPU and GPU

Update sample output under README

---------

Co-authored-by: Zijie Li <michael20001122@gmail.com>
2024-06-07 10:29:33 +08:00
Shaojun Liu
85df5e7699
fix nightly perf test (#11251) 2024-06-07 09:33:14 +08:00
Xin Qiu
2f809116e2
optimize Chatglm4 (#11239)
* chatglm4

* update

* update

* add rms norm

* chatglm4
2024-06-06 18:25:20 +08:00
hxsz1997
b6234eb4e2
Add task in allinone (#11226)
* add task

* update prompt

* modify typos

* add more cases in summarize

* Make the summarize & QA prompt preprocessing as a util function
2024-06-06 17:22:40 +08:00
Yishuo Wang
2e4ccd541c
fix qwen2 cpu (#11240) 2024-06-06 16:24:19 +08:00
Yishuo Wang
e738ec38f4
disable quantize kv in specific qwen model (#11238) 2024-06-06 14:08:39 +08:00
Yishuo Wang
c4e5806e01
add latest optimization in starcoder2 (#11236) 2024-06-06 14:02:17 +08:00
Yishuo Wang
ba27e750b1
refactor yuan2 (#11235) 2024-06-06 13:17:54 +08:00
Shaojun Liu
6be24fdd28
OSPDT: add tpp licenses (#11165)
* add tpp licenses

* add licenses

* add licenses

* delete mitchellh-mapstructure license

* delete stb-image public domain license

* add README.md

* remove core-xe related licenses
2024-06-06 10:59:06 +08:00
Guoqiong Song
09c6780d0c
phi-2 transformers 4.37 (#11161)
* phi-2 transformers 4.37
2024-06-05 13:36:41 -07:00
Guoqiong Song
f6d5c6af78
fix issue 1407 (#11171) 2024-06-05 13:35:57 -07:00
Zijie Li
bfa1367149
Add CPU and GPU example for MiniCPM (#11202)
* Change installation address

Change former address: "https://docs.conda.io/en/latest/miniconda.html#" to new address: "https://conda-forge.org/download/" for 63 occurrences under python\llm\example

* Change Prompt

Change "Anaconda Prompt" to "Miniforge Prompt" for 1 occurrence

* Create and update model minicpm

* Update model minicpm

Update model minicpm under GPU/PyTorch-Models

* Update readme and generate.py

change "prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)" and delete "pip install transformers==4.37.0
"

* Update comments for minicpm GPU

Update comments for generate.py at minicpm GPU

* Add CPU example for MiniCPM

* Update minicpm README for CPU

* Update README for MiniCPM and Llama3

* Update Readme for Llama3 CPU Pytorch

* Update and fix comments for MiniCPM
2024-06-05 18:09:53 +08:00
Yuwen Hu
af96579c76
Update installation guide for pipeline parallel inference (#11224)
* Update installation guide for pipeline parallel inference

* Small fix

* further fix

* Small fix

* Small fix

* Update based on comments

* Small fix

* Small fix

* Small fix
2024-06-05 17:54:29 +08:00
Yina Chen
ed67435491
Support Fp6 k in ipex-llm (#11222)
* support fp6_k

* support fp6_k

* remove

* fix style
2024-06-05 17:34:36 +08:00
binbin Deng
a6674f5bce
Fix should_use_fuse_rope error of Qwen1.5-MoE-A2.7B-Chat (#11216) 2024-06-05 15:56:10 +08:00
Wenjing Margaret Mao
231b968aba
Modify the check_results.py to support batch 2&4 (#11133)
* add batch 2&4 and exclude to perf_test

* modify the perf-test&437 yaml

* modify llm_performance_test.yml

* remove batch 4

* modify check_results.py to support batch 2&4

* change the batch_size format

* remove genxir

* add str(batch_size)

* change actual_test_casese in check_results file to support batch_size

* change html highlight

* less models to test html and html_path

* delete the moe model

* split batch html

* split

* use installing from pypi

* use installing from pypi - batch2

* revert cpp

* revert cpp

* merge two jobs into one, test batch_size in one job

* merge two jobs into one, test batch_size in one job

* change file directory in workflow

* try catch deal with odd file without batch_size

* modify pandas version

* change the dir

* organize the code

* organize the code

* remove Qwen-MOE

* modify based on feedback

* modify based on feedback

* modify based on second round of feedback

* modify based on second round of feedback + change run-arc.sh mode

* modify based on second round of feedback + revert config

* modify based on second round of feedback + revert config

* modify based on second round of feedback + remove comments

* modify based on second round of feedback + remove comments

* modify based on second round of feedback + revert arc-perf-test

* modify based on third round of feedback

* change error type

* change error type

* modify check_results.html

* split batch into two folders

* add all models

* move csv_name

* revert pr test

* revert pr test

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-05 15:04:55 +08:00
Xin Qiu
566691c5a3
quantized attention forward for minicpm (#11200)
* quantized minicpm

* fix style check
2024-06-05 09:15:25 +08:00
Jiao Wang
bb83bc23fd
Fix Starcoder issue on CPU on transformers 4.36+ (#11190)
* fix starcoder for sdpa

* update

* style
2024-06-04 10:05:40 -07:00
Kai Huang
f93664147c
Update config.yaml (#11208)
* update config.yaml

* fix

* minor

* style
2024-06-04 19:58:18 +08:00
Xiangyu Tian
ac3d53ff5d
LLM: Fix vLLM CPU version error (#11206)
Fix vLLM CPU version error
2024-06-04 19:10:23 +08:00
Ruonan Wang
1dde204775
update q6k (#11205) 2024-06-04 17:14:33 +08:00
Qiyuan Gong
ce3f08b25a
Fix IPEX auto importer (#11192)
* Fix ipex auto importer with Python builtins.
* Raise errors if the user imports ipex manually before importing ipex_llm. Do nothing if they import ipex after importing ipex_llm.
* Remove import ipex in examples.
2024-06-04 16:57:18 +08:00
Yina Chen
711fa0199e
Fix fp6k phi3 ppl core dump (#11204) 2024-06-04 16:44:27 +08:00
Xiangyu Tian
f02f097002
Fix vLLM verion in CPU/vLLM-Serving example README (#11201) 2024-06-04 15:56:55 +08:00
Yishuo Wang
6454655dcc
use sdp in baichuan2 13b (#11198) 2024-06-04 15:39:00 +08:00
Yishuo Wang
d90cd977d0
refactor stablelm (#11195) 2024-06-04 13:14:43 +08:00
Zijie Li
a644e9409b
Miniconda/Anaconda -> Miniforge update in examples (#11194)
* Change installation address

Change former address: "https://docs.conda.io/en/latest/miniconda.html#" to new address: "https://conda-forge.org/download/" for 63 occurrences under python\llm\example

* Change Prompt

Change "Anaconda Prompt" to "Miniforge Prompt" for 1 occurrence
2024-06-04 10:14:02 +08:00
Xin Qiu
5f13700c9f
optimize Minicpm (#11189)
* minicpm optimize

* update
2024-06-03 18:28:29 +08:00
Qiyuan Gong
15a6205790
Fix LoRA tokenizer for Llama and chatglm (#11186)
* Set pad_token to eos_token if it's None. Otherwise, use model config.
2024-06-03 15:35:38 +08:00
Cengguang Zhang
3eb13ccd8c
LLM: fix input length condition in deepspeed all-in-one benchmark. (#11185) 2024-06-03 10:05:43 +08:00
Shaojun Liu
401013a630
Remove chatglm_C Module to Eliminate LGPL Dependency (#11178)
* remove chatglm_C.**.pyd to solve ngsolve weak copyright vunl

* fix style check error

* remove chatglm native int4 from langchain
2024-05-31 17:03:11 +08:00
Ruonan Wang
50b5f4476f
update q4k convert (#11179) 2024-05-31 11:36:53 +08:00
Wang, Jian4
c0f1be6aea
Fix pp logic (#11175)
* only send no none batch and rank1-n sending first

* always send first
2024-05-30 16:40:59 +08:00
ZehuaCao
4127b99ed6
Fix null pointer dereferences error. (#11125)
* delete unused function on tgi_server

* update

* update

* fix style
2024-05-30 16:16:10 +08:00
Guancheng Fu
50ee004ac7
Fix vllm condition (#11169)
* add use-vllm

* done

* fix style

* fix done
2024-05-30 15:23:17 +08:00
Jin Qiao
dcbf4d3d0a
Add phi-3-vision example (#11156)
* Add phi-3-vision example (HF-Automodels)

* fix

* fix

* fix

* Add phi-3-vision CPU example (HF-Automodels)

* add in readme

* fix

* fix

* fix

* fix

* use fp8 for gpu example

* remove eval
2024-05-30 10:02:47 +08:00
Jiao Wang
93146b9433
Reconstruct Speculative Decoding example directory (#11136)
* update

* update

* update
2024-05-29 13:15:27 -07:00
Xiangyu Tian
2299698b45
Refine Pipeline Parallel FastAPI example (#11168) 2024-05-29 17:16:50 +08:00
Ruonan Wang
9bfbf78bf4
update api usage of xe_batch & fp16 (#11164)
* update api usage

* update setup.py
2024-05-29 15:15:14 +08:00
Yina Chen
e29e2f1c78
Support new fp8 e4m3 (#11158) 2024-05-29 14:27:14 +08:00
Wang, Jian4
8e25de1126
LLM: Add codegeex2 example (#11143)
* add codegeex example

* update

* update cpu

* add GPU

* add gpu

* update readme
2024-05-29 10:00:26 +08:00
ZehuaCao
751e1a4e29
Fix concurrent issue in autoTP streming. (#11150)
* add benchmark test

* update
2024-05-29 08:22:38 +08:00
Yishuo Wang
bc5008f0d5
disable sdp_causal in phi-3 to fix overflow (#11157) 2024-05-28 17:25:53 +08:00
SONG Ge
33852bd23e
Refactor pipeline parallel device config (#11149)
* refactor pipeline parallel device config

* meet comments

* update example

* add warnings and update code doc
2024-05-28 16:52:46 +08:00
hxsz1997
62b2d8af6b
Add lookahead in all-in-one (#11142)
* add lookahead in allinone

* delete save to csv in run_transformer_int4_gpu

* change lookup to lookahead

* fix the error of add model.peak_memory

* Set transformer_int4_gpu as the default option

* add comment of transformer_int4_fp16_lookahead_gpu
2024-05-28 15:39:58 +08:00
Xiangyu Tian
b44cf405e2
Refine Pipeline-Parallel-Fastapi example README (#11155) 2024-05-28 15:18:21 +08:00
Yishuo Wang
d307622797
fix first token sdp with batch (#11153) 2024-05-28 15:03:06 +08:00
Yina Chen
3464440839
fix qwen import error (#11154) 2024-05-28 14:50:12 +08:00
Jin Qiao
25b6402315
Add Windows GPU unit test (#11050)
* Add Windows GPU UT

* temporarily remove ut on arc

* retry

* retry

* retry

* fix

* retry

* retry

* fix

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* fix

* retry

* retry

* retry

* retry

* retry

* retry

* merge into single workflow

* retry inference test

* retry

* retrigger

* try to fix inference test

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* check lower_bound

* retry

* retry

* try example test

* try fix example test

* retry

* fix

* seperate function into shell script

* remove cygpath

* try remove all cygpath

* retry

* retry

* Revert "try remove all cygpath"

This reverts commit 7ceeff3e48f08429062ecef548c1a3ad3488756f.

* Revert "retry"

This reverts commit 40ea2457843bff6991b8db24316cde5de1d35418.

* Revert "retry"

This reverts commit 817d0db3e5aec3bd449d3deaf4fb01d3ecfdc8a3.

* enable ut

* fix

* retrigger

* retrigger

* update download url

* fix

* fix

* retry

* add comment

* fix
2024-05-28 13:29:47 +08:00
Yina Chen
b6b70d1ba0
Divide core-xe packages (#11131)
* temp

* add batch

* fix style

* update package name

* fix style

* add workflow

* use temp version to run uts

* trigger performance test

* trigger win igpu perf

* revert workflow & setup
2024-05-28 12:00:18 +08:00
binbin Deng
c9168b85b7
Fix error during merging adapter (#11145) 2024-05-27 19:41:42 +08:00
Guancheng Fu
daf7b1cd56
[Docker] Fix image using two cards error (#11144)
* fix all

* done
2024-05-27 16:20:13 +08:00
Xiangyu Tian
5c8ccf0ba9
LLM: Add Pipeline-Parallel-FastAPI example (#10917)
Add multi-stage Pipeline-Parallel-FastAPI example

---------

Co-authored-by: hzjane <a1015616934@qq.com>
2024-05-27 14:46:29 +08:00
Ruonan Wang
d550af957a
fix security issue of eagle (#11140)
* fix security issue of eagle

* small fix
2024-05-27 10:15:28 +08:00
binbin Deng
367de141f2
Fix mixtral-8x7b with transformers=4.37.0 (#11132) 2024-05-27 09:50:54 +08:00
Jean Yu
ab476c7fe2
Eagle Speculative Sampling examples (#11104)
* Eagle Speculative Sampling examples

* rm multi-gpu and ray content

* updated README to include Arc A770
2024-05-24 11:13:43 -07:00
Guancheng Fu
fabc395d0d
add langchain vllm interface (#11121)
* done

* fix

* fix

* add vllm

* add langchain vllm exampels

* add docs

* temp
2024-05-24 17:19:27 +08:00
ZehuaCao
63e95698eb
[LLM]Reopen autotp generate_stream (#11120)
* reopen autotp generate_stream

* fix style error

* update
2024-05-24 17:16:14 +08:00
Yishuo Wang
1dc680341b
fix phi-3-vision import (#11129) 2024-05-24 15:57:15 +08:00
Guancheng Fu
7f772c5a4f
Add half precision for fastchat models (#11130) 2024-05-24 15:41:14 +08:00
Zhao Changmin
65f4212f89
Fix qwen 14b run into register attention fwd (#11128)
* fix qwen 14b
2024-05-24 14:45:07 +08:00
Shaojun Liu
373f9e6c79
add ipex-llm-init.bat for Windows (#11082)
* add ipex-llm-init.bat for Windows

* update setup.py
2024-05-24 14:26:25 +08:00
Qiyuan Gong
120a0035ac
Fix type mismatch in eval for Baichuan2 QLora example (#11117)
* During the evaluation stage, Baichuan2 will raise type mismatch when training with bfloat16. Fix this issue by modifying modeling_baichuan.py. Add doc about how to modify this file.
2024-05-24 14:14:30 +08:00
Yishuo Wang
1db9d9a63b
optimize internlm2 xcomposer agin (#11124) 2024-05-24 13:44:52 +08:00
Yishuo Wang
9372ce87ce
fix internlm xcomposer2 fp16 (#11123) 2024-05-24 11:03:31 +08:00
Cengguang Zhang
011b9faa5c
LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107)
* LLM: unify alibi mask dtype.

* fix comments.
2024-05-24 10:27:53 +08:00
Jiao Wang
0a06a6e1d4
Update tests for transformers 4.36 (#10858)
* update unit test

* update

* update

* update

* update

* update

* fix gpu attention test

* update

* update

* update

* update

* update

* update

* update example test

* replace replit code

* update

* update

* update

* update

* set safe_serialization false

* perf test

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* delete

* update

* update

* update

* update

* update

* update

* revert

* update
2024-05-24 10:26:38 +08:00
Xiangyu Tian
b3f6faa038
LLM: Add CPU vLLM entrypoint (#11083)
Add CPU vLLM entrypoint and update CPU vLLM serving example.
2024-05-24 09:16:59 +08:00
Yishuo Wang
797dbc48b8
fix phi-2 and phi-3 convert (#11116) 2024-05-23 17:37:37 +08:00
Yishuo Wang
37b98a531f
support running internlm xcomposer2 on gpu and add sdp optimization (#11115) 2024-05-23 17:26:24 +08:00
Zhao Changmin
c5e8b90c8d
Add Qwen register attention implemention (#11110)
* qwen_register
2024-05-23 17:17:45 +08:00
Yishuo Wang
0e53f20edb
support running internlm-xcomposer2 on cpu (#11111) 2024-05-23 16:36:09 +08:00
Yuwen Hu
d36b41d59e
Add setuptools limitation for ipex-llm[xpu] (#11102)
* Add setuptool limitation for ipex-llm[xpu]

* llamaindex option update
2024-05-22 18:20:30 +08:00
Yishuo Wang
cd4dff09ee
support phi-3 vision (#11101) 2024-05-22 17:43:50 +08:00
Zhao Changmin
15d906a97b
Update linux igpu run script (#11098)
* update run script
2024-05-22 17:18:07 +08:00
Kai Huang
f63172ef63
Align ppl with llama.cpp (#11055)
* update script

* remove

* add header

* update readme
2024-05-22 16:43:11 +08:00
Qiyuan Gong
f6c9ffe4dc
Add WANDB_MODE and HF_HUB_OFFLINE to XPU finetune README (#11097)
* Add WANDB_MODE=offline to avoid multi-GPUs finetune errors.
* Add HF_HUB_OFFLINE=1 to avoid Hugging Face related errors.
2024-05-22 15:20:53 +08:00
Shaojun Liu
584439e498
update homepage url for ipex-llm (#11094)
* update homepage url

* Update python version to 3.11

* Update long description
2024-05-22 11:10:44 +08:00
Xin Qiu
71bcd18f44
fix qwen vl (#11090) 2024-05-21 18:40:29 +08:00
Yishuo Wang
f00625f9a4
refactor qwen2 (#11087) 2024-05-21 16:53:42 +08:00
Qiyuan Gong
492ed3fd41
Add verified models to GPU finetune README (#11088)
* Add verified models to GPU finetune README
2024-05-21 15:49:15 +08:00
Qiyuan Gong
1210491748
ChatGLM3, Baichuan2 and Qwen1.5 QLoRA example (#11078)
* Add chatglm3, qwen15-7b and baichuan-7b QLoRA alpaca example
* Remove unnecessary tokenization setting.
2024-05-21 15:29:43 +08:00
ZehuaCao
842d6dfc2d
Further Modify CPU example (#11081)
* modify CPU example

* update
2024-05-21 13:55:47 +08:00
Yishuo Wang
d830a63bb7
refactor qwen (#11074) 2024-05-20 18:08:37 +08:00
Wang, Jian4
74950a152a
Fix tgi_api_server error file name (#11075) 2024-05-20 16:48:40 +08:00
Yishuo Wang
4e97047d70
fix baichuan2 13b fp16 (#11071) 2024-05-20 11:21:20 +08:00
binbin Deng
7170dd9192
Update guide for running qwen with AutoTP (#11065) 2024-05-20 10:53:17 +08:00
Wang, Jian4
a2e1578fd9
Merge tgi_api_server to main (#11036)
* init

* fix style

* speculative can not use benchmark

* add tgi server readme
2024-05-20 09:15:03 +08:00
Yishuo Wang
31ce3e0c13
refactor baichuan2-13b (#11064) 2024-05-17 16:25:30 +08:00