Commit graph

2155 commits

Author SHA1 Message Date
Ruonan Wang
7e917d6cfb
fix gptq of llama (#11749)
* fix gptq of llama

* small fix
2024-08-09 16:39:25 +08:00
Yina Chen
dd46c141bd
Phi3 support compresskv (#11733)
* phi3 support compresskv

* fix phi3 mtl error

* fix conflict with quant kv

* fix abnormal on mtl

* fix style

* use slide windows size to compress kv

* support sliding window

* fix style

* fix style

* temp: partial support quant kv

* support quant kv with compress kv, todo: model check

* temp

* fix style

* fix style

* remove prepare

* address comment

* default -> 1.8k
2024-08-09 15:43:43 +08:00
Qiyuan Gong
d8808cc2e3
Mistral apply_rotary_pos_emb_no_cache_xpu use rope_theta from config (#11747)
mistral-7B-instruct-v0.2 and mistral-7B-instruct-v0.1 use different rope_theta (0.2 is 1e, 0.1 is 1e5). Pass self.config.rope_theta to apply_rotary_pos_emb_no_cache_xpu to avoid output difference.
2024-08-09 10:35:51 +08:00
Xiangyu Tian
044e486480
Fix vLLM CPU /chat endpoint (#11748) 2024-08-09 10:33:52 +08:00
Jinhe
27b4b104ed
Add qwen2-1.5b-instruct into igpu performance (#11735)
* updated qwen1.5B to all transformer==4.37 yaml

* updated qwen1.5B to all transformer==4.37 yaml
2024-08-08 16:42:18 +08:00
Shaojun Liu
107f7aafd0
enable inference mode for deepspeed tp serving (#11742) 2024-08-08 14:38:30 +08:00
Zijie Li
9e65cf00b3
Add openai-whisper pytorch gpu (#11736)
* Add openai-whisper pytorch gpu

* Update README.md

* Update README.md

* fix typo

* fix names update readme

* Update README.md
2024-08-08 12:32:59 +08:00
Jinhe
d0c89fb715
updated llama.cpp and ollama quickstart (#11732)
* updated llama.cpp and ollama quickstart.md

* added qwen2-1.5B sample output

* revision on quickstart updates

* revision on quickstart updates

* revision on qwen2 readme

* added 2 troubleshoots“
”

* troubleshoot revision
2024-08-08 11:04:01 +08:00
Yishuo Wang
54cc9353db
support and optimize minicpm-v-2_6 (#11738) 2024-08-07 18:21:16 +08:00
Yina Chen
e956e71fc1
fix conflict with quant kv (#11737) 2024-08-07 18:10:30 +08:00
Ruonan Wang
00a5574c8a
Use merge_qkv to replace fused_qkv for llama2 (#11727)
* update 4.38

* support new versions

* update

* fix style

* fix style

* update rope

* temp test sdpa

* fix style

* fix cpu ut
2024-08-07 18:04:01 +08:00
Yina Chen
d2abc9711b
Fix MTL 4k input qwen2 compresskv error (#11734)
* fix

* fix style
2024-08-07 16:21:57 +08:00
Yina Chen
a71ae7c22b
Support minicpm compresskv & modify default compresskv config & default enable compresskv on mtl 2.5k~4.5k (#11726)
* support minicpm & modify default & default enable on mtl 2.5k~4.5k

* fix style
2024-08-07 11:35:39 +08:00
Yishuo Wang
c093f7d980
fix phi3 (#11729) 2024-08-07 09:39:46 +08:00
Zijie Li
e7f7141781
Add benchmark util for transformers 4.42 (#11725)
* add new benchmark_util.py

Add new benchmark_util.py for transformers>=4.43.1. The old one renamed to benchmark_util_prev.py.

* Small fix to import code

* Update __init__.py

* fix file names

* Update lint-python

Update lint-python to exclude benchmark_util_4_29.py
benchmark_util_4_43.py

* Update benchmark_util_4_43.py

* add benchmark_util for transformers 4.42
2024-08-07 08:48:07 +08:00
Ch1y0q
4676af2054
add gemma2 example (#11724)
* add `gemma2`

* update `transformers` version

* update `README.md`
2024-08-06 21:17:50 +08:00
SichengStevenLi
985213614b
Removed no longer needed models for Arc nightly perf (#11722)
* removed LLMs that are no longer needed

Removed: 
mistralai/Mistral-7B-v0.1
deepseek-ai/deepseek-coder-6.7b-instruct

* Update arc-perf-test-batch4.yaml

Removed: 
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1

* Update arc-perf-test.yaml

Removed: 
deepseek-ai/deepseek-coder-6.7b-instruct
mistralai/Mistral-7B-v0.1

* Create arc-perf-transformers-438.yaml

* Moved arc-perf-transformers-438.yaml location

* Create arc-perf-transformers-438-batch2.yaml

* Create arc-perf-transformers-438-batch4.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch2.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438-batch4.yaml

* Delete python/llm/test/benchmark/arc-perf-transformers-438.yaml
2024-08-06 16:12:00 +08:00
Yishuo Wang
929675aa6b
support latest phi3 (#11721) 2024-08-06 15:52:55 +08:00
Jin, Qiao
11650b6f81
upgrade glm-4v example transformers version (#11719) 2024-08-06 14:55:09 +08:00
Yishuo Wang
bbdff6edeb
optimize internvl2 4b performance (#11720) 2024-08-06 14:25:08 +08:00
Yishuo Wang
f44b732aa8
support internvl2-4b (#11718) 2024-08-06 13:36:32 +08:00
Jin, Qiao
7f241133da
Add MiniCPM-Llama3-V-2_5 GPU example (#11693)
* Add MiniCPM-Llama3-V-2_5 GPU example

* fix
2024-08-06 10:22:41 +08:00
Jin, Qiao
808d9a7bae
Add MiniCPM-V-2 GPU example (#11699)
* Add MiniCPM-V-2 GPU example

* add example in README.md

* add example in README.md
2024-08-06 10:22:33 +08:00
Zijie Li
8fb36b9f4a
add new benchmark_util.py (#11713)
* add new benchmark_util.py
2024-08-05 16:18:48 +08:00
Wang, Jian4
493cbd9a36
Support lightweight-serving with internlm-xcomposer2-vl-7b multimodal input (#11703)
* init image_list

* enable internlm-xcomposer2 image input

* update style

* add readme

* update model

* update readme
2024-08-05 09:36:04 +08:00
Ruonan Wang
aa98ef96fe
change mixed_precision to q6_k (#11706) 2024-08-02 15:55:16 +08:00
Xiangyu Tian
1baa3efe0e
Optimizations for Pipeline Parallel Serving (#11702)
Optimizations for Pipeline Parallel Serving
2024-08-02 12:06:59 +08:00
Yina Chen
8d1e0bd2f4
add sdp causal support in llama (#11705) 2024-08-02 10:27:40 +08:00
Ruonan Wang
736a7ef72e
add sdp_causal for mistral 4.36 (#11686)
* add sdp_causal for mistral

* fix

* update
2024-08-01 18:57:31 +08:00
Yina Chen
45c730ff39
Chatglm support compresskv (#11690)
* chatglm4 support compresskv

* fix

* fix style

* support chatglm2

* fix quantkv conflict

* fix style
2024-08-01 18:20:20 +08:00
Qiyuan Gong
762ad49362
Add RANK_WAIT_TIME into DeepSpeed-AutoTP to avoid CPU memory OOM (#11704)
* DeepSpeed-AutoTP will start multiple processors to load models and convert them in CPU memory. If model/rank_num is large, this will lead to OOM. Add RANK_WAIT_TIME to reduce memory usage by controlling model reading parallelism.
2024-08-01 18:16:21 +08:00
hxsz1997
8ef4caaf5d
add 3k and 4k input of nightly perf test on iGPU (#11701)
* Add 3k&4k input in workflow for iGPU (#11685)

* add 3k&4k input in workflow

* comment for test

* comment models for accelarate test

* remove OOM models

* modify typo

* change test model (#11696)

* reverse test models (#11700)
2024-08-01 14:17:46 +08:00
Guancheng Fu
afeca38a47
Fix import vllm condition (#11682) 2024-07-31 13:50:01 +08:00
Ruonan Wang
54bf3a23a6
add fallback for unsupported k-quants (#11691)
* add fallback

* fix style

* fix
2024-07-31 11:39:58 +08:00
Zijie Li
5079ed9e06
Add Llama3.1 example (#11689)
* Add Llama3.1 example

Add Llama3.1 example for Linux arc and Windows MTL

* Changes made to adjust compatibilities

transformers changed to 4.43.1

* Update index.rst

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst
2024-07-31 10:53:30 +08:00
Jin, Qiao
6e3ce28173
Upgrade glm-4 example transformers version (#11659)
* upgrade glm-4 example transformers version

* move pip install in one line
2024-07-31 10:24:50 +08:00
Jin, Qiao
a44ab32153
Switch to conhost when running on NPU (#11687) 2024-07-30 17:08:06 +08:00
Wang, Jian4
b119825152
Remove tgi parameter validation (#11688)
* remove validation

* add min warm up

* remove no need source
2024-07-30 16:37:44 +08:00
Yina Chen
670ad887fc
Qwen support compress kv (#11680)
* Qwen support compress kv

* fix style

* fix
2024-07-30 11:16:42 +08:00
hxsz1997
9b36877897
disable default quantize_kv of GQA on MTL (#11679)
* disable default quantizekv of gqa in mtl

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype

* fix stype
2024-07-30 09:38:46 +08:00
Yishuo Wang
c02003925b
add mlp for gemma2 (#11678) 2024-07-29 16:10:23 +08:00
RyuKosei
1da1f1dd0e
Combine two versions of run_wikitext.py (#11597)
* Combine two versions of run_wikitext.py

* Update run_wikitext.py

* Update run_wikitext.py

* aligned the format

* update error display

* simplified argument parser

---------

Co-authored-by: jenniew <jenniewang123@gmail.com>
2024-07-29 15:56:16 +08:00
Yishuo Wang
6f999e6e90
add sdp for gemma2 (#11677) 2024-07-29 15:15:47 +08:00
Ruonan Wang
c11d5301d7
add sdp fp8 for llama (#11671)
* add sdp fp8 for llama

* fix style

* refactor
2024-07-29 13:46:22 +08:00
Yishuo Wang
7f88ce23cd
add more gemma2 optimization (#11673) 2024-07-29 11:13:00 +08:00
Yishuo Wang
3e8819734b
add basic gemma2 optimization (#11672) 2024-07-29 10:46:51 +08:00
Guoqiong Song
336dfc04b1
fix 1482 (#11661)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-26 12:39:09 -07:00
Heyang Sun
ba01b85c13
empty cache only for 1st token but rest token to speed up (#11665) 2024-07-26 16:46:21 +08:00
Yina Chen
fc7f8feb83
Support compress kv (#11642)
* mistral snapkv

* update

* mtl update

* update

* update

* update

* add comments

* style fix

* fix style

* support llama

* llama use compress kv

* support mistral 4.40

* fix style

* support diff transformers versions

* move snapkv util to kv

* fix style

* meet comments & small fix

* revert all in one

* fix indent

---------

Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-07-26 16:02:00 +08:00
Yishuo Wang
6bcdc6cc8f
fix qwen2 cpu (#11663) 2024-07-26 13:41:51 +08:00
Wang, Jian4
23681fbf5c
Support codegeex4-9b for lightweight-serving (#11648)
* add options, support prompt and not return end_token

* enable openai parameter

* set do_sample None and update style
2024-07-26 09:41:03 +08:00
Guancheng Fu
a4d30a8211
Change logic for detecting if vllm is available (#11657)
* fix

* fix
2024-07-25 15:24:19 +08:00
Qiyuan Gong
0c6e0b86c0
Refine continuation get input_str (#11652)
* Remove duplicate code in continuation get input_str.
* Avoid infinite loop in all-in-one due to test_length not in the list.
2024-07-25 14:41:19 +08:00
RyuKosei
2fbd375a94
update several models for nightly perf test (#11643)
Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-07-25 14:06:08 +08:00
Xiangyu Tian
4499d25c26
LLM: Fix ParallelLMHead convert in vLLM cpu (#11654) 2024-07-25 13:07:19 +08:00
binbin Deng
777e61d8c8
Fix qwen2 & int4 on NPU (#11646) 2024-07-24 13:14:39 +08:00
Yishuo Wang
1b3b46e54d
fix chatglm new model (#11639) 2024-07-23 13:44:56 +08:00
Xu, Shuo
7f80db95eb
Change run.py in benchmark to support phi-3-vision in arc-perf (#11638)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-23 09:51:36 +08:00
Xiangyu Tian
060792a648
LLM: Refine Pipeline Parallel FastAPI (#11587)
Refine Pipeline Parallel FastAPI
2024-07-22 15:52:05 +08:00
Wang, Jian4
1eed0635f2
Add lightweight serving and support tgi parameter (#11600)
* init tgi request

* update openai api

* update for pp

* update and add readme

* add to docker

* add start bash

* update

* update

* update
2024-07-19 13:15:56 +08:00
Xiangyu Tian
d27a8cd08c
Fix Pipeline Parallel dtype (#11623) 2024-07-19 13:07:40 +08:00
Yishuo Wang
d020ad6397
add save_low_bit support for DiskEmbedding (#11621) 2024-07-19 10:34:53 +08:00
Guoqiong Song
380717f50d
fix gemma for 4.41 (#11531)
* fix gemma for 4.41
2024-07-18 15:02:50 -07:00
Guoqiong Song
5a6211fd56
fix minicpm for transformers>=4.39 (#11533)
* fix minicpm for transformers>=4.39
2024-07-18 15:01:57 -07:00
Yishuo Wang
0209427cf4
Add disk_embedding parameter to support put Embedding layer on CPU (#11617) 2024-07-18 17:06:06 +08:00
Yuwen Hu
2478e2c14b
Add check in iGPU perf workflow for results integrity (#11616)
* Add csv check for igpu benchmark workflow (#11610)

* add csv check for igpu benchmark workflow

* ready to test

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* Restore the temporarily removed models in iGPU-perf (#11615)

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

---------

Co-authored-by: Xu, Shuo <100334393+ATMxsp01@users.noreply.github.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-18 14:13:16 +08:00
Xiangyu Tian
4594a3dd6c
LLM: Fix DummyLayer.weight device in Pipeline Parallel (#11612) 2024-07-18 13:39:34 +08:00
Ruonan Wang
4da93709b1
update doc/setup to use onednn gemm for cpp (#11598)
* update doc/setup to use onednn gemm

* small fix

* Change TOC of graphrag quickstart back
2024-07-18 13:04:38 +08:00
Yishuo Wang
f4077fa905
fix llama3-8b npu long input stuck (#11613) 2024-07-18 11:08:17 +08:00
Zhao Changmin
e5c0058c0e
fix baichuan (#11606) 2024-07-18 09:43:36 +08:00
Guoqiong Song
bfcdc35b04
phi-3 on "transformers>=4.37.0,<=4.42.3" (#11534) 2024-07-17 17:19:57 -07:00
Guoqiong Song
d64711900a
Fix cohere model on transformers>=4.41 (#11575)
* fix cohere model for 4-41
2024-07-17 17:18:59 -07:00
Guoqiong Song
5b6eb85b85
phi model readme (#11595)
Co-authored-by: rnwang04 <ruonan1.wang@intel.com>
2024-07-17 17:18:34 -07:00
Wang, Jian4
9c15abf825
Refactor fastapi-serving and add one card serving(#11581)
* init fastapi-serving one card

* mv api code to source

* update worker

* update for style-check

* add worker

* update bash

* update

* update worker name and add readme

* rename update

* rename to fastapi
2024-07-17 11:12:43 +08:00
Yishuo Wang
5837bc0014
fix chatglm3 npu output (#11590) 2024-07-16 18:16:30 +08:00
Guancheng Fu
06930ab258
Enable ipex-llm optimization for lm head (#11589)
* basic

* Modify convert.py

* fix
2024-07-16 16:48:44 +08:00
Heyang Sun
365adad59f
Support LoRA ChatGLM with Alpaca Dataset (#11580)
* Support LoRA ChatGLM with Alpaca Dataset

* refine

* fix

* add 2-card alpaca
2024-07-16 15:40:02 +08:00
Yina Chen
99c22745b2
fix qwen 14b fp6 abnormal output (#11583) 2024-07-16 10:59:00 +08:00
Yishuo Wang
c279849d27
add disk embedding api (#11585) 2024-07-16 10:43:39 +08:00
Xiangyu Tian
79c742dfd5
LLM: Add XPU Memory Optimizations for Pipeline Parallel (#11567)
Add XPU Memory Optimizations for Pipeline Parallel
2024-07-16 09:44:50 +08:00
Ch1y0q
50cf563a71
Add example: MiniCPM-V (#11570) 2024-07-15 10:55:48 +08:00
Zhao Changmin
06745e5742
Add npu benchmark all-in-one script (#11571)
* npu benchmark
2024-07-15 10:42:37 +08:00
Yishuo Wang
019da6c0ab
use mlp silu_mul fusion in qwen2 to optimize memory usage (#11574) 2024-07-13 16:32:54 +08:00
Xu, Shuo
13a72dc51d
Test MiniCPM performance on iGPU in a more stable way (#11573)
* Test MiniCPM performance on iGPU in a more stable way

* small fix

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-12 17:07:41 +08:00
Xiangyu Tian
0981b72275
Fix /generate_stream api in Pipeline Parallel FastAPI (#11569) 2024-07-12 13:19:42 +08:00
Yishuo Wang
a945500a98
fix internlm xcomposser stream chat (#11564) 2024-07-11 18:21:17 +08:00
Zhao Changmin
b9c66994a5
add npu sdp (#11562) 2024-07-11 16:57:35 +08:00
binbin Deng
2b8ad8731e
Support pipeline parallel for glm-4v (#11545) 2024-07-11 16:06:06 +08:00
Xiangyu Tian
7f5111a998
LLM: Refine start script for Pipeline Parallel Serving (#11557)
Refine start script and readme for Pipeline Parallel Serving
2024-07-11 15:45:27 +08:00
Xu, Shuo
1355b2ce06
Add model Qwen-VL-Chat to iGPU-perf (#11558)
* Add model Qwen-VL-Chat to iGPU-perf

* small fix

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-11 15:39:02 +08:00
Zhao Changmin
105e124752
optimize phi3-v encoder npu performance and add multimodal example (#11553)
* phi3-v

* readme
2024-07-11 13:59:14 +08:00
Cengguang Zhang
70ab1a6f1a
LLM: unify memory optimization env variables. (#11549)
* LLM: unify memory optimization env variables.

* fix comments.
2024-07-11 11:01:28 +08:00
Xu, Shuo
028ad4f63c
Add model phi-3-vision-128k-instruct to iGPU-perf benchmark (#11554)
* try to improve MIniCPM performance

* Add model phi-3-vision-128k-instruct to iGPU-perf benchmark

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-10 17:26:30 +08:00
Yishuo Wang
994e49a510
optimize internlm xcomposser performance again (#11551) 2024-07-10 17:08:56 +08:00
Xu, Shuo
61613b210c
try to improve MIniCPM performance (#11552)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-10 16:58:23 +08:00
Yishuo Wang
82f9514303
optimize internlm xcomposer2 performance (#11550) 2024-07-10 15:57:04 +08:00
Zhao Changmin
3c16c9f725
Optimize baichuan on NPU (#11548)
* baichuan_npu
2024-07-10 13:18:48 +08:00
Yuwen Hu
8982ab73d5
Add Yi-6B and StableLM to iGPU perf test (#11546)
* Add transformer4.38.2 test to igpu benchmark (#11529)

* add transformer4.38.1 test to igpu benchmark

* use transformers4.38.2 & fix csv name error in 4.38 workflow

* add model Yi-6B-Chat & remove temporarily most models

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* filter some errorlevel (#11541)

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

* Restore the temporarily removed models in iGPU-perf (#11544)

* filter some errorlevel

* restore the temporarily removed models in iGPU-perf

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>

---------

Co-authored-by: Xu, Shuo <100334393+ATMxsp01@users.noreply.github.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-09 18:51:23 +08:00
Yishuo Wang
7dc6756d86
add disk embedding (#11543) 2024-07-09 17:38:40 +08:00
Zhao Changmin
76a5802acf
update NPU examples (#11540)
* update NPU examples
2024-07-09 17:19:42 +08:00
Yishuo Wang
99b2802d3b
optimize qewn2 memory (#11535) 2024-07-09 17:14:01 +08:00
Yishuo Wang
2929eb262e
support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
Xiangyu Tian
a1cede926d
Fix update_kv_cache in Pipeline-Parallel-Serving for glm4-9b model (#11537) 2024-07-09 14:08:04 +08:00
Cengguang Zhang
fa81dbefd3
LLM: update multi gpu write csv in all-in-one benchmark. (#11538) 2024-07-09 11:14:17 +08:00
Xin Qiu
69701b3ec8
fix typo in python/llm/scripts/README.md (#11536) 2024-07-09 09:53:14 +08:00
Jason Dai
099486afb7
Update README.md (#11530) 2024-07-08 20:18:41 +08:00
binbin Deng
66f6ffe4b2
Update GPU HF-Transformers example structure (#11526) 2024-07-08 17:58:06 +08:00
Xu, Shuo
f9a199900d
add model RWKV/v5-Eagle-7B-HF to igpu benchmark (#11528)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-08 15:50:16 +08:00
Shaojun Liu
9b37ca6027
remove (#11527) 2024-07-08 15:49:52 +08:00
Yishuo Wang
c26651f91f
add mistral npu support (#11523) 2024-07-08 13:17:15 +08:00
Jun Wang
5a57e54400
[ADD] add 5 new models for igpu-perf (#11524) 2024-07-08 11:12:15 +08:00
Xu, Shuo
64cfed602d
Add new models to benchmark (#11505)
* Add new models to benchmark

* remove Qwen/Qwen-VL-Chat to pass the validation

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-08 10:35:55 +08:00
binbin Deng
252426793b
Fix setting of use_quantize_kv_cache on different GPU in pipeline parallel (#11516) 2024-07-08 09:27:01 +08:00
Yishuo Wang
7cb09a8eac
optimize qwen2 memory usage again (#11520) 2024-07-05 17:32:34 +08:00
Yuwen Hu
8f376e5192
Change igpu perf to mainly test int4+fp16 (#11513) 2024-07-05 17:12:33 +08:00
Jun Wang
1efb6ebe93
[ADD] add transformer_int4_fp16_loadlowbit_gpu_win api (#11511)
* [ADD] add transformer_int4_fp16_loadlowbit_gpu_win api

* [UPDATE] add int4_fp16_lowbit config and description

* [FIX] fix run.py mistake

* [FIX] fix run.py mistake

* [FIX] fix indent; change dtype=float16 to model.half()
2024-07-05 16:38:41 +08:00
Zhao Changmin
f7e957aaf9
Clean npu dtype branch (#11515)
* clean branch

* create_npu_kernels
2024-07-05 15:45:26 +08:00
Yishuo Wang
14ce058004
add chatglm3 npu support (#11518) 2024-07-05 15:31:27 +08:00
Xin Qiu
a31f2cbe13
update minicpm.py (#11517)
* update minicpm

* meet code review
2024-07-05 15:25:44 +08:00
Zhao Changmin
24de13fc45
Optimize stablelm on NPU (#11512)
* stablelm_optimize
2024-07-05 14:21:57 +08:00
Xiangyu Tian
7d8bc83415
LLM: Partial Prefilling for Pipeline Parallel Serving (#11457)
LLM: Partial Prefilling for Pipeline Parallel Serving
2024-07-05 13:10:35 +08:00
binbin Deng
60de428b37
Support pipeline parallel for qwen-vl (#11503) 2024-07-04 18:03:57 +08:00
Zhao Changmin
57b8adb189
[WIP] Support npu load_low_bit method (#11502)
* npu_load_low_bit
2024-07-04 17:15:34 +08:00
Jun Wang
f07937945f
[REMOVE] remove all useless repo-id in benchmark/igpu-perf (#11508) 2024-07-04 16:38:34 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
Yishuo Wang
bb0a84044b
add qwen2 npu support (#11504) 2024-07-04 11:01:25 +08:00
Xin Qiu
f84ca99b9f
optimize gemma2 rmsnorm (#11500) 2024-07-03 15:21:03 +08:00
Wang, Jian4
61c36ba085
Add pp_serving verified models (#11498)
* add verified models

* update

* verify large model

* update commend
2024-07-03 14:57:09 +08:00
binbin Deng
9274282ef7
Support pipeline parallel for glm-4-9b-chat (#11463) 2024-07-03 14:25:28 +08:00
Yishuo Wang
d97c2664ce
use new fuse rope in stablelm family (#11497) 2024-07-03 11:08:26 +08:00
Xu, Shuo
52519e07df
remove models we no longer need in benchmark. (#11492)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-07-02 17:20:48 +08:00
Zhao Changmin
6a0134a9b2
support q4_0_rtn (#11477)
* q4_0_rtn
2024-07-02 16:57:02 +08:00
Yishuo Wang
5e967205ac
remove the code converts input to fp16 before calling batch forward kernel (#11489) 2024-07-02 16:23:53 +08:00
Wang, Jian4
4390e7dc49
Fix codegeex2 transformers version (#11487) 2024-07-02 15:09:28 +08:00
Yishuo Wang
ec3a912ab6
optimize npu llama long context performance (#11478) 2024-07-01 16:49:23 +08:00
Heyang Sun
913e750b01
fix non-string deepseed config path bug (#11476)
* fix non-string deepseed config path bug

* Update lora_finetune_chatglm.py
2024-07-01 15:53:50 +08:00
binbin Deng
48ad482d3d
Fix import error caused by pydantic on cpu (#11474) 2024-07-01 15:49:49 +08:00
Yishuo Wang
39bcb33a67
add sdp support for stablelm 3b (#11473) 2024-07-01 14:56:15 +08:00
Zhao Changmin
cf8eb7b128
Init NPU quantize method and support q8_0_rtn (#11452)
* q8_0_rtn

* fix float point
2024-07-01 13:45:07 +08:00
Yishuo Wang
319a3b36b2
fix npu llama2 (#11471) 2024-07-01 10:14:11 +08:00
Heyang Sun
07362ffffc
ChatGLM3-6B LoRA Fine-tuning Demo (#11450)
* ChatGLM3-6B LoRA Fine-tuning Demo

* refine

* refine

* add 2-card deepspeed

* refine format

* add mpi4py and deepspeed install
2024-07-01 09:18:39 +08:00
Xiangyu Tian
fd933c92d8
Fix: Correct num_requests in benchmark for Pipeline Parallel Serving (#11462) 2024-06-28 16:10:51 +08:00
SONG Ge
a414e3ff8a
add pipeline parallel support with load_low_bit (#11414) 2024-06-28 10:17:56 +08:00
Cengguang Zhang
d0b801d7bc
LLM: change write mode in all-in-one benchmark. (#11444)
* LLM: change write mode in all-in-one benchmark.

* update output style.
2024-06-27 19:36:38 +08:00
binbin Deng
987017ef47
Update pipeline parallel serving for more model support (#11428) 2024-06-27 18:21:01 +08:00
Yishuo Wang
029ff15d28
optimize npu llama2 first token performance (#11451) 2024-06-27 17:37:33 +08:00
Qiyuan Gong
4e4ecd5095
Control sys.modules ipex duplicate check with BIGDL_CHECK_DUPLICATE_IMPORT (#11453)
* Control sys.modules ipex duplicate check with BIGDL_CHECK_DUPLICATE_IMPORT。
2024-06-27 17:21:45 +08:00
Yishuo Wang
c6e5ad668d
fix internlm xcomposser meta-instruction typo (#11448) 2024-06-27 15:29:43 +08:00
Yishuo Wang
f89ca23748
optimize npu llama2 perf again (#11445) 2024-06-27 15:13:42 +08:00
Yishuo Wang
cf0f5c4322
change npu document (#11446) 2024-06-27 13:59:59 +08:00
binbin Deng
508c364a79
Add precision option in PP inference examples (#11440) 2024-06-27 09:24:27 +08:00
Yishuo Wang
2a0f8087e3
optimize qwen2 gpu memory usage again (#11435) 2024-06-26 16:52:29 +08:00
Shaojun Liu
ab9f7f3ac5
FIX: Qwen1.5-GPTQ-Int4 inference error (#11432)
* merge_qkv if quant_method is 'gptq'

* fix python style checks

* refactor

* update GPU example
2024-06-26 15:36:22 +08:00
Guancheng Fu
99cd16ef9f
Fix error while using pipeline parallism (#11434) 2024-06-26 15:33:47 +08:00
Jiao Wang
40fa23560e
Fix LLAVA example on CPU (#11271)
* update

* update

* update

* update
2024-06-25 20:04:59 -07:00
Yishuo Wang
ca0e69c3a7
optimize npu llama perf again (#11431) 2024-06-26 10:52:54 +08:00
Yishuo Wang
9f6e5b4fba
optimize llama npu perf (#11426) 2024-06-25 17:43:20 +08:00
binbin Deng
e473b8d946
Add more qwen1.5 and qwen2 support for pipeline parallel inference (#11423) 2024-06-25 15:49:32 +08:00
binbin Deng
aacc1fd8c0
Fix shape error when run qwen1.5-14b using deepspeed autotp (#11420) 2024-06-25 13:48:37 +08:00
Yishuo Wang
3b23de684a
update npu examples (#11422) 2024-06-25 13:32:53 +08:00
Xiangyu Tian
8ddae22cfb
LLM: Refactor Pipeline-Parallel-FastAPI example (#11319)
Initially Refactor for Pipeline-Parallel-FastAPI example
2024-06-25 13:30:36 +08:00
SONG Ge
34c15d3a10
update pp document (#11421) 2024-06-25 10:17:20 +08:00
Xin Qiu
9e4ee61737
rename BIGDL_OPTIMIZE_LM_HEAD to IPEX_LLM_LAST_LM_HEAD and add qwen2 (#11418) 2024-06-24 18:42:37 +08:00
Heyang Sun
c985912ee3
Add Deepspeed LoRA dependencies in document (#11410) 2024-06-24 15:29:59 +08:00
Yishuo Wang
abe53eaa4f
optimize qwen1.5/2 memory usage when running long input with fp16 (#11403) 2024-06-24 13:43:04 +08:00
Guoqiong Song
7507000ef2
Fix 1383 Llama model on transformers=4.41[WIP] (#11280) 2024-06-21 11:24:10 -07:00
SONG Ge
0c67639539
Add more examples for pipeline parallel inference (#11372)
* add more model exampels for pipelien parallel inference

* add mixtral and vicuna models

* add yi model and past_kv supprot for chatglm family

* add docs

* doc update

* add license

* update
2024-06-21 17:55:16 +08:00
Xiangyu Tian
b30bf7648e
Fix vLLM CPU api_server params (#11384) 2024-06-21 13:00:06 +08:00
ivy-lv11
21fc781fce
Add GLM-4V example (#11343)
* add example

* modify

* modify

* add line

* add

* add link and replace with phi-3-vision template

* fix generate options

* fix

* fix

---------

Co-authored-by: jinbridge <2635480475@qq.com>
2024-06-21 12:54:31 +08:00
binbin Deng
4ba82191f2
Support PP inference for chatglm3 (#11375) 2024-06-21 09:59:01 +08:00
Yishuo Wang
f0fdfa081b
Optimize qwen 1.5 14B batch performance (#11370) 2024-06-20 17:23:39 +08:00
Wenjing Margaret Mao
c0e86c523a
Add qwen-moe batch1 to nightly perf (#11369)
* add moe

* reduce 437 models

* rename

* fix syntax

* add moe check result

* add 430 + 437

* all modes

* 4-37-4 exclud

* revert & comment

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-20 14:17:41 +08:00
Yishuo Wang
a5e7d93242
Add initial save/load low bit support for NPU(now only fp16 is supported) (#11359) 2024-06-20 10:49:39 +08:00
RyuKosei
05a8d051f6
Fix run.py run_ipex_fp16_gpu (#11361)
* fix a bug on run.py

* Update run.py

fixed the format problem

---------

Co-authored-by: sgwhat <ge.song@intel.com>
2024-06-20 10:29:32 +08:00
Wenjing Margaret Mao
b2f62a8561
Add batch 4 perf test (#11355)
* copy files to this branch

* add tasks

* comment one model

* change the model to test the 4.36

* only test batch-4

* typo

* typo

* typo

* typo

* typo

* typo

* add 4.37-batch4

* change the file name

* revet yaml file

* no print

* add batch4 task

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-20 09:48:52 +08:00
Zijie Li
ae452688c2
Add NPU HF example (#11358) 2024-06-19 18:07:28 +08:00
Qiyuan Gong
1eb884a249
IPEX Duplicate importer V2 (#11310)
* Add gguf support.
* Avoid error when import ipex-llm for multiple times.
* Add check to avoid duplicate replace and revert.
* Add calling from check to avoid raising exceptions in the submodule.
* Add BIGDL_CHECK_DUPLICATE_IMPORT for controlling duplicate checker. Default is true.
2024-06-19 16:29:19 +08:00
Yishuo Wang
ae7b662ed2
add fp16 NPU Linear support and fix intel_npu_acceleration_library version 1.0 support (#11352) 2024-06-19 09:14:59 +08:00
Guoqiong Song
c44b1942ed
fix mistral for transformers>=4.39 (#11191)
* fix mistral for transformers>=4.39
2024-06-18 13:39:35 -07:00
Heyang Sun
67a1e05876
Remove zero3 context manager from LoRA (#11346) 2024-06-18 17:24:43 +08:00
Yishuo Wang
83082e5cc7
add initial support for intel npu acceleration library (#11347) 2024-06-18 16:07:16 +08:00
Shaojun Liu
694912698e
Upgrade scikit-learn to 1.5.0 to fix dependabot issue (#11349) 2024-06-18 15:47:25 +08:00
hxsz1997
44f22cba70
add config and default value (#11344)
* add config and default value

* add config in taml

* remove lookahead and max_matching_ngram_size in config

* remove streaming and use_fp16_torch_dtype in test yaml

* update task in readme

* update commit of task
2024-06-18 15:28:57 +08:00
Heyang Sun
00f322d8ee
Finetune ChatGLM with Deepspeed Zero3 LoRA (#11314)
* Fintune ChatGLM with Deepspeed Zero3 LoRA

* add deepspeed zero3 config

* rename config

* remove offload_param

* add save_checkpoint parameter

* Update lora_deepspeed_zero3_finetune_chatglm3_6b_arc_2_card.sh

* refine
2024-06-18 12:31:26 +08:00
Yina Chen
5dad33e5af
Support fp8_e4m3 scale search (#11339)
* fp8e4m3 switch off

* fix style
2024-06-18 11:47:43 +08:00
binbin Deng
e50c890e1f
Support finishing PP inference once eos_token_id is found (#11336) 2024-06-18 09:55:40 +08:00
Qiyuan Gong
de4bb97b4f
Remove accelerate 0.23.0 install command in readme and docker (#11333)
*ipex-llm's accelerate has been upgraded to 0.23.0. Remove accelerate 0.23.0 install command in README and docker。
2024-06-17 17:52:12 +08:00
SONG Ge
ef4b6519fb
Add phi-3 model support for pipeline parallel inference (#11334)
* add phi-3 model support

* add phi3 example
2024-06-17 17:44:24 +08:00
hxsz1997
99b309928b
Add lookahead in test_api: transformer_int4_fp16_gpu (#11337)
* add lookahead in test_api:transformer_int4_fp16_gpu

* change the short prompt of summarize

* change short prompt to cnn_64

* change short prompt of summarize
2024-06-17 17:41:41 +08:00
Qiyuan Gong
5d7c9bf901
Upgrade accelerate to 0.23.0 (#11331)
* Upgrade accelerate to 0.23.0
2024-06-17 15:03:11 +08:00
Xin Qiu
183e0c6cf5
glm-4v-9b support (#11327)
* chatglm4v support

* fix style check

* update glm4v
2024-06-17 13:52:37 +08:00
Wenjing Margaret Mao
bca5cbd96c
Modify arc nightly perf to fp16 (#11275)
* change api

* move to pr mode and remove the build

* add batch4 yaml and remove the bigcode

* remove batch4

* revert the starcode

* remove the exclude

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-17 13:47:22 +08:00
binbin Deng
6ea1e71af0
Update PP inference benchmark script (#11323) 2024-06-17 09:59:36 +08:00
SONG Ge
be00380f1a
Fix pipeline parallel inference past_key_value error in Baichuan (#11318)
* fix past_key_value error

* add baichuan2 example

* fix style

* update doc

* add script link in doc

* fix import error

* update
2024-06-17 09:29:32 +08:00
Yina Chen
0af0102e61
Add quantization scale search switch (#11326)
* add scale_search switch

* remove llama3 instruct

* remove print
2024-06-14 18:46:52 +08:00
Ruonan Wang
8a3247ac71
support batch forward for q4_k, q6_k (#11325) 2024-06-14 18:25:50 +08:00
Yishuo Wang
e8dd8e97ef
fix chatglm lookahead on ARC (#11320) 2024-06-14 16:26:11 +08:00
Shaojun Liu
f5ef94046e
exclude dolly-v2-12b for arc perf test (#11315)
* test arc perf

* test

* test

* exclude dolly-v2-12b:2048

* revert changes
2024-06-14 15:35:56 +08:00
Xiangyu Tian
4359ab3172
LLM: Add /generate_stream endpoint for Pipeline-Parallel-FastAPI example (#11187)
Add /generate_stream and OpenAI-formatted endpoint for Pipeline-Parallel-FastAPI example
2024-06-14 15:15:32 +08:00
Jin Qiao
0e7a31a09c
ChatGLM Examples Restructure regarding Installation Steps (#11285)
* merge install step in glm examples

* fix section

* fix section

* fix tiktoken
2024-06-14 12:37:05 +08:00
Yishuo Wang
91965b5d05
add glm_sdpa back to fix chatglm-6b (#11313) 2024-06-14 10:31:43 +08:00
Yishuo Wang
7f65836cb9
fix chatglm2/3-32k/128k fp16 (#11311) 2024-06-14 09:58:07 +08:00
Xin Qiu
1b0c4c8cb8
use new rotary two in chatglm4 (#11312)
* use new rotary two in chatglm4

* rempve
2024-06-13 19:02:18 +08:00
Xin Qiu
f1410d6823
refactor chatglm4 (#11301)
* glm4

* remove useless code

* stype

* add rope_ratio

* update

* fix fp16

* fix style
2024-06-13 18:06:04 +08:00
Yishuo Wang
5e25766855
fix and optimize chatglm2-32k and chatglm3-128k (#11306) 2024-06-13 17:37:58 +08:00
binbin Deng
60cb1dac7c
Support PP for qwen1.5 (#11300) 2024-06-13 17:35:24 +08:00
binbin Deng
f97cce2642
Fix import error of ds autotp (#11307) 2024-06-13 16:22:52 +08:00
Jin Qiao
3682c6a979
add glm4 and qwen2 to igpu perf (#11304) 2024-06-13 16:16:35 +08:00
Yishuo Wang
a24666b8f3
fix chatglm3-6b-32k (#11303) 2024-06-13 16:01:34 +08:00
Yishuo Wang
01fe0fc1a2
refactor chatglm2/3 (#11290) 2024-06-13 12:22:58 +08:00
Guancheng Fu
57a023aadc
Fix vllm tp (#11297) 2024-06-13 10:47:48 +08:00
Ruonan Wang
986af21896
fix perf test(#11295) 2024-06-13 10:35:48 +08:00
binbin Deng
220151e2a1
Refactor pipeline parallel multi-stage implementation (#11286) 2024-06-13 10:00:23 +08:00
Ruonan Wang
14b1e6b699
Fix gguf_q4k (#11293)
* udpate embedding parameter

* update benchmark
2024-06-12 20:43:08 +08:00
Yuwen Hu
8edcdeb0e7
Fix bug that torch.ops.torch_ipex.matmul_bias_out cannot work on Linux MTL for short input (#11292) 2024-06-12 19:12:57 +08:00
Wenjing Margaret Mao
b61f6e3ab1
Add update_parent_folder for nightly_perf_test (#11287)
* add update_parent_folder and change the workflow file

* add update_parent_folder and change the workflow file

* move to pr mode and comment the test

* use one model per comfig

* revert

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-12 17:58:13 +08:00
Xin Qiu
592f7aa61e
Refine glm1-4 sdp (#11276)
* chatglm

* update

* update

* change chatglm

* update sdpa

* update

* fix style

* fix

* fix glm

* update glm2-32k

* update glm2-32k

* fix cpu

* update

* change lower_bound
2024-06-12 17:11:56 +08:00
Yuwen Hu
cffb932f05
Expose timeout for streamer for fastchat worker (#11288)
* Expose timeout for stremer for fastchat worker

* Change to read from env variables
2024-06-12 17:02:40 +08:00
ivy-lv11
e7a4e2296f
Add Stable Diffusion examples on GPU and CPU (#11166)
* add sdxl and lcm-lora

* readme

* modify

* add cpu

* add license

* modify

* add file
2024-06-12 16:33:25 +08:00
Jin Qiao
f224e98297
Add GLM-4 CPU example (#11223)
* Add GLM-4 example

* add tiktoken dependency

* fix

* fix
2024-06-12 15:30:51 +08:00
Zijie Li
40fc8704c4
Add GPU example for GLM-4 (#11267)
* Add GPU example for GLM-4

* Update streamchat.py

* Fix pretrianed arguments

Fix pretrained arguments in generate and streamchat.py

* Update Readme

Update install tiktoken required for GLM-4

* Update comments in generate.py
2024-06-12 14:29:50 +08:00
Qiyuan Gong
0d9cc9c106
Remove duplicate check for ipex (#11281)
* Replacing builtin.import is causing lots of unpredicted problems. Remove this function.
2024-06-12 13:52:02 +08:00
Yishuo Wang
10e480ee96
refactor internlm and internlm2 (#11274) 2024-06-11 14:19:19 +08:00
Yuwen Hu
fac49f15e3
Remove manual importing ipex in all-in-one benchmark (#11272) 2024-06-11 09:32:13 +08:00
Wenjing Margaret Mao
70b17c87be
Merge multiple batches (#11264)
* add merge steps

* move to pr mode

* remove build + add merge.py

* add tohtml and change cp

* change test_batch folder path

* change merge_temp path

* change to html folder

* revert

* change place

* revert 437

* revert space

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-07 18:38:45 +08:00
Xiangyu Tian
4b07712fd8
LLM: Fix vLLM CPU model convert mismatch (#11254)
Fix vLLM CPU model convert mismatch.
2024-06-07 15:54:34 +08:00
Yishuo Wang
42fab480ea
support stablm2 12b (#11265) 2024-06-07 15:46:00 +08:00
Xin Qiu
dbc3c2d72d
glm4 sdp (#11253)
* glm4 sdp

* fix style

* update comment
2024-06-07 15:42:23 +08:00
Xin Qiu
151fcf37bb
check devie name in use_flash_attention (#11263) 2024-06-07 15:07:47 +08:00
Yishuo Wang
2623944604
qwen2 sdpa small fix (#11261) 2024-06-07 14:42:18 +08:00
Yishuo Wang
ea0d03fd28
Refactor baichuan1 7B and 13B (#11258) 2024-06-07 14:29:20 +08:00
Qiyuan Gong
1aa9c9597a
Avoid duplicate import in IPEX auto importer (#11227)
* Add custom import to avoid ipex duplicate importing
* Add scope limitation
2024-06-07 14:08:00 +08:00
Wang, Jian4
6f2684e5c9
Update pp llama.py to save memory (#11233) 2024-06-07 13:18:16 +08:00
Yishuo Wang
ef8e9b2ecd
Refactor qwen2 moe (#11244) 2024-06-07 13:14:54 +08:00
Zijie Li
7b753dc8ca
Update sample output for HF Qwen2 GPU and CPU (#11257) 2024-06-07 11:36:22 +08:00
Zhao Changmin
b7948671de
[WIP] Add look up table in 1st token stage (#11193)
* lookuptb
2024-06-07 10:51:05 +08:00
Yuwen Hu
8c36b5bdde
Add qwen2 example (#11252)
* Add GPU example for Qwen2

* Update comments in README

* Update README for Qwen2 GPU example

* Add CPU example for Qwen2

Sample Output under README pending

* Update generate.py and README for CPU Qwen2

* Update GPU example for Qwen2

* Small update

* Small fix

* Add Qwen2 table

* Update README for Qwen2 CPU and GPU

Update sample output under README

---------

Co-authored-by: Zijie Li <michael20001122@gmail.com>
2024-06-07 10:29:33 +08:00
Shaojun Liu
85df5e7699
fix nightly perf test (#11251) 2024-06-07 09:33:14 +08:00
Xin Qiu
2f809116e2
optimize Chatglm4 (#11239)
* chatglm4

* update

* update

* add rms norm

* chatglm4
2024-06-06 18:25:20 +08:00
hxsz1997
b6234eb4e2
Add task in allinone (#11226)
* add task

* update prompt

* modify typos

* add more cases in summarize

* Make the summarize & QA prompt preprocessing as a util function
2024-06-06 17:22:40 +08:00
Yishuo Wang
2e4ccd541c
fix qwen2 cpu (#11240) 2024-06-06 16:24:19 +08:00
Yishuo Wang
e738ec38f4
disable quantize kv in specific qwen model (#11238) 2024-06-06 14:08:39 +08:00
Yishuo Wang
c4e5806e01
add latest optimization in starcoder2 (#11236) 2024-06-06 14:02:17 +08:00
Yishuo Wang
ba27e750b1
refactor yuan2 (#11235) 2024-06-06 13:17:54 +08:00
Shaojun Liu
6be24fdd28
OSPDT: add tpp licenses (#11165)
* add tpp licenses

* add licenses

* add licenses

* delete mitchellh-mapstructure license

* delete stb-image public domain license

* add README.md

* remove core-xe related licenses
2024-06-06 10:59:06 +08:00
Guoqiong Song
09c6780d0c
phi-2 transformers 4.37 (#11161)
* phi-2 transformers 4.37
2024-06-05 13:36:41 -07:00
Guoqiong Song
f6d5c6af78
fix issue 1407 (#11171) 2024-06-05 13:35:57 -07:00
Zijie Li
bfa1367149
Add CPU and GPU example for MiniCPM (#11202)
* Change installation address

Change former address: "https://docs.conda.io/en/latest/miniconda.html#" to new address: "https://conda-forge.org/download/" for 63 occurrences under python\llm\example

* Change Prompt

Change "Anaconda Prompt" to "Miniforge Prompt" for 1 occurrence

* Create and update model minicpm

* Update model minicpm

Update model minicpm under GPU/PyTorch-Models

* Update readme and generate.py

change "prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)" and delete "pip install transformers==4.37.0
"

* Update comments for minicpm GPU

Update comments for generate.py at minicpm GPU

* Add CPU example for MiniCPM

* Update minicpm README for CPU

* Update README for MiniCPM and Llama3

* Update Readme for Llama3 CPU Pytorch

* Update and fix comments for MiniCPM
2024-06-05 18:09:53 +08:00
Yuwen Hu
af96579c76
Update installation guide for pipeline parallel inference (#11224)
* Update installation guide for pipeline parallel inference

* Small fix

* further fix

* Small fix

* Small fix

* Update based on comments

* Small fix

* Small fix

* Small fix
2024-06-05 17:54:29 +08:00
Yina Chen
ed67435491
Support Fp6 k in ipex-llm (#11222)
* support fp6_k

* support fp6_k

* remove

* fix style
2024-06-05 17:34:36 +08:00
binbin Deng
a6674f5bce
Fix should_use_fuse_rope error of Qwen1.5-MoE-A2.7B-Chat (#11216) 2024-06-05 15:56:10 +08:00
Wenjing Margaret Mao
231b968aba
Modify the check_results.py to support batch 2&4 (#11133)
* add batch 2&4 and exclude to perf_test

* modify the perf-test&437 yaml

* modify llm_performance_test.yml

* remove batch 4

* modify check_results.py to support batch 2&4

* change the batch_size format

* remove genxir

* add str(batch_size)

* change actual_test_casese in check_results file to support batch_size

* change html highlight

* less models to test html and html_path

* delete the moe model

* split batch html

* split

* use installing from pypi

* use installing from pypi - batch2

* revert cpp

* revert cpp

* merge two jobs into one, test batch_size in one job

* merge two jobs into one, test batch_size in one job

* change file directory in workflow

* try catch deal with odd file without batch_size

* modify pandas version

* change the dir

* organize the code

* organize the code

* remove Qwen-MOE

* modify based on feedback

* modify based on feedback

* modify based on second round of feedback

* modify based on second round of feedback + change run-arc.sh mode

* modify based on second round of feedback + revert config

* modify based on second round of feedback + revert config

* modify based on second round of feedback + remove comments

* modify based on second round of feedback + remove comments

* modify based on second round of feedback + revert arc-perf-test

* modify based on third round of feedback

* change error type

* change error type

* modify check_results.html

* split batch into two folders

* add all models

* move csv_name

* revert pr test

* revert pr test

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-06-05 15:04:55 +08:00
Xin Qiu
566691c5a3
quantized attention forward for minicpm (#11200)
* quantized minicpm

* fix style check
2024-06-05 09:15:25 +08:00
Jiao Wang
bb83bc23fd
Fix Starcoder issue on CPU on transformers 4.36+ (#11190)
* fix starcoder for sdpa

* update

* style
2024-06-04 10:05:40 -07:00
Kai Huang
f93664147c
Update config.yaml (#11208)
* update config.yaml

* fix

* minor

* style
2024-06-04 19:58:18 +08:00
Xiangyu Tian
ac3d53ff5d
LLM: Fix vLLM CPU version error (#11206)
Fix vLLM CPU version error
2024-06-04 19:10:23 +08:00
Ruonan Wang
1dde204775
update q6k (#11205) 2024-06-04 17:14:33 +08:00
Qiyuan Gong
ce3f08b25a
Fix IPEX auto importer (#11192)
* Fix ipex auto importer with Python builtins.
* Raise errors if the user imports ipex manually before importing ipex_llm. Do nothing if they import ipex after importing ipex_llm.
* Remove import ipex in examples.
2024-06-04 16:57:18 +08:00
Yina Chen
711fa0199e
Fix fp6k phi3 ppl core dump (#11204) 2024-06-04 16:44:27 +08:00
Xiangyu Tian
f02f097002
Fix vLLM verion in CPU/vLLM-Serving example README (#11201) 2024-06-04 15:56:55 +08:00
Yishuo Wang
6454655dcc
use sdp in baichuan2 13b (#11198) 2024-06-04 15:39:00 +08:00
Yishuo Wang
d90cd977d0
refactor stablelm (#11195) 2024-06-04 13:14:43 +08:00
Zijie Li
a644e9409b
Miniconda/Anaconda -> Miniforge update in examples (#11194)
* Change installation address

Change former address: "https://docs.conda.io/en/latest/miniconda.html#" to new address: "https://conda-forge.org/download/" for 63 occurrences under python\llm\example

* Change Prompt

Change "Anaconda Prompt" to "Miniforge Prompt" for 1 occurrence
2024-06-04 10:14:02 +08:00
Xin Qiu
5f13700c9f
optimize Minicpm (#11189)
* minicpm optimize

* update
2024-06-03 18:28:29 +08:00
Qiyuan Gong
15a6205790
Fix LoRA tokenizer for Llama and chatglm (#11186)
* Set pad_token to eos_token if it's None. Otherwise, use model config.
2024-06-03 15:35:38 +08:00
Cengguang Zhang
3eb13ccd8c
LLM: fix input length condition in deepspeed all-in-one benchmark. (#11185) 2024-06-03 10:05:43 +08:00
Shaojun Liu
401013a630
Remove chatglm_C Module to Eliminate LGPL Dependency (#11178)
* remove chatglm_C.**.pyd to solve ngsolve weak copyright vunl

* fix style check error

* remove chatglm native int4 from langchain
2024-05-31 17:03:11 +08:00
Ruonan Wang
50b5f4476f
update q4k convert (#11179) 2024-05-31 11:36:53 +08:00
Wang, Jian4
c0f1be6aea
Fix pp logic (#11175)
* only send no none batch and rank1-n sending first

* always send first
2024-05-30 16:40:59 +08:00
ZehuaCao
4127b99ed6
Fix null pointer dereferences error. (#11125)
* delete unused function on tgi_server

* update

* update

* fix style
2024-05-30 16:16:10 +08:00
Guancheng Fu
50ee004ac7
Fix vllm condition (#11169)
* add use-vllm

* done

* fix style

* fix done
2024-05-30 15:23:17 +08:00
Jin Qiao
dcbf4d3d0a
Add phi-3-vision example (#11156)
* Add phi-3-vision example (HF-Automodels)

* fix

* fix

* fix

* Add phi-3-vision CPU example (HF-Automodels)

* add in readme

* fix

* fix

* fix

* fix

* use fp8 for gpu example

* remove eval
2024-05-30 10:02:47 +08:00
Jiao Wang
93146b9433
Reconstruct Speculative Decoding example directory (#11136)
* update

* update

* update
2024-05-29 13:15:27 -07:00
Xiangyu Tian
2299698b45
Refine Pipeline Parallel FastAPI example (#11168) 2024-05-29 17:16:50 +08:00
Ruonan Wang
9bfbf78bf4
update api usage of xe_batch & fp16 (#11164)
* update api usage

* update setup.py
2024-05-29 15:15:14 +08:00
Yina Chen
e29e2f1c78
Support new fp8 e4m3 (#11158) 2024-05-29 14:27:14 +08:00
Wang, Jian4
8e25de1126
LLM: Add codegeex2 example (#11143)
* add codegeex example

* update

* update cpu

* add GPU

* add gpu

* update readme
2024-05-29 10:00:26 +08:00
ZehuaCao
751e1a4e29
Fix concurrent issue in autoTP streming. (#11150)
* add benchmark test

* update
2024-05-29 08:22:38 +08:00
Yishuo Wang
bc5008f0d5
disable sdp_causal in phi-3 to fix overflow (#11157) 2024-05-28 17:25:53 +08:00
SONG Ge
33852bd23e
Refactor pipeline parallel device config (#11149)
* refactor pipeline parallel device config

* meet comments

* update example

* add warnings and update code doc
2024-05-28 16:52:46 +08:00
hxsz1997
62b2d8af6b
Add lookahead in all-in-one (#11142)
* add lookahead in allinone

* delete save to csv in run_transformer_int4_gpu

* change lookup to lookahead

* fix the error of add model.peak_memory

* Set transformer_int4_gpu as the default option

* add comment of transformer_int4_fp16_lookahead_gpu
2024-05-28 15:39:58 +08:00
Xiangyu Tian
b44cf405e2
Refine Pipeline-Parallel-Fastapi example README (#11155) 2024-05-28 15:18:21 +08:00
Yishuo Wang
d307622797
fix first token sdp with batch (#11153) 2024-05-28 15:03:06 +08:00
Yina Chen
3464440839
fix qwen import error (#11154) 2024-05-28 14:50:12 +08:00
Jin Qiao
25b6402315
Add Windows GPU unit test (#11050)
* Add Windows GPU UT

* temporarily remove ut on arc

* retry

* retry

* retry

* fix

* retry

* retry

* fix

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* fix

* retry

* retry

* retry

* retry

* retry

* retry

* merge into single workflow

* retry inference test

* retry

* retrigger

* try to fix inference test

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* retry

* check lower_bound

* retry

* retry

* try example test

* try fix example test

* retry

* fix

* seperate function into shell script

* remove cygpath

* try remove all cygpath

* retry

* retry

* Revert "try remove all cygpath"

This reverts commit 7ceeff3e48f08429062ecef548c1a3ad3488756f.

* Revert "retry"

This reverts commit 40ea2457843bff6991b8db24316cde5de1d35418.

* Revert "retry"

This reverts commit 817d0db3e5aec3bd449d3deaf4fb01d3ecfdc8a3.

* enable ut

* fix

* retrigger

* retrigger

* update download url

* fix

* fix

* retry

* add comment

* fix
2024-05-28 13:29:47 +08:00
Yina Chen
b6b70d1ba0
Divide core-xe packages (#11131)
* temp

* add batch

* fix style

* update package name

* fix style

* add workflow

* use temp version to run uts

* trigger performance test

* trigger win igpu perf

* revert workflow & setup
2024-05-28 12:00:18 +08:00
binbin Deng
c9168b85b7
Fix error during merging adapter (#11145) 2024-05-27 19:41:42 +08:00
Guancheng Fu
daf7b1cd56
[Docker] Fix image using two cards error (#11144)
* fix all

* done
2024-05-27 16:20:13 +08:00
Xiangyu Tian
5c8ccf0ba9
LLM: Add Pipeline-Parallel-FastAPI example (#10917)
Add multi-stage Pipeline-Parallel-FastAPI example

---------

Co-authored-by: hzjane <a1015616934@qq.com>
2024-05-27 14:46:29 +08:00
Ruonan Wang
d550af957a
fix security issue of eagle (#11140)
* fix security issue of eagle

* small fix
2024-05-27 10:15:28 +08:00
binbin Deng
367de141f2
Fix mixtral-8x7b with transformers=4.37.0 (#11132) 2024-05-27 09:50:54 +08:00
Jean Yu
ab476c7fe2
Eagle Speculative Sampling examples (#11104)
* Eagle Speculative Sampling examples

* rm multi-gpu and ray content

* updated README to include Arc A770
2024-05-24 11:13:43 -07:00
Guancheng Fu
fabc395d0d
add langchain vllm interface (#11121)
* done

* fix

* fix

* add vllm

* add langchain vllm exampels

* add docs

* temp
2024-05-24 17:19:27 +08:00
ZehuaCao
63e95698eb
[LLM]Reopen autotp generate_stream (#11120)
* reopen autotp generate_stream

* fix style error

* update
2024-05-24 17:16:14 +08:00
Yishuo Wang
1dc680341b
fix phi-3-vision import (#11129) 2024-05-24 15:57:15 +08:00
Guancheng Fu
7f772c5a4f
Add half precision for fastchat models (#11130) 2024-05-24 15:41:14 +08:00
Zhao Changmin
65f4212f89
Fix qwen 14b run into register attention fwd (#11128)
* fix qwen 14b
2024-05-24 14:45:07 +08:00
Shaojun Liu
373f9e6c79
add ipex-llm-init.bat for Windows (#11082)
* add ipex-llm-init.bat for Windows

* update setup.py
2024-05-24 14:26:25 +08:00
Qiyuan Gong
120a0035ac
Fix type mismatch in eval for Baichuan2 QLora example (#11117)
* During the evaluation stage, Baichuan2 will raise type mismatch when training with bfloat16. Fix this issue by modifying modeling_baichuan.py. Add doc about how to modify this file.
2024-05-24 14:14:30 +08:00
Yishuo Wang
1db9d9a63b
optimize internlm2 xcomposer agin (#11124) 2024-05-24 13:44:52 +08:00
Yishuo Wang
9372ce87ce
fix internlm xcomposer2 fp16 (#11123) 2024-05-24 11:03:31 +08:00
Cengguang Zhang
011b9faa5c
LLM: unify baichuan2-13b alibi mask dtype with model dtype. (#11107)
* LLM: unify alibi mask dtype.

* fix comments.
2024-05-24 10:27:53 +08:00
Jiao Wang
0a06a6e1d4
Update tests for transformers 4.36 (#10858)
* update unit test

* update

* update

* update

* update

* update

* fix gpu attention test

* update

* update

* update

* update

* update

* update

* update example test

* replace replit code

* update

* update

* update

* update

* set safe_serialization false

* perf test

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* delete

* update

* update

* update

* update

* update

* update

* revert

* update
2024-05-24 10:26:38 +08:00
Xiangyu Tian
b3f6faa038
LLM: Add CPU vLLM entrypoint (#11083)
Add CPU vLLM entrypoint and update CPU vLLM serving example.
2024-05-24 09:16:59 +08:00
Yishuo Wang
797dbc48b8
fix phi-2 and phi-3 convert (#11116) 2024-05-23 17:37:37 +08:00
Yishuo Wang
37b98a531f
support running internlm xcomposer2 on gpu and add sdp optimization (#11115) 2024-05-23 17:26:24 +08:00
Zhao Changmin
c5e8b90c8d
Add Qwen register attention implemention (#11110)
* qwen_register
2024-05-23 17:17:45 +08:00
Yishuo Wang
0e53f20edb
support running internlm-xcomposer2 on cpu (#11111) 2024-05-23 16:36:09 +08:00
Yuwen Hu
d36b41d59e
Add setuptools limitation for ipex-llm[xpu] (#11102)
* Add setuptool limitation for ipex-llm[xpu]

* llamaindex option update
2024-05-22 18:20:30 +08:00
Yishuo Wang
cd4dff09ee
support phi-3 vision (#11101) 2024-05-22 17:43:50 +08:00
Zhao Changmin
15d906a97b
Update linux igpu run script (#11098)
* update run script
2024-05-22 17:18:07 +08:00
Kai Huang
f63172ef63
Align ppl with llama.cpp (#11055)
* update script

* remove

* add header

* update readme
2024-05-22 16:43:11 +08:00
Qiyuan Gong
f6c9ffe4dc
Add WANDB_MODE and HF_HUB_OFFLINE to XPU finetune README (#11097)
* Add WANDB_MODE=offline to avoid multi-GPUs finetune errors.
* Add HF_HUB_OFFLINE=1 to avoid Hugging Face related errors.
2024-05-22 15:20:53 +08:00
Shaojun Liu
584439e498
update homepage url for ipex-llm (#11094)
* update homepage url

* Update python version to 3.11

* Update long description
2024-05-22 11:10:44 +08:00
Xin Qiu
71bcd18f44
fix qwen vl (#11090) 2024-05-21 18:40:29 +08:00
Yishuo Wang
f00625f9a4
refactor qwen2 (#11087) 2024-05-21 16:53:42 +08:00
Qiyuan Gong
492ed3fd41
Add verified models to GPU finetune README (#11088)
* Add verified models to GPU finetune README
2024-05-21 15:49:15 +08:00
Qiyuan Gong
1210491748
ChatGLM3, Baichuan2 and Qwen1.5 QLoRA example (#11078)
* Add chatglm3, qwen15-7b and baichuan-7b QLoRA alpaca example
* Remove unnecessary tokenization setting.
2024-05-21 15:29:43 +08:00
ZehuaCao
842d6dfc2d
Further Modify CPU example (#11081)
* modify CPU example

* update
2024-05-21 13:55:47 +08:00
Yishuo Wang
d830a63bb7
refactor qwen (#11074) 2024-05-20 18:08:37 +08:00
Wang, Jian4
74950a152a
Fix tgi_api_server error file name (#11075) 2024-05-20 16:48:40 +08:00
Yishuo Wang
4e97047d70
fix baichuan2 13b fp16 (#11071) 2024-05-20 11:21:20 +08:00
binbin Deng
7170dd9192
Update guide for running qwen with AutoTP (#11065) 2024-05-20 10:53:17 +08:00
Wang, Jian4
a2e1578fd9
Merge tgi_api_server to main (#11036)
* init

* fix style

* speculative can not use benchmark

* add tgi server readme
2024-05-20 09:15:03 +08:00
Yishuo Wang
31ce3e0c13
refactor baichuan2-13b (#11064) 2024-05-17 16:25:30 +08:00
ZehuaCao
56cb992497
LLM: Modify CPU Installation Command for most examples (#11049)
* init

* refine

* refine

* refine

* modify hf-agent example

* modify all CPU model example

* remove readthedoc modify

* replace powershell with cmd

* fix repo

* fix repo

* update

* remove comment on windows code block

* update

* update

* update

* update

---------

Co-authored-by: xiangyuT <xiangyu.tian@intel.com>
2024-05-17 15:52:20 +08:00
Ruonan Wang
f1156e6b20
support gguf_q4k_m / gguf_q4k_s (#10887)
* initial commit

* UPDATE

* fix style

* fix style

* add gguf_q4k_s

* update comment

* fix
2024-05-17 14:30:09 +08:00
Yishuo Wang
981d668be6
refactor baichuan2-7b (#11062) 2024-05-17 13:01:34 +08:00
Xiangyu Tian
d963e95363
LLM: Modify CPU Installation Command for documentation (#11042)
* init

* refine

* refine

* refine

* refine comments
2024-05-17 10:14:00 +08:00
Ruonan Wang
3a72e5df8c
disable mlp fusion of fp6 on mtl (#11059) 2024-05-17 10:10:16 +08:00
SONG Ge
192ae35012
Add support for llama2 quantize_kv with transformers 4.38.0 (#11054)
* add support for llama2 quantize_kv with transformers 4.38.0

* fix code style

* fix code style
2024-05-16 22:23:39 +08:00
SONG Ge
16b2a418be
hotfix native_sdp ut (#11046)
* hotfix native_sdp

* update
2024-05-16 17:15:37 +08:00
Xin Qiu
6be70283b7
fix chatglm run error (#11045)
* fix chatglm

* update

* fix style
2024-05-16 15:39:18 +08:00
Yishuo Wang
8cae897643
use new rope in phi3 (#11047) 2024-05-16 15:12:35 +08:00
Jin Qiao
9a96af4232
Remove oneAPI pip install command in related examples (#11030)
* Remove pip install command in windows installation guide

* fix chatglm3 installation guide

* Fix gemma cpu example

* Apply on other examples

* fix
2024-05-16 10:46:29 +08:00
Xiangyu Tian
612a365479
LLM: Install CPU version torch with extras [all] (#10868)
Modify setup.py to install CPU version torch with extras [all]
2024-05-16 10:39:55 +08:00
Yishuo Wang
59df750326
Use new sdp again (#11025) 2024-05-16 09:33:34 +08:00
SONG Ge
9942a4ba69
[WIP] Support llama2 with transformers==4.38.0 (#11024)
* support llama2 with transformers==4.38.0

* add supprot for quantize_qkv

* add original support for 4.38.0 now

* code style fix
2024-05-15 18:07:00 +08:00
Yina Chen
686f6038a8
Support fp6 save & load (#11034) 2024-05-15 17:52:02 +08:00
Ruonan Wang
ac384e0f45
add fp6 mlp fusion (#11032)
* add fp6 fusion

* add qkv fusion for fp6

* remove qkv first
2024-05-15 17:42:50 +08:00
Wang, Jian4
2084ebe4ee
Enable fastchat benchmark latency (#11017)
* enable fastchat benchmark

* add readme

* update readme

* update
2024-05-15 14:52:09 +08:00
hxsz1997
93d40ab127
Update lookahead strategy (#11021)
* update lookahead strategy

* remove lines

* fix python style check
2024-05-15 14:48:05 +08:00
Wang, Jian4
d9f71f1f53
Update benchmark util for example using (#11027)
* mv benchmark_util.py to utils/

* remove

* update
2024-05-15 14:16:35 +08:00
binbin Deng
4053a6ef94
Update environment variable setting in AutoTP with arc (#11018) 2024-05-15 10:23:58 +08:00
Yishuo Wang
fad1dbaf60
use sdp fp8 causal kernel (#11023) 2024-05-15 10:22:35 +08:00
Yishuo Wang
ee325e9cc9
fix phi3 (#11022) 2024-05-15 09:32:12 +08:00
Ziteng Zhang
7d3791c819
[LLM] Add llama3 alpaca qlora example (#11011)
* Add llama3 finetune example based on alpaca qlora example
2024-05-15 09:17:32 +08:00
Zhao Changmin
0a732bebe7
Add phi3 cached RotaryEmbedding (#11013)
* phi3cachedrotaryembed

* pep8
2024-05-15 08:16:43 +08:00
Yina Chen
893197434d
Add fp6 support on gpu (#11008)
* add fp6 support

* fix style
2024-05-14 16:31:44 +08:00
Zhao Changmin
b03c859278
Add phi3RMS (#10988)
* phi3RMS
2024-05-14 15:16:27 +08:00
Yishuo Wang
170e3d65e0
use new sdp and fp32 sdp (#11007) 2024-05-14 14:29:18 +08:00
Qiyuan Gong
c957ea3831
Add axolotl main support and axolotl Llama-3-8B QLoRA example (#10984)
* Support axolotl main (796a085).
* Add axolotl Llama-3-8B QLoRA example.
* Change `sequence_len` to 256 for alpaca, and revert `lora_r` value.
* Add example to quick_start.
2024-05-14 13:43:59 +08:00
Yuwen Hu
fb656fbf74
Add requirements for oneAPI pypi packages for windows Intel GPU users (#11009) 2024-05-14 13:40:54 +08:00
Shaojun Liu
7f8c5b410b
Quickstart: Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL) (#10970)
* add entrypoint.sh

* add quickstart

* remove entrypoint

* update

* Install related library of benchmarking

* update

* print out results

* update docs

* minor update

* update

* update quickstart

* update

* update

* update

* update

* update

* update

* add chat & example section

* add more details

* minor update

* rename quickstart

* update

* minor update

* update

* update config.yaml

* update readme

* use --gpu

* add tips

* minor update

* update
2024-05-14 12:58:31 +08:00
Guancheng Fu
a465111cf4
Update README.md (#11003) 2024-05-13 16:44:48 +08:00
Guancheng Fu
74997a3ed1
Adding load_low_bit interface for ipex_llm_worker (#11000)
* initial implementation, need tests

* fix

* fix baichuan issue

* fix typo
2024-05-13 15:30:19 +08:00
Yishuo Wang
1b3c7a6928
remove phi3 empty cache (#10997) 2024-05-13 14:09:55 +08:00
ZehuaCao
99255fe36e
fix ppl (#10996) 2024-05-13 13:57:19 +08:00
Kai Huang
f8dd2e52ad
Fix Langchain upstream ut (#10985)
* Fix Langchain upstream ut

* Small fix

* Install bigdl-llm

* Update run-langchain-upstream-tests.sh

* Update run-langchain-upstream-tests.sh

* Update llm_unit_tests.yml

* Update run-langchain-upstream-tests.sh

* Update llm_unit_tests.yml

* Update run-langchain-upstream-tests.sh

* fix git checkout

* fix

---------

Co-authored-by: Zhangky11 <2321096202@qq.com>
Co-authored-by: Keyan (Kyrie) Zhang <79576162+Zhangky11@users.noreply.github.com>
2024-05-11 14:40:37 +08:00
Yuwen Hu
9f6358e4c2
Deprecate support for pytorch 2.0 on Linux for ipex-llm >= 2.1.0b20240511 (#10986)
* Remove xpu_2.0 option in setup.py

* Disable xpu_2.0 test in UT and nightly

* Update docs for deprecated pytorch 2.0

* Small doc update
2024-05-11 12:33:35 +08:00
Yishuo Wang
ad96f32ce0
optimize phi3 1st token performance (#10981) 2024-05-10 17:33:46 +08:00
Cengguang Zhang
cfed76b2ed
LLM: add long-context support for Qwen1.5-7B/Baichuan2-7B/Mistral-7B. (#10937)
* LLM: add split tensor support for baichuan2-7b and qwen1.5-7b.

* fix style.

* fix style.

* fix style.

* add support for mistral and fix condition threshold.

* fix  style.

* fix comments.
2024-05-10 16:40:15 +08:00
binbin Deng
f9615f12d1
Add driver related packages version check in env script (#10977) 2024-05-10 15:02:58 +08:00
Kai Huang
a6342cc068
Empty cache after phi first attention to support 4k input (#10972)
* empty cache

* fix style
2024-05-09 19:50:04 +08:00
Yishuo Wang
e753125880
use fp16_sdp when head_dim=96 (#10976) 2024-05-09 17:02:59 +08:00
Yishuo Wang
697ca79eca
use quantize kv and sdp in phi3-mini (#10973) 2024-05-09 15:16:18 +08:00
Wang, Jian4
f4c615b1ee
Add cohere example (#10954)
* add link first

* add_cpu_example

* add GPU example
2024-05-08 17:19:59 +08:00
Wang, Jian4
3209d6b057
Fix spculative llama3 no stop error (#10963)
* fix normal

* add eos_tokens_id on sp and add list if

* update

* no none
2024-05-08 17:09:47 +08:00
Xiangyu Tian
02870dc385
LLM: Refine README of AutoTP-FastAPI example (#10960) 2024-05-08 16:55:23 +08:00
Yishuo Wang
2ebec0395c
optimize phi-3-mini-128 (#10959) 2024-05-08 16:33:17 +08:00
Xin Qiu
dfa3147278
update (#10944) 2024-05-08 14:28:05 +08:00
Xin Qiu
5973d6c753
make gemma's output better (#10943) 2024-05-08 14:27:51 +08:00
Jin Qiao
15ee3fd542
Update igpu perf internlm (#10958) 2024-05-08 14:16:43 +08:00
Zhao Changmin
0d6e12036f
Disable fast_init_ in load_low_bit (#10945)
* fast_init_ disable
2024-05-08 10:46:19 +08:00
Qiyuan Gong
164e6957af
Refine axolotl quickstart (#10957)
* Add default accelerate config for axolotl quickstart.
* Fix requirement link.
* Upgrade peft to 0.10.0 in requirement.
2024-05-08 09:34:02 +08:00
Yishuo Wang
c801c37bc6
optimize phi3 again: use quantize kv if possible (#10953) 2024-05-07 17:26:19 +08:00
Yishuo Wang
aa2fa9fde1
optimize phi3 again: use sdp if possible (#10951) 2024-05-07 15:53:08 +08:00
Qiyuan Gong
c11170b96f
Upgrade Peft to 0.10.0 in finetune examples and docker (#10930)
* Upgrade Peft to 0.10.0 in finetune examples.
* Upgrade Peft to 0.10.0 in docker.
2024-05-07 15:12:26 +08:00
Qiyuan Gong
d7ca5d935b
Upgrade Peft version to 0.10.0 for LLM finetune (#10886)
* Upgrade Peft version to 0.10.0
* Upgrade Peft version in ARC unit test and HF-Peft example.
2024-05-07 15:09:14 +08:00
Yuwen Hu
0efe26c3b6
Change order of chatglm2-6b and chatglm3-6b in iGPU perf test for more stable performance (#10948) 2024-05-07 13:48:39 +08:00
hxsz1997
245c7348bc
Add codegemma example (#10884)
* add codegemma example in GPU/HF-Transformers-AutoModels/

* add README of codegemma example in GPU/HF-Transformers-AutoModels/

* add codegemma example in GPU/PyTorch-Models/

* add readme of codegemma example in GPU/PyTorch-Models/

* add codegemma example in CPU/HF-Transformers-AutoModels/

* add readme of codegemma example in CPU/HF-Transformers-AutoModels/

* add codegemma example in CPU/PyTorch-Models/

* add readme of codegemma example in CPU/PyTorch-Models/

* fix typos

* fix filename typo

* add codegemma in tables

* add comments of lm_head

* remove comments of use_cache
2024-05-07 13:35:42 +08:00
Shaojun Liu
08ad40b251
improve ipex-llm-init for Linux (#10928)
* refine ipex-llm-init

* install libtcmalloc.so for Max

* update based on comment

* remove unneeded code
2024-05-07 12:55:14 +08:00
Wang, Jian4
191b184341
LLM: Optimize cohere model (#10878)
* use mlp and rms

* optimize kv_cache

* add fuse qkv

* add flash attention and fp16 sdp

* error fp8 sdp

* fix optimized

* fix style

* update

* add for pp
2024-05-07 10:19:50 +08:00
Xiangyu Tian
13a44cdacb
LLM: Refine Deepspped-AutoTP-FastAPI example (#10916) 2024-05-07 09:37:31 +08:00
Wang, Jian4
1de878bee1
LLM: Fix speculative llama3 long input error (#10934) 2024-05-07 09:25:20 +08:00
Guancheng Fu
49ab5a2b0e
Add embeddings (#10931) 2024-05-07 09:07:02 +08:00
Wang, Jian4
0e0bd309e2
LLM: Enable Speculative on Fastchat (#10909)
* init

* enable streamer

* update

* update

* remove deprecated

* update

* update

* add gpu example
2024-05-06 10:06:20 +08:00
Cengguang Zhang
0edef1f94c
LLM: add min_new_tokens to all in one benchmark. (#10911) 2024-05-06 09:32:59 +08:00
Cengguang Zhang
75dbf240ec
LLM: update split tensor conditions. (#10872)
* LLM: update split tensor condition.

* add cond for split tensor.

* update priority of env.

* fix style.

* update env name.
2024-04-30 17:07:21 +08:00
Guancheng Fu
2c64754eb0
Add vLLM to ipex-llm serving image (#10807)
* add vllm

* done

* doc work

* fix done

* temp

* add docs

* format

* add start-fastchat-service.sh

* fix
2024-04-29 17:25:42 +08:00
Jin Qiao
1f876fd837
Add example for phi-3 (#10881)
* Add example for phi-3

* add in readme and index

* fix

* fix

* fix

* fix indent

* fix
2024-04-29 16:43:55 +08:00
Yishuo Wang
d884c62dc4
remove new_layout parameter (#10906) 2024-04-29 10:31:50 +08:00
Guancheng Fu
fbcd7bc737
Fix Loader issue with dtype fp16 (#10907) 2024-04-29 10:16:02 +08:00
Guancheng Fu
c9fac8c26b
Fix sdp logic (#10896)
* fix

* fix
2024-04-28 22:02:14 +08:00
Yina Chen
015d07a58f
Fix lookahead sample error & add update strategy (#10894)
* Fix sample error & add update strategy

* add mtl config

* fix style

* remove print
2024-04-28 17:21:00 +08:00
Yuwen Hu
1a8a93d5e0
Further fix nightly perf (#10901) 2024-04-28 10:18:58 +08:00
Yuwen Hu
ddfdaec137
Fix nightly perf (#10899)
* Fix nightly perf by adding default value in benchmark for use_fp16_torch_dtype

* further fixes
2024-04-28 09:39:29 +08:00
Cengguang Zhang
9752ffe979
LLM: update split qkv native sdp. (#10895)
* LLM: update split qkv native sdp.

* fix typo.
2024-04-26 18:47:35 +08:00
Guancheng Fu
990535b1cf
Add tensor parallel for vLLM (#10879)
* initial

* test initial tp

* initial sup

* fix format

* fix

* fix
2024-04-26 17:10:49 +08:00
binbin Deng
f51bf018eb
Add benchmark script for pipeline parallel inference (#10873) 2024-04-26 15:28:11 +08:00
Yishuo Wang
46ba962168
use new quantize kv (#10888) 2024-04-26 14:42:17 +08:00
Xiangyu Tian
3d4950b0f0
LLM: Enable batch generate (world_size>1) in Deepspeed-AutoTP-FastAPI example (#10876)
Enable batch generate (world_size>1) in Deepspeed-AutoTP-FastAPI example.
2024-04-26 13:24:28 +08:00
Wang, Jian4
3e8ed54270
LLM: Fix bigdl_ipex_int8 warning (#10890) 2024-04-26 11:18:44 +08:00
Jin Qiao
fb3c268d13
Add phi-3 to perf (#10883) 2024-04-25 20:21:56 +08:00
Yina Chen
8811f268ff
Use new fp16 sdp in Qwen and modify the constraint (#10882) 2024-04-25 19:23:37 +08:00
Yuxuan Xia
0213c1c1da
Add phi3 to the nightly test (#10885)
* Add llama3 and phi2 nightly test

* Change llama3-8b to llama3-8b-instruct

* Add phi3 to nightly test

* Add phi3 to nightly test

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-04-25 17:39:12 +08:00
Yuxuan Xia
ca2479be87
Update scripts readme (#10725)
* Update scripts readme

* Update scripts readme

* Update README

* Update readme

* Update readme

* Update windows env check readme

* Adjust env check readme

* Update windows env check

* Update env check readme

* Adjust the env-check README

* Modify the env-check README
2024-04-25 17:24:37 +08:00
Cengguang Zhang
cd369c2715
LLM: add device id to benchmark utils. (#10877) 2024-04-25 14:01:51 +08:00
Yang Wang
1ce8d7bcd9
Support the desc_act feature in GPTQ model (#10851)
* support act_order

* update versions

* fix style

* fix bug

* clean up
2024-04-24 10:17:13 -07:00
Yina Chen
dc27b3bc35
Use sdp when rest token seq_len > 1 in llama & mistral (for lookup & spec) (#10790)
* update sdp condition

* update

* fix

* update & test llama

* mistral

* fix style

* update

* fix style

* remove pvc constrain

* update ds on arc

* fix style
2024-04-24 17:24:01 +08:00
Yuxuan Xia
844e18b1db
Add llama3 and phi2 nightly test (#10874)
* Add llama3 and phi2 nightly test

* Change llama3-8b to llama3-8b-instruct

---------

Co-authored-by: Yishuo Wang <yishuo.wang@intel.com>
2024-04-24 16:58:56 +08:00
binbin Deng
c9feffff9a
LLM: support Qwen1.5-MoE-A2.7B-Chat pipeline parallel inference (#10864) 2024-04-24 16:02:27 +08:00
Yishuo Wang
2d210817ff
add phi3 optimization (#10871) 2024-04-24 15:17:40 +08:00
Cengguang Zhang
eb39c61607
LLM: add min new token to perf test. (#10869) 2024-04-24 14:32:02 +08:00
Yuwen Hu
fb2a160af3
Add phi-2 to 2048-256 test for fixes (#10867) 2024-04-24 10:00:25 +08:00
binbin Deng
fabf54e052
LLM: make pipeline parallel inference example more common (#10786) 2024-04-24 09:28:52 +08:00
hxsz1997
328b1a1de9
Fix the not stop issue of llama3 examples (#10860)
* fix not stop issue in GPU/HF-Transformers-AutoModels

* fix not stop issue in GPU/PyTorch-Models/Model/llama3

* fix not stop issue in CPU/HF-Transformers-AutoModels/Model/llama3

* fix not stop issue in CPU/PyTorch-Models/Model/llama3

* update the output in readme

* update format

* add reference

* update prompt format

* update output format in readme

* update example output in readme
2024-04-23 19:10:09 +08:00
Yuwen Hu
5c9eb5d0f5
Support llama-index install option for upstreaming purposes (#10866)
* Support llama-index install option for upstreaming purposes

* Small fix

* Small fix
2024-04-23 19:08:29 +08:00
Yuwen Hu
21bb8bd164
Add phi-2 to igpu performance test (#10865) 2024-04-23 18:13:14 +08:00
ZehuaCao
36eb8b2e96
Add llama3 speculative example (#10856)
* Initial llama3 speculative example

* update README

* update README

* update README
2024-04-23 17:03:54 +08:00
Cengguang Zhang
763413b7e1
LLM: support llama split tensor for long context in transformers>=4.36. (#10844)
* LLm: support llama split tensor for long context in transformers>=4.36.

* fix dtype.

* fix style.

* fix style.

* fix style.

* fix style.

* fix dtype.

* fix style.
2024-04-23 16:13:25 +08:00
ZehuaCao
92ea54b512
Fix speculative decoding bug (#10855) 2024-04-23 14:28:31 +08:00
yb-peng
c9dee6cd0e
Update 8192.txt (#10824)
* Update 8192.txt

* Update 8192.txt with original text
2024-04-23 14:02:09 +08:00
Wang, Jian4
18c032652d
LLM: Add mixtral speculative CPU example (#10830)
* init mixtral sp example

* use different prompt_format

* update output

* update
2024-04-23 10:05:51 +08:00
Qiyuan Gong
5494aa55f6
Downgrade datasets in axolotl example (#10849)
* Downgrade datasets to 2.15.0 to address axolotl prepare issue https://github.com/OpenAccess-AI-Collective/axolotl/issues/1544

Tks to @kwaa for providing the solution in https://github.com/intel-analytics/ipex-llm/issues/10821#issuecomment-2068861571
2024-04-23 09:41:58 +08:00
Yishuo Wang
fe5a082b84
add phi-2 optimization (#10843) 2024-04-22 18:56:47 +08:00
Guancheng Fu
47bd5f504c
[vLLM]Remove vllm-v1, refactor v2 (#10842)
* remove vllm-v1

* fix format
2024-04-22 17:51:32 +08:00
Wang, Jian4
23c6a52fb0
LLM: Fix ipex torchscript=True error (#10832)
* remove

* update

* remove torchscript
2024-04-22 15:53:09 +08:00
Heyang Sun
fc33aa3721
fix missing import (#10839) 2024-04-22 14:34:52 +08:00
Yina Chen
3daad242b8
Fix No module named 'transformers.cache_utils' with transformers < 4.36 (#10835)
* update sdp condition

* update

* fix

* fix 431 error

* revert sdp & style fix

* fix

* meet comments
2024-04-22 14:05:50 +08:00
Guancheng Fu
ae3b577537
Update README.md (#10833) 2024-04-22 11:07:10 +08:00
Wang, Jian4
5f95054f97
LLM:Add qwen moe example libs md (#10828) 2024-04-22 10:03:19 +08:00
Guancheng Fu
61c67af386
Fix vLLM-v2 install instructions(#10822) 2024-04-22 09:02:48 +08:00
Guancheng Fu
caf75beef8
Disable sdpa (#10814) 2024-04-19 17:33:18 +08:00
Yishuo Wang
57edf2033c
fix lookahead with transformers >= 4.36 (#10808) 2024-04-19 16:24:56 +08:00
Ovo233
1a885020ee
Updated importing of top_k_top_p_filtering for transformers>=4.39.0 (#10794)
* In transformers>=4.39.0, the top_k_top_p_filtering function has been deprecated and moved to the hugging face package trl. Thus, for versions >= 4.39.0, import this function from trl.
2024-04-19 15:34:39 +08:00
Yuwen Hu
07e8b045a9
Add Meta-llama-3-8B-Instruct and Yi-6B-Chat to igpu nightly perf (#10810) 2024-04-19 15:09:58 +08:00
Yishuo Wang
08458b4f74
remove rms norm copy (#10793) 2024-04-19 13:57:48 +08:00
Yang Wang
8153c3008e
Initial llama3 example (#10799)
* Add initial hf huggingface GPU example

* Small fix

* Add llama3 gpu pytorch model example

* Add llama 3 hf transformers CPU example

* Add llama 3 pytorch model CPU example

* Fixes

* Small fix

* Small fixes

* Small fix

* Small fix

* Add links

* update repo id

* change prompt tuning url

* remove system header if there is no system prompt

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
Co-authored-by: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
2024-04-18 11:01:33 -07:00
Ruonan Wang
754b0ffecf Fix pvc llama (#10798)
* ifx

* update
2024-04-18 10:44:57 -07:00
Ruonan Wang
439c834ed3
LLM: add mixed precision for lm_head (#10795)
* add mixed_quantization

* meet code review

* update

* fix style

* meet review
2024-04-18 19:11:31 +08:00
Yina Chen
8796401b08
Support q4k in ipex-llm (#10796)
* support q4k

* update
2024-04-18 18:55:28 +08:00
Ruonan Wang
0e8aac19e3
add q6k precision in ipex-llm (#10792)
* add q6k

* add initial 16k

* update

* fix style
2024-04-18 16:52:09 +08:00
Qiyuan Gong
e90e31719f
axolotl lora example (#10789)
* Add axolotl lora example
* Modify readme
* Add comments in yml
2024-04-18 16:38:32 +08:00
Wang, Jian4
14ca42a048
LLM:Fix moe indexs error on cpu (#10791) 2024-04-18 15:56:52 +08:00
Guancheng Fu
cbe7b5753f
Add vLLM[xpu] related code (#10779)
* Add ipex-llm side change

* add runable offline_inference

* refactor to call vllm2

* Verified async server

* add new v2 example

* add README

* fix

* change dir

* refactor readme.md

* add experimental

* fix
2024-04-18 15:29:20 +08:00
Kai Huang
053ec30737
Transformers ppl evaluation on wikitext (#10784)
* tranformers code

* cache
2024-04-18 15:27:18 +08:00
Wang, Jian4
209c3501e6
LLM: Optimize qwen1.5 moe model (#10706)
* update moe block

* fix style

* enable optmize MLP

* enabel kv_cache

* enable fuse rope

* enable fused qkv

* enable flash_attention

* error sdp quantize

* use old api

* use fuse

* use xetla

* fix python style

* update moe_blocks num

* fix output error

* add cpu sdpa

* update

* update

* update
2024-04-18 14:54:05 +08:00
Ziteng Zhang
ff040c8f01
LISA Finetuning Example (#10743)
* enabling xetla only supports qtype=SYM_INT4 or FP8E5

* LISA Finetuning Example on gpu

* update readme

* add licence

* Explain parameters of lisa & Move backend codes to src dir

* fix style

* fix style

* update readme

* support chatglm

* fix style

* fix style

* update readme

* fix
2024-04-18 13:48:10 +08:00
Heyang Sun
581ebf6104
GaLore Finetuning Example (#10722)
* GaLore Finetuning Example

* Update README.md

* Update README.md

* change data to HuggingFaceH4/helpful_instructions

* Update README.md

* Update README.md

* shrink train size and delete cache before starting training to save memory

* Update README.md

* Update galore_finetuning.py

* change model to llama2 3b

* Update README.md
2024-04-18 13:47:41 +08:00
Yang Wang
952e517db9
use config rope_theta (#10787)
* use config rope_theta

* fix style
2024-04-17 20:39:11 -07:00
Guancheng Fu
31ea2f9a9f
Fix wrong output for Llama models on CPU (#10742) 2024-04-18 11:07:27 +08:00
Xin Qiu
e764f9b1b1
Disable fast fused rope on UHD (#10780)
* use decoding fast path

* update

* update

* cleanup
2024-04-18 10:03:53 +08:00
Yina Chen
ea5b373a97
Add lookahead GPU example (#10785)
* Add lookahead example

* fix style & attn mask

* fix typo

* address comments
2024-04-17 17:41:55 +08:00
Wang, Jian4
a20271ffe4
LLM: Fix yi-6b fp16 error on pvc (#10781)
* updat for yi fp16

* update

* update
2024-04-17 16:49:59 +08:00
ZehuaCao
0646e2c062
Fix short prompt for IPEX_CPU speculative decoding cause no_attr error (#10783) 2024-04-17 16:19:57 +08:00
Cengguang Zhang
7ec82c6042
LLM: add README.md for Long-Context examples. (#10765)
* LLM: add readme to long-context examples.

* add precision.

* update wording.

* add GPU type.

* add Long-Context example to GPU examples.

* fix comments.

* update max input length.

* update max length.

* add output length.

* fix wording.
2024-04-17 15:34:59 +08:00
Yina Chen
766fe45222
Fix spec error caused by lookup pr (#10777)
* Fix spec error

* remove

* fix style
2024-04-17 11:27:35 +08:00
Qiyuan Gong
9e5069437f
Fix gradio version in axolotl example (#10776)
* Change to gradio>=4.19.2
2024-04-17 10:23:43 +08:00
Qiyuan Gong
f2e923b3ca
Axolotl v0.4.0 support (#10773)
* Add Axolotl 0.4.0, remove legacy 0.3.0 support.
* replace is_torch_bf16_gpu_available
* Add HF_HUB_OFFLINE=1
* Move transformers out of requirement
* Refine readme and qlora.yml
2024-04-17 09:49:11 +08:00
Heyang Sun
26cae0a39c
Update FLEX in Deepspeed README (#10774)
* Update FLEX in Deepspeed README

* Update README.md
2024-04-17 09:28:24 +08:00
Wenjing Margaret Mao
c41730e024
edit 'ppl_result does not exist' issue, delete useless code (#10767)
* edit ppl_result not exist issue, delete useless code

* delete nonzero_min function

---------

Co-authored-by: jenniew <jenniewang123@gmail.com>
2024-04-16 18:11:56 +08:00
Yina Chen
899d392e2f
Support prompt lookup in ipex-llm (#10768)
* lookup init

* add lookup

* fix style

* remove redundant code

* change param name

* fix style
2024-04-16 16:52:38 +08:00
Qiyuan Gong
d30b22a81b
Refine axolotl 0.3.0 documents and links (#10764)
* Refine axolotl 0.3 based on comments
* Rename requirements to requirement-xpu
* Add comments for paged_adamw_32bit
* change lora_r from 8 to 16
2024-04-16 14:47:45 +08:00
ZehuaCao
599a88db53
Add deepsped-autoTP-Fastapi serving (#10748)
* add deepsped-autoTP-Fastapi serving

* add readme

* add license

* update

* update

* fix
2024-04-16 14:03:23 +08:00
binbin Deng
0a62933d36
LLM: fix qwen AutoTP (#10766) 2024-04-16 09:56:17 +08:00
Cengguang Zhang
3e2662c87e
LLM: fix get env KV_CACHE_ALLOC_BLOCK_LENGTH type. (#10771) 2024-04-16 09:32:30 +08:00
Jin Qiao
73a67804a4
GPU configuration update for examples (windows pip installer, etc.) (#10762)
* renew chatglm3-6b gpu example readme

fix

fix

fix

* fix for comments

* fix

* fix

* fix

* fix

* fix

* apply on HF-Transformers-AutoModels

* apply on PyTorch-Models

* fix

* fix
2024-04-15 17:42:52 +08:00
yb-peng
b5209d3ec1
Update example/GPU/PyTorch-Models/Model/llava/README.md (#10757)
* Update example/GPU/PyTorch-Models/Model/llava/README.md

* Update README.md

fix path in windows installation
2024-04-15 13:01:37 +08:00
binbin Deng
3d561b60ac
LLM: add enable_xetla parameter for optimize_model API (#10753) 2024-04-15 12:18:25 +08:00
Jiao Wang
a9a6b6b7af
Fix baichuan-13b issue on portable zip under transformers 4.36 (#10746)
* fix baichuan-13b issue

* update

* update
2024-04-12 16:27:01 -07:00
Jiao Wang
9e668a5bf0
fix_internlm-chat-7b-8k repo name in examples (#10747) 2024-04-12 10:15:48 -07:00
binbin Deng
c3fc8f4b90
LLM: add bs limitation for llama softmax upcast to fp32 (#10752) 2024-04-12 15:40:25 +08:00
hxsz1997
0d518aab8d
Merge pull request #10697 from MargarettMao/ceval
combine english and chinese, remove nan
2024-04-12 14:37:47 +08:00
jenniew
dd0d2df5af Change fp16.csv mistral-7b-v0.1 into Mistral-7B-v0.1 2024-04-12 14:28:46 +08:00
jenniew
7309f1ddf9 Mofidy Typos 2024-04-12 14:23:13 +08:00
jenniew
cb594e1fc5 Mofidy Typos 2024-04-12 14:22:09 +08:00
jenniew
382c18e600 Mofidy Typos 2024-04-12 14:15:48 +08:00
jenniew
1a360823ce Mofidy Typos 2024-04-12 14:13:21 +08:00
jenniew
cdbb1de972 Mark Color Modification 2024-04-12 14:00:50 +08:00
jenniew
9bbfcaf736 Mark Color Modification 2024-04-12 13:30:16 +08:00
jenniew
bb34c6e325 Mark Color Modification 2024-04-12 13:26:36 +08:00
Yishuo Wang
8086554d33
use new fp16 sdp in llama and mistral (#10734) 2024-04-12 10:49:02 +08:00
Yang Wang
019293e1b9
Fuse MOE indexes computation (#10716)
* try moe

* use c++ cpu to compute indexes

* fix style
2024-04-11 10:12:55 -07:00
jenniew
b151a9b672 edit csv_to_html to combine en & zh 2024-04-11 17:35:36 +08:00
binbin Deng
70ed9397f9
LLM: fix AttributeError of FP16Linear (#10740) 2024-04-11 17:03:56 +08:00
Keyan (Kyrie) Zhang
1256a2cc4e
Add chatglm3 long input example (#10739)
* Add long context input example for chatglm3

* Small fix

* Small fix

* Small fix
2024-04-11 16:33:43 +08:00
hxsz1997
fd473ddb1b
Merge pull request #10730 from MargarettMao/MargarettMao-parent_folder
Edit ppl update_HTML_parent_folder
2024-04-11 15:45:24 +08:00
Qiyuan Gong
2d64630757
Remove transformers version in axolotl example (#10736)
* Remove transformers version in axolotl requirements.txt
2024-04-11 14:02:31 +08:00
yb-peng
2685c41318
Modify all-in-one benchmark (#10726)
* Update 8192 prompt in all-in-one

* Add cpu_embedding param for linux api

* Update run.py

* Update README.md
2024-04-11 13:38:50 +08:00
Xiangyu Tian
301504aa8d
Fix transformers version warning (#10732) 2024-04-11 13:12:49 +08:00
Wenjing Margaret Mao
9bec233e4d
Delete python/llm/test/benchmark/perplexity/update_html_in_parent_folder.py
Delete due to repetition
2024-04-11 07:21:12 +08:00
Cengguang Zhang
4b024b7aac
LLM: optimize chatglm2 8k input. (#10723)
* LLM: optimize chatglm2 8k input.

* rename.
2024-04-10 16:59:06 +08:00
Yuxuan Xia
cd22cb8257
Update Env check Script (#10709)
* Update env check bash file

* Update env-check
2024-04-10 15:06:00 +08:00
Shaojun Liu
29bf28bd6f
Upgrade python to 3.11 in Docker Image (#10718)
* install python 3.11 for cpu-inference docker image

* update xpu-inference dockerfile

* update cpu-serving image

* update qlora image

* update lora image

* update document
2024-04-10 14:41:27 +08:00
Qiyuan Gong
b727767f00
Add axolotl v0.3.0 with ipex-llm on Intel GPU (#10717)
* Add axolotl v0.3.0 support on Intel GPU.
* Add finetune example on llama-2-7B with Alpaca dataset.
2024-04-10 14:38:29 +08:00
Wang, Jian4
c9e6d42ad1
LLM: Fix chatglm3-6b-32k error (#10719)
* fix chatglm3-6b-32k

* update style
2024-04-10 11:24:06 +08:00
Keyan (Kyrie) Zhang
585c174e92
Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables (#10707)
* Read the value of KV_CACHE_ALLOC_BLOCK_LENGTH from the environment variables.

* Fix style
2024-04-10 10:48:46 +08:00
Jiao Wang
d1eaea509f
update chatglm readme (#10659) 2024-04-09 14:24:46 -07:00
Jiao Wang
878a97077b
Fix llava example to support transformerds 4.36 (#10614)
* fix llava example

* update
2024-04-09 13:47:07 -07:00
Jiao Wang
1e817926ba
Fix low memory generation example issue in transformers 4.36 (#10702)
* update cache in low memory generate

* update
2024-04-09 09:56:52 -07:00
Yuwen Hu
97db2492c8
Update setup.py for bigdl-core-xe-esimd-21 on Windows (#10705)
* Support bigdl-core-xe-esimd-21 for windows in setup.py

* Update setup-llm-env accordingly
2024-04-09 18:21:21 +08:00
Zhicun
b4147a97bb
Fix dtype mismatch error (#10609)
* fix llama

* fix

* fix code style

* add torch type in model.py

---------

Co-authored-by: arda <arda@arda-arc19.sh.intel.com>
2024-04-09 17:50:33 +08:00
Shaojun Liu
f37a1f2a81
Upgrade to python 3.11 (#10711)
* create conda env with python 3.11

* recommend to use Python 3.11

* update
2024-04-09 17:41:17 +08:00
Yishuo Wang
8f45e22072
fix llama2 (#10710) 2024-04-09 17:28:37 +08:00
Yishuo Wang
e438f941f2
disable rwkv5 fp16 (#10699) 2024-04-09 16:42:11 +08:00
Cengguang Zhang
6a32216269
LLM: add llama2 8k input example. (#10696)
* LLM: add llama2-32K example.

* refactor name.

* fix comments.

* add IPEX_LLM_LOW_MEM notes and update sample output.
2024-04-09 16:02:37 +08:00
Wenjing Margaret Mao
289cc99cd6
Update README.md (#10700)
Edit "summarize the results"
2024-04-09 16:01:12 +08:00
Wenjing Margaret Mao
d3116de0db
Update README.md (#10701)
edit "summarize the results"
2024-04-09 15:50:25 +08:00
Chen, Zhentao
d59e0cce5c
Migrate harness to ipexllm (#10703)
* migrate to ipexlm

* fix workflow

* fix run_multi

* fix precision map

* rename ipexlm to ipexllm

* rename bigdl to ipex  in comments
2024-04-09 15:48:53 +08:00
Keyan (Kyrie) Zhang
1e27e08322
Modify example from fp32 to fp16 (#10528)
* Modify example from fp32 to fp16

* Remove Falcon from fp16 example for now

* Remove MPT from fp16 example
2024-04-09 15:45:49 +08:00
binbin Deng
44922bb5c2
LLM: support baichuan2-13b using AutoTP (#10691) 2024-04-09 14:06:01 +08:00
Yina Chen
c7422712fc
mistral 4.36 use fp16 sdp (#10704) 2024-04-09 13:50:33 +08:00
Ovo233
dcb2038aad
Enable optimization for sentence_transformers (#10679)
* enable optimization for sentence_transformers

* fix python style check failure
2024-04-09 12:33:46 +08:00
Yang Wang
5a1f446d3c
support fp8 in xetla (#10555)
* support fp8 in xetla

* change name

* adjust model file

* support convert back to cpu

* factor

* fix bug

* fix style
2024-04-08 13:22:09 -07:00
jenniew
591bae092c combine english and chinese, remove nan 2024-04-08 19:37:51 +08:00
Cengguang Zhang
7c43ac0164
LLM: optimize llama natvie sdp for split qkv tensor (#10693)
* LLM: optimize llama natvie sdp for split qkv tensor.

* fix block real size.

* fix comment.

* fix style.

* refactor.
2024-04-08 17:48:11 +08:00
Xin Qiu
1274cba79b
stablelm fp8 kv cache (#10672)
* stablelm fp8 kvcache

* update

* fix

* change to fp8 matmul

* fix style

* fix

* fix

* meet code review

* add comment
2024-04-08 15:16:46 +08:00
Yishuo Wang
65127622aa
fix UT threshold (#10689) 2024-04-08 14:58:20 +08:00
Cengguang Zhang
c0cd238e40
LLM: support llama2 8k input with w4a16. (#10677)
* LLM: support llama2 8k input with w4a16.

* fix comment and style.

* fix style.

* fix comments and split tensor to quantized attention forward.

* fix style.

* refactor name.

* fix style.

* fix style.

* fix style.

* refactor checker name.

* refactor native sdp split qkv tensor name.

* fix style.

* fix comment rename variables.

* fix co-exist of intermedia results.
2024-04-08 11:43:15 +08:00
Zhicun
321bc69307
Fix llamaindex ut (#10673)
* fix llamaindex ut

* add GPU ut
2024-04-08 09:47:51 +08:00
yb-peng
2d88bb9b4b
add test api transformer_int4_fp16_gpu (#10627)
* add test api transformer_int4_fp16_gpu

* update config.yaml and README.md in all-in-one

* modify run.py in all-in-one

* re-order test-api

* re-order test-api in config

* modify README.md in all-in-one

* modify README.md in all-in-one

* modify config.yaml

---------

Co-authored-by: pengyb2001 <arda@arda-arc21.sh.intel.com>
Co-authored-by: ivy-lv11 <zhicunlv@gmail.com>
2024-04-07 15:47:17 +08:00
Wang, Jian4
47cabe8fcc
LLM: Fix no return_last_logit running bigdl_ipex chatglm3 (#10678)
* fix no return_last_logits

* update only for chatglm
2024-04-07 15:27:58 +08:00
Wang, Jian4
9ad4b29697
LLM: CPU benchmark using tcmalloc (#10675) 2024-04-07 14:17:01 +08:00
binbin Deng
d9a1153b4e
LLM: upgrade deepspeed in AutoTP on GPU (#10647) 2024-04-07 14:05:19 +08:00
Jin Qiao
56dfcb2ade
Migrate portable zip to ipex-llm (#10617)
* change portable zip prompt to ipex-llm

* fix chat with ui

* add no proxy
2024-04-07 13:58:58 +08:00
Zhicun
9d8ba64c0d
Llamaindex: add tokenizer_id and support chat (#10590)
* add tokenizer_id

* fix

* modify

* add from_model_id and from_mode_id_low_bit

* fix typo and add comment

* fix python code style

---------

Co-authored-by: pengyb2001 <284261055@qq.com>
2024-04-07 13:51:34 +08:00
Jin Qiao
10ee786920
Replace with IPEX-LLM in example comments (#10671)
* Replace with IPEX-LLM in example comments

* More replacement

* revert some changes
2024-04-07 13:29:51 +08:00
Xiangyu Tian
08018a18df
Remove not-imported MistralConfig (#10670) 2024-04-07 10:32:05 +08:00
Cengguang Zhang
1a9b8204a4
LLM: support int4 fp16 chatglm2-6b 8k input. (#10648) 2024-04-07 09:39:21 +08:00
Jiao Wang
69bdbf5806
Fix vllm print error message issue (#10664)
* update chatglm readme

* Add condition to invalidInputError

* update

* update

* style
2024-04-05 15:08:13 -07:00
Jason Dai
29d97e4678
Update readme (#10665) 2024-04-05 18:01:57 +08:00
Xin Qiu
4c3e493b2d
fix stablelm2 1.6b (#10656)
* fix stablelm2 1.6b

* meet code review
2024-04-03 22:15:32 +08:00
Jin Qiao
cc8b3be11c
Add GPU and CPU example for stablelm-zephyr-3b (#10643)
* Add example for StableLM

* fix

* add to readme
2024-04-03 16:28:31 +08:00
Heyang Sun
6000241b10
Add Deepspeed Example of FLEX Mistral (#10640) 2024-04-03 16:04:17 +08:00
Shaojun Liu
d18dbfb097
update spr perf test (#10644) 2024-04-03 15:53:55 +08:00
Yishuo Wang
702e686901
optimize starcoder normal kv cache (#10642) 2024-04-03 15:27:02 +08:00
Xin Qiu
3a9ab8f1ae
fix stablelm logits diff (#10636)
* fix logits diff

* Small fixes

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-04-03 15:08:12 +08:00
Zhicun
b827f534d5
Add tokenizer_id in Langchain (#10588)
* fix low-bit

* fix

* fix style

---------

Co-authored-by: arda <arda@arda-arc12.sh.intel.com>
2024-04-03 14:25:35 +08:00
Zhicun
f6fef09933
fix prompt format for llama-2 in langchain (#10637) 2024-04-03 14:17:34 +08:00
Jiao Wang
330d4b4f4b
update readme (#10631) 2024-04-02 23:08:02 -07:00
Kai Huang
c875b3c858
Add seq len check for llama softmax upcast to fp32 (#10629) 2024-04-03 12:05:13 +08:00
Jiao Wang
4431134ec5
update readme (#10632) 2024-04-02 19:54:30 -07:00
Jiao Wang
23e33a0ca1
Fix qwen-vl style (#10633)
* update

* update
2024-04-02 18:41:38 -07:00
binbin Deng
2bbd8a1548
LLM: fix llama2 FP16 & bs>1 & autotp on PVC and ARC (#10611) 2024-04-03 09:28:04 +08:00
Jiao Wang
654dc5ba57
Fix Qwen-VL example problem (#10582)
* update

* update

* update

* update
2024-04-02 12:17:30 -07:00
Yuwen Hu
fd384ddfb8
Optimize StableLM (#10619)
* Initial commit for stablelm optimizations

* Small style fix

* add dependency

* Add mlp optimizations

* Small fix

* add attention forward

* Remove quantize kv for now as head_dim=80

* Add merged qkv

* fix lisence

* Python style fix

---------

Co-authored-by: qiuxin2012 <qiuxin2012cs@gmail.com>
2024-04-02 18:58:38 +08:00
binbin Deng
27be448920
LLM: add cpu_embedding and peak memory record for deepspeed autotp script (#10621) 2024-04-02 17:32:50 +08:00
Yishuo Wang
ba8cc6bd68
optimize starcoder2-3b (#10625) 2024-04-02 17:16:29 +08:00
Shaojun Liu
a10f5a1b8d
add python style check (#10620)
* add python style check

* fix style checks

* update runner

* add ipex-llm-finetune-qlora-cpu-k8s to manually_build workflow

* update tag to 2.1.0-SNAPSHOT
2024-04-02 16:17:56 +08:00