Commit graph

1998 commits

Author SHA1 Message Date
Yina Chen
ec465fbcd7
Add lookup generate in load_low_bit (#12243)
* add lookup generate in load_low_bit

* update comment
2024-10-22 15:51:52 +08:00
Yuwen Hu
b3df47486d
Fix Gemma 2 on LNL (#12240)
* Fix gemma 2 on LNL

* Python style fix
2024-10-21 18:25:53 +08:00
Yuwen Hu
5935b25622
Further update windows gpu perf test regarding results integrity check (#12232) 2024-10-18 18:15:13 +08:00
Yuwen Hu
b88c1df324
Add Llama 3.1 & 3.2 to Arc Performance test (#12225)
* Add llama3.1 and llama3.2 in arc perf (#12202)

* Add llama3.1 and llama3.2 in arc perf

* Uninstall trl after arc test on transformers>=4.40

* Fix arc llama3 perf (#12212)

* Fix pip uninstall

* Uninstall trl after test on transformers==4.43.1

* Fix llama3 arc perf (#12218)

---------

Co-authored-by: Jin, Qiao <89779290+JinBridger@users.noreply.github.com>
2024-10-17 21:12:45 +08:00
Yishuo Wang
9ea694484d
refactor ot remove old rope usage (#12224) 2024-10-17 17:06:09 +08:00
Yishuo Wang
324bcb057e
refactor to reduce old rope usage (#12219) 2024-10-17 14:45:09 +08:00
Jiao Wang
667f0db466
Update Eagle example to Eagle2+ipex-llm integration (#11717)
* update to e2 example

* update

* update
2024-10-16 23:16:14 -07:00
Yishuo Wang
a4a758656a
refactor gemma to reduce old fuse rope usage (#12215) 2024-10-16 17:40:28 +08:00
Yishuo Wang
9104a168f6
refactor phi-2 to reduce old fuse rope usage (#12214) 2024-10-16 17:08:14 +08:00
Yishuo Wang
bb247e991b
refactor merge_qkv and attention_softmax (#12213) 2024-10-16 15:58:14 +08:00
Yishuo Wang
e279148aa0
optimize llama3.2 vision again (#12211) 2024-10-16 14:29:48 +08:00
Chu,Youcheng
f17cc4fdee
feat: add llama3.2-11b-vision in all in one (#12207)
* feat: add llama3.2-11b-vision in all in one

* fix: change model

* fix: change name

* fix: add a space

* fix: switch import
2024-10-16 10:32:11 +08:00
Yuwen Hu
c9ac39fc1e
Add Llama 3.2 to iGPU performance test (transformers 4.45) (#12209)
* Add Llama 3.2 to iGPU Perf (#12200)

* Add Llama 3.2 to iGPU Perf

* Downgrade accelerate after step

* Temporarily disable model for test

* Temporarily change ERRORLEVEL check (#12201)

* Restore llama3.2 perf (#12206)

* Revert "Temporarily change ERRORLEVEL check"

This reverts commit 909dbbc930ab4283737161a55bb32006e6ca1991.

* Revert "Temporarily disable model for test"

This reverts commit 95322dc3c6429aa836f21bda0b5ba8d9b48592f8.

---------

Co-authored-by: Jin, Qiao <89779290+JinBridger@users.noreply.github.com>
2024-10-15 17:44:46 +08:00
Yishuo Wang
f6611f9d3a
optimize llama3.2 vison attention again (#12204) 2024-10-15 16:08:20 +08:00
Yishuo Wang
9b81236a2e
optimzie qwen2-vl vision (#12203) 2024-10-15 15:54:25 +08:00
Yishuo Wang
d5344587ab
optimize internvl2 vision model's attention (#12198) 2024-10-15 10:51:00 +08:00
Yuwen Hu
f8d1adc573
Fix Llama 3.2 & 3.1 on LNL (#12196) 2024-10-14 17:39:20 +08:00
Yuwen Hu
516b578104
Support cpp release for ARL on Windows (#12189)
* Support cpp Windows release for ARL

* Temp commit for test

* Remove temp commit
2024-10-14 17:20:31 +08:00
Zijie Li
7d80db710e
Add benchmark_util for transformers >= 4.44.0 (#12171)
* Create benchmark_util_4_45.py

* Update __init__.py

* Update lint-python

* Update benchmark_util_4_45.py

* Update benchmark_util_4_45.py

* Create benchmark_util_4_44.py
2024-10-14 15:40:12 +08:00
Jin, Qiao
8e35800abe
Add llama 3.1 in igpu perf (#12194) 2024-10-14 15:14:34 +08:00
Yuwen Hu
ddcdf47539
Support Windows ARL release (#12183)
* Support release for ARL

* Small fix

* Small fix to doc

* Temp for test

* Remove temp commit for test
2024-10-11 18:30:52 +08:00
Jinhe
f983f1a8f4
Add Qwen2-VL gpu example (#12135)
* qwen2-vl readme

* add qwen2-vl example

* fix

* fix

* fix

* add link

* Update regarding modules_to_not_convert and readme

* Further fix

* Small fix

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-10-11 18:25:23 +08:00
Ruonan Wang
310f18c8af
update NPU pipeline generate (#12182)
* update

* fix style
2024-10-11 17:39:20 +08:00
Shaojun Liu
724b2ae66d
add npu-level0 pipeline.dll to ipex-llm (#12181)
* add npu-level0 pipeline.dll to ipex-llm

* test

* update runner label

* fix

* update

* fix

* fix
2024-10-11 16:05:20 +08:00
Ruonan Wang
4d93bb81fe
Initial support of NPU level0 Model (#12177)
* first commit to support load dll and init llm pipeline

* add init generate

* fix style

* small updates

* fix style and check tokens number
2024-10-11 09:45:53 +08:00
Yuwen Hu
890662610b
Fix auto importer for LNL release (#12175) 2024-10-10 15:17:43 +08:00
Yishuo Wang
535bee5381
fix qwen2 vl again (#12174) 2024-10-10 13:50:01 +08:00
Yuwen Hu
aef1f671bd
Support LNL Windows release (#12169)
* Release for LNL on Windows

* Temp commit for release test

* Change option name

* Remove temp commit and change option name

* temp commit for test again

* Remove temp commit
2024-10-09 17:41:10 +08:00
Yishuo Wang
78d253165d
optimize qwen2 vl perf again (#12167) 2024-10-09 16:43:48 +08:00
Zijie Li
3d044dbf53
add llama3.2-vision Pytorch example (#12165) 2024-10-09 09:20:42 +08:00
Yishuo Wang
644af2a76e
add basic llama 3.2 vision support (#12163) 2024-10-08 10:46:48 +08:00
Ch1y0q
17c23cd759
add llama3.2 GPU example (#12137)
* add llama3.2 GPU example

* change prompt format reference url

* update

* add Meta-Llama-3.2-1B-Instruct sample output

* update wording
2024-09-29 14:41:54 +08:00
Yuwen Hu
f71b38a994
Update MiniCPM_V_26 GPU example with save & load (#12127) 2024-09-26 17:40:22 +08:00
Yishuo Wang
669ff1a97b
fix sd1.5 (#12129) 2024-09-26 17:15:16 +08:00
Yishuo Wang
a266528719
optimize llama 3.2 rope (#12128) 2024-09-26 16:08:10 +08:00
Yishuo Wang
584c3489e7
add basic support for llama3.2 (#12125) 2024-09-26 15:46:19 +08:00
Yishuo Wang
66f419f8b7
fix qwen2 vl (#12126) 2024-09-26 15:44:02 +08:00
Ch1y0q
2ea13d502f
Add minicpm3 gpu example (#12114)
* add minicpm3 gpu example

* update GPU example

* update

---------

Co-authored-by: Huang, Xinshengzi <xinshengzi.huang@intel.com>
2024-09-26 13:51:37 +08:00
Yishuo Wang
77af9bc5fa
support passing None to low_bit in optimize_model (#12121) 2024-09-26 11:09:35 +08:00
Yishuo Wang
47e0b83cbf
optimize sd 1.5 (#12119) 2024-09-25 15:45:13 +08:00
Jin, Qiao
2bedb17be7
Add Qwen2.5 NPU Example (#12110)
* Add Qwen2.5 NPU Example

* fix

* Merge qwen2.py and qwen2.5.py into qwen.py

* Fix description
2024-09-25 15:20:03 +08:00
Yishuo Wang
5d63aef60b
optimize qwen2 vl again (#12109) 2024-09-23 13:22:01 +08:00
Ruonan Wang
03bd01c99c
optimize npu qwen2 (#12107) 2024-09-20 19:46:16 +08:00
Jinhe
02399021d6
add npu load_low_bit api in all-in-one benchmark (#12103) 2024-09-20 17:56:08 +08:00
Yishuo Wang
9239fd4f12
add basic support and optimization for qwen2-vl (#12104) 2024-09-20 17:23:06 +08:00
Yuwen Hu
828fa01ad3
[NPU] Add mixed_precision for Qwen2 7B (#12098)
* Add mix_precision argument to control whether use INT8 lm_head for Qwen2-7B-Instruct

* Small fix

* Fixed on load low bit with mixed precision

* Small fix

* Update example accordingly

* Update for default prompt

* Update base on comments

* Final fix
2024-09-20 16:36:21 +08:00
Ch1y0q
2269768e71
add internvl2 example (#12102)
* add internvl2 example

* add to README.md

* update

* add link to zh-CN readme
2024-09-20 16:31:54 +08:00
Ruonan Wang
09b8c80d9d
update code for NPU qwen2 (#12094)
* update code

* fix
2024-09-20 15:58:32 +08:00
Jin, Qiao
db7500bfd4
Add Qwen2.5 GPU example (#12101)
* Add Qwen2.5 GPU example

* fix end line

* fix description
2024-09-20 15:55:57 +08:00
Yishuo Wang
54b973c744
fix ipex_llm import in transformers 4.45 (#12099) 2024-09-20 15:24:59 +08:00
Ch1y0q
9650bf616a
add transpose_value_cache for NPU benchmark (#12092)
* add `transpose_value_cache`

* update

* update
2024-09-19 18:45:05 +08:00
Yuwen Hu
f7fb3c896c
Update lm_head optimization for Qwen2 7B (#12090) 2024-09-18 17:02:02 +08:00
Xu, Shuo
ee33b93464
Longbench: NV code to ipex-llm (#11662)
* add nv longbench

* LongBench: NV code to ipex-llm

* ammend

* add more models support

* ammend

* optimize LongBench's user experience

* ammend

* ammend

* fix typo

* ammend

* remove cuda related information & add a readme

* add license to python scripts & polish the readme

* ammend

* ammend

---------

Co-authored-by: cyita <yitastudy@gmail.com>
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2024-09-18 15:55:14 +08:00
Wang, Jian4
40e463c66b
Enable vllm load gptq model (#12083)
* enable vllm load gptq model

* update

* update

* update

* update style
2024-09-18 14:41:00 +08:00
Ruonan Wang
081af41def
[NPU] Optimize Qwen2 lm_head to use INT4 (#12072)
* temp save

* update

* fix

* fix

* Split lm_head into 7 parts & remove int8 for lm_head when sym_int4

* Simlify and add condition to code

* Small fix

* refactor some code

* fix style

* fix style

* fix style

* fix

* fix

* temp sav e

* refactor

* fix style

* further refactor

* simplify code

* meet code review

* fix style

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-09-14 15:26:46 +08:00
Ch1y0q
b4b8c3e495
add lowbit_path for generate.py, fix npu_model (#12077)
* add `lowbit_path` for `generate.py`, fix `npu_model`

* update `README.md`
2024-09-13 17:28:05 +08:00
Wang, Jian4
d703e4f127
Enable vllm multimodal minicpm-v-2-6 (#12074)
* enable minicpm-v-2-6

* add image_url readme
2024-09-13 13:28:35 +08:00
Ruonan Wang
48d9092b5a
upgrade OneAPI version for cpp Windows (#12063)
* update version

* update quickstart
2024-09-12 11:12:12 +08:00
Jinhe
e78e45ee01
update NPU readme: run conhost as administrator (#12066) 2024-09-11 17:54:04 +08:00
Jinhe
4ca330da15
Fix NPU load error message and add minicpm npu lowbit feat (#12064)
* fix npu_model raise sym_int4 error

* add load_lowbit

* remove print&perf
2024-09-11 16:56:35 +08:00
Jinhe
32e8362da7
added minicpm cpu examples (#12027)
* minicpm cpu examples

* add link for minicpm-2
2024-09-11 15:51:21 +08:00
Ruonan Wang
a0c73c26d8
clean NPU code (#12060)
* clean code

* remove time.perf_counter()
2024-09-11 15:10:35 +08:00
Wang, Jian4
c75f3dd874
vllm no padding glm4 to avoid nan error (#12062)
* no padding glm4

* add codegeex
2024-09-11 13:44:40 +08:00
Chu,Youcheng
649390c464
fix: textual and env variable adjustment (#12038) 2024-09-11 13:38:01 +08:00
Wang, Jian4
30a8680645
Update for vllm one card padding (#12058) 2024-09-11 10:52:55 +08:00
Zijie Li
c5fdfde1bd
fix npu-model prompt (#12057) 2024-09-11 10:06:45 +08:00
Yishuo Wang
d8c044e79d
optimize minicpm3 kv cache (#12052) 2024-09-10 16:51:21 +08:00
Wang, Jian4
5d3ab16a80
Add vllm glm and baichuan padding (#12053) 2024-09-10 15:57:28 +08:00
Guancheng Fu
69c8d36f16
Switching from vLLM v0.3.3 to vLLM 0.5.4 (#12042)
* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* Remove duplicate layer

* LLM: Update vLLM to v0.5.4 (#11746)

* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* update 0.5.4 api_server

* add dockerfile

* fix

* fix

* refine

* fix

---------

Co-authored-by: gc-fu <guancheng.fu@intel.com>

* Add vllm-0.5.4 Dockerfile (#11838)

* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)

* Fix vLLM not convert issues (#11817) (#11918)

* Fix not convert issues

* refine

Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>

* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)

* init

* update mlp forward

* fix minicpm error in vllm 0.5.4

* fix dependabot alerts (#12008)

* Update 0.5.4 dockerfile (#12021)

* Add vllm awq loading logic (#11987)

* [ADD] Add vllm awq loading logic

* [FIX] fix the module.linear_method path

* [FIX] fix quant_config path error

* Enable Qwen padding mlp to 256 to support batch_forward (#12030)

* Enable padding mlp

* padding to 256

* update style

* Install 27191 runtime in 0.5.4 docker image (#12040)

* fix rebase error

* fix rebase error

* vLLM: format for 0.5.4 rebase (#12043)

* format

* Update model_convert.py

* Fix serving docker related modifications (#12046)

* Fix undesired modifications (#12048)

* fix

* Refine offline_inference arguments

---------

Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com>
Co-authored-by: Jun Wang <thoughts.times@gmail.com>
Co-authored-by: Wang, Jian4 <61138589+hzjane@users.noreply.github.com>
Co-authored-by: liu-shaojun <johnssalyn@outlook.com>
Co-authored-by: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
2024-09-10 15:37:43 +08:00
Ch1y0q
73a4360f3f
update lowbit path for baichuan2, qwen2, generate.py (#12051)
* update lowbit path for baichuan2, qwen2, `generate.py`

* update readme
2024-09-10 15:35:24 +08:00
Ruonan Wang
dc4af02b2a
Fix qwen2 1.5B NPU load error (#12049) 2024-09-10 14:41:18 +08:00
Yishuo Wang
abc370728c
optimize minicpm3 again (#12047) 2024-09-10 14:19:57 +08:00
Ch1y0q
f0061a9916
remove local import os to fix Baichuan NPU load issue (#12044) 2024-09-10 14:13:24 +08:00
Ruonan Wang
640998edea
update inter_pp of qwen2 (#12041) 2024-09-10 10:34:17 +08:00
Yishuo Wang
048b4590aa
add basic minicpm3 optimization (#12039) 2024-09-09 17:25:08 +08:00
Chu,Youcheng
16c658e732
LLM: add known issues to harness evaluation (#12036)
* feat: 在harness中添加known issue

* fix: resolve comments

* fix: small fixes
2024-09-09 14:15:42 +08:00
Yishuo Wang
6cedb601e4
remove some useless code (#12035) 2024-09-06 17:51:08 +08:00
binbin Deng
d2e1b9aaff
Add input padding during prefill for qwen2-7b (#12033) 2024-09-06 16:39:59 +08:00
Yuwen Hu
f61b1785fb
Small update to NPU example readme (#12034)
* Small update to NPU example readme

* Small fix
2024-09-06 15:54:23 +08:00
Ruonan Wang
0d04531ae0
update NPU readme of Qwen2 (#12032)
* update readme

* update broadcast
2024-09-06 15:02:39 +08:00
Yang Wang
58555bd9de
Optimize broadcast for npu llama (#12028) 2024-09-06 13:28:20 +08:00
binbin Deng
5b18bb3c4a
Add recommend version for mtl npu (#12024) 2024-09-05 16:28:53 +08:00
binbin Deng
845e5dc89e
Support lm_head of minicpm-2b on NPU (#12019) 2024-09-05 16:19:22 +08:00
Ch1y0q
820f8a4554
add --lowbit-path option for NPU llama example (#12020)
* add option" `--lowbit-path`

* add descriptions in `README.md` and formatting

* Update llama.py
2024-09-05 15:31:01 +08:00
Guoqiong Song
8803242f5c
fix llama on cpu (#12018) 2024-09-04 19:17:54 -07:00
Wang, Jian4
b3b2cd64b4
Support lightweight-serving glm-4v-9b (#11994)
* enable glm-4v-9b serving

* update readme

* update for no image input
2024-09-05 09:25:08 +08:00
Yishuo Wang
b1408a1f1c
fix UT (#12005) 2024-09-04 18:02:49 +08:00
Wang, Jian4
2b993ad479
vllm update for glm-4 model automatic not_convert (#12003) 2024-09-04 13:50:32 +08:00
Ruonan Wang
9eaff5e47d
add save & load support for NPU optimized model (#11999)
* add save &  load support

* fix style
2024-09-03 20:53:22 +08:00
Yuwen Hu
6eb55653ba
Performance mode strategy update for input_embeds input (#11997) 2024-09-03 17:46:16 +08:00
Jinhe
164f47adbd
MiniCPM-V-2 & MiniCPM-Llama3-V-2_5 example updates (#11988)
* minicpm example updates

* --stream
2024-09-03 17:02:06 +08:00
Jin, Qiao
2e54f4402b
Rename MiniCPM-V-2_6 CPU example (#11998) 2024-09-03 16:50:42 +08:00
binbin Deng
01099f08ee
Revert prefill logic of qwen2-7b (#11992) 2024-09-03 14:45:01 +08:00
Yuwen Hu
659d15defc
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation (#11989)
* Fix garbage output for input_embeds inputs during lookup generation

* Fix on sliding windows

* Simplify code
2024-09-02 19:09:12 +08:00
binbin Deng
2f3d1bd0ec
hotfix qwen2-7b weight setting (#11991) 2024-09-02 18:11:08 +08:00
binbin Deng
a40ea7038d
Fix AttributeError of qwen2-1.5B (#11990) 2024-09-02 17:55:10 +08:00
Yang Wang
c48817bd43
Support Qwen2-7b MLP in int4 and transpose_value_cache=True (#11968) 2024-09-02 14:37:44 +08:00
Jin, Qiao
65e281bb29
Add MiniCPM-V cpu example (#11975)
* Add MiniCPM-V cpu example

* fix

* fix

* fix

* fix
2024-09-02 10:17:57 +08:00
Ruonan Wang
79978e6f36
update npu multimodal readme (#11979)
* update npu readme of multimodal

* small fix

* meet comment
2024-08-30 19:02:06 +08:00
Ruonan Wang
4811a490ef
small fix (#11978)
* fix

* meet comment
2024-08-30 17:55:15 +08:00
Ruonan Wang
573c20bae6
fix npu lm_head cpu condition (#11976)
* fix

* fix

* fix

* fix stype

* fix style

* fix style
2024-08-30 17:11:26 +08:00
Ruonan Wang
60aa1a2c0f
Initial NPU support for MiniCPM-V-2_6 (#11966)
* initial pr

* update npu model

* fix

* fix kv cache type

* fix

* small fix

* fix style

* fix model id

* change inter_pp=4

* address comment

* fix

* fix style

* fix

* rebase
2024-08-30 16:34:35 +08:00
SONG Ge
158289d205
[NPU] Add initial support for minicpm-llama-v2.5 (#11962)
* add initial support for minicpm-llama-v2.5

* update impl

* add minicpm-llama3-v2.5 example
2024-08-30 16:00:33 +08:00
Chu,Youcheng
ae7302a654
add gptq option for ppl test (#11921)
* feat:add gptq for ppl

* fix: add an empty line

* fix: add an empty line

* fix: remove an empty line

* Resolve comments

* Resolve comments

* Resolve comments
2024-08-30 13:43:48 +08:00
binbin Deng
cd077881f1
Disable lm head (#11972) 2024-08-30 11:05:18 +08:00
Wang, Jian4
7d103417b8
Fix glm4-9b-chat nan error on vllm 0.3.3 (#11970)
* fix nan value

* update
2024-08-30 09:50:18 +08:00
Yang Wang
fbf088f61e
remove obselete npu code (#11967) 2024-08-29 14:16:44 -07:00
Yuwen Hu
a9e485eb1b
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer (#11963)
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer

* Style fixes
2024-08-29 19:22:09 +08:00
Yuwen Hu
2e49e1f8e9
Further fix for MiniCPM-V-2_6 example (#11965) 2024-08-29 19:14:13 +08:00
Jason Dai
431affd0a0
Update README.md (#11964) 2024-08-29 18:56:35 +08:00
binbin Deng
14b2c8dc32
Update qwen2-7b example script (#11961) 2024-08-29 18:25:17 +08:00
Yuwen Hu
7abe17d6f7
Update MiniCPM-V-2_6 Example (#11958)
* Update example scripts regarding warmup, stream generate, moudles to not convert, etc.

* Update readme accordingly

* Fix based on comments

* Small fix

* Remove n_predict
2024-08-29 18:23:48 +08:00
Yina Chen
5f7ff76ea5
update troubleshooting (#11960) 2024-08-29 17:44:22 +08:00
Yina Chen
882f4a5ff7
Add lnl npu driver recommend version and enable cpu_lm_head on llama3 (#11952)
* update lnl npu driver version and enable cpu_lm_head on llama3

* update

* fix style

* typo

* address comments

* update

* add qwen2-7b
2024-08-29 15:01:18 +08:00
binbin Deng
71f03dcc39
Support qwen2-7b with fused decoderlayer optimization on NPU (#11912) 2024-08-29 13:34:20 +08:00
Jiao Wang
63ac5f64bb
Refactor NPU baichuan multiple-process (#11945)
* update

* add baichuan mp

* clean

* refactor

* merge

* style

* update

* update
2024-08-28 11:33:40 -07:00
SONG Ge
5ca7390082
[NPU] Add minicpm-2b support for npu multi-processing (#11949)
* add minicpm-2b support

* update example for minicpm-2b

* add LNL NPU driver requirement in readme
2024-08-28 18:08:49 +08:00
Yishuo Wang
0fbb10259a
use sdp_causal to reduce internvl2-4b memory usage if set environment variable (#11953) 2024-08-28 17:35:05 +08:00
Guancheng Fu
0a7bd274e2
Add vllm awq loading logic (#11950)
* add vllm awq loading logic

* fix

* refine
2024-08-28 16:46:18 +08:00
Yina Chen
b38fb67bec
[NPU] lm head to cpu (#11943)
* lm head to cpu

* qwen2

* mv logic and add param to disable cpu_lm_head

* use env and lm_head opt to mp file

* fix

* update

* remove print
2024-08-28 16:34:07 +08:00
hxsz1997
e23549f63f
Update llamaindex examples (#11940)
* modify rag.py

* update readme of gpu example

* update llamaindex cpu example and readme

* add llamaindex doc

* update note style

* import before instancing IpexLLMEmbedding

* update index in readme

* update links

* update link

* update related links
2024-08-28 14:03:44 +08:00
binbin Deng
bec00e2015
Improve baichuan2 NPU performance (#11942) 2024-08-27 18:37:08 +08:00
Zijie Li
90f692937d
Update npu baichuan2 (#11939) 2024-08-27 16:56:26 +08:00
binbin Deng
7f7f6c89f5
Quick fix benchmark script (#11938) 2024-08-27 15:29:27 +08:00
Jiao Wang
b4b6ddf73c
NPU Baichuan2 Multi- Process example (#11928) 2024-08-27 15:25:49 +08:00
SONG Ge
e211a5b076
update minicpm to meet latest refactor (#11937) 2024-08-27 15:08:01 +08:00
SONG Ge
a81a329a5f
[NPU] Add example for NPU multi-processing minicpm-1b model (#11935)
* add minicpm example
2024-08-27 14:57:46 +08:00
binbin Deng
7c8c9a0670
Update benchmark script for NPU (#11932) 2024-08-27 14:41:14 +08:00
Ch1y0q
730d9ec811
Add Qwen2-audio example (#11835)
* add draft for qwen2-audio

* update example for `Qwen2-Audio`

* update

* update

* add warmup
2024-08-27 13:35:24 +08:00
Shaojun Liu
b11b28e9a9
update CORE_XE_VERSION to 2.6.0 (#11929) 2024-08-27 13:10:13 +08:00
Yina Chen
e246f1e258
update llama3 npu example (#11933) 2024-08-27 13:03:18 +08:00
binbin Deng
14dddfc0d6
Update NPU example readme (#11931) 2024-08-27 12:44:58 +08:00
Zijie Li
6c3eb1e1e8
refactor from_pretrained API for NPU (#11927) 2024-08-27 09:50:30 +08:00
Xiangyu Tian
7ca557aada
LLM: Fix vLLM CPU convert error (#11926) 2024-08-27 09:22:19 +08:00
Yuwen Hu
c1d07bc626
Support streaming for lookup generation (#11922)
* Support streaming for lookup generation

* Small update

* Style fixes

* Add origin generate full back for batch inference and beam search; support input length threshold judgement for directly input with input_ids

* Fix lookup stream generate with eos token

* Small fixes

* Small fix

* index fix

* Small fix
2024-08-26 19:33:31 +08:00
Yuwen Hu
a0bbd8e28d
All-in-one benchmark update regarding performance mode for input length threshold (#11920)
* All-in-one benchmark update regarding performance mode input length threshold

* typo fix
2024-08-26 18:52:13 +08:00
SONG Ge
019f725d4d
[NPU] Add support for running mp minicpm model on npu (#11909)
* add initial support for npu minicpm mp

* fix minicpm-1b abnormal output error
2024-08-26 17:52:55 +08:00
binbin Deng
dd303776cf
Add troubleshooting about transpose value setting 2024-08-26 16:06:32 +08:00
Yuwen Hu
24c279e0ae
Update IPEX_LLM_PERFORMANCE_MODE with input length threshold (#11908)
* Update IPEX_LLM_PERFORMANCE_MODE with input length threshold

* Update based on comments. And and judgement for inputs_embeds

* Fix for benchmarking purposes

* Update based on comments

* Small fix
2024-08-23 20:49:15 +08:00
binbin Deng
303a090a6b
Add lm_head optimization on NPU (#11903) 2024-08-23 15:51:07 +08:00
Yina Chen
23631cd357
disable lm_head opt for baichuan2-13b (#11905) 2024-08-23 15:39:47 +08:00
hxsz1997
650e6e6ce4
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
Add compress_kv for Baichuan2
2024-08-23 06:09:58 +03:00
Ruonan Wang
4a61f7d20d
update mlp of llama (#11897)
* update mlp of llama

* relax threshold of  mlp test

* revert code
2024-08-22 20:34:53 +08:00
Yuwen Hu
420ce7d164
Fix non-stop at eos token problem for lookup generation (#11896)
* Fix non-stop by eos_token_id problem for lookup

* Small fix

* Add judgement when generation_config.eos_token_id is None

* Fix based on comments
2024-08-22 18:55:59 +08:00
Huang, Xinshengzi
4cf03d6212 update baichuan-7b 2024-08-22 18:16:33 +08:00
Zijie Li
794abe2ce8
update npu-readme (#11900) 2024-08-22 17:49:35 +08:00
Guancheng Fu
278b191dc1
Fix optimize lm head error (#11899) 2024-08-22 17:45:26 +08:00
Shaojun Liu
c5b51d41fb
Update pypi tag to 2.2.0.dev0 (#11895) 2024-08-22 16:48:09 +08:00
Jinhe
18662dca1c
change 5 pytorch/huggingface models to fp16 (#11894) 2024-08-22 16:12:09 +08:00
Wang, Jian4
5c4ed00593
Add lightweight-serving whisper asr example (#11847)
* add asr init

* update for pp

* update style

* update readme

* update reamde
2024-08-22 15:46:28 +08:00