Commit graph

451 commits

Author SHA1 Message Date
Yuwen Hu
659d15defc
Fix wrong attention mask and garbage output for inputs_embeds inputs during lookup generation (#11989)
* Fix garbage output for input_embeds inputs during lookup generation

* Fix on sliding windows

* Simplify code
2024-09-02 19:09:12 +08:00
binbin Deng
2f3d1bd0ec
hotfix qwen2-7b weight setting (#11991) 2024-09-02 18:11:08 +08:00
binbin Deng
a40ea7038d
Fix AttributeError of qwen2-1.5B (#11990) 2024-09-02 17:55:10 +08:00
Yang Wang
c48817bd43
Support Qwen2-7b MLP in int4 and transpose_value_cache=True (#11968) 2024-09-02 14:37:44 +08:00
Ruonan Wang
573c20bae6
fix npu lm_head cpu condition (#11976)
* fix

* fix

* fix

* fix stype

* fix style

* fix style
2024-08-30 17:11:26 +08:00
Ruonan Wang
60aa1a2c0f
Initial NPU support for MiniCPM-V-2_6 (#11966)
* initial pr

* update npu model

* fix

* fix kv cache type

* fix

* small fix

* fix style

* fix model id

* change inter_pp=4

* address comment

* fix

* fix style

* fix

* rebase
2024-08-30 16:34:35 +08:00
SONG Ge
158289d205
[NPU] Add initial support for minicpm-llama-v2.5 (#11962)
* add initial support for minicpm-llama-v2.5

* update impl

* add minicpm-llama3-v2.5 example
2024-08-30 16:00:33 +08:00
binbin Deng
cd077881f1
Disable lm head (#11972) 2024-08-30 11:05:18 +08:00
Yang Wang
fbf088f61e
remove obselete npu code (#11967) 2024-08-29 14:16:44 -07:00
Yuwen Hu
a9e485eb1b
Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer (#11963)
* Support MiniCPM-V-2_6 multi-modal benchmarking with latency text streamer

* Style fixes
2024-08-29 19:22:09 +08:00
Yina Chen
882f4a5ff7
Add lnl npu driver recommend version and enable cpu_lm_head on llama3 (#11952)
* update lnl npu driver version and enable cpu_lm_head on llama3

* update

* fix style

* typo

* address comments

* update

* add qwen2-7b
2024-08-29 15:01:18 +08:00
binbin Deng
71f03dcc39
Support qwen2-7b with fused decoderlayer optimization on NPU (#11912) 2024-08-29 13:34:20 +08:00
Jiao Wang
63ac5f64bb
Refactor NPU baichuan multiple-process (#11945)
* update

* add baichuan mp

* clean

* refactor

* merge

* style

* update

* update
2024-08-28 11:33:40 -07:00
SONG Ge
5ca7390082
[NPU] Add minicpm-2b support for npu multi-processing (#11949)
* add minicpm-2b support

* update example for minicpm-2b

* add LNL NPU driver requirement in readme
2024-08-28 18:08:49 +08:00
Yishuo Wang
0fbb10259a
use sdp_causal to reduce internvl2-4b memory usage if set environment variable (#11953) 2024-08-28 17:35:05 +08:00
Guancheng Fu
0a7bd274e2
Add vllm awq loading logic (#11950)
* add vllm awq loading logic

* fix

* refine
2024-08-28 16:46:18 +08:00
Yina Chen
b38fb67bec
[NPU] lm head to cpu (#11943)
* lm head to cpu

* qwen2

* mv logic and add param to disable cpu_lm_head

* use env and lm_head opt to mp file

* fix

* update

* remove print
2024-08-28 16:34:07 +08:00
binbin Deng
bec00e2015
Improve baichuan2 NPU performance (#11942) 2024-08-27 18:37:08 +08:00
Zijie Li
90f692937d
Update npu baichuan2 (#11939) 2024-08-27 16:56:26 +08:00
Jiao Wang
b4b6ddf73c
NPU Baichuan2 Multi- Process example (#11928) 2024-08-27 15:25:49 +08:00
SONG Ge
e211a5b076
update minicpm to meet latest refactor (#11937) 2024-08-27 15:08:01 +08:00
binbin Deng
7c8c9a0670
Update benchmark script for NPU (#11932) 2024-08-27 14:41:14 +08:00
Zijie Li
6c3eb1e1e8
refactor from_pretrained API for NPU (#11927) 2024-08-27 09:50:30 +08:00
Yuwen Hu
c1d07bc626
Support streaming for lookup generation (#11922)
* Support streaming for lookup generation

* Small update

* Style fixes

* Add origin generate full back for batch inference and beam search; support input length threshold judgement for directly input with input_ids

* Fix lookup stream generate with eos token

* Small fixes

* Small fix

* index fix

* Small fix
2024-08-26 19:33:31 +08:00
SONG Ge
019f725d4d
[NPU] Add support for running mp minicpm model on npu (#11909)
* add initial support for npu minicpm mp

* fix minicpm-1b abnormal output error
2024-08-26 17:52:55 +08:00
Yuwen Hu
24c279e0ae
Update IPEX_LLM_PERFORMANCE_MODE with input length threshold (#11908)
* Update IPEX_LLM_PERFORMANCE_MODE with input length threshold

* Update based on comments. And and judgement for inputs_embeds

* Fix for benchmarking purposes

* Update based on comments

* Small fix
2024-08-23 20:49:15 +08:00
binbin Deng
303a090a6b
Add lm_head optimization on NPU (#11903) 2024-08-23 15:51:07 +08:00
Yina Chen
23631cd357
disable lm_head opt for baichuan2-13b (#11905) 2024-08-23 15:39:47 +08:00
hxsz1997
650e6e6ce4
Merge pull request #11891 from hxsz1997/baichuan2-compresskv
Add compress_kv for Baichuan2
2024-08-23 06:09:58 +03:00
Ruonan Wang
4a61f7d20d
update mlp of llama (#11897)
* update mlp of llama

* relax threshold of  mlp test

* revert code
2024-08-22 20:34:53 +08:00
Yuwen Hu
420ce7d164
Fix non-stop at eos token problem for lookup generation (#11896)
* Fix non-stop by eos_token_id problem for lookup

* Small fix

* Add judgement when generation_config.eos_token_id is None

* Fix based on comments
2024-08-22 18:55:59 +08:00
Huang, Xinshengzi
4cf03d6212 update baichuan-7b 2024-08-22 18:16:33 +08:00
Guancheng Fu
278b191dc1
Fix optimize lm head error (#11899) 2024-08-22 17:45:26 +08:00
Wang, Jian4
5c4ed00593
Add lightweight-serving whisper asr example (#11847)
* add asr init

* update for pp

* update style

* update readme

* update reamde
2024-08-22 15:46:28 +08:00
Huang, Xinshengzi
eb1e65f8a9 add comment 2024-08-22 15:14:47 +08:00
Huang, Xinshengzi
a2be3d7501 add comment of compress kv in attention forward 2024-08-22 15:11:55 +08:00
Huang, Xinshengzi
ce7de77085 add comment of change in model forward 2024-08-22 14:29:27 +08:00
Huang, Xinshengzi
42398a0045 add comment 2024-08-22 13:17:13 +08:00
Huang, Xinshengzi
48a827aa07 fix typos 2024-08-22 11:35:47 +08:00
Huang, Xinshengzi
8a5df93de2 fix typos 2024-08-22 11:33:07 +08:00
Huang, Xinshengzi
01ed397e7a fix typos 2024-08-22 11:31:25 +08:00
Huang, Xinshengzi
c6ed1c412d fix typos 2024-08-22 11:26:49 +08:00
Huang, Xinshengzi
2a0aa9271b fix typos 2024-08-22 11:23:22 +08:00
Huang, Xinshengzi
4adadddbbc fix typos 2024-08-22 11:12:23 +08:00
Huang, Xinshengzi
6a5ca17afc fix typoes 2024-08-22 11:09:58 +08:00
binbin Deng
72a7bf624b
Support qwen2-1.5b with fused decoderlayer optimization on NPU (#11888) 2024-08-22 11:09:12 +08:00
Huang, Xinshengzi
6bb9035788 fix typos 2024-08-22 11:08:48 +08:00
Huang, Xinshengzi
86248b0505 add compress_kv for baichuan2 2024-08-22 10:59:08 +08:00
Yina Chen
cc27321441
support chatglm4 in lookup (#11855) 2024-08-21 15:53:17 +08:00
Yina Chen
0236de3ac2
set IPEX_LLM_LAST_LM_HEAD=1 as default (#11885) 2024-08-21 15:06:12 +08:00
Yang Wang
209d42ab79
Refactor npu mp to make it easier to integrate new models (#11873)
* Refactor npu mp to make it easier to integrate new models

* fix style

* move layer functions to base
2024-08-20 20:58:47 -07:00
Yishuo Wang
bd1e490d62
fix phi3 (#11878) 2024-08-21 10:31:41 +08:00
Yang Wang
bdaeee1d63
Fix run_decoders bug (#11871) 2024-08-20 12:04:59 -07:00
Yina Chen
c3c058373f
Update compresskv model forward type logic (#11868)
* update

* fix
2024-08-20 18:11:37 +08:00
Yishuo Wang
d4ee0a89f3
optimize phi3 memory usage (#11867) 2024-08-20 17:32:51 +08:00
Yishuo Wang
2946420e14
add minicpmv 2.6 load_low_bit workaround (#11856) 2024-08-20 11:16:02 +08:00
Yang Wang
99b05ba1dc
separate prefill into a process (#11787)
* seperate prefill into a process

* using model.share_memory()

* might work

* worked

* use long prompt

* refactor

* cleanup

* fix bug

* clean up

* changable inter and intra process stages

* refactor

* add max output len

* fix npu_model changes that may cause generate down

* fix npu_model generate import error

* fix generare forward error

---------

Co-authored-by: sgwhat <ge.song@intel.com>
2024-08-19 17:53:36 +08:00
Yishuo Wang
9490781aec
optimize phi3 memory usage again (#11848) 2024-08-19 17:26:59 +08:00
Yina Chen
3cd4e87168
Support compress KV with quantize KV (#11812)
* update llama

* support llama 4.41

* fix style

* support minicpm

* support qwen2

* support minicpm & update

* support chatglm4

* support chatglm

* remove print

* add DynamicCompressFp8Cache & support qwen

* support llama

* support minicpm phi3

* update chatglm2/4

* small fix & support qwen 4.42

* remove print
2024-08-19 15:32:32 +08:00
Zhao Changmin
6841a9ac8f
fix load low bit com dtype (#11832) 2024-08-19 13:43:19 +08:00
Yuwen Hu
96796f95cb
Update all-in-one benchmark prompts for continuation task & lookup update for minicpmv (#11827)
* Update all-in-one benchmark prompts for continuation task

* Small fix

* Add pure-text benchmark support for minicpm-v-2_6

* Support lookahead for model.llm generate of minicpmv

* Add prompt reference

* Small update

* Small fix
2024-08-16 17:16:35 +08:00
Yishuo Wang
e966e85df8
force lm_head optimization in any model if set environment variable (#11830) 2024-08-16 16:48:45 +08:00
Yishuo Wang
17a0beb21f
optimize qwen2-audio again (#11825) 2024-08-16 11:11:35 +08:00
Yuwen Hu
9e9086cc2a
Update IPEX_LLM_PERFORMANCE_MODE (#11823) 2024-08-16 09:48:36 +08:00
Guancheng Fu
e70ae0638e
Fix vLLM not convert issues (#11817)
* Fix not convert issues

* refine
2024-08-15 19:04:05 +08:00
Yishuo Wang
750d4ad5dc
fix minicpm-v-2 fp16 (#11819) 2024-08-15 18:34:40 +08:00
Yishuo Wang
828ab16537
fix phi3 and minicpmv cpu (#11818) 2024-08-15 17:43:29 +08:00
Yishuo Wang
4e178f0c5d
rewrite minicpmv optimization (#11816) 2024-08-15 17:27:12 +08:00
Yishuo Wang
07b7f13982
support and optimize qwen2-audio (#11809) 2024-08-15 14:59:04 +08:00
Yishuo Wang
9a93808fc5
fix and optimize minicpm v 2 (#11799) 2024-08-14 17:27:23 +08:00
Yishuo Wang
3d6cfa291d
optimize minicpm v 2.5 (#11793) 2024-08-14 16:07:24 +08:00
Yuwen Hu
356281cb80
Further all-in-one benchmark update continuation task (#11784)
* Further update prompt for continuation task, and disable lookup candidate update strategy on MTL

* style fix
2024-08-14 14:39:34 +08:00
Ruonan Wang
43cca3be27
fix gemma2 runtime error caused by sliding window (#11788)
* fix runtime error

* revert workflow
2024-08-14 10:43:33 +08:00
Yang Wang
51bcac1229
follow up on experimental support of fused decoder layer for llama2 (#11785)
* clean up and support transpose value cache

* refine

* fix style

* fix style
2024-08-13 18:53:55 -07:00
Yishuo Wang
cb79dcda93
refactor llama convert to fix minicpm-v 2.5 optimization (#11783) 2024-08-14 09:29:57 +08:00
Yina Chen
7cd6ec9723
MiniCPM-V support compresskv (#11779)
* fix check error

* fix other models

* remove print
2024-08-13 19:03:40 +08:00
Qiyuan Gong
3998de14f0
Fix mistral forward_qkv in q4_0 (#11781)
* Fix mistral forward_qkv without self.rotary_emb.base in q4_0.
* Replace apply_rotary_pos_emb_no_cache_xpu with rotary_half_inplaced.
* Revert https://github.com/intel-analytics/ipex-llm/pull/11765
2024-08-13 16:48:19 +08:00
Heyang Sun
70c828b87c
deepspeed zero3 QLoRA finetuning (#11625)
* deepspeed zero3 QLoRA finetuning

* Update convert.py

* Update low_bit_linear.py

* Update utils.py

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update low_bit_linear.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update utils.py

* Update convert.py

* Update alpaca_qlora_finetuning.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update deepspeed_zero3.json

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update low_bit_linear.py

* Update low_bit_linear.py

* Update utils.py

* fix style

* fix style

* Update alpaca_qlora_finetuning.py

* Update qlora_finetune_llama2_13b_arch_2_card.sh

* Update convert.py

* Update low_bit_linear.py

* Update model.py

* Update alpaca_qlora_finetuning.py

* Update low_bit_linear.py

* Update low_bit_linear.py

* Update low_bit_linear.py
2024-08-13 16:15:29 +08:00
Yishuo Wang
a184b120c9
fix minicpm-v 2.5 (#11780) 2024-08-13 16:14:00 +08:00
Qiyuan Gong
a88c132e54
Reduce Mistral softmax memory only in low memory mode (#11775)
* Reduce Mistral softmax memory only in low memory mode
2024-08-13 14:50:54 +08:00
Yishuo Wang
aa861df066
use new fp32 softmax kernel (#11776) 2024-08-13 14:48:11 +08:00
binbin Deng
23d3acdc77
Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
Yishuo Wang
a1eb793f70
optimize minicpm v 2_6 firs token perf (#11770) 2024-08-13 09:51:18 +08:00
Yina Chen
841dbcdf3a
Fix compresskv with lookahead issue (#11767)
* fix compresskv + lookahead attn_mask qwen2

* support llama chatglm

* support mistral & chatglm

* address comments

* revert run.py
2024-08-12 18:53:55 +08:00
Xu, Shuo
1b05caba2b
Set mistral fuse rope to false except fp6 & fp16 (#11765)
* set mistral fuse rope to false except fp6 & fp16

* lint

* lint

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-08-12 17:25:07 +08:00
Ruonan Wang
8db34057b4
optimize lookahead init time (#11769) 2024-08-12 17:19:12 +08:00
Yishuo Wang
57d177738d
optimize minicpm-v-2_6 repetition penalty (#11763) 2024-08-12 14:10:10 +08:00
Ruonan Wang
66fe2ee464
initial support of IPEX_LLM_PERFORMANCE_MODE (#11754)
* add perf mode

* update

* fix style
2024-08-09 19:04:09 +08:00
Yina Chen
4b9c57cc60
Support compress kv with lookahead (#11752)
* support compress kv with lookahead

* enough kv miss param
2024-08-09 17:39:57 +08:00
Yishuo Wang
93455aac09
fix minicpm V 2.6 repeat output (#11753) 2024-08-09 17:39:24 +08:00
Ruonan Wang
7e917d6cfb
fix gptq of llama (#11749)
* fix gptq of llama

* small fix
2024-08-09 16:39:25 +08:00
Yina Chen
dd46c141bd
Phi3 support compresskv (#11733)
* phi3 support compresskv

* fix phi3 mtl error

* fix conflict with quant kv

* fix abnormal on mtl

* fix style

* use slide windows size to compress kv

* support sliding window

* fix style

* fix style

* temp: partial support quant kv

* support quant kv with compress kv, todo: model check

* temp

* fix style

* fix style

* remove prepare

* address comment

* default -> 1.8k
2024-08-09 15:43:43 +08:00
Qiyuan Gong
d8808cc2e3
Mistral apply_rotary_pos_emb_no_cache_xpu use rope_theta from config (#11747)
mistral-7B-instruct-v0.2 and mistral-7B-instruct-v0.1 use different rope_theta (0.2 is 1e, 0.1 is 1e5). Pass self.config.rope_theta to apply_rotary_pos_emb_no_cache_xpu to avoid output difference.
2024-08-09 10:35:51 +08:00
Yishuo Wang
54cc9353db
support and optimize minicpm-v-2_6 (#11738) 2024-08-07 18:21:16 +08:00
Yina Chen
e956e71fc1
fix conflict with quant kv (#11737) 2024-08-07 18:10:30 +08:00
Ruonan Wang
00a5574c8a
Use merge_qkv to replace fused_qkv for llama2 (#11727)
* update 4.38

* support new versions

* update

* fix style

* fix style

* update rope

* temp test sdpa

* fix style

* fix cpu ut
2024-08-07 18:04:01 +08:00
Yina Chen
d2abc9711b
Fix MTL 4k input qwen2 compresskv error (#11734)
* fix

* fix style
2024-08-07 16:21:57 +08:00
Yina Chen
a71ae7c22b
Support minicpm compresskv & modify default compresskv config & default enable compresskv on mtl 2.5k~4.5k (#11726)
* support minicpm & modify default & default enable on mtl 2.5k~4.5k

* fix style
2024-08-07 11:35:39 +08:00
Yishuo Wang
c093f7d980
fix phi3 (#11729) 2024-08-07 09:39:46 +08:00
Yishuo Wang
929675aa6b
support latest phi3 (#11721) 2024-08-06 15:52:55 +08:00