Cengguang Zhang
9930351112
LLM: add new qtype woq_int4 to support gemm int4 temporary. ( #12706 )
...
This PR add temporary qtype woq_int4 to avoid affecting other qtype and models.
Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2025-01-15 14:41:33 +08:00
Xu, Shuo
350fae285d
Add Qwen2-VL HF GPU example with ModelScope Support ( #12606 )
...
* Add qwen2-vl example
* complete generate.py & readme
* improve lint style
* update 1-6
* update main readme
* Format and other small fixes
---------
Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2025-01-13 15:42:04 +08:00
Yuwen Hu
a1da7908b9
Fix name device is not found bug ( #12703 )
2025-01-13 10:11:02 +08:00
Yishuo Wang
db9db51e2c
fix lnl perf ( #12700 )
2025-01-10 18:00:58 +08:00
binbin Deng
da8bcb7db1
[NPU ] fix load logic of glm-edge models ( #12698 )
2025-01-10 16:08:37 +08:00
Yishuo Wang
f8dc408888
fix user issue ( #12692 )
2025-01-10 10:18:47 +08:00
Yishuo Wang
68857494a5
refactor to simplify following upgrade 2 ( #12685 )
2025-01-10 09:29:03 +08:00
Yishuo Wang
7234c9b27b
update quantize kv cache condition ( #12681 )
2025-01-09 15:23:04 +08:00
Yishuo Wang
1ec40cd09e
refactor to simplify following upgrade ( #12680 )
2025-01-09 13:34:30 +08:00
Yishuo Wang
5c24276fc4
fix custom kernel registration ( #12674 )
2025-01-08 17:39:17 +08:00
Yishuo Wang
a22a8c21bb
small fix and remove ununsed code about ipex ( #12671 )
2025-01-08 17:39:04 +08:00
Yishuo Wang
c11f5f0fcd
also convert SdpaAttention in optimize_model ( #12673 )
2025-01-08 16:48:03 +08:00
Yishuo Wang
7dd156d292
small fix and add comment ( #12670 )
2025-01-08 10:56:50 +08:00
Yishuo Wang
ccf618ff4a
Remove all ipex usage ( #12666 )
2025-01-08 10:31:18 +08:00
Yishuo Wang
29ad5c449e
refactor codegeex to remove ipex kernel usage ( #12664 )
2025-01-07 16:17:40 +08:00
Yuwen Hu
ebdf19fa7e
[NPU] Further fix saving of generation config ( #12657 )
...
* Further fix saving of generation config
* Fix based on comments
* Small fix
2025-01-07 13:53:54 +08:00
Yishuo Wang
ddc0ef3993
refactor device check and remove cohere/mixtral support ( #12659 )
2025-01-07 11:15:51 +08:00
Yishuo Wang
ea65e4fecc
remove falcon support and related UT ( #12656 )
2025-01-07 09:26:00 +08:00
Yina Chen
fae73eee79
[NPU] Support save npu quantized model without npu dependency ( #12647 )
...
* support save awq
* load quantized model & save npu compiled model
* fix style
* update
* fix dll load issue
* update error message
* fix style
2025-01-06 18:06:22 +08:00
Yishuo Wang
502461d836
remove unnecessary ipex kernel usage ( #12649 )
2025-01-03 16:45:24 +08:00
Yishuo Wang
9f8b134889
add ipex-llm custom kernel registration ( #12648 )
2025-01-03 16:45:04 +08:00
Wang, Jian4
6711a48a36
Enable internvl2-8b on vllm( #12645 )
2025-01-03 14:49:36 +08:00
Zijie Li
8fd2dcba86
Add benchmark_util for transformers >= 4.47.0 ( #12644 )
2025-01-03 10:48:29 +08:00
Yina Chen
8e5328e9b4
add disable opts for awq ( #12641 )
2025-01-02 15:45:22 +08:00
Yishuo Wang
81211fd010
remove unused code ( #12635 )
2025-01-02 13:31:09 +08:00
binbin Deng
534566e290
[NPU] Support minicpm-v with python cpp backend ( #12637 )
2025-01-02 11:13:15 +08:00
Yishuo Wang
f289f68d57
small fix ( #12634 )
2024-12-30 17:14:25 +08:00
Yishuo Wang
2d08155513
remove bmm, which is only required in ipex 2.0 ( #12630 )
2024-12-27 17:28:57 +08:00
binbin Deng
f17ccfa61a
[NPU] Fix save-load usage of minicpm models ( #12628 )
2024-12-27 15:56:46 +08:00
Yishuo Wang
c72a5db757
remove unused code again ( #12624 )
2024-12-27 14:17:11 +08:00
binbin Deng
46eeab4479
[NPU] Fix regression caused by layer_norm change ( #12627 )
2024-12-27 14:08:49 +08:00
Yishuo Wang
34dbdb8ee3
small fix ( #12623 )
2024-12-27 10:19:27 +08:00
Ruonan Wang
bbdbbb0d88
[NPU] Compatible with other third-party models like auto-round ( #12620 )
...
* support third party model
* simplify code
* fix sty;e
* fix sym int4 GW
* code refactor
* fix
2024-12-26 17:25:18 +08:00
Yishuo Wang
a9abde0b5d
support passing attn_scale to sdpa ( #12619 )
2024-12-26 16:58:09 +08:00
Yishuo Wang
1604b4ead8
small fix ( #12616 )
2024-12-26 11:35:12 +08:00
Ruonan Wang
9e895f04ec
[NPU] fix npu save ( #12614 )
...
* fix npu save
* update
2024-12-26 09:21:16 +08:00
Yishuo Wang
6249c1e373
rewrite llama optimization ( #12609 )
2024-12-25 17:04:32 +08:00
Yishuo Wang
5f5ac8a856
fix llama related import ( #12611 )
2024-12-25 16:23:52 +08:00
Yishuo Wang
4e6b9d804f
add compresskv back for mistral ( #12607 )
...
* add compresskv back for mistral
* fix
* fix
2024-12-25 11:06:08 +08:00
Yishuo Wang
4135b895b3
refactor chatglm2, internlm, stablelm and qwen ( #12604 )
2024-12-24 18:18:00 +08:00
Yishuo Wang
073f936c37
refactor mistral and phi3 ( #12605 )
2024-12-24 17:52:32 +08:00
binbin Deng
45f8f72a28
[NPU] Fix minicpm on MTL ( #12599 )
2024-12-24 15:37:56 +08:00
Yishuo Wang
ad2dc965c5
refactor mllama, gpt2 and internvl ( #12602 )
2024-12-24 14:18:31 +08:00
Yishuo Wang
7aaf02f602
refactor baichuan, glm4 and minicpm3 ( #12600 )
2024-12-24 14:16:30 +08:00
Zijie Li
c410d9cf73
[NPU] support asym_int4 for baichuan ( #12576 )
...
* add npu support for baichuan
* Update baichuan_mp.py
* Update baichuan_mp.py
2024-12-24 09:17:50 +08:00
Yishuo Wang
098eb335b2
refactor sd 1.5 and qwen2-vl and fix ( #12590 )
2024-12-20 17:34:55 +08:00
Yishuo Wang
b050368efc
refactor yuan2 and starcoder2 and fix ( #12589 )
2024-12-20 16:41:50 +08:00
Yishuo Wang
6ea8033635
refactor glm edge ( #12588 )
2024-12-20 15:36:57 +08:00
Yishuo Wang
f3b5fad3be
refactor qwen2 and llama3 ( #12587 )
2024-12-20 13:25:25 +08:00
Yishuo Wang
3eeb02f1be
support Megrez-3B-Omni ( #12582 )
2024-12-19 17:23:01 +08:00
binbin Deng
4e7e988f70
[NPU] Fix MTL and ARL support ( #12580 )
2024-12-19 16:55:30 +08:00
Yishuo Wang
80f2fdc37b
optimize new minicpm model ( #12579 )
2024-12-19 14:22:47 +08:00
Yishuo Wang
4540424271
optimize siglip attention again ( #12578 )
2024-12-19 13:40:48 +08:00
Yishuo Wang
e0921f80c1
padding mask on torch side ( #12577 )
2024-12-19 10:53:02 +08:00
Yishuo Wang
e2ae42929a
small fix ( #12573 )
2024-12-18 15:48:22 +08:00
Yishuo Wang
a4eb561f36
optimize siglip attention on arc ( #12569 )
2024-12-18 14:19:43 +08:00
Zijie Li
1a2ab12876
[NPU] support asym_int4 for minicpm ( #12567 )
2024-12-18 10:55:35 +08:00
Zijie Li
fcb474820d
[NPU] support asym_int4 for llama ( #12556 )
...
* add llama-imatrix
* fix bugs in llama.py
* style fix
2024-12-17 14:01:17 +08:00
Yishuo Wang
a608f26cc8
use new fused layer norm ( #12553 )
2024-12-17 13:52:35 +08:00
Yishuo Wang
5ae0006103
remove old rope usage ( #12552 )
2024-12-16 15:59:36 +08:00
binbin Deng
caf15cc5ef
[NPU] Add IPEX_LLM_NPU_MTL to enable support on mtl ( #12543 )
2024-12-13 17:01:13 +08:00
Yishuo Wang
c090d167dc
remove old rope usage ( #12544 )
2024-12-13 16:54:58 +08:00
Yishuo Wang
15219944b8
optimize glm edge again ( #12539 )
2024-12-13 13:52:39 +08:00
binbin Deng
6596c18489
[NPU] Modify IPEX_LLM_NPU_DISABLE_COMPILE_OPT setting for long input ( #12537 )
2024-12-13 13:49:56 +08:00
Ruonan Wang
7cc01fdc86
[NPU] further fix of new_value_states ( #12538 )
2024-12-13 13:42:00 +08:00
binbin Deng
f36c23664f
[NPU] Fix abnormal output with latest driver ( #12530 )
2024-12-12 17:56:30 +08:00
Yishuo Wang
ffce86d69f
add basic glm-edge-v support ( #12533 )
2024-12-12 17:25:48 +08:00
Yishuo Wang
3e0823d2ae
add basic glm-edge support ( #12531 )
2024-12-12 16:02:22 +08:00
Yuwen Hu
dbaf4abcb3
[NPU] Update C++ example with repetition_penalty & update Python code accordingly ( #12528 )
...
* Update c++ npu examples with repetition penalty
* Fit python with updated C++ API
* Style fix
* Small fix
* Small fix
2024-12-12 13:42:55 +08:00
Shaojun Liu
2cce89691a
Enable use_batch_forward Optimization on Battlemage GPU ( #12516 )
...
* Update get_xpu_device_type() to support bmg
* enable use_batch_forward for bmg
* Update low_bit_linear.py
* Update utils.py
* use batch kernel for fp8e5
2024-12-12 12:44:36 +08:00
binbin Deng
509bdb4661
[NPU] Fix minicpm-2B error ( #12527 )
2024-12-11 16:49:32 +08:00
Ruonan Wang
41ef4974ab
[NPU] fix transpose_value = False for NPU optimize_model=True ( #12525 )
2024-12-11 15:51:39 +08:00
Ruonan Wang
588bfa24dc
support hqq ( #12518 )
...
* support
* fix
2024-12-11 15:43:02 +08:00
Yuwen Hu
68f2873bd3
[NPU] Support repetition penalty for simple generate, Python (cpp backend) ( #12522 )
...
* Initial support of repetition penalty on NPU (cpp backend) for simple generate
* Bug fix for generation config and others
* Remove unnecessary print and style fix
* Remove unnecessary print
* Fix based on comments
2024-12-11 14:55:25 +08:00
Yishuo Wang
77404d2a63
support new model ( #12523 )
2024-12-11 13:41:15 +08:00
binbin Deng
ea55235cbd
[NPU] Support glm-edge models ( #12511 )
2024-12-09 14:06:27 +08:00
Yuwen Hu
0918d3baca
[NPU] Fix hf generate with save/load generation config for Python (cpp backend) ( #12509 )
...
* Fix hf generate with save/load generation config
* Small fix
* Fix based on comments
2024-12-05 19:19:58 +08:00
Ruonan Wang
49ab8974fa
[NPU] initial support of asym_int4_rtn ( #12484 )
...
* initiail support of q4_1
* fix
* fix
* update
* update min to Z1
* update
* fix
* update
* fix style
* fix
* support qwen2 optimize_model=True mp version
* temp save
* fix
* fix style
* replace min with zero
* support split linear for q4_1
* fix lm_head with mixed_precision=True
* fix style
* revert test code
* add down proj back for q4_0
* remove print
2024-12-05 17:40:36 +08:00
Yuwen Hu
84f1c4ad57
Small fix for NPU Python cpp simple generate regarding eos tokens ( #12501 )
2024-12-04 18:54:06 +08:00
Kai Huang
d8b14a6305
Update save/load comments ( #12500 )
2024-12-04 18:51:38 +08:00
Kai Huang
b89ea1b0cf
Support save/load model for hf generate ( #12499 )
...
* change dummy model
* style
* meet review
2024-12-04 18:26:39 +08:00
Kai Huang
7d27f134dd
Fix hf generate for llama3.2 ( #12497 )
...
* fix kv condition]
* meet review
2024-12-04 17:54:40 +08:00
Yishuo Wang
a9e3f7f14c
optimize minicpm ( #12496 )
2024-12-04 17:14:16 +08:00
Yishuo Wang
e0bf0054e1
small fix ( #12493 )
2024-12-04 16:37:39 +08:00
Kai Huang
7ff4533b39
Support hf generate ( #12477 )
...
* generate
* style
* update
* remove timing
* style
* style
* combine generate api
* simple in kwargs
2024-12-04 16:31:09 +08:00
Yuwen Hu
ef4028ac2d
[NPU] Support split lm_head for Qwen2 with CPP ( #12491 )
...
* Use split for Qwen2 lm_head instead of slice in optimize_pre
* Support split lm_head in Qwen2 python cpp backend
* Fit with Python acc lib pipeline
* Removed default mixed_precision=True in all-in-one and related examples
* Small fix
* Style fix
* Fix based on comments
* Fix based on comments
* Stype fix
2024-12-04 14:41:08 +08:00
Yishuo Wang
5629fdd518
optimize qwen2_vl multiple image input or video input ( #12487 )
2024-12-04 09:24:38 +08:00
binbin Deng
c59284418c
Hotfix of BCE-Emdedding model ( #12490 )
2024-12-03 18:16:04 +08:00
Yuwen Hu
4ac66db034
[NPU] Support streaming in Python (cpp backend) ( #12488 )
...
* Support streaming in NPU Python (cpp backend)
* Small fix
2024-12-03 17:17:26 +08:00
Jin, Qiao
5fe766788e
Fix MiniCPM-V-2_6 running on NPU ( #12486 )
2024-12-03 16:16:29 +08:00
Ruonan Wang
598603bea6
small fix of imatrix ( #12480 )
2024-12-03 10:46:36 +08:00
binbin Deng
ab01753b1c
[NPU] update save-load API usage ( #12473 )
2024-12-03 09:46:15 +08:00
Yuwen Hu
26adb82ee3
[NPU] Remove hard code ( #12479 )
2024-12-02 18:26:07 +08:00
Jin, Qiao
31c69a8d31
Fix MiniCPM-V models running on NPU ( #12478 )
2024-12-02 16:29:46 +08:00
binbin Deng
54d9a590d4
[NPU]Fix eos_token setting ( #12475 )
2024-12-02 14:18:22 +08:00
Guancheng Fu
59bd4a214f
add vLLM glm4 fix ( #12474 )
2024-12-02 14:05:16 +08:00
Ruonan Wang
4b6c3160be
Support imatrix-guided quantization for NPU CW ( #12468 )
...
* init commit
* remove print
* add interface
* fix
* fix
* fix style
2024-12-02 11:31:26 +08:00
binbin Deng
c911026f03
[NPU C++] Update model support & examples & benchmark ( #12466 )
2024-11-29 13:35:58 +08:00
binbin Deng
14d8d3d8af
Integrate NPU C++ imple into ipex-llm ( #12461 )
2024-11-29 09:25:37 +08:00
Ruonan Wang
490bb0ca53
[NPU] update fused layers for GW ( #12459 )
...
* update fused layers for GW
* fix
* fix llama condition for glm model
* update
2024-11-28 17:14:30 +08:00