Commit graph

2081 commits

Author SHA1 Message Date
Chu,Youcheng
a86487c539
Add GLM-Edge GPU example (#12483)
* feat: initial commit

* generate.py and README updates

* Update link for main readme

* Update based on comments

* Small fix

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-12-16 14:39:19 +08:00
Jun Wang
0b953e61ef
[REFINE] graphmode code (#12540) 2024-12-16 09:17:01 +08:00
binbin Deng
caf15cc5ef
[NPU] Add IPEX_LLM_NPU_MTL to enable support on mtl (#12543) 2024-12-13 17:01:13 +08:00
Yishuo Wang
c090d167dc
remove old rope usage (#12544) 2024-12-13 16:54:58 +08:00
binbin Deng
d20a968ce2
[NPU] Fix generate example (#12541) 2024-12-13 14:07:24 +08:00
Yishuo Wang
15219944b8
optimize glm edge again (#12539) 2024-12-13 13:52:39 +08:00
binbin Deng
6596c18489
[NPU] Modify IPEX_LLM_NPU_DISABLE_COMPILE_OPT setting for long input (#12537) 2024-12-13 13:49:56 +08:00
Ruonan Wang
7cc01fdc86
[NPU] further fix of new_value_states (#12538) 2024-12-13 13:42:00 +08:00
Heyang Sun
fa261b8af1
torch 2.3 inference docker (#12517)
* torch 2.3 inference docker

* Update README.md

* add convert code

* rename image

* remove 2.1 and add graph example

* Update README.md
2024-12-13 10:47:04 +08:00
binbin Deng
f36c23664f
[NPU] Fix abnormal output with latest driver (#12530) 2024-12-12 17:56:30 +08:00
Yishuo Wang
ffce86d69f
add basic glm-edge-v support (#12533) 2024-12-12 17:25:48 +08:00
Yishuo Wang
3e0823d2ae
add basic glm-edge support (#12531) 2024-12-12 16:02:22 +08:00
Yuwen Hu
dbaf4abcb3
[NPU] Update C++ example with repetition_penalty & update Python code accordingly (#12528)
* Update c++ npu examples with repetition penalty

* Fit python with updated C++ API

* Style fix

* Small fix

* Small fix
2024-12-12 13:42:55 +08:00
Shaojun Liu
2cce89691a
Enable use_batch_forward Optimization on Battlemage GPU (#12516)
* Update get_xpu_device_type() to support bmg

* enable use_batch_forward for bmg

* Update low_bit_linear.py

* Update utils.py

* use batch kernel for fp8e5
2024-12-12 12:44:36 +08:00
binbin Deng
6fc27da9c1
[NPU] Update glm-edge support in docs (#12529) 2024-12-12 11:14:09 +08:00
binbin Deng
509bdb4661
[NPU] Fix minicpm-2B error (#12527) 2024-12-11 16:49:32 +08:00
Xu, Shuo
fd9cf767ed
All-in-one Benchmark run.py: Ignore error if import BenchmarkWrapper failed. (#12526) 2024-12-11 16:20:55 +08:00
Ruonan Wang
41ef4974ab
[NPU] fix transpose_value = False for NPU optimize_model=True (#12525) 2024-12-11 15:51:39 +08:00
Ruonan Wang
588bfa24dc
support hqq (#12518)
* support

* fix
2024-12-11 15:43:02 +08:00
Yuwen Hu
68f2873bd3
[NPU] Support repetition penalty for simple generate, Python (cpp backend) (#12522)
* Initial support of repetition penalty on NPU (cpp backend) for simple generate

* Bug fix for generation config and others

* Remove unnecessary print and style fix

* Remove unnecessary print

* Fix based on comments
2024-12-11 14:55:25 +08:00
Yishuo Wang
77404d2a63
support new model (#12523) 2024-12-11 13:41:15 +08:00
binbin Deng
ea55235cbd
[NPU] Support glm-edge models (#12511) 2024-12-09 14:06:27 +08:00
binbin Deng
12c78978dd
[NPU C++] Update example with conversation mode support (#12510) 2024-12-06 12:46:37 +08:00
Yuwen Hu
0918d3baca
[NPU] Fix hf generate with save/load generation config for Python (cpp backend) (#12509)
* Fix hf generate with save/load generation config

* Small fix

* Fix based on comments
2024-12-05 19:19:58 +08:00
Ruonan Wang
49ab8974fa
[NPU] initial support of asym_int4_rtn (#12484)
* initiail support of q4_1

* fix

* fix

* update

* update min to Z1

* update

* fix

* update

* fix style

* fix

* support qwen2 optimize_model=True mp version

* temp save

* fix

* fix style

* replace min with zero

* support split linear for q4_1

* fix lm_head with mixed_precision=True

* fix style

* revert test code

* add down proj back for q4_0

* remove print
2024-12-05 17:40:36 +08:00
Jinhe
5e1416c9aa
fix readme for npu cpp examples and llama.cpp (#12505)
* fix cpp readme

* fix cpp readme

* fix cpp readme
2024-12-05 12:32:42 +08:00
binbin Deng
f56a111aa2
[NPU] Fix load-low-bit benchmark script (#12502) 2024-12-05 10:01:32 +08:00
Yuwen Hu
84f1c4ad57
Small fix for NPU Python cpp simple generate regarding eos tokens (#12501) 2024-12-04 18:54:06 +08:00
Kai Huang
d8b14a6305
Update save/load comments (#12500) 2024-12-04 18:51:38 +08:00
Kai Huang
b89ea1b0cf
Support save/load model for hf generate (#12499)
* change dummy model

* style

* meet review
2024-12-04 18:26:39 +08:00
Kai Huang
7d27f134dd
Fix hf generate for llama3.2 (#12497)
* fix kv condition]

* meet review
2024-12-04 17:54:40 +08:00
Chu,Youcheng
ffa9a9e1b3
Update streaming in npu examples (#12495)
* feat: add streaming

* Update readme accordingly

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-12-04 17:51:10 +08:00
Yishuo Wang
a9e3f7f14c
optimize minicpm (#12496) 2024-12-04 17:14:16 +08:00
Yishuo Wang
e0bf0054e1
small fix (#12493) 2024-12-04 16:37:39 +08:00
Kai Huang
7ff4533b39
Support hf generate (#12477)
* generate

* style

* update

* remove timing

* style

* style

* combine generate api

* simple in kwargs
2024-12-04 16:31:09 +08:00
Yuwen Hu
ef4028ac2d
[NPU] Support split lm_head for Qwen2 with CPP (#12491)
* Use split for Qwen2 lm_head instead of slice in optimize_pre

* Support split lm_head in Qwen2 python cpp backend

* Fit with Python acc lib pipeline

* Removed default mixed_precision=True in all-in-one and related examples

* Small fix

* Style fix

* Fix based on comments

* Fix based on comments

* Stype fix
2024-12-04 14:41:08 +08:00
Yishuo Wang
5629fdd518
optimize qwen2_vl multiple image input or video input (#12487) 2024-12-04 09:24:38 +08:00
binbin Deng
c59284418c
Hotfix of BCE-Emdedding model (#12490) 2024-12-03 18:16:04 +08:00
Yuwen Hu
4ac66db034
[NPU] Support streaming in Python (cpp backend) (#12488)
* Support streaming in NPU Python (cpp backend)

* Small fix
2024-12-03 17:17:26 +08:00
Jin, Qiao
7082844f3f
Fix NPU LLM example save/load tokenizer (#12485) 2024-12-03 16:30:55 +08:00
Jin, Qiao
5fe766788e
Fix MiniCPM-V-2_6 running on NPU (#12486) 2024-12-03 16:16:29 +08:00
Ruonan Wang
598603bea6
small fix of imatrix (#12480) 2024-12-03 10:46:36 +08:00
binbin Deng
ab01753b1c
[NPU] update save-load API usage (#12473) 2024-12-03 09:46:15 +08:00
Yuwen Hu
26adb82ee3
[NPU] Remove hard code (#12479) 2024-12-02 18:26:07 +08:00
Yuwen Hu
b2e56a2e03
Add release support for option xpu_arc (#12422)
* Add release support for xpu-arc

* Dependency update
2024-12-02 17:16:04 +08:00
Yuwen Hu
aee9acb303
Add NPU QuickStart & update example links (#12470)
* Add initial NPU quickstart (c++ part unfinished)

* Small update

* Update based on comments

* Update main readme

* Remove LLaMA description

* Small fix

* Small fix

* Remove subsection link in main README

* Small fix

* Update based on comments

* Small fix

* TOC update and other small fixes

* Update for Chinese main readme

* Update based on comments and other small fixes

* Change order
2024-12-02 17:03:10 +08:00
Jin, Qiao
31c69a8d31
Fix MiniCPM-V models running on NPU (#12478) 2024-12-02 16:29:46 +08:00
binbin Deng
54d9a590d4
[NPU]Fix eos_token setting (#12475) 2024-12-02 14:18:22 +08:00
Guancheng Fu
59bd4a214f
add vLLM glm4 fix (#12474) 2024-12-02 14:05:16 +08:00
Ruonan Wang
4b6c3160be
Support imatrix-guided quantization for NPU CW (#12468)
* init commit

* remove print

* add interface

* fix

* fix

* fix style
2024-12-02 11:31:26 +08:00