Commit graph

2276 commits

Author SHA1 Message Date
Yuwen Hu
d11f257ee7
Add GPU example for MiniCPM-o-2_6 (#12735)
* Add init example for omni mode

* Small fix

* Small fix

* Add chat example

* Remove lagecy link

* Further update link

* Add readme

* Small fix

* Update main readme link

* Update based on comments

* Small fix

* Small fix

* Small fix
2025-01-23 16:10:19 +08:00
Yuwen Hu
dcca522618
Remove sdpa available patch (#12734) 2025-01-22 17:22:28 +08:00
Xiangyu Tian
c9b6c94a59
vLLM: Update vLLM-cpu to v0.6.6-post1 (#12728)
Update vLLM-cpu to v0.6.6-post1
2025-01-22 15:03:01 +08:00
Ruonan Wang
78cca0a68c
[NPU] update llm-npu-cli example (#12729)
* update cli example

* add license

* rename

* update readme sample output
2025-01-22 09:59:27 +08:00
Yishuo Wang
6789e5d92f
small fix (#12727) 2025-01-21 17:27:18 +08:00
Yishuo Wang
085974e307
fix nf4 to cpu (#12722) 2025-01-21 09:23:22 +08:00
Yuwen Hu
9aa4be8ced
Update runtime configuration on MTL (#12720) 2025-01-20 11:06:37 +08:00
Yishuo Wang
bda87c21eb
add support and optimization for minicpmo audio part (#12716) 2025-01-16 16:39:00 +08:00
Yuwen Hu
534e0e6774
Update dependency for PyTorch 2.6 RC support for woq int4 (#12714) 2025-01-16 15:51:57 +08:00
Zhao Changmin
54d6328b3c
woq int4 fwd (#12711) 2025-01-16 15:48:05 +08:00
Yishuo Wang
b62734748f
add support and optimization for minicpmo vision part (#12713) 2025-01-16 14:51:00 +08:00
Yuwen Hu
c52bdff76b
Update Deepseek coder GPU example (#12712)
* Update Deepseek coder GPU example

* Fix based on comment
2025-01-16 14:05:31 +08:00
Yuwen Hu
9d65dcd7ef
Fix deepseek coder with linear rope type support on GPU (#12709)
* Fix deepseek coder with linear rope type

* Style fix

* Move to optimize_pre

* Small fix

* Small fix

* Small fix to not affect other cases

* Style fixes

* Update function name

* Small fix

* Small fix

* Small fix

* Fix for low transformers version first

* Style fix

* Small fix
2025-01-15 21:12:34 +08:00
Cengguang Zhang
9930351112
LLM: add new qtype woq_int4 to support gemm int4 temporary. (#12706)
This PR add temporary qtype woq_int4 to avoid affecting other qtype and models.

Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>
2025-01-15 14:41:33 +08:00
Xu, Shuo
350fae285d
Add Qwen2-VL HF GPU example with ModelScope Support (#12606)
* Add qwen2-vl example

* complete generate.py & readme

* improve lint style

* update 1-6

* update main readme

* Format and other small fixes

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2025-01-13 15:42:04 +08:00
Yuwen Hu
a1da7908b9
Fix name device is not found bug (#12703) 2025-01-13 10:11:02 +08:00
Yishuo Wang
db9db51e2c
fix lnl perf (#12700) 2025-01-10 18:00:58 +08:00
binbin Deng
da8bcb7db1
[NPU ] fix load logic of glm-edge models (#12698) 2025-01-10 16:08:37 +08:00
Yishuo Wang
f8dc408888
fix user issue (#12692) 2025-01-10 10:18:47 +08:00
Yishuo Wang
68857494a5
refactor to simplify following upgrade 2 (#12685) 2025-01-10 09:29:03 +08:00
Yishuo Wang
7234c9b27b
update quantize kv cache condition (#12681) 2025-01-09 15:23:04 +08:00
Yuwen Hu
5d8081afbc
Remove dummy model from performance tests (#12682) 2025-01-09 14:50:17 +08:00
Yishuo Wang
1ec40cd09e
refactor to simplify following upgrade (#12680) 2025-01-09 13:34:30 +08:00
Yishuo Wang
5c24276fc4
fix custom kernel registration (#12674) 2025-01-08 17:39:17 +08:00
Yishuo Wang
a22a8c21bb
small fix and remove ununsed code about ipex (#12671) 2025-01-08 17:39:04 +08:00
Yishuo Wang
c11f5f0fcd
also convert SdpaAttention in optimize_model (#12673) 2025-01-08 16:48:03 +08:00
Yishuo Wang
7dd156d292
small fix and add comment (#12670) 2025-01-08 10:56:50 +08:00
Yishuo Wang
ccf618ff4a
Remove all ipex usage (#12666) 2025-01-08 10:31:18 +08:00
Yuwen Hu
5db6f9dcde
Add option with PyTorch 2.6 RC version for testing purposes (#12668)
* Add option with PyTorch 2.6 RC version for testing purposes

* Small update
2025-01-07 18:28:55 +08:00
Yishuo Wang
f9ee7898c8
fix onednn dependency bug (#12665) 2025-01-07 16:26:56 +08:00
Yishuo Wang
29ad5c449e
refactor codegeex to remove ipex kernel usage (#12664) 2025-01-07 16:17:40 +08:00
Yuwen Hu
525b0ee991
[NPU] Tiny fixes on examples (#12661) 2025-01-07 14:30:38 +08:00
Yuwen Hu
ebdf19fa7e
[NPU] Further fix saving of generation config (#12657)
* Further fix saving of generation config

* Fix based on comments

* Small fix
2025-01-07 13:53:54 +08:00
Yuwen Hu
381d448ee2
[NPU] Example & Quickstart updates (#12650)
* Remove model with optimize_model=False in NPU verified models tables, and remove related example

* Remove experimental in run optimized model section title

* Unify model table order & example cmd

* Move embedding example to separate folder & update quickstart example link

* Add Quickstart reference in main NPU readme

* Small fix

* Small fix

* Move save/load examples under NPU/HF-Transformers-AutoModels

* Add low-bit and polish arguments for LLM Python examples

* Small fix

* Add low-bit and polish arguments for Multi-Model  examples

* Polish argument for Embedding models

* Polish argument for LLM CPP examples

* Add low-bit and polish argument for Save-Load examples

* Add accuracy tuning tips for examples

* Update NPU qucikstart accuracy tuning with low-bit optimizations

* Add save/load section to qucikstart

* Update CPP example sample output to EN

* Add installation regarding cmake for CPP examples

* Small fix

* Small fix

* Small fix

* Small fix

* Small fix

* Small fix

* Unify max prompt length to 512

* Change recommended low-bit for Qwen2.5-3B-Instruct to asym_int4

* Update based on comments

* Small fix
2025-01-07 13:52:41 +08:00
Yishuo Wang
ddc0ef3993
refactor device check and remove cohere/mixtral support (#12659) 2025-01-07 11:15:51 +08:00
Yishuo Wang
ea65e4fecc
remove falcon support and related UT (#12656) 2025-01-07 09:26:00 +08:00
Yina Chen
fae73eee79
[NPU] Support save npu quantized model without npu dependency (#12647)
* support save awq

* load quantized model & save npu compiled model

* fix style

* update

* fix dll load issue

* update error message

* fix style
2025-01-06 18:06:22 +08:00
Yishuo Wang
502461d836
remove unnecessary ipex kernel usage (#12649) 2025-01-03 16:45:24 +08:00
Yishuo Wang
9f8b134889
add ipex-llm custom kernel registration (#12648) 2025-01-03 16:45:04 +08:00
binbin Deng
0b377100c5
Add guide for save-load usage (#12498) 2025-01-03 16:30:15 +08:00
Wang, Jian4
6711a48a36
Enable internvl2-8b on vllm(#12645) 2025-01-03 14:49:36 +08:00
Zijie Li
8fd2dcba86
Add benchmark_util for transformers >= 4.47.0 (#12644) 2025-01-03 10:48:29 +08:00
Yina Chen
8e5328e9b4
add disable opts for awq (#12641) 2025-01-02 15:45:22 +08:00
Xu, Shuo
62318964fa
Update llama example information (#12640)
Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2025-01-02 13:48:39 +08:00
Yishuo Wang
81211fd010
remove unused code (#12635) 2025-01-02 13:31:09 +08:00
binbin Deng
534566e290
[NPU] Support minicpm-v with python cpp backend (#12637) 2025-01-02 11:13:15 +08:00
Yishuo Wang
f289f68d57
small fix (#12634) 2024-12-30 17:14:25 +08:00
Yishuo Wang
2d08155513
remove bmm, which is only required in ipex 2.0 (#12630) 2024-12-27 17:28:57 +08:00
binbin Deng
f17ccfa61a
[NPU] Fix save-load usage of minicpm models (#12628) 2024-12-27 15:56:46 +08:00
Yishuo Wang
c72a5db757
remove unused code again (#12624) 2024-12-27 14:17:11 +08:00
binbin Deng
46eeab4479
[NPU] Fix regression caused by layer_norm change (#12627) 2024-12-27 14:08:49 +08:00
Ruonan Wang
90f6709486
[remove pipeline examples (#12626) 2024-12-27 13:42:28 +08:00
Zijie Li
5f04ed7254
NPU] Update prompt format for baichuan2-pipeline (#12625) 2024-12-27 11:30:54 +08:00
Yishuo Wang
34dbdb8ee3
small fix (#12623) 2024-12-27 10:19:27 +08:00
Xu, Shuo
55ce091242
Add GLM4-Edge-V GPU example (#12596)
* Add GLM4-Edge-V examples

* polish readme

* revert wrong changes

* polish readme

* polish readme

* little polish in reference info and indent

* Small fix and sample output updates

* Update main readme

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-12-27 09:40:29 +08:00
binbin Deng
796ee571a5
[NPU doc] Update verified platforms (#12621) 2024-12-26 17:39:13 +08:00
Ruonan Wang
bbdbbb0d88
[NPU] Compatible with other third-party models like auto-round (#12620)
* support third party model

* simplify code

* fix sty;e

* fix sym int4 GW

* code refactor

* fix
2024-12-26 17:25:18 +08:00
Yishuo Wang
a9abde0b5d
support passing attn_scale to sdpa (#12619) 2024-12-26 16:58:09 +08:00
Shaojun Liu
40a7d2b4f0
Consolidated C-Eval Benchmark Guide for Single-GPU and Multi-GPU Environments (#12618)
* run c-eval on multi-GPUs

* Update README.md
2024-12-26 15:23:32 +08:00
Zijie Li
ccc4055058
[NPU] Update prompt format for baichuan2 (#12615)
* Update baichuan2.py

* style fix
2024-12-26 11:41:37 +08:00
Yishuo Wang
1604b4ead8
small fix (#12616) 2024-12-26 11:35:12 +08:00
Ruonan Wang
d841e1dc0d
[NPU] update convert script based on latest usage (#12617) 2024-12-26 11:23:04 +08:00
Xu, Shuo
ef585d3360
Polish Readme for ModelScope-related examples (#12603) 2024-12-26 10:52:47 +08:00
Yishuo Wang
a596f1ae5f
remove bigdl-llm test to fix langchain UT (#12613) 2024-12-26 10:17:25 +08:00
Ruonan Wang
9e895f04ec
[NPU] fix npu save (#12614)
* fix npu save

* update
2024-12-26 09:21:16 +08:00
Yishuo Wang
6249c1e373
rewrite llama optimization (#12609) 2024-12-25 17:04:32 +08:00
Yishuo Wang
5f5ac8a856
fix llama related import (#12611) 2024-12-25 16:23:52 +08:00
Yishuo Wang
4e6b9d804f
add compresskv back for mistral (#12607)
* add compresskv back for mistral

* fix

* fix
2024-12-25 11:06:08 +08:00
Yishuo Wang
4135b895b3
refactor chatglm2, internlm, stablelm and qwen (#12604) 2024-12-24 18:18:00 +08:00
Yishuo Wang
073f936c37
refactor mistral and phi3 (#12605) 2024-12-24 17:52:32 +08:00
binbin Deng
45f8f72a28
[NPU] Fix minicpm on MTL (#12599) 2024-12-24 15:37:56 +08:00
Yishuo Wang
ad2dc965c5
refactor mllama, gpt2 and internvl (#12602) 2024-12-24 14:18:31 +08:00
Yishuo Wang
7aaf02f602
refactor baichuan, glm4 and minicpm3 (#12600) 2024-12-24 14:16:30 +08:00
Zijie Li
c410d9cf73
[NPU] support asym_int4 for baichuan (#12576)
* add npu support for baichuan

* Update baichuan_mp.py

* Update baichuan_mp.py
2024-12-24 09:17:50 +08:00
Yishuo Wang
098eb335b2
refactor sd 1.5 and qwen2-vl and fix (#12590) 2024-12-20 17:34:55 +08:00
Yishuo Wang
b050368efc
refactor yuan2 and starcoder2 and fix (#12589) 2024-12-20 16:41:50 +08:00
Yishuo Wang
6ea8033635
refactor glm edge (#12588) 2024-12-20 15:36:57 +08:00
Xu, Shuo
b0338c5529
Add --modelscope option for glm-v4 MiniCPM-V-2_6 glm-edge and internvl2 (#12583)
* Add --modelscope option for glm-v4 and MiniCPM-V-2_6

* glm-edge

* minicpm-v-2_6:don't use model_hub=modelscope when use lowbit; internvl2

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-12-20 13:54:17 +08:00
Yishuo Wang
f3b5fad3be
refactor qwen2 and llama3 (#12587) 2024-12-20 13:25:25 +08:00
Xu, Shuo
47da3c999f
Add --modelscope in GPU examples for minicpm, minicpm3, baichuan2 (#12564)
* Add --modelscope for more models

* minicpm

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-12-19 17:25:46 +08:00
Yishuo Wang
3eeb02f1be
support Megrez-3B-Omni (#12582) 2024-12-19 17:23:01 +08:00
binbin Deng
4e7e988f70
[NPU] Fix MTL and ARL support (#12580) 2024-12-19 16:55:30 +08:00
Yishuo Wang
80f2fdc37b
optimize new minicpm model (#12579) 2024-12-19 14:22:47 +08:00
Yishuo Wang
4540424271
optimize siglip attention again (#12578) 2024-12-19 13:40:48 +08:00
Yishuo Wang
e0921f80c1
padding mask on torch side (#12577) 2024-12-19 10:53:02 +08:00
Xu, Shuo
47e90a362f
Add --modelscope in GPU examples for glm4, codegeex2, qwen2 and qwen2.5 (#12561)
* Add --modelscope for more models

* imporve readme

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-12-19 10:00:39 +08:00
Yishuo Wang
e2ae42929a
small fix (#12573) 2024-12-18 15:48:22 +08:00
Yishuo Wang
a4eb561f36
optimize siglip attention on arc (#12569) 2024-12-18 14:19:43 +08:00
Zijie Li
1a2ab12876
[NPU] support asym_int4 for minicpm (#12567) 2024-12-18 10:55:35 +08:00
Yuwen Hu
6278cafc25
Add setuptools as a basic dependency (#12563)
* Add setuptools as a basic dependency

* Remove unnecessary requirements of setuptools in example/unit/nightly tests
2024-12-17 16:56:41 +08:00
Zijie Li
fcb474820d
[NPU] support asym_int4 for llama (#12556)
* add llama-imatrix

* fix bugs in llama.py

* style fix
2024-12-17 14:01:17 +08:00
Yishuo Wang
a608f26cc8
use new fused layer norm (#12553) 2024-12-17 13:52:35 +08:00
binbin Deng
680ea7e4a8
[NPU doc] Update configuration for different platforms (#12554) 2024-12-17 10:15:09 +08:00
Xu, Shuo
ccc18eefb5
Add Modelscope option for chatglm3 on GPU (#12545)
* Add Modelscope option for GPU model chatglm3

* Update readme

* Update readme

* Update readme

* Update readme

* format update

---------

Co-authored-by: ATMxsp01 <shou.xu@intel.com>
2024-12-16 20:00:37 +08:00
Yishuo Wang
5ae0006103
remove old rope usage (#12552) 2024-12-16 15:59:36 +08:00
Chu,Youcheng
a86487c539
Add GLM-Edge GPU example (#12483)
* feat: initial commit

* generate.py and README updates

* Update link for main readme

* Update based on comments

* Small fix

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-12-16 14:39:19 +08:00
Jun Wang
0b953e61ef
[REFINE] graphmode code (#12540) 2024-12-16 09:17:01 +08:00
binbin Deng
caf15cc5ef
[NPU] Add IPEX_LLM_NPU_MTL to enable support on mtl (#12543) 2024-12-13 17:01:13 +08:00
Yishuo Wang
c090d167dc
remove old rope usage (#12544) 2024-12-13 16:54:58 +08:00
binbin Deng
d20a968ce2
[NPU] Fix generate example (#12541) 2024-12-13 14:07:24 +08:00
Yishuo Wang
15219944b8
optimize glm edge again (#12539) 2024-12-13 13:52:39 +08:00
binbin Deng
6596c18489
[NPU] Modify IPEX_LLM_NPU_DISABLE_COMPILE_OPT setting for long input (#12537) 2024-12-13 13:49:56 +08:00
Ruonan Wang
7cc01fdc86
[NPU] further fix of new_value_states (#12538) 2024-12-13 13:42:00 +08:00
Heyang Sun
fa261b8af1
torch 2.3 inference docker (#12517)
* torch 2.3 inference docker

* Update README.md

* add convert code

* rename image

* remove 2.1 and add graph example

* Update README.md
2024-12-13 10:47:04 +08:00
binbin Deng
f36c23664f
[NPU] Fix abnormal output with latest driver (#12530) 2024-12-12 17:56:30 +08:00
Yishuo Wang
ffce86d69f
add basic glm-edge-v support (#12533) 2024-12-12 17:25:48 +08:00
Yishuo Wang
3e0823d2ae
add basic glm-edge support (#12531) 2024-12-12 16:02:22 +08:00
Yuwen Hu
dbaf4abcb3
[NPU] Update C++ example with repetition_penalty & update Python code accordingly (#12528)
* Update c++ npu examples with repetition penalty

* Fit python with updated C++ API

* Style fix

* Small fix

* Small fix
2024-12-12 13:42:55 +08:00
Shaojun Liu
2cce89691a
Enable use_batch_forward Optimization on Battlemage GPU (#12516)
* Update get_xpu_device_type() to support bmg

* enable use_batch_forward for bmg

* Update low_bit_linear.py

* Update utils.py

* use batch kernel for fp8e5
2024-12-12 12:44:36 +08:00
binbin Deng
6fc27da9c1
[NPU] Update glm-edge support in docs (#12529) 2024-12-12 11:14:09 +08:00
binbin Deng
509bdb4661
[NPU] Fix minicpm-2B error (#12527) 2024-12-11 16:49:32 +08:00
Xu, Shuo
fd9cf767ed
All-in-one Benchmark run.py: Ignore error if import BenchmarkWrapper failed. (#12526) 2024-12-11 16:20:55 +08:00
Ruonan Wang
41ef4974ab
[NPU] fix transpose_value = False for NPU optimize_model=True (#12525) 2024-12-11 15:51:39 +08:00
Ruonan Wang
588bfa24dc
support hqq (#12518)
* support

* fix
2024-12-11 15:43:02 +08:00
Yuwen Hu
68f2873bd3
[NPU] Support repetition penalty for simple generate, Python (cpp backend) (#12522)
* Initial support of repetition penalty on NPU (cpp backend) for simple generate

* Bug fix for generation config and others

* Remove unnecessary print and style fix

* Remove unnecessary print

* Fix based on comments
2024-12-11 14:55:25 +08:00
Yishuo Wang
77404d2a63
support new model (#12523) 2024-12-11 13:41:15 +08:00
binbin Deng
ea55235cbd
[NPU] Support glm-edge models (#12511) 2024-12-09 14:06:27 +08:00
binbin Deng
12c78978dd
[NPU C++] Update example with conversation mode support (#12510) 2024-12-06 12:46:37 +08:00
Yuwen Hu
0918d3baca
[NPU] Fix hf generate with save/load generation config for Python (cpp backend) (#12509)
* Fix hf generate with save/load generation config

* Small fix

* Fix based on comments
2024-12-05 19:19:58 +08:00
Ruonan Wang
49ab8974fa
[NPU] initial support of asym_int4_rtn (#12484)
* initiail support of q4_1

* fix

* fix

* update

* update min to Z1

* update

* fix

* update

* fix style

* fix

* support qwen2 optimize_model=True mp version

* temp save

* fix

* fix style

* replace min with zero

* support split linear for q4_1

* fix lm_head with mixed_precision=True

* fix style

* revert test code

* add down proj back for q4_0

* remove print
2024-12-05 17:40:36 +08:00
Jinhe
5e1416c9aa
fix readme for npu cpp examples and llama.cpp (#12505)
* fix cpp readme

* fix cpp readme

* fix cpp readme
2024-12-05 12:32:42 +08:00
binbin Deng
f56a111aa2
[NPU] Fix load-low-bit benchmark script (#12502) 2024-12-05 10:01:32 +08:00
Yuwen Hu
84f1c4ad57
Small fix for NPU Python cpp simple generate regarding eos tokens (#12501) 2024-12-04 18:54:06 +08:00
Kai Huang
d8b14a6305
Update save/load comments (#12500) 2024-12-04 18:51:38 +08:00
Kai Huang
b89ea1b0cf
Support save/load model for hf generate (#12499)
* change dummy model

* style

* meet review
2024-12-04 18:26:39 +08:00
Kai Huang
7d27f134dd
Fix hf generate for llama3.2 (#12497)
* fix kv condition]

* meet review
2024-12-04 17:54:40 +08:00
Chu,Youcheng
ffa9a9e1b3
Update streaming in npu examples (#12495)
* feat: add streaming

* Update readme accordingly

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>
2024-12-04 17:51:10 +08:00
Yishuo Wang
a9e3f7f14c
optimize minicpm (#12496) 2024-12-04 17:14:16 +08:00
Yishuo Wang
e0bf0054e1
small fix (#12493) 2024-12-04 16:37:39 +08:00
Kai Huang
7ff4533b39
Support hf generate (#12477)
* generate

* style

* update

* remove timing

* style

* style

* combine generate api

* simple in kwargs
2024-12-04 16:31:09 +08:00
Yuwen Hu
ef4028ac2d
[NPU] Support split lm_head for Qwen2 with CPP (#12491)
* Use split for Qwen2 lm_head instead of slice in optimize_pre

* Support split lm_head in Qwen2 python cpp backend

* Fit with Python acc lib pipeline

* Removed default mixed_precision=True in all-in-one and related examples

* Small fix

* Style fix

* Fix based on comments

* Fix based on comments

* Stype fix
2024-12-04 14:41:08 +08:00
Yishuo Wang
5629fdd518
optimize qwen2_vl multiple image input or video input (#12487) 2024-12-04 09:24:38 +08:00
binbin Deng
c59284418c
Hotfix of BCE-Emdedding model (#12490) 2024-12-03 18:16:04 +08:00
Yuwen Hu
4ac66db034
[NPU] Support streaming in Python (cpp backend) (#12488)
* Support streaming in NPU Python (cpp backend)

* Small fix
2024-12-03 17:17:26 +08:00
Jin, Qiao
7082844f3f
Fix NPU LLM example save/load tokenizer (#12485) 2024-12-03 16:30:55 +08:00
Jin, Qiao
5fe766788e
Fix MiniCPM-V-2_6 running on NPU (#12486) 2024-12-03 16:16:29 +08:00
Ruonan Wang
598603bea6
small fix of imatrix (#12480) 2024-12-03 10:46:36 +08:00
binbin Deng
ab01753b1c
[NPU] update save-load API usage (#12473) 2024-12-03 09:46:15 +08:00
Yuwen Hu
26adb82ee3
[NPU] Remove hard code (#12479) 2024-12-02 18:26:07 +08:00
Yuwen Hu
b2e56a2e03
Add release support for option xpu_arc (#12422)
* Add release support for xpu-arc

* Dependency update
2024-12-02 17:16:04 +08:00
Yuwen Hu
aee9acb303
Add NPU QuickStart & update example links (#12470)
* Add initial NPU quickstart (c++ part unfinished)

* Small update

* Update based on comments

* Update main readme

* Remove LLaMA description

* Small fix

* Small fix

* Remove subsection link in main README

* Small fix

* Update based on comments

* Small fix

* TOC update and other small fixes

* Update for Chinese main readme

* Update based on comments and other small fixes

* Change order
2024-12-02 17:03:10 +08:00
Jin, Qiao
31c69a8d31
Fix MiniCPM-V models running on NPU (#12478) 2024-12-02 16:29:46 +08:00
binbin Deng
54d9a590d4
[NPU]Fix eos_token setting (#12475) 2024-12-02 14:18:22 +08:00
Guancheng Fu
59bd4a214f
add vLLM glm4 fix (#12474) 2024-12-02 14:05:16 +08:00
Ruonan Wang
4b6c3160be
Support imatrix-guided quantization for NPU CW (#12468)
* init commit

* remove print

* add interface

* fix

* fix

* fix style
2024-12-02 11:31:26 +08:00
binbin Deng
f99f188023
Hotfix of benchmark script (#12467) 2024-11-29 14:00:59 +08:00
binbin Deng
c911026f03
[NPU C++] Update model support & examples & benchmark (#12466) 2024-11-29 13:35:58 +08:00
binbin Deng
14d8d3d8af
Integrate NPU C++ imple into ipex-llm (#12461) 2024-11-29 09:25:37 +08:00
Ruonan Wang
490bb0ca53
[NPU] update fused layers for GW (#12459)
* update fused layers for GW

* fix

* fix llama condition for glm model

* update
2024-11-28 17:14:30 +08:00
Yina Chen
1b533a105c
[NPU] Add env to enable scale search (#12462)
* add env enable scale search

* address comment

* move logic
2024-11-28 17:06:00 +08:00