Commit graph

31 commits

Author SHA1 Message Date
SONG Ge
e211a5b076
update minicpm to meet latest refactor (#11937) 2024-08-27 15:08:01 +08:00
Zijie Li
6c3eb1e1e8
refactor from_pretrained API for NPU (#11927) 2024-08-27 09:50:30 +08:00
SONG Ge
019f725d4d
[NPU] Add support for running mp minicpm model on npu (#11909)
* add initial support for npu minicpm mp

* fix minicpm-1b abnormal output error
2024-08-26 17:52:55 +08:00
binbin Deng
303a090a6b
Add lm_head optimization on NPU (#11903) 2024-08-23 15:51:07 +08:00
binbin Deng
72a7bf624b
Support qwen2-1.5b with fused decoderlayer optimization on NPU (#11888) 2024-08-22 11:09:12 +08:00
Yang Wang
209d42ab79
Refactor npu mp to make it easier to integrate new models (#11873)
* Refactor npu mp to make it easier to integrate new models

* fix style

* move layer functions to base
2024-08-20 20:58:47 -07:00
Yang Wang
bdaeee1d63
Fix run_decoders bug (#11871) 2024-08-20 12:04:59 -07:00
Yang Wang
99b05ba1dc
separate prefill into a process (#11787)
* seperate prefill into a process

* using model.share_memory()

* might work

* worked

* use long prompt

* refactor

* cleanup

* fix bug

* clean up

* changable inter and intra process stages

* refactor

* add max output len

* fix npu_model changes that may cause generate down

* fix npu_model generate import error

* fix generare forward error

---------

Co-authored-by: sgwhat <ge.song@intel.com>
2024-08-19 17:53:36 +08:00
Yang Wang
51bcac1229
follow up on experimental support of fused decoder layer for llama2 (#11785)
* clean up and support transpose value cache

* refine

* fix style

* fix style
2024-08-13 18:53:55 -07:00
binbin Deng
23d3acdc77
Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
binbin Deng
777e61d8c8
Fix qwen2 & int4 on NPU (#11646) 2024-07-24 13:14:39 +08:00
Yishuo Wang
f4077fa905
fix llama3-8b npu long input stuck (#11613) 2024-07-18 11:08:17 +08:00
Zhao Changmin
e5c0058c0e
fix baichuan (#11606) 2024-07-18 09:43:36 +08:00
Yishuo Wang
5837bc0014
fix chatglm3 npu output (#11590) 2024-07-16 18:16:30 +08:00
Zhao Changmin
b9c66994a5
add npu sdp (#11562) 2024-07-11 16:57:35 +08:00
Zhao Changmin
105e124752
optimize phi3-v encoder npu performance and add multimodal example (#11553)
* phi3-v

* readme
2024-07-11 13:59:14 +08:00
Zhao Changmin
3c16c9f725
Optimize baichuan on NPU (#11548)
* baichuan_npu
2024-07-10 13:18:48 +08:00
Yishuo Wang
2929eb262e
support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
Yishuo Wang
c26651f91f
add mistral npu support (#11523) 2024-07-08 13:17:15 +08:00
Yishuo Wang
14ce058004
add chatglm3 npu support (#11518) 2024-07-05 15:31:27 +08:00
Zhao Changmin
24de13fc45
Optimize stablelm on NPU (#11512)
* stablelm_optimize
2024-07-05 14:21:57 +08:00
Zhao Changmin
57b8adb189
[WIP] Support npu load_low_bit method (#11502)
* npu_load_low_bit
2024-07-04 17:15:34 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
Yishuo Wang
bb0a84044b
add qwen2 npu support (#11504) 2024-07-04 11:01:25 +08:00
Yishuo Wang
ec3a912ab6
optimize npu llama long context performance (#11478) 2024-07-01 16:49:23 +08:00
Zhao Changmin
cf8eb7b128
Init NPU quantize method and support q8_0_rtn (#11452)
* q8_0_rtn

* fix float point
2024-07-01 13:45:07 +08:00
Yishuo Wang
319a3b36b2
fix npu llama2 (#11471) 2024-07-01 10:14:11 +08:00
Yishuo Wang
029ff15d28
optimize npu llama2 first token performance (#11451) 2024-06-27 17:37:33 +08:00
Yishuo Wang
f89ca23748
optimize npu llama2 perf again (#11445) 2024-06-27 15:13:42 +08:00
Yishuo Wang
ca0e69c3a7
optimize npu llama perf again (#11431) 2024-06-26 10:52:54 +08:00
Yishuo Wang
9f6e5b4fba
optimize llama npu perf (#11426) 2024-06-25 17:43:20 +08:00