SONG Ge
e211a5b076
update minicpm to meet latest refactor ( #11937 )
2024-08-27 15:08:01 +08:00
Zijie Li
6c3eb1e1e8
refactor from_pretrained API for NPU ( #11927 )
2024-08-27 09:50:30 +08:00
SONG Ge
019f725d4d
[NPU] Add support for running mp minicpm model on npu ( #11909 )
...
* add initial support for npu minicpm mp
* fix minicpm-1b abnormal output error
2024-08-26 17:52:55 +08:00
binbin Deng
303a090a6b
Add lm_head optimization on NPU ( #11903 )
2024-08-23 15:51:07 +08:00
binbin Deng
72a7bf624b
Support qwen2-1.5b with fused decoderlayer optimization on NPU ( #11888 )
2024-08-22 11:09:12 +08:00
Yang Wang
209d42ab79
Refactor npu mp to make it easier to integrate new models ( #11873 )
...
* Refactor npu mp to make it easier to integrate new models
* fix style
* move layer functions to base
2024-08-20 20:58:47 -07:00
Yang Wang
bdaeee1d63
Fix run_decoders bug ( #11871 )
2024-08-20 12:04:59 -07:00
Yang Wang
99b05ba1dc
separate prefill into a process ( #11787 )
...
* seperate prefill into a process
* using model.share_memory()
* might work
* worked
* use long prompt
* refactor
* cleanup
* fix bug
* clean up
* changable inter and intra process stages
* refactor
* add max output len
* fix npu_model changes that may cause generate down
* fix npu_model generate import error
* fix generare forward error
---------
Co-authored-by: sgwhat <ge.song@intel.com>
2024-08-19 17:53:36 +08:00
Yang Wang
51bcac1229
follow up on experimental support of fused decoder layer for llama2 ( #11785 )
...
* clean up and support transpose value cache
* refine
* fix style
* fix style
2024-08-13 18:53:55 -07:00
binbin Deng
23d3acdc77
Add experimental support of fused decoder layer for llama2 ( #11768 )
2024-08-13 14:41:36 +08:00
binbin Deng
777e61d8c8
Fix qwen2 & int4 on NPU ( #11646 )
2024-07-24 13:14:39 +08:00
Yishuo Wang
f4077fa905
fix llama3-8b npu long input stuck ( #11613 )
2024-07-18 11:08:17 +08:00
Zhao Changmin
e5c0058c0e
fix baichuan ( #11606 )
2024-07-18 09:43:36 +08:00
Yishuo Wang
5837bc0014
fix chatglm3 npu output ( #11590 )
2024-07-16 18:16:30 +08:00
Zhao Changmin
b9c66994a5
add npu sdp ( #11562 )
2024-07-11 16:57:35 +08:00
Zhao Changmin
105e124752
optimize phi3-v encoder npu performance and add multimodal example ( #11553 )
...
* phi3-v
* readme
2024-07-11 13:59:14 +08:00
Zhao Changmin
3c16c9f725
Optimize baichuan on NPU ( #11548 )
...
* baichuan_npu
2024-07-10 13:18:48 +08:00
Yishuo Wang
2929eb262e
support npu glm4 ( #11539 )
2024-07-09 15:46:49 +08:00
Yishuo Wang
c26651f91f
add mistral npu support ( #11523 )
2024-07-08 13:17:15 +08:00
Yishuo Wang
14ce058004
add chatglm3 npu support ( #11518 )
2024-07-05 15:31:27 +08:00
Zhao Changmin
24de13fc45
Optimize stablelm on NPU ( #11512 )
...
* stablelm_optimize
2024-07-05 14:21:57 +08:00
Zhao Changmin
57b8adb189
[WIP] Support npu load_low_bit method ( #11502 )
...
* npu_load_low_bit
2024-07-04 17:15:34 +08:00
Yishuo Wang
1a8bab172e
add minicpm 1B/2B npu support ( #11507 )
2024-07-04 16:31:04 +08:00
Yishuo Wang
bb0a84044b
add qwen2 npu support ( #11504 )
2024-07-04 11:01:25 +08:00
Yishuo Wang
ec3a912ab6
optimize npu llama long context performance ( #11478 )
2024-07-01 16:49:23 +08:00
Zhao Changmin
cf8eb7b128
Init NPU quantize method and support q8_0_rtn ( #11452 )
...
* q8_0_rtn
* fix float point
2024-07-01 13:45:07 +08:00
Yishuo Wang
319a3b36b2
fix npu llama2 ( #11471 )
2024-07-01 10:14:11 +08:00
Yishuo Wang
029ff15d28
optimize npu llama2 first token performance ( #11451 )
2024-06-27 17:37:33 +08:00
Yishuo Wang
f89ca23748
optimize npu llama2 perf again ( #11445 )
2024-06-27 15:13:42 +08:00
Yishuo Wang
ca0e69c3a7
optimize npu llama perf again ( #11431 )
2024-06-26 10:52:54 +08:00
Yishuo Wang
9f6e5b4fba
optimize llama npu perf ( #11426 )
2024-06-25 17:43:20 +08:00