ipex-llm/python/llm/src/ipex_llm/transformers/npu_models
Yang Wang 99b05ba1dc
separate prefill into a process (#11787)
* seperate prefill into a process

* using model.share_memory()

* might work

* worked

* use long prompt

* refactor

* cleanup

* fix bug

* clean up

* changable inter and intra process stages

* refactor

* add max output len

* fix npu_model changes that may cause generate down

* fix npu_model generate import error

* fix generare forward error

---------

Co-authored-by: sgwhat <ge.song@intel.com>
2024-08-19 17:53:36 +08:00
..
__init__.py optimize llama npu perf (#11426) 2024-06-25 17:43:20 +08:00
baichuan.py fix baichuan (#11606) 2024-07-18 09:43:36 +08:00
chatglm.py fix chatglm3 npu output (#11590) 2024-07-16 18:16:30 +08:00
chatglm4.py support npu glm4 (#11539) 2024-07-09 15:46:49 +08:00
common.py optimize npu llama perf again (#11431) 2024-06-26 10:52:54 +08:00
convert.py Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
convert_mp.py separate prefill into a process (#11787) 2024-08-19 17:53:36 +08:00
kv.py separate prefill into a process (#11787) 2024-08-19 17:53:36 +08:00
linear.py fix llama3-8b npu long input stuck (#11613) 2024-07-18 11:08:17 +08:00
llama.py Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
llama_mp.py separate prefill into a process (#11787) 2024-08-19 17:53:36 +08:00
minicpm.py add minicpm 1B/2B npu support (#11507) 2024-07-04 16:31:04 +08:00
mistral.py add mistral npu support (#11523) 2024-07-08 13:17:15 +08:00
phi3.py add npu sdp (#11562) 2024-07-11 16:57:35 +08:00
phi3_v.py optimize phi3-v encoder npu performance and add multimodal example (#11553) 2024-07-11 13:59:14 +08:00
pipeline_parallel.py Add experimental support of fused decoder layer for llama2 (#11768) 2024-08-13 14:41:36 +08:00
qwen2.py add qwen2 npu support (#11504) 2024-07-04 11:01:25 +08:00
stablelm.py Optimize stablelm on NPU (#11512) 2024-07-05 14:21:57 +08:00