ipex-llm/python
Guancheng Fu fdf93c9267 Implement selective batching for vLLM (#9659)
* add control to load hf model

* finish initial version of selective_batching

* temp

* finish

* Remove print statement

* fix error

* Apply yang's optimization

* a version that works

* We need to check kv_cache passed in, this could be an error. TODO: add fast decoding path

* format

* temp solution: not batching prefill requests

* a version that works for prefill batching

* format

* a solid version: works normally

* a temp version

* Solid version: remove redundant functions

* fix format

* format

* solid: add option to enable selective_batching

* remove logic for using transformer models

* format

* format

* solid: enable argument VLLM_ENABLE_SELECTIVE_BATCHING

* format

* finish

* format
2023-12-22 13:45:46 +08:00
..
llm Implement selective batching for vLLM (#9659) 2023-12-22 13:45:46 +08:00