* done * Rename structure * add models * Add structure/sampling_params,sequence * add input_metadata * add outputs * Add policy,logger * add and update * add parallelconfig back * core/scheduler.py * Add llm_engine.py * Add async_llm_engine.py * Add tested entrypoint * fix minor error * Fix everything * fix kv cache view * fix * fix * fix * format&refine * remove logger from repo * try to add token latency * remove logger * Refine config.py * finish worker.py * delete utils.py * add license * refine * refine sequence.py * remove sampling_params.py * finish * add license * format * add license * refine * refine * Refine line too long * remove exception * so dumb style-check * refine * refine * refine * refine * refine * refine * add README * refine README * add warning instead error * fix padding * add license * format * format * format fix * Refine vllm dependency (#1) vllm dependency clear * fix licence * fix format * fix format * fix * adapt LLM engine * fix * add license * fix format * fix * Moving README.md to the correct position * Fix readme.md * done * guide for adding models * fix * Fix README.md * Add new model readme * remove ray-logic * refactor arg_utils.py * remove distributed_init_method logic * refactor entrypoints * refactor input_metadata * refactor model_loader * refactor utils.py * refactor models * fix api server * remove vllm.stucture * revert by txy 1120 * remove utils * format * fix license * add bigdl model * Refer to a specfic commit * Change code base * add comments * add async_llm_engine comment * refine * formatted * add worker comments * add comments * add comments * fix style * add changes --------- Co-authored-by: xiangyuT <xiangyu.tian@intel.com> Co-authored-by: Xiangyu Tian <109123695+xiangyuT@users.noreply.github.com> Co-authored-by: leonardozcm <leonardo1997zcm@gmail.com>  | 
			||
|---|---|---|
| .. | ||
| offline_inference.py | ||
| README.md | ||
Serving LLAMA models using vLLM on Intel platforms (experimental support)
This example demonstrates how to serving a llama2-7b model using BigDL-LLM 4 bits optimizations with xeon CPUs with adapted vLLM.
The code shown in the following example is ported from vLLM.
Example: Serving llama2-7b using Xeon CPU
In this example, we will run Llama2-7b model using 48 cores in one socket and provide OpenAI-compatible interface for users.
1. Install
The original vLLM is designed to run with CUDA environment. To adapt vLLM into Intel platforms, install the dependencies like this:
# First create an conda environment
conda create -n bigdl-vllm python==3.9
conda activate bigdl-vllm
# Install dependencies
pip install --pre --upgrade bigdl-llm[all]
pip3 install psutil
pip3 install sentencepiece  # Required for LLaMA tokenizer.
pip3 install numpy
pip3 install "torch==2.0.1"
pip3 install "transformers>=4.33.1"  # Required for Code Llama.
pip3 install "xformers == 0.0.22"
pip3 install fastapi
pip3 install "uvicorn[standard]"
pip3 install "pydantic<2"  # Required for OpenAI server.
2. Configures Recommending environment variables
source bigdl-llm-init
3. Offline inference/Service
Offline inference
To run offline inference using vLLM for a quick impression, use the following example:
#!/bin/bash
# Please first modify the MODEL_PATH in offline_inference.py
numactl -C 48-95 -m 1 python offline_inference.py
Service
To fully utilize the dynamic batching feature of the vLLM, you can send requests to the service using curl or other similar methods.  The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting all sequences are finished.
#!/bin/bash
numactl -C 48-95 -m 1 python -m bigdl.llm.vllm.examples.api_server \
        --model /MODEL_PATH/Llama-2-7b-chat-hf-bigdl/ --port 8000  \
        --load-format 'auto' --device cpu --dtype bfloat16
Then you can access the api server using the following way:
 curl http://localhost:8000/v1/completions \
         -H "Content-Type: application/json" \
         -d '{
                 "model": "/MODEL_PATH/Llama-2-7b-chat-hf-bigdl/",
                 "prompt": "San Francisco is a",
                 "max_tokens": 128,
                 "temperature": 0
 }' &
4. (Optional) Add a new model
Currently we have only supported llama-structure model (including llama, vicuna, llama-2, etc.). To use other model, you may need some adaption to the code.
4.1 Add model code
Create or clone the Pytorch model code to ./models.
4.2 Rewrite the forward methods
Refering to ./models/bigdl_llama.py, it's necessary to maintain a kv_cache, which is a nested list of dictionary that maps req_id to a three-dimensional tensor (the structure may vary from models). Before the model's actual forward method, you could prepare a past_key_values according to current req_id, and after you need to update the kv_cache with output.past_key_values. The clearence will be executed when the request is finished.
4.3 Register new model
Finally, register your *ForCausalLM class to the _MODEL_REGISTRY in ./models/model_loader.py.