1.1 KiB
1.1 KiB
BigDL-LLM INT4 Inference Using Llama-Cpp-Python Format API
In this example, we show how to run inference on converted INT4 model using llama-cpp-python format API.
Note
: Currently model family LLaMA, GPT-NeoX, BLOOM and StarCoder are supported.
Prepare Environment
We suggest using conda to manage environment:
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
Convert Models using bigdl-llm
Follow the instructions in Convert model.
Run the example
python ./int4_inference.py -m CONVERTED_MODEL_PATH -x MODEL_FAMILY -p PROMPT -t THREAD_NUM
arguments info:
-m CONVERTED_MODEL_PATH: required, path to the converted model-x MODEL_FAMILY: required, the model family of the model specified in-m, available options arellama,gptneox,bloomandstarcoder-p PROMPT: question to ask. Default isWhat is AI?.-t THREAD_NUM: specify the number of threads to use for inference. Default is2.