Update readme for FP8/FP4 inference examples (#9601)
This commit is contained in:
parent
a66fbedd7e
commit
06febb5fa7
3 changed files with 5 additions and 5 deletions
|
|
@ -12,6 +12,7 @@
|
|||
> *It is built on the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [gptq](https://github.com/IST-DASLab/gptq), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [awq](https://github.com/mit-han-lab/llm-awq), [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
|
||||
|
||||
### Latest update
|
||||
- [2023/12] `bigdl-llm` now supports [FP8 and FP4 inference](python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types) on Intel ***GPU***.
|
||||
- [2023/11] Initial support for directly loading [GGUF](python/llm/example/CPU/GGUF-Models/llama2), [AWQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/AWQ) and [GPTQ](python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ) models in to `bigdl-llm` is available.
|
||||
- [2023/11] Initial support for [vLLM continuous batching](python/llm/example/CPU/vLLM-Serving) is availabe on Intel ***CPU***.
|
||||
- [2023/11] Initial support for [vLLM continuous batching](python/llm/example/GPU/vLLM-Serving) is availabe on Intel ***GPU***.
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
# BigDL-LLM Transformers Low-Bit Inference Pipeline for Large Language Model
|
||||
# BigDL-LLM Transformers Low-Bit Inference Pipeline (FP8, FP4, INT4 and more)
|
||||
|
||||
In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including FP8/INT8/MixedFP8/FP4/MixedFP4) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
|
||||
In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including **FP8/INT8/MixedFP8/FP4/INT4/MixedFP4**) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
|
||||
|
||||
## Prepare Environment
|
||||
We suggest using conda to manage environment:
|
||||
|
|
@ -19,7 +19,7 @@ python ./transformers_low_bit_pipeline.py --repo-id-or-model-path meta-llama/Lla
|
|||
```
|
||||
arguments info:
|
||||
- `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is `meta-llama/Llama-2-7b-chat-hf` by default.
|
||||
- `--low-bit`: str value, options are fp8, sym_int8, fp4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model.
|
||||
- `--low-bit`: str value, options are fp8, sym_int8, fp4, sym_int4, mixed_fp8 or mixed_fp4. Relevant low bit optimizations will be applied to the model.
|
||||
- `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
|
||||
- `--load-path`: optional str value. The path to load low-bit model.
|
||||
|
||||
|
|
@ -42,4 +42,3 @@ Output log:
|
|||
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
|
||||
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always telling her to stay at home and be careful. They were worried about her safety and didn't want her to get hurt
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -26,7 +26,7 @@ if __name__ == '__main__':
|
|||
help='The huggingface repo id for the large language model to be downloaded'
|
||||
', or the path to the huggingface checkpoint folder')
|
||||
parser.add_argument('--low-bit', type=str, default="fp4",
|
||||
choices=['fp8', 'sym_int8', 'fp4', 'mixed_fp8', 'mixed_fp4'],
|
||||
choices=['fp8', 'sym_int8', 'fp4', 'sym_int4', 'mixed_fp8', 'mixed_fp4'],
|
||||
help='The quantization type the model will convert to.')
|
||||
parser.add_argument('--save-path', type=str, default=None,
|
||||
help='The path to save the low-bit model.')
|
||||
|
|
|
|||
Loading…
Reference in a new issue