Add CPU and GPU examples for Yuan2-2B-hf (#9946)

* Add a new CPU example of Yuan2-2B-hf

* Add a new CPU generate.py of Yuan2-2B-hf example

* Add a new GPU example of Yuan2-2B-hf

* Add Yuan2 to README table

* In CPU example:1.Use English as default prompt; 2.Provide modified files in yuan2-2B-instruct

* In GPU example:1.Use English as default prompt;2.Provide modified files

* GPU example:update README

* update Yuan2-2B-hf in README table

* Add CPU example for Yuan2-2B in Pytorch-Models

* Add GPU example for Yuan2-2B in Pytorch-Models

* Add license in generate.py; Modify README

* In GPU Add license in generate.py; Modify README

* In CPU yuan2 modify README

* In GPU yuan2 modify README

* In CPU yuan2 modify README

* In GPU example, updated the readme for Windows GPU supports

* In GPU torch example, updated the readme for Windows GPU supports

* GPU hf example README modified

* GPU example README modified
This commit is contained in:
yb-peng 2024-02-23 14:09:30 +08:00 committed by GitHub
parent f1f4094a09
commit a2c1675546
18 changed files with 5435 additions and 0 deletions

View file

@ -191,6 +191,7 @@ Over 40 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
| SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) | | SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) |
| Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | | | Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) | | Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).*** ***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***

View file

@ -83,6 +83,7 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
| SpeechT5 | | [link](example/GPU/PyTorch-Models/Model/speech-t5) | | SpeechT5 | | [link](example/GPU/PyTorch-Models/Model/speech-t5) |
| Ziya-Coding-34B-v1.0 | [link](example/CPU/HF-Transformers-AutoModels/Model/ziya) | | | Ziya-Coding-34B-v1.0 | [link](example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
| Phi-2 | [link](example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](example/GPU/HF-Transformers-AutoModels/Model/phi-2) | | Phi-2 | [link](example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
| Yuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
### Working with `bigdl-llm` ### Working with `bigdl-llm`

View file

@ -0,0 +1,65 @@
# Yuan2
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
## 0. Requirements
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
### 1. Install
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
pip install einops # additional package required for Yuan2 to conduct generation
pip install pandas # additional package required for Yuan2 to conduct generation
```
### 2. Run
```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
> **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
>
> Please select the appropriate size of the Yuan2 model based on the capabilities of your machine.
#### 2.1 Client
On client Windows machine, it is recommended to run directly with full utilization of all cores:
```powershell
python ./generate.py
```
#### 2.2 Server
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
E.g. on Linux,
```bash
# set BigDL-LLM env variables
source bigdl-llm-init
# e.g. for a server with 48 cores per socket
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 python ./generate.py
```
#### 2.3 Sample Output
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
```log
Inference time: xxxx seconds
-------------------- Output --------------------
What is AI?
AI is what we call "Artificial Intelligence."<eod>
```

View file

@ -0,0 +1,67 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch, transformers
import sys, os, time
import argparse
from transformers import LlamaTokenizer
from bigdl.llm.transformers import AutoModelForCausalLM
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
YUAN2_PROMPT_FORMAT = """
{prompt}
"""
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
help='The huggingface repo id for the Yuan2 to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt for the model')
parser.add_argument('--n-predict', type=int, default=100,
help='Number of tokens to generate')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load tokenizer
print("Creating tokenizer...")
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
print("Creating model...")
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, load_in_4bit=True).eval()
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
# Measure the inference time
start_time = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
end_time = time.time()
output_str = tokenizer.decode(outputs[0])
print(f'Inference time: {end_time - start_time} seconds')
print('-'*20, 'Output', '-'*20)
print(output_str)

View file

@ -0,0 +1,39 @@
{
"_from_model_config":true,
"architectures": [
"YuanForCausalLM"
],
"auto_map":{
"AutoConfig":"configuration_yuan.YuanConfig",
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
},
"tokenizer_class":"YuanTokenizer",
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"model_type": "yuan",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"rms_norm_eps": 1e-06,
"dropout": 0.1,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.30.0.dev0",
"use_cache": true,
"causal_mask": true,
"use_flash_attention": false,
"reset_attention_mask": true,
"reset_position_ids": true,
"use_loss_mask": false,
"eod_token": 77185,
"sep_token": 77187,
"eod_token_id": 77185,
"sep_token_id": 77185,
"pad_token_id": 77185,
"bos_token_id": 77185,
"eos_token_id": 77185,
"mask_token_id": 77185,
"vocab_size": 135040
}

View file

@ -0,0 +1,61 @@
# Yuan2
In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
## 0. Requirements
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
### 1. Install
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
pip install einops # additional package required for Yuan2 to conduct generation
pip install pandas # additional package required for Yuan2 to conduct generation
```
### 2. Run
```
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
```
Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
#### 2.1 Client
On client Windows machine, it is recommended to run directly with full utilization of all cores:
```powershell
python ./generate.py
```
#### 2.2 Server
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
E.g. on Linux,
```bash
# set BigDL-LLM env variables
source bigdl-llm-init
# e.g. for a server with 48 cores per socket
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 python ./generate.py
```
#### 2.3 Sample Output
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
```log
Inference time: xxxx seconds
-------------------- Output --------------------
What is AI?
The term "AI" refers to a process that involves creating machines or devices that can perform tasks that typically require human intelligence, such as AI-based decision-making and machine learning. AI is rapidly advancing in the fields of machine learning, computer science, and artificial intelligence, and has been used in various fields to achieve various goals, such as improving accuracy, efficiency, and complexity. However, the
```

View file

@ -0,0 +1,69 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch, transformers
import sys, os, time
import argparse
from transformers import LlamaTokenizer, AutoModelForCausalLM
from bigdl.llm import optimize_model
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
YUAN2_PROMPT_FORMAT = """
{prompt}
"""
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
help='The huggingface repo id for the Yuan2 to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt for the model')
parser.add_argument('--n-predict', type=int, default=100,
help='Number of tokens to generate')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load tokenizer
print("Creating tokenizer...")
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
# Load model
print("Creating model...")
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, torch_dtype=torch.float16).eval()
# With only one line to enable BigDL-LLM optimization on model
model = optimize_model(model)
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
# Measure the inference time
start_time = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
end_time = time.time()
output_str = tokenizer.decode(outputs[0])
print(f'Inference time: {end_time - start_time} seconds')
print('-'*20, 'Output', '-'*20)
print(output_str)

View file

@ -0,0 +1,39 @@
{
"_from_model_config":true,
"architectures": [
"YuanForCausalLM"
],
"auto_map":{
"AutoConfig":"configuration_yuan.YuanConfig",
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
},
"tokenizer_class":"YuanTokenizer",
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"model_type": "yuan",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"rms_norm_eps": 1e-06,
"dropout": 0.1,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.30.0.dev0",
"use_cache": true,
"causal_mask": true,
"use_flash_attention": false,
"reset_attention_mask": true,
"reset_position_ids": true,
"use_loss_mask": false,
"eod_token": 77185,
"sep_token": 77187,
"eod_token_id": 77185,
"sep_token_id": 77185,
"pad_token_id": 77185,
"bos_token_id": 77185,
"eos_token_id": 77185,
"mask_token_id": 77185,
"vocab_size": 135040
}

View file

@ -0,0 +1,119 @@
# Yuan2
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
## 0. Requirements
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
pip install einops # additional package required for Yuan2 to conduct generation
pip install pandas # additional package required for Yuan2 to conduct generation
```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for Yuan2 to conduct generation
```
### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash
source /opt/intel/oneapi/setvars.sh
```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
```bash
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash
python ./generate.py
```
In the example, several arguments can be passed to satisfy your requirements:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
#### Sample Output
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
```log
Inference time: xxxx seconds
-------------------- Output --------------------
What is AI?
AI is a field of technology and technologies that is used to analyze and improve human behavior such as language processing, machine learning and artificial intelligence (AI).<eod>
```

View file

@ -0,0 +1,78 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch, transformers
import sys, os, time
import intel_extension_for_pytorch as ipex
import argparse
from transformers import LlamaTokenizer
from bigdl.llm.transformers import AutoModelForCausalLM
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
YUAN2_PROMPT_FORMAT = """
{prompt}
"""
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
help='The huggingface repo id for the Yuan2 to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt for the model')
parser.add_argument('--n-predict', type=int, default=100,
help='Number of tokens to generate')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load tokenizer
print("Creating tokenizer...")
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
print("Creating model...")
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True).eval()
# Convert the model to xpu
model = model.to('xpu')
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
# Convert the inputs to xpu
inputs = inputs.to('xpu')
# Default warmup since the first generate() is slow
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
print('Finish warmup')
# Measure the inference time
start_time = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
end_time = time.time()
output_str = tokenizer.decode(outputs[0])
print(f'Inference time: {end_time - start_time} seconds')
print('-'*20, 'Output', '-'*20)
print(output_str)

View file

@ -0,0 +1,39 @@
{
"_from_model_config":true,
"architectures": [
"YuanForCausalLM"
],
"auto_map":{
"AutoConfig":"configuration_yuan.YuanConfig",
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
},
"tokenizer_class":"YuanTokenizer",
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"model_type": "yuan",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"rms_norm_eps": 1e-06,
"dropout": 0.1,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.30.0.dev0",
"use_cache": true,
"causal_mask": true,
"use_flash_attention": false,
"reset_attention_mask": true,
"reset_position_ids": true,
"use_loss_mask": false,
"eod_token": 77185,
"sep_token": 77187,
"eod_token_id": 77185,
"sep_token_id": 77185,
"pad_token_id": 77185,
"bos_token_id": 77185,
"eos_token_id": 77185,
"mask_token_id": 77185,
"vocab_size": 135040
}

View file

@ -0,0 +1,122 @@
# Yuan2
In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
## 0. Requirements
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
### 1. Install
#### 1.1 Installation on Linux
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
After installing conda, create a Python environment for BigDL-LLM:
```bash
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
pip install einops # additional package required for Yuan2 to conduct generation
pip install pandas # additional package required for Yuan2 to conduct generation
```
#### 1.2 Installation on Windows
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9 libuv
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install einops # additional package required for Yuan2 to conduct generation
```
### 2. Configures OneAPI environment variables
#### 2.1 Configurations for Linux
```bash
source /opt/intel/oneapi/setvars.sh
```
#### 2.2 Configurations for Windows
```cmd
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
```
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
### 3. Runtime Configurations
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
#### 3.1 Configurations for Linux
<details>
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
For optimal performance on Arc, it is recommended to set several environment variables.
```bash
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
```
</details>
<details>
<summary>For Intel Data Center GPU Max Series</summary>
```bash
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ENABLE_SDP_FUSION=1
```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>
#### 3.2 Configurations for Windows
<details>
<summary>For Intel iGPU</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
```
</details>
<details>
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
```cmd
set SYCL_CACHE_PERSISTENT=1
```
</details>
<details>
<summary>For other Intel dGPU Series</summary>
There is no need to set further environment variables.
</details>
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
### 4. Running examples
```bash
python ./generate.py
```
In the example, several arguments can be passed to satisfy your requirements:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
#### Sample Output
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
```log
Inference time: xxxx seconds
-------------------- Output --------------------
What is AI?
AI is the process of creating machines that can interact with humans with their minds and learn and understand them. It enables us to think about ideas and ideas, and then we can analyze them and come up with new ideas. It's not so much that you need to be an AI as an individual, you can be an AI, just as you think.<sep> 人工智能AI是一种计算机程序它可以帮助我们思考和学习从而让我们更好地理解人类的
```

View file

@ -0,0 +1,80 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import torch, transformers
import sys, os, time
import intel_extension_for_pytorch as ipex
import argparse
from transformers import LlamaTokenizer, AutoModelForCausalLM
from bigdl.llm import optimize_model
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
YUAN2_PROMPT_FORMAT = """
{prompt}
"""
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
help='The huggingface repo id for the Yuan2 to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="What is AI?",
help='Prompt for the model')
parser.add_argument('--n-predict', type=int, default=100,
help='Number of tokens to generate')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load tokenizer
print("Creating tokenizer...")
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
# Load model
print("Creating model...")
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype='auto', low_cpu_mem_usage=True).eval()
# With only one line to enable BigDL-LLM optimization on model
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
model = optimize_model(model)
# Convert the model to xpu
model = model.to('xpu')
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
# Convert the inputs to xpu
inputs = inputs.to('xpu')
# Default warmup since the first generate() is slow
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
print('Finish warmup')
# Measure the inference time
start_time = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with BigDL-LLM INT4 optimizations
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
end_time = time.time()
output_str = tokenizer.decode(outputs[0])
print(f'Inference time: {end_time - start_time} seconds')
print('-'*20, 'Output', '-'*20)
print(output_str)

View file

@ -0,0 +1,39 @@
{
"_from_model_config":true,
"architectures": [
"YuanForCausalLM"
],
"auto_map":{
"AutoConfig":"configuration_yuan.YuanConfig",
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
},
"tokenizer_class":"YuanTokenizer",
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 8192,
"model_type": "yuan",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"rms_norm_eps": 1e-06,
"dropout": 0.1,
"tie_word_embeddings": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.30.0.dev0",
"use_cache": true,
"causal_mask": true,
"use_flash_attention": false,
"reset_attention_mask": true,
"reset_position_ids": true,
"use_loss_mask": false,
"eod_token": 77185,
"sep_token": 77187,
"eod_token_id": 77185,
"sep_token_id": 77185,
"pad_token_id": 77185,
"bos_token_id": 77185,
"eos_token_id": 77185,
"mask_token_id": 77185,
"vocab_size": 135040
}