Add CPU and GPU examples for Yuan2-2B-hf (#9946)
* Add a new CPU example of Yuan2-2B-hf * Add a new CPU generate.py of Yuan2-2B-hf example * Add a new GPU example of Yuan2-2B-hf * Add Yuan2 to README table * In CPU example:1.Use English as default prompt; 2.Provide modified files in yuan2-2B-instruct * In GPU example:1.Use English as default prompt;2.Provide modified files * GPU example:update README * update Yuan2-2B-hf in README table * Add CPU example for Yuan2-2B in Pytorch-Models * Add GPU example for Yuan2-2B in Pytorch-Models * Add license in generate.py; Modify README * In GPU Add license in generate.py; Modify README * In CPU yuan2 modify README * In GPU yuan2 modify README * In CPU yuan2 modify README * In GPU example, updated the readme for Windows GPU supports * In GPU torch example, updated the readme for Windows GPU supports * GPU hf example README modified * GPU example README modified
This commit is contained in:
parent
f1f4094a09
commit
a2c1675546
18 changed files with 5435 additions and 0 deletions
|
|
@ -191,6 +191,7 @@ Over 40 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
|
||||||
| SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) |
|
| SpeechT5 | | [link](python/llm/example/GPU/PyTorch-Models/Model/speech-t5) |
|
||||||
| Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
|
| Ziya-Coding-34B-v1.0 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
|
||||||
| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
|
| Phi-2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
|
||||||
|
| Yuan2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
|
||||||
|
|
||||||
***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
|
***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -83,6 +83,7 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
|
||||||
| SpeechT5 | | [link](example/GPU/PyTorch-Models/Model/speech-t5) |
|
| SpeechT5 | | [link](example/GPU/PyTorch-Models/Model/speech-t5) |
|
||||||
| Ziya-Coding-34B-v1.0 | [link](example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
|
| Ziya-Coding-34B-v1.0 | [link](example/CPU/HF-Transformers-AutoModels/Model/ziya) | |
|
||||||
| Phi-2 | [link](example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
|
| Phi-2 | [link](example/CPU/HF-Transformers-AutoModels/Model/phi-2) | [link](example/GPU/HF-Transformers-AutoModels/Model/phi-2) |
|
||||||
|
| Yuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/yuan2) | [link](example/GPU/HF-Transformers-AutoModels/Model/yuan2) |
|
||||||
|
|
||||||
### Working with `bigdl-llm`
|
### Working with `bigdl-llm`
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,65 @@
|
||||||
|
# Yuan2
|
||||||
|
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
|
||||||
|
|
||||||
|
## 0. Requirements
|
||||||
|
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||||
|
|
||||||
|
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
|
||||||
|
|
||||||
|
## Example: Predict Tokens using `generate()` API
|
||||||
|
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
|
||||||
|
### 1. Install
|
||||||
|
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
|
||||||
|
|
||||||
|
After installing conda, create a Python environment for BigDL-LLM:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9
|
||||||
|
conda activate llm
|
||||||
|
|
||||||
|
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
pip install pandas # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Run
|
||||||
|
```
|
||||||
|
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||||
|
```
|
||||||
|
|
||||||
|
Arguments info:
|
||||||
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
|
||||||
|
|
||||||
|
> **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.
|
||||||
|
>
|
||||||
|
> Please select the appropriate size of the Yuan2 model based on the capabilities of your machine.
|
||||||
|
|
||||||
|
#### 2.1 Client
|
||||||
|
On client Windows machine, it is recommended to run directly with full utilization of all cores:
|
||||||
|
```powershell
|
||||||
|
python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.2 Server
|
||||||
|
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
|
||||||
|
|
||||||
|
E.g. on Linux,
|
||||||
|
```bash
|
||||||
|
# set BigDL-LLM env variables
|
||||||
|
source bigdl-llm-init
|
||||||
|
|
||||||
|
# e.g. for a server with 48 cores per socket
|
||||||
|
export OMP_NUM_THREADS=48
|
||||||
|
numactl -C 0-47 -m 0 python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.3 Sample Output
|
||||||
|
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
|
||||||
|
```log
|
||||||
|
Inference time: xxxx seconds
|
||||||
|
-------------------- Output --------------------
|
||||||
|
|
||||||
|
What is AI?
|
||||||
|
AI is what we call "Artificial Intelligence."<eod>
|
||||||
|
```
|
||||||
|
|
@ -0,0 +1,67 @@
|
||||||
|
#
|
||||||
|
# Copyright 2016 The BigDL Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
import torch, transformers
|
||||||
|
import sys, os, time
|
||||||
|
import argparse
|
||||||
|
from transformers import LlamaTokenizer
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
|
||||||
|
YUAN2_PROMPT_FORMAT = """
|
||||||
|
{prompt}
|
||||||
|
"""
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
|
||||||
|
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
|
||||||
|
help='The huggingface repo id for the Yuan2 to be downloaded'
|
||||||
|
', or the path to the huggingface checkpoint folder')
|
||||||
|
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||||
|
help='Prompt for the model')
|
||||||
|
parser.add_argument('--n-predict', type=int, default=100,
|
||||||
|
help='Number of tokens to generate')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
model_path = args.repo_id_or_model_path
|
||||||
|
|
||||||
|
# Load tokenizer
|
||||||
|
print("Creating tokenizer...")
|
||||||
|
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
|
||||||
|
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
|
||||||
|
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
|
||||||
|
|
||||||
|
# Load model in 4 bit,
|
||||||
|
# which convert the relevant layers in the model into INT4 format
|
||||||
|
print("Creating model...")
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, load_in_4bit=True).eval()
|
||||||
|
|
||||||
|
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
|
||||||
|
|
||||||
|
# Measure the inference time
|
||||||
|
start_time = time.time()
|
||||||
|
# if your selected model is capable of utilizing previous key/value attentions
|
||||||
|
# to enhance decoding speed, but has `"use_cache": false` in its model config,
|
||||||
|
# it is important to set `use_cache=True` explicitly in the `generate` function
|
||||||
|
# to obtain optimal performance with BigDL-LLM INT4 optimizations
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
end_time = time.time()
|
||||||
|
|
||||||
|
output_str = tokenizer.decode(outputs[0])
|
||||||
|
print(f'Inference time: {end_time - start_time} seconds')
|
||||||
|
print('-'*20, 'Output', '-'*20)
|
||||||
|
print(output_str)
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
{
|
||||||
|
"_from_model_config":true,
|
||||||
|
"architectures": [
|
||||||
|
"YuanForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map":{
|
||||||
|
"AutoConfig":"configuration_yuan.YuanConfig",
|
||||||
|
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
|
||||||
|
},
|
||||||
|
"tokenizer_class":"YuanTokenizer",
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"model_type": "yuan",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"dropout": 0.1,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.30.0.dev0",
|
||||||
|
"use_cache": true,
|
||||||
|
"causal_mask": true,
|
||||||
|
"use_flash_attention": false,
|
||||||
|
"reset_attention_mask": true,
|
||||||
|
"reset_position_ids": true,
|
||||||
|
"use_loss_mask": false,
|
||||||
|
"eod_token": 77185,
|
||||||
|
"sep_token": 77187,
|
||||||
|
"eod_token_id": 77185,
|
||||||
|
"sep_token_id": 77185,
|
||||||
|
"pad_token_id": 77185,
|
||||||
|
"bos_token_id": 77185,
|
||||||
|
"eos_token_id": 77185,
|
||||||
|
"mask_token_id": 77185,
|
||||||
|
"vocab_size": 135040
|
||||||
|
}
|
||||||
File diff suppressed because it is too large
Load diff
61
python/llm/example/CPU/PyTorch-Models/Model/yuan2/README.md
Normal file
61
python/llm/example/CPU/PyTorch-Models/Model/yuan2/README.md
Normal file
|
|
@ -0,0 +1,61 @@
|
||||||
|
# Yuan2
|
||||||
|
In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models. For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
|
||||||
|
|
||||||
|
## 0. Requirements
|
||||||
|
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||||
|
|
||||||
|
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
|
||||||
|
|
||||||
|
## Example: Predict Tokens using `generate()` API
|
||||||
|
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations.
|
||||||
|
### 1. Install
|
||||||
|
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
|
||||||
|
|
||||||
|
After installing conda, create a Python environment for BigDL-LLM:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9
|
||||||
|
conda activate llm
|
||||||
|
|
||||||
|
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
pip install pandas # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Run
|
||||||
|
```
|
||||||
|
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT
|
||||||
|
```
|
||||||
|
|
||||||
|
Arguments info:
|
||||||
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
|
||||||
|
|
||||||
|
#### 2.1 Client
|
||||||
|
On client Windows machine, it is recommended to run directly with full utilization of all cores:
|
||||||
|
```powershell
|
||||||
|
python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.2 Server
|
||||||
|
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
|
||||||
|
|
||||||
|
E.g. on Linux,
|
||||||
|
```bash
|
||||||
|
# set BigDL-LLM env variables
|
||||||
|
source bigdl-llm-init
|
||||||
|
|
||||||
|
# e.g. for a server with 48 cores per socket
|
||||||
|
export OMP_NUM_THREADS=48
|
||||||
|
numactl -C 0-47 -m 0 python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.3 Sample Output
|
||||||
|
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
|
||||||
|
```log
|
||||||
|
Inference time: xxxx seconds
|
||||||
|
-------------------- Output --------------------
|
||||||
|
|
||||||
|
What is AI?
|
||||||
|
The term "AI" refers to a process that involves creating machines or devices that can perform tasks that typically require human intelligence, such as AI-based decision-making and machine learning. AI is rapidly advancing in the fields of machine learning, computer science, and artificial intelligence, and has been used in various fields to achieve various goals, such as improving accuracy, efficiency, and complexity. However, the
|
||||||
|
```
|
||||||
|
|
@ -0,0 +1,69 @@
|
||||||
|
#
|
||||||
|
# Copyright 2016 The BigDL Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
import torch, transformers
|
||||||
|
import sys, os, time
|
||||||
|
import argparse
|
||||||
|
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
||||||
|
from bigdl.llm import optimize_model
|
||||||
|
|
||||||
|
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
|
||||||
|
YUAN2_PROMPT_FORMAT = """
|
||||||
|
{prompt}
|
||||||
|
"""
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
|
||||||
|
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
|
||||||
|
help='The huggingface repo id for the Yuan2 to be downloaded'
|
||||||
|
', or the path to the huggingface checkpoint folder')
|
||||||
|
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||||
|
help='Prompt for the model')
|
||||||
|
parser.add_argument('--n-predict', type=int, default=100,
|
||||||
|
help='Number of tokens to generate')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
model_path = args.repo_id_or_model_path
|
||||||
|
|
||||||
|
# Load tokenizer
|
||||||
|
print("Creating tokenizer...")
|
||||||
|
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
|
||||||
|
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
|
||||||
|
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
|
||||||
|
|
||||||
|
# Load model
|
||||||
|
print("Creating model...")
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cpu", trust_remote_code=True, torch_dtype=torch.float16).eval()
|
||||||
|
|
||||||
|
# With only one line to enable BigDL-LLM optimization on model
|
||||||
|
model = optimize_model(model)
|
||||||
|
|
||||||
|
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
|
||||||
|
|
||||||
|
# Measure the inference time
|
||||||
|
start_time = time.time()
|
||||||
|
# if your selected model is capable of utilizing previous key/value attentions
|
||||||
|
# to enhance decoding speed, but has `"use_cache": false` in its model config,
|
||||||
|
# it is important to set `use_cache=True` explicitly in the `generate` function
|
||||||
|
# to obtain optimal performance with BigDL-LLM INT4 optimizations
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
end_time = time.time()
|
||||||
|
|
||||||
|
output_str = tokenizer.decode(outputs[0])
|
||||||
|
print(f'Inference time: {end_time - start_time} seconds')
|
||||||
|
print('-'*20, 'Output', '-'*20)
|
||||||
|
print(output_str)
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
{
|
||||||
|
"_from_model_config":true,
|
||||||
|
"architectures": [
|
||||||
|
"YuanForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map":{
|
||||||
|
"AutoConfig":"configuration_yuan.YuanConfig",
|
||||||
|
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
|
||||||
|
},
|
||||||
|
"tokenizer_class":"YuanTokenizer",
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"model_type": "yuan",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"dropout": 0.1,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.30.0.dev0",
|
||||||
|
"use_cache": true,
|
||||||
|
"causal_mask": true,
|
||||||
|
"use_flash_attention": false,
|
||||||
|
"reset_attention_mask": true,
|
||||||
|
"reset_position_ids": true,
|
||||||
|
"use_loss_mask": false,
|
||||||
|
"eod_token": 77185,
|
||||||
|
"sep_token": 77187,
|
||||||
|
"eod_token_id": 77185,
|
||||||
|
"sep_token_id": 77185,
|
||||||
|
"pad_token_id": 77185,
|
||||||
|
"bos_token_id": 77185,
|
||||||
|
"eos_token_id": 77185,
|
||||||
|
"mask_token_id": 77185,
|
||||||
|
"vocab_size": 135040
|
||||||
|
}
|
||||||
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,119 @@
|
||||||
|
# Yuan2
|
||||||
|
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
|
||||||
|
|
||||||
|
## 0. Requirements
|
||||||
|
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||||
|
|
||||||
|
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
|
||||||
|
|
||||||
|
## Example: Predict Tokens using `generate()` API
|
||||||
|
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
|
||||||
|
### 1. Install
|
||||||
|
#### 1.1 Installation on Linux
|
||||||
|
We suggest using conda to manage environment:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9
|
||||||
|
conda activate llm
|
||||||
|
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||||
|
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
pip install pandas # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
#### 1.2 Installation on Windows
|
||||||
|
We suggest using conda to manage environment:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9 libuv
|
||||||
|
conda activate llm
|
||||||
|
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||||
|
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Configures OneAPI environment variables
|
||||||
|
#### 2.1 Configurations for Linux
|
||||||
|
```bash
|
||||||
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.2 Configurations for Windows
|
||||||
|
```cmd
|
||||||
|
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||||
|
```
|
||||||
|
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
|
||||||
|
### 3. Runtime Configurations
|
||||||
|
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||||
|
#### 3.1 Configurations for Linux
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export USE_XETLA=OFF
|
||||||
|
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Data Center GPU Max Series</summary>
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||||
|
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
|
export ENABLE_SDP_FUSION=1
|
||||||
|
```
|
||||||
|
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||||
|
</details>
|
||||||
|
#### 3.2 Configurations for Windows
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel iGPU</summary>
|
||||||
|
|
||||||
|
```cmd
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
set BIGDL_LLM_XMX_DISABLED=1
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
|
||||||
|
|
||||||
|
```cmd
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For other Intel dGPU Series</summary>
|
||||||
|
|
||||||
|
There is no need to set further environment variables.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||||
|
### 4. Running examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
In the example, several arguments can be passed to satisfy your requirements:
|
||||||
|
|
||||||
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
|
||||||
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
|
||||||
|
|
||||||
|
#### Sample Output
|
||||||
|
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
|
||||||
|
```log
|
||||||
|
Inference time: xxxx seconds
|
||||||
|
-------------------- Output --------------------
|
||||||
|
|
||||||
|
What is AI?
|
||||||
|
AI is a field of technology and technologies that is used to analyze and improve human behavior such as language processing, machine learning and artificial intelligence (AI).<eod>
|
||||||
|
```
|
||||||
|
|
@ -0,0 +1,78 @@
|
||||||
|
#
|
||||||
|
# Copyright 2016 The BigDL Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
import torch, transformers
|
||||||
|
import sys, os, time
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
import argparse
|
||||||
|
from transformers import LlamaTokenizer
|
||||||
|
from bigdl.llm.transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
|
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
|
||||||
|
YUAN2_PROMPT_FORMAT = """
|
||||||
|
{prompt}
|
||||||
|
"""
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
|
||||||
|
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
|
||||||
|
help='The huggingface repo id for the Yuan2 to be downloaded'
|
||||||
|
', or the path to the huggingface checkpoint folder')
|
||||||
|
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||||
|
help='Prompt for the model')
|
||||||
|
parser.add_argument('--n-predict', type=int, default=100,
|
||||||
|
help='Number of tokens to generate')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
model_path = args.repo_id_or_model_path
|
||||||
|
|
||||||
|
# Load tokenizer
|
||||||
|
print("Creating tokenizer...")
|
||||||
|
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
|
||||||
|
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
|
||||||
|
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
|
||||||
|
|
||||||
|
# Load model in 4 bit,
|
||||||
|
# which convert the relevant layers in the model into INT4 format
|
||||||
|
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
|
||||||
|
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
|
||||||
|
print("Creating model...")
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True).eval()
|
||||||
|
# Convert the model to xpu
|
||||||
|
model = model.to('xpu')
|
||||||
|
|
||||||
|
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
|
||||||
|
# Convert the inputs to xpu
|
||||||
|
inputs = inputs.to('xpu')
|
||||||
|
|
||||||
|
# Default warmup since the first generate() is slow
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
print('Finish warmup')
|
||||||
|
|
||||||
|
# Measure the inference time
|
||||||
|
start_time = time.time()
|
||||||
|
# if your selected model is capable of utilizing previous key/value attentions
|
||||||
|
# to enhance decoding speed, but has `"use_cache": false` in its model config,
|
||||||
|
# it is important to set `use_cache=True` explicitly in the `generate` function
|
||||||
|
# to obtain optimal performance with BigDL-LLM INT4 optimizations
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
end_time = time.time()
|
||||||
|
|
||||||
|
output_str = tokenizer.decode(outputs[0])
|
||||||
|
print(f'Inference time: {end_time - start_time} seconds')
|
||||||
|
print('-'*20, 'Output', '-'*20)
|
||||||
|
print(output_str)
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
{
|
||||||
|
"_from_model_config":true,
|
||||||
|
"architectures": [
|
||||||
|
"YuanForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map":{
|
||||||
|
"AutoConfig":"configuration_yuan.YuanConfig",
|
||||||
|
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
|
||||||
|
},
|
||||||
|
"tokenizer_class":"YuanTokenizer",
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"model_type": "yuan",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"dropout": 0.1,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.30.0.dev0",
|
||||||
|
"use_cache": true,
|
||||||
|
"causal_mask": true,
|
||||||
|
"use_flash_attention": false,
|
||||||
|
"reset_attention_mask": true,
|
||||||
|
"reset_position_ids": true,
|
||||||
|
"use_loss_mask": false,
|
||||||
|
"eod_token": 77185,
|
||||||
|
"sep_token": 77187,
|
||||||
|
"eod_token_id": 77185,
|
||||||
|
"sep_token_id": 77185,
|
||||||
|
"pad_token_id": 77185,
|
||||||
|
"bos_token_id": 77185,
|
||||||
|
"eos_token_id": 77185,
|
||||||
|
"mask_token_id": 77185,
|
||||||
|
"vocab_size": 135040
|
||||||
|
}
|
||||||
File diff suppressed because it is too large
Load diff
122
python/llm/example/GPU/PyTorch-Models/Model/yuan2/README.md
Normal file
122
python/llm/example/GPU/PyTorch-Models/Model/yuan2/README.md
Normal file
|
|
@ -0,0 +1,122 @@
|
||||||
|
# Yuan2
|
||||||
|
In this directory, you will find examples on how you could apply BigDL-LLM `optimize_model` API to accelerate Yuan2 models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf) as a reference Yuan2 model.
|
||||||
|
|
||||||
|
## 0. Requirements
|
||||||
|
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
|
||||||
|
|
||||||
|
In addition, you need to modify some files in Yuan2-2B-hf folder, since Flash attention dependency is for CUDA usage and currently cannot be installed on Intel CPUs. To manually turn it off, please refer to [this issue](https://github.com/IEIT-Yuan/Yuan-2.0/issues/92). We also provide two modified files([config.json](yuan2-2B-instruct/config.json) and [yuan_hf_model.py](yuan2-2B-instruct/yuan_hf_model.py)), which can be used to replace the original content in config.json and yuan_hf_model.py.
|
||||||
|
|
||||||
|
## Example: Predict Tokens using `generate()` API
|
||||||
|
In the example [generate.py](./generate.py), we show a basic use case for an Yuan2 model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations on Intel GPUs.
|
||||||
|
### 1. Install
|
||||||
|
#### 1.1 Installation on Linux
|
||||||
|
We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#).
|
||||||
|
|
||||||
|
After installing conda, create a Python environment for BigDL-LLM:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9
|
||||||
|
conda activate llm
|
||||||
|
|
||||||
|
pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
pip install pandas # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
#### 1.2 Installation on Windows
|
||||||
|
We suggest using conda to manage environment:
|
||||||
|
```bash
|
||||||
|
conda create -n llm python=3.9 libuv
|
||||||
|
conda activate llm
|
||||||
|
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
|
||||||
|
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
|
||||||
|
pip install einops # additional package required for Yuan2 to conduct generation
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Configures OneAPI environment variables
|
||||||
|
#### 2.1 Configurations for Linux
|
||||||
|
```bash
|
||||||
|
source /opt/intel/oneapi/setvars.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.2 Configurations for Windows
|
||||||
|
```cmd
|
||||||
|
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
|
||||||
|
```
|
||||||
|
> Note: Please make sure you are using **CMD** (**Anaconda Prompt** if using conda) to run the command as PowerShell is not supported.
|
||||||
|
### 3. Runtime Configurations
|
||||||
|
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
|
||||||
|
#### 3.1 Configurations for Linux
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
|
||||||
|
|
||||||
|
For optimal performance on Arc, it is recommended to set several environment variables.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export USE_XETLA=OFF
|
||||||
|
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
|
```
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Data Center GPU Max Series</summary>
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
|
||||||
|
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
|
||||||
|
export ENABLE_SDP_FUSION=1
|
||||||
|
```
|
||||||
|
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
|
||||||
|
</details>
|
||||||
|
#### 3.2 Configurations for Windows
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel iGPU</summary>
|
||||||
|
|
||||||
|
```cmd
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
set BIGDL_LLM_XMX_DISABLED=1
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For Intel Arc™ A300-Series or Pro A60</summary>
|
||||||
|
|
||||||
|
```cmd
|
||||||
|
set SYCL_CACHE_PERSISTENT=1
|
||||||
|
```
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<details>
|
||||||
|
|
||||||
|
<summary>For other Intel dGPU Series</summary>
|
||||||
|
|
||||||
|
There is no need to set further environment variables.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
> Note: For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile.
|
||||||
|
### 4. Running examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python ./generate.py
|
||||||
|
```
|
||||||
|
|
||||||
|
In the example, several arguments can be passed to satisfy your requirements:
|
||||||
|
|
||||||
|
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yuan2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'IEITYuan/Yuan2-2B-hf'`.
|
||||||
|
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
|
||||||
|
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `100`.
|
||||||
|
|
||||||
|
#### Sample Output
|
||||||
|
#### [IEITYuan/Yuan2-2B-hf](https://huggingface.co/IEITYuan/Yuan2-2B-hf)
|
||||||
|
```log
|
||||||
|
Inference time: xxxx seconds
|
||||||
|
-------------------- Output --------------------
|
||||||
|
|
||||||
|
What is AI?
|
||||||
|
AI is the process of creating machines that can interact with humans with their minds and learn and understand them. It enables us to think about ideas and ideas, and then we can analyze them and come up with new ideas. It's not so much that you need to be an AI as an individual, you can be an AI, just as you think.<sep> 人工智能(AI)是一种计算机程序,它可以帮助我们思考和学习,从而让我们更好地理解人类的
|
||||||
|
```
|
||||||
|
|
@ -0,0 +1,80 @@
|
||||||
|
#
|
||||||
|
# Copyright 2016 The BigDL Authors.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
|
||||||
|
import torch, transformers
|
||||||
|
import sys, os, time
|
||||||
|
import intel_extension_for_pytorch as ipex
|
||||||
|
import argparse
|
||||||
|
from transformers import LlamaTokenizer, AutoModelForCausalLM
|
||||||
|
from bigdl.llm import optimize_model
|
||||||
|
|
||||||
|
# Refer to https://huggingface.co/IEITYuan/Yuan2-2B-hf#Usage
|
||||||
|
YUAN2_PROMPT_FORMAT = """
|
||||||
|
{prompt}
|
||||||
|
"""
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser(description='Generate text using Yuan2-2B model')
|
||||||
|
parser.add_argument('--repo-id-or-model-path', type=str, default="IEITYuan/Yuan2-2B-hf",
|
||||||
|
help='The huggingface repo id for the Yuan2 to be downloaded'
|
||||||
|
', or the path to the huggingface checkpoint folder')
|
||||||
|
parser.add_argument('--prompt', type=str, default="What is AI?",
|
||||||
|
help='Prompt for the model')
|
||||||
|
parser.add_argument('--n-predict', type=int, default=100,
|
||||||
|
help='Number of tokens to generate')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
model_path = args.repo_id_or_model_path
|
||||||
|
|
||||||
|
# Load tokenizer
|
||||||
|
print("Creating tokenizer...")
|
||||||
|
tokenizer = LlamaTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
|
||||||
|
tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>',
|
||||||
|
'<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)
|
||||||
|
|
||||||
|
# Load model
|
||||||
|
print("Creating model...")
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype='auto', low_cpu_mem_usage=True).eval()
|
||||||
|
|
||||||
|
# With only one line to enable BigDL-LLM optimization on model
|
||||||
|
# When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
|
||||||
|
# This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
|
||||||
|
model = optimize_model(model)
|
||||||
|
# Convert the model to xpu
|
||||||
|
model = model.to('xpu')
|
||||||
|
|
||||||
|
prompt = YUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
|
||||||
|
# Convert the inputs to xpu
|
||||||
|
inputs = inputs.to('xpu')
|
||||||
|
|
||||||
|
# Default warmup since the first generate() is slow
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
print('Finish warmup')
|
||||||
|
|
||||||
|
# Measure the inference time
|
||||||
|
start_time = time.time()
|
||||||
|
# if your selected model is capable of utilizing previous key/value attentions
|
||||||
|
# to enhance decoding speed, but has `"use_cache": false` in its model config,
|
||||||
|
# it is important to set `use_cache=True` explicitly in the `generate` function
|
||||||
|
# to obtain optimal performance with BigDL-LLM INT4 optimizations
|
||||||
|
outputs = model.generate(inputs, do_sample=True, top_k=5, max_length=args.n_predict)
|
||||||
|
end_time = time.time()
|
||||||
|
|
||||||
|
output_str = tokenizer.decode(outputs[0])
|
||||||
|
print(f'Inference time: {end_time - start_time} seconds')
|
||||||
|
print('-'*20, 'Output', '-'*20)
|
||||||
|
print(output_str)
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
{
|
||||||
|
"_from_model_config":true,
|
||||||
|
"architectures": [
|
||||||
|
"YuanForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map":{
|
||||||
|
"AutoConfig":"configuration_yuan.YuanConfig",
|
||||||
|
"AutoModelForCausalLM":"yuan_hf_model.YuanForCausalLM"
|
||||||
|
},
|
||||||
|
"tokenizer_class":"YuanTokenizer",
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2048,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 8192,
|
||||||
|
"max_position_embeddings": 8192,
|
||||||
|
"model_type": "yuan",
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_hidden_layers": 24,
|
||||||
|
"rms_norm_eps": 1e-06,
|
||||||
|
"dropout": 0.1,
|
||||||
|
"tie_word_embeddings": true,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.30.0.dev0",
|
||||||
|
"use_cache": true,
|
||||||
|
"causal_mask": true,
|
||||||
|
"use_flash_attention": false,
|
||||||
|
"reset_attention_mask": true,
|
||||||
|
"reset_position_ids": true,
|
||||||
|
"use_loss_mask": false,
|
||||||
|
"eod_token": 77185,
|
||||||
|
"sep_token": 77187,
|
||||||
|
"eod_token_id": 77185,
|
||||||
|
"sep_token_id": 77185,
|
||||||
|
"pad_token_id": 77185,
|
||||||
|
"bos_token_id": 77185,
|
||||||
|
"eos_token_id": 77185,
|
||||||
|
"mask_token_id": 77185,
|
||||||
|
"vocab_size": 135040
|
||||||
|
}
|
||||||
File diff suppressed because it is too large
Load diff
Loading…
Reference in a new issue