Co-authored-by: leonardozcm <leonardozcm@gmail.com>
This commit is contained in:
Zhao Changmin 2024-02-28 11:18:21 +08:00 committed by GitHub
parent 4833067489
commit 937e1f7c74
4 changed files with 546 additions and 0 deletions

View file

@ -0,0 +1,90 @@
# LlaMA
In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on general pytorch models, for example Meta Llama models. **Different from what [Huggingface LlaMA2](../llama2/) example demonstrated, This example directly brings the optimizations of BigDL-LLM to the official LLaMA implementation of which the code style is more flexible.** For illustration purposes, we utilize the [Llama2-7b-Chat](https://ai.meta.com/llama/) as a reference LlaMA model.
## Requirements
To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
## Example: Generating text using a pretrained Llama model
In the example [example_chat_completion.py](./example_chat_completion.py), we show a basic use case for a Llama model to engage in a conversation with an AI assistant using `chat_completion` API, with BigDL-LLM INT4 optimizations. The process for [example_text_completion.py](./example_text_completion.py) is similar.
### 1. Install
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm
# Install meta-llama repository
git clone https://github.com/facebookresearch/llama.git
cd llama/
git apply < ../cpu.patch # apply cpu version patch
pip install -e .
cd -
pip install bigdl-llm[all] # install bigdl-llm with 'all' option
```
### 2. Run
Follow the instruction [here](https://github.com/facebookresearch/llama#download) to download the model weights and tokenizer.
```
torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```
Arguments info:
- `--ckpt_dir` (str): The directory containing checkpoint files for the pretrained model.
- `--tokenizer_path` (str): The path to the tokenizer model used for text encoding/decoding.
- `--temperature` (float, optional): The temperature value for controlling randomness in generation.
Defaults to 0.6.
- `--top_p` (float, optional): The top-p sampling parameter for controlling diversity in generation.
Defaults to 0.9.
- `--max_seq_len` (int, optional): The maximum sequence length for input prompts. Defaults to 128.
- `--max_gen_len` (int, optional): The maximum length of generated sequences. Defaults to 64.
- `--max_batch_size` (int, optional): The maximum batch size for generating sequences. Defaults to 4.
- `--backend` (str): The device backend for computing. Defaults to `cpu`.
> Please select the appropriate size of the Llama model based on the capabilities of your machine.
#### 2.1 Client
On client Windows machine, it is recommended to run directly with full utilization of all cores:
```powershell
torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```
#### 2.2 Server
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
E.g. on Linux,
```bash
# set BigDL-Nano env variables
source bigdl-nano-init
# e.g. for a server with 48 cores per socket
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```
#### 2.3 Sample Output
#### [Llama2-7b-Chat](https://ai.meta.com/llama/)
```log
2023-10-08 13:49:11,107 - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-08 13:49:11,108 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
2023-10-08 13:49:11,130 - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-08 13:49:11,130 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
2023-10-08 13:49:11,131 - INFO - Added key: store_based_barrier_key:3 to store for rank: 0
2023-10-08 13:49:11,131 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
2023-10-08 13:49:11,132 - INFO - Added key: store_based_barrier_key:4 to store for rank: 0
2023-10-08 13:49:11,132 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
2023-10-08 13:49:19,108 - INFO - Reloaded SentencePiece model from /disk1/changmin/Llama-2-7b-chat/tokenizer.model
2023-10-08 13:49:19,108 - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
Loaded in 54.41 seconds
2023-10-08 13:50:09,600 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
User: what is the recipe of mayonnaise?
> Assistant: Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Unterscheidung of mayonnaise involves the use of an emuls
==================================
```

View file

@ -0,0 +1,279 @@
diff --git a/README.md b/README.md
index 91e1719..1f6f26d 100755
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
# Llama 2
-We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly.
+We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly.
This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 70B parameters.
@@ -58,8 +58,6 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
- Adjust the `max_seq_len` and `max_batch_size` parameters as needed.
- This example runs the [example_chat_completion.py](example_chat_completion.py) found in this repository but you can change that to a different .py file.
-It is also possible to test models without CUDA. For example, to run models on CPU, add an extra command line option `--backend cpu` to following examples. Number of threads can be set using the environment variable `NUM_THREADS`.
-
## Inference
Different models require different model-parallel (MP) values:
@@ -116,7 +114,7 @@ See [MODEL_CARD.md](MODEL_CARD.md).
## License
-Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.
+Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.
See the [LICENSE](LICENSE) file, as well as our accompanying [Acceptable Use Policy](USE_POLICY.md)
diff --git a/example_chat_completion.py b/example_chat_completion.py
index acedf44..df4e5d6 100644
--- a/example_chat_completion.py
+++ b/example_chat_completion.py
@@ -7,13 +7,10 @@ import fire
from llama import Llama, Dialog
-from bigdl.llm.optimize import optimize_model
-
def main(
ckpt_dir: str,
tokenizer_path: str,
- backend: str = 'cuda',
temperature: float = 0.6,
top_p: float = 0.9,
max_seq_len: int = 512,
@@ -39,12 +36,9 @@ def main(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
- backend=backend,
max_batch_size=max_batch_size,
)
- generator.model = optimize_model(generator.model)
-
dialogs: List[Dialog] = [
[{"role": "user", "content": "what is the recipe of mayonnaise?"}],
[
diff --git a/example_text_completion.py b/example_text_completion.py
index 1f63bb0..0d60b9c 100755
--- a/example_text_completion.py
+++ b/example_text_completion.py
@@ -6,12 +6,9 @@ import fire
from llama import Llama
from typing import List
-from bigdl.llm.optimize import optimize_model
-
def main(
ckpt_dir: str,
tokenizer_path: str,
- backend: str = 'cuda',
temperature: float = 0.6,
top_p: float = 0.9,
max_seq_len: int = 128,
@@ -36,12 +33,9 @@ def main(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
- backend=backend,
max_batch_size=max_batch_size,
)
- generator.model = optimize_model(generator.model)
-
prompts: List[str] = [
# For these prompts, the expected answer is the natural continuation of the prompt
"I believe the meaning of life is",
@@ -49,11 +43,11 @@ def main(
"""A brief message congratulating the team on the launch:
Hi everyone,
-
+
I just """,
# Few shot prompt (providing a few examples before asking model to complete more);
"""Translate English to French:
-
+
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
diff --git a/llama/generation.py b/llama/generation.py
index df68aca..5f8faf9 100755
--- a/llama/generation.py
+++ b/llama/generation.py
@@ -55,7 +55,6 @@ class Llama:
tokenizer_path: str,
max_seq_len: int,
max_batch_size: int,
- backend: str,
model_parallel_size: Optional[int] = None,
seed: int = 1,
) -> "Llama":
@@ -82,41 +81,22 @@ class Llama:
and loads the pre-trained model and tokenizer.
"""
- if model_parallel_size is None:
- model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
-
- device = backend
-
- if backend == 'cuda':
- if not torch.distributed.is_initialized():
- torch.distributed.init_process_group("nccl")
- if not model_parallel_is_initialized():
- initialize_model_parallel(model_parallel_size)
- local_rank = int(os.environ.get("LOCAL_RANK", 0))
- torch.cuda.set_device(local_rank)
- if local_rank > 0:
- sys.stdout = open(os.devnull, "w")
- torch.set_default_tensor_type(torch.cuda.HalfTensor)
- else:
- torch.distributed.init_process_group("gloo")
-
+ if not torch.distributed.is_initialized():
+ torch.distributed.init_process_group("nccl")
+ if not model_parallel_is_initialized():
+ if model_parallel_size is None:
+ model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
initialize_model_parallel(model_parallel_size)
- if backend == 'directml':
- import torch_directml
- torch.set_default_tensor_type(torch_directml.torch.HalfTensor)
- device = torch_directml.device()
- elif backend == 'cpu':
- # Note: some operations such as "addmm_impl_cpu_" are not implemented for 'Half' at present
- # torch.set_default_tensor_type(torch.HalfTensor)
- n_threads = int(os.environ.get("NUM_THREADS", 0))
- if n_threads > 0:
- torch.set_num_threads(n_threads)
- pass
+ local_rank = int(os.environ.get("LOCAL_RANK", 0))
+ torch.cuda.set_device(local_rank)
# seed must be the same in all processes
torch.manual_seed(seed)
+ if local_rank > 0:
+ sys.stdout = open(os.devnull, "w")
+
start_time = time.time()
checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
@@ -129,13 +109,13 @@ class Llama:
params = json.loads(f.read())
model_args: ModelArgs = ModelArgs(
- device=device,
max_seq_len=max_seq_len,
max_batch_size=max_batch_size,
**params,
)
tokenizer = Tokenizer(model_path=tokenizer_path)
model_args.vocab_size = tokenizer.n_words
+ torch.set_default_tensor_type(torch.cuda.HalfTensor)
model = Transformer(model_args)
model.load_state_dict(checkpoint, strict=False)
print(f"Loaded in {time.time() - start_time:.2f} seconds")
@@ -145,7 +125,6 @@ class Llama:
def __init__(self, model: Transformer, tokenizer: Tokenizer):
self.model = model
self.tokenizer = tokenizer
- self.device = model.device
@torch.inference_mode()
def generate(
@@ -186,14 +165,14 @@ class Llama:
total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
pad_id = self.tokenizer.pad_id
- tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=self.device)
+ tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
for k, t in enumerate(prompt_tokens):
- tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device=self.device)
+ tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
if logprobs:
token_logprobs = torch.zeros_like(tokens, dtype=torch.float)
prev_pos = 0
- eos_reached = torch.tensor([False] * bsz, device=self.device)
+ eos_reached = torch.tensor([False] * bsz, device="cuda")
input_text_mask = tokens != pad_id
if min_prompt_len == total_len:
logits = self.model.forward(tokens, prev_pos)
diff --git a/llama/model.py b/llama/model.py
index 8646d31..770526d 100755
--- a/llama/model.py
+++ b/llama/model.py
@@ -9,28 +9,15 @@ import fairscale.nn.model_parallel.initialize as fs_init
import torch
import torch.nn.functional as F
from fairscale.nn.model_parallel.layers import (
- # ColumnParallelLinear,
+ ColumnParallelLinear,
ParallelEmbedding,
- # RowParallelLinear,
+ RowParallelLinear,
)
from torch import nn
-def ColumnParallelLinear(in_features: int, out_features: int, bias: bool = True, *args, **kwargs):
- return torch.nn.Linear(in_features=in_features,
- out_features=out_features,
- bias=bias)
-
-
-def RowParallelLinear(in_features: int, out_features: int, bias: bool = True, *args, **kwargs):
- return torch.nn.Linear(in_features=in_features,
- out_features=out_features,
- bias=bias)
-
-
@dataclass
class ModelArgs:
- device: object
dim: int = 4096
n_layers: int = 32
n_heads: int = 32
@@ -216,7 +203,6 @@ class Attention(nn.Module):
self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
self.n_rep = self.n_local_heads // self.n_local_kv_heads
self.head_dim = args.dim // args.n_heads
- self.device = args.device
self.wq = ColumnParallelLinear(
args.dim,
@@ -254,7 +240,7 @@ class Attention(nn.Module):
self.n_local_kv_heads,
self.head_dim,
)
- ).to(self.device)
+ ).cuda()
self.cache_v = torch.zeros(
(
args.max_batch_size,
@@ -262,7 +248,7 @@ class Attention(nn.Module):
self.n_local_kv_heads,
self.head_dim,
)
- ).to(self.device)
+ ).cuda()
def forward(
self,
@@ -447,7 +433,6 @@ class Transformer(nn.Module):
self.params = params
self.vocab_size = params.vocab_size
self.n_layers = params.n_layers
- self.device = params.device
self.tok_embeddings = ParallelEmbedding(
params.vocab_size, params.dim, init_method=lambda x: x

View file

@ -0,0 +1,85 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is adapted from https://github.com/facebookresearch/llama/blob/main/example_chat_completion.py
#####################################################
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
from typing import List, Optional
import fire
from llama import Llama, Dialog
from bigdl.llm.optimize import optimize_model
def main(
ckpt_dir: str,
tokenizer_path: str,
backend: str = 'cpu',
temperature: float = 0.6,
top_p: float = 0.9,
max_seq_len: int = 512,
max_batch_size: int = 8,
max_gen_len: Optional[int] = None,
):
"""
Entry point of the program for generating text using a pretrained model.
Args:
ckpt_dir (str): The directory containing checkpoint files for the pretrained model.
tokenizer_path (str): The path to the tokenizer model used for text encoding/decoding.
temperature (float, optional): The temperature value for controlling randomness in generation.
Defaults to 0.6.
top_p (float, optional): The top-p sampling parameter for controlling diversity in generation.
Defaults to 0.9.
max_seq_len (int, optional): The maximum sequence length for input prompts. Defaults to 512.
max_batch_size (int, optional): The maximum batch size for generating sequences. Defaults to 8.
max_gen_len (int, optional): The maximum length of generated sequences. If None, it will be
set to the model's max sequence length. Defaults to None.
"""
generator = Llama.build(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
backend=backend,
max_batch_size=max_batch_size,
)
generator.model = optimize_model(generator.model)
dialogs: List[Dialog] = [
[{"role": "user", "content": "what is the recipe of mayonnaise?"}],
]
results = generator.chat_completion(
dialogs, # type: ignore
max_gen_len=max_gen_len,
temperature=temperature,
top_p=top_p,
)
for dialog, result in zip(dialogs, results):
for msg in dialog:
print(f"{msg['role'].capitalize()}: {msg['content']}\n")
print(
f"> {result['generation']['role'].capitalize()}: {result['generation']['content']}"
)
print("\n==================================\n")
if __name__ == "__main__":
fire.Fire(main)

View file

@ -0,0 +1,92 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file is adapted from https://github.com/facebookresearch/llama/blob/main/example_text_completion.py
#####################################################
# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
import fire
from llama import Llama
from typing import List
from bigdl.llm.optimize import optimize_model
def main(
ckpt_dir: str,
tokenizer_path: str,
backend: str = 'cpu',
temperature: float = 0.6,
top_p: float = 0.9,
max_seq_len: int = 128,
max_gen_len: int = 64,
max_batch_size: int = 4,
):
"""
Entry point of the program for generating text using a pretrained model.
Args:
ckpt_dir (str): The directory containing checkpoint files for the pretrained model.
tokenizer_path (str): The path to the tokenizer model used for text encoding/decoding.
temperature (float, optional): The temperature value for controlling randomness in generation.
Defaults to 0.6.
top_p (float, optional): The top-p sampling parameter for controlling diversity in generation.
Defaults to 0.9.
max_seq_len (int, optional): The maximum sequence length for input prompts. Defaults to 128.
max_gen_len (int, optional): The maximum length of generated sequences. Defaults to 64.
max_batch_size (int, optional): The maximum batch size for generating sequences. Defaults to 4.
"""
generator = Llama.build(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
max_seq_len=max_seq_len,
backend=backend,
max_batch_size=max_batch_size,
)
generator.model = optimize_model(generator.model)
prompts: List[str] = [
# For these prompts, the expected answer is the natural continuation of the prompt
"I believe the meaning of life is",
"Simply put, the theory of relativity states that ",
"""A brief message congratulating the team on the launch:
Hi everyone,
I just """,
# Few shot prompt (providing a few examples before asking model to complete more);
"""Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>""",
]
results = generator.text_completion(
prompts,
max_gen_len=max_gen_len,
temperature=temperature,
top_p=top_p,
)
for prompt, result in zip(prompts, results):
print(prompt)
print(f"> {result['generation']}")
print("\n==================================\n")
if __name__ == "__main__":
fire.Fire(main)