rebase (#9104)

Co-authored-by: leonardozcm <leonardozcm@gmail.com>
2024-02-28 11:18:21 +08:00 · 2024-02-28 11:18:21 +08:00 · 937e1f7c74
commit 937e1f7c74
parent 4833067489
4 changed files with 546 additions and 0 deletions
--- a/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/README.md
+++ b/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/README.md
@ -0,0 +1,90 @@
 # LlaMA
 In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on general pytorch models, for example Meta Llama models. **Different from what [Huggingface LlaMA2](../llama2/) example demonstrated, This example directly brings the optimizations of BigDL-LLM to the official LLaMA implementation of which the code style is more flexible.** For illustration purposes, we utilize the [Llama2-7b-Chat](https://ai.meta.com/llama/) as a reference LlaMA model.
 ## Requirements
 To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
 ## Example: Generating text using a pretrained Llama model
 In the example [example_chat_completion.py](./example_chat_completion.py), we show a basic use case for a Llama model to engage in a conversation with an AI assistant using `chat_completion` API, with BigDL-LLM INT4 optimizations. The process for [example_text_completion.py](./example_text_completion.py) is similar.
 ### 1. Install
 We suggest using conda to manage environment:
 ```bash
 conda create -n llm python=3.9
 conda activate llm
 # Install meta-llama repository
 git clone https://github.com/facebookresearch/llama.git
 cd llama/
 git apply < ../cpu.patch # apply cpu version patch
 pip install -e .
 cd -
 pip install bigdl-llm[all] # install bigdl-llm with 'all' option
 ```
 ### 2. Run
 Follow the instruction [here](https://github.com/facebookresearch/llama#download) to download the model weights and tokenizer.
 ```
 torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
 ```
 Arguments info:
 - `--ckpt_dir` (str): The directory containing checkpoint files for the pretrained model.
 - `--tokenizer_path` (str): The path to the tokenizer model used for text encoding/decoding.
 - `--temperature` (float, optional): The temperature value for controlling randomness in generation.
    Defaults to 0.6.
 - `--top_p` (float, optional): The top-p sampling parameter for controlling diversity in generation.
    Defaults to 0.9.
 - `--max_seq_len` (int, optional): The maximum sequence length for input prompts. Defaults to 128.
 - `--max_gen_len` (int, optional): The maximum length of generated sequences. Defaults to 64.
 - `--max_batch_size` (int, optional): The maximum batch size for generating sequences. Defaults to 4.
 - `--backend` (str): The device backend for computing. Defaults to `cpu`.
 > Please select the appropriate size of the Llama model based on the capabilities of your machine.
 #### 2.1 Client
 On client Windows machine, it is recommended to run directly with full utilization of all cores:
 ```powershell
 torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
 ```
 #### 2.2 Server
 For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.
 E.g. on Linux,
 ```bash
 # set BigDL-Nano env variables
 source bigdl-nano-init
 # e.g. for a server with 48 cores per socket
 export OMP_NUM_THREADS=48
 numactl -C 0-47 -m 0 torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
 ```
 #### 2.3 Sample Output
 #### [Llama2-7b-Chat](https://ai.meta.com/llama/)
 ```log
 2023-10-08 13:49:11,107 - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
 2023-10-08 13:49:11,108 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
 > initializing model parallel with size 1
 > initializing ddp with size 1
 > initializing pipeline with size 1
 2023-10-08 13:49:11,130 - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
 2023-10-08 13:49:11,130 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
 2023-10-08 13:49:11,131 - INFO - Added key: store_based_barrier_key:3 to store for rank: 0
 2023-10-08 13:49:11,131 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
 2023-10-08 13:49:11,132 - INFO - Added key: store_based_barrier_key:4 to store for rank: 0
 2023-10-08 13:49:11,132 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
 2023-10-08 13:49:19,108 - INFO - Reloaded SentencePiece model from /disk1/changmin/Llama-2-7b-chat/tokenizer.model
 2023-10-08 13:49:19,108 - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
 Loaded in 54.41 seconds
 2023-10-08 13:50:09,600 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
 User: what is the recipe of mayonnaise?
 > Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Unterscheidung of mayonnaise involves the use of an emuls
 ==================================
 ```
--- a/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/cpu.patch
+++ b/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/cpu.patch
@ -0,0 +1,279 @@
 diff --git a/README.md b/README.md
 index 91e1719..1f6f26d 100755
 --- a/README.md
 +++ b/README.md
@@ -1,6 +1,6 @@
 # Llama 2
 -We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly.
 +We are unlocking the power of large language models. Our latest version of Llama is now accessible to individuals, creators, researchers and businesses of all sizes so that they can experiment, innovate and scale their ideas responsibly. 
 This release includes model weights and starting code for pretrained and fine-tuned Llama language models — ranging from 7B to 70B parameters.
@@ -58,8 +58,6 @@ torchrun --nproc_per_node 1 example_chat_completion.py \
 - Adjust the `max_seq_len` and `max_batch_size` parameters as needed.
 - This example runs the [example_chat_completion.py](example_chat_completion.py) found in this repository but you can change that to a different .py file.
 -It is also possible to test models without CUDA. For example, to run models on CPU, add an extra command line option `--backend cpu` to following examples. Number of threads can be set using the environment variable `NUM_THREADS`.
 -
 ## Inference
 Different models require different model-parallel (MP) values:
@@ -116,7 +114,7 @@ See [MODEL_CARD.md](MODEL_CARD.md).
 ## License
 -Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements.
 +Our model and weights are licensed for both researchers and commercial entities, upholding the principles of openness. Our mission is to empower individuals, and industry through this opportunity, while fostering an environment of discovery and ethical AI advancements. 
 See the [LICENSE](LICENSE) file, as well as our accompanying [Acceptable Use Policy](USE_POLICY.md)
 diff --git a/example_chat_completion.py b/example_chat_completion.py
 index acedf44..df4e5d6 100644
 --- a/example_chat_completion.py
 +++ b/example_chat_completion.py
@@ -7,13 +7,10 @@ import fire
 from llama import Llama, Dialog
 -from bigdl.llm.optimize import optimize_model
 -
 def main(
     ckpt_dir: str,
     tokenizer_path: str,
 -    backend: str = 'cuda',
     temperature: float = 0.6,
     top_p: float = 0.9,
     max_seq_len: int = 512,
@@ -39,12 +36,9 @@ def main(
         ckpt_dir=ckpt_dir,
         tokenizer_path=tokenizer_path,
         max_seq_len=max_seq_len,
 -        backend=backend,
         max_batch_size=max_batch_size,
     )
 -    generator.model = optimize_model(generator.model)
 -
     dialogs: List[Dialog] = [
         [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
         [
 diff --git a/example_text_completion.py b/example_text_completion.py
 index 1f63bb0..0d60b9c 100755
 --- a/example_text_completion.py
 +++ b/example_text_completion.py
@@ -6,12 +6,9 @@ import fire
 from llama import Llama
 from typing import List
 -from bigdl.llm.optimize import optimize_model
 -
 def main(
     ckpt_dir: str,
     tokenizer_path: str,
 -    backend: str = 'cuda',
     temperature: float = 0.6,
     top_p: float = 0.9,
     max_seq_len: int = 128,
@@ -36,12 +33,9 @@ def main(
         ckpt_dir=ckpt_dir,
         tokenizer_path=tokenizer_path,
         max_seq_len=max_seq_len,
 -        backend=backend,
         max_batch_size=max_batch_size,
     )
 -    generator.model = optimize_model(generator.model)
 -
     prompts: List[str] = [
         # For these prompts, the expected answer is the natural continuation of the prompt
         "I believe the meaning of life is",
@@ -49,11 +43,11 @@ def main(
         """A brief message congratulating the team on the launch:
         Hi everyone,
 -
 +        
         I just """,
         # Few shot prompt (providing a few examples before asking model to complete more);
         """Translate English to French:
 -
 +        
         sea otter => loutre de mer
         peppermint => menthe poivrée
         plush girafe => girafe peluche
 diff --git a/llama/generation.py b/llama/generation.py
 index df68aca..5f8faf9 100755
 --- a/llama/generation.py
 +++ b/llama/generation.py
@@ -55,7 +55,6 @@ class Llama:
         tokenizer_path: str,
         max_seq_len: int,
         max_batch_size: int,
 -        backend: str,
         model_parallel_size: Optional[int] = None,
         seed: int = 1,
     ) -> "Llama":
@@ -82,41 +81,22 @@ class Llama:
             and loads the pre-trained model and tokenizer.
         """
 -        if model_parallel_size is None:
 -            model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
 -
 -        device = backend
 -
 -        if backend == 'cuda':
 -            if not torch.distributed.is_initialized():
 -                torch.distributed.init_process_group("nccl")
 -            if not model_parallel_is_initialized():
 -                initialize_model_parallel(model_parallel_size)
 -            local_rank = int(os.environ.get("LOCAL_RANK", 0))
 -            torch.cuda.set_device(local_rank)
 -            if local_rank > 0:
 -                sys.stdout = open(os.devnull, "w")
 -            torch.set_default_tensor_type(torch.cuda.HalfTensor)
 -        else:
 -            torch.distributed.init_process_group("gloo")
 -
 +        if not torch.distributed.is_initialized():
 +            torch.distributed.init_process_group("nccl")
 +        if not model_parallel_is_initialized():
 +            if model_parallel_size is None:
 +                model_parallel_size = int(os.environ.get("WORLD_SIZE", 1))
             initialize_model_parallel(model_parallel_size)
 -            if backend == 'directml':
 -                import torch_directml
 -                torch.set_default_tensor_type(torch_directml.torch.HalfTensor)
 -                device = torch_directml.device()
 -            elif backend == 'cpu':
 -                # Note: some operations such as "addmm_impl_cpu_" are not implemented for 'Half' at present
 -                # torch.set_default_tensor_type(torch.HalfTensor)
 -                n_threads = int(os.environ.get("NUM_THREADS", 0))
 -                if n_threads > 0:
 -                    torch.set_num_threads(n_threads)
 -                pass
 +        local_rank = int(os.environ.get("LOCAL_RANK", 0))
 +        torch.cuda.set_device(local_rank)
         # seed must be the same in all processes
         torch.manual_seed(seed)
 +        if local_rank > 0:
 +            sys.stdout = open(os.devnull, "w")
 +
         start_time = time.time()
         checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
         assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
@@ -129,13 +109,13 @@ class Llama:
             params = json.loads(f.read())
         model_args: ModelArgs = ModelArgs(
 -            device=device,
             max_seq_len=max_seq_len,
             max_batch_size=max_batch_size,
             **params,
         )
         tokenizer = Tokenizer(model_path=tokenizer_path)
         model_args.vocab_size = tokenizer.n_words
 +        torch.set_default_tensor_type(torch.cuda.HalfTensor)
         model = Transformer(model_args)
         model.load_state_dict(checkpoint, strict=False)
         print(f"Loaded in {time.time() - start_time:.2f} seconds")
@@ -145,7 +125,6 @@ class Llama:
     def __init__(self, model: Transformer, tokenizer: Tokenizer):
         self.model = model
         self.tokenizer = tokenizer
 -        self.device = model.device
     @torch.inference_mode()
     def generate(
@@ -186,14 +165,14 @@ class Llama:
         total_len = min(params.max_seq_len, max_gen_len + max_prompt_len)
         pad_id = self.tokenizer.pad_id
 -        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device=self.device)
 +        tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
         for k, t in enumerate(prompt_tokens):
 -            tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device=self.device)
 +            tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
         if logprobs:
             token_logprobs = torch.zeros_like(tokens, dtype=torch.float)
         prev_pos = 0
 -        eos_reached = torch.tensor([False] * bsz, device=self.device)
 +        eos_reached = torch.tensor([False] * bsz, device="cuda")
         input_text_mask = tokens != pad_id
         if min_prompt_len == total_len:
             logits = self.model.forward(tokens, prev_pos)
 diff --git a/llama/model.py b/llama/model.py
 index 8646d31..770526d 100755
 --- a/llama/model.py
 +++ b/llama/model.py
@@ -9,28 +9,15 @@ import fairscale.nn.model_parallel.initialize as fs_init
 import torch
 import torch.nn.functional as F
 from fairscale.nn.model_parallel.layers import (
 -    # ColumnParallelLinear,
 +    ColumnParallelLinear,
     ParallelEmbedding,
 -    # RowParallelLinear,
 +    RowParallelLinear,
 )
 from torch import nn
 -def ColumnParallelLinear(in_features: int, out_features: int, bias: bool = True, *args, **kwargs):
 -    return torch.nn.Linear(in_features=in_features,
 -                           out_features=out_features,
 -                           bias=bias)
 -
 -
 -def RowParallelLinear(in_features: int, out_features: int, bias: bool = True, *args, **kwargs):
 -    return torch.nn.Linear(in_features=in_features,
 -                           out_features=out_features,
 -                           bias=bias)
 -
 -
 @dataclass
 class ModelArgs:
 -    device: object
     dim: int = 4096
     n_layers: int = 32
     n_heads: int = 32
@@ -216,7 +203,6 @@ class Attention(nn.Module):
         self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
         self.n_rep = self.n_local_heads // self.n_local_kv_heads
         self.head_dim = args.dim // args.n_heads
 -        self.device = args.device
         self.wq = ColumnParallelLinear(
             args.dim,
@@ -254,7 +240,7 @@ class Attention(nn.Module):
                 self.n_local_kv_heads,
                 self.head_dim,
             )
 -        ).to(self.device)
 +        ).cuda()
         self.cache_v = torch.zeros(
             (
                 args.max_batch_size,
@@ -262,7 +248,7 @@ class Attention(nn.Module):
                 self.n_local_kv_heads,
                 self.head_dim,
             )
 -        ).to(self.device)
 +        ).cuda()
     def forward(
         self,
@@ -447,7 +433,6 @@ class Transformer(nn.Module):
         self.params = params
         self.vocab_size = params.vocab_size
         self.n_layers = params.n_layers
 -        self.device = params.device
         self.tok_embeddings = ParallelEmbedding(
             params.vocab_size, params.dim, init_method=lambda x: x
--- a/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/example_chat_completion.py
+++ b/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/example_chat_completion.py
@ -0,0 +1,85 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # This file is adapted from https://github.com/facebookresearch/llama/blob/main/example_chat_completion.py
 #####################################################
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 from typing import List, Optional
 import fire
 from llama import Llama, Dialog
 from bigdl.llm.optimize import optimize_model
 def main(
    ckpt_dir: str,
    tokenizer_path: str,
    backend: str = 'cpu',
    temperature: float = 0.6,
    top_p: float = 0.9,
    max_seq_len: int = 512,
    max_batch_size: int = 8,
    max_gen_len: Optional[int] = None,
 ):
    """
    Entry point of the program for generating text using a pretrained model.
    Args:
        ckpt_dir (str): The directory containing checkpoint files for the pretrained model.
        tokenizer_path (str): The path to the tokenizer model used for text encoding/decoding.
        temperature (float, optional): The temperature value for controlling randomness in generation.
            Defaults to 0.6.
        top_p (float, optional): The top-p sampling parameter for controlling diversity in generation.
            Defaults to 0.9.
        max_seq_len (int, optional): The maximum sequence length for input prompts. Defaults to 512.
        max_batch_size (int, optional): The maximum batch size for generating sequences. Defaults to 8.
        max_gen_len (int, optional): The maximum length of generated sequences. If None, it will be
            set to the model's max sequence length. Defaults to None.
    """
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        backend=backend,
        max_batch_size=max_batch_size,
    )
    generator.model = optimize_model(generator.model)
    dialogs: List[Dialog] = [
        [{"role": "user", "content": "what is the recipe of mayonnaise?"}],
    ]
    results = generator.chat_completion(
        dialogs,  # type: ignore
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )
    for dialog, result in zip(dialogs, results):
        for msg in dialog:
            print(f"{msg['role'].capitalize()}: {msg['content']}\n")
        print(
            f"> {result['generation']['role'].capitalize()}: {result['generation']['content']}"
        )
        print("\n==================================\n")
 if __name__ == "__main__":
    fire.Fire(main)
--- a/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/example_text_completion.py
+++ b/python/llm/example/CPU/PyTorch-Models/Model/meta-llama/example_text_completion.py
@ -0,0 +1,92 @@
 #
 # Copyright 2016 The BigDL Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # This file is adapted from https://github.com/facebookresearch/llama/blob/main/example_text_completion.py
 #####################################################
 # Copyright (c) Meta Platforms, Inc. and affiliates.
 # This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
 import fire
 from llama import Llama
 from typing import List
 from bigdl.llm.optimize import optimize_model
 def main(
    ckpt_dir: str,
    tokenizer_path: str,
    backend: str = 'cpu',
    temperature: float = 0.6,
    top_p: float = 0.9,
    max_seq_len: int = 128,
    max_gen_len: int = 64,
    max_batch_size: int = 4,
 ):
    """
    Entry point of the program for generating text using a pretrained model.
    Args:
        ckpt_dir (str): The directory containing checkpoint files for the pretrained model.
        tokenizer_path (str): The path to the tokenizer model used for text encoding/decoding.
        temperature (float, optional): The temperature value for controlling randomness in generation.
            Defaults to 0.6.
        top_p (float, optional): The top-p sampling parameter for controlling diversity in generation.
            Defaults to 0.9.
        max_seq_len (int, optional): The maximum sequence length for input prompts. Defaults to 128.
        max_gen_len (int, optional): The maximum length of generated sequences. Defaults to 64.
        max_batch_size (int, optional): The maximum batch size for generating sequences. Defaults to 4.
    """ 
    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        backend=backend,
        max_batch_size=max_batch_size,
    )
    generator.model = optimize_model(generator.model)
    prompts: List[str] = [
        # For these prompts, the expected answer is the natural continuation of the prompt
        "I believe the meaning of life is",
        "Simply put, the theory of relativity states that ",
        """A brief message congratulating the team on the launch:
        Hi everyone,
        I just """,
        # Few shot prompt (providing a few examples before asking model to complete more);
        """Translate English to French:
        sea otter => loutre de mer
        peppermint => menthe poivrée
        plush girafe => girafe peluche
        cheese =>""",
    ]
    results = generator.text_completion(
        prompts,
        max_gen_len=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )
    for prompt, result in zip(prompts, results):
        print(prompt)
        print(f"> {result['generation']}")
        print("\n==================================\n")
 if __name__ == "__main__":
    fire.Fire(main)