# LlaMA

In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on general pytorch models, for example Meta Llama models. **Different from what [Huggingface LlaMA2](../llama2/) example demonstrated, This example directly brings the optimizations of IPEX-LLM to the official LLaMA implementation of which the code style is more flexible.** For illustration purposes, we utilize the [Llama2-7b-Chat](https://ai.meta.com/llama/) as a reference LlaMA model.

## Requirements
To run these examples with IPEX-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.

## Example: Generating text using a pretrained Llama model
In the example [example_chat_completion.py](./example_chat_completion.py), we show a basic use case for a Llama model to engage in a conversation with an AI assistant using `chat_completion` API, with IPEX-LLM INT4 optimizations. The process for [example_text_completion.py](./example_text_completion.py) is similar.
### 1. Install
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm

# Install meta-llama repository
git clone https://github.com/facebookresearch/llama.git
cd llama/
git apply < ../cpu.patch # apply cpu version patch
pip install -e .

cd -
pip install ipex-llm[all] # install ipex-llm with 'all' option
```

### 2. Run
Follow the instruction [here](https://github.com/facebookresearch/llama#download) to download the model weights and tokenizer.
```
torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```

Arguments info:
- `--ckpt_dir` (str): The directory containing checkpoint files for the pretrained model.
- `--tokenizer_path` (str): The path to the tokenizer model used for text encoding/decoding.
- `--temperature` (float, optional): The temperature value for controlling randomness in generation.
    Defaults to 0.6.
- `--top_p` (float, optional): The top-p sampling parameter for controlling diversity in generation.
    Defaults to 0.9.
- `--max_seq_len` (int, optional): The maximum sequence length for input prompts. Defaults to 128.
- `--max_gen_len` (int, optional): The maximum length of generated sequences. Defaults to 64.
- `--max_batch_size` (int, optional): The maximum batch size for generating sequences. Defaults to 4.
- `--backend` (str): The device backend for computing. Defaults to `cpu`.

> Please select the appropriate size of the Llama model based on the capabilities of your machine.


#### 2.1 Client
On client Windows machine, it is recommended to run directly with full utilization of all cores:
```powershell
torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```

#### 2.2 Server
For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket.

E.g. on Linux,
```bash
# set IPEX-Nano env variables
source ipex-nano-init

# e.g. for a server with 48 cores per socket
export OMP_NUM_THREADS=48
numactl -C 0-47 -m 0 torchrun --nproc-per-node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 64 --max_batch_size 1 --backend cpu
```

#### 2.3 Sample Output
#### [Llama2-7b-Chat](https://ai.meta.com/llama/)

```log
2023-10-08 13:49:11,107 - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-08 13:49:11,108 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
2023-10-08 13:49:11,130 - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-08 13:49:11,130 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
2023-10-08 13:49:11,131 - INFO - Added key: store_based_barrier_key:3 to store for rank: 0
2023-10-08 13:49:11,131 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
2023-10-08 13:49:11,132 - INFO - Added key: store_based_barrier_key:4 to store for rank: 0
2023-10-08 13:49:11,132 - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
2023-10-08 13:49:19,108 - INFO - Reloaded SentencePiece model from /disk1/changmin/Llama-2-7b-chat/tokenizer.model
2023-10-08 13:49:19,108 - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
Loaded in 54.41 seconds
2023-10-08 13:50:09,600 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
User: what is the recipe of mayonnaise?

> Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Unterscheidung of mayonnaise involves the use of an emuls

==================================
```