Replace ipex with ipex-llm (#10554)

* fix ipex with ipex_llm * fix ipex with ipex_llm * update * update * update * update * update * update * update * update
2024-03-28 13:54:40 +08:00 · 2024-03-28 13:54:40 +08:00 · 52a2135d83
commit 52a2135d83
parent 0a2e820c9f
106 changed files with 127 additions and 122 deletions
--- a/docker/llm/README.md
+++ b/docker/llm/README.md
@ -62,7 +62,7 @@ After the container is booted, you could get into the container through `docker
 docker exec -it my_container bash
 ```

-To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm#cpu-int4).
+To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#cpu-int4).


 #### Getting started with chat
--- a/docker/llm/finetune/qlora/cpu/kubernetes/Chart.yaml
+++ b/docker/llm/finetune/qlora/cpu/kubernetes/Chart.yaml
@ -1,5 +1,5 @@
 apiVersion: v2
-name: ipex-fintune-service
+name: ipex_llm-fintune-service
 description: A Helm chart for IPEX-LLM Finetune Service on Kubernetes
 type: application
 version: 1.1.27
--- a/docker/llm/serving/cpu/docker/README.md
+++ b/docker/llm/serving/cpu/docker/README.md
@ -30,7 +30,7 @@ sudo docker run -itd \

 After the container is booted, you could get into the container through `docker exec`.

-To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex/llm/serving).
+To run model-serving using `IPEX-LLM` as backend, you can refer to this [document](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/src/ipex_llm/serving/fastchat).
 Also you can set environment variables and start arguments while running a container to get serving started initially. You may need to boot several containers to support. One controller container and at least one worker container are needed. The api server address(host and port) and controller address are set in controller container, and you need to set the same controller address as above, model path on your machine and worker address in worker container.

 To start a controller container:
--- a/docker/llm/serving/cpu/kubernetes/README.md
+++ b/docker/llm/serving/cpu/kubernetes/README.md
@ -10,7 +10,7 @@ To deploy IPEX-LLM-serving cpu in Kubernetes environment, please use this image:

 In this document, we will use `vicuna-7b-v1.5` as the deployment model.

-After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.
+After downloading the model, please change name from `vicuna-7b-v1.5` to `vicuna-7b-v1.5-ipex-llm` to use `ipex-llm` as the backend. The `ipex-llm` backend will be used if model path contains `ipex-llm`. Otherwise, the original transformer-backend will be used.

 You can download the model from [here](https://huggingface.co/lmsys/vicuna-7b-v1.5).

--- a/python/llm/example/CPU/Deepspeed-AutoTP/deepspeed_autotp.py
+++ b/python/llm/example/CPU/Deepspeed-AutoTP/deepspeed_autotp.py
@ -102,7 +102,7 @@ if __name__ == '__main__':
        # Batch tokenizing
        prompt = args.prompt
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'cpu:{local_rank}')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex-llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                use_cache=True)
--- a/python/llm/example/CPU/LangChain/README.md
+++ b/python/llm/example/CPU/LangChain/README.md
@ -1,8 +1,8 @@
 ## Langchain Examples

-This folder contains examples showcasing how to use `langchain` with `ipex`. 
+This folder contains examples showcasing how to use `langchain` with `ipex-llm`. 

-### Install IPEX
+### Install-IPEX LLM

 Ensure `ipex-llm` is installed by following the [IPEX-LLM Installation Guide](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#install). 

@ -36,7 +36,7 @@ To run the example, execute the following command in the current directory:
 ```bash
 python transformers_int4/rag.py -m <path_to_model> [-q <your_question>] [-i <path_to_input_txt>]
 ```
-> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX?` will be used by default. 
+> Note: If `-i` is not specified, it will use a short introduction to Big-DL as input by default. if `-q` is not specified, `What is IPEX LLM?` will be used by default. 


 ### Example: Math
@ -66,3 +66,8 @@ python transformers_int4/voiceassistant.py -m <path_to_model> [-q <your_question
 - `-x MAX_NEW_TOKENS`: the max new tokens of model tokens input
 - `-l LANGUAGE`: you can specify a language such as "english" or "chinese" 
 - `-d True|False`: whether the model path specified in -m is saved low bit model.
+
+### Legacy (Native INT4 examples)
+
+IPEX-LLM also provides langchain integrations using native INT4 mode. Those examples can be foud in [native_int4](./native_int4/) folder. For detailed instructions of settting up and running `native_int4` examples, refer to [Native INT4 Examples README](./README_nativeint4.md). 
+
--- a/python/llm/example/CPU/PyTorch-Models/Model/mixtral/generate.py
+++ b/python/llm/example/CPU/PyTorch-Models/Model/mixtral/generate.py
@ -54,7 +54,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex-llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/README.md
+++ b/python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/README.md
@ -28,7 +28,7 @@ Example usage:
 python ./alpaca_qlora_finetuning_cpu.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca"
+    --output_dir "./ipex-llm-qlora-alpaca"
 ```

 **Note**: You could also specify `--base_model` to the local path of the huggingface model checkpoint folder and `--data_path` to the local path of the dataset JSON file.
@ -109,7 +109,7 @@ def generate_and_tokenize_prompt(data_point):
 python ./quotes_qlora_finetuning_cpu.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "./english_quotes" \
-    --output_dir "./ipex-qlora-alpaca" \
+    --output_dir "./ipex-llm-qlora-alpaca" \
    --prompt_template_name "english_quotes"
 ```

--- a/python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/finetune_one_node_two_sockets.sh
+++ b/python/llm/example/CPU/QLoRA-FineTuning/alpaca-qlora/finetune_one_node_two_sockets.sh
@ -14,5 +14,5 @@ mpirun -n 2 \
 --max_steps -1 \
 --base_model "meta-llama/Llama-2-7b-hf" \
 --data_path "yahma/alpaca-cleaned" \
- --output_dir "./ipex-qlora-alpaca"
+ --output_dir "./ipex-llm-qlora-alpaca"

--- a/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py
+++ b/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py
@ -109,7 +109,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = args.prompt
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to(f'xpu:{local_rank}')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                use_cache=True)
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Advanced-Quantizations/GGUF-IQ2/generate.py
@ -64,7 +64,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to("xpu")
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        st = time.time()
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan/generate.py
@ -55,7 +55,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/baichuan2/generate.py
@ -61,7 +61,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/bluelm/generate.py
@ -55,7 +55,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BLUELM_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm2/streamchat.py
@ -54,7 +54,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = args.question
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=32)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chatglm3/streamchat.py
@ -54,7 +54,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = args.question
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=32)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/chinese-llama2/generate.py
@ -74,7 +74,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
        prompt = CODELLAMA_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/falcon/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
        prompt = FALCON_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/flan-t5/generate.py
@ -60,7 +60,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = FLAN_T5_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gemma/generate.py
@ -59,7 +59,7 @@ if __name__ == '__main__':
        chat[0]['content'] = args.prompt
        prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/gpt-j/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
        prompt = GptJ_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/internlm2/generate.py
@ -62,7 +62,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py
@ -70,7 +70,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mistral/generate.py
@ -56,7 +56,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MISTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mixtral/generate.py
@ -56,7 +56,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/mpt/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MPT_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-1_5/generate.py
@ -59,7 +59,7 @@ if __name__ == '__main__':
        prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                generation_config = generation_config)
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py
@ -60,7 +60,7 @@ if __name__ == '__main__':
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

        model.generation_config.pad_token_id = model.generation_config.eos_token_id
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                generation_config = generation_config)
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phixtral/generate.py
@ -61,7 +61,7 @@ if __name__ == '__main__':
        prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                generation_config = generation_config)
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/qwen/generate.py
@ -64,7 +64,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = QWEN_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/redpajama/generate.py
@ -56,7 +56,7 @@ if __name__ == '__main__':
        prompt = RedPajama_PROMPT_FORMAT.format(prompt=args.prompt)
        inputs = tokenizer(prompt, return_tensors='pt').to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(**inputs,
                                max_new_tokens=args.n_predict,
                                do_sample=True,
--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/replit/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
        prompt = REPLIT_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv4/generate.py
@ -70,7 +70,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = generate_prompt(instruction=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/rwkv5/generate.py
@ -67,7 +67,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = generate_prompt(instruction=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/solar/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = SOLAR_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/starcoder/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
        prompt = StarCoder_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Model/yi/generate.py
@ -62,7 +62,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = YI_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py
+++ b/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/generate.py
@ -66,7 +66,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = LLAMA2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/README.md
@ -58,8 +58,8 @@ bash lora_finetune_llama2_7b_pvc_1550_4_card.sh
 python ./alpaca_lora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
-    --resume_from_checkpoint "./ipex-qlora-alpaca/checkpoint-1100"
+    --output_dir "./ipex-llm-qlora-alpaca" \
+    --resume_from_checkpoint "./ipex-llm-qlora-alpaca/checkpoint-1100"
 ```

 ### 5. Sample Output
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_arc_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_arc_1_card.sh
@ -20,6 +20,6 @@ python ./alpaca_lora_finetuning.py \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-lora-alpaca" \
+    --output_dir "./ipex-llm-lora-alpaca" \
    --gradient_checkpointing True \
    --lora_target_modules "['k_proj', 'q_proj', 'o_proj', 'v_proj']"
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1110_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1110_4_card.sh
@ -25,6 +25,6 @@ mpirun -n 4 \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-lora-alpaca" \
+    --output_dir "./ipex-llm-lora-alpaca" \
    --gradient_checkpointing True \
    --lora_target_modules "['k_proj', 'q_proj', 'o_proj', 'v_proj', 'up_proj', 'down_proj', 'gate_proj']"
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1550_1_tile.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1550_1_tile.sh
@ -20,6 +20,6 @@ python ./alpaca_lora_finetuning.py \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-lora-alpaca" \
+    --output_dir "./ipex-llm-lora-alpaca" \
    --gradient_checkpointing True \
    --lora_target_modules "['k_proj', 'q_proj', 'o_proj', 'v_proj', 'up_proj', 'down_proj', 'gate_proj']"
--- a/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1550_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/LoRA/lora_finetune_llama2_7b_pvc_1550_4_card.sh
@ -25,6 +25,6 @@ mpirun -n 8 \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-lora-alpaca" \
+    --output_dir "./ipex-llm-lora-alpaca" \
    --gradient_checkpointing False \
    --lora_target_modules "['k_proj', 'q_proj', 'o_proj', 'v_proj', 'up_proj', 'down_proj', 'gate_proj']"
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/README.md
@ -52,8 +52,8 @@ bash qalora_finetune_llama2_7b_pvc_1550_1_tile.sh
 python ./alpaca_qalora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
-    --resume_from_checkpoint "./ipex-qlora-alpaca/checkpoint-1100"
+    --output_dir "./ipex-llm-qlora-alpaca" \
+    --resume_from_checkpoint "./ipex-llm-qlora-alpaca/checkpoint-1100"
 ```

 ### 5. Sample Output
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_arc_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_arc_1_card.sh
@ -18,7 +18,7 @@
 python ./alpaca_qalora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
+    --output_dir "./ipex-llm-qlora-alpaca" \
    --learning_rate 9e-5 \
    --micro_batch_size 2 \
    --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_arc_2_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_arc_2_card.sh
@ -23,7 +23,7 @@ mpirun -n 2 \
       python -u ./alpaca_qalora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --learning_rate 9e-5 \
       --micro_batch_size 2 \
       --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_pvc_1550_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_pvc_1550_1_card.sh
@ -23,7 +23,7 @@ mpirun -n 2 \
       python -u ./alpaca_qalora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --learning_rate 9e-5 \
       --micro_batch_size 8 \
       --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_pvc_1550_1_tile.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QA-LoRA/qalora_finetune_llama2_7b_pvc_1550_1_tile.sh
@ -19,7 +19,7 @@
 python ./alpaca_qalora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
+    --output_dir "./ipex-llm-qlora-alpaca" \
    --learning_rate 9e-5 \
    --micro_batch_size 8 \
    --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/README.md
@ -135,8 +135,8 @@ If you fail to complete the whole finetuning process, it is suggested to resume
 python ./alpaca_qlora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
-    --resume_from_checkpoint "./ipex-qlora-alpaca/checkpoint-1100"
+    --output_dir "./ipex-llm-qlora-alpaca" \
+    --resume_from_checkpoint "./ipex-llm-qlora-alpaca/checkpoint-1100"
 ```

 ### 5. Sample Output
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_1_card.sh
@ -23,6 +23,6 @@ mpirun -n 2 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-13b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --micro_batch_size 8 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_1_tile.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_1_tile.sh
@ -18,6 +18,6 @@
 python ./alpaca_qlora_finetuning.py \
    --base_model "meta-llama/Llama-2-13b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
+    --output_dir "./ipex-llm-qlora-alpaca" \
    --micro_batch_size 8 \
    --batch_size 128
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_13b_pvc_1550_4_card.sh
@ -23,6 +23,6 @@ mpirun -n 8 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-13b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --micro_batch_size 8 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_1_card.sh
@ -27,7 +27,7 @@ mpirun -n 2 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-70b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --gradient_checkpointing True \
       --micro_batch_size 8 \
       --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_70b_pvc_1550_4_card.sh
@ -27,7 +27,7 @@ mpirun -n 8 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-70b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --gradient_checkpointing True \
       --micro_batch_size 8 \
       --batch_size 128 \
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_arc_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_arc_1_card.sh
@ -18,4 +18,4 @@
 python ./alpaca_qlora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca"
+    --output_dir "./ipex-llm-qlora-alpaca"
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_arc_2_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_arc_2_card.sh
@ -23,4 +23,4 @@ mpirun -n 2 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" > training.log
+       --output_dir "./ipex-llm-qlora-alpaca" > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_flex_170_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_flex_170_1_card.sh
@ -20,4 +20,4 @@ python ./alpaca_qlora_finetuning.py \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca"
+    --output_dir "./ipex-llm-qlora-alpaca"
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_flex_170_3_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_flex_170_3_card.sh
@ -23,7 +23,7 @@ mpirun -n 3 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --gradient_checkpointing False \
       --micro_batch_size 2 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1100_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1100_1_card.sh
@ -20,4 +20,4 @@ python ./alpaca_qlora_finetuning.py \
    --batch_size 128 \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca"
+    --output_dir "./ipex-llm-qlora-alpaca"
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1100_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1100_4_card.sh
@ -23,6 +23,6 @@ mpirun -n 4 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --micro_batch_size 8 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1550_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1550_1_card.sh
@ -23,6 +23,6 @@ mpirun -n 2 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --micro_batch_size 8 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1550_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/QLoRA/alpaca-qlora/qlora_finetune_llama2_7b_pvc_1550_4_card.sh
@ -23,6 +23,6 @@ mpirun -n 8 \
       python -u ./alpaca_qlora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-qlora-alpaca" \
+       --output_dir "./ipex-llm-qlora-alpaca" \
       --micro_batch_size 8 \
       --batch_size 128 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/README.md
@ -58,8 +58,8 @@ bash relora_finetune_llama2_7b_pvc_1550_4_card.sh
 python ./alpaca_relora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-qlora-alpaca" \
-    --resume_from_checkpoint "./ipex-qlora-alpaca/checkpoint-1100"
+    --output_dir "./ipex-llm-qlora-alpaca" \
+    --resume_from_checkpoint "./ipex-llm-qlora-alpaca/checkpoint-1100"
 ```

 ### 5. Sample Output
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_arc_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_arc_1_card.sh
@ -18,6 +18,6 @@
 python ./alpaca_relora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
-    --output_dir "./ipex-relora-alpaca" \
+    --output_dir "./ipex-llm-relora-alpaca" \
    --relora_steps 300 \
    --relora_warmup_steps 10
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_arc_2_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_arc_2_card.sh
@ -23,6 +23,6 @@ mpirun -n 2 \
       python -u ./alpaca_relora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-relora-alpaca" \
+       --output_dir "./ipex-llm-relora-alpaca" \
       --relora_steps 300 \
       --relora_warmup_steps 10 > training.log
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_pvc_1550_1_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_pvc_1550_1_card.sh
@ -23,7 +23,7 @@ mpirun -n 2 \
       python -u ./alpaca_relora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-relora-alpaca" \
+       --output_dir "./ipex-llm-relora-alpaca" \
       --micro_batch_size 8 \
       --relora_steps 300 \
       --relora_warmup_steps 10 \
--- a/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_pvc_1550_4_card.sh
+++ b/python/llm/example/GPU/LLM-Finetuning/ReLora/relora_finetune_llama2_7b_pvc_1550_4_card.sh
@ -23,7 +23,7 @@ mpirun -n 8 \
       python -u ./alpaca_relora_finetuning.py \
       --base_model "meta-llama/Llama-2-7b-hf" \
       --data_path "yahma/alpaca-cleaned" \
-       --output_dir "./ipex-relora-alpaca" \
+       --output_dir "./ipex-llm-relora-alpaca" \
       --micro_batch_size 8 \
       --relora_steps 300 \
       --relora_warmup_steps 10 \
--- a/python/llm/example/GPU/ModelScope-Models/Save-Load/generate.py
+++ b/python/llm/example/GPU/ModelScope-Models/Save-Load/generate.py
@ -62,7 +62,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/ModelScope-Models/generate.py
+++ b/python/llm/example/GPU/ModelScope-Models/generate.py
@ -60,7 +60,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/Pipeline-Parallel-Inference/generate.py
+++ b/python/llm/example/GPU/Pipeline-Parallel-Inference/generate.py
@ -90,7 +90,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = get_prompt(args.prompt, [], system_prompt=DEFAULT_SYSTEM_PROMPT)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu:0')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        output = model.generate(input_ids,
--- a/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/aquila2/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = AQUILA2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/baichuan2/generate.py
@ -59,7 +59,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BAICHUAN2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/bark/synthesize_speech.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/bark/synthesize_speech.py
@ -52,7 +52,7 @@ if __name__ == '__main__':
    inputs = processor(text, voice_preset=voice_preset).to('xpu')

    with torch.inference_mode():
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        audio_array = model.generate(**inputs)

        st = time.time()
--- a/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/bluelm/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = BLUELM_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm2/streamchat.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = args.question
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=32)

--- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/chatglm3/streamchat.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = args.question
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=32)

--- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/generate.py
@ -59,7 +59,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = CODELLAMA_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v1/generate.py
@ -65,7 +65,7 @@ if __name__ == '__main__':
        prompt = DOLLY_V1_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        end_key_token_id=tokenizer.encode("### End")[0]
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                use_cache=True,
                                max_new_tokens=args.n_predict,
--- a/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/dolly-v2/generate.py
@ -65,7 +65,7 @@ if __name__ == '__main__':
        prompt = DOLLY_V2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        end_key_token_id=tokenizer.encode("### End")[0]
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                pad_token_id=tokenizer.pad_token_id,
--- a/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/flan-t5/generate.py
@ -60,7 +60,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = FLAN_T5_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/internlm2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/internlm2/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = INTERNLM_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/llama2/generate.py
@ -62,7 +62,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = LLAMA2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/mamba/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/mamba/generate.py
@ -54,7 +54,7 @@ if __name__ == '__main__':
    # Generate predicted tokens
    with torch.inference_mode():
        input_ids = tokenizer.encode(args.prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        st = time.time()
--- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MISTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/mixtral/generate.py
@ -58,7 +58,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = MIXTRAL_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/phi-1_5/generate.py
@ -56,7 +56,7 @@ if __name__ == '__main__':
        prompt = PHI_1_5_V1_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict, generation_config = generation_config)
        # start inference
        st = time.time()
--- a/python/llm/example/GPU/PyTorch-Models/Model/phi-2/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/phi-2/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

        model.generation_config.pad_token_id = model.generation_config.eos_token_id
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict, generation_config = generation_config)
        # start inference
        st = time.time()
--- a/python/llm/example/GPU/PyTorch-Models/Model/phixtral/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/phixtral/generate.py
@ -61,7 +61,7 @@ if __name__ == '__main__':
        prompt = PHI1_5_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict,
                                generation_config = generation_config)
--- a/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = REPLIT_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/synthesize_speech.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/synthesize_speech.py
@ -89,7 +89,7 @@ if __name__ == '__main__':
    speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to('xpu')
    
    with torch.inference_mode():
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

        st = time.time()
--- a/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/starcoder/generate.py
@ -57,7 +57,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = STARCODER_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
@ -55,7 +55,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = YI_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/python/llm/example/GPU/PyTorch-Models/More-Data-Types/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/More-Data-Types/generate.py
@ -62,7 +62,7 @@ if __name__ == '__main__':
    with torch.inference_mode():
        prompt = LLAMA2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
-        # ipex model needs a warmup, then inference time can be accurate
+        # ipex_llm model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

--- a/Show more
+++ b/Show more