Update Deepseek coder GPU example (#12712)

* Update Deepseek coder GPU example * Fix based on comment
2025-01-16 14:05:31 +08:00 · 2025-01-16 14:05:31 +08:00 · c52bdff76b
commit c52bdff76b
parent 9d65dcd7ef
2 changed files with 39 additions and 34 deletions
--- a/python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/deepseek/README.md
@ -1,5 +1,5 @@
 # Deepseek
-In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Deepseek models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) as a reference Deepseek model.
+In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Deepseek models on [Intel GPUs](../README.md). For illustration purposes, we utilize the [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) and [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct) as reference Deepseek models.

 ## Requirements
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information.
@ -97,7 +97,7 @@ python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROM
 ```

 Arguments info:
- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Deepseek model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'deepseek-ai/deepseek-coder-6.7b-instruct'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Deepseek model (e.g. `deepseek-ai/deepseek-coder-6.7b-instruct` or `deepseek-ai/deepseek-coder-1.3b-instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'deepseek-ai/deepseek-coder-6.7b-instruct'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.

@ -107,15 +107,17 @@ Arguments info:
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
-You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
-### Instruction:
 What is AI?
-### Response:
-
 -------------------- Output --------------------
-You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
-### Instruction:
-What is AI?
-### Response:
-AI, or Artificial Intelligence, refers to the simulation of human intelligence processes by machines, especially computer systems. It involves the creation of algorithms that allow computers
+AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The
 ```
+
+#### [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct)
+
+```log
+Inference time: xxxx s
+-------------------- Prompt --------------------
+What is AI?
+-------------------- Output --------------------
+Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term
+```
--- a/python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py
+++ b/python/llm/example/GPU/HuggingFace/LLM/deepseek/generate.py
@ -21,19 +21,12 @@ import argparse
 from transformers import AutoTokenizer
 from ipex_llm.transformers import AutoModelForCausalLM

-# you could tune the prompt based on your own model,
-# here the prompt tuning refers to https://huggingface.co/Deci/DeciLM-7B-instruct#prompt-template
-PROMPT_FORMAT = """
-You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
-### Instruction:
-{prompt}
-### Response:
-"""

 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Deepseek-6.7b model')
+    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Deepseek model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="deepseek-ai/deepseek-coder-6.7b-instruct",
-                        help='The huggingface repo id for the deepseek (e.g. `deepseek-ai/deepseek-coder-6.7b-instruct`) to be downloaded'
+                        help='The huggingface repo id for the Deepseek model (e.g. `deepseek-ai/deepseek-coder-6.7b-instruct` or '
+                             '`deepseek-ai/deepseek-coder-1.3b-instruct`) to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="What is AI?",
                        help='Prompt to infer')
@ -47,31 +40,41 @@ if __name__ == '__main__':
    # which convert the relevant layers in the model into INT4 format
    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function.
    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
-    model = AutoModelForCausalLM.from_pretrained(
-        model_path,
-        load_in_4bit=True,
-        trust_remote_code=True,
-        cpu_embedding=True,
-    )
+    model = AutoModelForCausalLM.from_pretrained(model_path,
+                                                 load_in_4bit=True,
+                                                 optimize_model=True,
+                                                 trust_remote_code=True,
+                                                 use_cache=True)
    
-    model = model.to('xpu')
+    model = model.half().to("xpu")

    # Load tokenizer
-    tokenizer = AutoTokenizer.from_pretrained(model_path)
-    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                              trust_remote_code=True)
+    
+    prompt = args.prompt

    # Generate predicted tokens
    with torch.inference_mode():
-        prompt = PROMPT_FORMAT.format(prompt=args.prompt)
-        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
+        # The following code for generation is adapted from https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct#chat-model-inference
+        messages=[
+            { "role": "user", "content": prompt}
+        ]
+        input_ids = tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            return_tensors="pt"
+        ).to("xpu")
+
        st = time.time()
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        torch.xpu.synchronize()
        end = time.time()
        output = output.cpu()
-        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+        output_str = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
+        print('-'*20, 'Prompt', '-'*20)
+        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)
-