diff --git a/python/llm/example/GPU/Speculative-Decoding/README.md b/python/llm/example/GPU/Speculative-Decoding/README.md index e01a22c7..926fe4c0 100644 --- a/python/llm/example/GPU/Speculative-Decoding/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/README.md @@ -3,8 +3,6 @@ You can use BigDL-LLM to run almost every Huggingface Transformer models with sp ## Verified Hardware Platforms -- Intel Arc™ A-Series Graphics -- Intel Data Center GPU Flex Series - Intel Data Center GPU Max Series ## Recommended Requirements @@ -19,11 +17,6 @@ Step 3, you also need to download and install [Intel® oneAPI Base Toolkit](http > **Note**: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0. ## Best Known Configuration on Linux -For optimal performance on Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, it is recommended to set several environment variables. -```bash -export USE_XETLA=OFF -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -``` For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. ```bash diff --git a/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md b/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md index c35bfb6e..e4b47eac 100644 --- a/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/chatglm3/README.md @@ -21,12 +21,6 @@ source /opt/intel/oneapi/setvars.sh ### 3. Run -For optimal performance on Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, it is recommended to set several environment variables. -```bash -export USE_XETLA=OFF -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -``` - For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. ```bash export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so @@ -46,33 +40,8 @@ Arguments info: #### Sample Output #### [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) ```log -[gMASK]sop 折纸的过程看似简单,其实想要做好,还是需要一套很复杂的工艺。以折一支玫瑰花为例, -我们可以将整个折纸过程分成三个阶段,即:创建栅格折痕,制作立体基座,完成花瓣修饰。 -首先是创建栅格折痕:这一步有点像我们折千纸鹤的第一步,即通过对称州依次对折,然后按照长和宽两个维度, -依次进行多等分的均匀折叠;最终在两个方向上的折痕会交织成一套完整均匀的小方格拼接图案; -这些小方格就组成了类似二维坐标系的参考系统,使得我们在该平面上,通过组合临近折痕的方式从二维小方格 -上折叠出三维的高台或凹陷,以便于接下来的几座制作过程。需要注意的是,在建立栅格折痕的过程中,可能会 -出现折叠不对成的情况,这种错误所带来的后果可能是很严重的,就像是蝴蝶效应,一开始只是毫厘之差,最后 -可能就是天壤之别。然后是制作立体基座:在这一步,我们需要基于栅格折痕折出对称的三维高台或凹陷。从对 -称性分析不难发现,玫瑰花会有四个周对称的三维高台和配套凹陷。所以,我们可以先折出四分之一的凹陷和高 -台图案,然后以这四分之一的部分作为摸板,再依次折出其余三个部分的重复图案。值得注意的是,高台的布局 -不仅要考虑长和宽这两个唯独上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。与第一阶段的注 -意事项类似,请处理好三个维度上的所有折角,确保它们符合计划中所要求的那种布局,以免出现三维折叠过程 -中的蝴蝶效应;为此,我们常常会在折叠第一个四分之一图案的过程中,与成品玫瑰花进行反复比较,以便在第 -一时间排除掉所有可能的错误。最后一个阶段是完成花瓣修饰。在这个阶段,我们往往强调一个重要名词,叫用 -心折叠。这里的用心已经不是字面上的认真这个意思,而是指通过我们对于大自然中玫瑰花外型的理解,借助自 -然的曲线去不断修正花瓣的形状,以期逼近现实中的玫瑰花瓣外形。请注意,在这个阶段的最后一步,我们需要 -通过拉扯已经弯折的四个花瓣,来调整玫瑰花中心的绽放程度。这个过程可能会伴随玫瑰花整体结构的崩塌,所 -以,一定要控制好调整的力道,以免出现不可逆的后果。最终,经过三个阶段的折叠,我们会得到一支栩栩如生 -的玫瑰花冠。如果条件允许,我们可以在一根拉直的铁丝上缠绕绿色纸条,并将玫瑰花冠插在铁丝的一段。这样 -,我们就得到了一支手工玫瑰花。总之,通过创建栅格折痕,制作立体基座,以及完成花瓣修饰,我们从二维的 -纸面上创作出了一支三维的花朵。这个过程虽然看似简单,但它确实我们人类借助想象力和常见素材而创作出的 -艺术品。问: 请基于以上描述,分析哪些步骤做错了很大可能会导致最终折叠失败?答: 玫瑰花折纸过程中,如果以下步骤做错了,可能会导致最终折叠失败: - -1. 创建栅格折痕:如果折痕不够均匀、对称或者交织不紧密,可能会导致折纸无法形成稳定的三维结构。 -2. 制作立体基座:如果立体基座的制作不准确或者与花瓣的形状不匹配,也可能会导致最终折叠失败。 -3. 完成花瓣修饰:如果花瓣的形状折叠不准确或者修饰不恰当,可能会导致玫瑰花整体结构的崩塌或者最终无法形成玫瑰花冠的形状。 -Tokens generated 121 +[gMASK]sop 折纸的过程看似简单,其实想要做好,还是需要一套很复杂的工艺。以折一支玫瑰花为例,我们可以将整个折纸过程分成三个阶段,即:创建栅格折痕,制作立体基座,完成花瓣修饰。首先是创建栅格折痕:这一步有点像我们折千纸鹤的第一步,即通过对称州依次对折,然后按照长和宽两个维度,依次进行多等分的均匀折叠;最终在两个方向上的折痕会交织成一套完整均匀的小方格拼接图案;这些小方格就组成了类似二维坐标系的参考系统,使得我们在该平面上,通过组合临近折痕的方式从二维小方格上折叠出三维的高台或凹陷,以便于接下来的几座制作过程。需要注意的是,在建立栅格折痕的过程中,可能会出现折叠不对成的情况,这种错误所带来的后果可能是很严重的,就像是蝴蝶效应,一开始只是毫厘之差,最后可能就是天壤之别。然后是制作立体基座:在这一步,我们需要基于栅格折痕折出对称的三维高台或凹陷。从对称性分析不难发现,玫瑰花会有四个周对称的三维高台和配套凹陷。所以,我们可以先折出四分之一的凹陷和高台图案,然后以这四分之一的部分作为摸板,再依次折出其余三个部分的重复图案。值得注意的是,高台的布局不仅要考虑长和宽这两个唯独上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。与第一阶段的注意事项类似,请处理好三个维度上的所有折角,确保它们符合计划中所要求的那种布局,以免出现三维折叠过程中的蝴蝶效应;为此,我们常常会在折叠第一个四分之一图案的过程中,与成品玫瑰花进行反复比较,以便在第一时间排除掉所有可能的错误。最后一个阶段是完成花瓣修饰。在这个阶段,我们往往强调一个重要名词,叫用心折叠。这里的用心已经不是字面上的认真这个意思,而是指通过我们对于大自然中玫瑰花外型的理解,借助自然的曲线去不断修正花瓣的形状,以期逼近现实中的玫瑰花瓣外形。请注意,在这个阶段的最后一步,我们需要通过拉扯已经弯折的四个花瓣,来调整玫瑰花中心的绽放程度。这个过程可能会伴随玫瑰花整体结构的崩塌,所以,一定要控制好调整的力道,以免出现不可逆的后果。最终,经过三个阶段的折叠,我们会得到一支栩栩如生的玫瑰花冠。如果条件允许,我们可以在一根拉直的铁丝上缠绕绿色纸条,并将玫瑰花冠插在铁丝的一段。这样,我们就得到了一支手工玫瑰花。总之,通过创建栅格折痕,制作立体基座,以及完成花瓣修饰,我们从二维的纸面上创作出了一支三维的花朵。这个过程虽然看似简单,但它确实我们人类借助想象力和常见素材而创作出的艺术品。问: 请基于以上描述,分析哪些步骤做错了很大可能会导致最终折叠失败?答: 以上描述中,可能会出现折叠不对成的情况,导致最终折叠失败的步骤包括:1. 创建栅格折痕的过程中,没有处理好三个维度上的所有折角,导致折痕不规则或无法组成有效的参考系统。2. 制作立体基座的过程中,折出的高台和凹陷不准确,导致花瓣无法按照计划进行折叠。3. 在完成花瓣修饰的过程中,没有用心折叠花瓣,导致花瓣形状不准确或无法逼近现实中的玫瑰花瓣外形。如果以上三个步骤中出现了任何一个问题,都可能导致最终折叠失败。 +Tokens generated 120 E2E Generation time xx.xxxxs First token latency xx.xxxxs ``` diff --git a/python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py b/python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py index 00e17034..2ceb77bd 100644 --- a/python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py +++ b/python/llm/example/GPU/Speculative-Decoding/chatglm3/speculative.py @@ -28,28 +28,7 @@ torch.manual_seed(seed) np.random.seed(seed) -long_input = """折纸的过程看似简单,其实想要做好,还是需要一套很复杂的工艺。以折一支玫瑰花为例, -我们可以将整个折纸过程分成三个阶段,即:创建栅格折痕,制作立体基座,完成花瓣修饰。 -首先是创建栅格折痕:这一步有点像我们折千纸鹤的第一步,即通过对称州依次对折,然后按照长和宽两个维度, -依次进行多等分的均匀折叠;最终在两个方向上的折痕会交织成一套完整均匀的小方格拼接图案; -这些小方格就组成了类似二维坐标系的参考系统,使得我们在该平面上,通过组合临近折痕的方式从二维小方格 -上折叠出三维的高台或凹陷,以便于接下来的几座制作过程。需要注意的是,在建立栅格折痕的过程中,可能会 -出现折叠不对成的情况,这种错误所带来的后果可能是很严重的,就像是蝴蝶效应,一开始只是毫厘之差,最后 -可能就是天壤之别。然后是制作立体基座:在这一步,我们需要基于栅格折痕折出对称的三维高台或凹陷。从对 -称性分析不难发现,玫瑰花会有四个周对称的三维高台和配套凹陷。所以,我们可以先折出四分之一的凹陷和高 -台图案,然后以这四分之一的部分作为摸板,再依次折出其余三个部分的重复图案。值得注意的是,高台的布局 -不仅要考虑长和宽这两个唯独上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。与第一阶段的注 -意事项类似,请处理好三个维度上的所有折角,确保它们符合计划中所要求的那种布局,以免出现三维折叠过程 -中的蝴蝶效应;为此,我们常常会在折叠第一个四分之一图案的过程中,与成品玫瑰花进行反复比较,以便在第 -一时间排除掉所有可能的错误。最后一个阶段是完成花瓣修饰。在这个阶段,我们往往强调一个重要名词,叫用 -心折叠。这里的用心已经不是字面上的认真这个意思,而是指通过我们对于大自然中玫瑰花外型的理解,借助自 -然的曲线去不断修正花瓣的形状,以期逼近现实中的玫瑰花瓣外形。请注意,在这个阶段的最后一步,我们需要 -通过拉扯已经弯折的四个花瓣,来调整玫瑰花中心的绽放程度。这个过程可能会伴随玫瑰花整体结构的崩塌,所 -以,一定要控制好调整的力道,以免出现不可逆的后果。最终,经过三个阶段的折叠,我们会得到一支栩栩如生 -的玫瑰花冠。如果条件允许,我们可以在一根拉直的铁丝上缠绕绿色纸条,并将玫瑰花冠插在铁丝的一段。这样 -,我们就得到了一支手工玫瑰花。总之,通过创建栅格折痕,制作立体基座,以及完成花瓣修饰,我们从二维的 -纸面上创作出了一支三维的花朵。这个过程虽然看似简单,但它确实我们人类借助想象力和常见素材而创作出的 -艺术品。问: 请基于以上描述,分析哪些步骤做错了很大可能会导致最终折叠失败?答: """ +long_input = """折纸的过程看似简单,其实想要做好,还是需要一套很复杂的工艺。以折一支玫瑰花为例,我们可以将整个折纸过程分成三个阶段,即:创建栅格折痕,制作立体基座,完成花瓣修饰。首先是创建栅格折痕:这一步有点像我们折千纸鹤的第一步,即通过对称州依次对折,然后按照长和宽两个维度,依次进行多等分的均匀折叠;最终在两个方向上的折痕会交织成一套完整均匀的小方格拼接图案;这些小方格就组成了类似二维坐标系的参考系统,使得我们在该平面上,通过组合临近折痕的方式从二维小方格上折叠出三维的高台或凹陷,以便于接下来的几座制作过程。需要注意的是,在建立栅格折痕的过程中,可能会出现折叠不对成的情况,这种错误所带来的后果可能是很严重的,就像是蝴蝶效应,一开始只是毫厘之差,最后可能就是天壤之别。然后是制作立体基座:在这一步,我们需要基于栅格折痕折出对称的三维高台或凹陷。从对称性分析不难发现,玫瑰花会有四个周对称的三维高台和配套凹陷。所以,我们可以先折出四分之一的凹陷和高台图案,然后以这四分之一的部分作为摸板,再依次折出其余三个部分的重复图案。值得注意的是,高台的布局不仅要考虑长和宽这两个唯独上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。与第一阶段的注意事项类似,请处理好三个维度上的所有折角,确保它们符合计划中所要求的那种布局,以免出现三维折叠过程中的蝴蝶效应;为此,我们常常会在折叠第一个四分之一图案的过程中,与成品玫瑰花进行反复比较,以便在第一时间排除掉所有可能的错误。最后一个阶段是完成花瓣修饰。在这个阶段,我们往往强调一个重要名词,叫用心折叠。这里的用心已经不是字面上的认真这个意思,而是指通过我们对于大自然中玫瑰花外型的理解,借助自然的曲线去不断修正花瓣的形状,以期逼近现实中的玫瑰花瓣外形。请注意,在这个阶段的最后一步,我们需要通过拉扯已经弯折的四个花瓣,来调整玫瑰花中心的绽放程度。这个过程可能会伴随玫瑰花整体结构的崩塌,所以,一定要控制好调整的力道,以免出现不可逆的后果。最终,经过三个阶段的折叠,我们会得到一支栩栩如生的玫瑰花冠。如果条件允许,我们可以在一根拉直的铁丝上缠绕绿色纸条,并将玫瑰花冠插在铁丝的一段。这样,我们就得到了一支手工玫瑰花。总之,通过创建栅格折痕,制作立体基座,以及完成花瓣修饰,我们从二维的纸面上创作出了一支三维的花朵。这个过程虽然看似简单,但它确实我们人类借助想象力和常见素材而创作出的艺术品。问: 请基于以上描述,分析哪些步骤做错了很大可能会导致最终折叠失败?答: """ if __name__ == '__main__': diff --git a/python/llm/example/GPU/Speculative-Decoding/llama2/README.md b/python/llm/example/GPU/Speculative-Decoding/llama2/README.md index 2cabd9fa..72e584b7 100644 --- a/python/llm/example/GPU/Speculative-Decoding/llama2/README.md +++ b/python/llm/example/GPU/Speculative-Decoding/llama2/README.md @@ -21,12 +21,6 @@ source /opt/intel/oneapi/setvars.sh ### 3. Run -For optimal performance on Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, it is recommended to set several environment variables. -```bash -export USE_XETLA=OFF -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -``` - For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. ```bash export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so diff --git a/python/llm/example/GPU/Speculative-Decoding/mistral/README.md b/python/llm/example/GPU/Speculative-Decoding/mistral/README.md new file mode 100644 index 00000000..5a32ef0d --- /dev/null +++ b/python/llm/example/GPU/Speculative-Decoding/mistral/README.md @@ -0,0 +1,86 @@ +# Mistral +In this directory, you will find examples on how you could apply BigDL-LLM speculative decoding optimizations on Mistral models on [Intel GPUs](../README.md). For illustration purposes,we utilize the [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) as reference Mistral models. + +## 0. Requirements +To run these examples with BigDL-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. + +## Example: Predict Tokens using `generate()` API +In the example [speculative.py](./speculative.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with BigDL-LLM speculative decoding optimizations on Intel GPUs. +### 1. Install +We suggest using conda to manage environment: +```bash +conda create -n llm python=3.9 +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu +pip install transformers==4.36.0 +``` +### 2. Configures OneAPI environment variables +```bash +source /opt/intel/oneapi/setvars.sh +``` + +### 3. Run + +For optimal performance on Intel Data Center GPU Max Series, it is recommended to set several environment variables. +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export ENABLE_SDP_FUSION=1 +``` + +``` +python ./speculative.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT +``` + +Arguments info: +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Mistral model (e.g. `mistralai/Mistral-7B-Instruct-v0.1` and `mistralai/Mistral-7B-v0.1`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'mistralai/Mistral-7B-Instruct-v0.1'`. +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). A default prompt is provided. +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`. + +#### Sample Output +#### [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) +```log +In the year 2048, the world was a very different place from what it had been just two decades before. The pace of technological progress had quickened to an almost unimaginable degree, and the changes that had swept through society as a result were nothing short of revolutionary. +In many ways, the year 2048 represented the culmination of a long and tumultuous journey that humanity had been on since the dawn of civilization. The great leaps forward in science and technology that had occurred over the course of the previous century had laid the groundwork for a future that was beyond anything anyone could have imagined. +One of the most striking aspects of life in 2048 was the degree to which technology had become an integral part of nearly every aspect of daily existence. From the moment people woke up in the morning until they went to bed at night, they were surrounded by devices and systems that were powered by advanced artificial intelligence and machine learning algorithms. +In fact, it was hard to find anything in people's lives that wasn't touched by technology in some way. Every aspect of society had been transformed, from the way people communicated with one another to the way they worked, played, and even socialized. And as the years went on, it seemed as though there was no limit to what technology could achieve. +Despite all of these advances, however, not everyone was happy with the state of the world in 2048. Some people saw the increasing reliance on technology as a sign that humanity was losing touch with its own humanity, and they worried about the implications of this for the future. +Others were more pragmatic, recognizing that while technology had brought many benefits, it also posed new challenges and risks that needed to be addressed. As a result, there was a growing movement of people who were working to ensure that the advances of technology were used in ways that were safe, ethical, and beneficial for everyone. +One person who was at the forefront of this movement was a young woman named Maya. Maya was a brilliant and ambitious researcher who had dedicated her life to understanding the implications of emerging technologies like artificial intelligence and biotechnology. She was deeply concerned about the potential risks and unintended consequences of these technologies, and she worked tirelessly to raise awareness about the need for responsible innovation. +Maya's work had earned her a reputation as one of the most influential voices in the field of technology and ethics, and she was widely respected for her deep understanding of the issues and her ability to communicate complex ideas in ways that were accessible and engaging. She was also known for her passionate and inspiring speeches, which often left her audiences with a sense of purpose and determination to make the world a better place through their own efforts. +One day, Maya received an invitation to speak at a major conference on technology and ethics, which was being held in a large convention center in the heart of the city. The conference was expected to attract thousands of people from all over the world, and there was a great deal of excitement and anticipation about what Maya would say. +As she prepared for her speech, Maya knew that she had a big responsibility on her shoulders. She felt a deep sense of obligation to use her platform to inspire others to take action and make a difference in the world, and she was determined to do everything in her power to live up to this responsibility. +When the day of the conference arrived, Maya was filled with a mixture of excitement and nerves. She spent hours rehearsing her speech and fine-tuning her ideas, making sure that she had everything just right. Finally, after what felt like an eternity, it was time for her to take the stage. +As she stepped up to the podium, Maya could feel the energy of the crowd surging around her. She took a deep breath and began to speak, her voice strong and clear as she outlined the challenges and opportunities facing society in the age of technology. She spoke passionately about the need for responsible innovation and the importance of considering the ethical implications of our actions, and she inspired many people in the audience to take up this cause and make a difference in their own lives. +Overall, Maya's speech was a resounding success, and she received countless messages of gratitude and appreciation from those who had heard her speak. She knew that there was still much work to be done, but she felt hopeful about the future and the role that technology could play in creating a better world for all. +As Maya left the stage and made her way back to her seat, she couldn't help but feel a sense of pride and accomplishment at what she had just accomplished. She knew that her words had the power to inspire others and make a real difference in the world, and she was grateful for the opportunity to have played a part in this important work. +In the years that followed, Maya continued to work tirelessly to promote responsible innovation and ensure that technology was used in ways that were safe, ethical, and beneficial for everyone. She became a leading voice in the field of technology and ethics, and her ideas and insights were sought out by people all over the world. +Maya's legacy lived on long after she passed away, and her work continued to inspire and influence people for generations to come. She was a true visionary and a pioneer in the field of technology and ethics, and her contributions to society will be remembered for many years to come. +Tokens generated 128 +E2E Generation time xx.xxxxs +First token latency xx.xxxxs +``` + +#### [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) +``` +In the year 2048, the world was a very different place from what it had been just two decades before. The pace of technological progress had quickened to an almost unimaginable degree, and the changes that had swept through society as a result were nothing short of revolutionary. +In many ways, the year 2048 represented the culmination of a long and tumultuous journey that humanity had been on since the dawn of civilization. The great leaps forward in science and technology that had occurred over the course of the previous century had laid the groundwork for a future that was beyond anything anyone could have imagined. +One of the most striking aspects of life in 2048 was the degree to which technology had become an integral part of nearly every aspect of daily existence. From the moment people woke up in the morning until they went to bed at night, they were surrounded by devices and systems that were powered by advanced artificial intelligence and machine learning algorithms. +In fact, it was hard to find anything in people's lives that wasn't touched by technology in some way. Every aspect of society had been transformed, from the way people communicated with one another to the way they worked, played, and even socialized. And as the years went on, it seemed as though there was no limit to what technology could achieve. +Despite all of these advances, however, not everyone was happy with the state of the world in 2048. Some people saw the increasing reliance on technology as a sign that humanity was losing touch with its own humanity, and they worried about the implications of this for the future. +Others were more pragmatic, recognizing that while technology had brought many benefits, it also posed new challenges and risks that needed to be addressed. As a result, there was a growing movement of people who were working to ensure that the advances of technology were used in ways that were safe, ethical, and beneficial for everyone. +One person who was at the forefront of this movement was a young woman named Maya. Maya was a brilliant and ambitious researcher who had dedicated her life to understanding the implications of emerging technologies like artificial intelligence and biotechnology. She was deeply concerned about the potential risks and unintended consequences of these technologies, and she worked tirelessly to raise awareness about the need for responsible innovation. +Maya's work had earned her a reputation as one of the most influential voices in the field of technology and ethics, and she was widely respected for her deep understanding of the issues and her ability to communicate complex ideas in ways that were accessible and engaging. She was also known for her passionate and inspiring speeches, which often left her audiences with a sense of purpose and determination to make the world a better place through their own efforts. +One day, Maya received an invitation to speak at a major conference on technology and ethics, which was being held in a large convention center in the heart of the city. The conference was expected to attract thousands of people from all over the world, and there was a great deal of excitement and anticipation about what Maya would say. +As she prepared for her speech, Maya knew that she had a big responsibility on her shoulders. She felt a deep sense of obligation to use her platform to inspire others to take action and make a difference in the world, and she was determined to do everything in her power to live up to this responsibility. +When the day of the conference arrived, Maya was filled with a mixture of excitement and nerves. She spent hours rehearsing her speech and fine-tuning her ideas, making sure that she had everything just right. Finally, after what felt like an eternity, it was time for her to take the stage. +As she stepped up to the podium, Maya could feel the energy of the crowd surging around her. She took a deep breath and began to speak, her voice strong and clear as she outlined the challenges and opportunities facing society in the age of technology. She spoke passionately about the need for responsible innovation and the importance of considering the ethical implications of our actions, and she inspired many people in the audience to take up this cause and make a difference in their own lives. +Overall, Maya's speech was a resounding success, and she received countless messages of gratitude and appreciation from those who had heard her speak. She knew that there was still much work to be done, but she felt hopeful about the future and the role that technology could play in creating a better world for all. +As Maya left the stage and made her way back to her seat, she couldn't help but feel a sense of pride and accomplishment at what she had just accomplished. She knew that her words had the power to inspire others and make a real difference in the world, and she was grateful for the opportunity to have played a part in this important work. +As she reflected on the day's events, Maya felt a deep sense of satisfaction and purpose. She knew that she had a long and challenging road ahead of her, but she was determined to continue working towards a better future for all. She was confident that, with the help of others who shared her vision and values, she would be able to make a real difference in the world and leave a lasting legacy for generations to come. +In the end, Maya's speech was just one small step on a long and winding journey towards a better future. But it was a step that she was proud to have taken +Tokens generated 128 +E2E Generation time xx.xxxxs +First token latency xx.xxxxs +``` \ No newline at end of file diff --git a/python/llm/example/GPU/Speculative-Decoding/mistral/speculative.py b/python/llm/example/GPU/Speculative-Decoding/mistral/speculative.py new file mode 100644 index 00000000..9fad5d94 --- /dev/null +++ b/python/llm/example/GPU/Speculative-Decoding/mistral/speculative.py @@ -0,0 +1,97 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import torch +from bigdl.llm.transformers import AutoModelForCausalLM +from transformers import AutoTokenizer +import argparse +import time +import numpy as np + + +torch.nn.Linear.reset_parameters = lambda x: None +seed=42 +torch.manual_seed(seed) +np.random.seed(seed) + + +long_input = """In the year 2048, the world was a very different place from what it had been just two decades before. The pace of technological progress had quickened to an almost unimaginable degree, and the changes that had swept through society as a result were nothing short of revolutionary. +In many ways, the year 2048 represented the culmination of a long and tumultuous journey that humanity had been on since the dawn of civilization. The great leaps forward in science and technology that had occurred over the course of the previous century had laid the groundwork for a future that was beyond anything anyone could have imagined. +One of the most striking aspects of life in 2048 was the degree to which technology had become an integral part of nearly every aspect of daily existence. From the moment people woke up in the morning until they went to bed at night, they were surrounded by devices and systems that were powered by advanced artificial intelligence and machine learning algorithms. +In fact, it was hard to find anything in people's lives that wasn't touched by technology in some way. Every aspect of society had been transformed, from the way people communicated with one another to the way they worked, played, and even socialized. And as the years went on, it seemed as though there was no limit to what technology could achieve. +Despite all of these advances, however, not everyone was happy with the state of the world in 2048. Some people saw the increasing reliance on technology as a sign that humanity was losing touch with its own humanity, and they worried about the implications of this for the future. +Others were more pragmatic, recognizing that while technology had brought many benefits, it also posed new challenges and risks that needed to be addressed. As a result, there was a growing movement of people who were working to ensure that the advances of technology were used in ways that were safe, ethical, and beneficial for everyone. +One person who was at the forefront of this movement was a young woman named Maya. Maya was a brilliant and ambitious researcher who had dedicated her life to understanding the implications of emerging technologies like artificial intelligence and biotechnology. She was deeply concerned about the potential risks and unintended consequences of these technologies, and she worked tirelessly to raise awareness about the need for responsible innovation. +Maya's work had earned her a reputation as one of the most influential voices in the field of technology and ethics, and she was widely respected for her deep understanding of the issues and her ability to communicate complex ideas in ways that were accessible and engaging. She was also known for her passionate and inspiring speeches, which often left her audiences with a sense of purpose and determination to make the world a better place through their own efforts. +One day, Maya received an invitation to speak at a major conference on technology and ethics, which was being held in a large convention center in the heart of the city. The conference was expected to attract thousands of people from all over the world, and there was a great deal of excitement and anticipation about what Maya would say. +As she prepared for her speech, Maya knew that she had a big responsibility on her shoulders. She felt a deep sense of obligation to use her platform to inspire others to take action and make a difference in the world, and she was determined to do everything in her power to live up to this responsibility. +When the day of the conference arrived, Maya was filled with a mixture of excitement and nerves. She spent hours rehearsing her speech and fine-tuning her ideas, making sure that she had everything just right. Finally, after what felt like an eternity, it was time for her to take the stage. +As she stepped up to the podium, Maya could feel the energy of the crowd surging around her. She took a deep breath and began to speak, her voice strong and clear as she outlined the challenges and opportunities facing society in the age of technology. She spoke passionately about the need for responsible innovation and the importance of considering the ethical implications of our actions, and she inspired many people in the audience to take up this cause and make a difference in their own lives. +Overall, Maya's speech was a resounding success, and she received countless messages of gratitude and appreciation from those who had heard her speak. She knew that there was still much work to be done, but she felt hopeful about the future and the role that technology could play in creating a better world for all. +As Maya left the stage and made her way back to her seat, she couldn't help but feel a sense of pride and accomplishment at what she had just accomplished. She knew that her words had the power to inspire others and make a real difference in the world, and she was grateful for the opportunity to have played a part in this important work.""" + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Mistral model') + parser.add_argument('--repo-id-or-model-path', type=str, default="mistralai/Mistral-7B-Instruct-v0.1", + help='The huggingface repo id for the Mistral (e.g. `mistralai/Mistral-7B-Instruct-v0.1` and `mistralai/Mistral-7B-v0.1`) to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--prompt', type=str, default=long_input, + help='Prompt to infer') + parser.add_argument('--n-predict', type=int, default=128, + help='Max tokens to predict') + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + # Load model in optimized fp16 here. + # Set `speculative=True`` to enable speculative decoding, + # it only works when load_in_low_bit="fp16" on Intel GPU or load_in_low_bit="bf16" on latest Intel Xeon CPU + model = AutoModelForCausalLM.from_pretrained(model_path, + optimize_model=True, + torch_dtype=torch.float16, + load_in_low_bit="fp16", + speculative=True, + trust_remote_code=True, + use_cache=True) + model = model.to('xpu') + + tokenizer = AutoTokenizer.from_pretrained(model_path) + + with torch.inference_mode(): + prompt = args.prompt + input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(model.device) + + # warmup + output = model.generate(input_ids, + max_new_tokens=args.n_predict, + do_sample=False, + th_stop_draft=0.6) + output_str = tokenizer.decode(output[0]) + + # speculative decoding + st = time.perf_counter() + output = model.generate(input_ids, + max_new_tokens=args.n_predict, + do_sample=False, + th_stop_draft=0.6) + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + torch.xpu.synchronize() + end = time.perf_counter() + + print(output_str) + print(f"Tokens generated {model.n_token_generated}") + print(f"E2E Generation time {(end - st):.4f}s") + print(f"First token latency {model.first_token_time:.4f}s")