diff --git a/README.md b/README.md index 298413a7..dd5a4b9d 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa | LLaVA | [link](python/llm/example/CPU/PyTorch-Models/Model/llava) | [link](python/llm/example/GPU/PyTorch-Models/Model/llava) | | CodeLlama | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](python/llm/example/GPU/HF-Transformers-AutoModels/Model/codellama) | | Skywork | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/skywork) | | - +| InternLM-XComposer | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer) | | ***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).*** diff --git a/python/llm/README.md b/python/llm/README.md index 986a788a..50093970 100644 --- a/python/llm/README.md +++ b/python/llm/README.md @@ -66,7 +66,7 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa | LLaVA | [link](example/CPU/PyTorch-Models/Model/llava) | [link](example/GPU/PyTorch-Models/Model/llava) | | CodeLlama | [link](example/CPU/HF-Transformers-AutoModels/Model/codellama) | [link](example/GPU/HF-Transformers-AutoModels/Model/codellama) | | Skywork | [link](example/CPU/HF-Transformers-AutoModels/Model/skywork) | | - +| InternLM-XComposer | [link](example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer) | | ### Working with `bigdl-llm` diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/README.md new file mode 100644 index 00000000..f0783ed2 --- /dev/null +++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/README.md @@ -0,0 +1,93 @@ +# InternLM_XComposer +In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on InternLM_XComposer models. For illustration purposes, we utilize the [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) as a reference InternLM_XComposer model. + +## Requirements +To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. + +## Example: Multi-turn chat centered around an image using `chat()` API +In the example [chat.py](./chat.py), we show a basic use case for an InternLM_XComposer model to start a multi-turn chat centered around an image using `chat()` API, with BigDL-LLM INT4 optimizations. +### 1. Install +We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). + +After installing conda, create a Python environment for BigDL-LLM: +```bash +conda create -n llm python=3.9 # recommend to use Python 3.9 +conda activate llm + +pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option + +pip install accelerate timm==0.4.12 sentencepiece==0.1.99 gradio==3.44.4 markdown2==2.4.10 xlsxwriter==3.1.2 einops # additional package required for InternLM_XComposer to conduct generation + +``` + +### 2. Download Model and Replace File +If you select the InternLM_XComposer model ([internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b)), please note that their code (`modeling_InternLM_XComposer.py`) does not support inference on CPU. To address this issue, we have provided the updated file ([internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py](./internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py), which can be used to conduct inference on CPU. + +#### 2.1 Download Model +You could use the following code to download [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) with a specific snapshot id. Please note that the `modeling_InternLM_XComposer.py` file that we provide are based on these specific commits. + +``` +from huggingface_hub import snapshot_download + +# for internlm/internlm-xcomposer-vl-7b +model_path = snapshot_download(repo_id='internlm/internlm-xcomposer-vl-7b', + revision="b06eb0c11653fe1568b6c5614b6b7be407ef8660", + cache_dir="dir/path/where/model/files/are/downloaded") +print(f'internlm/internlm-xcomposer-vl-7b checkpoint is downloaded to {model_path}') +``` + +#### 2.2 Replace `modeling_InternLM_XComposer.py` +For `internlm/internlm-xcomposer-vl-7b`, you should replace the `modeling_InternLM_XComposer.py` with [internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py](./internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py). + + +### 3. Run +After setting up the Python environment, you could run the example by following steps. + +> **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. +> +> Please select the appropriate size of the LLaVA model based on the capabilities of your machine. + +#### 3.1 Client +On client Windows machines, it is recommended to run directly with full utilization of all cores: +```powershell +python ./chat.py --image-path demo.jpg +``` +More information about arguments can be found in [Arguments Info](#33-arguments-info) section. The expected output can be found in [Sample Output](#34-sample-output) section. + +#### 3.2 Server +For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. + +E.g. on Linux, +```bash +# set BigDL-Nano env variables +source bigdl-nano-init + +# e.g. for a server with 48 cores per socket +export OMP_NUM_THREADS=48 +numactl -C 0-47 -m 0 python ./chat.py --image-path demo.jpg +``` +More information about arguments can be found in [Arguments Info](#33-arguments-info) section. The expected output can be found in [Sample Output](#34-sample-output) section. + +#### 3.3 Arguments Info +In the example, several arguments can be passed to satisfy your requirements: + +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the LLaVA model (e.g. `internlm/internlm-xcomposer-vl-7b`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'internlm/internlm-xcomposer-vl-7b'`. +- `--image-path IMAGE_PATH`: argument defining the input image that the chat will focus on. It is required and should be a local path (not url). +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `512`. + + +#### 3.4 Sample Chat +#### [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) + +```log +User: 这是什么? +Bot: bus +User: 它可以用来干什么 +Bot: transport people +``` + +The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=178242)): + +[demo.jpg](https://cocodataset.org/#explore?id=178242) + + \ No newline at end of file diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/chat.py b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/chat.py new file mode 100644 index 00000000..dd824043 --- /dev/null +++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/chat.py @@ -0,0 +1,61 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from bigdl.llm.transformers import AutoModelForCausalLM +from transformers import AutoTokenizer +from transformers.generation import GenerationConfig +import torch +import time +import os +import argparse +from bigdl.llm import optimize_model + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Predict Tokens using `chat()` API for InternLM-XComposer model') + parser.add_argument('--repo-id-or-model-path', type=str, default="internlm/internlm-xcomposer-vl-7b", + help='The huggingface repo id for the InternLM-XComposer model to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--image-path', type=str, required=True, + help='Image path for the input image that the chat will focus on') + parser.add_argument('--n-predict', type=int, default=512, help='Max tokens to predict') + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + image = args.image_path + + # Load model + # For successful BigDL-LLM optimization on InternLM-XComposer, skip the 'qkv' module during optimization + model = AutoModelForCausalLM.from_pretrained(model_path, device='cpu', load_in_4bit=True, + trust_remote_code=True, modules_to_not_convert=['qkv']) + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + model.tokenizer = tokenizer + + history = None + while True: + try: + user_input = input("User: ") + except EOFError: + user_input = "" + if not user_input: + print("exit...") + break + + response, history = model.chat(text=user_input, image=image , history = history) + print(f'Bot: {response}', end="") + image = None + diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py new file mode 100644 index 00000000..3ada3efe --- /dev/null +++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py @@ -0,0 +1,284 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# =========================================================================== +# +# This file is adapted from +# https://huggingface.co/internlm/internlm-xcomposer-vl-7b/blob/b06eb0c11653fe1568b6c5614b6b7be407ef8660/modeling_InternLM_XComposer.py +# +# Apache 2.0 license + +# We change the dtype from float16 to float32 to enable inference on CPU. + +import copy +import os +import sys + +dir_path = os.path.dirname(os.path.realpath(__file__)) +sys.path.insert(0, dir_path) + +import contextlib + +import torch.utils.checkpoint +from torch.nn import LayerNorm +from torchvision import transforms +from torchvision.transforms.functional import InterpolationMode +from PIL import Image + +from .modeling_perceive_sampler import BertConfig, BertLMHeadModel +from .modeling_vit import * +from .modeling_InternLM import * +from .modeling_utils import * + +from transformers.utils import logging +logger = logging.get_logger(__name__) + + +class InternLMXComposerForCausalLM(PreTrainedModel): + config_class = InternLMXComposerConfig + _auto_class = "AutoModelForCausalLM" + + gen_config = dict( + num_beams=5, + do_sample=False, + min_length=1, + repetition_penalty=1.5, + length_penalty=1.0, + temperature=1.0, + max_new_tokens=200, + ) + + def __init__(self, config): + super().__init__(config) + + print('Init VIT ... ', end='') + # self.visual_encoder = create_eva_vit_g() + self.visual_encoder = create_eva_vit_g(precision="fp32") + self.ln_vision = LayerNorm(self.visual_encoder.num_features) + print('Done') + + print('Init Perceive Sampler ... ', end='') + with all_logging_disabled(): + self.Qformer, self.query_tokens = self.init_qformer( + config.num_query_token, self.visual_encoder.num_features) + self.Qformer.bert.embeddings.word_embeddings = None + self.Qformer.bert.embeddings.position_embeddings = None + for layer in self.Qformer.bert.encoder.layer: + layer.output = None + layer.intermediate = None + self.Qformer.cls = None + print('Done') + + print('Init InternLM ... ', end='') + self.flag_image_start = nn.Parameter(torch.zeros([1, 1, 4096])) + self.flag_image_end = nn.Parameter(torch.zeros([1, 1, 4096])) + self.flag_image_start.requires_grad = False + self.flag_image_end.requires_grad = False + + internlm_lora = config.internlm_lora + self.internlm_lora = internlm_lora + setattr(InternLMForCausalLM, 'lora_cfg', internlm_lora) + + if int(torch.__version__[0]) == 1: + # self.internlm_model = InternLMForCausalLM._from_config(config).to( + # torch.float16) + self.internlm_model = InternLMForCausalLM._from_config(config).to( + torch.float32) + else: + assert int(torch.__version__[0]) == 2 + # speed up init llm + with torch.device('meta'): + self.internlm_model = InternLMForCausalLM._from_config(config) + # self.internlm_model.to_empty(device=config.device).to(torch.float16) + self.internlm_model.to_empty(device=config.device).to(torch.float32) + for n, m in self.internlm_model.named_modules(): + if 'lora' in n: + m.float() + + self.internlm_proj = nn.Linear(self.Qformer.config.hidden_size, + self.internlm_model.config.hidden_size) + print('Done') + + self.vis_processor = transforms.Compose([ + transforms.Resize((224, 224), + interpolation=InterpolationMode.BICUBIC), + transforms.ToTensor(), + transforms.Normalize((0.48145466, 0.4578275, 0.40821073), + (0.26862954, 0.26130258, 0.27577711)), + ]) + + self.tokenizer = None + + @property + def eoh(self): + return self.tokenizer.decode(torch.Tensor([103027]), + skip_special_tokens=True) + + @property + def eoa(self): + return self.tokenizer.decode(torch.Tensor([103028]), + skip_special_tokens=True) + + def maybe_autocast(self, dtype=torch.float16): + # if on cpu, don't use autocast + # if on gpu, use autocast with dtype if provided, otherwise use torch.float16 + enable_autocast = self.device != torch.device("cpu") + + if enable_autocast: + return torch.cuda.amp.autocast(dtype=dtype) + else: + return contextlib.nullcontext() + + @classmethod + def init_qformer(cls, + num_query_token, + vision_width, + cross_attention_freq=2, + pretrain=True): + encoder_config = BertConfig() + encoder_config.encoder_width = vision_width + # insert cross-attention layer every other block + encoder_config.add_cross_attention = True + encoder_config.cross_attention_freq = cross_attention_freq + encoder_config.query_length = num_query_token + Qformer = BertLMHeadModel(config=encoder_config) + query_tokens = nn.Parameter( + torch.zeros(1, num_query_token, encoder_config.hidden_size)) + query_tokens.data.normal_(mean=0.0, + std=encoder_config.initializer_range) + return Qformer, query_tokens + + def encode_img(self, image): + if image is None: + return None + if isinstance(image, str): + image = Image.open(image).convert("RGB") + image = self.vis_processor(image).unsqueeze(0).to(self.device) + else: + assert isinstance(image, torch.Tensor) + device = image.device + with self.maybe_autocast(): + image_embeds = self.ln_vision( + self.visual_encoder(image)).to(device) + image_atts = torch.ones(image_embeds.size()[:-1], + dtype=torch.long).to(device) + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, + -1) + query_output = self.Qformer.bert( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_atts, + return_dict=True, + ) + inputs_internlm = self.internlm_proj(query_output.last_hidden_state) + inputs_internlm = torch.cat([ + self.flag_image_start.expand(inputs_internlm.shape[0], -1, -1), + inputs_internlm, + self.flag_image_end.expand(inputs_internlm.shape[0], -1, -1) + ], + dim=1) + return inputs_internlm + + def encode_text(self, text, add_special_tokens=False): + text_token_ids = self.tokenizer( + text, + return_tensors='pt', + add_special_tokens=add_special_tokens, + ).input_ids.to(self.device) + text_embeds = self.internlm_model.model.embed_tokens(text_token_ids) + return text_embeds + + def decode_text(self, out_embeds): + out_text = self.tokenizer.batch_decode(out_embeds, + skip_special_tokens=True)[0] + out_text = out_text.split(self.eoa)[0] + return out_text + + def wrap_text(self, user_text, bot_text='', add_special=True): + if add_special: + eoh = self.eoh + else: + eoh = '' + text = f' <|User|>:{user_text} \n{eoh} <|Bot|>:{bot_text}' + return text + + def get_gen_args(self, **kwargs): + new_kargs = copy.deepcopy(self.gen_config) + new_kargs.update(kwargs) + return new_kargs + + def generate(self, text, image=None, **kwargs): + text_embeds = self.encode_text(text) + img_embeds = self.encode_img(image) + prompt_embeds = self.wrap_prompt(text_embeds, img_embeds) + out_embeds = self.internlm_model.generate(inputs_embeds=prompt_embeds, + **self.get_gen_args(**kwargs)) + out_text = self.decode_text(out_embeds) + return out_text + + def chat(self, text, image=None, history=None, **kwargs): + text_embeds = self.encode_text(text) + img_embeds = self.encode_img(image) + prompt_embeds = self.wrap_prompt(text_embeds, + img_embeds, + history=history) + out_embeds = self.internlm_model.generate(inputs_embeds=prompt_embeds, + **self.get_gen_args(**kwargs)) + out_text = self.decode_text(out_embeds) + + # trunc at eoh and eoa + clean_out_text_token_ids = self.tokenizer( + out_text, return_tensors='pt').input_ids.to(self.device) + clean_out_text_embeds = self.internlm_model.model.embed_tokens( + clean_out_text_token_ids) + clean_prompt_embeds = self.wrap_prompt(text_embeds, + img_embeds, + add_special=False) + cur_history = torch.cat([clean_prompt_embeds, clean_out_text_embeds], + dim=1) + if history is None: + history = [] + history.append(cur_history) + return out_text, history + + def wrap_prompt(self, + text_embeds, + img_embeds=None, + history=None, + add_special=True): + if add_special: + prompt_segs = [' <|User|>:', f'\n{self.eoh} <|Bot|>:'] + else: + prompt_segs = [' <|User|>:', ' <|Bot|>:'] # used in wrap history + prompt_seg_embeds = [] + for i, seg in enumerate(prompt_segs): + if history is not None: + add_special_tokens = False + else: + add_special_tokens = i == 0 + seg_embeds = self.encode_text( + seg, add_special_tokens=add_special_tokens) + prompt_seg_embeds.append(seg_embeds) + if img_embeds is None: + img_embeds = text_embeds.new_empty(text_embeds.size(0), 0, + text_embeds.size(-1)) + prompt_seg_embeds = [ + prompt_seg_embeds[0], img_embeds, text_embeds, prompt_seg_embeds[1] + ] + prompt_embeds = torch.cat(prompt_seg_embeds, dim=1) + if history is not None: + prompt_embeds = torch.cat([*history, prompt_embeds], dim=1) + return prompt_embeds diff --git a/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/README.md b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/README.md new file mode 100644 index 00000000..a96db75d --- /dev/null +++ b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/README.md @@ -0,0 +1,93 @@ +# InternLM_XComposer +In this directory, you will find examples on how you could use BigDL-LLM `optimize_model` API to accelerate InternLM_XComposer models. For illustration purposes, we utilize the [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) as a reference InternLM_XComposer model. + +## Requirements +To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. + +## Example: Multi-turn chat centered around an image using `chat()` API +In the example [chat.py](./chat.py), we show a basic use case for an InternLM_XComposer model to start a multi-turn chat centered around an image using `chat()` API, with BigDL-LLM 'optimize_model' API. +### 1. Install +We suggest using conda to manage the Python environment. For more information about conda installation, please refer to [here](https://docs.conda.io/en/latest/miniconda.html#). + +After installing conda, create a Python environment for BigDL-LLM: +```bash +conda create -n llm python=3.9 # recommend to use Python 3.9 +conda activate llm + +pip install --pre --upgrade bigdl-llm[all] # install the latest bigdl-llm nightly build with 'all' option + +pip install accelerate timm==0.4.12 sentencepiece==0.1.99 gradio==3.44.4 markdown2==2.4.10 xlsxwriter==3.1.2 einops # additional package required for InternLM_XComposer to conduct generation + +``` + +### 2. Download Model and Replace File +If you select the InternLM_XComposer model ([internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b)), please note that their code (`modeling_InternLM_XComposer.py`) does not support inference on CPU. To address this issue, we have provided the updated file ([internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py](./internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py), which can be used to conduct inference on CPU. + +#### 2.1 Download Model +You could use the following code to download [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) with a specific snapshot id. Please note that the `modeling_InternLM_XComposer.py` file that we provide are based on these specific commits. + +``` +from huggingface_hub import snapshot_download + +# for internlm/internlm-xcomposer-vl-7b +model_path = snapshot_download(repo_id='internlm/internlm-xcomposer-vl-7b', + revision="b06eb0c11653fe1568b6c5614b6b7be407ef8660", + cache_dir="dir/path/where/model/files/are/downloaded") +print(f'internlm/internlm-xcomposer-vl-7b checkpoint is downloaded to {model_path}') +``` + +#### 2.2 Replace `modeling_InternLM_XComposer.py` +For `internlm/internlm-xcomposer-vl-7b`, you should replace the `modeling_InternLM_XComposer.py` with [internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py](./internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py). + + +### 3. Run +After setting up the Python environment, you could run the example by following steps. + +> **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. +> +> Please select the appropriate size of the LLaVA model based on the capabilities of your machine. + +#### 3.1 Client +On client Windows machines, it is recommended to run directly with full utilization of all cores: +```powershell +python ./chat.py --image-path demo.jpg +``` +More information about arguments can be found in [Arguments Info](#33-arguments-info) section. The expected output can be found in [Sample Output](#34-sample-output) section. + +#### 3.2 Server +For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. + +E.g. on Linux, +```bash +# set BigDL-Nano env variables +source bigdl-nano-init + +# e.g. for a server with 48 cores per socket +export OMP_NUM_THREADS=48 +numactl -C 0-47 -m 0 python ./chat.py --image-path demo.jpg +``` +More information about arguments can be found in [Arguments Info](#33-arguments-info) section. The expected output can be found in [Sample Output](#34-sample-output) section. + +#### 3.3 Arguments Info +In the example, several arguments can be passed to satisfy your requirements: + +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the LLaVA model (e.g. `internlm/internlm-xcomposer-vl-7b`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'internlm/internlm-xcomposer-vl-7b'`. +- `--image-path IMAGE_PATH`: argument defining the input image that the chat will focus on. It is required and should be a local path (not url). +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `512`. + + +#### 3.4 Sample Chat +#### [internlm/internlm-xcomposer-vl-7b](https://huggingface.co/internlm/internlm-xcomposer-vl-7b) + +```log +User: 这是什么? +Bot: bus +User: 它可以用来干什么 +Bot: transport people +``` + +The sample input image is (which is fetched from [COCO dataset](https://cocodataset.org/#explore?id=178242)): + +[demo.jpg](https://cocodataset.org/#explore?id=178242) + + \ No newline at end of file diff --git a/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/chat.py b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/chat.py new file mode 100644 index 00000000..3463eb3a --- /dev/null +++ b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/chat.py @@ -0,0 +1,64 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from transformers import AutoTokenizer, AutoModelForCausalLM +from transformers.generation import GenerationConfig +import torch +import time +import os +import argparse +from bigdl.llm import optimize_model + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Predict Tokens using `chat()` API for InternLM-XComposer model') + parser.add_argument('--repo-id-or-model-path', type=str, default="internlm/internlm-xcomposer-vl-7b", + help='The huggingface repo id for the InternLM-XComposer model to be downloaded' + ', or the path to the huggingface checkpoint folder') + parser.add_argument('--image-path', type=str, required=True, + help='Image path for the input image that the chat will focus on') + parser.add_argument('--n-predict', type=int, default=512, help='Max tokens to predict') + + args = parser.parse_args() + model_path = args.repo_id_or_model_path + image = args.image_path + + # Load model + model = AutoModelForCausalLM.from_pretrained(model_path, device='cpu', trust_remote_code=True) + + # With only one line to enable BigDL-LLM optimization on model + # For successful BigDL-LLM optimization on InternLM-XComposer, skip the 'qkv' module during optimization + model = optimize_model(model, + low_bit='sym_int4', + modules_to_not_convert=['qkv']) + + # Load tokenizer + tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + model.tokenizer = tokenizer + + history = None + while True: + try: + user_input = input("User: ") + except EOFError: + user_input = "" + if not user_input: + print("exit...") + break + + response, history = model.chat(text=user_input, image=image , history = history) + print(f'Bot: {response}', end="") + image = None + diff --git a/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py new file mode 100644 index 00000000..3ada3efe --- /dev/null +++ b/python/llm/example/CPU/PyTorch-Models/Model/internlm-xcomposer/internlm-xcomposer-vl-7b/modeling_InternLM_XComposer.py @@ -0,0 +1,284 @@ +# +# Copyright 2016 The BigDL Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# =========================================================================== +# +# This file is adapted from +# https://huggingface.co/internlm/internlm-xcomposer-vl-7b/blob/b06eb0c11653fe1568b6c5614b6b7be407ef8660/modeling_InternLM_XComposer.py +# +# Apache 2.0 license + +# We change the dtype from float16 to float32 to enable inference on CPU. + +import copy +import os +import sys + +dir_path = os.path.dirname(os.path.realpath(__file__)) +sys.path.insert(0, dir_path) + +import contextlib + +import torch.utils.checkpoint +from torch.nn import LayerNorm +from torchvision import transforms +from torchvision.transforms.functional import InterpolationMode +from PIL import Image + +from .modeling_perceive_sampler import BertConfig, BertLMHeadModel +from .modeling_vit import * +from .modeling_InternLM import * +from .modeling_utils import * + +from transformers.utils import logging +logger = logging.get_logger(__name__) + + +class InternLMXComposerForCausalLM(PreTrainedModel): + config_class = InternLMXComposerConfig + _auto_class = "AutoModelForCausalLM" + + gen_config = dict( + num_beams=5, + do_sample=False, + min_length=1, + repetition_penalty=1.5, + length_penalty=1.0, + temperature=1.0, + max_new_tokens=200, + ) + + def __init__(self, config): + super().__init__(config) + + print('Init VIT ... ', end='') + # self.visual_encoder = create_eva_vit_g() + self.visual_encoder = create_eva_vit_g(precision="fp32") + self.ln_vision = LayerNorm(self.visual_encoder.num_features) + print('Done') + + print('Init Perceive Sampler ... ', end='') + with all_logging_disabled(): + self.Qformer, self.query_tokens = self.init_qformer( + config.num_query_token, self.visual_encoder.num_features) + self.Qformer.bert.embeddings.word_embeddings = None + self.Qformer.bert.embeddings.position_embeddings = None + for layer in self.Qformer.bert.encoder.layer: + layer.output = None + layer.intermediate = None + self.Qformer.cls = None + print('Done') + + print('Init InternLM ... ', end='') + self.flag_image_start = nn.Parameter(torch.zeros([1, 1, 4096])) + self.flag_image_end = nn.Parameter(torch.zeros([1, 1, 4096])) + self.flag_image_start.requires_grad = False + self.flag_image_end.requires_grad = False + + internlm_lora = config.internlm_lora + self.internlm_lora = internlm_lora + setattr(InternLMForCausalLM, 'lora_cfg', internlm_lora) + + if int(torch.__version__[0]) == 1: + # self.internlm_model = InternLMForCausalLM._from_config(config).to( + # torch.float16) + self.internlm_model = InternLMForCausalLM._from_config(config).to( + torch.float32) + else: + assert int(torch.__version__[0]) == 2 + # speed up init llm + with torch.device('meta'): + self.internlm_model = InternLMForCausalLM._from_config(config) + # self.internlm_model.to_empty(device=config.device).to(torch.float16) + self.internlm_model.to_empty(device=config.device).to(torch.float32) + for n, m in self.internlm_model.named_modules(): + if 'lora' in n: + m.float() + + self.internlm_proj = nn.Linear(self.Qformer.config.hidden_size, + self.internlm_model.config.hidden_size) + print('Done') + + self.vis_processor = transforms.Compose([ + transforms.Resize((224, 224), + interpolation=InterpolationMode.BICUBIC), + transforms.ToTensor(), + transforms.Normalize((0.48145466, 0.4578275, 0.40821073), + (0.26862954, 0.26130258, 0.27577711)), + ]) + + self.tokenizer = None + + @property + def eoh(self): + return self.tokenizer.decode(torch.Tensor([103027]), + skip_special_tokens=True) + + @property + def eoa(self): + return self.tokenizer.decode(torch.Tensor([103028]), + skip_special_tokens=True) + + def maybe_autocast(self, dtype=torch.float16): + # if on cpu, don't use autocast + # if on gpu, use autocast with dtype if provided, otherwise use torch.float16 + enable_autocast = self.device != torch.device("cpu") + + if enable_autocast: + return torch.cuda.amp.autocast(dtype=dtype) + else: + return contextlib.nullcontext() + + @classmethod + def init_qformer(cls, + num_query_token, + vision_width, + cross_attention_freq=2, + pretrain=True): + encoder_config = BertConfig() + encoder_config.encoder_width = vision_width + # insert cross-attention layer every other block + encoder_config.add_cross_attention = True + encoder_config.cross_attention_freq = cross_attention_freq + encoder_config.query_length = num_query_token + Qformer = BertLMHeadModel(config=encoder_config) + query_tokens = nn.Parameter( + torch.zeros(1, num_query_token, encoder_config.hidden_size)) + query_tokens.data.normal_(mean=0.0, + std=encoder_config.initializer_range) + return Qformer, query_tokens + + def encode_img(self, image): + if image is None: + return None + if isinstance(image, str): + image = Image.open(image).convert("RGB") + image = self.vis_processor(image).unsqueeze(0).to(self.device) + else: + assert isinstance(image, torch.Tensor) + device = image.device + with self.maybe_autocast(): + image_embeds = self.ln_vision( + self.visual_encoder(image)).to(device) + image_atts = torch.ones(image_embeds.size()[:-1], + dtype=torch.long).to(device) + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, + -1) + query_output = self.Qformer.bert( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_atts, + return_dict=True, + ) + inputs_internlm = self.internlm_proj(query_output.last_hidden_state) + inputs_internlm = torch.cat([ + self.flag_image_start.expand(inputs_internlm.shape[0], -1, -1), + inputs_internlm, + self.flag_image_end.expand(inputs_internlm.shape[0], -1, -1) + ], + dim=1) + return inputs_internlm + + def encode_text(self, text, add_special_tokens=False): + text_token_ids = self.tokenizer( + text, + return_tensors='pt', + add_special_tokens=add_special_tokens, + ).input_ids.to(self.device) + text_embeds = self.internlm_model.model.embed_tokens(text_token_ids) + return text_embeds + + def decode_text(self, out_embeds): + out_text = self.tokenizer.batch_decode(out_embeds, + skip_special_tokens=True)[0] + out_text = out_text.split(self.eoa)[0] + return out_text + + def wrap_text(self, user_text, bot_text='', add_special=True): + if add_special: + eoh = self.eoh + else: + eoh = '' + text = f' <|User|>:{user_text} \n{eoh} <|Bot|>:{bot_text}' + return text + + def get_gen_args(self, **kwargs): + new_kargs = copy.deepcopy(self.gen_config) + new_kargs.update(kwargs) + return new_kargs + + def generate(self, text, image=None, **kwargs): + text_embeds = self.encode_text(text) + img_embeds = self.encode_img(image) + prompt_embeds = self.wrap_prompt(text_embeds, img_embeds) + out_embeds = self.internlm_model.generate(inputs_embeds=prompt_embeds, + **self.get_gen_args(**kwargs)) + out_text = self.decode_text(out_embeds) + return out_text + + def chat(self, text, image=None, history=None, **kwargs): + text_embeds = self.encode_text(text) + img_embeds = self.encode_img(image) + prompt_embeds = self.wrap_prompt(text_embeds, + img_embeds, + history=history) + out_embeds = self.internlm_model.generate(inputs_embeds=prompt_embeds, + **self.get_gen_args(**kwargs)) + out_text = self.decode_text(out_embeds) + + # trunc at eoh and eoa + clean_out_text_token_ids = self.tokenizer( + out_text, return_tensors='pt').input_ids.to(self.device) + clean_out_text_embeds = self.internlm_model.model.embed_tokens( + clean_out_text_token_ids) + clean_prompt_embeds = self.wrap_prompt(text_embeds, + img_embeds, + add_special=False) + cur_history = torch.cat([clean_prompt_embeds, clean_out_text_embeds], + dim=1) + if history is None: + history = [] + history.append(cur_history) + return out_text, history + + def wrap_prompt(self, + text_embeds, + img_embeds=None, + history=None, + add_special=True): + if add_special: + prompt_segs = [' <|User|>:', f'\n{self.eoh} <|Bot|>:'] + else: + prompt_segs = [' <|User|>:', ' <|Bot|>:'] # used in wrap history + prompt_seg_embeds = [] + for i, seg in enumerate(prompt_segs): + if history is not None: + add_special_tokens = False + else: + add_special_tokens = i == 0 + seg_embeds = self.encode_text( + seg, add_special_tokens=add_special_tokens) + prompt_seg_embeds.append(seg_embeds) + if img_embeds is None: + img_embeds = text_embeds.new_empty(text_embeds.size(0), 0, + text_embeds.size(-1)) + prompt_seg_embeds = [ + prompt_seg_embeds[0], img_embeds, text_embeds, prompt_seg_embeds[1] + ] + prompt_embeds = torch.cat(prompt_seg_embeds, dim=1) + if history is not None: + prompt_embeds = torch.cat([*history, prompt_embeds], dim=1) + return prompt_embeds