From 40a7d2b4f037869de648aaec7ee3a4d8f548ad7e Mon Sep 17 00:00:00 2001 From: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com> Date: Thu, 26 Dec 2024 15:23:32 +0800 Subject: [PATCH] Consolidated C-Eval Benchmark Guide for Single-GPU and Multi-GPU Environments (#12618) * run c-eval on multi-GPUs * Update README.md --- python/llm/dev/benchmark/ceval/README.md | 138 +++++++++++++++++++++-- 1 file changed, 127 insertions(+), 11 deletions(-) diff --git a/python/llm/dev/benchmark/ceval/README.md b/python/llm/dev/benchmark/ceval/README.md index 97771ba5..11863da6 100644 --- a/python/llm/dev/benchmark/ceval/README.md +++ b/python/llm/dev/benchmark/ceval/README.md @@ -1,23 +1,30 @@ -## C-Eval Benchmark Test +## C-Eval Benchmark Test Guide -C-Eval benchmark test allows users to test on [C-Eval](https://cevalbenchmark.com) datasets, which is a multi-level multi-discipline chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please check [paper](https://arxiv.org/abs/2305.08322) and [github repo](https://github.com/hkust-nlp/ceval) for more information. +This guide provides instructions for running the C-Eval benchmark test in both single-GPU and multi-GPU environments. [C-Eval](https://cevalbenchmark.com) is a comprehensive multi-level, multi-discipline Chinese evaluation suite for foundational models. It consists of 13,948 multiple-choice questions spanning 52 diverse disciplines and four difficulty levels. For more details, see the [C-Eval paper](https://arxiv.org/abs/2305.08322) and [GitHub repository](https://github.com/hkust-nlp/ceval). -### Download dataset -Please download and unzip the dataset for evaluation. -```shell +--- + +### Single-GPU Environment + +#### 1. Download Dataset + +Download and unzip the dataset for evaluation: +```bash wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip mkdir data mv ceval-exam.zip data cd data; unzip ceval-exam.zip ``` -### Run -You can run evaluation with following command. -```shell +#### 2. Run Evaluation + +Use the following command to run the evaluation: +```bash bash run.sh ``` -+ `run.sh` -```shell + +Contents of `run.sh`: +```bash export IPEX_LLM_LAST_LM_HEAD=0 python eval.py \ --model_path "path to model" \ @@ -29,4 +36,113 @@ python eval.py \ > **Note** > -> `eval_type` there is two types of evaluation, first type is `validation`, which runs on validation dataset and output evaluation scores. The second type is `test`, which runs on test dataset and output `submission.json` file for submission on https://cevalbenchmark.com to get the evaluation score. +> - `eval_type`: There are two types of evaluations: +> - `validation`: Runs on the validation dataset and outputs evaluation scores. +> - `test`: Runs on the test dataset and outputs a `submission.json` file for submission on [C-Eval](https://cevalbenchmark.com) to get evaluation scores. + +--- + +### Multi-GPU Environment + +#### 1. Prepare Environment + +1. **Set Docker Image and Container Name**: + ```bash + export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest + export CONTAINER_NAME=ceval-benchmark + ``` + +2. **Start Docker Container**: + ```bash + docker run -td \ + --privileged \ + --net=host \ + --device=/dev/dri \ + --name=$CONTAINER_NAME \ + -v /home/intel/LLM:/llm/models/ \ + -e no_proxy=localhost,127.0.0.1 \ + -e http_proxy=$HTTP_PROXY \ + -e https_proxy=$HTTPS_PROXY \ + --shm-size="16g" \ + $DOCKER_IMAGE + ``` + +3. **Enter the Container**: + ```bash + docker exec -it $CONTAINER_NAME bash + ``` + +#### 2. Configure `lm-evaluation-harness` + +1. **Clone the Repository**: + ```bash + git clone https://github.com/EleutherAI/lm-evaluation-harness + cd lm-evaluation-harness + ``` + +2. **Update Multi-GPU Support File**: + Update `lm_eval/models/vllm_causallms.py` based on the following link: + [Update Multi-GPU Support File](https://github.com/EleutherAI/lm-evaluation-harness/compare/main...liu-shaojun:lm-evaluation-harness:multi-arc?expand=1) + +3. **Install Dependencies**: + ```bash + pip install -e . + ``` + +#### 3. Configure Environment Variables + +Set environment variables required for multi-GPU execution: +```bash +export CCL_WORKER_COUNT=2 +export CCL_ATL_TRANSPORT=ofi +export CCL_ZE_IPC_EXCHANGE=sockets +export CCL_ATL_SHM=1 +export CCL_SAME_STREAM=1 +export CCL_BLOCKING_WAIT=0 + +export SYCL_CACHE_PERSISTENT=1 +export FI_PROVIDER=shm +export USE_XETLA=OFF +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 +export TORCH_LLM_ALLREDUCE=0 +``` + +Load Intel OneCCL environment variables: +```bash +source /opt/intel/1ccl-wks/setvars.sh +``` + +#### 4. Run Evaluation + +Use the following command to run the C-Eval benchmark: +```bash +lm_eval --model vllm \ + --model_args pretrained=/llm/models/CodeLlama-34b/,dtype=float16,max_model_len=2048,device=xpu,load_in_low_bit=fp8,tensor_parallel_size=4,distributed_executor_backend="ray",gpu_memory_utilization=0.90,trust_remote_code=True \ + --tasks ceval-valid \ + --batch_size 2 \ + --num_fewshot 0 \ + --output_path c-eval-result +``` + +#### 5. Notes + +- **Model and Parameter Adjustments**: + - **`pretrained`**: Replace with the desired model path, e.g., `/llm/models/CodeLlama-7b/`. + - **`load_in_low_bit`**: Set to `fp8` or other precision options based on hardware and task requirements. + - **`tensor_parallel_size`**: Adjust based on the number of GPUs and memory. Recommended to match the GPU count. + - **`batch_size`**: Increase to accelerate testing, but ensure it does not cause OOM errors. Recommended values are `2` or `3`. + - **`num_fewshot`**: Specify the number of few-shot examples. Default is `0`. Increasing this value can improve model contextual understanding but may significantly increase input length and runtime. + +- **Logging**: + To log both to the console and a file, use: + ```bash + lm_eval --model vllm ... | tee c-eval.log + ``` + +- **Container Debugging**: + Ensure the paths for the model and tasks are correctly set, e.g., check if `/llm/models/` is properly mounted in the container. + +--- + +By following the above steps, you can successfully run the C-Eval benchmark in both single-GPU and multi-GPU environments. +