4.6 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	C-Eval Benchmark Test Guide
This guide provides instructions for running the C-Eval benchmark test in both single-GPU and multi-GPU environments. C-Eval is a comprehensive multi-level, multi-discipline Chinese evaluation suite for foundational models. It consists of 13,948 multiple-choice questions spanning 52 diverse disciplines and four difficulty levels. For more details, see the C-Eval paper and GitHub repository.
Single-GPU Environment
1. Download Dataset
Download and unzip the dataset for evaluation:
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
mkdir data
mv ceval-exam.zip data
cd data; unzip ceval-exam.zip
2. Run Evaluation
Use the following command to run the evaluation:
bash run.sh
Contents of run.sh:
export IPEX_LLM_LAST_LM_HEAD=0
python eval.py \
    --model_path "path to model" \
    --eval_type validation \
    --device xpu \
    --eval_data_path data \
    --qtype sym_int4
Note
eval_type: There are two types of evaluations:
validation: Runs on the validation dataset and outputs evaluation scores.test: Runs on the test dataset and outputs asubmission.jsonfile for submission on C-Eval to get evaluation scores.
Multi-GPU Environment
1. Prepare Environment
- 
Set Docker Image and Container Name:
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest export CONTAINER_NAME=ceval-benchmark - 
Start Docker Container:
docker run -td \ --privileged \ --net=host \ --device=/dev/dri \ --name=$CONTAINER_NAME \ -v /home/intel/LLM:/llm/models/ \ -e no_proxy=localhost,127.0.0.1 \ -e http_proxy=$HTTP_PROXY \ -e https_proxy=$HTTPS_PROXY \ --shm-size="16g" \ $DOCKER_IMAGE - 
Enter the Container:
docker exec -it $CONTAINER_NAME bash 
2. Configure lm-evaluation-harness
- 
Clone the Repository:
git clone https://github.com/EleutherAI/lm-evaluation-harness cd lm-evaluation-harness - 
Update Multi-GPU Support File: Update
lm_eval/models/vllm_causallms.pybased on the following link: Update Multi-GPU Support File - 
Install Dependencies:
pip install -e . 
3. Configure Environment Variables
Set environment variables required for multi-GPU execution:
export CCL_WORKER_COUNT=2
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
Load Intel OneCCL environment variables:
source /opt/intel/1ccl-wks/setvars.sh
4. Run Evaluation
Use the following command to run the C-Eval benchmark:
lm_eval --model vllm \
  --model_args pretrained=/llm/models/CodeLlama-34b/,dtype=float16,max_model_len=2048,device=xpu,load_in_low_bit=fp8,tensor_parallel_size=4,distributed_executor_backend="ray",gpu_memory_utilization=0.90,trust_remote_code=True \
  --tasks ceval-valid \
  --batch_size 2 \
  --num_fewshot 0 \
  --output_path c-eval-result
5. Notes
- 
Model and Parameter Adjustments:
pretrained: Replace with the desired model path, e.g.,/llm/models/CodeLlama-7b/.load_in_low_bit: Set tofp8or other precision options based on hardware and task requirements.tensor_parallel_size: Adjust based on the number of GPUs and memory. Recommended to match the GPU count.batch_size: Increase to accelerate testing, but ensure it does not cause OOM errors. Recommended values are2or3.num_fewshot: Specify the number of few-shot examples. Default is0. Increasing this value can improve model contextual understanding but may significantly increase input length and runtime.
 - 
Logging: To log both to the console and a file, use:
lm_eval --model vllm ... | tee c-eval.log - 
Container Debugging: Ensure the paths for the model and tasks are correctly set, e.g., check if
/llm/models/is properly mounted in the container. 
By following the above steps, you can successfully run the C-Eval benchmark in both single-GPU and multi-GPU environments.