The main goal of this repository is to evaluate the performance of Intel's 5th Generation Xeon "Emerald Rapids" processors in multimodal Retrieval-Augmented Generation (RAG) scenarios using CPUs. Specifically, this benchmarking focuses on three key models that form the multimodal pipeline:
- Embeddings (BAAI/bge-large-en-v1.5): For generating high-quality semantic text representations.
- Large Language Model (Llama-3.2-1B-Instruct): A compact instruction-following LLM.
- Vision Language Model (Phi-3.5-vision-instruct): Handles tasks that combine visual and textual data.
This repository provides scripts and instructions to measure inference times for these models on both CPU and GPU environments.
To use the Llama-3.2-1B-Instruct model, you first need to obtain access via Hugging Face. Follow these steps:
- Request access to the model from meta-llama/Llama-3.2-1B-Instruct.
- Once access is granted, generate a user access token to authorize model downloads. Refer to the Hugging Face documentation for detailed instructions.
Create a .env file in the llm folder with the following content HF_TOKEN=REPLACE_TOKEN
Create a Python 3.11 environment using your preferred environment manager and ensure pip
is updated to version 24.2 or later:
python -m pip install --upgrade pip
Install the required dependencies for CPU-based inference:
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install -r requirements_embeddings_cpu.txt
Install the required dependencies for GPU-based inference:
pip install -r requirements_embeddings_gpu.txt
To measure inference time for embeddings, run the following script:
python embeddings/main.py
Install the required dependencies for CPU-based inference:
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install -r requirements_cpu.txt
Install the required dependencies for GPU-based inference:
pip install -r requirements_gpu.txt
To measure inference time for the large language model, execute the script:
python llm/main.py
To measure inference time for the vision language model, execute the script:
python vlm/main.py
Note: In linux environments execute the following commands before script execution:
> source ipex-llm-init
> numactl -C 0-NUM_PROCESSORS -m 0 python SCRIPT_PATH
The benchmarking scripts will output inference time metrics for each model. These metrics can be used to compare CPU and GPU performance under different configurations.
For further assistance, please create an issue in this repository.
This project is licensed under the MIT License. See the LICENSE
file for details.