🤖 ALCF Inference Endpoints

Unlock Powerful Large Language Model Inference at Argonne Leadership Computing Facility (ALCF)

Table of Content

Overview

Available Clusters

Supported Frameworks

API Endpoints

Chat Completions
Completions
Embeddings

Available Models

Chat Language Models
Deepseek Family
Allenai Family
Vision Language Models
Embedding Models

Inference Execution

Performance and Wait Times
Cluster Specific Details

Prerequisites

Python SDK Setup
Authentication

Usage Examples

Curl Request Examples
Python Implementations

Batch

Create Batch
Retrieve Batch
List Batch
Batch Status
Cancel Batch

Troubleshooting

Contact Us

🌐 Overview

The ALCF Inference Endpoints provide a robust API for running Large Language Model (LLM) inference using Globus Compute on ALCF HPC Clusters.

🖥️ Available Clusters

Cluster	Endpoint
Sophia	https://data-portal-dev.cels.anl.gov/resource_server/sophia

🔒 Access Note:

Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).

You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials. See Authentication.

🧩 Supported Frameworks

vLLM - https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm
Infinity - https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity

🚀 API Endpoints

Chat Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions

Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions

Embeddings

https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1/embeddings

Batches

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/batches

📝 Important Notes:

Currently embeddings are only supported by the infinity framework.

See usage and/or refer to OpenAI API docs for examples.

Default response format for the API is text/plain.

Globus backend does not support streaming, set stream: False when integrating with RAG applications.

📚 Available Models

💬 Chat Language Models

Qwen Family

Qwen/Qwen2.5-14B-Instruct^B
Qwen/Qwen2.5-7B-Instruct^B
Qwen/QwQ-32B-Preview^B

Meta Llama Family

meta-llama/Meta-Llama-3-70B-Instruct^B
meta-llama/Meta-Llama-3-8B-Instruct^B
meta-llama/Meta-Llama-3.1-70B-Instruct^B
meta-llama/Meta-Llama-3.1-8B-Instruct^B
meta-llama/Meta-Llama-3.1-405B-Instruct
meta-llama/Llama-3.3-70B-Instruct^B

Mistral Family

mistralai/Mistral-7B-Instruct-v0.3^B
mistralai/Mistral-Large-Instruct-2407^B
mistralai/Mixtral-8x22B-Instruct-v0.1^B

Nvidia Nemotron Family

mgoin/Nemotron-4-340B-Instruct-hf

Aurora GPT Family

argonne-private/AuroraGPT-7B (previously called auroragpt/auroragpt-0.1-chkpt-7B-Base)
argonne-private/AuroraGPT-IT-v4-0125 (previously called auroragpt/auroragpt-0.1-chkpt-7B-IT)
argonne-private/AuroraGPT-Tulu3-SFT-0125
argonne-private/AuroraGPT-KTO-1902 (previously called auroragpt/auroragpt-0.1-chkpt-7B-KTO)
argonne-private/AuroraGPT-DPO-1902 (previously called auroragpt/auroragpt-0.1-chkpt-7B-DPO)
argonne-private/AuroraGPT-SFT-190

Deepseek Family

deepseek-ai/DeepSeek-R1 (Not supported natively on A100 GPUs. Under Testing)
deepseek-ai/DeepSeek-V3 (Not supported natively on A100 GPUs. Under Testing)

Allenai Family

allenai/Llama-3.1-Tulu-3-405B

👁️ Vision Language Models

Qwen Family

Qwen/Qwen2-VL-72B-Instruct^B (Ranked 1 in vision leaderboard)

Meta Llama Family

meta-llama/Llama-3.2-90B-Vision-Instruct

🧲 Embedding Models

Nvidia Family

nvidia/NV-Embed-v2 (Ranked 1 in embedding Leaderboard)

📝 Want to add a model? Add the HF-compatible, framework-supported model weights to /eagle/argonne_tpc/model_weights/ and contact Aditya Tanikanti

🧩 Inference Execution

Performance and Wait Times

When interacting with the inference endpoints, it's crucial to understand the system's operational characteristics:

Initial Model Loading
- The first query for a "cold" model takes approximately 10-15 minutes
- Loading time depends on the specific model's size
- A node must first be acquired and the model loaded into memory
Cluster Resource Constraints
- These endpoints run on a High-Performance Computing (HPC) cluster as PBS jobs
- The cluster is used for multiple tasks beyond inference
- During high-demand periods, your job might be queued
- You may need to wait until computational resources become available
Job and model running status
- To view currently running jobs along with the models served on the cluster you can run curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" -H "Authorization: Bearer ${access_token}". See Authentication for access_token

📝 Note:

If you’re interested in extended model runtimes, reservations, or private model deployments, please get in touch with us.

Cluster-Specific Details

Sophia Cluster

The models currently run as part of a 24-hour job on Sophia. Here's how the endpoint activation works:

The first query by a user dynamically acquires and activates the endpoints (approximately 10-15 minutes).
Subsequent queries by users will re-use the running job/endpoint.
Running endpoints that are idle for more than 2 hours will be terminated in order to re-allocate resources to other HPC jobs.

🛠️ Prerequisites

Python SDK Setup

# Create a new Conda environment
conda create -n globus_env python==3.11.9 --y
conda activate globus_env

# Install Globus SDK (must be at least version 3.46.0)
pip install globus_sdk

# Install optional package
pip install openai

Authentication

Download the script to manage access tokens:

wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/inference_auth_token.py

Authenticate with your Globus account:

python inference_auth_token.py authenticate

The above command will generate an access token and a refresh token, and store them in your home directory.

If you need to re-authenticate from scratch in order to 1) change Globus account, or 2) resolve a Permission denied from internal policies error, first logout from your account by visiting https://app.globus.org/logout, and type the following command:

python inference_auth_token.py authenticate --force

View your access token:

python inference_auth_token.py get_access_token

If your current access token is expired, the above command will atomatically generate a new token without human intervention.

⏰ Token Validity: All access tokens are valid for 48 hours, but the refresh token will allow you to acquire new access tokens programatically without needing to re-authenticate. Refresh tokens do not expire unless they are left unused for 6 months or more. However, an internal policy will force users to re-authenticate every 7 days.

🔒 Access Note:

Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).

You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials.

💡 Usage Examples

🌟 Curl Request Examples

List the status of running jobs/endpoints on the cluster

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" \
 -H "Authorization: Bearer ${access_token}"

List all available endpoints

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)


curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/list-endpoints" \
 -H "Authorization: Bearer ${access_token}"

Chat Completions Curl Example

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of messages
messages=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the messages and send a POST request for each
for message in "${messages[@]}"; do
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "messages":[{"role": "user", "content": "'"$message"'"}]
           }'
done

Completions Curl Example

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of prompts
prompts=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the prompts and send a POST request for each
for prompt in "${prompts[@]}"; do
  echo "'"$prompt"'"
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "prompt":"'"$prompt"'"
           }'
done

🐍 Python Implementations

Using Requests

import requests
import json
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Chat Completions Example
def send_chat_request(message):
    url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
    headers = {
        'Authorization': f'Bearer {access_token}',
        'Content-Type': 'application/json'
    }
    data = {
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": message}]
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response.json()

output = send_chat_request("What is the purpose of life?")
print(output)

Using OpenAI Package

from openai import OpenAI
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response)

Using Vision Model

from openai import OpenAI
import base64
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()
    
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

# Function to encode image to base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Prepare the image
image_path = "scientific_diagram.png"
base64_image = encode_image(image_path)

# Create vision model request
response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the key components in this scientific diagram"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
            ]
        }
    ],
    max_tokens=300
)

# Print the model's analysis
print(response.server_response)

Using Embedding Model

from openai import OpenAI
import base64
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()
 
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1"
)

# Create Embeddings
completion = client.embeddings.create(
  model="nvidia/NV-Embed-v2",
  input="The food was delicious and the waiter...",
  encoding_format="float"
)

# Print the model's analysis
print(completion)

🧩 Batch

The ALCF Inference Service provides batch processing capabilities for large-scale inference tasks. This service is exclusively available to ALCF users with an allocation and access to our filesystem space. When a batch job is submitted:

A dedicated vLLM instance is launched specifically for processing your batch requests
The model serves only your requests from the input file (up to 150,000 requests per file per batch job)
The service runs for a maximum of 24 hours or until all requests are processed
Once completed, the model is automatically brought down to free resources
Results are written either to:
- Default directory: /eagle/argonne_tpc/inference-service-batch-results/.
- Custom directory: Specified via optional output_folder_path in the request payload (e.g., /eagle/argonne_tpc/path/to/your/output/folder/).

📝 Important Note:

Input file and output folder (if provided) must be located within the argonne_tcp project space or within a world readable/writable folder. Otherwise, the ALCF inference service will not have the permission to process your batch request.

Concurrent Job Limit:

Currently, the service accommodates only two concurrent batch jobs. Any additional jobs are queued on Globus, and the batch status will accurately reflect the current state of each job.

📝 Important Note:

Only models marked with ^B in the Available Models section support batch processing. Those are models with less than 70B parameters (models that fit on a single Sophia node)

Input File Format

Each line in the input file should contain a complete JSON request object in the format of the OpenAI API. For example:

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}

📝 Important Notes:

Input files must be located within the argonne_tcp project space or within a world readable/writable folder.

Each request in the input file should be formatted as a JSON object on a single line (JSON Lines format).

Each input file must only target one model, since a batch job will only load one model in memory.

Batch API Endpoints

Create Batch

Create Batch Request

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/batches"

# Submit batch request
curl -X POST "$base_url" \
     -H "Authorization: Bearer ${access_token}" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
          "input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl"
        }'

# Submit batch request with custom output folder
curl -X POST "$base_url" \
     -H "Authorization: Bearer ${access_token}" \
     -H "Content-Type: application/json" \
     -d '{
          "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
          "input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl",
          "output_folder_path": "/eagle/argonne_tpc/path/to/your/output/folder/"
        }'

Using Python:

import requests
import json
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Define headers and URL
headers = {
    'Authorization': f'Bearer {access_token}',
    'Content-Type': 'application/json'
}
url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/batches"

# Submit batch request
data = {
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "input_file": "/eagle/argonne_tpc/path/to/your/input.jsonl",
    "output_folder_path": "/eagle/argonne_tpc/path/to/your/output/folder/"
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Retrieve Batch

Retrieve Batch Metrics

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Get results of specific batch
batch_id="your-batch-id"
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/v1/batches/${batch_id}/result" \
     -H "Authorization: Bearer ${access_token}"

Using Python:

import requests
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Define headers and URL
headers = {
    'Authorization': f'Bearer {access_token}'
}
batch_id = "your-batch-id"
url = f"https://data-portal-dev.cels.anl.gov/resource_server/v1/batches/{batch_id}/result"

# Get batch results
response = requests.get(url, headers=headers)
print(response.json())

Sample output:

{
    "results_file": "/eagle/argonne_tpc/path/to/your/output/folder/<input-file-name>_<model>_<batch-id>/<input-file-name>_<timestamp>.results.jsonl",
    "progress_file": "/eagle/argonne_tpc/path/to/your/output/folder/<input-file-name>_<model>_<batch-id>/<input-file-name>_<timestamp>.progress.json",
    "metrics": {
        "response_time": 27837.440138816833,
        "throughput_tokens_per_second": 3899.833442250346,
        "total_tokens": 108561380,
        "num_responses": 99985,
        "lines_processed": 100000
    }
}

List Batch

List All Batches

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# List all batches
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/v1/batches" \
     -H "Authorization: Bearer ${access_token}"

# Optionally filter by status (pending, running, completed, or failed)
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/v1/batches?status=completed" \
     -H "Authorization: Bearer ${access_token}"

Using Python:

import requests
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Define headers and URL
headers = {
    'Authorization': f'Bearer {access_token}'
}
url = "https://data-portal-dev.cels.anl.gov/resource_server/v1/batches"

# List all batches
response = requests.get(url, headers=headers)
print(response.json())

# Optionally filter by status (pending, running, completed, or failed)
params = {'status': 'completed'}
response = requests.get(url, headers=headers, params=params)
print(response.json())

Sample Output:

[
  {
    "batch_id": "f8fa8efd-1111-476d-a0a0-111111111111",
    "cluster": "sophia",
    "created_at": "2025-02-20 18:39:58.049584+00:00",
    "framework": "vllm",
    "input_file": "/eagle/argonne_tpc/path/to/your/output/folder/chunk_a.jsonl",
    "status": "pending"
  },
  {
    "batch_id": "4b8a31b8-2222-479f-8c8c-222222222222",
    "cluster": "sophia",
    "created_at": "2025-02-20 18:40:30.882414+00:00",
    "framework": "vllm",
    "input_file": "/eagle/argonne_tpc/path/to/your/output/folder/chunk_b.jsonl",
    "status": "pending"
  }
]

Batch Status

Get Batch Status

#!/bin/bash

# Get your access token
access_token=$(python inference_auth_token.py get_access_token)

# Get status of specific batch
batch_id="your-batch-id"
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/v1/batches/${batch_id}" \
     -H "Authorization: Bearer ${access_token}"

Using Python:

import requests
from inference_auth_token import get_access_token

# Get your access token
access_token = get_access_token()

# Define headers and URL
headers = {
    'Authorization': f'Bearer {access_token}'
}
batch_id = "your-batch-id"
url = f"https://data-portal-dev.cels.anl.gov/resource_server/v1/batches/{batch_id}"

# Get batch status
response = requests.get(url, headers=headers)
print(response.json())

Batch Status Codes:

pending: The request was submitted, but the job has not started yet.
running: The job is currently running on a compute node.
failed: An error occurred; the error message will be displayed when querying the result.
completed: 🎉

Cancel Batch

Cancel Submitted Batch

The inference team is currently developing a mechanism for users to cancel submitted batches. In the meantime, please contact us with your batch_id if you have a batch to cancel.

🚨 Troubleshooting

Connection Timeout?
- Verify Argonne network access
- The model you are requesting may be queued as the cluster has too many pending jobs
  - Check model status by querying https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs
Permission Denied from Internal Policies
- Error: Permission denied from internal policies. This is likely due to a high-assurance timeout...
- Logout from your account by visiting https://app.globus.org/logout
- Regenerate your access token with python inference_auth_token.py authenticate --force
Permission Error During Batch Execution
- Error: Batch failed: Error: TaskExecutionFailed:... PermissionError:...
- Make sure your input file and output folder (if provided) are located within the argonne_tcp project space, or within a world readable/writable folder.

📞 Contact Us

📧 Benoit Cote
📧 Aditya Tanikanti
📧 ALCF Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

🤖 ALCF Inference Endpoints

Table of Content

🌐 Overview

🖥️ Available Clusters

🧩 Supported Frameworks

🚀 API Endpoints

Chat Completions

Completions

Embeddings

Batches

📚 Available Models

💬 Chat Language Models

Qwen Family

Meta Llama Family

Mistral Family

Nvidia Nemotron Family

Aurora GPT Family

Deepseek Family

Allenai Family

👁️ Vision Language Models

Qwen Family

Meta Llama Family

🧲 Embedding Models

Nvidia Family

🧩 Inference Execution

Performance and Wait Times

Cluster-Specific Details

Sophia Cluster

🛠️ Prerequisites

Python SDK Setup

Authentication

💡 Usage Examples

🌟 Curl Request Examples

🐍 Python Implementations

🧩 Batch

Concurrent Job Limit:

Input File Format

Batch API Endpoints

Create Batch

Retrieve Batch

List Batch

Batch Status

Cancel Batch

🚨 Troubleshooting

📞 Contact Us