🤖 ALCF Inference Endpoints

Unlock Powerful Large Language Model Inference at Argonne Leadership Computing Facility (ALCF)

🌐 Overview

The ALCF Inference Endpoints provide a robust API for running Large Language Model (LLM) inference using Globus Compute on ALCF HPC Clusters.

🖥️ Available Clusters

Cluster	Endpoint
Sophia	https://data-portal-dev.cels.anl.gov/resource_server/sophia

🔒 Access Note:

Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).

You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials. See Authentication.

🧩 Supported Frameworks

🚀 API Endpoints

Chat Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions

Completions

https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions

Embeddings

https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1/embeddings

📝 Note Currently embeddings are only supported by the infinity framework. See usage and/or refer to OpenAI API docs for examples

📚 Available Models

💬 Chat Language Models

Qwen Family

Qwen/Qwen2.5-14B-Instruct
Qwen/Qwen2.5-7B-Instruct
Qwen/QwQ-32B-Preview

Meta Llama Family

meta-llama/Meta-Llama-3-70B-Instruct
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct
meta-llama/Llama-3.3-70B-Instruct

Mistral Family

mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mistral-Large-Instruct-2407
mistralai/Mixtral-8x22B-Instruct-v0.1

Nvidia Nemotron Family

mgoin/Nemotron-4-340B-Instruct-hf

Aurora GPT Family

auroragpt/auroragpt-0.1-chkpt-7B-Base

👁️ Vision Language Models

Qwen Family

Qwen/Qwen2-VL-72B-Instruct (Ranked 1 in vision leaderboard)

Meta Llama Family

meta-llama/Llama-3.2-90B-Vision-Instruct

🧲 Embedding Models

Nvidia Family

nvidia/NV-Embed-v2 (Ranked 1 in embedding Leaderboard)

📝 Want to add a model? Add the HF-compatible, framework-supported model weights to /eagle/argonne_tpc/model_weights/ and contact Aditya Tanikanti

🧩 Inference Execution

Performance and Wait Times

When interacting with the inference endpoints, it's crucial to understand the system's operational characteristics:

Initial Model Loading
- The first query for a "cold" model takes approximately 10-15 minutes
- Loading time depends on the specific model's size
- A node must first be acquired and the model loaded into memory
Cluster Resource Constraints
- These endpoints run on a High-Performance Computing (HPC) cluster as PBS jobs
- The cluster is used for multiple tasks beyond inference
- During high-demand periods, your job might be queued
- You may need to wait until computational resources become available
Job and model running status
- To view currently running jobs along with the models served on the cluster you can run curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" -H "Authorization: Bearer ${access_token}". See Authentication for access_token

🚧 Future Improvements:

The team is actively working on implementing a node reservation system to mitigate wait times and improve user experience.

If you’re interested in extended model runtimes, reservations, or private model deployments, please get in touch with us.

Cluster-Specific Details

Sophia Cluster

The models are currently run as part of a 24-hour job on Sophia. Here's how the endpoint activation works:

The first query by an authorized user dynamically acquires and activates the endpoints
Subsequent queries by authorized users will re-use the running job/endpoint

🛠️ Prerequisites

Python SDK Setup

# Create a new Conda environment
conda create -n globus_env python==3.11.9 --y
conda activate globus_env

# Install required package
pip install globus_sdk

# Install optional package
pip install openai

Authentication

Generate an access token:

wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/generate_auth_token.py
python3 generate_auth_token.py
access_token=$(cat access_token.txt)

⏰ Token Validity: Active for 48 hours

🔒 Access Note:

Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).

You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials.

💡 Usage Examples

🌟 Curl Request Examples

List the status of running jobs/endpoints on the cluster

#!/bin/bash

# Define the access token
access_token=$(cat access_token.txt)

curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" \
 -H "Authorization: Bearer ${access_token}"

List all available endpoints

#!/bin/bash

# Define the access token
access_token=$(cat access_token.txt)


curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/list-endpoints" \
 -H "Authorization: Bearer ${access_token}"

Chat Completions Curl Example

#!/bin/bash

# Define the access token
access_token=$(cat access_token.txt)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of messages
messages=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the messages and send a POST request for each
for message in "${messages[@]}"; do
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "messages":[{"role": "user", "content": "'"$message"'"}]
           }'
done

Completions Curl Example

#!/bin/bash

# Define the access token
access_token=$(cat access_token.txt)

# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"

# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150

# Define an array of prompts
prompts=(
  "List all proteins that interact with RAD51"
  "What are the symptoms of diabetes?"
  "How does photosynthesis work?"
)

# Loop through the prompts and send a POST request for each
for prompt in "${prompts[@]}"; do
  echo "'"$prompt"'"
  curl -X POST "$base_url" \
       -H "Authorization: Bearer ${access_token}" \
       -H "Content-Type: application/json" \
       -d '{
              "model": "'$model'",
              "temperature": '$temperature',
              "max_tokens": '$max_tokens',
              "prompt":"'"$prompt"'"
           }'
done

🐍 Python Implementations

Using Requests

import requests
import json

# Load access token
with open('access_token.txt', 'r') as file:
    access_token = file.read().strip()

# Chat Completions Example
def send_chat_request(message):
    url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
    headers = {
        'Authorization': f'Bearer {access_token}',
        'Content-Type': 'application/json'
    }
    data = {
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": message}]
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))
    return response.json()

output = send_chat_request("What is the purpose of life?")
print(output)

Using OpenAI Package

from openai import OpenAI

# Load access token
with open('access_token.txt', 'r') as file:
    access_token = file.read().strip()

client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

print(response)

Using Vision model

from openai import OpenAI
import base64

# Load access token
with open('access_token.txt', 'r') as file:
    access_token = file.read().strip()
    
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)

# Function to encode image to base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Prepare the image
image_path = "scientific_diagram.png"
base64_image = encode_image(image_path)

# Create vision model request
response = client.chat.completions.create(
    model="Qwen/Qwen2-VL-72B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the key components in this scientific diagram"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
            ]
        }
    ],
    max_tokens=300
)

# Print the model's analysis
print(response.server_response)

Using Embedding model

from openai import OpenAI
import base64

# Load access token
with open('access_token.txt', 'r') as file:
    access_token = file.read().strip()
 
# Initialize the client
client = OpenAI(
    api_key=access_token,
    base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1"
)

# Create Embeddings
completion = client.embeddings.create(
  model="nvidia/NV-Embed-v2",
  input="The food was delicious and the waiter...",
  encoding_format="float"
)

# Print the model's analysis
print(completion)

🚨 Troubleshooting

Connection Timeout?
- Regenerate your access token
- Verify Argonne network access
- Your job is queued as the cluster has too many pending jobs

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
README.md		README.md
access_token.txt		access_token.txt
curl-request.sh		curl-request.sh
generate_auth_token.py		generate_auth_token.py
remote_inference_gateway-users.ipynb		remote_inference_gateway-users.ipynb
remote_inference_gateway.ipynb		remote_inference_gateway.ipynb

argonne-lcf/inference-endpoints

Folders and files

Latest commit

History

Repository files navigation

🤖 ALCF Inference Endpoints

🌐 Overview

🖥️ Available Clusters

🧩 Supported Frameworks

🚀 API Endpoints

Chat Completions

Completions

Embeddings

📚 Available Models

💬 Chat Language Models

Qwen Family

Meta Llama Family

Mistral Family

Nvidia Nemotron Family

Aurora GPT Family

👁️ Vision Language Models

Qwen Family

Meta Llama Family

🧲 Embedding Models

Nvidia Family

🧩 Inference Execution

Performance and Wait Times

Cluster-Specific Details

Sophia Cluster

🛠️ Prerequisites

Python SDK Setup

Authentication

💡 Usage Examples

🌟 Curl Request Examples

🐍 Python Implementations

🚨 Troubleshooting

📞 Contact Us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages