Large Language Models (LLMs) on CPU as SageMaker Endpoints

This code demonstrates how you can run Large Language Models (LLMs) on CPU-only instances including Graviton. We are using Llama.cpp project and exposing an Sagemaker endpoint API for inference. Models are downloaded from Hugging Face model hub. The project can be deployed to be compatible to both ARM64 and x86 architectures.

Project Overview

This project is built by using AWS Cloud Development Kit(AWS CDK) with Python. The cdk.json file tells the CDK Toolkit how to execute your app.

Configuration

AWS CDK app configuration file values are in config.yaml:

Parameter	Description	Example value
project.name	Used as prefix for AWS resources created with this app	cpu-llm
model.hf_name	HuggingFace model name	TheBloke/Llama-2-7b-Chat-GGUF
model.full_name	HuggingFace model file full name	llama-2-7b-chat.Q4_K_M.gguf
image.platform	Platfrom used to run inference and build an image; Values: ["ARM", "AMD"]	ARM
image.image_tag	Tag used to tag the image;	arm-latest
inference.sagemaker_model_name	SageMaker endpoint name for model inference	llama-2-7b-chat
inference.instance_type	Instance type used for SageMaker Endpoint	"ml.c7g.8xlarge" for ARM platform or "ml.g5.xlarge" for AMD platform

At the moment the only supported options are ARM-based inference on Amazon Graviton processors and AMD-based inference for CUDA-based GPUs (G5 are highly recommended). For GPU inference we do not support weights sharding across multiple GPU cards.

Architecture

The stack can be found in ./infrastructure directory.

Prerequisites

Before proceeding any further, you need to identify and designate an AWS account required for the solution to work.

Deploying from your local machine

You need to create an AWS account profile in ~/.aws/credentials for the designated AWS account, if you don’t already have one. The profile needs to have sufficient permissions to run an AWS Cloud Development Kit (AWS CDK) stack. We recommend removing the profile when you’re finished with the testing. For more information about creating an AWS account profile, see Configuring the AWS CLI.

Python 3.11.x or later has to be installed on a machine to run CDK code. You will also need to install AWS CDK CLI as per documentation and bootstrap your environment.

Deploying from Cloud9 instance

If you don't want to install the necessary software locally you can spin up Cloud9 instance that already have all necessary software preinstalled, however if this is your first CDK deployment in the account and/or region you will need to bootstrap your environment.

CDK deployment

To Create Resources / Deploy Stack

Open the terminal and run the following commands:

# uncomment the line below if you need to bootstrap your environment
# replace ACCOUNT_ID and REGION placeholders with your actual
# AWS account id and region where you deploy the application

# cdk bootstrap aws://ACCOUNT_ID/REGION 

git clone https://github.com/aws-samples/genai-llm-cpu-sagemaker llamacpp
cd llamacpp
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
cdk deploy

To Destroy Resources / Clean-up

Delete stack from Cloudfromation console.

Model Selection / Change

Only changing a model does not require rebuidling an image, and would take approximatelly 30% less time than redeploying the whole application. You can use the following process:

Navigate to https://huggingface.co/TheBloke and choose GGUF model of your choice for example https://huggingface.co/TheBloke/llama-2-7B-Arguments-GGUF, scroll to provided files. Usually Q4_K_M is good enough compromise (based on our testing but feel free to try yourself).
Update values of the variables in config.yaml to use the new model:
- model.hf_name - set Hugging Face model name e.g. "TheBloke/llama-2-7B-Arguments-GGUF"
- model.full_name - set Hugging Face file full name e.g. "llama-2-7b-chat.Q4_K_M.gguf"
Re-deploy stack by running cdk deploy

Platform Selection / Change

Update values of the variables in config.yaml to use the different platform:
- platform - set platform (not case sensitive) e.g. "AMD"
- instance_type - set instance type that matches platform e.g. "ml.g5.xlarge"
- image_tag - (optional) update image tag e.g. "amd-latest"
Re-deploy stack by running cdk deploy

Multi-Model Deployment

Sometimes you want to try multiple models from Hugging face to compare the quality of responses or latency. For this you can specify several models in multimodel_config.yaml and then use provided python script to start multiple model deployments in parallel.

python3 multimodel_cdk.py --deploy

Inference

Use notebooks/inference.ipynb as an example. IAM credentials / IAM Role that you use to run the notebook has to allow sagemaker:InvokeEndpoint API calls.

If you don't have an existing environment to run Juputer notebooks, the easiest way to run the notebook would be to create new Sagemaker notebook instance using default settings and letting Sagemaker to create the necessary IAM role with enough permissions to interact with provisioned LLM endpoint.

Limitations

At the moment there's 25GB limit on custom docker image size. Please make sure the size of GGUF model file you want to use is below the limit.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github		.github
cb_buildspec		cb_buildspec
docker		docker
images		images
infrastructure		infrastructure
lambda		lambda
notebooks		notebooks
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cdk.json		cdk.json
config.yaml		config.yaml
multimodel_cdk.py		multimodel_cdk.py
multimodel_config.yaml		multimodel_config.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
source.bat		source.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Models (LLMs) on CPU as SageMaker Endpoints

Project Overview

Configuration

Architecture

Prerequisites

Deploying from your local machine

Deploying from Cloud9 instance

CDK deployment

To Create Resources / Deploy Stack

To Destroy Resources / Clean-up

Model Selection / Change

Platform Selection / Change

Multi-Model Deployment

Inference

Limitations

Security

License

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/genai-llm-cpu-sagemaker

Folders and files

Latest commit

History

Repository files navigation

Large Language Models (LLMs) on CPU as SageMaker Endpoints

Project Overview

Configuration

Architecture

Prerequisites

Deploying from your local machine

Deploying from Cloud9 instance

CDK deployment

To Create Resources / Deploy Stack

To Destroy Resources / Clean-up

Model Selection / Change

Platform Selection / Change

Multi-Model Deployment

Inference

Limitations

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages