Skip to content

Commit

Permalink
Merge pull request #158 from Kenza-AI/feature/batch-inference-embeddings
Browse files Browse the repository at this point in the history
Feature/batch inference embeddings
  • Loading branch information
pm3310 committed Mar 10, 2024
2 parents c1a072b + 6dab3a7 commit 351532e
Show file tree
Hide file tree
Showing 4 changed files with 527 additions and 17 deletions.
187 changes: 173 additions & 14 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,11 @@ You can change the values for ec2 type (-e), aws region and aws profile with you

Once the Stable Diffusion model is deployed, you can use the generated code snippet to query it. Enjoy!

### Backend Platforms
### Restful Inference

#### OpenAI
#### Backend Platforms

##### OpenAI

The following models are offered for chat completions:

Expand Down Expand Up @@ -129,7 +131,7 @@ And for embeddings:
All these lists of supported models on Openai can be retrieved by running the command `sagify llm models --all --provider openai`. If you want to focus only on chat completions models, then run `sagify llm models --chat-completions --provider openai`. For image creations and embeddings, `sagify llm models --image-creations --provider openai` and `sagify llm models --embeddings --provider openai`, respectively.


#### Anthropic
##### Anthropic

The following models are offered for chat completions:

Expand All @@ -142,7 +144,7 @@ The following models are offered for chat completions:
|claude-3-sonnet|https://docs.anthropic.com/claude/docs/models-overview|


#### Open-Source
##### Open-Source

The following open-source models are offered for chat completions:

Expand Down Expand Up @@ -181,7 +183,7 @@ And for embeddings:

All these lists of supported open-source models are supported on AWS Sagemaker and can be retrieved by running the command `sagify llm models --all --provider sagemaker`. If you want to focus only on chat completions models, then run `sagify llm models --chat-completions --provider sagemaker`. For image creations and embeddings, `sagify llm models --image-creations --provider sagemaker` and `sagify llm models --embeddings --provider sagemaker`, respectively.

### Set up OpenAI
#### Set up OpenAI

You need to define the following env variables before you start the LLM Gateway server:

Expand All @@ -190,14 +192,14 @@ You need to define the following env variables before you start the LLM Gateway
- `OPENAI_EMBEDDINGS_MODEL`: It should have one of values [here](https://platform.openai.com/docs/models/embeddings).
- `OPENAI_IMAGE_CREATION_MODEL`: It should have one of values [here](https://platform.openai.com/docs/models/dall-e).

### Set up Anthropic
#### Set up Anthropic

You need to define the following env variables before you start the LLM Gateway server:

- `ANTHROPIC_API_KEY`: Your OpenAI API key. Example: `export ANTHROPIC_API_KEY=...`.
- `ANTHROPIC_CHAT_COMPLETIONS_MODEL`: It should have one of values [here](https://docs.anthropic.com/claude/docs/models-overview). Example `export ANTHROPIC_CHAT_COMPLETIONS_MODEL=claude-2.1`

### Set up open-source LLMs
#### Set up open-source LLMs

First step is to deploy the LLM model(s). You can choose to deploy all backend services (chat completions, image creations, embeddings) or some of them.

Expand Down Expand Up @@ -229,7 +231,7 @@ It takes 15 to 30 minutes to deploy all the backend services as Sagemaker endpoi

The deployed model names, which are the Sagemaker endpoint names, are printed out and stored in the hidden file `.sagify_llm_infra.json`. You can also access them from the AWS Sagemaker web console.

### Deploy FastAPI LLM Gateway - Docker
#### Deploy FastAPI LLM Gateway - Docker

Once you have set up your backend platform, you can deploy the FastAPI LLM Gateway locally.

Expand Down Expand Up @@ -275,7 +277,7 @@ sagify llm gateway --image sagify-llm-gateway:v0.1.0 --start-local

If you want to support both platforms (OpenAI and AWS Sagemaker), then pass all the env variables for both platforms.

### Deploy FastAPI LLM Gateway - AWS Fargate
#### Deploy FastAPI LLM Gateway - AWS Fargate

In case you want to deploy the LLM Gateway to AWS Fargate, then you can follow these general steps:

Expand Down Expand Up @@ -341,11 +343,11 @@ Resources:
- <YOUR_SECURITY_GROUP_ID>
```

### LLM Gateway API
#### LLM Gateway API

Once the LLM Gateway is deployed, you can access it on `HOST_NAME/docs`.

#### Completions
##### Completions

Code samples

Expand Down Expand Up @@ -490,7 +492,7 @@ print(response.text)
}
```

#### Embeddings
##### Embeddings

Code samples

Expand Down Expand Up @@ -616,7 +618,7 @@ print(response.text)
}
```

#### Image Generations
##### Image Generations

Code samples

Expand Down Expand Up @@ -733,14 +735,105 @@ print(response.text)
The above example returns a url to the image. If you want to return a base64 value of the image, then set `response_format` to `base64_json` in the request body params.


### Upcoming Proprietary & Open-Source LLMs and Cloud Platforms
#### Upcoming Proprietary & Open-Source LLMs and Cloud Platforms

- [Amazong Bedrock](https://aws.amazon.com/bedrock/)
- [Cohere](https://cohere.com/)
- [Mistral](https://docs.mistral.ai/models/)
- [Gemma](https://blog.google/technology/developers/gemma-open-models/)
- [GCP VertexAI](https://cloud.google.com/vertex-ai)

### Batch Inference

In the realm of AI/ML, real-time inference via RESTful APIs is undeniably crucial for many applications. However, another equally important, yet often overlooked, aspect of inference lies in batch processing.

While real-time inference caters to immediate, on-the-fly predictions, batch inference empowers users with the ability to process large volumes of data efficiently and cost-effectively.

#### Embeddings

Generating embeddings offline in a batch mode is essential for many real world applications. These embeddings can then be stored in some vector database to serve recommender, search/ranking and other ML powered systems.

You have to use Sagemaker as the backend platform and only the following open-source models are supported:

| Model Name | URL |
|:------------:|:-----:|
|bge-large-en|https://huggingface.co/BAAI/bge-large-en|
|bge-base-en|https://huggingface.co/BAAI/bge-base-en|
|gte-large|https://huggingface.co/thenlper/gte-large|
|gte-base|https://huggingface.co/thenlper/gte-base|
|e5-large-v2|https://huggingface.co/intfloat/e5-large-v2|
|bge-small-en|https://huggingface.co/BAAI/bge-small-en|
|e5-base-v2|https://huggingface.co/intfloat/e5-base-v2|
|multilingual-e5-large|https://huggingface.co/intfloat/multilingual-e5-large|
|e5-large|https://huggingface.co/intfloat/e5-large|
|gte-small|https://huggingface.co/thenlper/gte-small|
|e5-base|https://huggingface.co/intfloat/e5-base|
|e5-small-v2|https://huggingface.co/intfloat/e5-small-v2|
|multilingual-e5-base|https://huggingface.co/intfloat/multilingual-e5-base|
|all-MiniLM-L6-v2|https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2|

Also, the following ec2 instance types support batch inference:

| Instance Type | Details |
|:------------:|:-----:|
|ml.p3.2xlarge|https://instances.vantage.sh/aws/ec2/p3.2xlarge|
|ml.p3.8xlarge|https://instances.vantage.sh/aws/ec2/p3.8xlarge|
|ml.p3.16xlarge|https://instances.vantage.sh/aws/ec2/p3.16xlarge|
|ml.g4dn.2xlarge|https://instances.vantage.sh/aws/ec2/g4dn.2xlarge|
|ml.g4dn.4xlarge|https://instances.vantage.sh/aws/ec2/g4dn.4xlarge|
|ml.g4dn.8xlarge|https://instances.vantage.sh/aws/ec2/g4dn.8xlarge|
|ml.g4dn.16xlarge|https://instances.vantage.sh/aws/ec2/g4dn.16xlarge|

##### How does it work?

It's quite simple. To begin, prepare the input JSONL file(s). Consider the following example:

```json
{"id":1,"text_inputs":"what is the recipe of mayonnaise?"}
{"id":2,"text_inputs":"what is the recipe of fish and chips?"}
```

Each line contains a unique identifier (id) and the corresponding text input (text_inputs). This identifier is crucial for linking inputs to their respective outputs, as illustrated in the output format below:

```json
{'id': 1, 'embedding': [-0.029919596, -0.0011845357, ..., 0.08851079, 0.021398442]}
{'id': 2, 'embedding': [-0.041918136, 0.007127975, ..., 0.060178414, 0.031050885]}
```

By ensuring consistency in the id field between input and output files, you empower your ML use cases with seamless data coherence.

Once the input JSONL file(s) are saved in an S3 bucket, you can trigger the batch inference programmatically from your Python codebase or via the Sagify CLI.

##### CLI

The following command does all the magic! Here's an example:

```sh
sagify llm batch-inference --model gte-small --s3-input-location s3://sagify-llm-playground/batch-input-data-example/embeddings/ --s3-output-location s3://sagify-llm-playground/batch-output-data-example/embeddings/1/ --aws-profile sagemaker-dev --aws-region us-east-1 --num-instances 1 --ec2-type ml.p3.2xlarge --wait
```

The `--s3-input-location` should be the path where the JSONL file(s) are saved.

##### SDK

Magic can happen with the Sagify SDK, too. Here's a code snippet:

```python
from sagify.api.llm import batch_inference

batch_inference(
model='gte-small',
s3_input_location='3://sagify-llm-playground/batch-input-data-example/embeddings/',
s3_output_location='s3://sagify-llm-playground/batch-output-data-example/embeddings/1/',
aws_profile='sagemaker-dev',
aws_region='us-east-1',
num_instances=1,
ec2_type='ml.p3.2xlarge',
aws_access_key_id='YOUR_AWS_ACCESS_KEY_ID'
aws_secret_access_key='YOUR_AWS_SECRET_ACCESS_KEY',
wait=True
)
```

## Machine Learning

Expand Down Expand Up @@ -1957,3 +2050,69 @@ It builds gateway docker image and starts the gateway locally.
`--platform PLATFORM`: Operating system. Platform in the format `os[/arch[/variant]]`.

`--start-local`: Flag to indicate if to start the gateway locally.


### LLM Batch Inference

#### Name

Command to execute an LLM batch inference job

#### Synopsis
```sh
sagify llm batch-inference --model MODEL --s3-input-location S3_INPUT_LOCATION --s3-output-location S3_OUTPUT_LOCATION --aws-profile AWS_PROFILE --aws-region AWS_REGION --num-instances NUMBER_OF_EC2_INSTANCES --ec2-type EC2_TYPE [--aws-tags TAGS] [--iam-role-arn IAM_ROLE] [--external-id EXTERNAL_ID] [--wait] [--job-name JOB_NAME] [--max-concurrent-transforms MAX_CONCURRENT_TRANSFORMS]
```

#### Description

This command triggers an batch inference job given an LLM model and an batch input.

- The input S3 path should contain a JSONL file or multiple JSONL files. Example of a file:
```json
{"id":1,"text_inputs":"what is the recipe of mayonnaise?"}
{"id":2,"text_inputs":"what is the recipe of fish and chips?"}
```

Each line contains a unique identifier (id) and the corresponding text input (text_inputs). This identifier is crucial for linking inputs to their respective outputs, as illustrated in the output format below:

```json
{'id': 1, 'embedding': [-0.029919596, -0.0011845357, ..., 0.08851079, 0.021398442]}
{'id': 2, 'embedding': [-0.041918136, 0.007127975, ..., 0.060178414, 0.031050885]}
```

By ensuring consistency in the id field between input and output files, you empower your ML use cases with seamless data coherence.

#### Required Flags

`--model MODEL`: LLM model name

`--s3-input-location S3_INPUT_LOCATION` or `-i S3_INPUT_LOCATION`: s3 input data location

`--s3-output-location S3_OUTPUT_LOCATION` or `-o S3_OUTPUT_LOCATION`: s3 location to save predictions

`--num-instances NUMBER_OF_EC2_INSTANCES` or `n NUMBER_OF_EC2_INSTANCES`: Number of ec2 instances

`--ec2-type EC2_TYPE` or `e EC2_TYPE`: ec2 type. Refer to https://aws.amazon.com/sagemaker/pricing/instance-types/

`--aws-profile AWS_PROFILE`: The AWS profile to use for the lightning deploy command

`--aws-region AWS_REGION`: The AWS region to use for the lightning deploy command

#### Optional Flags

`--aws-tags TAGS` or `-a TAGS`: Tags for labeling an inference job of the form `tag1=value1;tag2=value2`. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html.

`--iam-role-arn IAM_ROLE` or `-r IAM_ROLE`: AWS IAM role to use for the inference job with *SageMaker*

`--external-id EXTERNAL_ID` or `-x EXTERNAL_ID`: Optional external id used when using an IAM role

`--wait`: Optional flag to wait until Batch Inference is finished. (default: don't wait)

`--job-name JOB_NAME`: Optional name for the SageMaker batch inference job

`--max-concurrent-transforms MAX_CONCURRENT_TRANSFORMS`: Optional maximum number of HTTP requests to be made to each individual inference container at one time. Default value: 1

#### Example
```sh
sagify llm batch-inference --model gte-small --s3-input-location s3://sagify-llm-playground/batch-input-data-example/embeddings/ --s3-output-location s3://sagify-llm-playground/batch-output-data-example/embeddings/1/ --aws-profile sagemaker-dev --aws-region us-east-1 --num-instances 1 --ec2-type ml.p3.2xlarge --wait
```
80 changes: 80 additions & 0 deletions sagify/api/llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
from sagify.sagemaker import sagemaker


def batch_inference(
model,
s3_input_location,
s3_output_location,
aws_profile,
aws_region,
num_instances,
ec2_type,
aws_role=None,
external_id=None,
tags=None,
wait=True,
job_name=None,
model_version='1.*',
max_concurrent_transforms=None,
aws_access_key_id=None,
aws_secret_access_key=None,
):
"""
Executes a batch inference job given a foundation model on SageMaker
:param model: [str], model name
:param s3_model_location: [str], S3 model location
:param s3_input_location: [str], S3 input data location
:param s3_output_location: [str], S3 location to save predictions
:param aws_profile: [str], AWS profile name
:param aws_region: [str], AWS region
:param num_instances: [int], number of ec2 instances
:param ec2_type: [str], ec2 instance type. Refer to:
https://aws.amazon.com/sagemaker/pricing/instance-types/
:param aws_role: [str, default=None], the AWS role assumed by SageMaker while deploying
:param external_id: [str, default=None], Optional external id used when using an IAM role
:param tags: [optional[list[dict], default=None], default: None], List of tags for labeling a training
job. For more, see https://docs.aws.amazon.com/sagemaker/latest/dg/API_Tag.html. Example:
[
{
'Key': 'key_name_1',
'Value': key_value_1,
},
{
'Key': 'key_name_2',
'Value': key_value_2,
},
...
]
:param wait: [bool, default=True], wait or not for the batch transform to finish
:param job_name: [str, default=None], name for the SageMaker batch transform job
:param model_version: [str, default='1.*'], model version to use
:param max_concurrent_transforms: [int, default=None], max number of concurrent transforms
:param aws_access_key_id: [str, default=None], AWS access key id
:param aws_secret_access_key: [str, default=None], AWS secret access key
:return: [str], transform job status if wait=True.
Valid values: 'InProgress'|'Completed'|'Failed'|'Stopping'|'Stopped'
"""
sage_maker_client = sagemaker.SageMakerClient(
aws_profile=aws_profile,
aws_region=aws_region,
aws_role=aws_role,
external_id=external_id,
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key
)

return sage_maker_client.foundation_model_batch_transform(
model_id=model,
s3_input_location=s3_input_location,
s3_output_location=s3_output_location,
num_instances=num_instances,
ec2_type=ec2_type,
max_concurrent_transforms=max_concurrent_transforms,
tags=tags,
wait=wait,
job_name=job_name,
model_version=model_version
)
Loading

0 comments on commit 351532e

Please sign in to comment.