Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for custom ai models | Pipeline API RFC #368

Closed
wants to merge 0 commits into from

Conversation

kallebysantos
Copy link
Contributor

@kallebysantos kallebysantos commented Jun 18, 2024

CLOSED: Proposal changed to PR [#430]

Git history moved to 368-add-suport-for-custom-ai-models


Since @nyannyacha had made a code refactor, maybe the original description does not reflect the architecture changes anymore, but you can still read it bellow:

Original description

What kind of change does this PR introduce?

Feature, Enhancement

What is the current behavior?

As described in the following discussion, currently edge-runtime only allows the coupled gte-small as feature-extraction model.

What is the new behavior?

This PR introduces a new Rust definition of a transformers like API to edge-runtime. So it allows to install different ONNX models at same time, as well contains the base API definitions to support other kinds of pipelines rather than just feature-extraction task.

This new API introduces the Pipeline class, that pretends to be similar as xenova/transformers but backed by Rust. The Pipeline class aim to be an evolution of the current Session class, where only support gte-small for feature extractions and ollama models. The new class defines APIs that allows:

  • Different kinds of tasks, feature-extraction(implemented), token-classification, sentiment-analysis and others ...
  • Usage of pre-installed models
  • GPU support provided by ort

GPU support was added in the ort-gpu-runtime branch, see also ort - Execution Providers

Docker Image:

You can get a docker image of this from docker hub:

# default runtime
docker pull kallebysantos/edge-runtime:latest

# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda

Pipeline usage:

To use the new Pipeline class is very simple and looks very similar to Session:

Using default gte-small, pre-installed by Supabase team:

// By default `feature-extraction` will map to `gte-small`
const pipe = new Supabase.ai.Pipeline('feature-extraction');

Deno.serve(async (req: Request) => {
        const embeddings = await pipe.run("Hello World", { mean_pool: true, normalize: true });
	
	return new Response(JSON.stringify(embeddings), { status: 200 });
});

Using different model:

// Just set the model name, similar to `xenova/transformers` or python `transformers`
const pipe = new Supabase.ai.Pipeline(
	'feature-extraction',
	'paraphrase-multilingual-MiniLM-L12-v2',
);

Deno.serve(async (req: Request) => {
        const embeddings = await pipe.run("Hello World", { mean_pool: true, normalize: true });
	
	return new Response(JSON.stringify(embeddings), { status: 200 });
});

Note: Custom models must be pre-installed inside the edge-runtime container. The pipeline class will not download it automatically.

Models folder architecture:

The models folder as changed a little bit to support more than one model/task installed at same time. This new structure aims to be more organized and robust based on the following scheme:

models
├── defaults
│   ├── feature-extraction -> ../gte-small
│   └── token-classification -> ../ner-bert-large-cased-pt-lenerbr-onnx
├── gte-small
│   ├── model.onnx
│   └── tokenizer.json
├── ner-bert-large-cased-pt-lenerbr-onnx
│   └── ...
└── paraphrase-multilingual-MiniLM-L12-v2
    └── ...

Where models are installed in the root, and then referenced as symbolic links in the defaults folder. The defaults folder maps the current default model to a specific task. Symbolic links are used to reduce disk space and file duplication.

Then the name of the task as well the model should be the same of the folders, to simplify the model loading by the runtime.

So, in order to install a new model, you just need to download it to models folder.
Then to change de default model of an specific task, you just need to change the symbolic link.

In order to simplify the model installation the download_models.sh script has been changed a little, and now supports the model name as input argument:

./scripts/download_models.sh "Supabase/gte-small" "feature-extraction"

Setting feature-extraction defaults to gte-small

./scripts/download_models.sh  "hf-user/model-name"

Custom model, example Xenova/paraphrase-multilingual-MiniLM-L12-v2

The Pipeline API:

The pipeline mod allows developers to create new kinds of tasks - Currently just the feature-extraction has been implemented. This architecture defines the base blocks to extend and add more support for ai.

In order to create new pipelines, developers must impl the Pipeline trait like the following:

Feature Extraction implementation
use anyhow::Error;
use ndarray::{Array1, Axis, Ix3};
use ndarray_linalg::norm::{normalize, NormalizeAxis};
use ort::{inputs, Session};
use std::sync::Arc;
use tokenizers::Tokenizer;
use tokio::sync::mpsc::{self, UnboundedReceiver, UnboundedSender};
use tokio::sync::Mutex;

use crate::tensor_ops::mean_pool;

use super::{Pipeline, PipelineInput, PipelineRequest};

pub type FeatureExtractionResult = Vec<f32>;

#[derive(Debug)]
pub(crate) struct FeatureExtractionPipelineInput {
    pub prompt: String,
    pub mean_pool: bool,
    pub normalize: bool,
}
impl PipelineInput for FeatureExtractionPipelineInput {}

#[derive(Debug)]
pub struct FeatureExtractionPipeline {
    sender:
        UnboundedSender<PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>>,
    receiver: Arc<
        Mutex<
            UnboundedReceiver<
                PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
            >,
        >,
    >,
}
impl FeatureExtractionPipeline {
    pub fn init() -> Self {
        let (sender, receiver) = mpsc::unbounded_channel::<
            PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
        >();

        Self {
            sender,
            receiver: Arc::new(Mutex::new(receiver)),
        }
    }
}
impl Pipeline<FeatureExtractionPipelineInput, FeatureExtractionResult>
    for FeatureExtractionPipeline
{
    fn get_sender(
        &self,
    ) -> UnboundedSender<PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>>
    {
        self.sender.to_owned()
    }

    fn get_receiver(
        &self,
    ) -> Arc<
        Mutex<
            UnboundedReceiver<
                PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
            >,
        >,
    > {
        Arc::clone(&self.receiver)
    }

    fn run(
        &self,
        session: &Session,
        tokenizer: &Tokenizer,
        input: &FeatureExtractionPipelineInput,
    ) -> Result<FeatureExtractionResult, Error> {
        let encoded_prompt = tokenizer
            .encode(input.prompt.to_owned(), true)
            .map_err(anyhow::Error::msg)?;

        let input_ids = encoded_prompt
            .get_ids()
            .iter()
            .map(|i| *i as i64)
            .collect::<Vec<_>>();

        let attention_mask = encoded_prompt
            .get_attention_mask()
            .iter()
            .map(|i| *i as i64)
            .collect::<Vec<_>>();

        let token_type_ids = encoded_prompt
            .get_type_ids()
            .iter()
            .map(|i| *i as i64)
            .collect::<Vec<_>>();

        let input_ids_array = Array1::from_iter(input_ids.iter().cloned());
        let input_ids_array = input_ids_array.view().insert_axis(Axis(0));

        let attention_mask_array = Array1::from_iter(attention_mask.iter().cloned());
        let attention_mask_array = attention_mask_array.view().insert_axis(Axis(0));

        let token_type_ids_array = Array1::from_iter(token_type_ids.iter().cloned());
        let token_type_ids_array = token_type_ids_array.view().insert_axis(Axis(0));

        let outputs = session.run(inputs! {
            "input_ids" => input_ids_array,
            "token_type_ids" => token_type_ids_array,
            "attention_mask" => attention_mask_array,
        }?)?;

        let embeddings = outputs["last_hidden_state"].extract_tensor()?;
        let embeddings = embeddings.into_dimensionality::<Ix3>()?;

        let result = if input.mean_pool {
            mean_pool(embeddings, attention_mask_array.insert_axis(Axis(2)))
        } else {
            embeddings.into_owned().remove_axis(Axis(0))
        };

        let result = if input.normalize {
            let (normalized, _) = normalize(result, NormalizeAxis::Row);
            normalized
        } else {
            result
        };

        Ok(result.view().to_slice().unwrap().to_vec())
    }
}

GPU Support:

The gpu support allows pipeline inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Pipeline has described before. But in order to enable gpu inference the Dockerfile was changed a little and now has two main build stages (That must be explicitly specified during docker build) :

edge-runtime (CPU only):
This stage builds the default edge-runtime, where ort::Session's are loaded using CPU

docker build --target "edge-runtime" .

edge-runtime-cuda (GPU/CPU):
This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).

docker build --target "edge-runtime-cuda" .

Each stage needs to install the appropriated onnx-runtime. So in order that, the install_onnx.sh has updated with a 4º parameter flag --gpu, that will download a cuda version from the official microsoft/onnxruntime repository.

Using GPU image:

In order to use the gpu image the docker-compose file must include the following properties for the functions service:

services:
  functions:
    image: supabase/edge-runtime:latest-cuda
    # Built was describe before, or directly by compose
    build:
      context: .
      dockerfile: Dockerfile
      target: edge-runtime-cuda

   # Required to use gpu inside the container
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Change here if more devices are installed
              capabilities: [gpu]

IMPORTANT NOTE: The target infrastructure must be prepared with NVIDIA Container Toolkit to allow gpu support inside docker


Final considerations:

There are a loot of work to do, in order to have more kinds of tasks as well improve the Pipeline API to avoid code duplication. But I think that PR brings an standard pattern to improve AI support to edge-runtime reducing the costs for container warm up and giving more flexibility to change and test different models.


Updates and tests:

Inference Performance

Answer by @nyannyacha where describe some solutions.

Since I made this PR, I'd been testing it with my custom docker image and I figure out some performance issues.
In my create_session implementation I tried to follow the same structure of the original code from Supabase team but I figure out that sometimes the container crashes 🔥.

My theory is the Session:.builder() has creating multiple Sections instances instead of reusing then, but I don't know if it was supposed to do that. For edge environment, I think that ok since every request runs in a separated cold start container. But for self-host it has a huge performance impact due CPU utilization issues. I would like if someone could help me with that.

GPU Support (check ort-gpu-runtime )

Already added to this PR, check the GPU Support section above.

In the previous section I reported some CPU performance issues, so I started to search about GPU utilization.

GPU inference setup
So I move on with the following changes in the create_session function:

    let builder = Session::builder()?
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        .with_intra_threads(orm_threads)?;

    let cuda = CUDAExecutionProvider::default();
    println!("CUDA: {:?}", cuda.is_available());

    if let Err(error) = cuda.register(&builder) {
        eprintln!("Failed to register CUDA!: {:?}", error);
        std::process::exit(1); // Forcing CUDA just for testing, in production I recommend to use provider fallbacks
    }

    let session = builder.commit_from_file(model_file_path)?;
    Ok(session)

Then update the docker image with CUDA support

Dockerfile code
# Docker image with rust changes for GPU support
FROM kallebysantos/edge-runtime:latest-cuda AS build

RUN apt update && apt install curl wget -y

WORKDIR /etc/sb_ai

# Custom model donwloading
RUN curl -L -o ./download_models.sh https://raw.githubusercontent.com/kallebysantos/edge-runtime/main/scripts/download_models.sh
RUN chmod +x ./download_models.sh
RUN ./download_models.sh "Xenova/paraphrase-multilingual-MiniLM-L12-v2" "feature-extraction" --quantized

# Getting onnxruntime with cuda support
RUN wget -qO- 'https://github.com/microsoft/onnxruntime/releases/download/v1.18.1/onnxruntime-linux-x64-gpu-1.18.1.tgz' | tar zxv

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 AS run

COPY --from=build /etc/sb_ai/models /etc/sb_ai/models
COPY --from=build /usr/local/bin/edge-runtime /usr/local/bin/edge-runtime

COPY --from=build /etc/sb_ai/onnxruntime-linux-x64-gpu-1.18.1 /usr/local/bin/onnxruntime

ENV SB_AI_MODELS_DIR=/etc/sb_ai/models
ENV ORT_DYLIB_PATH=/usr/local/bin/onnxruntime/lib/libonnxruntime.so

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

ENTRYPOINT ["edge-runtime"]

Embedding from DB

With the GPU container started, I tried to embedding all the missing document_sections of my database with the following function:

code
--  Function that can be trigged manually
create or replace function private.apply_embed_document_sections(
  batch_size int default 5,
  timeout_milliseconds int default 5 * 60 * 1000
)
returns void
language plpgsql
security definer
as $$
declare
  batch_count int = ceiling((
      select count(*) from document_sections where (ai).embedding is null
  ) / batch_size::float);
begin
  SET search_path = 'public','extensions';

  -- Loop through each batch and invoke an edge function to handle the embedding generation
  for i in 0 .. (batch_count-1) loop
  perform
    net.http_request(
      method := 'POST',
      url := vault.get('supabase_url') || '/functions/v1/document_sections/embed',
      body := jsonb_build_object(
        'ids', (select json_agg(id) from (
            select id from document_sections
              where (ai).embedding is null
              order by id limit batch_size offset i*batch_size
          ) as ds)
      ),
      headers := jsonb_build_object(
        'Content-Type', 'application/json',
        'Authorization','Bearer ' || vault.get('supabase_anon_key')
      ),
      timeout_milliseconds := timeout_milliseconds
    );
  end loop;
end;
$$;

GPU performance
Then I figured out that my theory was true, watching to nvidia-smi I could see that GPU memory utilization has been overflowed by multiples model instances and after some time I need to restart the container.

logs

image

GPU Overflow Note: the model should only use ~200Mb


What kind of change does this PR introduce?

Feature, Enhancement

What is the current behavior?

As described in the following discussion, currently edge-runtime only allows the coupled gte-small as feature-extraction model.

What is the new behavior?

This PR introduces a new Rust definition of a transformers like API to edge-runtime. So it allows to install different ONNX models at same time, as well contains the base API definitions to support other kinds of pipelines rather than just feature-extraction task.

This new API introduces the Pipeline class, that pretends to be similar as xenova/transformers but backed by Rust. The Pipeline class aim to be an evolution of the current Session class, where only support gte-small for feature extractions and ollama models. The new class defines APIs that allows:

  • Different kinds of tasks like feature-extraction, sentiment-analysis and others...
  • Usage of pre-loaded models (build time).
  • Model fetching and cache (runtime).
  • GPU support provided by ort.
  • Lazy loading of multiples ort's sessions at same time.

👷 It also is a base foundation for something even bigger like a hosting platform for custom models aka MLOps.

Tester docker image:

You can get a docker image of this PR from docker hub:

# default runtime
docker pull kallebysantos/edge-runtime:latest

# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda

Pipeline usage:

To use the new Pipeline class is very simple and looks very similar to Session:

Using default gte-small, pre-installed by Supabase team:

Typescript support provided by add type definition for Pipeline API #91

const pipe = new Supabase.ai.Pipeline('gte-small'); // 'gte-small' or 'supabase-gte'
// 'supabase-gte' | 'gte-small' is only backward compatibility for 'feature-extraction' default variant.

Deno.serve(async (req: Request) => {
  const embeddings = await pipe.run("Hello World");
	
  return new Response(JSON.stringify(embeddings), { status: 200 });
});

Using different model:
Custom models can be used as variant of some task. The pipeline will auto-fetch them at runtime.

But for it works, the downloadable URLs must be included at environment variables.

# export SB_AI_MODEL_URL_<task>_<variant_name>="...hf.co/../model.onnx"
# export SB_AI_TOKENIZER_URL_<task>_<variant_name>="...hf.co/../tokenizer.json"
# export SB_AI_CONFIG_URL_<task>_<variant_name>="...hf.co/../config.json"

export SB_AI_MODEL_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/onnx/model_quantized.onnx?download=true'
export SB_AI_TOKENIZER_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/tokenizer.json?download=true'
export SB_AI_CONFIG_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/config.json?download=true'

./scripts/run.sh

Then use it as task variant:

// Just set the variant name, similar to `xenova/transformers` or python `transformers`
const pipe = new Supabase.ai.Pipeline('feature-extraction',  'multilingual-Mini-LM');

Deno.serve(async (req: Request) => {
  const embeddings = await pipe.run("Hello World");
	
  return new Response(JSON.stringify(embeddings), { status: 200 });
});

Available tasks:
At this point there are 3 kind of tasks implemented in this PR, many others will coming after.

Feature extraction, see details here

The default variant model for this task is Supabase/gte-small.
Also, supabase-gte and gte-small are aliases for feature-extraction in default variant.

Example 1: Instantiate pipeline using the Pipeline class.

const extractor = new Supabase.ai.Pipeline('feature-extraction');
const output = await extractor('This is a simple test.');

// output: [0.05939, 0.02165, ...]

Example 2: Batch inference, processing multiples in parallel

const extractor = new Supabase.ai.Pipeline('feature-extraction');
const output = await extractor(["I'd use Supabase in all of my projects", "Just a test for embedding"]);

// output: [[0.07399, 0.01462, ...], [-0.08963, 0.01234, ...]]
Text classification, see details here

The default variant model for this task is Xenova/distilbert-base-uncased-finetuned-sst-2-english.
Also, sentiment-analysis is name alias for text-classification.

Example 1: Instantiate pipeline using the Pipeline class.

const classifier = new Supabase.ai.Pipeline('text-classification');
const output = await classifier('I love Supabase!');

// output: {label: 'POSITIVE', score: 1.00}

Example 2: Batch inference, processing multiples in parallel

const classifier = new Supabase.ai.Pipeline('sentiment-analysis');
const output = await classifier(['Cats are fun', 'Java is annoying']);

// output: [{label: 'POSITIVE', score: 0.99 }, {label: 'NEGATIVE', score: 0.97}]
Zero shot classification, see details here

The default variant model for this task is Xenova/distilbert-base-uncased-mnli.

Example 1: Instantiate pipeline using the Pipeline class.

const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const output = await classifier({
  text: 'one day I will see the world',
  candidate_labels: ['travel', 'cooking', 'exploration']
});

// output: [{label: "travel", score: 0.797}, {label: "exploration", score: 0.199}, {label: "cooking", score: 0.002}]

Example 2: Handling multiple correct labels

const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const input = {
  text: 'one day I will see the world',
  candidate_labels: ['travel', 'cooking', 'exploration']
};
const output = await classifier(input, { multi_label: true });

// output: [{label: "travel", score: 0.994}, {label: "exploration", score: 0.938}, {label: "cooking", score: 0.001}]

Example 3: Custom hypothesis template

const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const input = {
  text: 'one day I will see the world',
  candidate_labels: ['travel', 'cooking', 'exploration']
};
const output = await classifier(input, { hypothesis_template: "This example is NOT about {}");

// output: [{label: "cooking", score: 0.47}, {label: "exploration", score: 0.26}, {label: "travel", score: 0.26}]

The cache folder:

By default all task assets will be cached to ort-sys::internal::dirs::cache_dir() using the following folder structure:

.cache
├── dfbin
│   └── <ort-runtime-lib>
├── models
│   ├── <hash of the model url from environment variable>
│   │   ├── model.onnx
│   │   └── model.onnx.checksum
│   └── ...
├── configs
│   ├── <hash of the config url from environment variable>
│   │   ├── config.json
│   │   └── config.json.checksum
│   └── ...
└── tokenizers
    ├── <hash of the tokenizer url from environment variable>
    │   ├── tokenizer.json
    │   └── tokenizer.json.checksum
    └── ...

Since cache folders are resolved based on the downloadable URLs, if some URL change the pipeline will refetch it.
In order to preserve cold start on docker containers, consider mapping this folder to an external volume.
Assets can be pre-loaded during docker build using the preload_defs bin.

GPU Support:

The gpu support allows pipeline inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Pipeline has described before. But in order to enable gpu inference the Dockerfile now has two main build stages (That must be explicitly specified during docker build) :

edge-runtime (CPU only):
This stage builds the default edge-runtime, where ort::Session's are loaded using CPU.

docker build --target "edge-runtime" .

Resulting image with ~150 Mb

edge-runtime-cuda (GPU/CPU):
This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).

docker build --target "edge-runtime-cuda" .

Resulting image with ~2.20 GB

Each stage needs to install the appropriated onnx-runtime. So in order that, the install_onnx.sh has updated with a 4º parameter flag --gpu, that will download a cuda version from the official microsoft/onnxruntime repository.

Using GPU image:

In order to use the gpu image the docker-compose file must include the following properties for the functions service:

services:
  functions:
    # Built was describe before
    image: supabase/edge-runtime:latest-cuda
    # or directly by compose
    build:
      context: .
      dockerfile: Dockerfile
      target: edge-runtime-cuda

   # Required to use gpu inside the container
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Change here if more devices are installed
              capabilities: [gpu]

IMPORTANT NOTE: The target infrastructure must be prepared with NVIDIA Container Toolkit to allow gpu support inside docker


Final considerations:

There are a loot of work to do, in order to have more kinds of tasks as well improve the Pipeline API. But I think that PR brings an standard pattern to improve AI support to edge-runtime reducing the costs for container warm up and giving more flexibility to change and test different models.

Finally, thanks for @nyannyacha that help me a loot 🙏

@kallebysantos kallebysantos changed the title feat: adding support for custom ai models feat: add support for custom ai models Jun 18, 2024
@nyannyacha
Copy link
Collaborator

cc @laktek

@laktek
Copy link
Contributor

laktek commented Jun 19, 2024

This is amazing! Thanks for contributing @kallebysantos.

Please allow us couple of days to go through the changes and get back to you on how to proceed.

crates/sb_ai/lib.rs Outdated Show resolved Hide resolved
crates/sb_ai/lib.rs Outdated Show resolved Hide resolved
@kallebysantos
Copy link
Contributor Author

Hi guys!!
I had push a docker image with the current changes kallebysantos/edge-runtime:latest

@laktek
Copy link
Contributor

laktek commented Jun 24, 2024

@kallebysantos I started experimenting with your branch locally and got few questions would like to discuss with you (more product related than the code). Can you please drop me an email lakshan [at] supabase [dot] com?

@kallebysantos
Copy link
Contributor Author

@kallebysantos I started experimenting with your branch locally and got few questions would like to discuss with you (more product related than the code). Can you please drop me an email lakshan [at] supabase [dot] com?

Yes, sure 🙏

@laktek
Copy link
Contributor

laktek commented Jun 27, 2024

@kallebysantos Just wanted to let you know I'll be on leave for 4 weeks starting from next week. We'll make a decision on how to proceed with this after I return. Meanwhile, other contributors will review and test the PR.

@nyannyacha nyannyacha self-assigned this Jul 9, 2024
@nyannyacha
Copy link
Collaborator

Hello @kallebysantos 😄

Since @laktek is on leave, I'm going to help with the review of this PR.

(If it's okay with you, I hope you will allow me to push changes to PR to speed up reviews. 😋)

@kallebysantos
Copy link
Contributor Author

Hello @kallebysantos 😄

Since @laktek is on leave, I'm going to help with the review of this PR.

(If it's okay with you, I hope you will allow me to push changes to PR to speed up reviews. 😋)

Hi @nyannyacha, it sound ok for me, feel free to help 🤗
I'd report some "issues/questions" in the Updates and tests section of my PR, could you have a look in that? I know that maybe the session instance can use a Singleton pattern that should solve it, but I don't know if it was supposed to be, since Supabase team aims to host it on the edge and I don't have sure the impacts of a Singleton instance in the edge. I was thinking in something like a HashSet that handles the (model_name, session) but I need to study more about the Deno runtime to know here global/shared things should be stored. Also some way to terminate the session after a while, specially when using CudaProvider, on CPU I notice that workers already terminate the session.

@nyannyacha
Copy link
Collaborator

nyannyacha commented Jul 10, 2024

@kallebysantos

Since I made this PR, I'd been testing it with my custom docker image and I figure out some performance issues.
In my create_session implementation I tried to follow the same structure of the original code from Supabase team but I figure out that sometimes the container crashes 🔥.

I don't know what error code your container terminated with, but we've seen containers terminate due to SIGSEGV in addition to containers terminating due to system memory or GPU memory exhaustion.

If it also happened to your container, it was probably a problem with get_onnx_env in auto_pipeline.rs.

pub(crate) fn get_onnx_env() -> Lazy<Option<ort::Error>> {
    Lazy::new(|| {
        // Create the ONNX Runtime environment, for all sessions created in this process.
        // TODO: Add CUDA execution provider
        if let Err(err) = ort::init().with_name("SB_AI_ONNX").commit() {
            error!("sb_ai: failed to create environment - {}", err);
            return Some(err);
        }

        None
    })
}

If I remember correctly, ort::init()...commit() assigns ort::Environment to a global static variable (G_ENV), so there shouldn't be any further assignments to that variable after this.

However, since get_onnx_env transfers ownership of the return value to a local variable on the caller's stack rather than assigning it to a static variable, we can expect ort::init()...commit() to be called every time this function is called.

I know that maybe the session instance can use a Singleton pattern that should solve it, but I don't know if it was supposed to be, since Supabase team aims to host it on the edge and I don't have sure the impacts of a Singleton instance in the edge.

I've had similar thoughts to yours before, and I've pushed some commits for this to my fork before this review. (This fork branch also contains commits for a few individual things that seem to need improvement in the scope of the sb_ai crate.)

Could you take a look?

https://github.com/nyannyacha/edge-runtime/commits/ai
nyannyacha@bc28819
nyannyacha@4447903

@kallebysantos
Copy link
Contributor Author

@nyannyacha

I've had similar thoughts to yours before, and I've pushed some commits for this to my fork before this review. (This fork branch also contains commits for a few individual things that seem to need improvement in the scope of the sb_ai crate.)

Could you take a look?

https://github.com/nyannyacha/edge-runtime/commits/ai
nyannyacha@bc28819
nyannyacha@4447903

I'd replicate you ai in mine ai branch to do some tests with your changes on create_session. But now I can't figure out why that branch just don't compiles? I think that's something related with the base crate and this snapshot thing. I assume that is trying to execute as runtime during compilation.

failing workflow

@nyannyacha
Copy link
Collaborator

@kallebysantos

That failure is related to the last commit I pushed to this PR.

@nyannyacha
Copy link
Collaborator

Github Test action does not have the onnx runtime library, but the panic is caused by the ctor macro trying to initialize onnx environment.

@kallebysantos
Copy link
Contributor Author

Hi @nyannyacha, I'd follow your code changes in order to implement the session optimization for Pipeline.
During my tests (Processing 10k of embeddings in batches of 10), the combination of my changes with yours (bc28819) solved the memory issues and container crashing.

  • Container crashing: Before your fixes, my container blow up every first request (I thought that probably was the deno caching that should runs once, but was being called twice due parallel requests at same time).

  • GPU: I could see the reutilization of ort::Session from my changes in create_session, that results in just ~600Mib of GRAM.

Before my container was crashing a lot, now just crash 1/2 times. But I assume that was because the machine weak and was processing a lot of requests at same time and looking into htop I could see my CPU at max.

I'd to implement a HashMap over the Lazy/static thing, because the Pipelines should handle models in a generic way, so it's impossible to now the model path as constant value. You can check it in these commits:

I think that next step is look to some way to drop unused Sessions after a while.
Maybe some smart pointer could help me with that. But for my company use case, theses optimizations should be enough.

Also I'll need to move further to add more kinds of pipelines like ner/token-classification.
I saw edge-transformers and rust-bert repositories, that maybe we can follow their implementations.

@nyannyacha
Copy link
Collaborator

Hello @kallebysantos 😉

I'd follow your code changes in order to implement the session optimization for Pipeline.

When you say the session optimization, do you mean this commit?

the combination of my changes with yours (bc28819) solved the memory issues and container crashing.

Glad to hear that the commit worked for you. 😁

Before my container was crashing a lot, now just crash 1/2 times.

Hmm... this is a little weird, can you find out what error code crashed the container? If it was caused by an error like SIGSEGV, that's serious and we should definitely look for it. Conversely, if it was caused by an OOM (out of memory), that makes sense since you said your machine is weak.

...However, it's hard to think of a case where the edge runtime crashes simply because CPU utilization reaches max. 🧐

I'd to implement a HashMap over the Lazy/static thing, because the Pipelines should handle models in a generic way, so it's impossible to now the model path as constant value. You can check it in these commits:

Yeah, this is a good idea. I looked at the two commits you linked to and they look fine.
I'd like you to add that change to this PR. 😊

I think that next step is look to some way to drop unused Sessions after a while.
Maybe some smart pointer could help me with that. But for my company use case, theses optimizations should be enough.

Since the Session is wrapped in an Arc<T>, we should be able to see a strong_count. This should allow us to write simple LRU or timer-based cleanup logic.

Also I'll need to move further to add more kinds of pipelines like ner/token-classification.
I saw edge-transformers and rust-bert repositories, that maybe we can follow their implementations.

I don't have any immediate thoughts on this, but I think it would take a lot of time to merge if we add a lot of changes to this PR, so it might not be a bad idea to start this on another PR after this PR merge. 😁

@kallebysantos
Copy link
Contributor Author

Hello @nyannyacha, thanks for your considerations 🙏.
I'd push the optimize session commit to the main branch.

Hmm... this is a little weird, can you find out what error code crashed the container? If it was caused by an error like SIGSEGV, that's serious and we should definitely look for it. Conversely, if it was caused by an OOM (out of memory), that makes sense since you said your machine is weak.

I can't know what error it is, the only thing that happens is supabase-edge-functions exited with code 0. I think that is related with the weak hardware, like you said before the RAM memory just fly to the moon 🚀 you can see it from my htop logs bellow.

image

Memory blowing up 🚀🔥

For a production environment I'm planning to run it in a better machine with Docker swarm using replicas and boundary limits.

Container start

image

Start processing pending sections
select count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 17,771

select count(*) from net.failed_requests
-- Nothing here

select private.apply_embed_document_sections();
-- Start performing requests in batches of 5
Server handling a loot of requests

image
As you can see, in this machine the CPU goes up to 100%


We can also watch that just 1 session are initialized and then is reused. 🤗
image

Container chash

Then, after a while the container just exit 0
image

That was the final results, after existing:

select count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 14,578

select count(*) from net.failed_requests
-- Failed requests: 1,298

Since my compose will always restart the container, it still processing after reboot. And I have some pg_cron jobs to retry failed requests. My example is very massive, I'd trigger embed manually for all pending sections, but in prod it will be called automatically by pg triggers after insert a document, so the quantity of embeddings to process will be very small.

@nyannyacha
Copy link
Collaborator

@kallebysantos

I'm confused that the container has an exit code of 0 😵‍💫

This is because this would indicate that the edge runtime process was gracefully terminated, but your screenshot describes the exact opposite.

Can you try the command below immediately after the container crashes to see if it was OOMKilled?

docker container inspect <edge functions container id>

Example output

...
"State": {
    "Status": "running",
    "Running": true,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false, // 👈
...

@nyannyacha
Copy link
Collaborator

nyannyacha commented Jul 15, 2024

Hey @kallebysantos

I added a few more commits.

Previously, the error code was returned as 0 because edge runtime consumed the received unix signal and did not expose it out of process.

Please test it in your environment and let me know how it goes. 😋

@nyannyacha
Copy link
Collaborator

nyannyacha commented Aug 9, 2024

Note:

This PR includes improvements to some bottlenecks for the inference task.

Load Testing (main vs PR-368)

Hardware Information

Hardware:

    Hardware Overview:

      Model Name: MacBook Air
      Model Identifier: MacBookAir10,1
      Chip: Apple M1
      Total Number of Cores: 8 (4 performance and 4 efficiency)
      Memory: 16 GB

Run # 1 (cold-start)

The RSS metric shown in htop in the screenshots below is the result after malloc_trim(0) is called.

main
main

PR-368
ai

Run # 2 (warm-up)

k6 only

main

vscode ➜ /workspaces/edge-runtime (main-k6) $ k6 run  --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js 

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: ./k6/dist/specs/gte.js
        output: -

     scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
              * simple: 12 looping VUs for 3m0s (gracefulStop: 30s)


     ✗ status is 200
      ↳  98% — ✓ 4832 / ✗ 95
     ✗ request cancelled
      ↳  1% — ✓ 95 / ✗ 4832

     checks.........................: 50.00% ✓ 4927      ✗ 4927
     data_received..................: 724 kB 3.8 kB/s
     data_sent......................: 1.7 MB 8.7 kB/s
     http_req_blocked...............: min=416ns   avg=15.44µs  med=2.2µs    max=3.12ms p(95)=6.66µs   p(99)=411.61µs p(99.99)=2.9ms 
     http_req_connecting............: min=0s      avg=10.82µs  med=0s       max=3.04ms p(95)=0s       p(99)=325.58µs p(99.99)=2.84ms
     http_req_duration..............: min=84.68ms avg=439.51ms med=401.96ms max=1.47s  p(95)=869.7ms  p(99)=1.21s    p(99.99)=1.47s 
       { expected_response:true }...: min=84.68ms avg=427.74ms med=397.04ms max=1.47s  p(95)=796.38ms p(99)=1.13s    p(99.99)=1.47s 
     http_req_failed................: 1.92%  ✓ 95        ✗ 4832
     http_req_receiving.............: min=5.66µs  avg=76.7µs   med=50.45µs  max=2.75ms p(95)=238.18µs p(99)=685.38µs p(99.99)=2.69ms
     http_req_sending...............: min=2.37µs  avg=18.9µs   med=14.33µs  max=6.3ms  p(95)=33.61µs  p(99)=76.57µs  p(99.99)=3.78ms
     http_req_tls_handshaking.......: min=0s      avg=0s       med=0s       max=0s     p(95)=0s       p(99)=0s       p(99.99)=0s    
     http_req_waiting...............: min=84.59ms avg=439.42ms med=401.92ms max=1.47s  p(95)=869.66ms p(99)=1.21s    p(99.99)=1.47s 
     http_reqs......................: 4927   25.568987/s
     iteration_duration.............: min=84.87ms avg=441.96ms med=402.21ms max=11.39s p(95)=870.54ms p(99)=1.21s    p(99.99)=6.5s  
     iterations.....................: 4927   25.568987/s
     vus............................: 12     min=0       max=12
     vus_max........................: 12     min=0       max=12


running (3m12.7s), 00/12 VUs, 4927 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs  3m0s

PR-368

vscode ➜ /workspaces/edge-runtime (ai) $ k6 run  --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js 

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: ./k6/dist/specs/gte.js
        output: -

     scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
              * simple: 12 looping VUs for 3m0s (gracefulStop: 30s)


     ✗ status is 200
      ↳  97% — ✓ 8252 / ✗ 180
     ✗ request cancelled
      ↳  2% — ✓ 180 / ✗ 8252

     checks.........................: 50.00% ✓ 8432    ✗ 8432
     data_received..................: 1.2 MB 6.5 kB/s
     data_sent......................: 2.8 MB 15 kB/s
     http_req_blocked...............: min=500ns    avg=12.4µs   med=2.75µs   max=7.95ms   p(95)=7.06µs   p(99)=198.65µs p(99.99)=3.91ms  
     http_req_connecting............: min=0s       avg=7.61µs   med=0s       max=7.85ms   p(95)=0s       p(99)=124.92µs p(99.99)=3.84ms  
     http_req_duration..............: min=109.28ms avg=256ms    med=245.92ms max=596.69ms p(95)=376.07ms p(99)=441.57ms p(99.99)=588.84ms
       { expected_response:true }...: min=109.28ms avg=256.11ms med=245.8ms  max=596.69ms p(95)=376.26ms p(99)=441.66ms p(99.99)=589.01ms
     http_req_failed................: 2.13%  ✓ 180     ✗ 8252
     http_req_receiving.............: min=8.54µs   avg=86.39µs  med=62.56µs  max=10.49ms  p(95)=134.2µs  p(99)=463.11µs p(99.99)=5.89ms  
     http_req_sending...............: min=3.2µs    avg=24.54µs  med=16.5µs   max=5.35ms   p(95)=36.58µs  p(99)=77.53µs  p(99.99)=4.28ms  
     http_req_tls_handshaking.......: min=0s       avg=0s       med=0s       max=0s       p(95)=0s       p(99)=0s       p(99.99)=0s      
     http_req_waiting...............: min=109.21ms avg=255.89ms med=245.83ms max=596.62ms p(95)=375.97ms p(99)=441.43ms p(99.99)=588.74ms
     http_reqs......................: 8432   44.0193/s
     iteration_duration.............: min=109.44ms avg=257.51ms med=246.24ms max=10.78s   p(95)=376.32ms p(99)=441.92ms p(99.99)=2.19s   
     iterations.....................: 8432   44.0193/s
     vus............................: 12     min=0     max=12
     vus_max........................: 12     min=0     max=12


running (3m11.6s), 00/12 VUs, 8432 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs  3m0s

kallebysantos added a commit to kallebysantos/functions-js that referenced this pull request Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants