-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for custom ai models | Pipeline API RFC #368
Conversation
cc @laktek |
This is amazing! Thanks for contributing @kallebysantos. Please allow us couple of days to go through the changes and get back to you on how to proceed. |
Hi guys!! |
@kallebysantos I started experimenting with your branch locally and got few questions would like to discuss with you (more product related than the code). Can you please drop me an email lakshan [at] supabase [dot] com? |
Yes, sure 🙏 |
@kallebysantos Just wanted to let you know I'll be on leave for 4 weeks starting from next week. We'll make a decision on how to proceed with this after I return. Meanwhile, other contributors will review and test the PR. |
Hello @kallebysantos 😄 Since @laktek is on leave, I'm going to help with the review of this PR. (If it's okay with you, I hope you will allow me to push changes to PR to speed up reviews. 😋) |
Hi @nyannyacha, it sound ok for me, feel free to help 🤗 |
I don't know what error code your container terminated with, but we've seen containers terminate due to SIGSEGV in addition to containers terminating due to system memory or GPU memory exhaustion. If it also happened to your container, it was probably a problem with pub(crate) fn get_onnx_env() -> Lazy<Option<ort::Error>> {
Lazy::new(|| {
// Create the ONNX Runtime environment, for all sessions created in this process.
// TODO: Add CUDA execution provider
if let Err(err) = ort::init().with_name("SB_AI_ONNX").commit() {
error!("sb_ai: failed to create environment - {}", err);
return Some(err);
}
None
})
} If I remember correctly, ort::init()...commit() assigns ort::Environment to a global static variable (G_ENV), so there shouldn't be any further assignments to that variable after this. However, since get_onnx_env transfers ownership of the return value to a local variable on the caller's stack rather than assigning it to a static variable, we can expect ort::init()...commit() to be called every time this function is called.
I've had similar thoughts to yours before, and I've pushed some commits for this to my fork before this review. (This fork branch also contains commits for a few individual things that seem to need improvement in the scope of the Could you take a look? https://github.com/nyannyacha/edge-runtime/commits/ai |
I'd replicate you |
That failure is related to the last commit I pushed to this PR. |
Github Test action does not have the onnx runtime library, but the panic is caused by the ctor macro trying to initialize onnx environment. |
Hi @nyannyacha, I'd follow your code changes in order to implement the session optimization for
I'd to implement a I think that next step is look to some way to drop unused Also I'll need to move further to add more kinds of pipelines like |
Hello @kallebysantos 😉
When you say the session optimization, do you mean this commit?
Glad to hear that the commit worked for you. 😁
Hmm... this is a little weird, can you find out what error code crashed the container? If it was caused by an error like SIGSEGV, that's serious and we should definitely look for it. Conversely, if it was caused by an OOM (out of memory), that makes sense since you said your machine is weak. ...However, it's hard to think of a case where the edge runtime crashes simply because CPU utilization reaches max. 🧐
Yeah, this is a good idea. I looked at the two commits you linked to and they look fine.
Since the
I don't have any immediate thoughts on this, but I think it would take a lot of time to merge if we add a lot of changes to this PR, so it might not be a bad idea to start this on another PR after this PR merge. 😁 |
Hello @nyannyacha, thanks for your considerations 🙏.
I can't know what error it is, the only thing that happens is
Start processing pending sectionsselect count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 17,771
select count(*) from net.failed_requests
-- Nothing here
select private.apply_embed_document_sections();
-- Start performing requests in batches of 5 Server handling a loot of requests
We can also watch that just 1 session are initialized and then is reused. 🤗 Container chashThen, after a while the container just That was the final results, after existing: select count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 14,578
select count(*) from net.failed_requests
-- Failed requests: 1,298 Since my compose will always restart the container, it still processing after reboot. And I have some |
I'm confused that the container has an exit code of 0 😵💫 This is because this would indicate that the edge runtime process was gracefully terminated, but your screenshot describes the exact opposite. Can you try the command below immediately after the container crashes to see if it was docker container inspect <edge functions container id> Example output ...
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false, // 👈
... |
Hey @kallebysantos I added a few more commits. Previously, the error code was returned as 0 because edge runtime consumed the received unix signal and did not expose it out of process. Please test it in your environment and let me know how it goes. 😋 |
fa35dc8
to
4a99525
Compare
Note: This PR includes improvements to some bottlenecks for the inference task. Load Testing (main vs PR-368)Hardware InformationHardware:
Hardware Overview:
Model Name: MacBook Air
Model Identifier: MacBookAir10,1
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB Run # 1 (cold-start)
Run # 2 (warm-up)
main vscode ➜ /workspaces/edge-runtime (main-k6) $ k6 run --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: ./k6/dist/specs/gte.js
output: -
scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
* simple: 12 looping VUs for 3m0s (gracefulStop: 30s)
✗ status is 200
↳ 98% — ✓ 4832 / ✗ 95
✗ request cancelled
↳ 1% — ✓ 95 / ✗ 4832
checks.........................: 50.00% ✓ 4927 ✗ 4927
data_received..................: 724 kB 3.8 kB/s
data_sent......................: 1.7 MB 8.7 kB/s
http_req_blocked...............: min=416ns avg=15.44µs med=2.2µs max=3.12ms p(95)=6.66µs p(99)=411.61µs p(99.99)=2.9ms
http_req_connecting............: min=0s avg=10.82µs med=0s max=3.04ms p(95)=0s p(99)=325.58µs p(99.99)=2.84ms
http_req_duration..............: min=84.68ms avg=439.51ms med=401.96ms max=1.47s p(95)=869.7ms p(99)=1.21s p(99.99)=1.47s
{ expected_response:true }...: min=84.68ms avg=427.74ms med=397.04ms max=1.47s p(95)=796.38ms p(99)=1.13s p(99.99)=1.47s
http_req_failed................: 1.92% ✓ 95 ✗ 4832
http_req_receiving.............: min=5.66µs avg=76.7µs med=50.45µs max=2.75ms p(95)=238.18µs p(99)=685.38µs p(99.99)=2.69ms
http_req_sending...............: min=2.37µs avg=18.9µs med=14.33µs max=6.3ms p(95)=33.61µs p(99)=76.57µs p(99.99)=3.78ms
http_req_tls_handshaking.......: min=0s avg=0s med=0s max=0s p(95)=0s p(99)=0s p(99.99)=0s
http_req_waiting...............: min=84.59ms avg=439.42ms med=401.92ms max=1.47s p(95)=869.66ms p(99)=1.21s p(99.99)=1.47s
http_reqs......................: 4927 25.568987/s
iteration_duration.............: min=84.87ms avg=441.96ms med=402.21ms max=11.39s p(95)=870.54ms p(99)=1.21s p(99.99)=6.5s
iterations.....................: 4927 25.568987/s
vus............................: 12 min=0 max=12
vus_max........................: 12 min=0 max=12
running (3m12.7s), 00/12 VUs, 4927 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs 3m0s PR-368 vscode ➜ /workspaces/edge-runtime (ai) $ k6 run --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: ./k6/dist/specs/gte.js
output: -
scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
* simple: 12 looping VUs for 3m0s (gracefulStop: 30s)
✗ status is 200
↳ 97% — ✓ 8252 / ✗ 180
✗ request cancelled
↳ 2% — ✓ 180 / ✗ 8252
checks.........................: 50.00% ✓ 8432 ✗ 8432
data_received..................: 1.2 MB 6.5 kB/s
data_sent......................: 2.8 MB 15 kB/s
http_req_blocked...............: min=500ns avg=12.4µs med=2.75µs max=7.95ms p(95)=7.06µs p(99)=198.65µs p(99.99)=3.91ms
http_req_connecting............: min=0s avg=7.61µs med=0s max=7.85ms p(95)=0s p(99)=124.92µs p(99.99)=3.84ms
http_req_duration..............: min=109.28ms avg=256ms med=245.92ms max=596.69ms p(95)=376.07ms p(99)=441.57ms p(99.99)=588.84ms
{ expected_response:true }...: min=109.28ms avg=256.11ms med=245.8ms max=596.69ms p(95)=376.26ms p(99)=441.66ms p(99.99)=589.01ms
http_req_failed................: 2.13% ✓ 180 ✗ 8252
http_req_receiving.............: min=8.54µs avg=86.39µs med=62.56µs max=10.49ms p(95)=134.2µs p(99)=463.11µs p(99.99)=5.89ms
http_req_sending...............: min=3.2µs avg=24.54µs med=16.5µs max=5.35ms p(95)=36.58µs p(99)=77.53µs p(99.99)=4.28ms
http_req_tls_handshaking.......: min=0s avg=0s med=0s max=0s p(95)=0s p(99)=0s p(99.99)=0s
http_req_waiting...............: min=109.21ms avg=255.89ms med=245.83ms max=596.62ms p(95)=375.97ms p(99)=441.43ms p(99.99)=588.74ms
http_reqs......................: 8432 44.0193/s
iteration_duration.............: min=109.44ms avg=257.51ms med=246.24ms max=10.78s p(95)=376.32ms p(99)=441.92ms p(99.99)=2.19s
iterations.....................: 8432 44.0193/s
vus............................: 12 min=0 max=12
vus_max........................: 12 min=0 max=12
running (3m11.6s), 00/12 VUs, 8432 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs 3m0s |
Adding the type defs to integrate with: supabase/edge-runtime#368
3bb87d9
to
62187be
Compare
CLOSED: Proposal changed to PR [#430]
Git history moved to 368-add-suport-for-custom-ai-models
Original description
What kind of change does this PR introduce?
Feature, Enhancement
What is the current behavior?
As described in the following discussion, currently
edge-runtime
only allows the coupledgte-small
asfeature-extraction
model.What is the new behavior?
This PR introduces a new Rust definition of a
transformers
like API toedge-runtime
. So it allows to install differentONNX
models at same time, as well contains the base API definitions to support other kinds ofpipelines
rather than justfeature-extraction
task.This new API introduces the
Pipeline
class, that pretends to be similar asxenova/transformers
but backed by Rust. ThePipeline
class aim to be an evolution of the currentSession
class, where only supportgte-small
for feature extractions andollama
models. The new class defines APIs that allows:tasks
,feature-extraction
(implemented),token-classification
,sentiment-analysis
and others ...ort
Docker Image:
You can get a docker image of this from docker hub:
Pipeline usage:
To use the new
Pipeline
class is very simple and looks very similar toSession
:Using default
gte-small
, pre-installed bySupabase
team:Using different model:
Models folder architecture:
The models folder as changed a little bit to support more than one model/task installed at same time. This new structure aims to be more organized and robust based on the following scheme:
Where models are installed in the
root
, and then referenced assymbolic links
in thedefaults
folder. Thedefaults
folder maps the current default model to a specific task. Symbolic links are used to reduce disk space and file duplication.Then the name of the task as well the model should be the same of the folders, to simplify the model loading by the runtime.
So, in order to install a new model, you just need to
download
it to models folder.Then to change de default model of an specific task, you just need to change the symbolic link.
In order to simplify the model installation the
download_models.sh
script has been changed a little, and now supports themodel name
as input argument:./scripts/download_models.sh "hf-user/model-name"
The Pipeline API:
The
pipeline
mod allows developers to create new kinds of tasks - Currently just thefeature-extraction
has been implemented. This architecture defines the base blocks to extend and add more support forai
.In order to create new pipelines, developers must
impl
thePipeline
trait like the following:Feature Extraction implementation
GPU Support:
The
gpu
support allowspipeline
inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call thePipeline
has described before. But in order to enablegpu
inference theDockerfile
was changed a little and now has two mainbuild stages
(That must be explicitly specified duringdocker build
) :edge-runtime (CPU only):
This stage builds the default
edge-runtime
, whereort::Session
's are loaded using CPUedge-runtime-cuda (GPU/CPU):
This stage builds the default
edge-runtime
in anvidia/cuda
machine that allows loading usingGPU
orCPU
(as fallback).Using GPU image:
In order to use the
gpu
image thedocker-compose
file must include the following properties for thefunctions
service:Final considerations:
There are a loot of work to do, in order to have more kinds of
tasks
as well improve thePipeline
API to avoid code duplication. But I think that PR brings an standard pattern to improveAI
support toedge-runtime
reducing the costs for container warm up and giving more flexibility to change and test different models.Updates and tests:
Inference Performance
Since I made this PR, I'd been testing it with my custom
docker image
and I figure out some performance issues.In my
create_session
implementation I tried to follow the same structure of the original code from Supabase team but I figure out that sometimes the container crashes 🔥.My theory is the
Session:.builder()
has creating multipleSections
instances instead of reusing then, but I don't know if it was supposed to do that. For edge environment, I think that ok since every request runs in a separated cold start container. But for self-host it has a huge performance impact due CPU utilization issues. I would like if someone could help me with that.GPU Support (check ort-gpu-runtime )
In the previous section I reported some CPU performance issues, so I started to search about GPU utilization.
GPU inference setup
So I move on with the following changes in the
create_session
function:Then update the docker image with
CUDA
supportDockerfile code
Embedding from DB
With the GPU container started, I tried to embedding all the missing
document_sections
of my database with the following function:code
GPU performance
Then I figured out that my theory was true, watching to
nvidia-smi
I could see that GPU memory utilization has been overflowed by multiples model instances and after some time I need to restart the container.logs
What kind of change does this PR introduce?
Feature, Enhancement
What is the current behavior?
As described in the following discussion, currently
edge-runtime
only allows the coupledgte-small
asfeature-extraction
model.What is the new behavior?
This PR introduces a new Rust definition of a
transformers
like API toedge-runtime
. So it allows to install differentONNX
models at same time, as well contains the base API definitions to support other kinds ofpipelines
rather than justfeature-extraction
task.This new API introduces the
Pipeline
class, that pretends to be similar asxenova/transformers
but backed by Rust. ThePipeline
class aim to be an evolution of the currentSession
class, where only supportgte-small
for feature extractions andollama
models. The new class defines APIs that allows:tasks
likefeature-extraction
,sentiment-analysis
and others...ort
.Tester docker image:
You can get a docker image of this PR from docker hub:
Pipeline usage:
To use the new
Pipeline
class is very simple and looks very similar toSession
:Using default
gte-small
, pre-installed bySupabase
team:Using different model:
Custom models can be used as
variant
of sometask
. The pipeline willauto-fetch
them at runtime.Then use it as task variant:
Available tasks:
At this point there are 3 kind of
tasks
implemented in this PR, many others will coming after.Feature extraction, see details here
The default variant model for this task is Supabase/gte-small.
Also,
supabase-gte
andgte-small
are aliases forfeature-extraction
in default variant.Example 1: Instantiate pipeline using the
Pipeline
class.Example 2: Batch inference, processing multiples in parallel
Text classification, see details here
The default variant model for this task is Xenova/distilbert-base-uncased-finetuned-sst-2-english.
Also,
sentiment-analysis
is name alias fortext-classification
.Example 1: Instantiate pipeline using the
Pipeline
class.Example 2: Batch inference, processing multiples in parallel
Zero shot classification, see details here
The default variant model for this task is Xenova/distilbert-base-uncased-mnli.
Example 1: Instantiate pipeline using the
Pipeline
class.Example 2: Handling multiple correct labels
Example 3: Custom hypothesis template
The cache folder:
By default all task assets will be cached to
ort-sys::internal::dirs::cache_dir()
using the following folder structure:GPU Support:
The
gpu
support allowspipeline
inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call thePipeline
has described before. But in order to enablegpu
inference theDockerfile
now has two mainbuild stages
(That must be explicitly specified duringdocker build
) :edge-runtime (CPU only):
This stage builds the default
edge-runtime
, whereort::Session
's are loaded using CPU.edge-runtime-cuda (GPU/CPU):
This stage builds the default
edge-runtime
in anvidia/cuda
machine that allows loading usingGPU
orCPU
(as fallback).Each stage needs to install the appropriated
onnx-runtime
. So in order that, theinstall_onnx.sh
has updated with a 4º parameter flag--gpu
, that will download acuda
version from the officialmicrosoft/onnxruntime
repository.Using GPU image:
In order to use the
gpu
image thedocker-compose
file must include the following properties for thefunctions
service:Final considerations:
There are a loot of work to do, in order to have more kinds of
tasks
as well improve thePipeline
API. But I think that PR brings an standard pattern to improveAI
support toedge-runtime
reducing the costs for container warm up and giving more flexibility to change and test different models.Finally, thanks for @nyannyacha that help me a loot 🙏