-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET: User Story: PyTorch & HuggingFace Custom Models Migration Story #9793
Comments
Hi @tjwald, Thanks for this writeup and feedback. I'm curious, when you mentioned "The need for a tensor type", did you use https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotnet-9/overview#tensort |
@luisquintanilla I didn't see that the type existed - there was a tensor type provided in the ONNX package that wasn't easy to use. Also - I have to add that in dotnet 9 is the first release that I could actually implement our ML model, and it is a lot more performant (10X!!) than our python implementation! |
I have now tried to use the Tensor Type provided in System.Numerics.Tensors and I wasn't able to adapt my POC to use it. These were the issues I ran into:
I expected to be able to SoftMax the tensor so that each row was softmax on its own, and then IndexOfMax / Max would be applied to each row separately, returning a
|
Thanks for the detailed response.
In the coming year, integration with libraries in the ecosystem is one of the areas the team will be working on. This includes ONNX, TorchSharp, and ML.NET.
Other than Softmax are there other operations you'd be looking for? This is also an area we plan to simplify, so any additional feedback including the one you've already provided would be appreciated. |
Great to hear about the integration with libraries! In the example from my previous message, I gave the 3 operations I used, but I am certain there are more.
|
cc: @tannergooding |
name: PyTorch & HuggingFace Custom Models Migration Story
about: Making migration to dotnet easier for projects that were trained using the HuggingFace transformers library and PyTorch model.
We created a POC using the new AI building blocks of dotnet 9, and wanted to point out pain points, and opportunities to improve performance, and enable easier migrations from python.
Background
My team is trying to cut costs in our production environment, and a third of our cost is custom ML servers that we have created.
Each ML model is wrapped in a FastAPI server. The model itself is called using the transformers library created by HuggingFace.
The model is trained and created by our research team, and we are responsible to make them run fast and cost less.
We need to host our own models due to algorithmic complexities surrounding the call to the model itself - for example repeated calls to the model during the same user request, data locality optimizations for combining several models for the same request and more.
To reduce costs and improve our performance we migrated to ONNX (but still using python), and we saw an improvement, but we still weren't able to fully utilize the GPU, and we feel that we have reached the maximum ability of our python server to handle concurrent requests.
This requires us to spin up multiple pods for the same service to deal with the load.
As soon as dotnet 9 came out with the new AI Infrastructure and building blocks, I created a POC of our simplest model with the new libraries and was able to prove that this can increase our GPU utilization, throughput, and latency to move to C# and dotnet.
This was difficult.
There was no clear migration guide for this scenario which was shocking given the importance of HuggingFace transformers for AI usage.
This POC required me to implement many things provided by the
transformers
library and 'fight' with ONNX <-> Tokenizers libraries in dotnet.Additional Context
We are a python backend team. I have some background in C# and Dotnet, but convincing management to migrate to dotnet is difficult especially given the complexity of the code required to write an efficient server in C# for ML processing.
I spent 1 month to migrate all of our models to ONNX and to a new architecture to improve performance. This only got us to 24K requests per minute. But using the C# POC I created I was able to get to 200K requests per minute with a substantially lower latency.
Request
Start a project to document, supply tools, and library features to make the migration from HuggingFace custom models simple and the end result performant.
Even if some of my comments / requests exist, they aren't documented well enough for this migration to be easy.
I love dotnet and would love more applications and coding shops to use it.
Value To the Ecosystem of Dotnet
If dotnet wants more users to start using dotnet for AI applications, it needs to supply easy to use, performant migration paths for the largest AI ecosystem - HuggingFace transformers, especially with custom models and tokenizers.
This will enable R&D teams to take ML Researcher models and get them to production on a more efficient solution.
The following contains most of the suggestions / issues we encountered in our POC.
Tokenizers Enhancements
Using Custom Tokenizer Options
In the HuggingFace library, loading a custom tokenizer is as simple as:
In the Microsoft.ML.Tokenizers library, this is more complex making the migration harder.
There are 2 reasons why migration is harder:
If we could have a factory that was able to load the resources from disk and return a fully functional tokenizer, and if there was a simple migration guide or Extension package for ease of migration from HuggingFace, this would be best.
Token Id Type
We should be able to specify that the Output should be a long as opposed to an int, etc. since we had to cast the int to a long since that was what to model took as input.
Batch Tokenization
We optimized our models to use a lot of batch processing - both pre-batched and dynamic batching.
To support this, I had to write a wrapper for the Microsoft.ML.Tokenizers Tokenizer class that performed this batch tokenization.
The current interface of the Tokenizer requires me to allocate an array for each tokenization call.
In addition, I then allocate an array for the batch to hold on to all of these arrays for each sentence in the batch and then copy it into a 2-dimensional array for the model to be able to process them.
These are a lot of allocations and copying that could be avoided by supporting batching natively.
In addition, adding an overload so we can pass the output buffer can help reduce allocations and increase performance by pooling these tensors.
This shows that batch tokenization should be a feature of the tokenizer and not handwritten by the user, and with minimal changes to the signature be more performant.
Context Tokenization
In the HuggingFace library, using a tokenizer you can tokenize a sentence with a given context like so:
This is also supported in the batch form.
Migrating from HuggingFace to dotnet would require understanding the underpinning of this tokenization method and would complicate the project enough to make the transition not "worth it" on the maintenance side.
Tensors
The need for a tensor type
We have some models that use 2-dimensional tensors as input, and some that use 3-dimensional tensors.
All of our models return 2d tensors, where the first dimension is the batch size, and the second is the actual result for each item in the batch.
Trying to get this working with arrays / Memory2D from the CommunityToolkit.HighPerformance package helped but is cumbersome to use. Also, there is no Memory3D or MemoryND.
In python we have the numpy.ndarray that enables the user to specify the shape of the tensor, and change the shape as needed.
For example, we can batch tokenize 20 sentences, where we need the model to get 5X4X512 tensor, representing a batch size of 5 with an input of 4 sentence of up to 512 tokens per sentence.
For example:
We should be able to create a view of the underlying data with the new shape without allocations. This is not possible with arrays, and the higher dimensionality isn't easy using Memory.
I am also aware that there are dense and sparse tensors, but we only use dense tensors so I can't give any input here.
This should be considered as part of the design of a Tensor Type.
This tensor type should be compatible / easy / efficient to use with connecting the tokenizer output to the model.
Tensor primitives
All of our models use SoftMax on the output of the model before using the output.
To do this for a batch I used Memory2D for the model output and then had to loop for each row in the result and call TensorPremitives.SoftMax to get the result.
I am sure that there is a more efficient way to do this, that is simple to use. If there was a tensor type, then the call to SoftMax on the tensor should run the equivalent of SoftMax for each "row" of the last dimension (or give a parameter which dimension to use)
Putting this all Together
In the transformers library there are simple to use pipelines that enable users to solve a specific task.
For example, the TextClassificationPipeline enables users to tokenize and then classify the text to a set of given labels.
The pipeline takes a batch of sentences, runs the tokenization, runs the model on the tokens, and then returns the label for each sentence along with the logits for each input.
There is no simple to use equivalent pipeline in dotnet.
To make it worse, the ONNX library uses a custom Tensor type and OrtValues that aren't easily created, which are also very confusing to get right, and RunAsync method that isn't thread safe as far as I can tell.
I wrote my own pipeline for one of the tasks we need, but this makes the transition from python to dotnet very hard, and also very error prone.
Bonus
Since we can't use a solution like NVidia Triton server, or other hosted solutions for AI models, we had to write our own inference orchestration, to manage batching and parallel processing of requests in a certain time-window. This is also very difficult to manage and could be better done by a dedicated solution (For example we don't monitor Memory usage to see if we can fit more models on the same GPU at the same time).
The text was updated successfully, but these errors were encountered: