Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs to v0.0.80 #133

Merged
merged 2 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions docs/deployment/self-deployment/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@ title: Self-deployment
slug: overview
---

Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately.
Mistral AI models can be self-deployed on your own infrastructure through various
inference engines. We recommend using [vLLM](https://vllm.readthedocs.io/), a
highly-optimized Python-only serving framework which can exponse an OpenAI-compatible
API.

To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the [model description](/getting-started/models).
Other inference engine alternatives include
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and
[TGI](https://huggingface.co/docs/text-generation-inference/index).

We recommend three different serving frameworks for our models :
- [vLLM](https://vllm.readthedocs.io/): A python only serving framework which deploys an API matching OpenAI's spec. vLLM provides paged attention kernel to improve serving throughput.
- NVidias's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) served with Nvidia's [Triton Inference Server](https://github.com/triton-inference-server) : TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
- [TGI](https://huggingface.co/docs/text-generation-inference/index): A toolkit for deploying LLMs, including OpenAI's spec, grammars, production monitoring, and tools functionality.

These images can be run locally, or on your favorite cloud provider, using [SkyPilot](https://skypilot.readthedocs.io/en/latest/).
You can also leverage specific tools to facilitate infrastructure management, such as
[SkyPilot](https://skypilot.readthedocs.io) or [Cerebrium](https://www.cerebrium.ai).
Loading
Loading