A terraform module to productionalize MLflow on top of AWS.
This terraform module allows you to deploy a cluster of MLflow servers + UI using:
- ECS Fargate as the compute engine
- Amazon Aurora Serverless as the backend store
- S3 as the default artifact root
When designing this module, we've made some decisions about technologies and configuration that might not apply to all use cases. In doing so, we've applied the following principles, in this order:
- High availability and recovery. All components are meant to be highly available and provide backups so that important data can be recovered in case of a failure. Database back-ups are activated, and versioning is enabled for the S3 bucket.
- Least privilege. We've created dedicated security groups and IAM roles, and restricted traffic/permissions to the minimum necessary to run MLflow.
- Smallest maintenance overhead. We've chosen serverless technologies like Fargate and Aurora Serverless to minimize the cost of ownership of an MLflow cluster.
- Smallest cost overhead. We've tried to choose technologies that minimize costs, under the assumption that MLflow will be an internal tool that is used during working hours, and with a very lightweight use of the database.
- Private by default. As of version 1.9.1, MLflow doesn't provide native authentication/authorization mechanisms. When using the default values, the module will create resources that are not exposed to the Internet. Moreover, the module provides server-side encryption for the S3 bucket and the database through different KMS keys.
- Flexibility. Where possible, we've tried to make this module usable under different circumstances. For instance, you can use it to deploy MLflow to a private VPN and access it within a VPN, or you can leverage ALB's integration with Cognito/OIDC to allow users to access MLflow from your SSO solution.
The following diagram illustrates the components the module creates with the default configuration:
To use this module, you can simply:
module "mlflow" {
source = "glovo/mlflow/aws"
version = "1.0.0"
unique_name = "mlflow-team-x"
vpc_id = "my-vpc"
load_balancer_subnet_ids = ["public-subnet-az-1", "public-subnet-az-2", "public-subnet-az-3"]
load_balancer_ingress_cidr_blocks = ["192.0.2.0/24"]
service_subnet_ids = ["private-subnet-az-1", "private-subnet-az-2", "private-subnet-az-3"]
database_subnet_ids = ["db-private-subnet-az-1", "db-private-subnet-az-2", "db-private-subnet-az-3"]
database_password_secret_name = "mlflow-team-x-db-password-arn"
}
You can find a more complete usage example in terratest/examples/main.tf
.
Note you may also:
- Add sidecar containers (e.g. a datadog agent for Fargate)
- Provide your own bucket/path as the default artifact root
- Attach an autoscaling policy to the service (for instance, you may scale down to 0 instances during the night)
- This module only supports this docker image. The reason behind this is that we need to inject the database password as a secret environment variable, which can only be injected into the ECS task definition by overriding the entrypoint and making a lot of assumptions about how the base image was built.
- By default, the load balancer is internal. This is because as of v1.9.1, MLflow doesn't have native authentication or authorization. We recommend exposing MLflow behind a VPN or using OIDC/Cognito together with the LB listener.
- PR to mlflow to accept BACKEND_STORE_URI as an environment variable => Allow selecting a different container image
Everybody is welcome to contribute ideas and PRs to this repository. We don't have any strict contribution guidelines. Follow your best common sense and have some patience with us if we take a few days to answer.