Standardization of CICD and common/repeatable flow components.
The goal of this project is to serve as a shared code repository for common repeatable workflows for Prefect v2 flows at Pocket.
These things include:
- CI/CD patterns
- Connecting to Cloud Services and Databases
- Enforcing other conventions specific to Pocket Prefect flows
This project is a part of the larger data-flows project and meant to be installed as a dependency in other prefect flow sub-projects.
Once you have cloned the larger data-flows project. You should be working in the common-utils
folder.
This project uses Poetry for dependency management. You will want to follow the installation instructions here:
https://python-poetry.org/docs/#installation
This project also uses dependency groups for managing Python dependencies. You can learn about them here:
https://python-poetry.org/docs/managing-dependencies/#dependency-groups
To install as part of your data-flows project, here are the steps:
-
From the root of your sub-project run
../common-utils/common_utils_build.sh
. This will create adist
folder with the proper wheel file you need to install. -
Finally run
poetry add dist/<wheel file name>.whl
The dist folder is in the .gitignore because the CI/CD job will run the ../common-utils/common_utils_build.sh
for you. This means that your poetry install will work as expected. It also means the build will fail if you have not installed the latest version of the library.
❗ This library currently requires the use of our custom Prefect-AWS package at https://github.com/Pocket/prefect-aws.git@test-prs. We needed the changes outlined in these PRs: PrefectHQ/prefect-aws#236 and PrefectHQ/prefect-aws#233 (released). Once both of these changes are released through the main project, we can revert back to the main library.
To setup your dev environment for developing on the project, simply run Poetry install for the dev dependency group.
# Installs dev dependencies and any cli scripts
poetry install --only main, dev
Tests can be executed via Pytest.
python -m pytest --cov=src/common --cov-report term-missing --cov-fail-under=100
Changes to the code will trigger the build process:
- Black syntax check
- Check for unit test completion and coverage
CI/CD workflow is managed by CircleCI. The workflow code is here.
Below is a high level overview of the current modules available.
This module provides tooling for deploying your project and flow resources to AWS and Prefect respectively.
Per the Prefect v2 docs, you will need to have your PREFECT_API_KEY
and PREFECT_API_URL
environment variables set when using the module for pushing flows to Prefect Cloud.
You will also need to have your AWS credentials set either via Environment Variables, credientials file, or SSO. With SSO, you will most likely need your AWS_DEFAULT_REGION
environment variable set.
Another important environment variable is DEPLOYMENT_TYPE
. This determines what infrastructure flows run on as well as what access it will have.
DEPLOYMENT_TYPE
has to do with Prefect Flow development and underlying AWS infrastructure. Flows merged to main-v2
will be considered a main
deployment type and flows optionally merged to staging-v2
will be considered a staging
deployment type. There will be deployment and infrastructure resources available to support both in the Prefect Environment. There will be the option to push flows to a dev
deployment type powered by the AWS development account. This is no automation for this, so though flow code will be pulled from dev-v2
, a push to the dev-v2
branch will not trigger a flow deployment. However, there will be an option to deploy flows locally as needed to the dev
deployment type.
The main deliverable here is the deploy-cli
. This will be installed in your Poetry Python environment. This provides the commands needed to leverage the deployment tools capabilities.
Finally, this module relies on the existence of a Poetry PyProject path to pull metadata for deployment. Here is a code sample:
PYPROJECT_FILE_PATH = os.path.expanduser(os.path.join(os.getcwd(), "pyproject.toml"))
We default to using current working directory to find the pyproject.toml file. For the deploy-cli
to work, you must execute it from the subproject root.
Here is the help text:
usage: deploy-cli [-h] [--check-version] ...
basic CLI setup for running deployment utils as needed.
options:
-h, --help show this help message and exit
--check-version Do a version check against main-v2 for current project.
subcommands:
these are the subcommands to use for deploying flows and environments as needed
deploy-envs use this command to deploy the docker envs configured in your pyproject.toml
this will also deploy your project's filesystem.
deploy-flows use this command to deploy flows from the flows_folder configured in your pyproject.toml
As you can see, there are 2 subcommands:
deploy-envs
usage: deploy-cli deploy-envs [-h] [--build-only]
options:
-h, --help show this help message and exit
--build-only set this flag to only run Docker build.
deploy-flows
usage: deploy-cli deploy-flows [-h] [--validate-only]
options:
-h, --help show this help message and exit
--validate-only set this flag to only validate flow specs.
There is also a top level flag called --check-version
This will pull down a project pyproject.toml
file and make sure the version is update as needed if there are changes.
We leverage this in our CICD process.
The purpose of deploy-envs
is to accept and create a series of environment configurations that will used as Docker environments for your flow runs.
This is the environment that your flow will run in. It can be defined however you want given it has the proper top level template for deriving from a support Prefect v2 Docker Images.
ARG PYTHON_VERSION
FROM prefecthq/prefect:2-python${PYTHON_VERSION}
RUN apt update && \
apt install -y curl
RUN curl -sSL https://install.python-poetry.org | python3 -
COPY dist /src/repo/dist
COPY pyproject.toml /src/repo/
COPY poetry.lock /src/repo/
WORKDIR /src/repo
RUN $HOME/.local/bin/poetry config virtualenvs.create false
RUN $HOME/.local/bin/poetry install --no-root --only main
As you can see, we want to make sure we are using supported Prefect Docker Images. This is useful even though your own project may install a different version of Prefect.
These environments will be defined in your Poetry pyproject.toml
.
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
[tool.prefect.envs.base]
python_version = "3.10"
dockerfile_path=".docker/main/Dockerfile"
docker_build_context="<code directory>" # this is optional and will default to '.'
These are project level images that can be leveraged by any flow.
The purpose of this deploy-flows
is to:
- Using the
tool.prefect.flows_folder
configuration, find Python files that end in_flow.py
and attempt to deploy them to Prefect. - The deployment of a flow consists of:
- Registering the proper ECS Task Definition.
- Creating an ECS Task Block in Prefect that defines the flow's task definition.
- Registering the flow's deployment configurations with Prefect. A flow can have multiple deployment configurations. Deployment documentation is found here.
We will be leveraging Github for flow storage. This works well because we can map our deployments and agent queues to Github branches.
Branch main-v2
will be used for releasing production ready flow deployments. These will run on AWS production infrastructure using the main
Prefect agent.
Branch staging-v2
will be used for pre-release flow deployments. These will run on AWS production infrastructure using the staging
Prefect agent. The idea here is that these flows should only have access to leverage resources approved for staging. This will be done via a combination of IAM access and loading certain credentials via secrets manager.
The dev-v2
can be used to test deployments locally against a local Prefect instance or Prefect Cloud via the dev
agent. Flow deployments submitted via this agent will run in the AWS development account.
The folder to look for flows is defined in the pyproject.toml.
[tool.prefect]
flows_folder = "tests/test_flows"
Once a flow file is found, we look for a global variable in the file called FLOW_SPEC
.
This variable must be an instantiated FlowSpec
object.
from prefect import flow, task
from prefect.server.schemas.schedules import CronSchedule
from common.deployment import FlowDeployment, FlowEnvar, FlowSpec
@task()
def task_1():
print("hello world")
@flow()
def flow_1():
task_1()
FLOW_SPEC = FlowSpec(
flow=flow_1,
docker_env="base",
secrets=[
FlowEnvar(envar_name="MY_SECRET_JSON", envar_value="/my/secretsmanager/secret")
],
ephemeral_storage_gb=200,
deployments=[
FlowDeployment(
deployment_name="base",
cpu="1024",
memory="4096",
parameters={"param_name": "param_value"},
schedule=CronSchedule(cron="0 0 * * *"),
envars=[
FlowEnvar(envar_name="EXTRA_PIP_PACKAGES", envar_value="pandas")
],
)
], # type: ignore
)
The models are Pydantic models, so the inputs are validated.
FlowSpec
fields as a whole, control how the task definition is created. Any AWS ARN/URI specifics are added for you. The docker_env
must exist in the pyproject.toml.
FlowDeployment
controls how the Prefect Deployments are registered.
Screenshots can be found here.
PRs from the Pocket Team are welcomed given that changes are made with the understanding that this is meant to be shared code Prefect v2 flows.
A subproject example can be found in the data-products project folder at the root of the git repo. This is where the Pocket Data Products Team will author flows.
More links to come...
The code in this project is licensed under Apache-2.0 license.