The following guide will walk you through contributing new sources or changes to existing sources and their demo pipelines and contains a troubleshooting section. Please also read DISTRIBUTION.md to understand how our sources are distributed to the users. Refer to BUILDING-BLOCKS.md to learn about the basic building blocks of a dlt source.
What do you can do here:
- Contribute a change to an existing verified source or its demo pipeline: Go to the "Walktrough: Fix, improve, customize, document an existing source" section.
- Contribute a new verified source: Go to the "Walktrough: Create and contribute a new source" section.
- Join our slack to get support from us by following the invitation link.
In this section you will learn how to contribute changes to an existing pipeline.
- Ensure you have followed all steps in the coding prerequisites section and the
format-lint
command works. - Start changing the code of an existing source. The typical development workflow will look
something like this (the code examples assume you are changing the
chess
source):- Make changes to the code of your source, for example adding new resources or fixing bugs.
- Execute the source example pipeline script, for example
python chess_pipeline.py
(fromsources
folder!) and see if there are any errors and wether the expected data ends up in your destination. - Adjust your tests to test for the new features you have added or changes you have made in
./tests/chess/test_chess_source.py
and run the tests against duckdb locally with this command:pytest tests/chess
. - Run the linter and formatter to check for any problems:
make lint-code
.
- Proceed to the pull request section to create a pull request to the main repo.
In this section you will learn how to contribute a new source including tests and a demo pipeline for that source. It is helpful to also read through the above section to see all the steps that are part of source development.
- Before starting development on a new source, please open a ticket here and let us know what you have planned.
- We will acknowledge your ticket and figure out how to proceed. This mostly has to do with creating a test account for the desired source and providing you with the needed credentials. We will also ensure that no parallel development is happening.
Now you can get to coding. As a starting point we will copy the chess
source. The chess
example
is two very basic sources with a few resources. For an even simpler starting point, you can use the
pokemon
source as the starting point with the same method. Please also read
DISTRIBUTION.md before you start this guide to get an understanding of how
your source will be distributed to other users once it is accepted into our repo.
-
Ensure you have followed all steps in the coding prerequisites section and the format-lint command works.
-
We will copy the chess source as a starting point. There is a convenient script that creates a copy of the chess source to a new name. Run it with
python tools/new_source.py my_source
. This will create a new example script and source folder in thesources
directory and a new test folder in thetests
directory. You will have to update a few imports to make it work. -
You are now set to start development on your new source.
-
You can now implement your verified source. Consult our extensive docs on how to create dlt sources and pipelines at dlthub create pipeline walkthrough.
-
Read the rest of this document and BUILDING-BLOCKS.md for information on various topics.
-
Proceed to the pull request section to create a pull request to the main repo.
To start development in the verified sources repository, there are a few steps you need to do to ensure you have a working setup.
-
Fork the verified-sources repository on GitHub, alternatively check out this repository if you have the right to create Pull Requests directly.
-
Clone the forked repository:
git clone https://github.com/dlt-hub/verified-sources.git
-
Make a feature branch in the fork:
cd verified-sources git checkout -b <your-branch-name>
Development on the verified sources repository depends on Python 3.8 or higher and poetry being available as well as the needed dependencies being installed. Make is used to automate tasks.
-
Install poetry:
make install-poetry
-
Install the python dependencies:
make dev
-
Activate the poetry shell:
poetry shell
-
To verify that the dependencies are set up correctly you can run the linter / formatter:
make format-lint
If this command fails, something is not set up correctly yet.
A requirements.txt
file must be added to the source folder
including a versioned dependency on dlt
itself.
This is to specify which version of dlt the source is developed against, and ensures that users
are notified to update dlt
if the source depends on new features or backwards incompatible changes.
The dlt
dependency should be added in requirements.txt
with a version range and without extras, example:
dlt>=0.3.5,<0.4.0
If your source requires additional dependencies that are not available in dlt
they may be added as
follows:
- Use
poetry
to add it to the group with the same name as the source. Example: the chess source usespython-chess
to decode game moves. The dependency was added withpoetry add -G chess python-chess
. - Add the dependency to the
requirements.txt
file in the source folder.
Use relative imports. Your code will be imported as source code and everything under source
folder must be self-contained and isolated. Example (from google_sheets
):
from .helpers.data_processing import get_spreadsheet_id
from .helpers.api_calls import api_auth
from .helpers import api_calls
This script is distributed by dlt init
with the other source <name>
files. It will be a first
touch point with your users. It will be used by them as a starting point or as a source of code
snippets. The ideal content for the script:
- Shows a few usage examples with different source/resource arguments combinations that you think are the most common cases for your user.
- If you provide any customizations/transformations then show how to use them.
- Any code snippet that will speed the user up.
Examples:
All source tests and usage/example scripts share the same config and credential files that are
present in sources/.dlt
.
This makes running locally much easier and dlt
configuration is flexible enough to apply to many
sources in one folder.
Please add your credentials/secrets using sources.<source_name>
ie.
[sources.github]
access_token="ghp_KZCEQl****"
this will become handy when you'll use our github CI or run local tests.
- Ensure the linter and formatter pass by running:
make lint-code
- Ensure the example script of the source you have added/changed runs:
python my_source_pipeline.py
- Add all files and commit them to your feature branch:
commit -am "my changes"
- Push to your fork.
- Make a PR to a master branch of this repository (upstream). from your fork. You can see the result of our automated testing and linting scripts in the PR on GitHub.
- If you are connecting to a new datasource, we will need instructions on how to connect there or reproduce the test data.
- Wait for our code review/merge.
- Your PR must pass the linting and testing stages on the GitHub pull request. For more information consult the sections about formatting, linting and testing.
- Your code must have Google style docstrings in all relevant sections.
- Your code needs to be typed. Consult the section about typing for more information.
- If you create a new source or make significant changes to an existing source, please add or update tests accordingly.
- The source folder must contain a
requirements.txt
file includingdlt
as a dependency and additional dependencies needed to run the source (if any).
python-dlt
uses mypy
and flake8
with several plugins for linting and black
with default
settings to format the code To lint the code and run the formatter do make lint-code
. Do this
before you commit so you can be sure that the CI will pass.
Code needs to be typed - mypy
is able to catch a lot of problems in the code. See the chess
source for example.
We use pytest
for testing. Every test is running within a set of fixtures that provide the
following environment (see conftest.py
):
- They load secrets and config from
sources/.dlt
so the same values are used when you run your pipeline from command line and in tests. - It sets the working directory for each pipeline to
_storage
folder and makes sure it is empty before each test. - It drops all datasets from the destination after each test.
- It runs each test with the original environment variables so you can modify
os.environ
.
Look at tests/chess/test_chess_source.py
for an example. The line
@pytest.mark.parametrize('destination_name', ALL_DESTINATIONS)
makes sure that each test runs against all destinations (as defined in the ALL_DESTINATIONS
global
variables).
The simplest possible test just creates a pipeline with your source and then issues a run on a
source. More advanced test will use sql_client
to check the data and access the schemas to check
the table structure.
Please also look at the test helpers that you can use to assert the load infos, get counts of elements in tables, select and assert the data in tables etc.
Your tests will be run both locally and on CI. It means that a few instances of your test may be executed in parallel, and they will be sharing resources. A few simple rules make that possible.
- Always use
full_refresh
when creating pipelines in test. This will make sure that data is loaded into a new schema/dataset. Fixtures inconftest.py
will drop datasets created during load. - When creating any fixtures for your tests, make sure that fixture is unique for your test instance.
If you create database or schema or table, add random suffix/prefix to it und use in your test.
If you create an account i.e. a user with a name and this name is uniq identifier, also add random suffix/prefix.
- Cleanup after your fixtures - delete accounts, drop schemas and databases.
- Add code to
tests/utils.py
only if this is helpful for all tests. Put your specific helpers in your own directory.
Tests in tests/test_dlt_init.py
are executed as part of linting stage and must be passing. They
make sure that sources can be distributed with dlt init
.
When developing, limit the destinations to local i.e. Postgres by setting the environment variable:
ALL_DESTINATIONS='["postgres"]' pytest tests/chess
There's also make test-local
command that will run all the tests on duckdb
and postgres
.
- linter and init checks will run immediately
- we will review your code
- we will setup secrets/credentials on our side (with your help)
- we will assign a label ci from fork that will enable your verified sources tests against duckdb and postgres
Overall following checks must pass:
- mypy and linter
dlt init
test where we make sure you provided all information to your verified source module for the distribution to correctly happen- tests for your source must pass on postgres and duckdb
If you prefer to run your checks on your own CI, do the following:
- Go to settings of your fork https://github.com/**account name**/repo name/settings/secrets/actions
- Add new Repository Secrets with a name DLT_SECRETS_TOML
- Paste the
toml
fragment with source credentials that you added to your secrets.toml - remember to include section name:
[sources.github]
access_token="ghp_KZCEQlC8***"
In essence DLT_SECRETS_TOML is just your secrets.toml
file and will be used as such by the CI runner.
Typically we created a common test account for your source before you started coding. This is an ideal situation - we can reuse your tests directly and can merge your work quickly.
If you contributed a source and created own credentials, test accounts or test datasets please
include them in the tests or share them with dlt
team so we can configure the CI job. If
sharing is not possible please help us to reproduce your test cases so CI job will pass.
We are happy to add you as contributor to avoid the hurdles of setting up credentials. This also let's you run your tests on BigQuery/Redshift etc. Just ping the dlt team on slack*.
Use Python 3.8 for development which is the lowest supported version for python-dlt
. You'll need
distutils
and venv
:
sudo apt-get install python3.8
sudo apt-get install python3.8-distutils
sudo apt install python3.8-venv
You may also use pyenv
as poetry
suggests.
Please look at example.secrets.toml
in .dlt
folder on how to configure postgres
, redshift
and bigquery
destination credentials. Those credentials are shared by all sources.
Then you can create your secrets.toml
with the credentials you need. The duckdb
and postgres
destinations work locally, and we suggest you use them for initial testing.
As explained in technical docs, both native form (i.e. database connection string) or dictionary representation (a python dict with host database password etc.) can be used.
There's compose file with fully prepared postgres instance here.