Skip to content

Commit

Permalink
Added MongoDB documentation. (#607)
Browse files Browse the repository at this point in the history
  • Loading branch information
dat-a-man authored Sep 7, 2023
1 parent e92e140 commit 85a5aaf
Show file tree
Hide file tree
Showing 2 changed files with 328 additions and 0 deletions.
327 changes: 327 additions & 0 deletions docs/website/docs/dlt-ecosystem/verified-sources/mongodb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
---
title: MongoDB
description: dlt verified source for MongoDB
keywords: [mongodb, verified source, mongo database]
---

# MongoDB

:::info Need help deploying these sources, or figuring out how to run them in your data stack?

[Join our Slack community](https://dlthub-community.slack.com/join/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g)
or [book a call](https://calendar.app.google/kiLhuMsWKpZUpfho6) with our support engineer Adrian.
:::

[MongoDB](https://www.mongodb.com/what-is-mongodb) is a NoSQL database that stores JSON-like
documents.

This MongoDB `dlt` verified source and
[pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/mongodb_pipeline.py)
loads data using “MongoDB" source to the destination of your choice.

Sources and resources that can be loaded using this verified source are:

| Name | Description |
|--------------------|--------------------------------------------|
| mongodb | loads a specific MongoDB database |
| mongodb_collection | loads a collection from a MongoDB database |

## Setup Guide

### Grab credentials

#### Grab `connection_url`

MongoDB can be configured in multiple ways. Typically, the connection URL format is:

```text
connection_url = "mongodb://dbuser:[email protected]:27017"
```

For details on connecting to MongoDB and obtaining the connection URL, see
[the documentation.](https://www.mongodb.com/docs/drivers/go/current/fundamentals/connection/)

Here are the typical ways to configure MongoDB and their connection URLs:

| Name | Description | Connection URL Example |
|---------------------|---------------------------------------------------------------------------------------|---------------------------------------------------|
| Local Installation | Install on Windows, macOS, Linux using official packages. | "mongodb://dbuser:passwd@host.or.ip:27017" |
| Docker | Deploy using the MongoDB Docker image. | "mongodb://dbuser:passwd@docker.host:27017" |
| MongoDB Atlas | MongoDB’s managed service on AWS, Azure, and Google Cloud. | "mongodb+srv://dbuser:passwd@cluster.mongodb.net" |
| Managed Cloud | AWS DocumentDB, Azure Cosmos DB, and others offer MongoDB as a managed database. | "mongodb://dbuser:passwd@managed.cloud:27017" |
| Configuration Tools | Use Ansible, Chef, or Puppet for automation of setup and configuration. | "mongodb://dbuser:passwd@config.tool:27017" |
| Replica Set | Set up for high availability with data replication across multiple MongoDB instances. | "mongodb://dbuser:passwd@replica.set:27017" |
| Sharded Cluster | Scalable distribution of datasets across multiple MongoDB instances. | "mongodb://dbuser:passwd@shard.cluster:27017" |
| Kubernetes | Deploy on Kubernetes using Helm charts or operators. | "mongodb://dbuser:passwd@k8s.cluster:27017" |
| Manual Tarball | Install directly from the official MongoDB tarball, typically on Linux. | "mongodb://dbuser:passwd@tarball.host:27017" |

> Note: The provided URLs are example formats; adjust as needed for your specific setup.
#### Grab `database and collections`

1. To grab "database and collections" you must have MongoDB shell installed. For installation
guidance, refer to [documentation here.](https://www.mongodb.com/docs/mongodb-shell/install/)

1. Modify the example URLs with your credentials (dbuser & passwd) and host details.

1. Connect to MongoDB:

```bash
mongo "mongodb://dbuser:passwd@your_host:27017"
```

1. List all Databases:

```bash
show dbs
```

1. View Collections in a Database:

1. Switch to Database:
```bash
use your_database_name
```
1. Display its Collections:
```bash
show collections
```

1. Disconnect:

```bash
exit
```

>Note the database and collection names for future source configuration.

### Prepare your data

Data in MongoDB is stored in BSON (Binary JSON) format, which allows for embedded documents or
nested data. It employs a flexible schema, and its key terms include:

`Documents`: Key-value pairs representing data units.

`Collections`: Groups of documents, similar to database tables but without a fixed schema.

`Databases`: Containers for collections; a single MongoDB server can have multiple databases.

The `dlt` converts nested data into relational tables, deduces data types, and defines parent-child
relationships, creating an adaptive schema for future data adjustments.

### Initialize the verified source

To get started with your data pipeline, follow these steps:

1. Enter the following command:

```bash
dlt init mongodb duckdb
```

[This command](../../reference/command-line-interface) will initialize
[the pipeline example](https://github.com/dlt-hub/verified-sources/blob/master/sources/mongodb_pipeline.py)
with MongoDB as the [source](../../general-usage/source) and [duckdb](../destinations/duckdb.md)
as the [destination](../destinations).

1. If you'd like to use a different destination, simply replace `duckdb` with the name of your
preferred [destination](../destinations).
1. After running this command, a new directory will be created with the necessary files and
configuration settings to get started.
For more information, read the
[Walkthrough: Add a verified source.](../../walkthroughs/add-a-verified-source)
### Add credentials
1. Inside the `.dlt` folder, you'll find a file called `secrets.toml`, which is where you can
securely store your access tokens and other sensitive information. It's important to handle this
file with care and keep it safe. Here's what the file looks like:

```toml
# put your secret values and credentials here
# do not share this file and do not push it to github
[sources.mongodb]
connection_url = "mongodb connection_url" # please set me up!
```

1. Replace the connection_url value with the [previously copied one](#grab-connection_url) to ensure
secure access to your MongoDB sources.

1. Next, Follow the [destination documentation](../../dlt-ecosystem/destinations) instructions to
add credentials for your chosen destination, ensuring proper routing of your data to the final
destination.

1. Next, store your configuration details in the `.dlt/config.toml`.

Here's what the `config.toml` looks like:
```toml
[your_pipeline_name] # Set your pipeline name here!
database = "defaultDB" # Database name (Optional), default database is loaded if not provided.
collection_names = ["collection_1", "collection_2"] # Collection names (Optional), all collections are loaded if not provided.
```
> Optionally, you can set database and collection names in ".dlt/secrets.toml" under
> [sources.mongodb] without listing the pipeline name.
1. Replace the value of the "database" and "collections_names" with the ones
[copied above](#grab-database-and-collections).
## Run the pipeline
1. Before running the pipeline, ensure that you have installed all the necessary dependencies by
running the command:
```bash
pip install -r requirements.txt
```
1. You're now ready to run the pipeline! To get started, run the following command:
```bash
python3 mongodb_pipeline.py
```
1. Once the pipeline has finished running, you can verify that everything loaded correctly by using
the following command:
```bash
dlt pipeline <pipeline_name> show
```
For example, the `pipeline_name` for the above pipeline example is `local_mongo`, you may also
use any custom name instead.

For more information, read the [Walkthrough: Run a pipeline.](../../walkthroughs/run-a-pipeline)

## Sources and resources

`dlt` works on the principle of [sources](../../general-usage/source) and
[resources](../../general-usage/resource).

### Source `mongodb`

This function loads data from a MongoDB database, yielding one or multiple collections to be retrieved.

```python
@dlt.source
def mongodb(
connection_url: str = dlt.secrets.value,
database: Optional[str] = dlt.config.value,
collection_names: Optional[List[str]] = dlt.config.value,
incremental: Optional[dlt.sources.incremental] = None, # type: ignore[type-arg]
write_disposition: Optional[str] = dlt.config.value,
) -> Iterable[DltResource]:
```
`connection_url`: MongoDB connection URL.
`database`: Database name (defaults if unspecified).
`collection_names`: Names of desired collections; loads all if not specified.
`incremental`: Option for incremental data loading.
`write_disposition`: Writing mode: "replace", "append", or "merge".
### Source `mongo_collection`
This function fetches a single collection from a MongoDB database using PyMongo.
```python
def mongodb_collection(
connection_url: str = dlt.secrets.value,
database: Optional[str] = dlt.config.value,
collection: str = dlt.config.value,
incremental: Optional[dlt.sources.incremental] = None, # type: ignore[type-arg]
write_disposition: Optional[str] = dlt.config.value,
) -> Any:
```
`collection`: Name of the collection to load.
### Create your own pipeline
If you wish to create your own pipelines, you can leverage source and resource methods from this
verified source.
1. Configure the pipeline by specifying the pipeline name, destination, and dataset as follows:
```python
pipeline = dlt.pipeline(
pipeline_name="mongodb_pipeline", # Use a custom name if desired
destination="duckdb", # Choose the appropriate destination (e.g., duckdb, redshift, post)
dataset_name="mongodb_data" # Use a custom name if desired
)
```
1. To load all the collections in a database:
```python
load_data = mongodb()
load_info = pipeline.run(load_data, write_disposition="replace")
print(load_info)
```
1. To load a specific collections from the database:
```python
load_data = mongodb().with_resources("collection_1", "collection_2")
load_info = pipeline.run(load_data, write_disposition="replace")
print(load_info)
```
1. To load specific collections from the source incrementally:
```python
load_data = mongodb(incremental=dlt.sources.incremental("date")).with_resources("collection_1")
load_info = pipeline.run(load_data, write_disposition = "merge")
print(load_info)
```
> Data is loaded incrementally based on "date" field.
1. To load data from a particular collection say "movies" incrementally:
```python
load_data = mongodb_collection(
collection="movies",
incremental=dlt.sources.incremental(
"lastupdated", initial_value=pendulum.DateTime(2020, 9, 10, 0, 0, 0)
))
load_info = pipeline.run(load_data, write_disposition="merge")
```
> The source function "mongodb_collection" loads data from a particular single
> collection, where as source "mongodb" can load data from multiple collections.
> This script configures incremental loading from the "movies" collection based on the
> "lastupdated" field, starting from midnight on September 10, 2020.
1. To incrementally load a table with an append-only disposition using hints:
```python
# Suitable for tables where new rows are added, but existing rows aren't updated.
# Load data from the 'listingsAndReviews' collection in MongoDB, using 'last_scraped' for incremental addition.
airbnb = mongodb().with_resources("listingsAndReviews")
airbnb.listingsAndReviews.apply_hints(
incremental=dlt.sources.incremental("last_scraped")
)
info = pipeline.run(airbnb, write_disposition="append")
```
> It applies hint for incremental loading based on the "last_scraped" field, ideal for tables
> with additions but no updates.
1. To load a selected collection and rename it in the destination:
```python
# Create the MongoDB source and select the "collection_1" collection
source = mongodb().with_resources("collection_1")
# Apply the hint to rename the table in the destination
source.resources["collection_1"].apply_hints(table_name="loaded_data_1")
# Run the pipeline
info = pipeline.run(source, write_disposition="replace")
print(info)
```
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ const sidebars = {
'dlt-ecosystem/verified-sources/hubspot',
'dlt-ecosystem/verified-sources/jira',
'dlt-ecosystem/verified-sources/matomo',
'dlt-ecosystem/verified-sources/mongodb',
'dlt-ecosystem/verified-sources/mux',
'dlt-ecosystem/verified-sources/notion',
'dlt-ecosystem/verified-sources/pipedrive',
Expand Down

0 comments on commit 85a5aaf

Please sign in to comment.