Skip to content

Commit

Permalink
Add mongo blog article (#610)
Browse files Browse the repository at this point in the history
Co-authored-by: AstrakhantsevaAA <[email protected]>
  • Loading branch information
adrianbr and AstrakhantsevaAA authored Sep 7, 2023
1 parent 85a5aaf commit df719ac
Show file tree
Hide file tree
Showing 2 changed files with 202 additions and 0 deletions.
202 changes: 202 additions & 0 deletions docs/website/blog/2023-09-05-mongo-etl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
---
slug: mongo-etl
title: "Dumpster diving for data: The MongoDB experience"
image: /img/data-dumpster.png
authors:
name: Adrian Brudaru
title: Open source data engineer
url: https://github.com/adrianbr
image_url: https://avatars.githubusercontent.com/u/5762770?v=4
tags: [mongodb, mongo, nosql, bson]
---
:::info
💡 TIL: BSON stands for binary JSON, not for BS Object Notation /s
:::


## What will you learn in this article?

The scope of this article is to look at and discuss the process of extracting data from MongoDB and
making it available in a SQL store. We will focus on the difficulties around ingesting the data into
a SQL database, while we will look at MongoDB only from a source perspective.

The focus is the data in its different states, not the underlying technologies.

## Why the harsh title?

The title may sound harsh, but it accurately reflects the challenges often encountered when dealing
with MongoDB.

Also referred to as `/dev/null`, Mongo offers a flexible schema and document-based storage can lead
to data inconsistencies and complexities, resembling a metaphorical "dumpster" where data might be
scattered and difficult to organise.

Analytical consumers of MongoDB will be forced to invest extra effort in data modelling, schema
design, and querying optimisation to ensure data quality and retrieval efficiency.

## Is this a common problem?

It's inevitable. An analytical system has multiple requirements which force us to move data from
places such as MongoDB to a SQL store.

Let’s look at those requirements:

- **Business User access**: Most data is used by business users to improve operations. Business
users access data via dashboards, or pivot-table like interfaces that enable them to do custom
aggregations. The tooling that exists to do this is created for SQL, as a place where relational
queries can be executed.
- **Ecosystem of integrations and tools**: In analytics, having a diverse ecosystem of integrations
and tools is crucial. SQL databases seamlessly fit into this ecosystem, offering compatibility
with a wide array of data warehousing, data integration, and data governance tools. This
comprehensive ecosystem enhances the analytics infrastructure, ensuring that data can be
efficiently managed, transformed, and accessed by various stakeholders.
- **Standardization for consistency**: Maintaining data consistency is paramount in analytics. SQL's
widespread adoption and standardized query language enable analysts and data professionals to work
with data consistently across different systems and platforms. This standardization ensures that
data is interpreted and manipulated uniformly, reducing the risk of errors and discrepancies in
analytical processes.
- **Data transformation & modelling capabilities**: Effective data transformation and modelling are
prerequisites for meaningful analytics. SQL provides a robust toolkit for these tasks, enabling
data professionals to perform complex operations such as joins, filtering, aggregation, and
intricate calculations. These capabilities are essential for preparing raw data into structured
formats that can be readily used for in-depth analysis, reporting, and decision-making in the
analytics domain.

So, after looking at what is needed for analytics, it becomes clear that going off the beaten path
will lead to some pretty gnarly limitations and outcomes.

## Mongo in particular: BSON vs JSON

How is Mongo different from semi-structure data like JSON, and is MongoDB particularly hard to
ingest from?

### Bson is for performance, json is for transmission.

The differences stem from the fact that MongoDB uses BSON under the hood, as opposed to JSON. BSON
is a binary object notation optimised for performance, while JSON is a standard interchange format.

Similarly, Mongo also supports custom and more complex data types, such as geospatial, dates, regex,
etc, that json does not. Additionally, BSON supports character encodings. All these benefits enable
MongoDB to be a faster and better database, but the cost is additional hurdles that must be crossed
before we can use this data elsewhere.

So how do you solve these issues? Well, hopefully your development team didn't go overboard, and you
can just simply convert the BSON to JSON. If you are unlucky, you will need to create your own
mappers that follow whatever your team did.

## From JSON to DB

Once you have converted your mongo BSON into JSON, you are able to use its wide support to have it
ingested.

JSON enjoys widespread support across various data processing tools and systems, making it a
versatile choice for data ingestion. With your data in JSON, you can seamlessly integrate it into
your database, leveraging its compatibility to efficiently manage and analyze your information.

### Cleaning and typing

Data typing is essential in ensuring data integrity. It involves assigning appropriate data types to
JSON fields, like converting numerical values into integers or floats, representing dates as
datetime types, and labeling text data as string data types. This step guarantees that the database
accurately stores and processes information.

### Do we unpack?

The choice between unpacking nested JSON into tables or keeping it as JSON depends on your specific
needs. Unpacking enhances query performance, indexing, and data manipulation within relational
databases. However, native JSON support in some databases can suffice for simpler scenarios,
preserving the original hierarchical structure. Your decision should align with data analysis,
retrieval requirements, and your chosen database's capabilities.

Simply put, if you plan to use the data, you should probably unpack it to benefit from what
relational dbs have to offer. But if you simply need to store and retrieve the json, do not convert
it.

### Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code

Unpacking nested JSON into separate tables within a relational database is essential for robustness
and query efficiency. Relational databases are optimized for tabular data and typed columns, making
it challenging and error prone to handle complex nested structures directly.

By breaking down nested JSON into separate tables and establishing relationships through foreign
keys, the data becomes more structured, ensuring robust data management and enhancing query
efficiency. This simplification streamlines data retrieval and manipulation, aligning it with
standard SQL operations for efficient and effective use.

## Start using `dlt` to load Mongo to SQL today

To help with the challenges of loading Mongo data, we created a dlt source that reads your mongo
collections and throws flat sql tables on the other side.

The benefit of using dlt is that you get flat tables in your sql database that adapt to match the
Mongo schema.

Here's a code explanation of how it works under the hood:

1. It grabs data from Mongo and turns it into JSON.

1. From json, dlt leverages schema inference and evolution to make sense of the data. Here is an
example of how this nested data could look:

```json
data = {
'id': 1,
'name': 'Alice',
'job': {
"company": "ScaleVector",
"title": "Data Scientist",
},
'children': [
{
'id': 1,
'name': 'Eve'
},
{
'id': 2,
'name': 'Wendy'
}
]
}
```

1. We can load the data to a supported destination declaratively:

```python
import dlt

pipeline = dlt.pipeline(
pipeline_name='from_json',
destination='duckdb',
dataset_name='mydata',
full_refresh=True,
)
# dlt works with lists of dicts, so wrap data to the list
load_info = pipeline.run([data], table_name="json_data")
print(load_info)
```

1. Now we can use the data, these are two tables:

*json_data*

| index | id | name | job\_\_company | job\_\_title | \_dlt_load_id | \_dlt_id |
| ----- | --- | ----- | -------------- | -------------- | ----------------- | -------------- |
| 0 | 1 | Alice | ScaleVector | Data Scientist | 1693922245.602667 | 0ZbCzK7Ra2tWMQ |

*json_data\_\_children*

| index | id | name | \_dlt_parent_id | \_dlt_list_idx | \_dlt_id |
| ----- | --- | ----- | --------------- | -------------- | -------------- |
| 0 | 1 | Eve | 0ZbCzK7Ra2tWMQ | 0 | TjzpGZ+dwrrQhg |
| 1 | 2 | Wendy | 0ZbCzK7Ra2tWMQ | 1 | RdqpN1luoKxQTA |

Note that the original json got unpacked into tables that are now joinable via generated keys
`child._dlt_parent_id = parent._dlt_id`.

Read more about it here:
[Mongo verified source.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/mongodb.md)

What are you waiting for?

- Dive into our [Getting Started.](https://dlthub.com/docs/getting-started)
- [Join the ⭐Slack Community⭐ for discussion and help!](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g)
Binary file added docs/website/static/img/data-dumpster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit df719ac

Please sign in to comment.