-
Notifications
You must be signed in to change notification settings - Fork 198
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: AstrakhantsevaAA <[email protected]>
- Loading branch information
1 parent
85a5aaf
commit df719ac
Showing
2 changed files
with
202 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
--- | ||
slug: mongo-etl | ||
title: "Dumpster diving for data: The MongoDB experience" | ||
image: /img/data-dumpster.png | ||
authors: | ||
name: Adrian Brudaru | ||
title: Open source data engineer | ||
url: https://github.com/adrianbr | ||
image_url: https://avatars.githubusercontent.com/u/5762770?v=4 | ||
tags: [mongodb, mongo, nosql, bson] | ||
--- | ||
:::info | ||
💡 TIL: BSON stands for binary JSON, not for BS Object Notation /s | ||
::: | ||
|
||
|
||
## What will you learn in this article? | ||
|
||
The scope of this article is to look at and discuss the process of extracting data from MongoDB and | ||
making it available in a SQL store. We will focus on the difficulties around ingesting the data into | ||
a SQL database, while we will look at MongoDB only from a source perspective. | ||
|
||
The focus is the data in its different states, not the underlying technologies. | ||
|
||
## Why the harsh title? | ||
|
||
The title may sound harsh, but it accurately reflects the challenges often encountered when dealing | ||
with MongoDB. | ||
|
||
Also referred to as `/dev/null`, Mongo offers a flexible schema and document-based storage can lead | ||
to data inconsistencies and complexities, resembling a metaphorical "dumpster" where data might be | ||
scattered and difficult to organise. | ||
|
||
Analytical consumers of MongoDB will be forced to invest extra effort in data modelling, schema | ||
design, and querying optimisation to ensure data quality and retrieval efficiency. | ||
|
||
## Is this a common problem? | ||
|
||
It's inevitable. An analytical system has multiple requirements which force us to move data from | ||
places such as MongoDB to a SQL store. | ||
|
||
Let’s look at those requirements: | ||
|
||
- **Business User access**: Most data is used by business users to improve operations. Business | ||
users access data via dashboards, or pivot-table like interfaces that enable them to do custom | ||
aggregations. The tooling that exists to do this is created for SQL, as a place where relational | ||
queries can be executed. | ||
- **Ecosystem of integrations and tools**: In analytics, having a diverse ecosystem of integrations | ||
and tools is crucial. SQL databases seamlessly fit into this ecosystem, offering compatibility | ||
with a wide array of data warehousing, data integration, and data governance tools. This | ||
comprehensive ecosystem enhances the analytics infrastructure, ensuring that data can be | ||
efficiently managed, transformed, and accessed by various stakeholders. | ||
- **Standardization for consistency**: Maintaining data consistency is paramount in analytics. SQL's | ||
widespread adoption and standardized query language enable analysts and data professionals to work | ||
with data consistently across different systems and platforms. This standardization ensures that | ||
data is interpreted and manipulated uniformly, reducing the risk of errors and discrepancies in | ||
analytical processes. | ||
- **Data transformation & modelling capabilities**: Effective data transformation and modelling are | ||
prerequisites for meaningful analytics. SQL provides a robust toolkit for these tasks, enabling | ||
data professionals to perform complex operations such as joins, filtering, aggregation, and | ||
intricate calculations. These capabilities are essential for preparing raw data into structured | ||
formats that can be readily used for in-depth analysis, reporting, and decision-making in the | ||
analytics domain. | ||
|
||
So, after looking at what is needed for analytics, it becomes clear that going off the beaten path | ||
will lead to some pretty gnarly limitations and outcomes. | ||
|
||
## Mongo in particular: BSON vs JSON | ||
|
||
How is Mongo different from semi-structure data like JSON, and is MongoDB particularly hard to | ||
ingest from? | ||
|
||
### Bson is for performance, json is for transmission. | ||
|
||
The differences stem from the fact that MongoDB uses BSON under the hood, as opposed to JSON. BSON | ||
is a binary object notation optimised for performance, while JSON is a standard interchange format. | ||
|
||
Similarly, Mongo also supports custom and more complex data types, such as geospatial, dates, regex, | ||
etc, that json does not. Additionally, BSON supports character encodings. All these benefits enable | ||
MongoDB to be a faster and better database, but the cost is additional hurdles that must be crossed | ||
before we can use this data elsewhere. | ||
|
||
So how do you solve these issues? Well, hopefully your development team didn't go overboard, and you | ||
can just simply convert the BSON to JSON. If you are unlucky, you will need to create your own | ||
mappers that follow whatever your team did. | ||
|
||
## From JSON to DB | ||
|
||
Once you have converted your mongo BSON into JSON, you are able to use its wide support to have it | ||
ingested. | ||
|
||
JSON enjoys widespread support across various data processing tools and systems, making it a | ||
versatile choice for data ingestion. With your data in JSON, you can seamlessly integrate it into | ||
your database, leveraging its compatibility to efficiently manage and analyze your information. | ||
|
||
### Cleaning and typing | ||
|
||
Data typing is essential in ensuring data integrity. It involves assigning appropriate data types to | ||
JSON fields, like converting numerical values into integers or floats, representing dates as | ||
datetime types, and labeling text data as string data types. This step guarantees that the database | ||
accurately stores and processes information. | ||
|
||
### Do we unpack? | ||
|
||
The choice between unpacking nested JSON into tables or keeping it as JSON depends on your specific | ||
needs. Unpacking enhances query performance, indexing, and data manipulation within relational | ||
databases. However, native JSON support in some databases can suffice for simpler scenarios, | ||
preserving the original hierarchical structure. Your decision should align with data analysis, | ||
retrieval requirements, and your chosen database's capabilities. | ||
|
||
Simply put, if you plan to use the data, you should probably unpack it to benefit from what | ||
relational dbs have to offer. But if you simply need to store and retrieve the json, do not convert | ||
it. | ||
|
||
### Querying unpacked data is cheaper and more robust than maintaining wet json_extract() code | ||
|
||
Unpacking nested JSON into separate tables within a relational database is essential for robustness | ||
and query efficiency. Relational databases are optimized for tabular data and typed columns, making | ||
it challenging and error prone to handle complex nested structures directly. | ||
|
||
By breaking down nested JSON into separate tables and establishing relationships through foreign | ||
keys, the data becomes more structured, ensuring robust data management and enhancing query | ||
efficiency. This simplification streamlines data retrieval and manipulation, aligning it with | ||
standard SQL operations for efficient and effective use. | ||
|
||
## Start using `dlt` to load Mongo to SQL today | ||
|
||
To help with the challenges of loading Mongo data, we created a dlt source that reads your mongo | ||
collections and throws flat sql tables on the other side. | ||
|
||
The benefit of using dlt is that you get flat tables in your sql database that adapt to match the | ||
Mongo schema. | ||
|
||
Here's a code explanation of how it works under the hood: | ||
|
||
1. It grabs data from Mongo and turns it into JSON. | ||
|
||
1. From json, dlt leverages schema inference and evolution to make sense of the data. Here is an | ||
example of how this nested data could look: | ||
|
||
```json | ||
data = { | ||
'id': 1, | ||
'name': 'Alice', | ||
'job': { | ||
"company": "ScaleVector", | ||
"title": "Data Scientist", | ||
}, | ||
'children': [ | ||
{ | ||
'id': 1, | ||
'name': 'Eve' | ||
}, | ||
{ | ||
'id': 2, | ||
'name': 'Wendy' | ||
} | ||
] | ||
} | ||
``` | ||
|
||
1. We can load the data to a supported destination declaratively: | ||
|
||
```python | ||
import dlt | ||
|
||
pipeline = dlt.pipeline( | ||
pipeline_name='from_json', | ||
destination='duckdb', | ||
dataset_name='mydata', | ||
full_refresh=True, | ||
) | ||
# dlt works with lists of dicts, so wrap data to the list | ||
load_info = pipeline.run([data], table_name="json_data") | ||
print(load_info) | ||
``` | ||
|
||
1. Now we can use the data, these are two tables: | ||
|
||
*json_data* | ||
|
||
| index | id | name | job\_\_company | job\_\_title | \_dlt_load_id | \_dlt_id | | ||
| ----- | --- | ----- | -------------- | -------------- | ----------------- | -------------- | | ||
| 0 | 1 | Alice | ScaleVector | Data Scientist | 1693922245.602667 | 0ZbCzK7Ra2tWMQ | | ||
|
||
*json_data\_\_children* | ||
|
||
| index | id | name | \_dlt_parent_id | \_dlt_list_idx | \_dlt_id | | ||
| ----- | --- | ----- | --------------- | -------------- | -------------- | | ||
| 0 | 1 | Eve | 0ZbCzK7Ra2tWMQ | 0 | TjzpGZ+dwrrQhg | | ||
| 1 | 2 | Wendy | 0ZbCzK7Ra2tWMQ | 1 | RdqpN1luoKxQTA | | ||
|
||
Note that the original json got unpacked into tables that are now joinable via generated keys | ||
`child._dlt_parent_id = parent._dlt_id`. | ||
|
||
Read more about it here: | ||
[Mongo verified source.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/mongodb.md) | ||
|
||
What are you waiting for? | ||
|
||
- Dive into our [Getting Started.](https://dlthub.com/docs/getting-started) | ||
- [Join the ⭐Slack Community⭐ for discussion and help!](https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.