From 0aa05c473e04d1cc028296a6dec83c27189aa4ba Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 24 Oct 2023 11:57:55 +0200 Subject: [PATCH 01/14] blog dlt deepnote draft --- docs/website/blog/2023-10-25-dlt-deepnote.md | 131 +++---------------- 1 file changed, 20 insertions(+), 111 deletions(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index c0909c0685..2b0af7db99 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -21,7 +21,7 @@ What’s in this article: 1. [⌛The Problem; The bulk of time spent in a data science project is on the transformation of data itself.](#⌛The-Problem;-The-bulk-of-time-spent-in-a-data-science-project-is-on-the-transformation-of-data-itself.) 1. [The usual flow of data for data science projects](#-The-usual-flow-of-data-for-data-science-projects) - 2. [A peek into the datasets 👀](#A-peek-into-the-datasets-👀) + 2. [A peak into the datasets 👀](#A-peak-into-the-datasets-👀) 2. [⚰️The Classical Solution; using pandas to model complicated data for your analytics workflows isn’t the fastest way out.](#⚰️The-Classical-Solution;-using-pandas-to-model-complicated-data-for-your-analytics-workflows-isn’t-the-fastest-way-out.) 3. [💫The Revised Solution; Revisualizing the flow of data with dlt & Deepnote](#💫The-Revised-Solution;-Revisualizing-the-flow-of-data-with-dlt-&-Deepnote) 1. [Introducing dlt; the data cleaner I wish I had](#Introducing-dlt-the-data-cleaner-I-wish-I-had) @@ -51,7 +51,7 @@ like, let’s list down the steps we usually undergo. ### The usual flow of data for data science projects -![usual flow](/img/blog_deepnote_usual_flow.png) +![usual flow](/img/blog_deepnote_usual_flow.gif) We sign up for our jobs because we enjoy the last two activities the most. These parts have all the pretty charts, the flashy animations, and, if the stars align, include watching your @@ -59,19 +59,11 @@ hunches turn out to be statistically significant! However, the journey to reach these stages is stretched much longer due to the time spent on data formats and pipelines. It would be such a load off my mind if they would get sorted themselves and we could skip to the good part. Sure, ipython notebooks with `pandas` and `numpy` help us in getting along, but what if there was something even simpler? Let’s explore different solutions. -### A peek into the datasets 👀 +### A peak into the datasets 👀 The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: -
- -
- - - + Looks like it is a nested json, nested further with more lists of dictionaries. @@ -79,38 +71,17 @@ Looks like it is a nested json, nested further with more lists of dictionaries. Usually, `json_normalize` can be used to unnest a json file while loading it into pandas. However, the nested lists inside dictionaries do not unravel quite well. Nonetheless, let’s see how the pandas normalizer works on our dataset. -
- -
- +https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/c4409a7a7440435fa1bd16bcebcd8c9b?height=537.3999938964844 Conclusion from looking at the data: pandas successfully flattened dictionaries but did not unnest lists. Perhaps because in order to unpack these lists, one might need to create new tables, essentially create a data model entirely. But, that is something pandas does not do for us. So, to be able to use it, let’s flatten the data further into arrays and tables. Particularly, let’s pay attention to the amount of code required to achieve this task. To start off, using the `pandas` `explode` function might be a good way to flatten these lists: + -
- -
- - ---- And now, putting one of the nested variables into a pandas data frame: - -
- -
- + And this little exercise needs to be repeated for each of the columns that we had to “explode” in the first place. @@ -128,70 +99,26 @@ We leave the loading of the raw data to dlt, while we leave the data exploration Imagine this: you initialize a data pipeline in one line of code, and pass complicated raw data in another to be modelled, unnested and formatted. Now, watch that come to reality: - -
- -
- -
- -
- - - + + And that’s pretty much it. Notice the difference in the effort you had to put in? -The data has been loaded into a pipeline with `duckdb` as its destination. -`duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. -The data has been unnested and formatted. To explore what exactly was stored in that destination, -a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. - - -
- -
- - - +The data has been loaded into a pipeline with `duckdb` as its destination. `duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. The data has been unnested and formatted. To explore what exactly was stored in that destination, a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. + In a first look, we understand that both the datasets `violence` and `wellness` have their own base tables. One of the child tables is shown below: - -
- -
- - + ### Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys: The child tables, like `violence__value` or `wellness__age_related` are the unnested lists of dictionaries from the original json files. The `_dlt_id` column as shown in the table above serves as a **primary key**. This will help us in connecting the children tables with ease. The `parent_id` column in the children tables serve as **foreign keys** to the base tables. If more then one child table needs to be joined together, we make use of the `_dlt_list_idx` column; - -
- -
- - + ## Deepnote - the iPython Notebook turned Dashboarding tool @@ -205,29 +132,19 @@ At this point, we would probably move towards a `plt.plot` or `plt.bar` function And a stacked bar chart came into existence! A little note about the query results; the **value** column corresponds to how much (in %) a person justifies violence against women. An interesting yet disturbing insight from the above plot: in many countries, women condone violence against women as often if not more often than men do! -The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. +The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. ~~The plot shows you that women who are uneducated or have lower levels of education & employment have higher levels (averages) of justifications of violence.~~ -Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. - -
- -
+Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. +~~To understand, hover over the blocks from the top of the graph, and see the difference in averages between women who are employed for cash vs employed for kind. Furthermore, look at the difference between women who have received at least a secondary or higher education and compare that to those who have received no education.~~ +We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. + Let’s look at one last plot created by Deepnote for the other dataset with wellness indicators. The upward moving trend shows us that women are much less likely to have a final say on their health if they are less educated. -
- -
- + # 🌍 Clustering countries based on their wellness indicators @@ -247,15 +164,7 @@ The color bar shows us which color is associated to which cluster. Namely; 1: pu To understand briefly what each cluster represents, let’s look at the averages for each indicator across all clusters; -
- -
- - - + This tells us that according to these datasets, cluster 2 (highlighted blue) is the cluster that is performing the best in terms of wellness of women. It has the lowest levels of justifications of violence, highest average years of education, and almost the highest percentage of women who have control over their health and finances. This is followed by clusters 3, 1, and 4 respectively; countries like the Philippines, Peru, Mozambique, Indonesia and Bolivia are comparatively better than countries like South Africa, Egypt, Zambia, Guatemala & all South Asian countries, in regards to how they treat women. From 026caf69fa620cff681b0a474ab5f0ce7636659d Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 24 Oct 2023 18:02:24 +0200 Subject: [PATCH 02/14] fix typo --- docs/website/blog/2023-10-25-dlt-deepnote.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 2b0af7db99..f24abacf0b 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -51,7 +51,7 @@ like, let’s list down the steps we usually undergo. ### The usual flow of data for data science projects -![usual flow](/img/blog_deepnote_usual_flow.gif) +![usual flow](/img/blog_deepnote_usual_flow.png) We sign up for our jobs because we enjoy the last two activities the most. These parts have all the pretty charts, the flashy animations, and, if the stars align, include watching your From 53e44c51a50c12724b37be5569cbf12cca209adb Mon Sep 17 00:00:00 2001 From: Adrian Date: Tue, 24 Oct 2023 18:08:01 +0200 Subject: [PATCH 03/14] format iframe --- docs/website/blog/2023-10-25-dlt-deepnote.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index f24abacf0b..852a01695b 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -63,7 +63,7 @@ However, the journey to reach these stages is stretched much longer due to the t The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: - + + + Looks like it is a nested json, nested further with more lists of dictionaries. From 709c0099dd924779d7ff868b0922a410adc4cc0c Mon Sep 17 00:00:00 2001 From: Adrian Date: Wed, 25 Oct 2023 12:31:02 +0200 Subject: [PATCH 06/14] format iframe --- docs/website/blog/2023-10-25-dlt-deepnote.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index b0593799af..1bb355bfe9 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -63,7 +63,8 @@ However, the journey to reach these stages is stretched much longer due to the t The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: - + + Looks like it is a nested json, nested further with more lists of dictionaries. From f1d0d9940c2535fbf9f0f015824536cdbde99f3e Mon Sep 17 00:00:00 2001 From: Adrian Date: Wed, 25 Oct 2023 14:05:53 +0200 Subject: [PATCH 07/14] format iframe --- docs/website/blog/2023-10-25-dlt-deepnote.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 1bb355bfe9..9e69fa2c6d 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -63,7 +63,14 @@ However, the journey to reach these stages is stretched much longer due to the t The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: - +
+ +
+ + Looks like it is a nested json, nested further with more lists of dictionaries. @@ -72,7 +79,13 @@ Looks like it is a nested json, nested further with more lists of dictionaries. Usually, `json_normalize` can be used to unnest a json file while loading it into pandas. However, the nested lists inside dictionaries do not unravel quite well. Nonetheless, let’s see how the pandas normalizer works on our dataset. -https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/c4409a7a7440435fa1bd16bcebcd8c9b?height=537.3999938964844 +
+ +
+ Conclusion from looking at the data: pandas successfully flattened dictionaries but did not unnest lists. Perhaps because in order to unpack these lists, one might need to create new tables, essentially create a data model entirely. But, that is something pandas does not do for us. So, to be able to use it, let’s flatten the data further into arrays and tables. Particularly, let’s pay attention to the amount of code required to achieve this task. From d10a88108c877a896ea49b4c59003020f2cbb0a6 Mon Sep 17 00:00:00 2001 From: hibajamal Date: Wed, 25 Oct 2023 14:35:34 +0200 Subject: [PATCH 08/14] Update 2023-10-25-dlt-deepnote.md i think we just needed to remove two sentences. I did that --- docs/website/blog/2023-10-25-dlt-deepnote.md | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 9e69fa2c6d..5a0816b9f5 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -146,13 +146,9 @@ At this point, we would probably move towards a `plt.plot` or `plt.bar` function And a stacked bar chart came into existence! A little note about the query results; the **value** column corresponds to how much (in %) a person justifies violence against women. An interesting yet disturbing insight from the above plot: in many countries, women condone violence against women as often if not more often than men do! -The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. ~~The plot shows you that women who are uneducated or have lower levels of education & employment have higher levels (averages) of justifications of violence.~~ +The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. -Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. - -~~To understand, hover over the blocks from the top of the graph, and see the difference in averages between women who are employed for cash vs employed for kind. Furthermore, look at the difference between women who have received at least a secondary or higher education and compare that to those who have received no education.~~ - -We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. +Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. From 969db4f6620ecdce1ead1533973b15699c0fdbd3 Mon Sep 17 00:00:00 2001 From: hibajamal Date: Mon, 30 Oct 2023 12:36:03 +0100 Subject: [PATCH 09/14] added links and iframe padding/responsiveness --- .../examples/incremental_loading/__init__.py | 0 .../incremental_loading/code/.dlt/config.toml | 0 .../code/.dlt/secrets.toml | 4 + .../incremental_loading/code/zendesk.py | 126 +++++++++++++++++ .../docs/examples/transformers/__init__.py | 0 .../transformers/code/.dlt/config.toml | 16 +++ .../examples/transformers/code/pokemon.py | 61 ++++++++ docs/website/blog/2023-10-25-dlt-deepnote.md | 133 ++++++++++++------ docs/website/package.json | 5 +- 9 files changed, 298 insertions(+), 47 deletions(-) create mode 100644 docs/examples/docs/examples/incremental_loading/__init__.py create mode 100644 docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/config.toml create mode 100644 docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/secrets.toml create mode 100644 docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/zendesk.py create mode 100644 docs/examples/docs/examples/transformers/__init__.py create mode 100644 docs/examples/docs/examples/transformersdocs/examples/transformers/code/.dlt/config.toml create mode 100644 docs/examples/docs/examples/transformersdocs/examples/transformers/code/pokemon.py diff --git a/docs/examples/docs/examples/incremental_loading/__init__.py b/docs/examples/docs/examples/incremental_loading/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/config.toml b/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/config.toml new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/secrets.toml b/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/secrets.toml new file mode 100644 index 0000000000..4dec919c06 --- /dev/null +++ b/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/.dlt/secrets.toml @@ -0,0 +1,4 @@ +[sources.zendesk.credentials] +password = "" +subdomain = "" +email = "" \ No newline at end of file diff --git a/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/zendesk.py b/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/zendesk.py new file mode 100644 index 0000000000..6370f29811 --- /dev/null +++ b/docs/examples/docs/examples/incremental_loadingdocs/examples/incremental_loading/code/zendesk.py @@ -0,0 +1,126 @@ +from typing import Iterator, Optional, Dict, Any, Tuple + +import dlt +from dlt.common import pendulum +from dlt.common.time import ensure_pendulum_datetime +from dlt.common.typing import TDataItem, TDataItems, TAnyDateTime +from dlt.extract.source import DltResource +from dlt.sources.helpers.requests import client + + +@dlt.source(max_table_nesting=2) +def zendesk_support( + credentials: Dict[str, str]=dlt.secrets.value, + start_date: Optional[TAnyDateTime] = pendulum.datetime(year=2000, month=1, day=1), # noqa: B008 + end_date: Optional[TAnyDateTime] = None, +): + """ + Retrieves data from Zendesk Support for tickets events. + + Args: + credentials: Zendesk credentials (default: dlt.secrets.value) + start_date: Start date for data extraction (default: 2000-01-01) + end_date: End date for data extraction (default: None). + If end time is not provided, the incremental loading will be + enabled, and after the initial run, only new data will be retrieved. + + Returns: + DltResource. + """ + # Convert start_date and end_date to Pendulum datetime objects + start_date_obj = ensure_pendulum_datetime(start_date) + end_date_obj = ensure_pendulum_datetime(end_date) if end_date else None + + # Convert Pendulum datetime objects to Unix timestamps + start_date_ts = start_date_obj.int_timestamp + end_date_ts: Optional[int] = None + if end_date_obj: + end_date_ts = end_date_obj.int_timestamp + + # Extract credentials from secrets dictionary + auth = (credentials["email"], credentials["password"]) + subdomain = credentials["subdomain"] + url = f"https://{subdomain}.zendesk.com" + + # we use `append` write disposition, because objects in ticket_events endpoint are never updated + # so we do not need to merge + # we set primary_key so allow deduplication of events by the `incremental` below in the rare case + # when two events have the same timestamp + @dlt.resource(primary_key="id", write_disposition="append") + def ticket_events( + timestamp: dlt.sources.incremental[int] = dlt.sources.incremental( + "timestamp", + initial_value=start_date_ts, + end_value=end_date_ts, + allow_external_schedulers=True, + ), + ): + # URL For ticket events + # 'https://d3v-dlthub.zendesk.com/api/v2/incremental/ticket_events.json?start_time=946684800' + event_pages = get_pages( + url=url, + endpoint="/api/v2/incremental/ticket_events.json", + auth=auth, + data_point_name="ticket_events", + params={"start_time": timestamp.last_value}, + ) + for page in event_pages: + yield page + # stop loading when using end_value and end is reached. + # unfortunately, Zendesk API does not have the "end_time" parameter, so we stop iterating ourselves + if timestamp.end_out_of_range: + return + + return ticket_events + + +def get_pages( + url: str, + endpoint: str, + auth: Tuple[str, str], + data_point_name: str, + params: Optional[Dict[str, Any]] = None, +): + """ + Makes a request to a paginated endpoint and returns a generator of data items per page. + + Args: + url: The base URL. + endpoint: The url to the endpoint, e.g. /api/v2/calls + auth: Credentials for authentication. + data_point_name: The key which data items are nested under in the response object (e.g. calls) + params: Optional dict of query params to include in the request. + + Returns: + Generator of pages, each page is a list of dict data items. + """ + # update the page size to enable cursor pagination + params = params or {} + params["per_page"] = 1000 + headers = None + + # make request and keep looping until there is no next page + get_url = f"{url}{endpoint}" + while get_url: + response = client.get( + get_url, headers=headers, auth=auth, params=params + ) + response.raise_for_status() + response_json = response.json() + result = response_json[data_point_name] + yield result + + get_url = None + # See https://developer.zendesk.com/api-reference/ticketing/ticket-management/incremental_exports/#json-format + if not response_json["end_of_stream"]: + get_url = response_json["next_page"] + + +if __name__ == "__main__": + # create dlt pipeline + pipeline = dlt.pipeline( + pipeline_name="zendesk", destination="duckdb", dataset_name="zendesk_data" + ) + + load_info = pipeline.run(zendesk_support()) + print(load_info) \ No newline at end of file diff --git a/docs/examples/docs/examples/transformers/__init__.py b/docs/examples/docs/examples/transformers/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/docs/examples/docs/examples/transformersdocs/examples/transformers/code/.dlt/config.toml b/docs/examples/docs/examples/transformersdocs/examples/transformers/code/.dlt/config.toml new file mode 100644 index 0000000000..a366f34edf --- /dev/null +++ b/docs/examples/docs/examples/transformersdocs/examples/transformers/code/.dlt/config.toml @@ -0,0 +1,16 @@ +[runtime] +log_level="WARNING" + +[extract] +# use 2 workers to extract sources in parallel +worker=2 +# allow 10 async items to be processed in parallel +max_parallel_items=10 + +[normalize] +# use 3 worker processes to process 3 files in parallel +workers=3 + +[load] +# have 50 concurrent load jobs +workers=50 \ No newline at end of file diff --git a/docs/examples/docs/examples/transformersdocs/examples/transformers/code/pokemon.py b/docs/examples/docs/examples/transformersdocs/examples/transformers/code/pokemon.py new file mode 100644 index 0000000000..ce8cc0142c --- /dev/null +++ b/docs/examples/docs/examples/transformersdocs/examples/transformers/code/pokemon.py @@ -0,0 +1,61 @@ +import dlt +from dlt.sources.helpers import requests + + +@dlt.source(max_table_nesting=2) +def source(pokemon_api_url: str): + """""" + + # note that we deselect `pokemon_list` - we do not want it to be loaded + @dlt.resource(write_disposition="replace", selected=False) + def pokemon_list(): + """Retrieve a first page of Pokemons and yield it. We do not retrieve all the pages in this example""" + yield requests.get(pokemon_api_url).json()["results"] + + # transformer that retrieves a list of objects in parallel + @dlt.transformer + def pokemon(pokemons): + """Yields details for a list of `pokemons`""" + + # @dlt.defer marks a function to be executed in parallel + # in a thread pool + @dlt.defer + def _get_pokemon(_pokemon): + return requests.get(_pokemon["url"]).json() + + # call and yield the function result normally, the @dlt.defer takes care of parallelism + for _pokemon in pokemons: + yield _get_pokemon(_pokemon) + + # a special case where just one item is retrieved in transformer + # a whole transformer may be marked for parallel execution + @dlt.transformer + @dlt.defer + def species(pokemon_details): + """Yields species details for a pokemon""" + species_data = requests.get(pokemon_details["species"]["url"]).json() + # link back to pokemon so we have a relation in loaded data + species_data["pokemon_id"] = pokemon_details["id"] + # just return the results, if you yield, + # generator will be evaluated in main thread + return species_data + + # create two simple pipelines with | operator + # 1. send list of pokemons into `pokemon` transformer to get pokemon details + # 2. send pokemon details into `species` transformer to get species details + # NOTE: dlt is smart enough to get data from pokemon_list and pokemon details once + + return ( + pokemon_list | pokemon, + pokemon_list | pokemon | species + ) + +if __name__ == "__main__": + # build duck db pipeline + pipeline = dlt.pipeline( + pipeline_name="pokemon", destination="duckdb", dataset_name="pokemon_data" + ) + + # the pokemon_list resource does not need to be loaded + load_info = pipeline.run(source("https://pokeapi.co/api/v2/pokemon")) + print(load_info) \ No newline at end of file diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 5a0816b9f5..5ee17eb5d5 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -19,20 +19,18 @@ tags: [dbt runner, dbt cloud runner, dbt core runner] What’s in this article: -1. [⌛The Problem; The bulk of time spent in a data science project is on the transformation of data itself.](#⌛The-Problem;-The-bulk-of-time-spent-in-a-data-science-project-is-on-the-transformation-of-data-itself.) - 1. [The usual flow of data for data science projects](#-The-usual-flow-of-data-for-data-science-projects) - 2. [A peak into the datasets 👀](#A-peak-into-the-datasets-👀) -2. [⚰️The Classical Solution; using pandas to model complicated data for your analytics workflows isn’t the fastest way out.](#⚰️The-Classical-Solution;-using-pandas-to-model-complicated-data-for-your-analytics-workflows-isn’t-the-fastest-way-out.) -3. [💫The Revised Solution; Revisualizing the flow of data with dlt & Deepnote](#💫The-Revised-Solution;-Revisualizing-the-flow-of-data-with-dlt-&-Deepnote) - 1. [Introducing dlt; the data cleaner I wish I had](#Introducing-dlt-the-data-cleaner-I-wish-I-had) - 1. [Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys:](#Know-your-data-model-connect-the-unnested-tables-using-dlt-s-pre-assigned-primary-and-foreign-keys) +1. [⌛The Problem; The bulk of time spent in a data science project is on the transformation of data itself.](#data-trans1) + 1. [The usual flow of data for data science projects](#ds-project-usual-flow) + 2. [A peak into the datasets 👀](#dataset-peak) +2. [⚰️The Classical Solution; using pandas to model complicated data for your analytics workflows isn’t the fastest way out.](#classical-solution) +3. [💫The Revised Solution; Revisualizing the flow of data with dlt & Deepnote](#revised-solution) + 1. [Introducing dlt; the data cleaner I wish I had](#introducing-dlt) 2. [Deepnote - the iPython Notebook turned Dashboarding tool](#Deepnote-the-iPython-Notebook-turned-Dashboarding-tool) - 1. [One step visualizations](#One-step-visualizations) 4. [🌍Clustering countries based on their wellness indicators](#Clustering-countries-based-on-their-wellness-indicators) -5. [🔧Technical Conclusion; dlt & Deepnote are the data science dream team](#Technical-Conclusion-dlt-Deepnote-are-the-data-science-dream-team) -6. [🎆Analytical Conclusion; Leave women in dangerous situations for extended periods of time and they’ll begin to justify the violence committed against themselves!](#Analytical-Conclusion-Leave-women-in-dangerous-situations-for-extended-periods-of-time-and-they-ll-begin-to-justify-the-violence-committed-against-themselves!) +5. [🔧Technical Conclusion; dlt & Deepnote are the data science dream team](#technical-conclusion) +6. [🎆Analytical Conclusion; Leave women in dangerous situations for extended periods of time and they’ll begin to justify the violence committed against themselves!](#analytical-conclusion) -# ⌛The Problem; The bulk of time spent in a data science project is on the transformation of data itself. +# ⌛The Problem; The bulk of time spent in a data science project is on the transformation of data itself. If you are a data analyst, data scientist or a machine learning engineer, then more likely than not, you spend more time fixing data pipelines or data formats then you do @@ -49,7 +47,7 @@ Unfortunately, before we get to writing this `select` statement, we need to go t some very important but time consuming first steps. To describe what this journey looks like, let’s list down the steps we usually undergo. -### The usual flow of data for data science projects +### The usual flow of data for data science projects ![usual flow](/img/blog_deepnote_usual_flow.png) @@ -59,49 +57,55 @@ hunches turn out to be statistically significant! However, the journey to reach these stages is stretched much longer due to the time spent on data formats and pipelines. It would be such a load off my mind if they would get sorted themselves and we could skip to the good part. Sure, ipython notebooks with `pandas` and `numpy` help us in getting along, but what if there was something even simpler? Let’s explore different solutions. -### A peak into the datasets 👀 +### A peak into the datasets 👀 The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like: -
+
+

Looks like it is a nested json, nested further with more lists of dictionaries.

- - -Looks like it is a nested json, nested further with more lists of dictionaries. - -# ⚰️The Classical Solution; using pandas to model complicated data for your analytics workflows isn’t the fastest way out. +# ⚰️The Classical Solution; using pandas to model complicated data for your analytics workflows isn’t the fastest way out. Usually, `json_normalize` can be used to unnest a json file while loading it into pandas. However, the nested lists inside dictionaries do not unravel quite well. Nonetheless, let’s see how the pandas normalizer works on our dataset. -
+
- -Conclusion from looking at the data: pandas successfully flattened dictionaries but did not unnest lists. Perhaps because in order to unpack these lists, one might need to create new tables, essentially create a data model entirely. But, that is something pandas does not do for us. So, to be able to use it, let’s flatten the data further into arrays and tables. Particularly, let’s pay attention to the amount of code required to achieve this task. +

Conclusion from looking at the data: pandas successfully flattened dictionaries but did not unnest lists. Perhaps because in order to unpack these lists, one might need to create new tables, essentially create a data model entirely. But, that is something pandas does not do for us. So, to be able to use it, let’s flatten the data further into arrays and tables. Particularly, let’s pay attention to the amount of code required to achieve this task.

To start off, using the `pandas` `explode` function might be a good way to flatten these lists: - +
+ +
-And now, putting one of the nested variables into a pandas data frame: +

And now, putting one of the nested variables into a pandas data frame:

- +
+ +
-And this little exercise needs to be repeated for each of the columns that we had to “explode” in the first place. +

And this little exercise needs to be repeated for each of the columns that we had to “explode” in the first place.

-Our next step could be using a visualization package like `matplotlib`, and other `pandas` and `numpy` based functions to conduct a thorough exploratory analysis on the data. However, if we use the code above and plot two variables against each other on a scatter plot, for example, `marriage_related` and `work_related`, then joining this data wouldn’t be simple. We would have to be wary of the list indices (or something that can be used as foreign keys) that will match rows together across different tables. Otherwise, we would end up with mismatched data points on the scatter plot. We’ll get more into this in the [Know your data model](https://www.notion.so/DLT-Deepnote-in-women-s-wellness-and-violence-trends-A-Visual-Analysis-07de2cab78f84a23a46e03cddf885320?pvs=21) section. +Our next step could be using a visualization package like `matplotlib`, and other `pandas` and `numpy` based functions to conduct a thorough exploratory analysis on the data. However, if we use the code above and plot two variables against each other on a scatter plot, for example, `marriage_related` and `work_related`, then joining this data wouldn’t be simple. We would have to be wary of the list indices (or something that can be used as foreign keys) that will match rows together across different tables. Otherwise, we would end up with mismatched data points on the scatter plot. We’ll get more into this in the [Know your data model](#know-your-data-model) section. -# 💫The Revised Solution; Revisualizing the flow of data with dlt & Deepnote +# 💫The Revised Solution; Revisualizing the flow of data with dlt & Deepnote We can reimagine the flow of data with dlt and Deepnote in the following way: @@ -109,36 +113,60 @@ We can reimagine the flow of data with dlt and Deepnote in the following way: We leave the loading of the raw data to dlt, while we leave the data exploration and visualization to the Deepnote interface. -## Introducing dlt; the data cleaner I wish I had +## Introducing dlt; the data cleaner I wish I had Imagine this: you initialize a data pipeline in one line of code, and pass complicated raw data in another to be modelled, unnested and formatted. Now, watch that come to reality: - - +
+ +
- +
+ +
And that’s pretty much it. Notice the difference in the effort you had to put in? The data has been loaded into a pipeline with `duckdb` as its destination. `duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. The data has been unnested and formatted. To explore what exactly was stored in that destination, a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. - +
+ +
In a first look, we understand that both the datasets `violence` and `wellness` have their own base tables. One of the child tables is shown below: - +
+ +
-### Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys: +### Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys: The child tables, like `violence__value` or `wellness__age_related` are the unnested lists of dictionaries from the original json files. The `_dlt_id` column as shown in the table above serves as a **primary key**. This will help us in connecting the children tables with ease. The `parent_id` column in the children tables serve as **foreign keys** to the base tables. If more then one child table needs to be joined together, we make use of the `_dlt_list_idx` column; - +
+ +
-## Deepnote - the iPython Notebook turned Dashboarding tool +## Deepnote - the iPython Notebook turned Dashboarding tool Take your average Notebook experience, and combine it with the powers of a collaborative and interactive dashboarding tool and you get Deepnote. Now that we focus on analytics portion of this article, let’s check out how Deepnote helps along the way. -### One step visualizations +### One step visualizations At this point, we would probably move towards a `plt.plot` or `plt.bar` function. However, with Deepnote, the little Visualize button on top of any data frame will help us jump straight to an easy figure. Clicking on the Visualize button takes you to a new cell block, where you can choose your parameters, types of charts, and customization settings in the sidebar. The following chart is built from the `joined` data frame we defined above. @@ -146,17 +174,27 @@ At this point, we would probably move towards a `plt.plot` or `plt.bar` function And a stacked bar chart came into existence! A little note about the query results; the **value** column corresponds to how much (in %) a person justifies violence against women. An interesting yet disturbing insight from the above plot: in many countries, women condone violence against women as often if not more often than men do! -The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. +The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence. Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification. - +
+ +
Let’s look at one last plot created by Deepnote for the other dataset with wellness indicators. The upward moving trend shows us that women are much less likely to have a final say on their health if they are less educated. - +
+ +
-# 🌍 Clustering countries based on their wellness indicators +# 🌍 Clustering countries based on their wellness indicators Lastly, based on these indicators of wellness and violence about women, let’s use KMEANS to cluster these countries to see how the algorithm groups which countries together. The intersection of the ‘countries’ columns in both datasets results in the availability of data for 45 countries. The columns used in this model indicate per country: @@ -174,11 +212,16 @@ The color bar shows us which color is associated to which cluster. Namely; 1: pu To understand briefly what each cluster represents, let’s look at the averages for each indicator across all clusters; - +
+ +
This tells us that according to these datasets, cluster 2 (highlighted blue) is the cluster that is performing the best in terms of wellness of women. It has the lowest levels of justifications of violence, highest average years of education, and almost the highest percentage of women who have control over their health and finances. This is followed by clusters 3, 1, and 4 respectively; countries like the Philippines, Peru, Mozambique, Indonesia and Bolivia are comparatively better than countries like South Africa, Egypt, Zambia, Guatemala & all South Asian countries, in regards to how they treat women. -## 🔧Technical Conclusion; dlt & Deepnote are the data science dream team +## 🔧Technical Conclusion; dlt & Deepnote are the data science dream team It is safe to say that dlt is a dream come true for all data scientists who do not want to 1. W**ait for a data engineer to fix data pipeline issues** and model discrepancies, or 2. **Spend time studying the format of a dataset** and find ways to structure and unnest it. The library supports many different [sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources/) and can pick up the dreadful data cleaning tasks you don’t want to do. @@ -186,7 +229,7 @@ Next, let’s talk about the coding tool of choice for this article—Deepnote. Using both of these tools together made the critical tasks of data loading and data exploration much easier for a data scientist or analyst by automating much of the upfront data preparation steps! -## 🎆Analytical Conclusion; Leave women in dangerous situations for extended periods of time and they’ll begin to justify the violence committed against themselves! +## 🎆Analytical Conclusion; Leave women in dangerous situations for extended periods of time and they’ll begin to justify the violence committed against themselves! The data we explored in the plots above demonstrated that women often justify violent acts committed against themselves almost as equally as men do. Particularly, women who are less educated are more likely to fall into the shackles of these beliefs when compared to their more educated counterparts. diff --git a/docs/website/package.json b/docs/website/package.json index 77ecbcc9dd..0e9248f177 100644 --- a/docs/website/package.json +++ b/docs/website/package.json @@ -3,10 +3,11 @@ "version": "0.0.0", "private": true, "scripts": { + "docusaurus": "docusaurus", - "start": "PYTHONPATH=. poetry run pydoc-markdown && node tools/update_snippets.js && docusaurus start", + "start": "poetry run pydoc-markdown && node tools/update_snippets.js && docusaurus start", "watch-snippets": "node tools/update_snippets.js --watch", - "build": "PYTHONPATH=. poetry run pydoc-markdown && node tools/update_snippets.js && docusaurus build", + "build": "set PYTHONPATH='C:\\Python312\\'. poetry run pydoc-markdown && node tools/update_snippets.js && docusaurus build", "build:netlify": "PYTHONPATH=. pydoc-markdown && node tools/update_snippets.js && docusaurus build --out-dir build/docs", "swizzle": "docusaurus swizzle", "clear": "docusaurus clear", From e2fc2222b42755aa39a584dfd808f8e22e23ab5d Mon Sep 17 00:00:00 2001 From: Adrian Date: Wed, 25 Oct 2023 16:46:27 +0200 Subject: [PATCH 10/14] small format improvement --- docs/website/blog/2023-10-25-dlt-deepnote.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/website/blog/2023-10-25-dlt-deepnote.md b/docs/website/blog/2023-10-25-dlt-deepnote.md index 5ee17eb5d5..864353a36d 100644 --- a/docs/website/blog/2023-10-25-dlt-deepnote.md +++ b/docs/website/blog/2023-10-25-dlt-deepnote.md @@ -135,7 +135,7 @@ And that’s pretty much it. Notice the difference in the effort you had to put The data has been loaded into a pipeline with `duckdb` as its destination. `duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. The data has been unnested and formatted. To explore what exactly was stored in that destination, a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed. -
+