Skip to content

Commit

Permalink
format iframes
Browse files Browse the repository at this point in the history
  • Loading branch information
Adrian authored and Adrian committed Oct 25, 2023
1 parent cceb8a0 commit 1378382
Showing 1 changed file with 95 additions and 14 deletions.
109 changes: 95 additions & 14 deletions docs/website/blog/2023-10-25-dlt-deepnote.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ However, the journey to reach these stages is stretched much longer due to the t

The two datasets that we are using are nested json files, with further lists of dictionaries, and are survey results with wellness indicators for women. Here’s what the first element of one dataset looks like:

<div style={{ position: 'relative', paddingBottom: '80%' }}>
<div style={{ position: 'relative', paddingBottom: '50%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/3a517be3788b446bb1380cd0e7df274e"
style={{ position: 'absolute', width: '100%', height: '100%' }}
Expand All @@ -79,7 +79,7 @@ Looks like it is a nested json, nested further with more lists of dictionaries.

Usually, `json_normalize` can be used to unnest a json file while loading it into pandas. However, the nested lists inside dictionaries do not unravel quite well. Nonetheless, let’s see how the pandas normalizer works on our dataset.

<div style={{ position: 'relative', paddingBottom: '80%' }}>
<div style={{ position: 'relative', paddingBottom: '60%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/c4409a7a7440435fa1bd16bcebcd8c9b"
style={{ position: 'absolute', width: '100%', height: '100%' }}
Expand All @@ -91,11 +91,26 @@ Conclusion from looking at the data: pandas successfully flattened dictionaries

To start off, using the `pandas` `explode` function might be a good way to flatten these lists:

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/c4409a7a7440435fa1bd16bcebcd8c9b?height=537.3999938964844" height="537.4"></iframe>

<div style={{ position: 'relative', paddingBottom: '60%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/ad8635a80e784717844308f44a41e703"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>

---
And now, putting one of the nested variables into a pandas data frame:

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/84726ac7a1464f27b6374a8af85cfe65?height=807.3999938964844" height="807.4"></iframe>

<div style={{ position: 'relative', paddingBottom: '120%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/84726ac7a1464f27b6374a8af85cfe65"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>

And this little exercise needs to be repeated for each of the columns that we had to “explode” in the first place.

Expand All @@ -113,26 +128,70 @@ We leave the loading of the raw data to dlt, while we leave the data exploration

Imagine this: you initialize a data pipeline in one line of code, and pass complicated raw data in another to be modelled, unnested and formatted. Now, watch that come to reality:

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/4afdf1ecf4164b219614bd87c7b21df0?height=191" height="191"></iframe>

<div style={{ position: 'relative', paddingBottom: '30%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/4afdf1ecf4164b219614bd87c7b21df0"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>
<div style={{ position: 'relative', paddingBottom: '30%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/0f80dc1a5917406abe87ce59b46cc2e7"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>




<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/0f80dc1a5917406abe87ce59b46cc2e7?height=169.98749923706055" height="169.99"></iframe>

And that’s pretty much it. Notice the difference in the effort you had to put in?

The data has been loaded into a pipeline with `duckdb` as its destination. `duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow. The data has been unnested and formatted. To explore what exactly was stored in that destination, a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed.
The data has been loaded into a pipeline with `duckdb` as its destination.
`duckdb` was chosen as it is an OLAP database, perfect for usage in our analytics workflow.
The data has been unnested and formatted. To explore what exactly was stored in that destination,
a `duckdb` connector (`conn`) is set up, and the `SHOW ALL TABLES` command is executed.


<div style={{ position: 'relative', paddingBottom: '80%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/5400d02a3ccd4973ae25e3d3b76a5ead"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>



<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/5400d02a3ccd4973ae25e3d3b76a5ead?height=574.3875122070312" height="574.39"></iframe>

In a first look, we understand that both the datasets `violence` and `wellness` have their own base tables. One of the child tables is shown below:

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/a4a1702a0582492f8f78a3fa753c4d57?height=502.6000061035156" height="502.6"></iframe>

<div style={{ position: 'relative', paddingBottom: '50%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/a4a1702a0582492f8f78a3fa753c4d57"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>


### Know your data model; connect the unnested tables using dlt’s pre-assigned primary and foreign keys:

The child tables, like `violence__value` or `wellness__age_related` are the unnested lists of dictionaries from the original json files. The `_dlt_id` column as shown in the table above serves as a **primary key**. This will help us in connecting the children tables with ease. The `parent_id` column in the children tables serve as **foreign keys** to the base tables. If more then one child table needs to be joined together, we make use of the `_dlt_list_idx` column;

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/e46c971e6265418382aa690dae0abc23?height=610.6000061035156" height="610.6"></iframe>

<div style={{ position: 'relative', paddingBottom: '60%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/e46c971e6265418382aa690dae0abc23"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>


## Deepnote - the iPython Notebook turned Dashboarding tool

Expand All @@ -146,15 +205,29 @@ At this point, we would probably move towards a `plt.plot` or `plt.bar` function

And a stacked bar chart came into existence! A little note about the query results; the **value** column corresponds to how much (in %) a person justifies violence against women. An interesting yet disturbing insight from the above plot: in many countries, women condone violence against women as often if not more often than men do!

The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence.
The next figure slices the data further by gender and demographic. The normalized bar chart is sliced by 2 parameters, gender and demographic. The two colors represent genders. While different widths of the rectangles represent the different demographics, and the different heights represent that demographic’s justification of violence in %. The taller the rectangle, the greater the % average. It tells us that most women think that violence on them is justified for the reasons mentioned, as shown by the fact that the blue rectangles make up more than 50% of respondents who say ‘yes’ to each reason shown on the x-axis. If you hover over the blocks, you will see the gender and demographic represented in each differently sized rectangle, alongside that subset’s percentage of justification of violence.

Let’s examine the differences in women’s responses for two demographic types: employment vs education levels. We can see that the blue rectangles for “employed for cash” vs “employed for kind” don’t really vary in size. However, when we select “higher” vs “no education”, we see that the former is merely a speck when compared to the rectangles for the latter. This comparison between employment and education differences demonstrates that education plays a much larger role in likelihood to influence women’s levels of violence justification.

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/71a6385d51284d85a0c62474d5e430dc?height=547" height="547"></iframe>
<div style={{ position: 'relative', paddingBottom: '80%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/71a6385d51284d85a0c62474d5e430dc"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>



Let’s look at one last plot created by Deepnote for the other dataset with wellness indicators. The upward moving trend shows us that women are much less likely to have a final say on their health if they are less educated.

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/ca6e638b94e448a1ade186a558984b78?height=591" height="591"></iframe>
<div style={{ position: 'relative', paddingBottom: '80%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/ca6e638b94e448a1ade186a558984b78"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>

# 🌍 Clustering countries based on their wellness indicators

Expand All @@ -174,7 +247,15 @@ The color bar shows us which color is associated to which cluster. Namely; 1: pu

To understand briefly what each cluster represents, let’s look at the averages for each indicator across all clusters;

<iframe src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/8e1b72a8f89c432994068666792e1a18?height=366.4" height="366.4"></iframe>
<div style={{ position: 'relative', paddingBottom: '30%' }}>
<iframe
src="https://embed.deepnote.com/5fc0e511-cc64-4c44-a71c-a36c8c18ef62/48645544ae4740ce8e49fb6e0c1db925/8e1b72a8f89c432994068666792e1a18"
style={{ position: 'absolute', width: '100%', height: '100%' }}
></iframe>
</div>



This tells us that according to these datasets, cluster 2 (highlighted blue) is the cluster that is performing the best in terms of wellness of women. It has the lowest levels of justifications of violence, highest average years of education, and almost the highest percentage of women who have control over their health and finances. This is followed by clusters 3, 1, and 4 respectively; countries like the Philippines, Peru, Mozambique, Indonesia and Bolivia are comparatively better than countries like South Africa, Egypt, Zambia, Guatemala & all South Asian countries, in regards to how they treat women.

Expand Down

0 comments on commit 1378382

Please sign in to comment.