Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update gapminder.json and add source information #580

Merged
merged 6 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion SOURCES.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,36 @@ Transformed using `/scripts/flights.js`. Arrow file generated with [json2arrow](

Football match outcomes across multiple divisions from 2013 to 2017. This dataset is a subset of a larger dataset from https://github.com/openfootball/football.json. The subset was made such that there are records for all five chosen divisions over the time period.

## `gapminder-health-income.csv`, `gapminder.json`
## `gapminder.json`
### Source
- **Original Data**: [Gapminder Foundation](https://www.gapminder.org/)
- **URLs**:
- Life Expectancy (v14): [Data](https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd004/)
- Population (v7): [Data](https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd003/)
- Fertility (v14): [Data](https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676) | [Reference](https://www.gapminder.org/data/documentation/gd008/)
- Data Geographies (v2): [Data](https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158) | [Reference](https://www.gapminder.org/data/geo/)

- **Date Accessed**: July 11, 2024
- **License**: Creative Commons Attribution 4.0 International (CC BY 4.0) | [Reference](https://www.gapminder.org/free-material/)

### Description
This dataset combines key demographic indicators (life expectancy at birth, population, and fertility rate measured as babies per woman) for various countries from 1955 to 2005 at 5-year intervals. It also includes a 'cluster' column, a categorical variable grouping countries. Gapminder's [data documentation](https://www.gapminder.org/data/documentation/) notes that its philosophy is to fill data gaps with estimates and use current geographic boundaries for historical data. Gapminder states that it aims to "show people the big picture" rather than support detailed numeric analysis.

#### Columns:
1. `year` (type: integer): Years from 1955 to 2005 at 5-year intervals
2. `country` (type: string): Name of the country
3. `cluster` (type: integer): A categorical variable (values 0-5) grouping countries. See Revision Notes for details.
4. `pop` (type: integer): Population of the country
5. `life_expect` (type: float): Life expectancy in years
6. `fertility` (type: float): Fertility rate (average number of children per woman)

### Revision Notes
1. Country Selection: The set of countries in this file matches the version of this dataset originally added to this collection in 2015. The specific criteria for country selection in that version are not known. Data for Aruba are no longer available in the new version. Hong Kong has been revised to Hong Kong, China in the new version.
2. Data Precision: The precision of float values may have changed from the original version. These changes reflect the most recent source data used for each indicator.
3. Regional Groupings: The 'cluster' column represents a regional mapping of countries corresponding to the 'six_regions' schema in Gapminder's Data Geographies dataset. To preserve continuity with previous versions of this dataset, we have retained the column name 'cluster' instead of renaming it to 'six_regions'. The six regions represented are:
`0: south_asia, 1: europe_central_asia, 2: sub_saharan_africa, 3: america, 4: east_asia_pacific, 5: middle_east_north_africa`.

## `gapminder-health-income.csv`

## `github.csv`

Expand Down
2 changes: 1 addition & 1 deletion data/gapminder.json

Large diffs are not rendered by default.

117 changes: 117 additions & 0 deletions scripts/update_gapminder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
"""
Gapminder Dataset Updater

This script updates the gapminder.json file in the vega-datasets repository
in a manner consistent with a minor release. It fetches current data from the source,
processes it, and then filters the results to match the countries and years in the
existing dataset. To ensure reproducibility and data consistency, the script fetches
the existing gapminder dataset from a specific commit.

The generated dataset is used in the following PR:
https://github.com/vega/vega-datasets/pull/580

Data sources:
- Google Sheets: Multiple sheets containing updated Gapminder data
- Vega-Datasets: Raw GitHub URL (commit: 05fcb7c07b1d76206856e75129fc1e79dc61735c)

"""

import pandas as pd
import json
import re

def google_sheet_to_pandas(sheet_url):
key_match = re.search(r'/d/([a-zA-Z0-9-_]+)', sheet_url)
gid_match = re.search(r'gid=(\d+)', sheet_url)
sheet_key, gid = key_match.group(1), gid_match.group(1)
csv_export_url = f"https://docs.google.com/spreadsheets/d/{sheet_key}/export?format=csv&gid={gid}"
return pd.read_csv(csv_export_url)


# Gapminder datasets (as of July 2024) are stored in individual Google Sheets files.
# These files are linked from dataset-specific reference pages:
# - Life Expectancy: https://www.gapminder.org/data/documentation/gd004/
# - Population: https://www.gapminder.org/data/documentation/gd003/
# - Fertility: https://www.gapminder.org/data/documentation/gd008/
# - Data Geographies: https://www.gapminder.org/data/geo/

urls = [
"https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676", #life expectancy v14 (retrieved July 11, 2024)
"https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676", #population v7 (retrieved July 11, 2024)
"https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676", #fertility v14 (retrieved July 11, 2024)
"https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158" #data geographies v2 (retrieved July 11, 2024)
]

# Load dataframes from Google Sheets
df_life, df_pop, df_fert, df_region = [google_sheet_to_pandas(url) for url in urls]

# Load gapminder dataset directly from the raw GitHub URL
gapminder_url = "https://raw.githubusercontent.com/vega/vega-datasets/05fcb7c07b1d76206856e75129fc1e79dc61735c/data/gapminder.json"
df_gapminder = pd.read_json(gapminder_url)

# Prepare main dataframe
df_main = df_pop[['name', 'time', 'Population']].rename(columns={'Population': 'pop'})

# Merge other dataframes
df_main = df_main.merge(df_life[['name', 'time', 'Life expectancy ']], on=['name', 'time'])
df_main = df_main.merge(df_fert[['name', 'time', 'Babies per woman']], on=['name', 'time'])
df_main = df_main.merge(df_region[['name', 'six_regions']], on='name')

# Rename columns
df_main = df_main.rename(columns={
'name': 'country',
'time': 'year',
'Life expectancy ': 'life_expect',
'Babies per woman': 'fertility',
'six_regions': 'region'
})

# Reorder columns
df_main = df_main[['year', 'country', 'region', 'pop', 'life_expect', 'fertility']]

# Convert year to int and filter years from 1955 to 2005 in increments of 5
df_main['year'] = df_main['year'].astype(int)
df_main = df_main[df_main['year'].between(1955, 2005) & (df_main['year'] % 5 == 0)]

# Sort the dataframe
df_main = df_main.sort_values(['country', 'year'])

# Create the cluster mapping
cluster_map = {
'south_asia': 0,
'europe_central_asia': 1,
'sub_saharan_africa': 2,
'america': 3,
'east_asia_pacific': 4,
'middle_east_north_africa': 5
}

# Add cluster column and drop region column
df_main['cluster'] = df_main['region'].map(cluster_map)
df_main = df_main.drop('region', axis=1)

# Reorder columns
column_order = ['year', 'country', 'cluster', 'pop', 'life_expect', 'fertility']
df_main = df_main[column_order]

# Rename Hong Kong to Hong Kong, China in df_gapminder
df_gapminder.loc[df_gapminder['country'] == 'Hong Kong', 'country'] = 'Hong Kong, China'

# Get the list of countries in df_gapminder
gapminder_countries = set(df_gapminder['country'])

# Keep only rows in df_main that have a country in gapminder_countries
df_main = df_main[df_main['country'].isin(gapminder_countries)]

# Convert population to integer to match data type of original version of the dataset (and handle potential errors)
df_main['pop'] = df_main['pop'].astype(int, errors='ignore')

# Convert DataFrame to list of dictionaries
data_list = df_main.to_dict(orient='records')

# Convert the list of dictionaries to JSON
json_data = json.dumps(data_list)

print(json_data)
with open('gapminder.json', 'w') as f:
json.dump(data_list, f)