Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update gapminder.json and add source information #580

Merged
merged 6 commits into from
Jul 16, 2024

Conversation

dsmedia
Copy link
Contributor

@dsmedia dsmedia commented Jul 11, 2024

Summary

  1. Updates the gapminder.json dataset from the source
  2. Adds detailed source information for gapminder.json to SOURCES.md
  3. Creates a script file demonstrating the dataset update process

Related Issue

Resolves #577

@dsmedia
Copy link
Contributor Author

dsmedia commented Jul 11, 2024

For reference, here is the code used to generate the dataset. The code pulls the data from source spreadsheets (links via gapminder.org) and then retains only the countries included in the then-current vega-datsets version of gapminder.

import pandas as pd
import json
import re
from vega_datasets import data

def google_sheet_to_pandas(sheet_url):
    key_match = re.search(r'/d/([a-zA-Z0-9-_]+)', sheet_url)
    gid_match = re.search(r'gid=(\d+)', sheet_url)
    sheet_key, gid = key_match.group(1), gid_match.group(1)
    csv_export_url = f"https://docs.google.com/spreadsheets/d/{sheet_key}/export?format=csv&gid={gid}"
    return pd.read_csv(csv_export_url)

urls = [
    "https://docs.google.com/spreadsheets/d/1RehxZjXd7_rG8v2pJYV6aY0J3LAsgUPDQnbY4dRdiSs/edit?gid=176703676#gid=176703676", #life expectancy v14
    "https://docs.google.com/spreadsheets/d/1c1luQNdpH90tNbMIeU7jD__59wQ0bdIGRFpbMm8ZBTk/edit?gid=176703676#gid=176703676", #population v7
    "https://docs.google.com/spreadsheets/d/1aLtIpAWvDGGa9k2XXEz6hZugWn0wCd5nmzaRPPjbYNA/edit?gid=176703676#gid=176703676", #fertility v14
    "https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/edit?gid=1597424158#gid=1597424158" #data geographies v2
]

# Load dataframes
df_life, df_pop, df_fert, df_region = [google_sheet_to_pandas(url) for url in urls]

# Prepare main dataframe
df_main = df_pop[['name', 'time', 'Population']].rename(columns={'Population': 'pop'})

# Merge other dataframes
df_main = df_main.merge(df_life[['name', 'time', 'Life expectancy ']], on=['name', 'time'])
df_main = df_main.merge(df_fert[['name', 'time', 'Babies per woman']], on=['name', 'time'])
df_main = df_main.merge(df_region[['name', 'six_regions']], on='name')

# Rename columns
df_main = df_main.rename(columns={
    'name': 'country',
    'time': 'year',
    'Life expectancy ': 'life_expect',
    'Babies per woman': 'fertility',
    'six_regions': 'region'
})

# Reorder columns
df_main = df_main[['year', 'country', 'region', 'pop', 'life_expect', 'fertility']]

# Convert year to int and filter years from 1955 to 2005 in increments of 5
df_main['year'] = df_main['year'].astype(int)
df_main = df_main[df_main['year'].between(1955, 2005) & (df_main['year'] % 5 == 0)]

# Sort the dataframe
df_main = df_main.sort_values(['country', 'year'])

# Create the cluster mapping
cluster_map = {
    'south_asia': 0,
    'europe_central_asia': 1,
    'sub_saharan_africa': 2,
    'america': 3,
    'east_asia_pacific': 4,
    'middle_east_north_africa': 5
}

# Add cluster column and drop region column
df_main['cluster'] = df_main['region'].map(cluster_map)
df_main = df_main.drop('region', axis=1)

# Reorder columns
column_order = ['year', 'country', 'cluster', 'pop', 'life_expect', 'fertility']
df_main = df_main[column_order]

# Load gapminder dataset
df_gapminder = data.gapminder()

# Rename Hong Kong to Hong Kong, China in df_gapminder
df_gapminder.loc[df_gapminder['country'] == 'Hong Kong', 'country'] = 'Hong Kong, China'

# Get the list of countries in df_gapminder
gapminder_countries = set(df_gapminder['country'])

# Keep only rows in df_main that have a country in gapminder_countries
df_main = df_main[df_main['country'].isin(gapminder_countries)]

# Convert population to integer to match data type of original version of the dataset (and handle potential errors)
df_main['pop'] = df_main['pop'].astype(int, errors='ignore')

# Convert DataFrame to list of dictionaries
data_list = df_main.to_dict(orient='records')

# Convert the list of dictionaries to JSON
json_data = json.dumps(data_list)

print(json_data)
with open('gapminder.json', 'w') as f:
    json.dump(data_list, f)

@domoritz
Copy link
Member

Thank you. I think the updates and adding to sources can be one pull request since the sources updates are for the updates dataset, no?

@domoritz
Copy link
Member

@dsmedia let's add the code to the repo. We have the https://github.com/vega/vega-datasets/tree/main/scripts folder.

This commit introduces a Python script that updates the gapminder.json file
in the vega-datasets repository. The script:

- Fetches current data from Gapminder's Google Sheets
- Processes and combines data for life expectancy, population, fertility, and regions
- Filters results to match countries in the existing dataset
- Updates consistent with a minor release

The script maintains data consistency by referencing a specific version
of the existing dataset. This update allows for refreshed Gapminder data
while preserving the column structure and scope of countries/years expected by dependent visualizations.

Related: vega#580
@dsmedia dsmedia changed the title feat: update gapminder.json dataset from source feat: update gapminder.json and add source information Jul 12, 2024
@dsmedia
Copy link
Contributor Author

dsmedia commented Jul 12, 2024

Merged all related changes into this PR and added the updater to the script folder.

@domoritz domoritz merged commit 76feaab into vega:main Jul 16, 2024
2 checks passed
@domoritz
Copy link
Member

thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Address data inconsistencies and absence of versioning or sourcing in gapminder data
2 participants