G5055_Practicum_Project2 | Fall 2021

Team Members (in alphabetical order)

Core Team: Qinyue Hao, Jasmine Hwang, Dan Li, Peishan Li, Rina Shin, Connie Xu, Hanyu Zhang
Supporting Team: Zhiwen Huang, Cara Latinazo, Xingchen Li, Soobin Oh, Lizabeth Singh, Mengying Xu, Tianqing Zhou

Project Description

This project is developed by graduate students in the Social Sciences program of Columbia University in collaboration with the UN SDG Fund. The objectives of this project was to develop models that could define and quantitatively measure the networks among the 17 Sustainable Development Goals (SDG). To do this we built two models, one text model based on indicator descriptions from UN SDG Indicator Metadata, and coefficient networks based on coefficients calculated using UN SDG Indicator Database. The team also looked at the similarity between the two models, and the generalizability of the network model from the two example countries.

The importance of this project lies in the context that the different domains of SDGs are interconnected and cannot be effectively resolved without being considered as interdependent, and the fact that although this networks should be both theoretical and evidence-based, few research have been conducted to validate their empirial groundings.

Additional information about the project can be found on the project slides here

Team member contact information can also be found on the slides.

Scoping and Methodologies:

Scoping:

For the coefficient model, we selected two specific countries: Indonesia and Guatemala.

The two countries were chosen considering countries of interest from the UN Joint SDG Fund, geographical distribution differences, similarities in factors such as population density, political stability, etc., as well as relative data availability.

For the coefficient model, the team is looking at data starting from 2012 to 2020.

Model Methodologies Used:

Text Model: Network Model based on TF-IDF and cosine similarities between indicator descriptions from SDG metadata
Coefficient Social Network Model: Whole Network, Positive and Negative Network Models based on coefficients for year-to-year changes in SDG indicator measures
- Whole Network: An undirected, weighted network. All availalable indicators as nodes, statistically significant(p < 0.05) relationships as ties, and the corresponding correlation coefficents as weights of ties.
- Positive Network: A subgraph of the whole network, with only the positive linkages and the indicators they connect.
- Negative Network: A subgraph of the whole network, with only the negative linkages and the indicators they connect.
QAP Procedure and Network Logistic Regression : QAP (Quadratic Assignment Procedure) procedure is a way to handel non-independence problem by permuting rows and columns in the matrix, while maintaining the underlying relationship. To focus on predicting the existence rather than the strength of ties, we made the positive and negative network models binary by recoding all the coefficients to 1, before doing Network Logistic Regression between them, to test for the predcitive strength of one network on another.

Final Deliverables

Blog
Research Paper
Interactive Visualizations
See key findings and other visualizations on the final presentation slides.

Repository Directory Contents:

├── Codes
	├── Data Accessing and Preprocessing 
	├── Text Model
	├── Coefficient Network Model 
		├── Composite Method
		├── Regression Models
	├── Representative Method ^^
		├── Correlation Analysis
		├── Pick Central Variable
	├── Data Visualizations
		├── coefficient network 
		├── text network 
		├── data missingness and disaggregation 
	
├── Data  
	├── Centrality_representative_results (1) ^^ 
		├── centrality_scores(after removing disaggregation)
		├── indicator_picked(before removing disaggregation)
		├── measure_picked(before removing disaggregation)
	├── Guatemala & Indonesia Correlation among Indicators (1) ^^
	├── Guatemala & Indonesia Correlation among Targets (1) ^^
	├── Guatemala & Indonesia Correlation among measurements (1) ^^ 
	├── Guatemala & Indonesia Correlation among measurements-WITHOUT disaggregation ^^ (1) 
	Guatemala & Indonesia Correlation among Targets Ungrouped.csv ^^
	Indonesia.csv & Guatemala.csv ^^
	Guatemala & Indonesia data after selecting one measurement for each indicator.csv ^^
	Guatmala & Indonesia Data Without Disaggregation.csv ^^
	├── variable_types (1) ^^
	├── variables_picked (1) ^^
	├── List of indicators (1) 
	├── Data_preprocessed_for_PCA (1) 
	├── PCA_results (1) 
	├── coefficient_network (1) ^^
	├── Text_Model_Data (2) 

├── Visualizations 
	├── Disaggregated_Data (1a) 
	├── Missing_Data (1a) 
	├── Model_Viz (1c) 	
	├── Interactive_Plots 
	├── Text_Model_Viz (2) 
	├── goal_hexcodes_edge.csv 
	├── goal_hexcodes.csv

^^ - item is likely deprecated 
(1), (2), (1 & 2) - refers to model that the folder is corresponding to.

Codes

Text Model

three_models.ipynb shows the use of three word embedding models TF-IDF, Word2vec and Doc2vec and their results.
25sample_indicator.ipynb is the sampling process for the three models' results to decide the final model.
tf_idf_final.ipynb represents the final result using TF-IDF.

Coefficient Social Network Model

Correlation_among_indicators_Indicator_model.ipynb This jupyter notebook includes Pearson correlation calculation using PCA vectors. (outputs include simple matrices showing pearson correlation coefficient and statistical significance).

QAP Analysis

QAP_regression_sig.Rmd This Rmd shows the process of regressions between different networks with OLS Network models and Network Logistic Models.

Data Visualization

final_viz_weighted_net.R Visualization of static indicator network built with correlation coefficients.

Data pre-processing:

Text Model

fill_definition_incomplete.ipynb shows the process of scraping the 246 indicators' definitions from the metadata PDF files.

Coefficient Social Network Model

UN_SDG_2_Functions.py This python package includes a function called preprocess. If users import UN_SDG_2_Functions, read an SDG file (2012-2021) CSV from the API, they can use this function to directly pivot data into indicator metric / time (year) format. Also available in more detail as .ipynb file
For_PCA_Data_Preprocessing.ipynb This jupyter notebook uses PCA to preprocess data for building coefficients at the indicator level.
Missingness_Imputation.ipynb This jupyter notebook includes code to impute missing data for UN countries' using linear regression slope fitting over time.

Please feel free to contact our team with questions, issues, and concerns. Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 277 Commits
Codes		Codes
Data		Data
Visualizations		Visualizations
G5055 Project 2 Deck .pdf		G5055 Project 2 Deck .pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

G5055_Practicum_Project2 | Fall 2021

Project Description

Scoping and Methodologies:

Scoping:

Model Methodologies Used:

Final Deliverables

Repository Directory Contents:

Codes

About

Releases

Packages

Contributors 6

Languages

PeishanLi/G5055_Practicum_Project2

Folders and files

Latest commit

History

Repository files navigation

G5055_Practicum_Project2 | Fall 2021

Project Description

Scoping and Methodologies:

Scoping:

Model Methodologies Used:

Final Deliverables

Repository Directory Contents:

Codes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages