gdelt-demo: Notes on datasets

Over time and through lack of planning, I have created various subsets of the version 1 GDELT dataset both locally and on AWS (both MySQL and flat files in s3). Small and medium subsets are a good thing (they could be considered dev and test, respectively), but poor coordination and haphazard proliferation are not good.

Datasets to keep

FULL: Full GDELT on s3://gdelt-open-data/ -- tab-delimited as *.csv -- obviously not mine.
SMALL: Local tab-delimited files -- too big to put in this repo. Intended to validate analysis code. For many dates I used a util script to pull 1/10 of the rows.
MICRO: Flat files in this git repo - See data_related/sample_data. Anyone cloning the repo can get going quickly and get output with this set.

Note: The summary country data under data_related/features and data_related/external also fit in this repo but should be part of any dataset. The former resulted from a Hadoop query summarizing the entire dataset.

Datasets to harmonize

Mini tab-delimited on my s3 -- useful for validating HiveQL scripts. Should match SMALL.
MySQL on AWS Relational Database service -- should match SMALL
MySQL on my local server -- lost in a computer crash. Should be rebuilt to match SMALL.

Datasets to consider eliminating

Mini version - I forget what this refers to, maybe something stray on s3. I should check out what it is, but my strong prior is that it's no longer adding any value..

Version 2 (future)

As I get into really understanding the Global Knowledge Graph, I may avail myself of the services offered in its native habitat on Google BigQuery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme_data.md

readme_data.md

gdelt-demo: Notes on datasets

Datasets to keep

Datasets to harmonize

Datasets to consider eliminating

Version 2 (future)

Files

readme_data.md

Latest commit

History

readme_data.md

File metadata and controls

gdelt-demo: Notes on datasets

Datasets to keep

Datasets to harmonize

Datasets to consider eliminating

Version 2 (future)