Skip to content
This repository has been archived by the owner on Apr 13, 2021. It is now read-only.
/ covid-19-data Public archive

COVID-19 datasets are constructed entirely from primary (government and public agency) sources

License

Notifications You must be signed in to change notification settings

yahoo/covid-19-data

Repository files navigation

Yahoo Knowledge Graph COVID-19 Datasets

Slack

Background

The Yahoo Knowledge Graph team at Verizon Media is responsible for providing critical COVID-19 data that feeds into Yahoo properties like Yahoo News, Yahoo Finance, and Yahoo Weather. The COVID-19 datasets include country, state, and county level information updated on a rolling basis, with updates occurring approximately hourly.

The COVID-19 datasets are constructed entirely from primary (government and public agency) sources with a clear attribution of the primary sources used for each geographical region. While other aggregations of COVID-19 data are already available, we believe ours to be the only open source COVID-19 dataset that is constructed entirely from primary sources with clear attribution back to those sources. Our hope is that additional transparency will enable more accurate analysis, aiding researchers who seek to understand and prevent further spread of the disease.

Released together with the COVID-19 dataset are two other open source projects:

Datasets

The data is logically organized by region and time. Time is further organized into a snapshot of the latest updates received for all regions and the updates reported by regions for a given date. As the COVID-19 pandemic develops and local governments and agencies improve their ability to collect and present their data to the public, the schema will evolve. Please check back as sources frequently evolve.

We welcome data feeds or links to web pages that you would like us to crawl, extract, and merge into the overall stats. Feel free to submit an issue.

region-metadata

Provides general information about the regions covered in the dataset, such as geographical location and links to other public data sources.

Field Type Description
id xsd:string a unique identifier for the region
type list of xsd:string a list of type classifications for the region. for example: Country, StateAdminArea, CountyAdminArea, etc...
woeId xsd:string WhereOnEarth unique identifier for the region
wikiId xsd:string the main Wikipedia page name of the country, can be used as a unique key
countryCode xsd:string 2 letter country abbreviation code (ISO 3116)
stateCode xsd:string 2 letter state abbreviation code (FIPS 5-2)
countyCode xsd:string US county code (FIPS 6-4)
label xsd:string the English name of the region
latitude xsd:float latitude in decimal number format
longitude xsd:float longitude in decimal number format
population xsd:integer the population residing in the region
parentId list of xsd:string a list of parent geopolitical regions for the region, this represents only direct parents as they exist in the dataset and not the full possible hierarchy

by-region-[DATE]

Provides detailed case counts of COVID-19 in each region on [DATE] in local time for that region. Each entry (row) in the daily file represents a single region.

Please be aware that different sources release data at different and often unpredictable frequencies. The by-region-[DATE] numbers will be updated as sources release data for the given date for their region. In some cases, data for a given region is not released until many days after that calendar date has elapsed everywhere in the world. As a result, the same by-region-[DATE] file may show different aggregate statistics for the same date depending on when the by-region-[DATE] is accessed. Generally speaking, by-region-[DATE] data more than one week old is stable.

Field Type Description
regionId xsd:string see id above
label xsd:string see above
totalConfirmed xsd:integer the total amount of confirmed cases of COVID-19 in the region until the given date (aggregate)
totalDeaths xsd:integer the total amount of fatalities from COVID-19 in the region
totalRecoveredCases xsd:integer the total amount of people recovered from COVID-19 in the region (aggregate)
totalTestedCases xsd:integer the total amount of people tested for COVID-19 in the region (aggregate)
numPositiveTests xsd:integer the daily count of people tested positive for COVID-19
numDeaths xsd:integer the daily count of fatalities as a result of COVID-19
numRecoveredCases xsd:integer the daily count of people recovered from COVID-19
diffNumPositiveTests xsd:integer the difference in number of positive cases found between 2 consecutive days
diffNumDeaths xsd:integer the difference in number of deaths between 2 consecutive days
avgWeeklyConfirmedCases xsd:float 7-day moving average of daily new confirmed cases
avgWeeklyDeaths xsd:float 7-day moving average of daily new deaths
referenceDate xsd:date the date associated with the COVID-19 data according to the local timezone of the region
lastUpdatedDate xsd:datetime last update time of the entry
dataSource xsd:anyURI the source attribution for the COVID-19 data in the current entry

by-region-latest

Provides the latest figures for each region.

The schema for the latest file is similar to the by-region-[DATE] above. There are 2 main differences:

  • All daily diff, moving average and daily numbers are removed - daily numbers in latest file can be misleading as they are dependant on the time of day at which the data was collected
  • referenceDate - In the daily files, referenceDate always matches the filename, and represents the date in local time for the relevant data reported by the source for that region when that source was last consulted. In the latest file, referenceDate will differ across regions, representing the latest date on which the source for a given region was consulted.

Note that because different regions report at different and often unpredictable frequencies, the latest figures for one region may be many days older than the latest figures for another region. For this reason, stable by-region-[DATE] numbers are required for an accurate comparison of growth rates in different regions. Generally speaking, by-region-[DATE] data more than one week old is stable.

Field Type Description
regionId xsd:string see id above
label xsd:string see above
totalConfirmed xsd:integer the total amount of confirmed cases of COVID-19 in the region until the given date (aggregate)
totalDeaths xsd:integer the total amount of fatalities from COVID-19 in the region
totalRecoveredCases xsd:integer the total amount of people recovered from COVID-19 in the region (aggregate)
totalTestedCases xsd:integer the total amount of people tested for COVID-19 in the region (aggregate)
referenceDate xsd:date the date associated with the COVID-19 data according to the local timezone of the region
lastUpdatedDate xsd:datetime last update time of the entry
dataSource xsd:anyURI the source attribution for the COVID-19 data in the current entry

Maintainers

Please contact [email protected] with any questions.

Contributors

Thank you to everyone who contributed to this project!

License

The Yahoo Knowledge Graph COVID-19 Dataset is made available under a Creative Commons CC-BY-NC 4.0 license. No express permission from Verizon Media is required for noncommercial uses. Only compliance with the CC-BY-NC 4.0 license is required for noncommercial uses including attribution.

Verizon Media may consider granting royalty-free commercial licenses upon request. If you are interested in making commercial use of the Yahoo COVID-19 Dataset, please submit a request.

About

COVID-19 datasets are constructed entirely from primary (government and public agency) sources

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published