Skip to content

COVID-19 case count in Germany state-by-state, over time (HTTP API and CSV files)

License

Notifications You must be signed in to change notification settings

etho50/covid-19-germany-gae

 
 

Repository files navigation

COVID-19 case numbers in Germany by state, over time 😷

COVID-19 Fallzahlen für Deutschland, für Bundesländer und Landkreise. Mit Zeitreihen.

This dataset is provided through comma-separated value (CSV) files. In addition, this project offers an HTTP (JSON) API.

Unboxing: what's in it? :-)

  • JSON endpoint /now: Germany's total case count (updated in real time, always fresh, for the sensationalists)
  • RKI data (most credible view into the past): time series data provided by the Robert Koch-Institut (updated daily)
  • Crowdsourcing data (fresh view into the last 1-2 days): Risklayer crowdsource effort (see "Attribution" below)
  • ags.json: a map for translating "amtlicher Gemeindeschlüssel" (AGS) to Landreis/Bundesland details, including latitude and longitude.
  • data.csv: history, mixed data source based on RKI/ZEIT ONLINE. This powers the per-Bundesland timeseries exposed by the HTTP JSON API.
  • JSON endpoints for per-Bundesland time series, example for Bayern: /timeseries/DE-BY/cases, based on data.csv, endpoints for other states linked from this landing page: https://covid19-germany.appspot.com

There also is a website showing a plot (not updated daily): https://covid19-germany.appspot.com

How is this dataset different from others?

  • It includes historical data for individual Bundesländer and Landkreise (states and counties).
  • Its time series data is being re-written as data gets better over time. This is based on official RKI-provided time series data which receives daily updates even for days weeks in the past (accounting for delay in reporting).
  • The HTTP endpoint /now consults multiple sources (and has changed its sources over time) to be as fresh and credible as possible while maintaining a stable interface.

Contact, questions, contributions

You probably have many questions, just as I did (and still do). Your feedback and questions are highly appreciated! Please use the GitHub issue tracker (preferred) or contact me via mail at [email protected].

CSV file details

  • The column names use the ISO 3166 code for individual states.
  • The points in time are encoded using localized ISO 8601 time string notation.
  • I did not incorporate the numbers on recovered so far because individual Gesundheitsämter do not have the capacity to carefully track this metric yet (it is rather meaningless).
  • As a differentiator from other datasets the sample timestamps contain the time of the day so that consumers can at least have a vague impression if the sample represents the state in the morning or evening (a common confusion about the RKI-derived datasets). If it's the morning then it's likely to actually be data of the day before. If it's the evening then it's more likely to represent the state of the day.

Code example: parsing and plotting

This example assumes experience with established tools from the Python ecosystem. Create a file called plot.py:

import sys
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")

df = pd.read_csv(
    sys.argv[1],
    index_col=["time_iso8601"],
    parse_dates=["time_iso8601"],
    date_parser=lambda col: pd.to_datetime(col, utc=True),
)
df.index.name = "time"

df["DE-BW"].plot(
    title="DE-BW confirmed cases (RKI data)", marker="x", grid=True, figsize=[12, 9]
)
plt.tight_layout()
plt.savefig("bw_cases_over_time.png", dpi=70)

Run it, provide cases-rki-by-state.csv as an argument:

python plot.py cases-rki-by-state.csv

This creates a file bw_cases_over_time.png which may look like the following:

Quality data sources published by Bundesländer

I tried to discover these step-by-step, they are possibly underrated:

Further resources:

HTTP API details

For the HTTP API some of the motivations are convenience ( easy to consume in the tooling of your choice!), interface stability, and availability.

  • The HTTP API is served under https://covid19-germany.appspot.com
  • It is served by Google App Engine from a European data center
  • The code behind this can be found in the gae directory in this repository.

How to get historical data for a specific German state/Bundesland:

Construct the URL based on this pattern:

https://covid19-germany.appspot.com/timeseries/<state>/<metric>:

For <state> use the ISO 3166 code, for <metric> use cases or deaths.

For example, to fetch the time evolution of the number of confirmed COVID-19 cases for Bayern (Bavaria):

$ curl -s https://covid19-germany.appspot.com/timeseries/DE-BY/cases | jq
{
  "data": [
    {
      "2020-03-10T12:00:00+01:00": "314"
    },
[...]

The points in time are encoded using localized ISO 8601 time string notation. Any decent datetime library can parse that into timezone-aware native timestamp representations.

How to get the current snapshot for all of Germany (no time series):

$ curl -s https://covid19-germany.appspot.com/now | jq
{
  "current_totals": {
    "cases": 12223,
    "deaths": 31,
    "recovered": 99,
    "tested": "unknown"
  },
  "meta": {
    "contact": "Dr. Jan-Philip Gehrcke, [email protected]",
    "source": "ZEIT ONLINE (aggregated data from individual ministries of health in Germany)",
    "time_source_last_consulted_iso8601": "2020-03-19T03:47:01+00:00",
    "time_source_last_updated_iso8601": "2020-03-18T22:11:00+01:00"
  }
}

Notably, the Berliner Morgenpost seems to also do a great job at quickly aggregating the state-level data. This API endpoint chooses either that source or ZEIT ONLINE depending on the higher case count.

What you should know before reading these numbers

Please question the conclusiveness of these numbers. Some directions along which you may want to think:

  • Germany seems to perform a large number of tests. But think about how much insight you actually have into how the testing rate (and its spatial distribution) evolves over time. In my opinion, one absolutely should know a whole lot about the testing effort itself before drawing conclusions from the time evolution of case count numbers.
  • Each confirmed case is implicitly associated with a reporting date. We do not know for sure how that reporting date relates to the date of taking the sample.
  • We believe that each "confirmed case" actually corresponds to a polymerase chain reaction (PCR) test for the SARS-CoV2 virus with a positive outcome. Well, I think that's true, we can have that much trust into the system.
  • We seem to believe that the change of the number of confirmed COVID-19 cases over time is somewhat expressive: but what does it shed light on, exactly? The amount of testing performed, and its spatial coverage? The efficiency with which the virus spreads through the population ("basic reproduction number")? The actual, absolute number of people infected? The virus' potential to exhibit COVID-19 in an infected human body?

If you keep these (and more) ambiguities and questions in mind then I think you are ready to look at these numbers and their time evolution :-) 😷.

Changelog: data source

In Germany, every step along the chain of reporting (Meldekette) introduces a noticeable delay. This is not necessary, but sadly the current state of affairs. The Robert Koch-Institut (RKI) seems to be working on a more modern reporting system that might mitigate some of these delays along the Meldekette in the future. Until then, it is fair to assume that case numbers published by RKI have 1-2 days delay over the case numbers published by Landkreise, which themselves have an unknown lag relative to the physical tests. In some cases, the Meldekette might even be entirely disrupted, as discussed in this SPIEGEL article (German). Also see this discussion.

Wishlist: every case should be tracked with its own time line, and transparently change state over time. The individual cases (and their time lines) should be aggregated on a country-wide level, anonymously, and get published in almost real time, through an official, structured data source, free to consume for everyone.

As discussed, the actual data flow situation is far from this ideal. Nevertheless, the primary concern of this dataset here is to maximize data credibility while also trying to maximize data freshness; a challenging trade-off in this initial phase of pandemic growth in Germany. That is, the goal is to provide you with the least shitty numbers from a set of generally pretty shitty numbers. To that end, I took liberty to iterate on the data source behind this dataset — as indicated below.

/now (current state):

  • Since (incl) March 26: Meldekette step 2: reports by the individual counties (Landkreise), curated by Tagesspiegel and Risklayer for the current case count, curated by ZEIT ONLINE for deaths.
  • Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
  • Since (incl) March 19: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE, and Berliner Morgenpost.

/timeseries/... (historical data):

Update (evening March 29): in the near future I consider re-writing the history exposed by these endpoints (data.csv) using RKI data, accounting for long reporting delays.

  • Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
  • Since (incl) March 18: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE.
  • Before March 18: Meldekette step 4: RKI "situation reports" (PDF documents).

Note:

  • The source identifier in the CSV file changes correspondingly over time.
  • A mix of sources in a time series is of course far from ideal. However — given the boundary conditions — I think switching to better sources as they come up is fair and useful. We might also change (read: rewrite) time series data in hindsight. Towards enhancing overall credibility. That has not happened yet, but that can change as we learn more about the Germany-internal data flow, and about the credibility of individual data sources.

Attribution

Shout-out to ZEIT ONLINE for continuously collecting and publishing the state-level data with little delay.

Edit: Notably, by now the Berliner Morgenpost seems to do an equally well job of quickly aggregating the state-level data. We are using that in here, too. Thanks!

Edit March 26: Risklayer is coordinating a crowd-sourcing effort to process verified Landkreis data as quickly as possible. Tagesspiegel is verifying this effort and using it in their overview page. As far as I can tell this is so far the most transparent data flow, and also the fastest, getting us the freshest case count numbers. Great work!

Fast aggregation & communication is important during the phase of exponential growth.

Random notes

  • The MDC Berlin has published this visualization and this article, but they seemingly decided to not publish the time series data. I got my hopes up here at first!

About

COVID-19 case count in Germany state-by-state, over time (HTTP API and CSV files)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 84.6%
  • Python 14.8%
  • Other 0.6%