COVID-19 Fallzahlen für Deutschland, für Bundesländer und Landkreise. Mit Zeitreihen.
This dataset is provided through comma-separated value (CSV) files. In addition, this project offers an HTTP (JSON) API.
- JSON endpoint /now: Germany's total case count (updated in real time, always fresh, for the sensationalists)
- RKI data (most credible view into the past): time series data provided by the Robert Koch-Institut (updated daily)
- cases-rki-by-ags.csv and deaths-rki-by-ags.csv: per-Landkreis time series
- cases-rki-by-state.csv and deaths-rki-by-state.csv: per-Bundesland time series
- This is the only data source that properly accounts for Meldeverzug (reporting delay). The historical evolution of data points in these files is updated daily based on a (less accessible) RKI ArcGIS system.
- Crowdsourcing data (fresh view into the last 1-2 days): Risklayer crowdsource effort (see "Attribution" below)
- cases-rl-crowdsource-by-ags.csv: per-Landkreis time series
- cases-rl-crowdsource-by-state.csv: per-Bundesland time series
- For the last ~48 hours these case count numbers (crowdsourced from Gesundheitsämter) are a little higher than what the RKI data set shows.
- ags.json: a map for translating "amtlicher Gemeindeschlüssel" (AGS) to Landreis/Bundesland details, including latitude and longitude.
- data.csv: history, mixed data source based on RKI/ZEIT ONLINE. This powers the per-Bundesland timeseries exposed by the HTTP JSON API.
- JSON endpoints for per-Bundesland time series, example for Bayern: /timeseries/DE-BY/cases, based on
data.csv
, endpoints for other states linked from this landing page: https://covid19-germany.appspot.com
There also is a website showing a plot (not updated daily): https://covid19-germany.appspot.com
- It includes historical data for individual Bundesländer and Landkreise (states and counties).
- Its time series data is being re-written as data gets better over time. This is based on official RKI-provided time series data which receives daily updates even for days weeks in the past (accounting for delay in reporting).
- The HTTP endpoint /now consults multiple sources (and has changed its sources over time) to be as fresh and credible as possible while maintaining a stable interface.
You probably have many questions, just as I did (and still do). Your feedback and questions are highly appreciated! Please use the GitHub issue tracker (preferred) or contact me via mail at [email protected].
- The column names use the ISO 3166 code for individual states.
- The points in time are encoded using localized ISO 8601 time string notation.
- I did not incorporate the numbers on
recovered
so far because individual Gesundheitsämter do not have the capacity to carefully track this metric yet (it is rather meaningless). - As a differentiator from other datasets the sample timestamps contain the time of the day so that consumers can at least have a vague impression if the sample represents the state in the morning or evening (a common confusion about the RKI-derived datasets). If it's the morning then it's likely to actually be data of the day before. If it's the evening then it's more likely to represent the state of the day.
This example assumes experience with established tools from the Python ecosystem.
Create a file called plot.py
:
import sys
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
df = pd.read_csv(
sys.argv[1],
index_col=["time_iso8601"],
parse_dates=["time_iso8601"],
date_parser=lambda col: pd.to_datetime(col, utc=True),
)
df.index.name = "time"
df["DE-BW"].plot(
title="DE-BW confirmed cases (RKI data)", marker="x", grid=True, figsize=[12, 9]
)
plt.tight_layout()
plt.savefig("bw_cases_over_time.png", dpi=70)
Run it, provide cases-rki-by-state.csv
as an argument:
python plot.py cases-rki-by-state.csv
This creates a file bw_cases_over_time.png
which may look like the following:
I tried to discover these step-by-step, they are possibly underrated:
- Bayern: case numbers, map, LK table
- Berlin: case numbers, map, intensive care numbers
- Baden-Württemberg:
- Brandenburg: press releases
- Bremen: press releases
- Hamburg: case numbers, press releases
- Hessen: press releases
- NRW: case numbers, LK table
- Mecklenburg-Vorpommern: press releases
- Niedersachsen (pretty well done!):
- case numbers, map, LK table
- CSV / GeoJSON
- so close, but no historical data :-(
- Rheinland-Pfalz: case numbers, LK table
- Saarland: case numbers
- Sachsen: case numbers, LK table, intensive care numbers
- Sachsen-Anhalt: case numbers, LK table, intensive care numbers
- Schleswig-Holstein: case numbers, LK table
- Thüringen: case numbers, LK table, intensive car numbers
- In this blog post (German) I try to shed light on why — as of the time of writing (March 18) — the numbers reported in the RKI and WHO situation reports lag behind by 1-3 days.
- Blog post Covid-19 HTTP API: German case numbers
- Blog post Covid-19 HTTP API: case numbers as time series, for individual German states
For the HTTP API some of the motivations are convenience ( easy to consume in the tooling of your choice!), interface stability, and availability.
- The HTTP API is served under https://covid19-germany.appspot.com
- It is served by Google App Engine from a European data center
- The code behind this can be found in the
gae
directory in this repository.
How to get historical data for a specific German state/Bundesland:
Construct the URL based on this pattern:
https://covid19-germany.appspot.com/timeseries/<state>/<metric>
:
For <state>
use the ISO 3166 code, for <metric>
use cases
or deaths
.
For example, to fetch the time evolution of the number of confirmed COVID-19 cases for Bayern (Bavaria):
$ curl -s https://covid19-germany.appspot.com/timeseries/DE-BY/cases | jq
{
"data": [
{
"2020-03-10T12:00:00+01:00": "314"
},
[...]
The points in time are encoded using localized ISO 8601 time string notation. Any decent datetime library can parse that into timezone-aware native timestamp representations.
How to get the current snapshot for all of Germany (no time series):
$ curl -s https://covid19-germany.appspot.com/now | jq
{
"current_totals": {
"cases": 12223,
"deaths": 31,
"recovered": 99,
"tested": "unknown"
},
"meta": {
"contact": "Dr. Jan-Philip Gehrcke, [email protected]",
"source": "ZEIT ONLINE (aggregated data from individual ministries of health in Germany)",
"time_source_last_consulted_iso8601": "2020-03-19T03:47:01+00:00",
"time_source_last_updated_iso8601": "2020-03-18T22:11:00+01:00"
}
}
Notably, the Berliner Morgenpost seems to also do a great job at quickly aggregating the state-level data. This API endpoint chooses either that source or ZEIT ONLINE depending on the higher case count.
Please question the conclusiveness of these numbers. Some directions along which you may want to think:
- Germany seems to perform a large number of tests. But think about how much insight you actually have into how the testing rate (and its spatial distribution) evolves over time. In my opinion, one absolutely should know a whole lot about the testing effort itself before drawing conclusions from the time evolution of case count numbers.
- Each confirmed case is implicitly associated with a reporting date. We do not know for sure how that reporting date relates to the date of taking the sample.
- We believe that each "confirmed case" actually corresponds to a polymerase chain reaction (PCR) test for the SARS-CoV2 virus with a positive outcome. Well, I think that's true, we can have that much trust into the system.
- We seem to believe that the change of the number of confirmed COVID-19 cases over time is somewhat expressive: but what does it shed light on, exactly? The amount of testing performed, and its spatial coverage? The efficiency with which the virus spreads through the population ("basic reproduction number")? The actual, absolute number of people infected? The virus' potential to exhibit COVID-19 in an infected human body?
If you keep these (and more) ambiguities and questions in mind then I think you are ready to look at these numbers and their time evolution :-) 😷.
In Germany, every step along the chain of reporting (Meldekette) introduces a noticeable delay. This is not necessary, but sadly the current state of affairs. The Robert Koch-Institut (RKI) seems to be working on a more modern reporting system that might mitigate some of these delays along the Meldekette in the future. Until then, it is fair to assume that case numbers published by RKI have 1-2 days delay over the case numbers published by Landkreise, which themselves have an unknown lag relative to the physical tests. In some cases, the Meldekette might even be entirely disrupted, as discussed in this SPIEGEL article (German). Also see this discussion.
Wishlist: every case should be tracked with its own time line, and transparently change state over time. The individual cases (and their time lines) should be aggregated on a country-wide level, anonymously, and get published in almost real time, through an official, structured data source, free to consume for everyone.
As discussed, the actual data flow situation is far from this ideal. Nevertheless, the primary concern of this dataset here is to maximize data credibility while also trying to maximize data freshness; a challenging trade-off in this initial phase of pandemic growth in Germany. That is, the goal is to provide you with the least shitty numbers from a set of generally pretty shitty numbers. To that end, I took liberty to iterate on the data source behind this dataset — as indicated below.
- Since (incl) March 26: Meldekette step 2: reports by the individual counties (Landkreise), curated by Tagesspiegel and Risklayer for the current case count, curated by ZEIT ONLINE for
deaths
. - Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
- Since (incl) March 19: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE, and Berliner Morgenpost.
Update (evening March 29): in the near future I consider re-writing the history exposed by these endpoints (data.csv
) using RKI data, accounting for long reporting delays.
- Since (incl) March 24: Meldekette step 2: reports by the individual counties (Landkreise), curated by ZEIT ONLINE.
- Since (incl) March 18: Meldekette step 3: reports by the individual states (Bundesländer), curated by ZEIT ONLINE.
- Before March 18: Meldekette step 4: RKI "situation reports" (PDF documents).
Note:
- The
source
identifier in the CSV file changes correspondingly over time. - A mix of sources in a time series is of course far from ideal. However — given the boundary conditions — I think switching to better sources as they come up is fair and useful. We might also change (read: rewrite) time series data in hindsight. Towards enhancing overall credibility. That has not happened yet, but that can change as we learn more about the Germany-internal data flow, and about the credibility of individual data sources.
Shout-out to ZEIT ONLINE for continuously collecting and publishing the state-level data with little delay.
Edit: Notably, by now the Berliner Morgenpost seems to do an equally well job of quickly aggregating the state-level data. We are using that in here, too. Thanks!
Edit March 26: Risklayer is coordinating a crowd-sourcing effort to process verified Landkreis data as quickly as possible. Tagesspiegel is verifying this effort and using it in their overview page. As far as I can tell this is so far the most transparent data flow, and also the fastest, getting us the freshest case count numbers. Great work!
Fast aggregation & communication is important during the phase of exponential growth.
- The MDC Berlin has published this visualization and this article, but they seemingly decided to not publish the time series data. I got my hopes up here at first!