You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: "Short discussion of use of API for reproducibility"
3
3
author: "Lars Vilhuber"
4
4
date: "`r Sys.Date()`"
5
-
output: html_document
5
+
output:
6
+
html_document:
7
+
keep_md: true
8
+
6
9
---
7
10
8
11
```{r setup, include=FALSE}
9
12
knitr::opts_chunk$set(echo = TRUE)
13
+
NOTE <- "README ::::"
14
+
options(scipen = 999)
10
15
```
11
16
12
17
## Introduction
13
18
14
-
This document demonstrates the importance of using vintage dates when pulling API data from FRED.
15
19
16
-
## Load Required Libraries
20
+
Application programming interfaces (APIs) are popular, and often a very convenient way to get just the data one needs.
21
+
22
+
However, they pose reproducibility challenges:
23
+
24
+
- the data might change, in ways unknown to the researcher
25
+
- the API might change, breaking the researcher's code
26
+
27
+
This brief tutorial will discuss some safeguards that can be done at relatively low cost by the researcher to improve reproducibility.
28
+
29
+
## The setting
30
+
31
+
We will use the St. Louis Fed's FRED data service, frequently used by economists. In this example, we will use a single time-series -- `GNPCA` -- Real Gross National Product -- but the example can be easily extended to a whole set of series.
32
+
33
+
Often, researchers will use the default landing page for a time-series, in this case, [https://fred.stlouisfed.org/series/GNPCA](https://fred.stlouisfed.org/series/GNPCA), and then download a CSV or Excel file.
34
+
35
+

36
+
37
+
However, they could instead use the API that is offered. This tutorial uses R and the [`fredr`](https://cran.r-project.org/web/packages/fredr/vignettes/fredr.html) to describe the use of that API.[^other]
38
+
39
+
[^other]: For a version using Stata, see [this other document](stata-fred.md). Other interfaces include MATLAB and of course Python.
40
+
41
+
## Using the API
42
+
43
+
The FRED API, as most other APIs, requires an API key - a kind of password. One typical technique is to store the API key as an environment variable, or (less securely) hard-coding it in the code (this makes the code not easily shareable). For this tutorial, obtain an API key from the [FRED profile page](https://fredaccount.stlouisfed.org/apikeys]) (a login is required), then store in the file `.Renviron` as
44
+
45
+
```
46
+
# This is an example, not a valid API key!
47
+
FRED_API_KEY="78862231cc0bd7a7f3b84eb9e19d4b7e"
48
+
```
49
+
50
+
Save it, and restart R. The environment variable will now be available to the R package `fredr`.
So the value of GNPCA, as of `r today` when we ran this, install
85
+
86
+
> GNPCA(`r DEFAULT_DATE`) = `r current_value`, as of `r today`
87
+
88
+
## Two observations
41
89
42
-
By default, we pull data without specifying a vintage:
90
+
Two things are of note:
43
91
44
-
```{r pull-default}
45
-
data_default <- fredr(
92
+
93
+
### Clipping the time series
94
+
95
+
We pulled the entire time series, even though we only were interested in a subset. While the start date (for GNPCA, `r min_date`) will never change, the length of the time series will change if we pull in the future, or if we had pulled in the past. This might affect the rest of the code. So a good practice is to define precise start and end dates.
Thus, if we are primarily interested in the time series between, say, `r DATE_START` and `r DATE_END`, whenever we pull the data in the future, the length of the time series will be the same: `nrow(data_clipped)` rows. Note that this should not change the value obtained for `r DEFAULT_DATE`:
55
116
117
+
> GNPCA(`r DEFAULT_DATE`) = `r clipped_value`, when clipped between `r DATE_START` and `r DATE_END`
56
118
57
-
Now we do the same thing, but precisely defining the vintages:
119
+
but it would affect other (naively computed) values such as the mean of the time-series, or a computed linear trend.
58
120
59
-
```{r pull-vintaged}
60
-
data_vintaged <- fredr(
121
+
### Revisions
122
+
The second issue is that the measures of GNP, both specific data points, as well as occassionally the entire time series, are revised by the Bureau of Economic Analysis. At the time of writing this in 2025, `GNPCA` was expressed in "Billions of Chained 2017 Dollars". Obviously, prior to 2017, it would have been expressed differently. This also leads to revisions of historical values.
123
+
124
+
The FRED API (although not all other APIs), allows to extract data with an *as-if* date, which they call a vintage. Thus, suppose we had pulled the time series as of a specific date:
So what does the value for `GNPCA` look like as of `r VINTAGE`?
71
141
72
-
As of `r CURRENT_DATE`, the two values are:
73
142
74
-
-`r default_value`, when not specifying a vintage
143
+
As of `r today`, the two values for GNPCA that we have obtained are:
144
+
145
+
-`r current_value`, when not specifying a vintage, as of `r today`
75
146
-`r vintaged_value`, when specifying vintage `r VINTAGE`
76
147
77
-
**Expected result:** As of the current date, the two values may differ because:
78
148
79
-
- The default pulls the most recent revision
80
-
- The vintaged data pulls the value as it was known on 2021-12-31
149
+
That is a substantial difference in absolute value! In part, this is due to the rebase-lining of the time series, in part this is due to revisions of the annual value as additional data becomes available.
81
150
82
151
## Pulling Multiple Vintages for Comparison
83
152
84
153
85
-
Now let's see why this matters - let's pull down a few more vintages:
154
+
Now let's see how long this can matter, by obtaining a number of additional vintages.
## Comparison of GNPCA values across different vintages"
114
-
115
-
```{r print_vintages}
116
-
print(vintage_wide)
180
+
filter(date == DEFAULT_DATE)
117
181
```
118
182
119
183
120
-
*NOTE: These values should NEVER change once recorded. The value for 2007-01-01 as of 2008-09-15 will always be what it was on that date.
121
-
184
+
## Comparison of GNPCA values across different vintages
122
185
123
-
## Key Lessons
186
+
Focussing on the value for `r DEFAULT_DATE`, we see how the value has changed over time:
124
187
125
-
**LESSON:**
188
+
```{r print_vintages,results='asis',echo=FALSE}
189
+
kable(print(vintage_wide))
190
+
```
126
191
127
-
1. Always use a fixed vintage date to query the API for reproducibility
128
-
2. Some series change as time progresses, even for historical values
129
-
3. Without vintage specification, you get the latest revision
192
+
The effect on the time series is shown in the following graph (focussing only on the period 2012-2016 for clarity):
193
+
194
+
```{r plot_vintages, fig.width=4, fig.height=4}
195
+
library(ggplot2)
196
+
vintage_comparison %>%
197
+
filter(date > PLOT_DATE) %>%
198
+
ggplot(aes(x = date, y = value, color = vintage)) +
199
+
geom_line() +
200
+
labs(title = "GNPCA over time for different vintages",
201
+
x = "Date",
202
+
y = "GNPCA (Billions of Chained 2017 Dollars)",
203
+
color = "Vintage Date") +
204
+
theme_minimal()
205
+
```
130
206
131
207
## Data Persistence Strategy
132
208
133
-
**SUPPLEMENTARY LESSON:** Save the data pulled through the API as an intermediate dataset and if permissible by the license (check!), redistribute it in case that the API is deprecated and won't work in the future.
209
+
APIs in general have one additional "feature": At some point, they may break, because the hosting institution makes decisions that affects its availability. While the above sections show how data can be obtained that precisely reflect the intended range and as-of date, they cannot compensate for the disappearance or breaking changes to an API.
210
+
211
+
The solution is to save the data pulled through the API as an intermediate dataset ("cache") upon first use, and henceforth use the cached data. If redistribution is permissible by the license (check!), this also allows to provide future users with the same data, in case that the API is deprecated and won't work in the future.
message(NOTE, "Re-using existing file with vintage =", as.character(VINTAGE_READ),"\n")
146
226
} else {
147
227
# Code if the file does not exist
148
228
# You could do the full API pull
149
229
# conditional on the intermediate
150
230
# file NOT being there
151
-
cat(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
231
+
message(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
152
232
fred_data <- fredr(
153
233
series_id = "GNPCA",
154
234
observation_start = DATE_START,
@@ -159,18 +239,14 @@ if (file.exists("data/fred/fred_gnpca.rds")) {
159
239
}
160
240
```
161
241
162
-
### Final Dataset
242
+
When first run, the code will output
163
243
164
-
```{r data-persistence-print}
165
-
print(fred_data)
244
+
```
245
+
README ::::Reading in data from FRED API with vintage =2017-06-01
166
246
```
167
247
168
-
### Analysis Complete
169
-
170
-
Intermediate data saved to: `data/fred/fred_gnpca.rds`
171
-
172
-
This ensures reproducibility even if the API changes or becomes unavailable.
173
-
174
-
## Summary
248
+
but subsequent runs will show the output
175
249
176
-
This analysis demonstrates the critical importance of specifying vintage dates when working with FRED API data to ensure reproducibility and understand how economic data revisions affect historical values.
250
+
```
251
+
README :::: Re-using existing file with vintage =2017-06-01
0 commit comments