Skip to content

Commit b99e243

Browse files
committed
Updating code
1 parent 5b01404 commit b99e243

File tree

4 files changed

+400
-163
lines changed

4 files changed

+400
-163
lines changed

data/fred/fred_gnpca.rds

162 Bytes
Binary file not shown.

main.Rmd

Lines changed: 141 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,90 +1,161 @@
11
---
2-
title: "FRED API Data Vintage Example"
2+
title: "Short discussion of use of API for reproducibility"
33
author: "Lars Vilhuber"
44
date: "`r Sys.Date()`"
5-
output: html_document
5+
output:
6+
html_document:
7+
keep_md: true
8+
69
---
710

811
```{r setup, include=FALSE}
912
knitr::opts_chunk$set(echo = TRUE)
13+
NOTE <- "README ::::"
14+
options(scipen = 999)
1015
```
1116

1217
## Introduction
1318

14-
This document demonstrates the importance of using vintage dates when pulling API data from FRED.
1519

16-
## Load Required Libraries
20+
Application programming interfaces (APIs) are popular, and often a very convenient way to get just the data one needs.
21+
22+
However, they pose reproducibility challenges:
23+
24+
- the data might change, in ways unknown to the researcher
25+
- the API might change, breaking the researcher's code
26+
27+
This brief tutorial will discuss some safeguards that can be done at relatively low cost by the researcher to improve reproducibility.
28+
29+
## The setting
30+
31+
We will use the St. Louis Fed's FRED data service, frequently used by economists. In this example, we will use a single time-series -- `GNPCA` -- Real Gross National Product -- but the example can be easily extended to a whole set of series.
32+
33+
Often, researchers will use the default landing page for a time-series, in this case, [https://fred.stlouisfed.org/series/GNPCA](https://fred.stlouisfed.org/series/GNPCA), and then download a CSV or Excel file.
34+
35+
![GNPCA landing page](images/gnpca-fred.png)
36+
37+
However, they could instead use the API that is offered. This tutorial uses R and the [`fredr`](https://cran.r-project.org/web/packages/fredr/vignettes/fredr.html) to describe the use of that API.[^other]
38+
39+
[^other]: For a version using Stata, see [this other document](stata-fred.md). Other interfaces include MATLAB and of course Python.
40+
41+
## Using the API
42+
43+
The FRED API, as most other APIs, requires an API key - a kind of password. One typical technique is to store the API key as an environment variable, or (less securely) hard-coding it in the code (this makes the code not easily shareable). For this tutorial, obtain an API key from the [FRED profile page](https://fredaccount.stlouisfed.org/apikeys]) (a login is required), then store in the file `.Renviron` as
44+
45+
```
46+
# This is an example, not a valid API key!
47+
FRED_API_KEY="78862231cc0bd7a7f3b84eb9e19d4b7e"
48+
```
49+
50+
Save it, and restart R. The environment variable will now be available to the R package `fredr`.
51+
52+
To use the API, we load the necessary libraries.
1753

1854
```{r load-libraries, message=FALSE, warning=FALSE}
1955
if (!require("fredr")) install.packages("fredr")
2056
if (!require("dplyr")) install.packages("dplyr")
57+
if (!require("knitr")) install.packages("knitr")
58+
if (!require("ggplot2")) install.packages("ggplot2")
2159
library(fredr)
2260
library(dplyr)
61+
library(knitr)
62+
63+
DEFAULT_DATE=as.Date("2016-01-01")
64+
PLOT_DATE=as.Date("2012-01-01")
2365
```
2466

25-
## Set Parameters
67+
## Simple usage
2668

27-
```{r parameters}
28-
# Set some parameters we want to re-use
29-
# - the date range we want
30-
DATE_START <- as.Date("2007-01-01")
31-
DATE_END <- as.Date("2007-01-01")
32-
# - the as-of date we want
33-
CURRENT_DATE <- Sys.Date()
34-
VINTAGE <- as.Date("2021-12-31")
35-
# - for testing, other as-of dates
36-
ALTVINTAGES <- as.Date(c("2008-09-15", "2015-09-15"))
37-
NOTE <- "README ::::"
69+
We can obtain the entire series for `GNPCA` as follows, focussing on the value for `r :
70+
71+
```{r pull-current}
72+
data_current <- fredr(
73+
series_id = "GNPCA"
74+
)
75+
names(data_current)
76+
nrow(data_current)
77+
print(head(data_current))
78+
current_value <- round(data_current$value[data_current$date == DEFAULT_DATE],0)
79+
max_current_date <- max(data_current$date)
80+
min_date <- min(data_current$date)
81+
today <- Sys.Date()
3882
```
3983

40-
## Pulling Data WITHOUT Vintage Specification
84+
So the value of GNPCA, as of `r today` when we ran this, install
85+
86+
> GNPCA(`r DEFAULT_DATE`) = `r current_value`, as of `r today`
87+
88+
## Two observations
4189

42-
By default, we pull data without specifying a vintage:
90+
Two things are of note:
4391

44-
```{r pull-default}
45-
data_default <- fredr(
92+
93+
### Clipping the time series
94+
95+
We pulled the entire time series, even though we only were interested in a subset. While the start date (for GNPCA, `r min_date`) will never change, the length of the time series will change if we pull in the future, or if we had pulled in the past. This might affect the rest of the code. So a good practice is to define precise start and end dates.
96+
97+
98+
```{r define-start-end}
99+
# Set some parameters we want to re-use
100+
# - the date range we want
101+
DATE_START <- as.Date("2000-01-01")
102+
DATE_END <- as.Date("2016-01-01")
103+
104+
data_clipped <- fredr(
46105
series_id = "GNPCA",
47106
observation_start = DATE_START,
48107
observation_end = DATE_END
49108
)
50-
print(data_default)
51-
default_value <- mean(data_default$value, na.rm = TRUE)
109+
nrow(data_clipped)
110+
print(head(data_clipped))
111+
clipped_value <- round(data_clipped$value[data_clipped$date == DEFAULT_DATE],0)
112+
max_clipped_date <- max(data_clipped$date)
52113
```
53114

54-
## Pulling Data WITH Vintage Specification
115+
Thus, if we are primarily interested in the time series between, say, `r DATE_START` and `r DATE_END`, whenever we pull the data in the future, the length of the time series will be the same: `nrow(data_clipped)` rows. Note that this should not change the value obtained for `r DEFAULT_DATE`:
55116

117+
> GNPCA(`r DEFAULT_DATE`) = `r clipped_value`, when clipped between `r DATE_START` and `r DATE_END`
56118
57-
Now we do the same thing, but precisely defining the vintages:
119+
but it would affect other (naively computed) values such as the mean of the time-series, or a computed linear trend.
58120

59-
```{r pull-vintaged}
60-
data_vintaged <- fredr(
121+
### Revisions
122+
The second issue is that the measures of GNP, both specific data points, as well as occassionally the entire time series, are revised by the Bureau of Economic Analysis. At the time of writing this in 2025, `GNPCA` was expressed in "Billions of Chained 2017 Dollars". Obviously, prior to 2017, it would have been expressed differently. This also leads to revisions of historical values.
123+
124+
The FRED API (although not all other APIs), allows to extract data with an *as-if* date, which they call a vintage. Thus, suppose we had pulled the time series as of a specific date:
125+
126+
```{r define-dates2}
127+
VINTAGE <- as.Date("2017-06-01")
128+
129+
data_vintage <- fredr(
61130
series_id = "GNPCA",
62131
observation_start = DATE_START,
63132
observation_end = DATE_END,
64133
vintage_dates = VINTAGE
65134
)
66-
print(data_vintaged)
67-
vintaged_value <- mean(data_vintaged$value, na.rm = TRUE)
135+
nrow(data_vintage)
136+
print(head(data_vintage))
137+
vintaged_value <- round(data_vintage$value[data_vintage$date == DEFAULT_DATE],0)
68138
```
69139

70-
## Compare the Results
140+
So what does the value for `GNPCA` look like as of `r VINTAGE`?
71141

72-
As of `r CURRENT_DATE`, the two values are:
73142

74-
- `r default_value`, when not specifying a vintage
143+
As of `r today`, the two values for GNPCA that we have obtained are:
144+
145+
- `r current_value`, when not specifying a vintage, as of `r today`
75146
- `r vintaged_value`, when specifying vintage `r VINTAGE`
76147

77-
**Expected result:** As of the current date, the two values may differ because:
78148

79-
- The default pulls the most recent revision
80-
- The vintaged data pulls the value as it was known on 2021-12-31
149+
That is a substantial difference in absolute value! In part, this is due to the rebase-lining of the time series, in part this is due to revisions of the annual value as additional data becomes available.
81150

82151
## Pulling Multiple Vintages for Comparison
83152

84153

85-
Now let's see why this matters - let's pull down a few more vintages:
154+
Now let's see how long this can matter, by obtaining a number of additional vintages.
86155

87156
```{r multiple-vintages}
157+
# - for testing, other as-of dates
158+
ALTVINTAGES <- as.Date(c("2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01"))
88159
# Combine all vintage dates
89160
all_vintages <- c(VINTAGE, ALTVINTAGES)
90161
@@ -106,33 +177,40 @@ vintage_comparison <- do.call(rbind, vintage_data_list)
106177
# Reshape for comparison
107178
vintage_wide <- vintage_comparison %>%
108179
select(date, value, vintage) %>%
109-
tidyr::pivot_wider(names_from = vintage, values_from = value, names_prefix = "vintage_")
110-
```
111-
112-
113-
## Comparison of GNPCA values across different vintages"
114-
115-
```{r print_vintages}
116-
print(vintage_wide)
180+
filter(date == DEFAULT_DATE)
117181
```
118182

119183

120-
*NOTE: These values should NEVER change once recorded. The value for 2007-01-01 as of 2008-09-15 will always be what it was on that date.
121-
184+
## Comparison of GNPCA values across different vintages
122185

123-
## Key Lessons
186+
Focussing on the value for `r DEFAULT_DATE`, we see how the value has changed over time:
124187

125-
**LESSON:**
188+
```{r print_vintages,results='asis',echo=FALSE}
189+
kable(print(vintage_wide))
190+
```
126191

127-
1. Always use a fixed vintage date to query the API for reproducibility
128-
2. Some series change as time progresses, even for historical values
129-
3. Without vintage specification, you get the latest revision
192+
The effect on the time series is shown in the following graph (focussing only on the period 2012-2016 for clarity):
193+
194+
```{r plot_vintages, fig.width=4, fig.height=4}
195+
library(ggplot2)
196+
vintage_comparison %>%
197+
filter(date > PLOT_DATE) %>%
198+
ggplot(aes(x = date, y = value, color = vintage)) +
199+
geom_line() +
200+
labs(title = "GNPCA over time for different vintages",
201+
x = "Date",
202+
y = "GNPCA (Billions of Chained 2017 Dollars)",
203+
color = "Vintage Date") +
204+
theme_minimal()
205+
```
130206

131207
## Data Persistence Strategy
132208

133-
**SUPPLEMENTARY LESSON:** Save the data pulled through the API as an intermediate dataset and if permissible by the license (check!), redistribute it in case that the API is deprecated and won't work in the future.
209+
APIs in general have one additional "feature": At some point, they may break, because the hosting institution makes decisions that affects its availability. While the above sections show how data can be obtained that precisely reflect the intended range and as-of date, they cannot compensate for the disappearance or breaking changes to an API.
210+
211+
The solution is to save the data pulled through the API as an intermediate dataset ("cache") upon first use, and henceforth use the cached data. If redistribution is permissible by the license (check!), this also allows to provide future users with the same data, in case that the API is deprecated and won't work in the future.
134212

135-
### Data Persistence Strategy
213+
### Example implementation of caching
136214

137215
```{r data-persistence-code}
138216
# Create directories if they don't exist
@@ -141,14 +219,16 @@ dir.create("data/fred", showWarnings = FALSE)
141219
142220
# Check if file exists
143221
if (file.exists("data/fred/fred_gnpca.rds")) {
144-
cat(NOTE, "Re-using existing file\n")
145222
fred_data <- readRDS("data/fred/fred_gnpca.rds")
223+
# get the vintage id from the file
224+
VINTAGE_READ <- max(as.Date(fred_data$realtime_start))
225+
message(NOTE, "Re-using existing file with vintage =", as.character(VINTAGE_READ),"\n")
146226
} else {
147227
# Code if the file does not exist
148228
# You could do the full API pull
149229
# conditional on the intermediate
150230
# file NOT being there
151-
cat(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
231+
message(NOTE, "Reading in data from FRED API with vintage =", as.character(VINTAGE), "\n")
152232
fred_data <- fredr(
153233
series_id = "GNPCA",
154234
observation_start = DATE_START,
@@ -159,18 +239,14 @@ if (file.exists("data/fred/fred_gnpca.rds")) {
159239
}
160240
```
161241

162-
### Final Dataset
242+
When first run, the code will output
163243

164-
```{r data-persistence-print}
165-
print(fred_data)
244+
```
245+
README ::::Reading in data from FRED API with vintage =2017-06-01
166246
```
167247

168-
### Analysis Complete
169-
170-
Intermediate data saved to: `data/fred/fred_gnpca.rds`
171-
172-
This ensures reproducibility even if the API changes or becomes unavailable.
173-
174-
## Summary
248+
but subsequent runs will show the output
175249

176-
This analysis demonstrates the critical importance of specifying vintage dates when working with FRED API data to ensure reproducibility and understand how economic data revisions affect historical values.
250+
```
251+
README :::: Re-using existing file with vintage =2017-06-01
252+
```

main.html

Lines changed: 259 additions & 98 deletions
Large diffs are not rendered by default.
34.3 KB
Loading

0 commit comments

Comments
 (0)