Skip to content

Commit

Permalink
Merge pull request #275 from PSLmodels/pr-improve-target-documentation
Browse files Browse the repository at this point in the history
PR improve target documentation
  • Loading branch information
donboyd5 authored Nov 1, 2024
2 parents 3b39b9c + 96bde52 commit 04e5039
Show file tree
Hide file tree
Showing 8 changed files with 100 additions and 80 deletions.
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tmd
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ rm(data, data2, cdnums)
```{r}
#| label: create-save-soi-cddata-long
#| eval: true
#| output: false
cdwide <- read_csv(fs::path(CDINTERMEDIATE, "cddata_wide_clean.csv"))
doc <- read_csv(fs::path(CDINTERMEDIATE, "variable_documentation.csv"))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,21 +132,3 @@ rm(vdoc)
```

## Issues and notes

Notes below are primarily intended for the project team but could be of interest to users.

### Number of returns and number of individuals

Note that we do not have number of exemptions, but we do have:

- N1 -- number of returns

- N2 -- number of individuals

Footnote 5 in the IRS documentation file (21incddocguide.docx), pertaining to N2, says:

> Beginning in 2018, personal exemption deductions were suspended for the primary, secondary, and dependent taxpayers. However, the data used to create the “Number of individuals”—filing status, dependent status indicator, and identifying dependent information—are still available on the Form 1040. This field is based on these data.


Original file line number Diff line number Diff line change
Expand Up @@ -8,52 +8,6 @@ editor_options:

This section creates one long file that is a superset of what we need for individual 117th Congressional District target files. This long file has everything needed to extract and save a target file for any CD . It also has additional convenience variables that will not be included in individual CD target files such as variable descriptions, human-friendly AGI-range labels, state fips codes, and a sort code for ordering records within a CD. These variables are excluded when target files are written.

## Documentation for target files for individual CDs

### Target file name

Congressional District target files follow the naming convention **xxxx_targets.csv**, where **xxxx** is a 4 character CD identifier.

- The first two characters are the state postal abbreviation or, in the case of the District of Columbia, "DC". (DC does not have a voting representative but does have a non-voting member. The SOI data have information for DC and so it is in the data. Thus, we have data for 435 voting districts, plus data for DC.)

- The next 2 characters identify the Congressional District within the state, with a leading zero. For states that have more than one district, these range from 01 to the number of districts (for example, 53 in the case of California). For the 7 states and DC that have only one CD, these 2 characters are 00, following the SOI convention.

- Thus, the filename for California's 3rd Congressional District would be CA03_targets.csv and allowable file names would range from CA01_targets.csv to CA53_targets.csv. There is no CA00_targets.csv. The filename for any of the 7 states (or DC) that have only one CD would be WY00_targets.csv.

### Target file variables

Each target file will have the following variables:

- **varname**: This is a PUF-based filename, as used in Tax-Calculator. Thus, examples of allowable names are XTOT (but see below), e00200 (wages), c00100 (AGI, calculated), and e00900.

- **count**: Indicates whether the target is a count or a dollar amount. Allowable values are 0 for dollar amount and 1 for count.

- **scope**: Indicates which kinds of records the target applies to. Allowable values are 0 for all records, 1 for tax filers, and 2 for nonfilers.

- **agilo**, **agihi**: Lower and upper bounds for the AGI range. The interval is of the form \[agilo, agihi) -- that is, it includes all values \>= agilo and \< agihi.

- **fstatus**: Filing status, following the PUF MARS definition. Allowable values are integers 0-5, where 0 = all records, 1 = single, 2 = married filing joint, 3 = married filing separately, 4 = head of household, and 5 = surviving spouse. **\[?? VERIFY WITH MARTIN\]**

- **target**: The SOI value (or other target, if the user overrides the SOI value) for this variable, scope, agi range, and filing status. Counts and dollar amounts are "raw" values - neither is scaled to be in thousands or millions, for example. (Because SOI reported dollar values usually are in \$ thousands, we have multipled them by 1,000 so that they are unscaled.)

### The special first data row of a CD target file

The area targeting software needs a value for total population in the area. It uses this to scale initial weights prior to optimization so that they sum to the area population. To assist in this, the target file must contain in its first data row a value for the total area population. This special row must have the following values:

- **varname**: XTOT
- **count**: 0
- **scope**: 0
- **agilo**: must be \< -8e99
- **agihi**: must be \> 8e99
- **fstatus**: 0
- **target**: area population

For example, here is the first data row of an area that has population of 33 million:

varname,count,scope,agilo,agihi,fstatus,target

XTOT, 0, 0,-9e99, 9e99, 0, 33e6

## Setup

```{r}
Expand Down Expand Up @@ -94,6 +48,7 @@ cdlong <- read_csv(fs::path(CDINTERMEDIATE, "cddata_long_clean.csv"))

```{r}
#| label: drop-records-and-variables
#| output: false
cdlong1 <- cdlong |>
filter(rectype %in% c("cd", "cdstate", "DC"))
Expand Down Expand Up @@ -130,6 +85,7 @@ We are going to create new rows for N00100 (Number of returns with AGI (estimate

```{r}
#| label: address-agi
#| output: false
nagi <- cdlong2 |>
filter(vname == "N1") |>
Expand All @@ -156,6 +112,7 @@ rm(nagi, check)

```{r}
#| label: fstatus-misc
#| output: false
cdlong4 <- cdlong3 |>
mutate(
Expand All @@ -176,11 +133,11 @@ cdlong4 <- cdlong3 |>
value)
)
summary(cdlong4)
# summary(cdlong4)
# skim(cdlong4)
count(cdlong4, fstatus)
count(cdlong4, count, vtype)
count(cdlong4, scope)
# count(cdlong4, fstatus)
# count(cdlong4, count, vtype)
# count(cdlong4, scope)
```

Expand All @@ -190,6 +147,7 @@ Prepare the Census population data.

```{r}
#| label: prepare-census-pop
#| output: false
# - **varname**: XTOT
# - **count**: 0
Expand Down Expand Up @@ -239,6 +197,7 @@ rm(soistubs, fmatch)

```{r}
#| label: create-cdbasefile
#| output: false
cdbasefile <- bind_rows(cdlong4 |>
rename(target=value) |>
Expand All @@ -254,9 +213,9 @@ cdbasefile <- bind_rows(cdlong4 |>
vname, description, agirange) |>
arrange(statecd, src, scope, fstatus, basevname, count, agistub)
glimpse(cdbasefile)
summary(cdbasefile)
skim(cdbasefile)
# glimpse(cdbasefile)
# summary(cdbasefile)
# skim(cdbasefile)
cdbasefile |> count(basevname)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ This chunk gets previously saved Congressional District population data, does mi

```{r}
#| label: cdpop-clean
#| output: false
cdpop1year <- read_csv(fs::path(CDRAW, "cdpop1year_acs.csv"))
cdpop1year |> summarise(estimate=sum(estimate)) # 335157329
Expand Down
15 changes: 14 additions & 1 deletion tmd/areas/targets/prepare/cd_issues_and_TODOs.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,23 @@ editor_options:

- Per IRS documentation: "Income and tax items with less than 20 returns for a particular AGI class were combined with another AGI class within the same congressional district. Collapsed AGI classes are identified with a double asterisk (dropped) in the Excel files." **This will require attention soon.**

## Defining number of returns with AGI
## Defining number of returns, and number of returns with AGI

- We have an issue with AGI: for Congressional Districts IRS does NOT report the number of returns with AGI. They report two variables that should be close in concept: N1 (Number of returns), and N02650 (Number of returns with total income). For all CDs in the U.S. N1 was 157,375,370 in 2021 and N02650 was 155,283,590. Since N1 is larger and is probably a better indicator of total number of filers, we'll use that as the number-of-returns counterpart to AGI (c00100). We address this late in the process because for most of the data development we try to keep the data faithful to what IRS reports and because the solution chosen here may be suboptimal and we may want to change it later.

Note that we do not have number of exemptions, but we do have:

- N1 -- number of returns

- N2 -- number of individuals

Footnote 5 in the IRS documentation file (21incddocguide.docx), pertaining to N2, says:

> Beginning in 2018, personal exemption deductions were suspended for the primary, secondary, and dependent taxpayers. However, the data used to create the “Number of individuals”—filing status, dependent status indicator, and identifying dependent information—are still available on the Form 1040. This field is based on these data.



## Census population

Used to create the row 0 "XTOT" (population) target as a way to develop an initial scaling ratio: `initial_weights_scale = row.target / national_population`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ editor_options:

# Map tax calculator vars to soi vars and extract targets


```{r}
#| label: setup
Expand All @@ -22,6 +21,7 @@ source(here::here("R", "constants.R"))

```{r}
#| label: get-cdbasefile
#| output: false
cd117 <- read_csv(fs::path(CDINTERMEDIATE, "cdbasefile_117.csv"))
cd118 <- read_csv(fs::path(CDINTERMEDIATE, "cdbasefile_118.csv"))
Expand Down Expand Up @@ -53,9 +53,10 @@ write_csv(stack, fs::path(CDINTERMEDIATE, "cdbasefile_sessions.csv"))

```{r}
#| label: tc-soi-variablemap
#| output: false
soivars <- count(stack, basevname)
soivars$basevname
# soivars$basevname
# the MARS mappings let us get counts by filing status by agi range
vmap <- read_csv(file="
Expand All @@ -75,18 +76,19 @@ e26270, v26270

```{r}
#| label: mapped-file
#| output: false
mapped <- stack |>
filter(basevname %in% vmap$soivar) |>
mutate(varname=factor(basevname, levels=vmap$soivar, labels=vmap$tcvar))
count(mapped, varname, vname)
# count(mapped, varname, vname)
```


```{r}
#| label: extracts
#| output: false
# varname,count,scope,agilo,agihi,fstatus,target
# XTOT, 0, 0,-9e99, 9e99, 0, 33e6
Expand Down Expand Up @@ -136,5 +138,60 @@ targets |>
```

## Documentation for target files for individual CDs

### Target file name

Congressional District target files follow the naming convention **xxxx_targets.csv**, where **xxxx** is a 4 character CD identifier.

- The first two characters are the state postal abbreviation or, in the case of the District of Columbia, "DC". (DC does not have a voting representative but does have a non-voting member. The SOI data have information for DC and so it is in the data. Thus, we have data for 435 voting districts, plus data for DC.)

- The next 2 characters identify the Congressional District within the state, with a leading zero. For states that have more than one district, these range from 01 to the number of districts (for example, 53 in the case of California). For the 7 states and DC that have only one CD, these 2 characters are 00, following the SOI convention.

- Thus, the filename for California's 3rd Congressional District would be CA03_targets.csv and allowable file names would range from CA01_targets.csv to CA53_targets.csv. There is no CA00_targets.csv. The filename for any of the 7 states (or DC) that have only one CD would be WY00_targets.csv.

### Target file variables

### The special first data row of a CD target file

The area targeting software needs a value for total population in the area. It uses this to scale initial weights prior to optimization so that they sum to the area population. To assist in this, the target file must contain in its first data row a value for the total area population. This special row must have the following values:

- **varname**: XTOT
- **count**: 0
- **scope**: 0
- **agilo**: must be \< -8e99
- **agihi**: must be \> 8e99
- **fstatus**: 0
- **target**: area population

For example, here is the first data row of an area that has population of 33 million:

varname,count,scope,agilo,agihi,fstatus,target

XTOT, 0, 0,-9e99, 9e99, 0, 33e6

For up-to-date documentation of target files, see the associated [README](https://github.com/PSLmodels/tax-microdata-benchmarking/blob/master/tmd/areas/targets/README.md). The following is from the version that was current as of 2024-11-01:

> An areas targets file is a CSV-formatted file with its first row containing column names and its second row containing the area population target. Each subsequent row contains another target. Rows after the first two that start with a `#` character are considered comments and are skipped.
>
> Here are the column names and their valid values:
>
> 1. **`varname`**: any Tax-Calculator input variable name plus any Tax-Calculator calculated variable in the list of cached variables in the `tmd/storage/__init__.py` file
> 2. **`count`**: integer in \[0,4\] range:
> - count==0 implies dollar total of varname is tabulated
> - count==1 implies number of tax units with **any** value of varname is tabulated
> - count==2 implies number of tax units with a **nonzero** value of varname is tabulated
> - count==3 implies number of tax units with a **positive** value of varname is tabulated
> - count==4 implies number of tax units with a **negative** value of varname is tabulated
> 3. **`scope`**: integer in \[0,2\] range:
> - scope==0 implies all tax units are tabulated
> - scope==1 implies only PUF-derived filing units are tabulated
> - scope==2 implies only CPS-derived filing units are tabulated
> 4. **`agilo`**: float representing lower bound of the AGI range (which is included in the range) that is tabulated.
> 5. **`agihi`**: float representing upper bound of the AGI range (which is excluded from the range) that is tabulated.
> 6. **`fstatus`**: integer in \[0,5\] range:
> - fstatus=0 implies all filing statuses are tabulated
> - other fstatus values imply just the tax units with the Tax-Calculator `MARS` variable equal to fstatus are included in the tabulation
> 7. **`target`**: target amount:
> - dollars if count==0
> - number of tax units if count\>0
14 changes: 10 additions & 4 deletions tmd/areas/targets/prepare/usage.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,28 @@ editor_options:
- Recent version of [R](https://www.r-project.org/). This project was created with R version 4.4.1.
- Recent release of [RStudio](https://posit.co/products/open-source/rstudio/). Other IDEs may work well, but RStudio has been used in this project (RStudio 2024.09.0 Build 375).
- Recent pre-release version of [quarto](https://quarto.org/docs/download/prerelease.html), 1.6 or higher. This project was created with quarto version 1.6.24
- Be sure that ".../targets/prepare/cds/raw_data/" exists and has the files shown below. The GitHub repo includes these files:
- Be sure that ".../targets/prepare/cds/raw_data/" exists and has the files shown below, which should have been downloaded when you cloned the GitHub repo:

- 21incddocguide.docx
- cd_documentation_extracted_from_21incddocguide.docx.xlsx
- cdpop1year_acs.csv
- congressional2021.zip
- geocorr2022_2428906586.csv
- After checking the above and starting RStudio, in the console run `renv::restore()`. `renv` is, in essence, a package manager for R designed to set up a private environment that is the same across multile machines. It should ensure that your environment includes appropriate versions of R packages used in this project (generally loaded in ".../targets/prepare/R/libraries.R").

## Setting up the environment

- After checking the above and starting RStudio, in the console run `renv::restore()` and answer y when asked whether to proceed installing packages. This may take a while the first time you set your environment up.

`renv` is an environment manager for R that can set up a project-specific environment that is the same across multile machines. It should ensure that your environment includes the versions of R packages used in this project. (Most packages are loaded in ".../targets/prepare/R/libraries.R".)

## To create target files and build the web page

- Open a terminal in the "prepare" folder.
- Enter "quarto render"

The first time the project is rendered, it will create needed intermediates files and put them in the "../cds/intermediate" folder.
The first time the project is rendered, it will create needed intermediate files and put them in the "../cds/intermediate" folder.

Note that the \_quarto.yml file sets the `freeze` execution option to `false`, which means .qmd files will be rerendered even if they have not changed (except that quarto will not re-render chunks with the option `eval: false`). For incremental re-rendering of changed files only, set `freeze: auto`. This should be used cautiously to avoid unintended consequences.
Note that the \_quarto.yml file sets the `freeze` execution option to `false`, which means .qmd files will be rerendered even if they have not changed (except that quarto will not re-render chunks with the option `eval: false`), and intermediate data files will be recreated. For incremental re-rendering of changed .qmd files only, set `freeze: auto`, which will avoid recreating intermediate files. This should be used cautiously to avoid unintended consequences.

At present the code prepares target files with targets we believe are useful and practical. Users who want different targets will have to modify code to do so. However, as described in overall repo documentation, users can comment-out individual targets.

Expand Down

0 comments on commit 04e5039

Please sign in to comment.