Skip to content

Commit

Permalink
Add links to more useful resources in the Features Used section of th…
Browse files Browse the repository at this point in the history
…e README
  • Loading branch information
jeancochrane committed Jan 13, 2025
1 parent dc1ce6e commit 551e49b
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 5 deletions.
8 changes: 6 additions & 2 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -233,8 +233,6 @@ Model accuracy for each parameter combination is measured on a validation set us

The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved.

For a machine-readable version of this data dictionary, see [`docs/data-dict.csv`](./docs/data-dict.csv).

```{r feature_guide, message=FALSE, results='asis', echo=FALSE}
library(dplyr)
library(readr)
Expand Down Expand Up @@ -366,6 +364,12 @@ param_tbl_fmt %>%
knitr::kable(format = "markdown")
```

We maintain a few useful resources for working with these features:

- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, specifically the [residential model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_card_res_input) which is the source of our training data
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html).

#### Data Sources

We rely on numerous third-party sources to add new features to our data. These features are used in the primary valuation model and thus need to be high-quality and error-free. A non-exhaustive list of features and their respective sources includes:
Expand Down
23 changes: 20 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,9 +369,6 @@ and many others. The features in the table below are the ones that made
the cut. They’re the right combination of easy to understand and impute,
powerfully predictive, and well-behaved.

For a machine-readable version of this data dictionary, see
[`docs/data-dict.csv`](./docs/data-dict.csv).

| Feature Name | Variable Name | Description | Category | Possible Values |
|:----------------------------------------------------------------------------|:------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|:---------------------------------------------------------------------|
| Percent Population Age, Under 19 Years Old | acs5_percent_age_children | Percent of the people 17 years or younger | ACS5 | |
Expand Down Expand Up @@ -475,6 +472,26 @@ For a machine-readable version of this data dictionary, see
| Sale Day of Week | time_sale_day_of_week | Numeric encoding of day of week (1 - 7) | Time | |
| Sale After COVID-19 | time_sale_post_covid | Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020) | Time | |

We maintain a few useful resources for working with these features:

- Once you’ve [pulled the input data](#getting-data), you can inner join
the data to the CSV version of the data dictionary
([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the
features that we use in the model
- You can browse our [data
catalog](https://ccao-data.github.io/data-architecture/#!/overview) to
see more details about these features, specifically the [residential
model input
view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_card_res_input)
which is the source of our training data
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/)
or its [Python equivalent](https://ccao-data.github.io/ccao/python/)
to programmatically convert variable names to their human-readable
versions
([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html))
or convert numerically-encoded variables to human-readable values
([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html).

#### Data Sources

We rely on numerous third-party sources to add new features to our data.
Expand Down

0 comments on commit 551e49b

Please sign in to comment.