diff --git a/README.Rmd b/README.Rmd index 0825cd00..6868bde6 100644 --- a/README.Rmd +++ b/README.Rmd @@ -233,8 +233,6 @@ Model accuracy for each parameter combination is measured on a validation set us The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. -For a machine-readable version of this data dictionary, see [`docs/data-dict.csv`](./docs/data-dict.csv). - ```{r feature_guide, message=FALSE, results='asis', echo=FALSE} library(dplyr) library(readr) @@ -366,6 +364,12 @@ param_tbl_fmt %>% knitr::kable(format = "markdown") ``` +We maintain a few useful resources for working with these features: + +- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model +- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, specifically the [residential model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_card_res_input) which is the source of our training data +- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). + #### Data Sources We rely on numerous third-party sources to add new features to our data. These features are used in the primary valuation model and thus need to be high-quality and error-free. A non-exhaustive list of features and their respective sources includes: diff --git a/README.md b/README.md index 162ddddf..e4837f7b 100644 --- a/README.md +++ b/README.md @@ -369,9 +369,6 @@ and many others. The features in the table below are the ones that made the cut. They’re the right combination of easy to understand and impute, powerfully predictive, and well-behaved. -For a machine-readable version of this data dictionary, see -[`docs/data-dict.csv`](./docs/data-dict.csv). - | Feature Name | Variable Name | Description | Category | Possible Values | |:----------------------------------------------------------------------------|:------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------|:---------------------------------------------------------------------| | Percent Population Age, Under 19 Years Old | acs5_percent_age_children | Percent of the people 17 years or younger | ACS5 | | @@ -475,6 +472,26 @@ For a machine-readable version of this data dictionary, see | Sale Day of Week | time_sale_day_of_week | Numeric encoding of day of week (1 - 7) | Time | | | Sale After COVID-19 | time_sale_post_covid | Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020) | Time | | +We maintain a few useful resources for working with these features: + +- Once you’ve [pulled the input data](#getting-data), you can inner join + the data to the CSV version of the data dictionary + ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the + features that we use in the model +- You can browse our [data + catalog](https://ccao-data.github.io/data-architecture/#!/overview) to + see more details about these features, specifically the [residential + model input + view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_card_res_input) + which is the source of our training data +- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) + or its [Python equivalent](https://ccao-data.github.io/ccao/python/) + to programmatically convert variable names to their human-readable + versions + ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) + or convert numerically-encoded variables to human-readable values + ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html). + #### Data Sources We rely on numerous third-party sources to add new features to our data.