From c251c88a82d8c6b3da42fb0f40a294137cbacf5c Mon Sep 17 00:00:00 2001 From: Michael Wagner Date: Thu, 20 Feb 2025 17:16:42 +0000 Subject: [PATCH 1/4] Initial commit w multi-card explanation --- README.md | 86 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 50 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index dc6810ea..6f0c1d69 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,8 @@ Table of Contents - [`assessment-year-2022`](#assessment-year-2022) - [`assessment-year-2023`](#assessment-year-2023) - [`assessment-year-2024`](#assessment-year-2024) +- [Special Groups](#special-groups) + - [Multi-Card Parcels](#multi-card-parcels) - [Ongoing Issues](#ongoing-issues) - [Data Quality and Integrity](#data-quality-and-integrity) - [Heterogeneity and Extremes](#heterogeneity-and-extremes) @@ -138,23 +140,23 @@ stand-alone script) or as part of the overall pipeline (with > including dependencies, outputs, parameters, and more, see > [dvc.yaml](./dvc.yaml) -0. **Ingest**: Pull prepared data from the CCAO’s Athena database. This +0. **Ingest**: Pull prepared data from the CCAO’s Athena database. This data is divided into [2 primary datasets](#data-used), one for training and one for assessment. NOTE: This stage is only run as-needed, since the input data does not change for each model run. -1. **Train**: Train the model using sales data. This involves splitting +1. **Train**: Train the model using sales data. This involves splitting the input data into train/test sets and performing cross-validation to determine the optimal set of hyperparameters. The primary output of this stage is a trained model object. -2. **Assess**: Use the trained model to estimate values for all +2. **Assess**: Use the trained model to estimate values for all residential properties. Values are [adjusted if necessary](#post-modeling) and then aggregated to the PIN level. The primary output of this stage is a data frame of PIN-level assessed values. -3. **Evaluate**: Measure the performance of the model using the +3. **Evaluate**: Measure the performance of the model using the held-out test set and an assessor-specific ratio study method. Performance statistics include standard machine learning metrics (RMSE, MAE, MAPE) as well as assessor-specific metrics (COD, PRD, @@ -163,7 +165,7 @@ stand-alone script) or as part of the overall pipeline (with output of this stage is a data frame of aggregate performance statistics. -4. **Interpret**: Calculate three major explanatory outputs: +4. **Interpret**: Calculate three major explanatory outputs: - SHAP values for all the estimated values from the assess stage. These are the *per feature* contribution to the predicted value @@ -174,7 +176,7 @@ stand-alone script) or as part of the overall pipeline (with the method described [in this vignette](https://ccao-data.github.io/lightsnip/articles/finding-comps.html) -5. **Finalize**: Save run timings and metadata. Render the following +5. **Finalize**: Save run timings and metadata. Render the following Quarto documents: - An overall model report detailing model performance, effects, and @@ -182,13 +184,13 @@ stand-alone script) or as part of the overall pipeline (with - For PINs of interest, individual PIN-level reports detailing the characteristics, SHAP values, and results for a given PIN -6. **Upload**: Upload all output objects to AWS (S3). All model outputs +6. **Upload**: Upload all output objects to AWS (S3). All model outputs for every model run are stored in perpetuity in S3. Each run’s performance can be visualized using the CCAO’s internal Tableau dashboards. NOTE: This stage is only run internally, since it requires access to the CCAO Data AWS account. -7. **Export**: Export assessed values to Desk Review spreadsheets for +7. **Export**: Export assessed values to Desk Review spreadsheets for Valuations, as well as a delimited text format for upload to the system of record (iasWorld). NOTE: This stage is only run when a final model is selected. It is not run automatically or as part of @@ -566,10 +568,10 @@ different model](https://github.com/ccao-data/model-condo-avm). Models need data in order to be trained and measured for accuracy. Modern predictive modeling typically uses three data sets: -1. A training set, used to train the parameters of the model itself. -2. A validation set, used to choose a hyperparameter combination that +1. A training set, used to train the parameters of the model itself. +2. A validation set, used to choose a hyperparameter combination that optimizes model accuracy. -3. A test set, used to measure the performance of the trained, tuned +3. A test set, used to measure the performance of the trained, tuned model on unseen data. `training_data` is used to create these data sets. It is subdivided @@ -678,12 +680,12 @@ regression with the following specification: There a few caveats with this approach and with balance testing in general: -1. There could be statistically significant omitted variables that +1. There could be statistically significant omitted variables that differentiate sold from unsold. Things like `recently_painted` or `full_kitchen_renovation` are good examples. We don’t collect these data points, so it could be the case that sold properties are more “sale-ready” in these unknown terms. -2. There could be significant variation by geography in the +2. There could be significant variation by geography in the representativeness of the sales. In other words, certain areas could have non-representative sales whose predictive effect on `sold_in_last_2_years` is washed out due to mis- or under-specified @@ -697,7 +699,7 @@ from the `assess` stage. These adjustments are internally called “post-modeling,” and are responsible for correcting minor deficiencies in the initial predictions. Specifically, post-modeling will: -1. Aggregate values for multi-card properties to the PIN level, then +1. Aggregate values for multi-card properties to the PIN level, then disaggregate them back to the card level. A check is used to ensure that the PIN-level assessed value is not significantly greater than the prior year’s value. This is needed because often back buildings @@ -705,14 +707,14 @@ in the initial predictions. Specifically, post-modeling will: than they are actually worth (since they are not differentiated as ADUs by the model). -2. Ensure that nearly identical properties are identically valued. For +2. Ensure that nearly identical properties are identically valued. For some property classes, such as 210 and 295s, we manually adjust values such that all identical properties in the same complex receive the same predicted value. This is accomplished by replacing individual predicted values with the average predicted value for the complex. -3. Round PIN-level values (typically to the nearest \$1,000). This is +3. Round PIN-level values (typically to the nearest \$1,000). This is done to indicate that model values are *estimates*, not precise values. @@ -773,10 +775,10 @@ the following major changes to the residential modeling codebase: was moved to [pipeline/07-export.R](pipeline/07-export.R). - Added [DVC](https://dvc.org/) support/integration. This repository uses DVC in 2 ways: - 1. All input data in [`input/`](input/) is versioned, tracked, and + 1. All input data in [`input/`](input/) is versioned, tracked, and stored using DVC. Previous input data sets are stored in perpetuity on S3. - 2. [DVC + 2. [DVC pipelines](https://dvc.org/doc/user-guide/project-structure/pipelines-files) are used to sequentially run R pipeline scripts and track/cache inputs and outputs. @@ -854,6 +856,18 @@ the following major changes to the residential modeling codebase: using [renv profiles](#profiles-and-lockfiles) to increase replicability. +# Special Groups + +### Multi-Card Parcels + +Multi-card parcels are properties (PINs) that include more than one building, +each represented by its own "card". Because our model is trained at the building +level, these parcels can be challenging to value accurately. + +We consolidate the building square footage from all cards into a single “main” card. +We then predict once for that combined building, ensuring location data is only +factored in once while still capturing the total building area. + # Ongoing Issues The CCAO faces a number of ongoing issues which make modeling difficult. @@ -900,10 +914,10 @@ property condition. The property characteristics we track can sometimes be incorrect or outdated. The two major sources of characteristic errors are: -1. Data entry or processing errors. Records collected by our office +1. Data entry or processing errors. Records collected by our office often need to digitized and mistakes happen. Fortunately, these types of errors are relatively rare. -2. Characteristic update errors. There are a variety of systems that +2. Characteristic update errors. There are a variety of systems that update the characteristics of properties in our system. Some of them can be slow to detect changes or otherwise unreliable. @@ -1130,19 +1144,19 @@ of the model. ### Installation -1. Clone this repository using git, or simply download it using the +1. Clone this repository using git, or simply download it using the button at the top of the page. -2. Set your working directory to the local folder containing this +2. Set your working directory to the local folder containing this repository’s files, either using R’s `setwd()` command or (preferably) using RStudio’s [projects](https://support.posit.co/hc/en-us/articles/200526207-Using-Projects). -3. Install `renv`, R’s package manager, by running +3. Install `renv`, R’s package manager, by running `install.packages("renv")`. -4. Install all R package dependencies using `renv` by running +4. Install all R package dependencies using `renv` by running `renv::restore()`. This step may take awhile. Linux users will likely need to install dependencies (via apt, yum, etc.) to build from source. -5. (Optional) The `finalize` step of the model pipeline requires some +5. (Optional) The `finalize` step of the model pipeline requires some additional dependencies for generating a model performance report. Install these additional dependencies by running `renv::restore(lockfile = "renv/profiles/reporting/renv.lock")`. @@ -1426,14 +1440,14 @@ installed automatically when you run `renv::restore()`. We use multiple renv lockfiles to manage R dependencies: -1. **`renv.lock`** is the canonical list of dependencies that are used +1. **`renv.lock`** is the canonical list of dependencies that are used by the **core model pipeline**. Any dependencies that are required to run the model itself should be defined in this lockfile. -2. **`renv/profiles/reporting/renv.lock`** is the canonical list of +2. **`renv/profiles/reporting/renv.lock`** is the canonical list of dependencies that are used to **generate model reports** in the `finalize` step of the pipeline. Any dependencies that are required to generate reports should be defined in this lockfile. -3. **`renv/profiles/dev/renv.lock`** is the canonical list of +3. **`renv/profiles/dev/renv.lock`** is the canonical list of dependencies that are used **for local development**, running the `ingest`, `export`, and `api` steps of the pipeline, and building the README. These dependencies are required only by CCAO staff and @@ -1467,10 +1481,10 @@ of it (run `renv::restore(lockfile = "renv/profiles/dev/renv.lock")`). The process for **updating core model pipeline dependencies** is straightforward: -1. Add the dependency to the list of explicit dependencies under the +1. Add the dependency to the list of explicit dependencies under the `Depends:` key of the `DESCRIPTION` file -2. Run `renv::install("")` -3. Run `renv::snapshot()` to update the core lockfile (the root +2. Run `renv::install("")` +3. Run `renv::snapshot()` to update the core lockfile (the root `renv.lock`) The process for updating \*dependencies for other lockfiles\*\* is more @@ -1479,15 +1493,15 @@ renv commands. Determine the name of the profile you’d like to update (`` in the code that follows) and run the following commands: -1. Run `renv::activate(profile = "")` to set the renv +1. Run `renv::activate(profile = "")` to set the renv profile to `` -2. Make sure that the dependency is defined in the `DESCRIPTION` file +2. Make sure that the dependency is defined in the `DESCRIPTION` file under the `Config/renv/profiles//dependencies` key -3. Run `renv::install("")` to add or update the +3. Run `renv::install("")` to add or update the dependency as necessary -4. Run `renv::snapshot()` to update the reporting lockfile with the +4. Run `renv::snapshot()` to update the reporting lockfile with the dependencies defined in the `DESCRIPTION` file -5. Run `renv::activate(profile = "default")` if you would like to +5. Run `renv::activate(profile = "default")` if you would like to switch back to the default renv profile ## Troubleshooting From ff617d410f773d728f953422b84a2f10a673e340 Mon Sep 17 00:00:00 2001 From: Michael Wagner Date: Thu, 20 Feb 2025 17:58:36 +0000 Subject: [PATCH 2/4] Add proration --- README.md | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 6f0c1d69..24ef579c 100644 --- a/README.md +++ b/README.md @@ -17,8 +17,9 @@ Table of Contents - [`assessment-year-2022`](#assessment-year-2022) - [`assessment-year-2023`](#assessment-year-2023) - [`assessment-year-2024`](#assessment-year-2024) -- [Special Groups](#special-groups) +- [Complex-Parcel-Configurations](#complex-parcel-configurations) - [Multi-Card Parcels](#multi-card-parcels) + - [Proration](#proration) - [Ongoing Issues](#ongoing-issues) - [Data Quality and Integrity](#data-quality-and-integrity) - [Heterogeneity and Extremes](#heterogeneity-and-extremes) @@ -856,7 +857,7 @@ the following major changes to the residential modeling codebase: using [renv profiles](#profiles-and-lockfiles) to increase replicability. -# Special Groups +# Complex Parcel Configurations ### Multi-Card Parcels @@ -868,6 +869,21 @@ We consolidate the building square footage from all cards into a single “main We then predict once for that combined building, ensuring location data is only factored in once while still capturing the total building area. +### Proration + +Some buildings span multiple parcel boundaries. When this happens, each parcel (PIN) technically includes only part of the same building. Valuing those parcels must then “prorate” the building value so that each PIN gets its fair share. + +1. **Calculate Building-Only Value:** + - To figure out the building’s contribution to total value, we subtract out the land value from our model’s predicted total for each PIN. For parcels in a prorated building, we average these “building-only” values across all of its PINs. This average becomes the building-wide figure we’ll distribute among the PINs. + +2. **Prorate the Building:** + - Multiply the building-wide figure by each PIN’s proration rate (e.g., if a PIN has 30% of the building, it gets 30% of that amount). + - Round fractional dollars so the total matches exactly. + +3. **Add Back Land:** + - Each PIN has its own land value, which is not shared. + - Final PIN value = prorated building value + that PIN’s land value. + # Ongoing Issues The CCAO faces a number of ongoing issues which make modeling difficult. From 122800e333ec72ffdd391b3455799856869389de Mon Sep 17 00:00:00 2001 From: Michael Wagner Date: Thu, 20 Feb 2025 18:00:00 +0000 Subject: [PATCH 3/4] Adjsut header --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 24ef579c..bdf1a895 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ Table of Contents - [`assessment-year-2022`](#assessment-year-2022) - [`assessment-year-2023`](#assessment-year-2023) - [`assessment-year-2024`](#assessment-year-2024) -- [Complex-Parcel-Configurations](#complex-parcel-configurations) +- [Complex-Parcels](#complex-parcels) - [Multi-Card Parcels](#multi-card-parcels) - [Proration](#proration) - [Ongoing Issues](#ongoing-issues) @@ -857,7 +857,7 @@ the following major changes to the residential modeling codebase: using [renv profiles](#profiles-and-lockfiles) to increase replicability. -# Complex Parcel Configurations +# Complex Parcels ### Multi-Card Parcels From cde37e1f88dbdb603a485a74d233719f8bd3d95b Mon Sep 17 00:00:00 2001 From: Michael Wagner Date: Thu, 20 Feb 2025 18:08:10 +0000 Subject: [PATCH 4/4] Remove detail --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index bdf1a895..97eae7bd 100644 --- a/README.md +++ b/README.md @@ -878,7 +878,6 @@ Some buildings span multiple parcel boundaries. When this happens, each parcel ( 2. **Prorate the Building:** - Multiply the building-wide figure by each PIN’s proration rate (e.g., if a PIN has 30% of the building, it gets 30% of that amount). - - Round fractional dollars so the total matches exactly. 3. **Add Back Land:** - Each PIN has its own land value, which is not shared.