Skip to content

HHS/acf-ohsepr-dart-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIDDMS_analytics: Task 3

DRAFT - Full Report

DART Task 3: Entity Reconciliation Using USA Spending Grants Data

Table of Contents


Objective

Task 3 ("Unique Grant Recipient Approach Proof-of-Concept") is a test project designed to improve how we identify and assist organizations or individuals who receive grants. By trying out new tools or methods, we aim to make the process faster, more accurate, and better tailored to specific needs. The goal is to create a better framework to match funding efficiently. The Raft Team (the team) aims to provide a Proof-of-Concept (PoC) approach for uniquely identifying these ACF grantees.

For the proof of concept, the team worked with grant data across three Administration for Children and Family (ACF) Program Offices:

  • Office of Head Start (OHS)
  • Office of Family Assistance (OFA)
  • Office of Family Violence Prevention and Services (OFVPS)

Summary of DART Task 3 Sub-Objectives

Sub-Objective Description
SO1 Establish primary dataset for ETL and updating of USA Spending Grants Data.
SO2 Identify entity evolution over time by mapping old and new versions of entities.
SO3 Enhance data quality by improving completeness and accuracy of records.
SO4 Provide PoC for entity reconciliation using USA Spending grants data.
SO5 Offer recommendations for resolving unmatched entities.

Summary of Challenges Encountered and Corresponding Solutions

Challenge 1: Ambiguous entity evolution, unlabeled entity sets, and lack of bridge data and events

Problem: Many entities undergo changes over time (e.g., name refinements, Unique Entity Identifier [UEI] updates) without clear documentation, making tracking difficult.

Solution: Using auxiliary data sources (e.g., facility data, Bureau of Indian Affairs (BIA) name change notices) to create bridge event data.

Pros:

  • Leverages available data sources.
  • Improves historical tracking.
  • Builds scalable solutions for future program offices.

Cons:

  • Limited data coverage across offices.
  • Scalability challenges requiring continuous updates.

Challenge 2: Data Quality Issues

Problem: USA Spending Grants Data has inconsistent and incomplete data documentation.

Solution: Deploying data validation scripts to identify and correct quality issues.

Pros:

  • Enhances data reliability and systematic issue identification.

Cons:

  • Does not address root causes of data inconsistencies.
  • Requires ongoing maintenance.

Challenge 3: No Validation Dataset

Problem: Lack of validation dataset necessitated manual validation of entity evolutions over time.

Solution: Developing a validation dataset to evaluate reconciliation methods.

Pros:

  • Improves accuracy and enables scalable reconciliation.
  • Provides iterative improvement and enhances confidence in data reliability.

Cons:

  • Requires significant resources for development.
  • Partial solutions still require manual effort.

Recommendations

By-Office Workshops to Build Out Custom Validation Datasets

  • Facilitate knowledge-sharing workshops with ACF stakeholders.
  • Capture institutional knowledge and validate entity reconciliation results.

Collecting By-Office Facility Data

  • Extend facility data collection to other offices beyond OHS.
  • Address confidentiality concerns in data collection.

USA Spending Data Discussion

  • USAspending.gov provides critical data for entity reconciliation.
  • Data cleaning, normalization, and validation improve grant recipient identification.

How to Run Task 3

  1. Source the 00-source.R file.
  2. Run ETL scripts for USA Spending, ACF Program Mapping, and Head Start Facilities data.
  3. Apply entity reconciliation functions.

Data Flow

Data Sources

  • USA Spending Grants Data
  • ACF Program Mapping Data
  • Head Start Facility Locations Data

Key Data Preprocessing Steps

Data Source Input Format Output Format
USA Spending Grants Data Zip URL Feather
ACF Program Mapping Data XLSX Feather
Head Start Facility Locations Data CSV URL Feather

Additional Tasks and Data Collection Efforts to Close the Gap

Additional Facilities Data Collection by Office

  • Expand facility data collection beyond OHS.
  • Evaluate feasibility based on confidentiality and data regulations.

ACF Stakeholders Workshops

  • Conduct workshops to validate entity reconciliation methods.
  • Create an office-approved validation dataset.

Deploying Clustering Algorithms, After By-Office Human Verification

  • Utilize clustering methods:
    • DBSCAN for density-based clustering.
    • K-means for partitioning data.
    • Agglomerative Clustering for nested structures.

Appendix

Summary of Script Operations by Data Source

Data Source Type Function Description & Notes
USA Spending Grants Data ETL get_grants_data() Pulls most recent USA Spending data.
Data Validation assert_grant_identifiers() Summarizes grant consistency over time.
ACF Program Mapping Data ETL rename_acf_program_map() Standardizes column names.
Data Validation acf_program_info() Summarizes program distributions.
Head Start Facility Locations Data ETL expand_abbreviations() Standardizes address formats.
Data Validation assert_head_start_diff() Summarizes expired grants and discrepancies.

This document provides a structured and scalable approach to entity reconciliation using USA Spending grants data, facilitating improved grant tracking and administration for ACF Program Offices.