SoT: Row and PDF Hashing #4746

jadudm · 2025-03-05T15:25:18Z

Problem

Currently, both the internal tables and the dissemination tables have no indicator of their composition. For example, consider the audit for the City of Sandwich.

"agencies_with_prior_findings": "00",
        "audit_period_covered": "annual",
        "audit_type": "single-audit",
        "audit_year": "2024",
        ...
        "fac_accepted_date": "2025-03-05",
        ...
        "total_amount_expended": 4230531,
        "type_audit_code": "UG"
    }

There are 60 fields in that record. Once published, it never changes. However, neither we (the FAC) nor users of the FAC have any way of knowing that nothing has changed.

a traditional solution

A traditional solution would be to hash the data in some deterministic, repeatable way, and publish that hash. (For example: sort the keys in alpha order, and then hash a string concatenation of all of the keys and data.) We would then add this hash to our data:

"agencies_with_prior_findings": "00",
        "audit_period_covered": "annual",
        "audit_type": "single-audit",
        "audit_year": "2024",
        ...
        "fac_accepted_date": "2025-03-05",
        ...
        "total_amount_expended": 4230531,
        "type_audit_code": "UG",
        "data_hash": "2bc0d4ae220eaa8e4d468eedee790f6eb9d9d440"
    }

How did we discover this problem?

We realized that we have no way of knowing if our data at rest remains constant over time. Users of the FAC, similarly, have no idea if data they fetch on one day is the same as data fetched on the next. This is a long-standing problem for users of the FAC (agencies, oversight): their lived experience was that data would change, but they would not know how or why.

As we look at migrating data from our existing internal tables to new designs, we need a way of knowing that the data has not changed from one representation to the next. One way to achieve this is to hash the data for both representations in equivalent ways, and compare these values as part of the migration process.

As we engage in curation work, hashing would help us identify records that should have been updated (and were), and records that should not have changed (and were not). Or, stated as a problem: when we curate data, we do not (currently) have a way of asserting conclusively that records that should not have changed absolutely have not changed.

PDFs are not hashed, currently, which suggests we do not have a way of asserting that audit reports have not changed over time.

Job Story(s)

When I [situation], I want to [motivation] so I can [outcome/benefit].

What are we planning to do about it?

We believe hashing of the data is appropriate. If there is another appropriate solution/path that solves the above problems, we should consider that.

What are we not planning to do about it?

We need confidence in our migration and data over time. This should happen as part of the "source of truth" work, as it sets us up for resubmission and curation work.

How will we measure success?

When we have a way of repeatably hashing all data in our current model, and in the new model, and can confirm that data migrated between the two hashes identically (perhaps through the API?), we will know we succeeded.

Security Considerations

Required per CM-4.

This does not introduce security concerns.

It allays many concerns. While our data is encrypted at rest, it is not hashed. Therefore, we do not know when or if data has changed. This addresses that concern.

Process checklist

If there's UI...

Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order
Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works.
Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to FAC Mar 5, 2025

github-project-automation bot moved this to Triage in FAC Mar 5, 2025

jadudm mentioned this issue Mar 6, 2025

SoT: Migration #4742

Open

15 tasks

jadudm added the data label Mar 12, 2025

jadudm moved this from Triage to Backlog in FAC Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SoT: Row and PDF Hashing #4746

SoT: Row and PDF Hashing #4746

jadudm commented Mar 5, 2025 •

edited

Loading

If there's UI...

SoT: Row and PDF Hashing #4746

SoT: Row and PDF Hashing #4746

Comments

jadudm commented Mar 5, 2025 • edited Loading

Problem

a traditional solution

How did we discover this problem?

Job Story(s)

What are we planning to do about it?

What are we not planning to do about it?

How will we measure success?

Security Considerations

If there's UI...

jadudm commented Mar 5, 2025 •

edited

Loading