Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SoT: Row and PDF Hashing #4746

Open
15 tasks
jadudm opened this issue Mar 5, 2025 · 0 comments
Open
15 tasks

SoT: Row and PDF Hashing #4746

jadudm opened this issue Mar 5, 2025 · 0 comments
Labels

Comments

@jadudm
Copy link
Contributor

jadudm commented Mar 5, 2025

Problem

Currently, both the internal tables and the dissemination tables have no indicator of their composition. For example, consider the audit for the City of Sandwich.

"agencies_with_prior_findings": "00",
        "audit_period_covered": "annual",
        "audit_type": "single-audit",
        "audit_year": "2024",
        ...
        "fac_accepted_date": "2025-03-05",
        ...
        "total_amount_expended": 4230531,
        "type_audit_code": "UG"
    }

There are 60 fields in that record. Once published, it never changes. However, neither we (the FAC) nor users of the FAC have any way of knowing that nothing has changed.

a traditional solution

A traditional solution would be to hash the data in some deterministic, repeatable way, and publish that hash. (For example: sort the keys in alpha order, and then hash a string concatenation of all of the keys and data.) We would then add this hash to our data:

"agencies_with_prior_findings": "00",
        "audit_period_covered": "annual",
        "audit_type": "single-audit",
        "audit_year": "2024",
        ...
        "fac_accepted_date": "2025-03-05",
        ...
        "total_amount_expended": 4230531,
        "type_audit_code": "UG",
        "data_hash": "2bc0d4ae220eaa8e4d468eedee790f6eb9d9d440"
    }

How did we discover this problem?

We realized that we have no way of knowing if our data at rest remains constant over time. Users of the FAC, similarly, have no idea if data they fetch on one day is the same as data fetched on the next. This is a long-standing problem for users of the FAC (agencies, oversight): their lived experience was that data would change, but they would not know how or why.

As we look at migrating data from our existing internal tables to new designs, we need a way of knowing that the data has not changed from one representation to the next. One way to achieve this is to hash the data for both representations in equivalent ways, and compare these values as part of the migration process.

As we engage in curation work, hashing would help us identify records that should have been updated (and were), and records that should not have changed (and were not). Or, stated as a problem: when we curate data, we do not (currently) have a way of asserting conclusively that records that should not have changed absolutely have not changed.

PDFs are not hashed, currently, which suggests we do not have a way of asserting that audit reports have not changed over time.

Job Story(s)

When I [situation], I want to [motivation] so I can [outcome/benefit].

What are we planning to do about it?

We believe hashing of the data is appropriate. If there is another appropriate solution/path that solves the above problems, we should consider that.

What are we not planning to do about it?

We need confidence in our migration and data over time. This should happen as part of the "source of truth" work, as it sets us up for resubmission and curation work.

How will we measure success?

When we have a way of repeatably hashing all data in our current model, and in the new model, and can confirm that data migrated between the two hashes identically (perhaps through the API?), we will know we succeeded.

Security Considerations

Required per CM-4.

This does not introduce security concerns.

It allays many concerns. While our data is encrypted at rest, it is not hashed. Therefore, we do not know when or if data has changed. This addresses that concern.


Process checklist
  • Has a clear story statement
  • Can reasonably be done in a few days (otherwise, split this up!)
  • Shepherds have been identified
  • UX youexes all the things
  • Design designs all the things
  • Engineering engineers all the things
  • Meets acceptance criteria
  • Meets QASP conditions
  • Presented in a review
  • Includes screenshots or references to artifacts
  • Tagged with the sprint where it was finished
  • Archived

If there's UI...

  • Screen reader - Listen to the experience with a screen reader extension, ensure the information presented in order
  • Keyboard navigation - Run through acceptance criteria with keyboard tabs, ensure it works.
  • Text scaling - Adjust viewport to 1280 pixels wide and zoom to 200%, ensure everything renders as expected. Document 400% zoom issues with USWDS if appropriate.
@github-project-automation github-project-automation bot moved this to Triage in FAC Mar 5, 2025
@jadudm jadudm mentioned this issue Mar 6, 2025
15 tasks
@jadudm jadudm added the data label Mar 12, 2025
@jadudm jadudm moved this from Triage to Backlog in FAC Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Backlog
Development

No branches or pull requests

1 participant