Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR - content hashing caching strategy for business rules. #650

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/source/adr/0016-content-checksums-for-tracked-models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
.. _16-content-checksums-for-tracked-models

16 Content checksums for Tracked Models
=======================================


Status
------

Proposed


Context
-------

After running the business rules it is useful to store the results of checks against models, and have a mechanism to check if the result is still valid to avoid re-running the check in future.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about what can invalidate a check after it has completed. We know that after a check has been run, then newly published tariff data can cause some types of check to become invalid - an ME32 check against a Measure, for instance. So hashing doesn't help avoid the need to run a check again in those circumstances.
It would be worth pointing out that caveat here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is really important - I've updated the PR description to cover this, and will update the ADR to reflect this (date validity checkers will be their own adr)


This ADR adds a mechanims to checksum the user-editable data to check if to versions of a Tracked Model are equivilent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncertain whether there are fields that can be mutated but are not user editable. If there is a difference, and editable fields are a subset of mutable fields, then would mutable be a preferable term here?
Tiny typo "... check if to versions".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> mechanism
--> equivalent


In the simplest sense this means hashing fields that exclude the PK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's meant by "fields that exclude the PK" ?



TAP already has a mechanism to get the user editable fields, the attribute `.copyable_fields`, which provides a starting point.

Unlike pythons default hashing, it is nessacary to distinguish between two different types that produce the same hash - so the string that gets passed the hashing function ("hashable_string"), will include the module and class name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to give an example where two different types produce the same hash? It's probably trivial, but I have Monday brain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth adding a link to docs on python's default hashing?



The hash generated in this ADR will not currently be attached to TrackedModel - thought it may be desirable in future.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attached == cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saved to it's own field I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> though

By limiting the scope to business rule checks this gives time for the code to bed in, and makes it easier to change things if there are issues.


Decision
--------

The fields to be hashed are contained in `copyable_fields` but since this doesn't communicate the intent used here, this will be proxied as `mutable_fields`, which is explicitly about providing which fields can be hashed to check equivilence.


TrackedModel will gain an API named get_content_hash() that returns a sha256 hash.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it only matters that it's a sufficiently unique hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to be more specfic than "API"? Would it be a class_method? As I read it currently, it sounds a bit like the hash will be attached to the tracked model, which contradicts what's written on L28. I know that's not the intent, but just be good to clarify

Copy link
Contributor Author

@stuaxo stuaxo Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I guess I meant a method - e.g. .content_hash(), vs something we save forever in a field.


get_content_hash() -

Iterates the fields in `mutable_fields` and building a dict keyed by field name, the value contains the location __module__ and __name__ of the field.

If the field is not TrackedModel then the value includes the sha256 of their content, for foreign keys that are TrackedModels the value is obtained by calling get_content_hash()


Consequences
------------

This provides a mechanism to store the results of business rule checks, only becomming invalid when the TrackedModels they reference change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will part of this work involve creating a list of rules to which this kind of caching will be applicable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it doesn't need to be exhaustive - as it can be extended if needed later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> becoming


When a workbasket is published, the validity of all other unpublished workbaskets must be verified - with this system, most of the business rule checks would remain valid and the only the checksums need to be verified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... only the checksums need to be verified.

May be clearer to say "... only the content hash needs to be verified."?


2 changes: 2 additions & 0 deletions pii-ner-exclude.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1168,3 +1168,5 @@ measure_qs)\n
MeasureComponent
Components
TP2000-452
Tracked Model
sha256