-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify and enhance the datastore #122
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- On unhandled errors, only print `anyhow` backtraces when using `-vv` or higher, otherwise print a more compact single-line message - Make datastore-related error messages less verbose
The integration tests can now be run on other binaries, such as release binaries or Docker images, using the new `NP_TEST_PROGRAM` environment variable. For example: NP_TEST_PROGRAM="$PWD"/release/bin/noseyparker cargo test --test test_noseyparker
* Needs refinement though :)
…into datastore-overhaul
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Lots of changes in this PR — a heart and lung transplant for Nosey Parker.
The minimum supported Rust version has been changed from 1.70 to 1.76.
The data model and datastore have been significantly overhauled:
The rules used during scanning are now explicitly recorded in the datastore.
Each rule is additionally accompanied by a content-based identifier that uniquely identifies the rule based on its pattern.
Each match is now associated with the rule that produced it, rather than just the rule's name (which can change as rules are modified).
Each match is now assigned a unique content-based identifier.
Findings (i.e., groups of matches with the same capture groups, produced by the same rule) are now represented explicitly in the datastore.
Each finding is assigned a unique content-based identifier.
Now, each time a rule matches, a single match object is produced.
Each match in the datastore is now associated with an array of capture groups.
Previously, a rule whose pattern had multiple capture groups would produce one match object for each group, with each one being associated with a single capture group.
Provenance metadata for blobs is recorded in a much simpler way than before.
The new representation explicitly records file and git-based provenance, but also adds explicit support for extensible provenance.
This change will make it possible in the future to have Nosey Parker scan and usefully report blobs produced by custom input data enumerators (e.g., a Python script that lists files from the Common Crawl WARC files).
Scores are now associated with matches instead of findings.
Comments can now be associated with both matches and findings, instead of just findings.
The JSON and JSONL report formats have changed.
These will stabilize in a future release (#101).
The
matching_input
field for matches has been removed and replaced with a newgroups
field, which contains an array of base64-encoded bytestrings.Each match now includes additional
rule_text_id
,rule_structural_id
, andstructural_id
fields.The
provenance
field of each match is now slightly different.Schema migration of older Nosey Parker datastores is no longer performed.
Previously, this would automatically and silently be done when opening a datastore from an older version.
Explicit support for datastore migration may be added back in a future release.