Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and enhance the datastore #122

Merged
merged 46 commits into from
Feb 17, 2024
Merged

Simplify and enhance the datastore #122

merged 46 commits into from
Feb 17, 2024

Conversation

bradlarsen
Copy link
Collaborator

@bradlarsen bradlarsen commented Jan 29, 2024

Lots of changes in this PR — a heart and lung transplant for Nosey Parker.

  • The minimum supported Rust version has been changed from 1.70 to 1.76.

  • The data model and datastore have been significantly overhauled:

    • The rules used during scanning are now explicitly recorded in the datastore.
      Each rule is additionally accompanied by a content-based identifier that uniquely identifies the rule based on its pattern.

    • Each match is now associated with the rule that produced it, rather than just the rule's name (which can change as rules are modified).

    • Each match is now assigned a unique content-based identifier.

    • Findings (i.e., groups of matches with the same capture groups, produced by the same rule) are now represented explicitly in the datastore.
      Each finding is assigned a unique content-based identifier.

    • Now, each time a rule matches, a single match object is produced.
      Each match in the datastore is now associated with an array of capture groups.
      Previously, a rule whose pattern had multiple capture groups would produce one match object for each group, with each one being associated with a single capture group.

    • Provenance metadata for blobs is recorded in a much simpler way than before.
      The new representation explicitly records file and git-based provenance, but also adds explicit support for extensible provenance.
      This change will make it possible in the future to have Nosey Parker scan and usefully report blobs produced by custom input data enumerators (e.g., a Python script that lists files from the Common Crawl WARC files).

    • Scores are now associated with matches instead of findings.

    • Comments can now be associated with both matches and findings, instead of just findings.

  • The JSON and JSONL report formats have changed.
    These will stabilize in a future release (#101).

    • The matching_input field for matches has been removed and replaced with a new groups field, which contains an array of base64-encoded bytestrings.

    • Each match now includes additional rule_text_id, rule_structural_id, and structural_id fields.

    • The provenance field of each match is now slightly different.

  • Schema migration of older Nosey Parker datastores is no longer performed.
    Previously, this would automatically and silently be done when opening a datastore from an older version.
    Explicit support for datastore migration may be added back in a future release.

@bradlarsen bradlarsen self-assigned this Jan 29, 2024
- On unhandled errors, only print `anyhow` backtraces when using `-vv`
  or higher, otherwise print a more compact single-line message

- Make datastore-related error messages less verbose
The integration tests can now be run on other binaries, such as release
binaries or Docker images, using the new `NP_TEST_PROGRAM` environment
variable. For example:

    NP_TEST_PROGRAM="$PWD"/release/bin/noseyparker cargo test --test test_noseyparker
* Needs refinement though :)
crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed
crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed
crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed
crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed
crates/noseyparker-cli/src/cmd_report.rs Fixed Show fixed Hide fixed
@bradlarsen bradlarsen marked this pull request as ready for review February 16, 2024 23:37
@bradlarsen bradlarsen merged commit b86fefe into main Feb 17, 2024
8 checks passed
@bradlarsen bradlarsen deleted the datastore-overhaul branch February 17, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant