Simplify and enhance the datastore #122

bradlarsen · 2024-01-29T22:04:43Z

Lots of changes in this PR — a heart and lung transplant for Nosey Parker.

The minimum supported Rust version has been changed from 1.70 to 1.76.
The data model and datastore have been significantly overhauled:
- The rules used during scanning are now explicitly recorded in the datastore.
  Each rule is additionally accompanied by a content-based identifier that uniquely identifies the rule based on its pattern.
- Each match is now associated with the rule that produced it, rather than just the rule's name (which can change as rules are modified).
- Each match is now assigned a unique content-based identifier.
- Findings (i.e., groups of matches with the same capture groups, produced by the same rule) are now represented explicitly in the datastore.
  Each finding is assigned a unique content-based identifier.
- Now, each time a rule matches, a single match object is produced.
  Each match in the datastore is now associated with an array of capture groups.
  Previously, a rule whose pattern had multiple capture groups would produce one match object for each group, with each one being associated with a single capture group.
- Provenance metadata for blobs is recorded in a much simpler way than before.
  The new representation explicitly records file and git-based provenance, but also adds explicit support for extensible provenance.
  This change will make it possible in the future to have Nosey Parker scan and usefully report blobs produced by custom input data enumerators (e.g., a Python script that lists files from the Common Crawl WARC files).
- Scores are now associated with matches instead of findings.
- Comments can now be associated with both matches and findings, instead of just findings.
The JSON and JSONL report formats have changed.
These will stabilize in a future release (#101).
- The matching_input field for matches has been removed and replaced with a new groups field, which contains an array of base64-encoded bytestrings.
- Each match now includes additional rule_text_id, rule_structural_id, and structural_id fields.
- The provenance field of each match is now slightly different.
Schema migration of older Nosey Parker datastores is no longer performed.
Previously, this would automatically and silently be done when opening a datastore from an older version.
Explicit support for datastore migration may be added back in a future release.

- On unhandled errors, only print `anyhow` backtraces when using `-vv` or higher, otherwise print a more compact single-line message - Make datastore-related error messages less verbose

The integration tests can now be run on other binaries, such as release binaries or Docker images, using the new `NP_TEST_PROGRAM` environment variable. For example: NP_TEST_PROGRAM="$PWD"/release/bin/noseyparker cargo test --test test_noseyparker

crates/noseyparker/src/datastore.rs

* Needs refinement though :)

crates/noseyparker/src/datastore.rs

crates/noseyparker-cli/src/cmd_report.rs

crates/noseyparker/src/datastore.rs

crates/noseyparker-cli/src/cmd_report/human_format.rs

…into datastore-overhaul

Checkpoint new datastore schema

151c253

bradlarsen self-assigned this Jan 29, 2024

bradlarsen added 16 commits January 30, 2024 12:38

CI: update upload-artifact version

e0b9e72

Update dependencies with cargo update

0c5e096

CI: update dependencies to avoid node.js deprecation warnings

d991524

Improve error messages

7cbbd8f

- On unhandled errors, only print `anyhow` backtraces when using `-vv` or higher, otherwise print a more compact single-line message - Make datastore-related error messages less verbose

Update gix from 0.56 to 0.58

778d7dd

Update strum from 0.25 to 0.26

f3d587a

Delete cruft

636c5bf

CI: run integration tests on built releases

41a42d1

CI: set NP_GITHUB_TOKEN

23ea140

Add notes and assertions

9d4a018

Include more information in a debug message

4c02d8c

Add a more realistic integration test

e912cab

Reduce false positives from the JWT rule

addff6c

Give Saullo credit

d20a003

Checkpoint

217e2f3

github-advanced-security bot found potential problems Feb 8, 2024

View reviewed changes

crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed

bradlarsen added 3 commits February 8, 2024 18:31

Record new blob provenance format while scanning

978a31c

Refine schema

75ae509

More datastore-related fixes

8b4a486

github-advanced-security bot found potential problems Feb 9, 2024

View reviewed changes

crates/noseyparker/src/datastore.rs Fixed Show fixed Hide fixed

Checkpoint. report works again! (*)

31bdbf4

* Needs refinement though :)

github-advanced-security bot found potential problems Feb 13, 2024

View reviewed changes

bradlarsen added 5 commits February 13, 2024 11:29

Fix test build

4821472

Fix deserialization of BStringLossyUtf8

f188a13

Make a datastore error message more informative

44e3556

Fix an error message in database conversion for groups

7141c0e

More correct provenance support

a09a776

bradlarsen added 6 commits February 13, 2024 13:57

Add some round-trip property tests for custom serialization

25ec4f3

Fix some tests

8e388ce

Simplify schema

fe9fac2

Put a version number in schema name

3b10d20

Fix a test

9034e8d

Make reporting more deterministic

59acf61

github-advanced-security bot found potential problems Feb 13, 2024

View reviewed changes

bradlarsen added 14 commits February 15, 2024 20:38

More fixes

87d73b0

Add another FIXME

35dee8e

Add rule name and id to JSON report output

3b8e666

Update dependencies with cargo update

43a9323

Merge branch 'main' of https://github.com/praetorian-inc/noseyparker …

bdb4b2a

…into datastore-overhaul

Fix clippy nits

91359a4

Bump minimal Rust version from 1.70 to 1.76

dea3b11

Include match structural id in JSON output

2c0d8f8

Update CHANGELOG

9e03f2c

Retrain tests

5eff1a0

Eliminate unused imports

db63229

Add a type annotation to try to fix Linux builds

bc5ab9b

Update CHANGELOG

c039bf7

Update CHANGELOG

3679148

bradlarsen marked this pull request as ready for review February 16, 2024 23:37

bradlarsen mentioned this pull request Feb 16, 2024

Stabilize the JSON format #101

Closed

bradlarsen merged commit b86fefe into main Feb 17, 2024
8 checks passed

bradlarsen deleted the datastore-overhaul branch February 17, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify and enhance the datastore #122

Simplify and enhance the datastore #122

bradlarsen commented Jan 29, 2024 •

edited

Loading

Simplify and enhance the datastore #122

Simplify and enhance the datastore #122

Conversation

bradlarsen commented Jan 29, 2024 • edited Loading

bradlarsen commented Jan 29, 2024 •

edited

Loading