Cache scan results across runs #146

bradlarsen · 2023-02-17T17:31:10Z

bradlarsen
Feb 17, 2023
Maintainer

Nosey Parker should keep track of which inputs have already been scanned, and avoid rescanning them if possible on future scanning runs.

Currently, noseyparker scan -d DATASTORE INPUT will completely enumerate and scan INPUT from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.

Caching is tricky to get right. The information about which inputs have already been scanned should probably be persisted in the datastore sqlite database. An entry would be a (blob id, ruleset id, nosey parker version id), indicated that the particular blob had been scanned with a particular set of rules and Nosey Parker version.

If the context size for reported findings in Nosey Parker becomes runtime-configurable, that parameter would also need to be taken into account for caching.

The cache could be a fixed-size LRU cache, 128MB of entries for example, loaded into a fast in-memory structure at enumeration time, and then updated in bulk at the end of scanning. (Some implementation like this may be necessary to avoid tanking Nosey Parker speed.)

One complication: a Nosey Parker datastore sqlite database is currently a very simple, totally denormalized schema with a single table. There is also no such thing currently as a ruleset id; that notion would need to be added (perhaps sha512 of all the loaded rules).

bradlarsen · 2024-03-07T22:10:42Z

bradlarsen
Mar 7, 2024
Maintainer Author

Some measurement would need to be done to determine the effectiveness of this caching. If the SHA1 digest of each file needs to be computed to determine if it's been scanned already, that might be nearly as expensive as actually just going ahead and scanning the file anyway!

This seems tricky to get right, and I'm not sure there would be large performance benefit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache scan results across runs #146

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Cache scan results across runs #146

bradlarsen Feb 17, 2023 Maintainer

Replies: 1 comment

bradlarsen Mar 7, 2024 Maintainer Author

bradlarsen
Feb 17, 2023
Maintainer

bradlarsen
Mar 7, 2024
Maintainer Author