Cache scan results across runs #146
bradlarsen
started this conversation in
Ideas
Replies: 1 comment
-
Some measurement would need to be done to determine the effectiveness of this caching. If the SHA1 digest of each file needs to be computed to determine if it's been scanned already, that might be nearly as expensive as actually just going ahead and scanning the file anyway! This seems tricky to get right, and I'm not sure there would be large performance benefit. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Nosey Parker should keep track of which inputs have already been scanned, and avoid rescanning them if possible on future scanning runs.
Currently,
noseyparker scan -d DATASTORE INPUT
will completely enumerate and scanINPUT
from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.Caching is tricky to get right. The information about which inputs have already been scanned should probably be persisted in the datastore sqlite database. An entry would be a (blob id, ruleset id, nosey parker version id), indicated that the particular blob had been scanned with a particular set of rules and Nosey Parker version.
If the context size for reported findings in Nosey Parker becomes runtime-configurable, that parameter would also need to be taken into account for caching.
The cache could be a fixed-size LRU cache, 128MB of entries for example, loaded into a fast in-memory structure at enumeration time, and then updated in bulk at the end of scanning. (Some implementation like this may be necessary to avoid tanking Nosey Parker speed.)
One complication: a Nosey Parker datastore sqlite database is currently a very simple, totally denormalized schema with a single table. There is also no such thing currently as a ruleset id; that notion would need to be added (perhaps sha512 of all the loaded rules).
Beta Was this translation helpful? Give feedback.
All reactions