-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy scanning producing extra (incorrect) outputs #315
Comments
Well, I thought I'd share some analysis of this issue as we search for the best response to it. Let's start with the underlying assumption that's rarely laid out anywhere. The secrets used by serious systems are generated using high-quality cryptographic algorithms that generate some sort of encoded or hashed output. The better the algorithm is, the more the output looks like random garbage. Because humans don't deal well with strings of random bytes, common practice is to encode the raw data as printable text. There are lots of ways to do this, but a couple of the most common are hexadecimal representation (two characters per byte) and base64 (more efficient, using four characters to represent three bytes). Instead of laboriously checking the entire content of a file for entropy, truffleHog and tartufo checked entropy only for strings that were hexadecimal or "probably" base64 encodings. (This could include strings that weren't actually legal encodings but still used only the base64 alphabet.) This is much more efficient, but ignores anything that doesn't happen to be using one of those two encoding methods. #177 requested support for base64url encoding, a variant that uses The approach incorporated in 3.0.0 was to relax the base64 scanner slightly so it would recognize both base64 and base64url (as well as things that were illegal combinations of both) using a single regular expression. This satisfied #177 -- base64url encodings would be detected and entropy-checked -- and had negligible performance impact because we did not materially alter the algorithm for scanning input strings. It also didn't turn up any problems in our initial testing. The gotcha, as has been quickly revealed following 3.0.0's release, is that it is extremely common for long file and path names (among other things) to contain The question is, "what do we do about it?" Obviously we'd like tartufo to magically ignore all of the "obviously nonrandom" strings while correctly reporting all of the actually random strings (and do it as fast or faster than before). It is not clear this is possible. I wanted to outline the approaches we have considered and why all of them suck. First off, there is the issue of "obviously nonrandom". Humans are pretty good at looking at something like The approach taken by #319 is to observe that this string obviously can't be a real encoding, because it contains Of course, it's not quite that easy. Many alphanumeric strings (that don't contain The price we pay for this is an extra pass over the data (to look for base64url encodings, distinct from the base64 pass) and the added deduplication recordkeeping -- hopefully largely offset by the elimination of a pass over the data (to split lines on whitespace before checking for encodings) and regex changes to short-circuit production of too-small strings that would be immediately discarded anyway. However, in real life we have already found examples of filenames that still exceed the entropy threshold and generate issues. Therefore, from a strict compatibility viewpoint, #319 sucks. It just seems to suck less than the alternatives. What are the alternatives? We could drop base64url support, but that sucks. We shouldn't ignore possibly sensitive data (using industry-standard encoding, even!) and we know at least one project actually uses these encodings and wants to check them. We could add a switch to enable or disable base64 support, but that sucks, too. Maybe a repository owner doesn't realize that there is base64url-encoded content present. Maybe they get a bunch of "spurious" issues driven by filenames and turn off the check, allowing legitimately problematic content to pass unscanned. It feels like too big a hammer. Several observers have suggested changing the default However, we have examples of real-life filename fragments that have entropy equivalent to real-life AWS_SECRET_ACCESS_KEY strings. That is, if we set the bar high enough to avoid false positives on these filenames, we get false negatives on sensitive strings we definitely want to know about (and which existing tartufo versions will report). From a security perspective, this cure is worse than the disease, and that sucks. Another option we investigated was to build on the concept of "that filename obviously looks nonrandom". Could we make these obvious to tartufo also? I'd love to hear from somebody with a workable approach, but the things we've considered suck. The crux of the problem is that "that filename" is still a valid encoding with high entropy. From a statistical viewpoint, we played with looking at sequencing -- Alternatively, we could say "well, it's a filename" -- but realistically we don't really KNOW it's a filename; we just have human intuition. The purported name doesn't necessarily exist in the repository, so it's not like the scanner can go look and say "yay, I found it so I guess it's probably a reference and not some encoding that just looks like a filename!" Maybe it's something that exists in a different repository or is present in the runtime environment where the repository's generated artifacts will be deployed. So, in the absence of any specifics, this approach sucks too. That brings us, reluctantly, to the point of saying "this is a file, it's okay" and "this is another file, it's okay too" and... wait -- those are exclusions. Tartufo already knows how to do that. But can it do it well enough? We already have efficiency concerns without adding even more exclusions, and that's an ongoing cost piled on top of the one-time cost of adding them. But that's a story for another issue or PR. |
🐛 Bug Report
Values like below are being flagged as high entropy findings when they shouldn't.
Expected Behavior
Values should not be flagged as high entropy findings.
Environment
tartufo v3.0.0
The text was updated successfully, but these errors were encountered: