Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy scanning fixups #319

Closed
wants to merge 2 commits into from
Closed

Entropy scanning fixups #319

wants to merge 2 commits into from

Conversation

rbailey-godaddy
Copy link
Contributor

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

  • Tests (if applicable)
  • Documentation (if applicable)
  • Changelog entry
  • A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

  • Yes (backward compatible)
  • No (breaking changes)

"Almost" -- see description below for corner cases.

Issue Linking

closes #315

What's new?

The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned out to be problematic because we several examples of things that looked like base64 encodings but weren't and which generated new issues thanks to the extended combined alphabet. Simple tweaks looked problematic (read "nondeterministic") going forward.

This fix is a little bit more invasive.

  • We scan for BASE64 and BASE64URL encodings separately (using the same flakey expressions as pre-3.0.0 did) so they cannot bleed into each other. The cost of this is that we now make 3 (vs 2) regex scans of input text. A number of unit tests required revision to account for the additional calls.
  • To offset this cost, we further adjust the regex expressions to match a minimum of 20 characters, offloading much of the work from function util.find_strings_by_regex() which spent most of its time throwing away short strings. Nothing in the tartufo codebase allows users to supply a different minimum length; I added a warning for folks who might be building other code on top of tartufo.
  • Additionally, scanner.scan_entropy() used to split lines into words and scan the words individually. Since the supplied regexs will do this by themselves anyway, we eliminate the line.split() step and eliminate a pass over the input text. This required fixes to a number of unit tests where we generally make fewer calls on longer targets.
  • But wait! There's more! Many strings are both base64 and base64url, and we don't want to generate duplicate (or overlapping) issues. That resulted in the following two changes...
  • scanner.evaluate_entropy_string() now returns an optional issue instead of being a generator (that returned either 1 or 0 issues). This allows its only caller to inspect the return before passing it back to higher-level callers.
  • scanner.scan_entropy() now jumps through deduplication hoops:
    • results from both base64 and base64url scans are combined in a set to eliminate strict duplicates; as a side effect, it is possible that lines with multiple issues (in the same line) may return the issues in a different order.
    • target strings are evaluated from longest to shortest, and we no longer report an issue for a substring if we already reported an issue on the longer string containing it. We do not strictly test that the substring is located within the longer string, so a line that contained "bad-string even-more-bad-string" would not generate an issue for the initial "bad-string" even though it appears before the subsequent "even-more-bad-string" (which would get an issue).
    • We also test for duplicate hex encoding issues (because a hex string is also a base64 string) -- this could not happen with default sensitivity (backstory omitted) but is nice to close off.

My testing shows variations in the noise level when tartufo scans itself, but it seems like the extra overhead should be approximately balanced by the efficiency gains and we'll see a wash. Time will tell.

Note it is still possible to get new issues (not reported by 2.x) if a string begins or ends with - or _ and the old string didn't generate an issue but the new longer string does.

The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned
out to be problematic because we several examples of things that looked
like base64 encodings but weren't and which generated new issues thanks
to the extended combined alphabet. Simple tweaks looked problematic
(read "nondeterministic") going forward.

This fix is a little bit more invasive.

* We scan for BASE64 and BASE64URL encodings separately (using the same
  flakey expressions as pre-3.0.0 did) so they cannot bleed into each
  other.  The cost of this is that we now make 3 (vs 2) regex scans of
  input text. A number of unit tests required revision to account for
  the additional calls.
* To offset this cost, we further adjust the regex expressions to match
  a minimum of 20 characters, offloading much of the work from function
  `util.find_strings_by_regex()` which spent most of its time throwing
  away short strings. Nothing in the tartufo codebase allows users to
  supply a different minimum length; I added a warning for folks who
  might be building other code on top of tartufo.
* Additionally, `scanner.scan_entropy()` used to split lines into words
  and scan the words individually. Since the supplied regexs will do
  this by themselves anyway, we eliminate the `line.split()` step and
  eliminate a pass over the input text. This required fixes to a number
  of unit tests where we generally make fewer calls on longer targets.
* But wait! There's more! Many strings are both base64 and base64url,
  and we don't want to generate duplicate (or overlapping) issues.
  That resulted in the following two changes...
* `scanner.evaluate_entropy_string()` now returns an optional issue
  instead of being a generator (that returned either 1 or 0 issues).
  This allows its only caller to inspect the return before passing it
  back to higher-level callers.
* `scanner.scan_entropy()` now jumps through deduplication hoops:
  - results from both base64 and base64url scans are combined in a set
    to eliminate strict duplicates; as a side effect, it is possible that
    lines with multiple issues (in the same line) may return the issues
    in a different order.
  - target strings are evaluated from longest to shortest, and we no
    longer report an issue for a substring if we already reported an issue
    on the longer string containing it. We do not strictly test that the
    substring is located within the longer string, so a line that contained
    "bad-string even-more-bad-string" would not generate an issue for
    the initial "bad-string" even though it appears before the subsequent
    "even-more-bad-string" (which would get an issue).
  - We also test for duplicate hex encoding issues (because a hex string
    is also a base64 string) -- this could not happen with default
    sensitivity (backstory omitted) but is nice to close off.

My testing shows variations in the noise level when tartufo scans itself,
but it seems like the extra overhead should be approximately balanced by
the efficiency gains and we'll see a wash. Time will tell.

Note it is *still* possible to get new issues (not reported by 2.x) if a
string begins or ends with `-` or `_` and the old string didn't generate
an issue but the new longer string does.
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@rbailey-godaddy
Copy link
Contributor Author

I'm closing this PR; it's been sitting without action for over a year, and clearly:

  • users apparently aren't too upset about the existing behavior
  • we really don't need another (gratuitious) incompatible finding change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Entropy scanning producing extra (incorrect) outputs
2 participants