Introduce indexed embedded CPE dictionary #1897

luhring · 2023-06-27T15:57:28Z

This PR tries out a new approach to CPE generation (that was discussed briefly in Slack) in a limited capacity.

In this PR, we leverage the information in NVD's CPE Dictionary to understand what CPEs are actually defined — and, thus usable when querying NVD's CVE data — and then attempt to relate CPE dictionary entries to identifiable ecosystem packages that Syft's SCA logic detects.

We capture this knowledge in a concise JSON file that we embed into Syft at build time. (This embedded file is currently ~95 KB.)

Finally, when Syft is adding CPEs to each package it discovered, it first looks for entries in the embedded dictionary index, and uses the CPE from the dictionary if available. If no entry can be found, Syft falls back to today's existing CPE generation logic. The net effect is that Syft's existing generation logic is still used the majority of the time, but Syft's CPE quality greatly improves when a hit is found in the embedded dictionary.

Example

Using Chainguard's Node.js image, here's what Syft does before this change:

$ syft cgr.dev/chainguard/node -qo json | jq '.artifacts[] | select(.name == "ansi-regex").cpes'
[
  "cpe:2.3:a:ansi-regex:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi-regex:ansi_regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi_regex:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi_regex:ansi_regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi:ansi-regex:5.0.1:*:*:*:*:*:*:*",
  "cpe:2.3:a:ansi:ansi_regex:5.0.1:*:*:*:*:*:*:*"
]

... and after this change:

$ go run ./cmd/syft cgr.dev/chainguard/node -qo json | jq '.artifacts[] | select(.name == "ansi-regex").cpes'
[
  "cpe:2.3:a:ansi-regex_project:ansi-regex:5.0.1:*:*:*:*:node.js:*:*"
]

It's important to note that in the "before" state, Syft makes several guesses at the CPE for the package, but none of its guesses are correct. And with the new approach, Syft gets the CPE right in the first try. This can be confirmed by setting the CPE version to * and querying NVD's CVE data.

luhring · 2023-07-09T14:47:27Z

This is odd... 🤔 (link)

Building snapshot artifacts�
# create a config with the dist dir overridden
echo "dist: ./snapshot" > ./.tmp/goreleaser.yaml
cat .goreleaser.yaml >> ./.tmp/goreleaser.yaml
# build release snapshots
./.tmp/goreleaser release --clean --skip-publish --skip-sign --snapshot --config ./.tmp/goreleaser.yaml
make: ./.tmp/goreleaser: Command not found
make: *** [Makefile:329: snapshot] Error [12](https://github.com/anchore/syft/actions/runs/5497306836/jobs/10017970119?pr=1897#step:4:13)7
Error: Process completed with exit code 2.

luhring · 2023-07-09T23:13:51Z

From a suggestion in the community Slack, I tried to run the quality gate locally to make sure this change wouldn't cause any regression in matching quality. I'm about 60% sure I did this correctly. 😆

I used this in .yardstick.yaml in Grype:

tools:

  - name: syft
    # note: we want to use a fixed version of syft for capturing all results (NOT "latest")
    version: v0.74.1
    produces: SBOM
    refresh: false

  - name: syft
    # note: we want to use a fixed version of syft for capturing all results (NOT "latest")
    version: cpe-dict
    produces: SBOM-new
    refresh: false

  - name: grype
    version: latest
    takes: SBOM

  - name: grype
    version: latest
    takes: SBOM-new

And I ran make capture and then make validate. I didn't notice any differences in the output, and got:

Quality gate passed!

...kg/cataloger/common/cpe/dictionary/index-generator/testdata/official-cpe-dictionary_v2.3.xml

syft/pkg/cataloger/common/cpe/dictionary/index-generator/generate.go

wagoodman

Overall looks great! Looking forward to more accurate CPEs being generated -- left one comment around adding unit tests around specific expectations during indexing.

Signed-off-by: Dan Luhring <[email protected]>

wagoodman · 2023-07-20T15:57:38Z

Makefile

@@ -338,7 +343,7 @@ release:
 	@.github/scripts/trigger-release.sh

 .PHONY: ci-release
-ci-release: ci-check clean-dist $(CHANGELOG)
+ci-release: ci-check clean-dist $(CHANGELOG) cpe-index


actually, generating new data during the release step isn't the right place for this. Though it answers the question of how this gets refreshed, it introduces potentially breaking changes right at the release, bypassing testing.

Yeah, I think we can just take that step out, lean on the index that's checked into the repo, and periodically regenerate what's in the repo. I've seen other projects take a similar approach with generated/embedded data. How does that sound to you?

wagoodman

I want to get this change in, but I'm going to take one more pass at validating the impact downstream with grype. Additionally I added one more comment about how the data is getting refreshed that will need to get addressed. I don't have a specific suggestion yet.

syft/pkg/cataloger/common/cpe/cpe-index.json

wagoodman · 2023-07-21T12:59:19Z

It looks like there wasn't sufficient sample coverage in the labeled data we use in the grype quality gate to demonstrate performance improvement. But! the gate run did show that from the existing samples, there was no degradation in performance (which is half the battle to prove).

I did an ad-hoc analysis with:

python packages: 22
npm packages: 50
gem packages: 31

where each package was selected based on it's existence in the generated CPE index, there are CVEs that map to the CPE, and affected versions exist in the package repository to use for cataloging... so chances for matching against these packages was high. Here's what I found

Additional TPs found due to cpe fix: 14
Additional FNs due to cpe fix: 0
FPs eliminated due to cpe fix: 0
FPs introduced due to cpe fix: 0

Which is a clear win 🎉 ! I deferred looking specifically at the jenkins plugins and rust crates until we can incorporate this analysis approach into the quality gate. This is slightly different than our typical analysis since we want to ignore unmatched labels (partial evaluation) which is flagged as invalid by the current gates. In the future we'll allow for images under test with both modes of analysis.

luhring · 2023-07-21T13:24:23Z

@wagoodman Thanks for the analysis. Glad it's looking all positive!

In case it helps with future coverage testing, one bucket of false positives that this PR solves (and I'm really looking forward to) is matching Ruby's bundled openssl gem with OpenSSL itself. The presence of this CPE in the index — "cpe:2.3:a:ruby-lang:openssl:*:*:*:*:*:*:*:*" — saves Grype from incorrectly matching the gem to loads of OpenSSL vulns that don't at all affect Ruby's wrapper.

What's the next step here?

Signed-off-by: Alex Goodman <[email protected]>

wagoodman · 2023-07-21T13:32:58Z

one bucket of false positives that this PR solves is matching Ruby's bundled openssl gem with OpenSSL itself

nice! The packages I selected were random samples so I happened to not run across that case was all (openssl wasn't in there for certain). That should be an obvious win here 🙌 .

Additionally I just updated the branch to account for how this data gets generated -- right now by a weekly workflow that would show up as a PR.

~~Let me try out that workflow as a sanity check then I think~~ [edit: I forgot this is a fork, I'll test it on the first run this monday] this is ready to be merged.

…ses it Signed-off-by: Alex Goodman <[email protected]>

wagoodman

great work 🚀 I added one more test to check that the generated data is wired to the function that uses it.

luhring · 2023-07-21T14:26:14Z

Thanks a million 🙇 ❤️ — very excited about this one

* Introduce indexed embedded CPE dictionary Signed-off-by: Dan Luhring <[email protected]> * Don't generate cpe-index on make snapshot Signed-off-by: Dan Luhring <[email protected]> * Add unit tests for individual addEntry funcs Signed-off-by: Dan Luhring <[email protected]> * migrate CPE index build to go generate and add periodic workflow Signed-off-by: Alex Goodman <[email protected]> * add test to ensure generated cpe index is wired up to function that uses it Signed-off-by: Alex Goodman <[email protected]> --------- Signed-off-by: Dan Luhring <[email protected]> Signed-off-by: Alex Goodman <[email protected]> Co-authored-by: Alex Goodman <[email protected]>

luhring force-pushed the cpe-dict branch from 8fae3cb to a8a2b6c Compare June 27, 2023 18:51

luhring force-pushed the cpe-dict branch 3 times, most recently from 810f128 to ff95fbb Compare July 9, 2023 01:19

luhring mentioned this pull request Jul 9, 2023

make snapshot fails locally on main branch #1923

Open

luhring marked this pull request as ready for review July 9, 2023 23:14

rawlingsj mentioned this pull request Jul 10, 2023

add kube-fluentd-operator chainguard-images/images#1086

Merged

33 tasks

wagoodman reviewed Jul 17, 2023

View reviewed changes

...kg/cataloger/common/cpe/dictionary/index-generator/testdata/official-cpe-dictionary_v2.3.xml Outdated Show resolved Hide resolved

wagoodman reviewed Jul 17, 2023

View reviewed changes

syft/pkg/cataloger/common/cpe/dictionary/index-generator/generate.go Show resolved Hide resolved

wagoodman reviewed Jul 17, 2023

View reviewed changes

luhring added 3 commits July 18, 2023 13:08

Introduce indexed embedded CPE dictionary

9987e6a

Signed-off-by: Dan Luhring <[email protected]>

Don't generate cpe-index on make snapshot

d7ab39d

Signed-off-by: Dan Luhring <[email protected]>

Add unit tests for individual addEntry funcs

1a2b4ae

Signed-off-by: Dan Luhring <[email protected]>

luhring force-pushed the cpe-dict branch from 6fcc0a9 to 1a2b4ae Compare July 18, 2023 17:09

wagoodman force-pushed the main branch from c67c76e to 88b3d1e Compare July 18, 2023 18:10

wagoodman approved these changes Jul 19, 2023

View reviewed changes

wagoodman reviewed Jul 20, 2023

View reviewed changes

wagoodman requested changes Jul 20, 2023

View reviewed changes

wagoodman reviewed Jul 20, 2023

View reviewed changes

syft/pkg/cataloger/common/cpe/cpe-index.json Outdated Show resolved Hide resolved

wagoodman self-assigned this Jul 20, 2023

migrate CPE index build to go generate and add periodic workflow

8b0ed81

Signed-off-by: Alex Goodman <[email protected]>

add test to ensure generated cpe index is wired up to function that u…

5d2dfd5

…ses it Signed-off-by: Alex Goodman <[email protected]>

wagoodman approved these changes Jul 21, 2023

View reviewed changes

wagoodman enabled auto-merge (squash) July 21, 2023 13:50

wagoodman merged commit 99d172f into anchore:main Jul 21, 2023

luhring deleted the cpe-dict branch July 21, 2023 14:25

kzantow added the enhancement New feature or request label Jul 31, 2023

wagoodman mentioned this pull request Nov 2, 2023

Annotate where each CPE on a package is sourced from #2282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce indexed embedded CPE dictionary #1897

Introduce indexed embedded CPE dictionary #1897

luhring commented Jun 27, 2023

luhring commented Jul 9, 2023

luhring commented Jul 9, 2023

wagoodman left a comment

wagoodman Jul 20, 2023

luhring Jul 20, 2023

wagoodman left a comment

wagoodman commented Jul 21, 2023 •

edited

Loading

luhring commented Jul 21, 2023

wagoodman commented Jul 21, 2023 •

edited

Loading

wagoodman left a comment

luhring commented Jul 21, 2023

Introduce indexed embedded CPE dictionary #1897

Introduce indexed embedded CPE dictionary #1897

Conversation

luhring commented Jun 27, 2023

Example

luhring commented Jul 9, 2023

luhring commented Jul 9, 2023

wagoodman left a comment

Choose a reason for hiding this comment

wagoodman Jul 20, 2023

Choose a reason for hiding this comment

luhring Jul 20, 2023

Choose a reason for hiding this comment

wagoodman left a comment

Choose a reason for hiding this comment

wagoodman commented Jul 21, 2023 • edited Loading

luhring commented Jul 21, 2023

wagoodman commented Jul 21, 2023 • edited Loading

wagoodman left a comment

Choose a reason for hiding this comment

luhring commented Jul 21, 2023

wagoodman commented Jul 21, 2023 •

edited

Loading

wagoodman commented Jul 21, 2023 •

edited

Loading