Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some "all.zip" files do not contain all files #2575

Closed
martin-bpw opened this issue Sep 7, 2024 · 5 comments
Closed

Some "all.zip" files do not contain all files #2575

martin-bpw opened this issue Sep 7, 2024 · 5 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@martin-bpw
Copy link

martin-bpw commented Sep 7, 2024

Describe the bug
We have recently noticed that zip.all files are missing sometimes few sometimes quite a lot of json files that are present in folder. In https://storage.googleapis.com/osv-vulnerabilities/index.html .
This looks like bug, as spec says:

https://google.github.io/osv.dev/data/#data-dumps

...This bucket contains individual entries of the format gs://osv-vulnerabilities//.json as well as a zip containing all vulnerabilities for each ecosystem at gs://osv-vulnerabilities//all.zip....

To Reproduce
Just compare the numbers of files in zip and in folder.

Expected behaviour
Number of files in zip shall be same as in folder.

Additional context
I created short report on the current (7.9.2024) state:

AlmaLinux/: ZIP: 3086, FOLDER: 3087
AlmaLinux:8/: ZIP: 2353, FOLDER: 2355
AlmaLinux:9/: ZIP: 733, FOLDER: 735
Alpine/: ZIP: 3432, FOLDER: 3531
Alpine:v3.10/: ZIP: 1487, FOLDER: 1490
Alpine:v3.11/: ZIP: 1596, FOLDER: 1605
Alpine:v3.12/: ZIP: 1706, FOLDER: 1748
Alpine:v3.13/: ZIP: 1796, FOLDER: 1839
Alpine:v3.14/: ZIP: 1917, FOLDER: 1966
Alpine:v3.15/: ZIP: 2034, FOLDER: 2084
Alpine:v3.16/: ZIP: 2126, FOLDER: 2179
Alpine:v3.17/: ZIP: 2238, FOLDER: 2325
Alpine:v3.18/: ZIP: 2242, FOLDER: 2339
Alpine:v3.19/: ZIP: 2292, FOLDER: 2302
Alpine:v3.2/: ZIP: 301, FOLDER: 305
Alpine:v3.20/: ZIP: 2277, FOLDER: 2287
Alpine:v3.3/: ZIP: 464, FOLDER: 470
Alpine:v3.4/: ZIP: 659, FOLDER: 663
Alpine:v3.5/: ZIP: 805, FOLDER: 809
Alpine:v3.6/: ZIP: 881, FOLDER: 887
Alpine:v3.7/: ZIP: 1034, FOLDER: 1039
Alpine:v3.8/: ZIP: 1188, FOLDER: 1195
Alpine:v3.9/: ZIP: 1319, FOLDER: 1322
Android/: ZIP: 2120, FOLDER: 2476
Bitnami/: ZIP: 4406, FOLDER: 7711
CRAN/: ZIP: 10, FOLDER: 10
Chainguard/: ZIP: 13193, FOLDER: 13193
DWF/: ZIP: 0, FOLDER: 30
Debian/: ZIP: 17194, FOLDER: 18171
Debian:10/: ZIP: 1830, FOLDER: 8712
Debian:11/: ZIP: 7223, FOLDER: 7236
Debian:12/: ZIP: 6518, FOLDER: 6537
Debian:13/: ZIP: 6056, FOLDER: 6164
Debian:3.0/: ZIP: 727, FOLDER: 773
Debian:3.1/: ZIP: 649, FOLDER: 653
Debian:4.0/: ZIP: 669, FOLDER: 670
Debian:5.0/: ZIP: 733, FOLDER: 736
Debian:6.0/: ZIP: 1152, FOLDER: 1152
Debian:7/: ZIP: 1796, FOLDER: 1796
Debian:8/: ZIP: 1826, FOLDER: 1826
Debian:9/: ZIP: 1568, FOLDER: 1568
GIT/: ZIP: 31694, FOLDER: 57517
GSD/: ZIP: 7, FOLDER: 37
GitHub Actions/: ZIP: 19, FOLDER: 20
Go/: ZIP: 3472, FOLDER: 3473
Hackage/: ZIP: 19, FOLDER: 19
Hex/: ZIP: 30, FOLDER: 30
JavaScript/: ZIP: 1, FOLDER: 1
Linux/: ZIP: 15909, FOLDER: 15910
Maven/: ZIP: 5075, FOLDER: 5076
NuGet/: ZIP: 1367, FOLDER: 1373
OSS-Fuzz/: ZIP: 3588, FOLDER: 3588
Packagist/: ZIP: 4046, FOLDER: 4047
Pub/: ZIP: 10, FOLDER: 13
PyPI/: ZIP: 13982, FOLDER: 13985
Rocky Linux/: ZIP: 1333, FOLDER: 1333
Rocky Linux:8/: ZIP: 1008, FOLDER: 1008
Rocky Linux:9/: ZIP: 327, FOLDER: 327
ecosystems.txt: ZIP: 0, FOLDER: 1
index.html: ZIP: 0, FOLDER: 1
RubyGems/: ZIP: 1653, FOLDER: 1653
SwiftURL/: ZIP: 35, FOLDER: 35
UVI/: ZIP: 1, FOLDER: 1
Ubuntu/: ZIP: 5446, FOLDER: 39883
Ubuntu:14.04:LTS/: ZIP: 1593, FOLDER: 10370
Ubuntu:16.04:LTS/: ZIP: 1483, FOLDER: 11363
Ubuntu:18.04:LTS/: ZIP: 1700, FOLDER: 3411
Ubuntu:20.04:LTS/: ZIP: 1763, FOLDER: 9928
Ubuntu:22.04:LTS/: ZIP: 1015, FOLDER: 8177
Ubuntu:23.10/: ZIP: 274, FOLDER: 274
Ubuntu:24.04:LTS/: ZIP: 133, FOLDER: 6081
Ubuntu:Pro:14.04:LTS/: ZIP: 554, FOLDER: 4826
Ubuntu:Pro:16.04:LTS/: ZIP: 972, FOLDER: 20630
Ubuntu:Pro:18.04:LTS/: ZIP: 517, FOLDER: 15030
Ubuntu:Pro:20.04:LTS/: ZIP: 134, FOLDER: 2011
Ubuntu:Pro:22.04:LTS/: ZIP: 89, FOLDER: 1402
Ubuntu:Pro:24.04:LTS/: ZIP: 6, FOLDER: 771
Wolfi/: ZIP: 8224, FOLDER: 8224
crates.io/: ZIP: 1461, FOLDER: 1461
icons/: ZIP: 0, FOLDER: 4
npm/: ZIP: 19047, FOLDER: 19052

It might be related to timestamp, as certain pattern can be spotted:

image image
@andrewpollock
Copy link
Contributor

This discrepancy is something that in the short-term needs to be documented in the FAQ and longer-term needs to be fixed in our exporter (#2329 touches on this a little as well)

Essentially, the all.zip files are canonical. The individual records in GCS are not. They may have existed and been exported at some point in the past, but not any longer, and do not (currently) get cleaned up.

def _export_ecosystem_to_bucket(self, ecosystem: str, tmp_dir: str):
"""Export the vulnerabilities in an ecosystem to GCS.
Args:
ecosystem: the ecosystem name
tmp_dir: temporary directory for scratch
This simultaneously exports every Bug for the given ecosystem to individual
files in the scratch filesystem, and a zip file in the scratch filesystem.
At the conclusion of this export, all of the files in the scratch filesystem
(including the zip file) are uploaded to the GCS bucket.
"""
logging.info('Exporting vulnerabilities for ecosystem %s', ecosystem)
storage_client = storage.Client()
bucket = storage_client.get_bucket(self._export_bucket)
zip_path = os.path.join(tmp_dir, 'all.zip')
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zip_file:
files_to_zip = []
@ndb.tasklet
def _export_to_file_and_zipfile(bug):
"""Write out a bug record to both a single file and the zip file."""
if not bug.public or bug.status == osv.BugStatus.UNPROCESSED:
return
file_path = os.path.join(tmp_dir, bug.id() + '.json')
vulnerability = yield bug.to_vulnerability_async(include_source=True)
osv.write_vulnerability(vulnerability, file_path)
files_to_zip.append(file_path)
# This *should* pause here until
# all the exports have been written to disk.
osv.Bug.query(
osv.Bug.ecosystem == ecosystem).map(_export_to_file_and_zipfile)
files_to_zip.sort()
for file_path in files_to_zip:
zip_file.write(file_path, os.path.basename(file_path))
with concurrent.futures.ThreadPoolExecutor(
max_workers=_EXPORT_WORKERS) as executor:
# Note: all.zip is included here
for filename in os.listdir(tmp_dir):
executor.submit(self.upload_single, bucket,
os.path.join(tmp_dir, filename),
f'{ecosystem}/{filename}')
is the relevant code. One possible solution is do add in a deletion run at the end, or some other reverse check.

There's some conceptual similarity with code added to the importer in #2030

@andrewpollock andrewpollock added the documentation Improvements or additions to documentation label Sep 9, 2024
@andrewpollock andrewpollock self-assigned this Sep 9, 2024
@martin-bpw
Copy link
Author

Thank you very much for quick feedback, it already helped to know that all.zip is the preferred source.

@andrewpollock
Copy link
Contributor

I think that with recent work that @hogo6002 did to make adjustments to how our exporting works we may be able to almost call this "done".

I think a review and refresh of what is stated at https://google.github.io/osv.dev/data/#data-dumps is all that is necessary.

@andrewpollock
Copy link
Contributor

Actually @hogo6002 already made the necessary documentation changes in #2784 so I think we can call this done.

@martin-bpw
Copy link
Author

Thank you, nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants