Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong dump data for Debian packages #1

Open
armijnhemel opened this issue Oct 31, 2022 · 4 comments
Open

Wrong dump data for Debian packages #1

armijnhemel opened this issue Oct 31, 2022 · 4 comments

Comments

@armijnhemel
Copy link

Not sure if this should go here or another repository, so feel free to move.

I just looked at deb-purls-aa.json.zst and saw this line:

{"purl":"pkg:deb/[email protected]","download_url":"http://ftp.debian.org/debian/pool/main/0/0ad/0ad_0.0.23.1.orig.tar.xz"}

The package number and the referenced source code file do not match: the file in download_url is the original file and is actually the same for multiple patch versions. The version number only becomes -4 after applying the Debian specific patches, so these should probably also be included. The patches for -4 are no longer available via the Debian FTP, but for -5 they are.

The .dsc file for -5 says:

Files:
 4fa111410ea55de7a013406ac1013668 31922812 0ad_0.0.23.1.orig.tar.xz
 43a5bf77192a8eebdbe763cdd1d72fa3 73620 0ad_0.0.23.1-5.debian.tar.xz

So possibly you should not have this as a single download URL, but as a list of download URLs.

Also, with Debian these URLs tend to get moved (granted, after many years) to their archive. It might be good to take a closer look at aboutcode-org/fetchcode#82

@pombredanne
Copy link
Member

@armijnhemel you have eagle yes! thanks for the report.
I do not have yet a good mostly universal solution on how to deal with these cases where multiple download URLs exist for a single package, like you found where we have patches and sources into a binary

@pombredanne
Copy link
Member

The point is that for now the model is to have one download URL == one record in the purldb
We can however track multiple purls for the related source packages though we do not have the proper DB models and relationship yet

@armijnhemel
Copy link
Author

The point is that for now the model is to have one download URL == one record in the purldb We can however track multiple purls for the related source packages though we do not have the proper DB models and relationship yet

Having thought a bit about this there are some other issues as well, which can possibly interfere (not in this particular case, but in general).

First of all, there is the situation where there are multiple files/download URLs that point to the same package. For example, let's look at GNU binutils: https://ftp.gnu.org/gnu/binutils/

For 2.30 there are four distinct downloads: a .tar.bz2, a .tar.gz, a .tar.lz and a .tar.xz. These are all equivalent and should map to the same package URL and possibly back as well.

Then there is the situation where multiple components/sources are used in a certain configuration (like in the Debian example). So what I could envision is that download_url for a version would be something like this:

download_url = [
    [url1, patch1, patch2],
    [url2, patch1, patch2]
]

Or something like that.

@armijnhemel
Copy link
Author

armijnhemel commented Nov 1, 2022

Some more thoughts: Debian typically renames the original files (to something like foo_bar-1.0.orig.tar.gz if the original is called foo_bar-1.0.tar.gz). It also lowercases the files and replaces - with _.

A question: when encountering these (without patches or other files, just standalone), should they be mapped to the original package or to the Debian package? There is something to say for both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants