Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: make it clear we're working with PURLs #468

Merged
merged 6 commits into from
Jul 1, 2024

Conversation

ctron
Copy link
Contributor

@ctron ctron commented Jun 27, 2024

As it feels the confusion happens quite often, even to me, I think we should rename things.

We have "packages", which have some meaning according to our glossary: https://github.com/trustification/trustify/blob/main/docs/glossary.md#package

A package is an atomic artifact or component. Packages may be addressed using pURLs. A package may be described by an SBOM describing how it is created and its contents. A package may certainly contain other packages (e.g. shading one Java jar into another). A package may also be the sole member of a Product (UBI-8.0.13-x86.oci may be the singular package within the "UBI 8.0.13-x86" product). A package is one step more abstract than an artifact.

However, in the next section that definition actually calls out PURLs:

Package URLs (pURLs) are possibly ambiguous names applied to packages. […]

So there are "packages", and "PURLs", which can be used to give "packages" a name.

In our current API and database structure, we have (because history) several entries, endpoints, and services which are named "packages", but actually deal with PURLs only.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

Next steps

It would also be reasonable to rename functions, services, and database structures towards PURLs. But naming is hard, so let's start small.

Alternatives

One alternative could be to use /api/v1/purl. However, those API still deal with packages, just by using PURLs as name. So keeping "packages" make sense to me.

@JimFuller-RedHat
Copy link
Collaborator

random aside - in corgi land we sidestepped the overloading of the term 'package' by just using the equally imprecise term of 'component' which always has a 1:1 relationship with a single PURL. That gave us some 'wiggle' room if ever we had to assign a purl to something like a single script or something less then a package.

@bobmcwhirter
Copy link
Contributor

I think in my mind, a package is something that can be referenced by a pURL.

A pURL is just a name for a pURL-addressible thing.

The ambiguity I think I was talking about in the glossary is more around... a pURL doesn't describe the binary bytes produced. Those would be artifacts.

eg, a specific build of pkg:maven/org.apache/[email protected] with a given hash of sha256:<blah> would be a non-ambiguous artifact, based on the package referenced by pkg:maven/org.apache/[email protected].

"Bob McWhirter" is a name, somewhat ambiguous (there's at least one Canadian and one Arizonan sharing my name). This particular Bob who is typing now is the the artifact with my particular sha256. I think?

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

I think similarly, while other systems may store "URLs", they are ultimately speaking about the thing behind the URL, e.g., the endpoint, the webpage, the podcast you can download.

Naming is hard. Naming things that name things is even harder.

Do we work with names? Do we work with things, and just use names to refer to them?

I dunno.

@bobmcwhirter
Copy link
Contributor

I guess I may fall into the camp of "a package is anything that can be described by a pURL" which might as well just be a pURL. Maybe.

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

Right now, all that data comes from PURLs. We also do have packages, but in the context of an SBOM. Going in the direction of what @JimFuller-RedHat mentioned before I guess.

We also have CPEs. And we also have file-hashes in the context of an SBOM. All of those reference "packages of an SBOM".

I think we are on the same page, PURLs are names for packages. But one package can have multiple PURLs assigned, and additionally other names of other types (like CPEs, digests, and whatever the future will bring).

If we'd add functions/endpoints to search for CPEs or hashes. Where would they end up in the HTTP endpoint namespace?

@JimFuller-RedHat
Copy link
Collaborator

when you say 'But one package can have multiple PURLs assigned,'

Do you mean a package can contain other packages which have purls or do you mean a single package can have multiple purls ?

If the later then eg. a single unique id assigned to a single package creates a contract ... put another way if we do have other purls that need to map to a package I would suggest pointing to canonical purl (ultimately this is 'our purl ') rather then directly map to package.

This is a subtle difference but we have already seen differences in who mints a purl ... sometimes we are the first person to do so, then sometime later upstream mints a purl. Also we should expect wholesale refactoring of purls a real possibility as that proto spec matures and upstream work more with it ... it becomes burdensome to have to update all associations in database, etc. when this happens.

From a REST API (conceptual model) pov we can make the association transparent ... eg the above applies more to the logical internal model to handle unknown change later on .

Lastly, I could be easily convinced I am overthinking all this....

@bobmcwhirter
Copy link
Contributor

I guess I was thinking CPEs are one possible name/identifier for products (the stuff Dejan is working on).

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

In SPDX-land, a single package can have a list of external references, which can by of type purl: https://spdx.github.io/spdx-spec/v2.3/package-information/#721-external-reference-field … which leads to each package having a list of cpes or purls attached, if I get that right.

And I do recall a conversation where that might actually make sense. Publishing the same artifact to multiple maven repositories.

@bobmcwhirter
Copy link
Contributor

I know we're going into the weeds here, but I also think pkg:maven/com.foo/[email protected] implies ?repository_url=m2.maven.org or whatever maven central's repo is.

So the same jar published elsewhere would have to include ?repository_url=somewhere.else

@JimFuller-RedHat
Copy link
Collaborator

+1 to external reference ... and maybe all this kind of stuff be represented as labels/tags ? maybe that is overloading generic thing too much and worth representing in core model as 'external-reference' ... that makes it clearly different then the canonical purl.

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

I think the model that we have is actually quite good. We have base PURLs, versioned PURLs, qualified PURLs. Referencing those from SBOM packages/components. Also towards CPEs and hashes.

It's only the names that feel confusing.

@carlosthe19916
Copy link
Member

just to be sure. The current way the UI is fetching packages is:

  • First step: get the list of packages from GET /api/v1/package. Each item from the list obtained here has a field uuid. Then
  • Second step: get a single package using GET /api/v1/package/{uuid}

Are those step still going to be valid? I would expect "yes" right?

@jcrossley3
Copy link
Contributor

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

@carlosthe19916
Copy link
Member

Also, a general thought:

  • There must be a single way of fetching an Entity (single entity). I mean, having /api/v1/package/by-purl/{key} and /api/v1/package/{key} seems illogical to me.
  • I have no issues with multiple "search" endpoints (list of entities) /api/v1/package?q=abc and /api/v1/package/by-purl?q=abc where these endpoints return a paginated result.

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

I also did stumble over this myself when navigating the code. You see something that's named "package", but indeed it's a PURL.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

How would you ask then for a package by CPE?

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

I don't know, but they already existed. And, they only make sense on the context of a PURL. Not in the context of a "package". So it might be that you are too confused by the naming.

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

just to be sure. The current way the UI is fetching packages is:

* First step: get the list of packages from `GET /api/v1/package`. Each item from the list obtained here has a field `uuid`. Then

* Second step: get a single package using `GET /api/v1/package/{uuid}`

Are those step still going to be valid? I would expect "yes" right?

That depends on what you mean by "package". Package of an SBOM no. Package referenced by a CPE, no. Package referenced by a PURLs, yes.

@ctron
Copy link
Contributor Author

ctron commented Jun 27, 2024

Also, a general thought:

* There must be a single way of fetching an Entity (single entity). I mean, having `/api/v1/package/by-purl/{key}` and `/api/v1/package/{key}` seems illogical to me.

Correct. But indeed what you get back is information coming from the PURL. Not from the package itself. Basically it splits the PURL into a lot of database entities and returns that.

* I have no issues with multiple "search" endpoints (list of entities) `/api/v1/package?q=abc` and `/api/v1/package/by-purl?q=abc` where these endpoints return a paginated result.

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs. And they are references packages inside of SBOMs. And so they reference SBOMs. But you could also find an SBOM package by CPE or hash (not today, but hopefully in the future).

@jcrossley3
Copy link
Contributor

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

@jcrossley3
Copy link
Contributor

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

@ctron
Copy link
Contributor Author

ctron commented Jun 28, 2024

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

Looks like that. If you take a look at e.g. PackageService::package_by_uuid, which returns PackageDetails, that's simply a decomposed PURL, stored in tables. And it's absolutely not about "packages".

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

Right, the package table was there before. But it only stores PURLs, not packages. And sbom record relates to an sbom_package, which relates to zero or more packages (which indeed are PURLs). It also relates to one or more cpes, which is correct.

That's why I think it makes sense to rename services and entities too. But before doing all the work, I think we should agree on that.

@jcrossley3
Copy link
Contributor

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

From the above doc, "SBOM packages are an entity of their own, but may have zero or more PURLs or CPEs (or other identifiers)". That sentence betrays a lack of shared understanding, I think. In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

My intuition is that our database designers -- Jens and Bob -- need to agree on whether we're modeling entities in the abstract or according to a spec.

@ctron
Copy link
Contributor Author

ctron commented Jun 28, 2024

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

Because it didn't fit.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

I don't think it's the sentence that is confusing, but the structure in the database. The package, versioned_package, qualified_package table only store decomposed parts of PURLs. I see the reason for that. It's the naming that is confusing!

Actually the package does not. It comes from the context of the SBOM. The same package can be named by different SBOM by different names.

In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

Again, in the database we model PURLs (with tables mentioned above). Not packages.

I think the database structure works just fine. We have (actual) SBOM packages, named by PURLs and CPEs. (any maybe others in the future). It simply is the naming.

@bobmcwhirter
Copy link
Contributor

While I think a package is "anything that can be address by a Package URL (pURL)" I don't hold it strongly enough, nor do I have enough about SBOM packages in my head to disagree with @ctron right now.

My vote would be to proceed (after we winnow down other outstanding package-related PRs.......) and see how we feel with the purl-centric naming.

),
)]
#[get("/v1/package/base/{uuid}")]
#[get("/v1/package/by-purl/base/{id}")]
pub async fn get(
service: web::Data<PackageService>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to rename the service to PurlService?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do, I'd argue the api should become #[get("/v1/purl/base/{id}")]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not the service, but the functions and types that deal with PURLs. The main namespace/module still is package, which is ok.

@ctron
Copy link
Contributor Author

ctron commented Jul 1, 2024

I updated the PR, renaming the package tables to "purl". Renaming the fields as well.

@ctron ctron requested a review from bobmcwhirter July 1, 2024 09:18
@ctron ctron force-pushed the feature/rename_to_purl_1 branch from 10d7537 to 4183a2f Compare July 1, 2024 09:28
@ctron ctron force-pushed the feature/rename_to_purl_1 branch from 4183a2f to 3cd6e71 Compare July 1, 2024 09:49
@ctron ctron added this pull request to the merge queue Jul 1, 2024
Merged via the queue into trustification:main with commit 947a954 Jul 1, 2024
1 check passed
@ctron ctron deleted the feature/rename_to_purl_1 branch July 1, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants