refactor: make it clear we're working with PURLs #468

ctron · 2024-06-27T12:34:25Z

As it feels the confusion happens quite often, even to me, I think we should rename things.

We have "packages", which have some meaning according to our glossary: https://github.com/trustification/trustify/blob/main/docs/glossary.md#package

A package is an atomic artifact or component. Packages may be addressed using pURLs. A package may be described by an SBOM describing how it is created and its contents. A package may certainly contain other packages (e.g. shading one Java jar into another). A package may also be the sole member of a Product (UBI-8.0.13-x86.oci may be the singular package within the "UBI 8.0.13-x86" product). A package is one step more abstract than an artifact.

However, in the next section that definition actually calls out PURLs:

Package URLs (pURLs) are possibly ambiguous names applied to packages. […]

So there are "packages", and "PURLs", which can be used to give "packages" a name.

In our current API and database structure, we have (because history) several entries, endpoints, and services which are named "packages", but actually deal with PURLs only.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

Next steps

It would also be reasonable to rename functions, services, and database structures towards PURLs. But naming is hard, so let's start small.

Alternatives

One alternative could be to use /api/v1/purl. However, those API still deal with packages, just by using PURLs as name. So keeping "packages" make sense to me.

JimFuller-RedHat · 2024-06-27T13:09:56Z

random aside - in corgi land we sidestepped the overloading of the term 'package' by just using the equally imprecise term of 'component' which always has a 1:1 relationship with a single PURL. That gave us some 'wiggle' room if ever we had to assign a purl to something like a single script or something less then a package.

bobmcwhirter · 2024-06-27T13:14:52Z

I think in my mind, a package is something that can be referenced by a pURL.

A pURL is just a name for a pURL-addressible thing.

The ambiguity I think I was talking about in the glossary is more around... a pURL doesn't describe the binary bytes produced. Those would be artifacts.

eg, a specific build of pkg:maven/org.apache/[email protected] with a given hash of sha256:<blah> would be a non-ambiguous artifact, based on the package referenced by pkg:maven/org.apache/[email protected].

"Bob McWhirter" is a name, somewhat ambiguous (there's at least one Canadian and one Arizonan sharing my name). This particular Bob who is typing now is the the artifact with my particular sha256. I think?

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

I think similarly, while other systems may store "URLs", they are ultimately speaking about the thing behind the URL, e.g., the endpoint, the webpage, the podcast you can download.

Naming is hard. Naming things that name things is even harder.

Do we work with names? Do we work with things, and just use names to refer to them?

I dunno.

bobmcwhirter · 2024-06-27T13:16:54Z

I guess I may fall into the camp of "a package is anything that can be described by a pURL" which might as well just be a pURL. Maybe.

ctron · 2024-06-27T13:43:33Z

All that being said, I'm not opposed to by-url, but that makes me wonder, is there any other non-url way to discuss a package, beyond our internal uuid? Or do we consider that sufficient to bifurcate?

Right now, all that data comes from PURLs. We also do have packages, but in the context of an SBOM. Going in the direction of what @JimFuller-RedHat mentioned before I guess.

We also have CPEs. And we also have file-hashes in the context of an SBOM. All of those reference "packages of an SBOM".

I think we are on the same page, PURLs are names for packages. But one package can have multiple PURLs assigned, and additionally other names of other types (like CPEs, digests, and whatever the future will bring).

If we'd add functions/endpoints to search for CPEs or hashes. Where would they end up in the HTTP endpoint namespace?

JimFuller-RedHat · 2024-06-27T14:10:42Z

when you say 'But one package can have multiple PURLs assigned,'

Do you mean a package can contain other packages which have purls or do you mean a single package can have multiple purls ?

If the later then eg. a single unique id assigned to a single package creates a contract ... put another way if we do have other purls that need to map to a package I would suggest pointing to canonical purl (ultimately this is 'our purl ') rather then directly map to package.

This is a subtle difference but we have already seen differences in who mints a purl ... sometimes we are the first person to do so, then sometime later upstream mints a purl. Also we should expect wholesale refactoring of purls a real possibility as that proto spec matures and upstream work more with it ... it becomes burdensome to have to update all associations in database, etc. when this happens.

From a REST API (conceptual model) pov we can make the association transparent ... eg the above applies more to the logical internal model to handle unknown change later on .

Lastly, I could be easily convinced I am overthinking all this....

bobmcwhirter · 2024-06-27T14:11:33Z

I guess I was thinking CPEs are one possible name/identifier for products (the stuff Dejan is working on).

ctron · 2024-06-27T14:14:46Z

In SPDX-land, a single package can have a list of external references, which can by of type purl: https://spdx.github.io/spdx-spec/v2.3/package-information/#721-external-reference-field … which leads to each package having a list of cpes or purls attached, if I get that right.

And I do recall a conversation where that might actually make sense. Publishing the same artifact to multiple maven repositories.

bobmcwhirter · 2024-06-27T14:16:51Z

I know we're going into the weeds here, but I also think pkg:maven/com.foo/[email protected] implies ?repository_url=m2.maven.org or whatever maven central's repo is.

So the same jar published elsewhere would have to include ?repository_url=somewhere.else

JimFuller-RedHat · 2024-06-27T14:17:24Z

+1 to external reference ... and maybe all this kind of stuff be represented as labels/tags ? maybe that is overloading generic thing too much and worth representing in core model as 'external-reference' ... that makes it clearly different then the canonical purl.

ctron · 2024-06-27T14:36:33Z

I think the model that we have is actually quite good. We have base PURLs, versioned PURLs, qualified PURLs. Referencing those from SBOM packages/components. Also towards CPEs and hashes.

It's only the names that feel confusing.

carlosthe19916 · 2024-06-27T14:46:52Z

just to be sure. The current way the UI is fetching packages is:

First step: get the list of packages from GET /api/v1/package. Each item from the list obtained here has a field uuid. Then
Second step: get a single package using GET /api/v1/package/{uuid}

Are those step still going to be valid? I would expect "yes" right?

jcrossley3 · 2024-06-27T14:47:59Z

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

carlosthe19916 · 2024-06-27T15:00:31Z

Also, a general thought:

There must be a single way of fetching an Entity (single entity). I mean, having /api/v1/package/by-purl/{key} and /api/v1/package/{key} seems illogical to me.
I have no issues with multiple "search" endpoints (list of entities) /api/v1/package?q=abc and /api/v1/package/by-purl?q=abc where these endpoints return a paginated result.

ctron · 2024-06-27T15:01:20Z

As it feels the confusion happens quite often, even to me, I think we should rename things.

Can you describe the confusion that happens quite often? What problems are we solving?

I also did stumble over this myself when navigating the code. You see something that's named "package", but indeed it's a PURL.

This PR tries to clean up the user facing API (REST endpoints) by changing the base URL from /api/v1/package to /api/v1/package/by-url, when it comes to PURL based calls.

I don't see how this clarifies things. To me, /api/v1/package/{purl} is pretty clear. I'm asking for the package referenced by a purl.

How would you ask then for a package by CPE?

I also don't understand the terms "base", "qualified" and "versioned". Will the consumers of our API understand them?

I don't know, but they already existed. And, they only make sense on the context of a PURL. Not in the context of a "package". So it might be that you are too confused by the naming.

ctron · 2024-06-27T15:03:11Z

just to be sure. The current way the UI is fetching packages is:
* First step: get the list of packages from `GET /api/v1/package`. Each item from the list obtained here has a field `uuid`. Then

* Second step: get a single package using `GET /api/v1/package/{uuid}`
Are those step still going to be valid? I would expect "yes" right?

That depends on what you mean by "package". Package of an SBOM no. Package referenced by a CPE, no. Package referenced by a PURLs, yes.

ctron · 2024-06-27T15:07:48Z

Also, a general thought:

* There must be a single way of fetching an Entity (single entity). I mean, having `/api/v1/package/by-purl/{key}` and `/api/v1/package/{key}` seems illogical to me.

Correct. But indeed what you get back is information coming from the PURL. Not from the package itself. Basically it splits the PURL into a lot of database entities and returns that.

* I have no issues with multiple "search" endpoints (list of entities) `/api/v1/package?q=abc` and `/api/v1/package/by-purl?q=abc` where these endpoints return a paginated result.

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs. And they are references packages inside of SBOMs. And so they reference SBOMs. But you could also find an SBOM package by CPE or hash (not today, but hopefully in the future).

jcrossley3 · 2024-06-27T15:14:15Z

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

jcrossley3 · 2024-06-27T15:29:52Z

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

ctron · 2024-06-28T06:56:44Z

That's why one idea was to just go for /api/v1/purl. Because basically you're only searching for PURLs.

What is the use case for that? To me, a purl is not a resource -- it's just the name of a resource, in this case a Package.

Do we have a need to return a paginated list of names?

Looks like that. If you take a look at e.g. PackageService::package_by_uuid, which returns PackageDetails, that's simply a decomposed PURL, stored in tables. And it's absolutely not about "packages".

Could some of the confusion be coming from the existence of both a package and an sbom_package table? I would expect an sbom record to relate to a package record somehow.

Right, the package table was there before. But it only stores PURLs, not packages. And sbom record relates to an sbom_package, which relates to zero or more packages (which indeed are PURLs). It also relates to one or more cpes, which is correct.

That's why I think it makes sense to rename services and entities too. But before doing all the work, I think we should agree on that.

jcrossley3 · 2024-06-28T13:17:48Z

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

From the above doc, "SBOM packages are an entity of their own, but may have zero or more PURLs or CPEs (or other identifiers)". That sentence betrays a lack of shared understanding, I think. In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

My intuition is that our database designers -- Jens and Bob -- need to agree on whether we're modeling entities in the abstract or according to a spec.

ctron · 2024-06-28T13:56:44Z

Right, the package table was there before.

Before what? If you mean before we implemented this, then I might question why the existing package table wasn't incorporated into that design.

Because it didn't fit.

But it only stores PURLs, not packages.

This sentence is confusing. A PURL is just a name. It's a logical concatenation of attributes of a package (the concept, not the table). A package will have those attributes whether a particular SBOM says it has a PURL/CPE/identifier or not, right?

I don't think it's the sentence that is confusing, but the structure in the database. The package, versioned_package, qualified_package table only store decomposed parts of PURLs. I see the reason for that. It's the naming that is confusing!

Actually the package does not. It comes from the context of the SBOM. The same package can be named by different SBOM by different names.

In an abstract sense, a package will always have a PURL, because it can be constructed from the package's attributes. From a literal reading of the spec, a package may not have an identifier, but is that what we should be modeling in the database?

Again, in the database we model PURLs (with tables mentioned above). Not packages.

I think the database structure works just fine. We have (actual) SBOM packages, named by PURLs and CPEs. (any maybe others in the future). It simply is the naming.

bobmcwhirter · 2024-06-28T14:02:58Z

While I think a package is "anything that can be address by a Package URL (pURL)" I don't hold it strongly enough, nor do I have enough about SBOM packages in my head to disagree with @ctron right now.

My vote would be to proceed (after we winnow down other outstanding package-related PRs.......) and see how we feel with the purl-centric naming.

bobmcwhirter · 2024-06-28T14:20:53Z

modules/fundamental/src/package/endpoints/base.rs

    ),
 )]
-#[get("/v1/package/base/{uuid}")]
+#[get("/v1/package/by-purl/base/{id}")]
 pub async fn get(
    service: web::Data<PackageService>,


Do we want to rename the service to PurlService?

If we do, I'd argue the api should become #[get("/v1/purl/base/{id}")]

Maybe not the service, but the functions and types that deal with PURLs. The main namespace/module still is package, which is ok.

ctron · 2024-07-01T09:18:45Z

I updated the PR, renaming the package tables to "purl". Renaming the fields as well.

ctron requested review from jcrossley3, bobmcwhirter, dejanb and carlosthe19916 June 27, 2024 12:34

bobmcwhirter approved these changes Jun 28, 2024

View reviewed changes

ctron requested a review from bobmcwhirter July 1, 2024 09:18

refactor: make it clear we're working with PURLs

abd5f70

ctron added 4 commits July 1, 2024 11:28

docs: add some docs to the fields

b656cb8

fix: this file actually belongs to migration 230

11c6ad8

refactor: rename table containing purls to reflect that in their name

7c5bb13

refactor: rename entity modules to align with database

1ea2a03

ctron force-pushed the feature/rename_to_purl_1 branch from 10d7537 to 4183a2f Compare July 1, 2024 09:28

refactor: rename entity fields to reflect its about PURLs

3cd6e71

ctron force-pushed the feature/rename_to_purl_1 branch from 4183a2f to 3cd6e71 Compare July 1, 2024 09:49

bobmcwhirter approved these changes Jul 1, 2024

View reviewed changes

ctron added this pull request to the merge queue Jul 1, 2024

Merged via the queue into trustification:main with commit 947a954 Jul 1, 2024
1 check passed

ctron deleted the feature/rename_to_purl_1 branch July 1, 2024 14:22

carlosthe19916 mentioned this pull request Jul 2, 2024

Update package url path changes trustification/trustify-ui#91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: make it clear we're working with PURLs #468

refactor: make it clear we're working with PURLs #468

ctron commented Jun 27, 2024

JimFuller-RedHat commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

ctron commented Jun 27, 2024

JimFuller-RedHat commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

ctron commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

JimFuller-RedHat commented Jun 27, 2024

ctron commented Jun 27, 2024

carlosthe19916 commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

carlosthe19916 commented Jun 27, 2024

ctron commented Jun 27, 2024

ctron commented Jun 27, 2024

ctron commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

ctron commented Jun 28, 2024

jcrossley3 commented Jun 28, 2024

ctron commented Jun 28, 2024

bobmcwhirter commented Jun 28, 2024

bobmcwhirter Jun 28, 2024

jcrossley3 Jun 28, 2024

ctron Jun 28, 2024

ctron commented Jul 1, 2024

refactor: make it clear we're working with PURLs #468

refactor: make it clear we're working with PURLs #468

Conversation

ctron commented Jun 27, 2024

Next steps

Alternatives

JimFuller-RedHat commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

ctron commented Jun 27, 2024

JimFuller-RedHat commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

ctron commented Jun 27, 2024

bobmcwhirter commented Jun 27, 2024

JimFuller-RedHat commented Jun 27, 2024

ctron commented Jun 27, 2024

carlosthe19916 commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

carlosthe19916 commented Jun 27, 2024

ctron commented Jun 27, 2024

ctron commented Jun 27, 2024

ctron commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

jcrossley3 commented Jun 27, 2024

ctron commented Jun 28, 2024

jcrossley3 commented Jun 28, 2024

ctron commented Jun 28, 2024

bobmcwhirter commented Jun 28, 2024

bobmcwhirter Jun 28, 2024

Choose a reason for hiding this comment

jcrossley3 Jun 28, 2024

Choose a reason for hiding this comment

ctron Jun 28, 2024

Choose a reason for hiding this comment

ctron commented Jul 1, 2024