Spike a /distributions endpoint in spec to reduce /version size #441

eldeal · 2023-11-03T15:29:59Z

What

Fundamentally, we face an issue with the complexity of returning dcat:dataset and dcat:distribution details as one joined response while meeting API user expectations. The issue is the nested table_schema within the distribution within the dataset. Having 3 layers of nested document on /editions/ creates a very messy API, and suggests that at least one of those nested resources is a sub resource of another. As dcat:distribution is a known resource type, this is an attempt to see if the API could mirror that distinction while still providing utility for users.

This means /versions/ has no top level distribution or table_schema object, this is instead embedded like this

"_embedded": {
    "dimensions": [
      {
        "code_list": "string",
        "identifier": "string",
        "label": "string",
        "name": "string"
      }
    ],
    "distributions": [
      {
        "distributions": [
          {
            "checksum": "string",
            "@id": "string",
            "byte_size": "string",
            "media_type": "string",
            "download_url": "string"
          }
        ]
      }
    ]
  },

The dimensions list is there for census reasons, relocated from previous top level field.

But that more complete view of distributions and table_schema for CSV distros is available on a /distributions endpoint with this response type:

{
  "_links": {
    "self": {
      "href": "string"
    },
    "next": {
      "href": "string"
    },
    "prev": {
      "href": "string"
    }
  },
  "count": 0,
  "limit": 0,
  "offset": 0,
  "total_count": 0,
  "@context": "string",
  "items": [
    {
      "checksum": "string",
      "described_by": "string",
      "table_schema": {
        "about_url": "string",
        "column": [
          {
            "component_type": "string",
            "datatype": "string",
            "name": "string",
            "title": "string"
          }
        ]
      },
      "@id": "string",
      "byte_size": "string",
      "media_type": "string",
      "download_url": "string",
      "etag": "string",
      "@type": "dcat:distribution",
      "_links": {
        "self": {
          "href": "string"
        },
        "version": {
          "href": "string",
          "id": "string"
        }
      }
    }
  ]
}

How to review

There are 3 main issues with this approach that I think I need feedback on:

LD users rely on information in the _embedded fields to get the full picture. I don't see this as too big an issue right now, we did accept that this was something we would trial and get feedback on, but we're leveraging it more so just flagging
These _embedded fields are changing between the /editions/ and /versions/ endpoints, despite being part of the same dcat:dataset resource being represented (/editions/ embeds versions and distributions, /versions/ embeds dimensions and distributions - just based on my best guess of what would be useful). On the one hand we said the HAL fields _embedded and _links are generated and so it's acceptable for these to change between endpoints, but if they're forming part of the dcat resource is that still the case?
Because we decided that _embedded fields should never be returned in list responses, calls to /editions and /versions will never include download links unless we revise this principle.

Who can review

@janderson2 @rossbowen

rossbowen · 2023-11-03T17:24:18Z

So on your points:

I'm still quite uncomfortable with the use of _embedded but I'll leave that at the door for a moment! - but proceeding with including it for now I think we can tell the JSON-LD context to "skip" over it (using @nest) and still use the distributions field from within it.
Agree those are the ones which would be useful, and I don't see an issue with them changing between editions and versions.
In the example below I've included the download_url field... I'd expect to see it as one of the "minimal" fields to include...so maybe an exception here. Or at least keep it in using the linked data property.

Also I don't see an issue with also having a /distributions endpoint.

Having 3 layers of nested document on /editions/ creates a very messy API, and suggests that at least one of those nested resources is a sub resource of another.

So while I get that the response is long, I don't think it's too messy, at most we get code which looks like:

edition.distributions[0].table_schema.columns

// Or including embedded
edition._embedded.distributions[0].table_schema.columns

So there's a bit of nesting there, but I'd say that feels perfectly pleasant to use and navigate with. Personally I'd be a bit more annoyed with having to make another call. What we have shapes up into the sort of thing Google are looking for as part of their structured data SEO... they even go as far as throwing in the observations.

Another thought is, many users won't really appreciate the difference between a dataset and its distributions. Many analysts will say things like "a dataset has columns" which isn't how DCAT models it. So I don't think including a distribution along with its edition/version is a bad thing.

Assuming a structure roughly like this:

curl https://data.ons.gov.uk/datasets/cpih/editions/2022-01

{
  "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01",
  "@type": "dcat:Dataset",
  // some additional stuff here...
  "distributions": [
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "@type": "dcat:Distribution",
      "byte_size": 123456,
      "download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "media_type": "text/csv",
      "table_schema": {
        "columns": [
          {
            "name": "area",
            "titles": "area",
            "datatype": "string",
            "label": "Area",
            "description": "The area of an observation."
          },
          {
            "name": "period",
            "titles": "period",
            "datatype": "string",
            "label": "Period",
            "description": "The period of an observation."
          },
          {
            "name": "sex",
            "titles": "sex",
            "datatype": "string",
            "label": "Sex",
            "description": "Biological sex of observed individuals."
          },
          {
            "name": "life_expectancy",
            "titles": "life_expectancy",
            "datatype": "decimal",
            "label": "Average life expectancy",
            "description": "Mean life expectancy of observed individuals."
          }
        ]
      }
    },
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.xlsx",
      "@type": "dcat:Distribution",
      "byte_size": 123456,
      "download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "media_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
    }
  ],
  "versions": [
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/1",
      "issued": "2022-01-01T00:00:00Z",
      "modified": "2022-01-01T00:00:00Z"
    },
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/2",
      "issued": "2022-01-02T00:00:00Z",
      "modified": "2022-01-02T00:00:00Z"
    }
  ]
}

Spike a /distributions endpoint in spec to reduce /version size

035610b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike a /distributions endpoint in spec to reduce /version size #441

Spike a /distributions endpoint in spec to reduce /version size #441

eldeal commented Nov 3, 2023

rossbowen commented Nov 3, 2023 •

edited

Loading

Spike a /distributions endpoint in spec to reduce /version size #441

Are you sure you want to change the base?

Spike a /distributions endpoint in spec to reduce /version size #441

Conversation

eldeal commented Nov 3, 2023

What

How to review

Who can review

rossbowen commented Nov 3, 2023 • edited Loading

rossbowen commented Nov 3, 2023 •

edited

Loading