Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike a /distributions endpoint in spec to reduce /version size #441

Draft
wants to merge 1 commit into
base: v2-spec
Choose a base branch
from

Conversation

eldeal
Copy link
Contributor

@eldeal eldeal commented Nov 3, 2023

What

Fundamentally, we face an issue with the complexity of returning dcat:dataset and dcat:distribution details as one joined response while meeting API user expectations. The issue is the nested table_schema within the distribution within the dataset. Having 3 layers of nested document on /editions/ creates a very messy API, and suggests that at least one of those nested resources is a sub resource of another. As dcat:distribution is a known resource type, this is an attempt to see if the API could mirror that distinction while still providing utility for users.

This means /versions/ has no top level distribution or table_schema object, this is instead embedded like this

"_embedded": {
    "dimensions": [
      {
        "code_list": "string",
        "identifier": "string",
        "label": "string",
        "name": "string"
      }
    ],
    "distributions": [
      {
        "distributions": [
          {
            "checksum": "string",
            "@id": "string",
            "byte_size": "string",
            "media_type": "string",
            "download_url": "string"
          }
        ]
      }
    ]
  },

The dimensions list is there for census reasons, relocated from previous top level field.

But that more complete view of distributions and table_schema for CSV distros is available on a /distributions endpoint with this response type:

{
  "_links": {
    "self": {
      "href": "string"
    },
    "next": {
      "href": "string"
    },
    "prev": {
      "href": "string"
    }
  },
  "count": 0,
  "limit": 0,
  "offset": 0,
  "total_count": 0,
  "@context": "string",
  "items": [
    {
      "checksum": "string",
      "described_by": "string",
      "table_schema": {
        "about_url": "string",
        "column": [
          {
            "component_type": "string",
            "datatype": "string",
            "name": "string",
            "title": "string"
          }
        ]
      },
      "@id": "string",
      "byte_size": "string",
      "media_type": "string",
      "download_url": "string",
      "etag": "string",
      "@type": "dcat:distribution",
      "_links": {
        "self": {
          "href": "string"
        },
        "version": {
          "href": "string",
          "id": "string"
        }
      }
    }
  ]
}

How to review

There are 3 main issues with this approach that I think I need feedback on:

  1. LD users rely on information in the _embedded fields to get the full picture. I don't see this as too big an issue right now, we did accept that this was something we would trial and get feedback on, but we're leveraging it more so just flagging
  2. These _embedded fields are changing between the /editions/ and /versions/ endpoints, despite being part of the same dcat:dataset resource being represented (/editions/ embeds versions and distributions, /versions/ embeds dimensions and distributions - just based on my best guess of what would be useful). On the one hand we said the HAL fields _embedded and _links are generated and so it's acceptable for these to change between endpoints, but if they're forming part of the dcat resource is that still the case?
  3. Because we decided that _embedded fields should never be returned in list responses, calls to /editions and /versions will never include download links unless we revise this principle.

Who can review

@janderson2 @rossbowen

@rossbowen
Copy link

rossbowen commented Nov 3, 2023

So on your points:

  1. I'm still quite uncomfortable with the use of _embedded but I'll leave that at the door for a moment! - but proceeding with including it for now I think we can tell the JSON-LD context to "skip" over it (using @nest) and still use the distributions field from within it.

  2. Agree those are the ones which would be useful, and I don't see an issue with them changing between editions and versions.

  3. In the example below I've included the download_url field... I'd expect to see it as one of the "minimal" fields to include...so maybe an exception here. Or at least keep it in using the linked data property.

Also I don't see an issue with also having a /distributions endpoint.


Having 3 layers of nested document on /editions/ creates a very messy API, and suggests that at least one of those nested resources is a sub resource of another.

So while I get that the response is long, I don't think it's too messy, at most we get code which looks like:

edition.distributions[0].table_schema.columns

// Or including embedded
edition._embedded.distributions[0].table_schema.columns

So there's a bit of nesting there, but I'd say that feels perfectly pleasant to use and navigate with. Personally I'd be a bit more annoyed with having to make another call. What we have shapes up into the sort of thing Google are looking for as part of their structured data SEO... they even go as far as throwing in the observations.

Another thought is, many users won't really appreciate the difference between a dataset and its distributions. Many analysts will say things like "a dataset has columns" which isn't how DCAT models it. So I don't think including a distribution along with its edition/version is a bad thing.

Assuming a structure roughly like this:

curl https://data.ons.gov.uk/datasets/cpih/editions/2022-01
{
  "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01",
  "@type": "dcat:Dataset",
  // some additional stuff here...
  "distributions": [
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "@type": "dcat:Distribution",
      "byte_size": 123456,
      "download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "media_type": "text/csv",
      "table_schema": {
        "columns": [
          {
            "name": "area",
            "titles": "area",
            "datatype": "string",
            "label": "Area",
            "description": "The area of an observation."
          },
          {
            "name": "period",
            "titles": "period",
            "datatype": "string",
            "label": "Period",
            "description": "The period of an observation."
          },
          {
            "name": "sex",
            "titles": "sex",
            "datatype": "string",
            "label": "Sex",
            "description": "Biological sex of observed individuals."
          },
          {
            "name": "life_expectancy",
            "titles": "life_expectancy",
            "datatype": "decimal",
            "label": "Average life expectancy",
            "description": "Mean life expectancy of observed individuals."
          }
        ]
      }
    },
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.xlsx",
      "@type": "dcat:Distribution",
      "byte_size": 123456,
      "download_url": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01.csv",
      "media_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
    }
  ],
  "versions": [
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/1",
      "issued": "2022-01-01T00:00:00Z",
      "modified": "2022-01-01T00:00:00Z"
    },
    {
      "@id": "https://data.ons.gov.uk/datasets/cpih/editions/2022-01/versions/2",
      "issued": "2022-01-02T00:00:00Z",
      "modified": "2022-01-02T00:00:00Z"
    }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants