Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert a v1 to a v2 package #264

Open
20 of 46 tasks
peterdesmet opened this issue Aug 30, 2024 · 3 comments
Open
20 of 46 tasks

How to convert a v1 to a v2 package #264

peterdesmet opened this issue Aug 30, 2024 · 3 comments

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Aug 30, 2024

I wanted to assess the complexity of converting a v1 to a v2 Data Package. Below are the steps that need to be taken. For version detection, see #262. @khusmann could you review these? There are a couple of items I'm unsure about.

Package

Add package.$schema, remove package.profile

Use package.profile, then remove it.

  • NULL => https://datapackage.org/profiles/2.0/datapackage.json
  • data-package (registered id) => https://datapackage.org/profiles/2.0/datapackage.json
  • tabular-data-package (registered id) => https://datapackage.org/profiles/2.0/datapackage.json. This also removes deprecated tabular-data-package
  • fiscal-data-package (registered id) => Unsure, should we use the 1.0 URL for fiscal-data-package?
  • A URL => Unsure, the referenced schema will likely point to Data Package v1, making it a v1
  • Any other value => Unsure, not allowed by https://specs.frictionlessdata.io/profiles/

Add package.contributors.roles

  • For each contributor set roles (array) based on role (string). Remove role

Other changes

Each resource

Add resource.$schema, remove resource.profile

Use resource.profile, then remove it

  • NULL =>https://datapackage.org/profiles/2.0/dataresource.json
  • data-resource (registered id) => https://datapackage.org/profiles/2.0/dataresource.json
  • tabular-data-resource (registered id) => https://datapackage.org/profiles/2.0/dataresource.json (but see resource.type)
  • A URL => Unsure, the referenced schema will likely point to Data Package v1, making it a v1
  • Any other value => Unsure, not allowed by https://specs.frictionlessdata.io/profiles/
  • There is also the edge case where $schema is already present (i.e. a v1 package with a v2 resource). => Unsure, should the present resource.$schema be left as is then?

Add resource.type

Use resource.profile:

  • NULL => don't set
  • tabular-data-resource => table
  • Any other value or URL => don't set

Other changes

  • resource.sources: no change required
  • resource.name: rules are relaxed, existing names can remain as is
  • resource.path: dot-paths are now forbidden. In the edge case there is such a path provided, we should not convert it, because it is impossible to know what would be the correct path. These types of paths will be flagged when reading a resource.
  • resource.encoding: allows more, no action required

For each dialect

Note that upconverting a dialect requires a remote one to be downloaded and verbosely included.

Add dialect.$schema

  • dialect.caseSensitiveHeader is present => https://datapackage.org/profiles/1.0/tabledialect.json
  • dialect.csvddfVersion is present => https://datapackage.org/profiles/1.0/tabledialect.json
  • Otherwise this can safely be set to https://datapackage.org/profiles/2.0/tabledialect.json

Unsure about this though. For example, if a dialect was absent (very often the case), one will be added with just the $schema property. The alternative is to leave all dialects as v1 (assuming a $schema that defaults to https://datapackage.org/profiles/1.0/tabledialect.json). That would also mean that remote dialects can stay remote.

Other changes

For each schema

Note that upconverting a schema requires a remote one to be downloaded and verbosely included.

Add schema.$schema

  • Set to https://datapackage.org/profiles/2.0/tableschema.json because we will update the schema it to that version.

Update schema.primaryKey

  • Convert from string to array.

Update schema.foreignKeys

  • Convert schema.foreignKeys.fields from string to array
  • Convert schema.foreignKeys.reference[x].fields from string to array
  • If schema.foreignKeys.reference[x].resource = resource name => remove property

No action required

For each field

Other changes

@khusmann
Copy link
Contributor

Oof, yeah, the remote references really put a wrench in things. I suppose downloading and verbosely including is the way to go, in the same way that if we add a new field to a remote schema it must be downloaded and verbosely included.

I see now how preserving remote references without modification makes a good argument for mixing v1 and v2 descriptors within the same package, but I'm still very concerned about the complexity it creates.

In general remote references work ok for read-only datapackages, but behavior gets complicated / murky when any mutability AND round-tripping is desired. I think we'd both really like the ability to round-trip descriptors (incl. references to remote descriptors) when no modifications occur, but when we add remote references to the picture it means we need to track the modified state of our descriptors. If a property inside a remotely referenced descriptor is changed, the descriptor needs to decide whether that modification should trigger the descriptor being downloaded & modified. I think properly managing this state will become a big headache pretty quickly... so that brings us back to just downloading & verbosely including when upgrading.

Either that or we need to sit down and figure out an architecture that helps us track / manage remote descriptor modification state & handle roundtripping without it exploding in our faces. I think the design I describe in #252 can potentially help with this. We can have different classes for local vs remote descriptors which can have different serialization behaviors via get_descriptor(). Remote descriptors would serialize as a link to the remote descriptor, but when modified via set_prop() they would be converted into a local descriptor. When we load a remote v1 resource and update it to v2, we could still register it as remote resource descriptor, so that serializing it will still output the original v1 descriptor link (unless it was modified at any point by the user, whereupon it'd be converted into a local descriptor). Does that make sense? ...actually, this doesn't sound that bad the more I'm thinking about it.

The only thing I have a confident answer on is field.categories. I do not think fields with enum should be converted to categories because we can't assume every enum field is categorical, as you say. field.categories is a new feature for v2, so if someone wants to convert their enums to categories when upgrading v1 to v2 they can make this transition themselves.

@peterdesmet
Copy link
Member Author

Thanks for thinking along @khusmann! I've been contemplating this a bit more and I came up with this:

flowchart TB
    classDef function color:#fff,fill:#5E8CD8,stroke:#5E8CD8,stroke-width:2px;
    classDef error color:#D86C5D,fill:#fff,stroke:#D86C5D,stroke-width:2px;
    
    v1_a["v1 datapackage.json"]
    v1_b["v1"]
    v1_c["v1"]
    v1_d["v1 datapackage.json"]
    v1_read_package("read_package()"):::function
    v1_create_package("create_package(v1)"):::function
    v1_add_resource("add_resource()"):::function
    v1_write_package("write_package()"):::function
    v1_schema["v1 schema"]
    v2_a["v2 datapackage.json"]
    v2_b["v2"]
    v2_c["v2"]
    v2_d["v2 datapackage.json"]
    v2_read_package("read_package()"):::function
    v2_create_package("create_package(v2)"):::function
    v2_add_resource("add_resource()"):::function
    v2_write_package("write_package()"):::function
    v2_schema["v2 schema"]
    
    b_upgrade_package("upgrade_package()"):::function
    c_upgrade_package("upgrade_package()"):::function
    
    subgraph v2_workflow
    v2_create_package --> v2_b
    v2_a ==> v2_read_package ==> v2_b ==> v2_add_resource ==> v2_c ==> v2_write_package ==> v2_d
    v2_schema -. optionally provided by user .-> v2_add_resource
    v2_add_resource -- user provided v1 schema --> v2_error["error, use upgrade_schema()"]:::error
    end

    v1_b --> b_upgrade_package --> v2_b
    v1_c --> c_upgrade_package --> v2_c
    %% v1_workflow --> b_upgrade_package --> v2_workflow
    
    subgraph v1_workflow
    v1_create_package --> v1_b
    v1_a ==> v1_read_package ==> v1_b ==> v1_add_resource ==> v1_c ==> v1_write_package ==> v1_d
    v1_schema -. optionally provided by user .-> v1_add_resource
    v1_add_resource -- user provided v2 schema --> v1_error["error, use upgrade_package()"]:::error
    end
Loading
  1. When reading or creating a v1 package, it stays v1. There is no automatic conversion. Adding resources doesn't change the version (in contrast with what I proposed in Detect Data Package version with version() function #262 (comment)) and neither will writing. This means that any current (v1) workflow is not affected. That also means we don't have to upgrade frictionless to a major version (we can reserve that for an entire API change like API Brainstorm Thread #252)
  2. The same applies for reading or creating a v2 package. Users can decide to adopt a v2 workflow when they want. Once in v2, they stay in v2.
  3. Upgrading from v1 to v2 is an active user choice and is facilitated with upgrade_package(). It performs all the steps described above in the body of this issue. So it upgrades the package, all resources, schemas and dialects. Any remote schema will be downloaded, but the user is informed of that. Since everything gets upgraded, users can upgrade their v1 package at any point in their workflow. They could even do 1. write_package() + git commit. 2. upgrade_package() %>% write_package() + inspect git diff (I know I will).
  4. We don't mix versions (e.g. a v2 package with v1 resource). We have one manipulation function add_resource() where there could be a discrepancy: when a user provides a schema to add_resource() with a different version. In that case we can return an error and suggest to upgrade.
  5. It will likely be useful to create some internal functions to separate the flows (e.g. add_resource_v1(), add_resource_v2(), ...), but I don't think we need a frictionless2 package. In fact, it could complicate things (i.e. does it still support v1, where should the upgrade_package() function live, etc.)

@khusmann
Copy link
Contributor

khusmann commented Sep 2, 2024

You bet!

This current design is looking good! It looks almost identical to the flows I was imagining with a frictionless2 package, but combined into single package. I also like the explicit upgrades via upgrade_package() when we're putting everything in the same package rather than splitting off to frictionless2.

It will likely be useful to create some internal functions to separate the flows (e.g. add_resource_v1(), add_resource_v2(), ...)

I think this will be key -- my main concern with a single package will be needing to think about v1 stability every time I want to add a v2 feature -- it's one of those cases where duplicating some code can actually be beneficial because it provides isolation between the versions. Internal functions separating the flows can accomplish this as well.

but I don't think we need a frictionless2 package. In fact, it could complicate things (i.e. does it still support v1, where should the upgrade_package() function live, etc.)

I agree, a frictionless2 package isn't strictly needed, it's just gives us a different set of trade-offs. I support whichever direction you go, just am thinking through the pros/cons.

I think there's clear answers to the concerns you list here though -- upgrade_package() would live as an internal function in frictionless2, because the frictionless v1 package would not support v2 packages, whereas frictionless2 would support v1 packages only via upgrading them.

In other words, a frictionless2 package would isolate these flows like this:

Note that upgrading is still an active user choice: If you don't want to upgrade to v2 and remain in v1-land, you use library(frictionless). If you want to upgrade to v2, you use library(frictionless2)

We don't mix versions (e.g. a v2 package with v1 resource).

What about when we encounter mixes in the wild when reading packages? For example a resource like this:

{
  ...v2 resource def
  "schema": {
    ...v1 schema def
  }
}

or the remote case?

{
  ...v2 resource def,
  "schema": "http://example.com/v1_schema.json"
}

Right now we don't say anything in the v2 spec that explicitly disallows this, right? ...maybe we should explicitly disallow this in the spec?

The downside to disallowing this is I can see it as a potentially a valid use case... It might occur when someone is using v2 features in their datapackage, but wants to reference a descriptor that has only been published as v1, for example.

In any case, this is where the sort of "always upgrade, but only write upgrades when modified" behavior I described above would potentially be useful -- if you were working in v2-land, it would load the remote v1 resources as v2 ones, but preserve their URL when serializing as long as they weren't modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants