How to convert a v1 to a v2 package #264

peterdesmet · 2024-08-30T08:34:16Z

I wanted to assess the complexity of converting a v1 to a v2 Data Package. Below are the steps that need to be taken. For version detection, see #262. @khusmann could you review these? There are a couple of items I'm unsure about.

Package

Add package.$schema, remove `package.profile`

Use package.profile, then remove it.

NULL => https://datapackage.org/profiles/2.0/datapackage.json
data-package (registered id) => https://datapackage.org/profiles/2.0/datapackage.json
tabular-data-package (registered id) => https://datapackage.org/profiles/2.0/datapackage.json. This also removes deprecated tabular-data-package
fiscal-data-package (registered id) => Unsure, should we use the 1.0 URL for fiscal-data-package?
A URL => Unsure, the referenced schema will likely point to Data Package v1, making it a v1
Any other value => Unsure, not allowed by https://specs.frictionlessdata.io/profiles/

Add package.contributors.roles

For each contributor set roles (array) based on role (string). Remove role

Other changes

package.version: documentation change, no action required
package.contributors: no action required for title, givenName and familyName.
package.sources: Unsure, but I think no action is required

Each resource

Add resource.$schema, remove `resource.profile`

Use resource.profile, then remove it

NULL =>https://datapackage.org/profiles/2.0/dataresource.json
data-resource (registered id) => https://datapackage.org/profiles/2.0/dataresource.json
tabular-data-resource (registered id) => https://datapackage.org/profiles/2.0/dataresource.json (but see resource.type)
A URL => Unsure, the referenced schema will likely point to Data Package v1, making it a v1
Any other value => Unsure, not allowed by https://specs.frictionlessdata.io/profiles/
There is also the edge case where $schema is already present (i.e. a v1 package with a v2 resource). => Unsure, should the present resource.$schema be left as is then?

Add resource.type

Use resource.profile:

NULL => don't set
tabular-data-resource => table
Any other value or URL => don't set

Other changes

resource.sources: no change required
resource.name: rules are relaxed, existing names can remain as is
resource.path: dot-paths are now forbidden. In the edge case there is such a path provided, we should not convert it, because it is impossible to know what would be the correct path. These types of paths will be flagged when reading a resource.
resource.encoding: allows more, no action required

For each dialect

Note that upconverting a dialect requires a remote one to be downloaded and verbosely included.

Add dialect.$schema

dialect.caseSensitiveHeader is present => https://datapackage.org/profiles/1.0/tabledialect.json
dialect.csvddfVersion is present => https://datapackage.org/profiles/1.0/tabledialect.json
Otherwise this can safely be set to https://datapackage.org/profiles/2.0/tabledialect.json

Unsure about this though. For example, if a dialect was absent (very often the case), one will be added with just the $schema property. The alternative is to leave all dialects as v1 (assuming a $schema that defaults to https://datapackage.org/profiles/1.0/tabledialect.json). That would also mean that remote dialects can stay remote.

Other changes

dialect.table: new property, no action required

For each schema

Note that upconverting a schema requires a remote one to be downloaded and verbosely included.

Add schema.$schema

Set to https://datapackage.org/profiles/2.0/tableschema.json because we will update the schema it to that version.

Update `schema.primaryKey`

Convert from string to array.

Update `schema.foreignKeys`

Convert schema.foreignKeys.fields from string to array
Convert schema.foreignKeys.reference[x].fields from string to array
If schema.foreignKeys.reference[x].resource = resource name => remove property

No action required

schema.missingValues: old format still valid, no action required
schema.fieldMatch: this is exact for all v1, but that is also the default for this field, so no need to set it
schema.uniqueKeys: new property, no action required

For each field

Other changes

The text was updated successfully, but these errors were encountered:

khusmann · 2024-08-30T19:44:04Z

Oof, yeah, the remote references really put a wrench in things. I suppose downloading and verbosely including is the way to go, in the same way that if we add a new field to a remote schema it must be downloaded and verbosely included.

I see now how preserving remote references without modification makes a good argument for mixing v1 and v2 descriptors within the same package, but I'm still very concerned about the complexity it creates.

In general remote references work ok for read-only datapackages, but behavior gets complicated / murky when any mutability AND round-tripping is desired. I think we'd both really like the ability to round-trip descriptors (incl. references to remote descriptors) when no modifications occur, but when we add remote references to the picture it means we need to track the modified state of our descriptors. If a property inside a remotely referenced descriptor is changed, the descriptor needs to decide whether that modification should trigger the descriptor being downloaded & modified. I think properly managing this state will become a big headache pretty quickly... so that brings us back to just downloading & verbosely including when upgrading.

Either that or we need to sit down and figure out an architecture that helps us track / manage remote descriptor modification state & handle roundtripping without it exploding in our faces. I think the design I describe in #252 can potentially help with this. We can have different classes for local vs remote descriptors which can have different serialization behaviors via get_descriptor(). Remote descriptors would serialize as a link to the remote descriptor, but when modified via set_prop() they would be converted into a local descriptor. When we load a remote v1 resource and update it to v2, we could still register it as remote resource descriptor, so that serializing it will still output the original v1 descriptor link (unless it was modified at any point by the user, whereupon it'd be converted into a local descriptor). Does that make sense? ...actually, this doesn't sound that bad the more I'm thinking about it.

The only thing I have a confident answer on is field.categories. I do not think fields with enum should be converted to categories because we can't assume every enum field is categorical, as you say. field.categories is a new feature for v2, so if someone wants to convert their enums to categories when upgrading v1 to v2 they can make this transition themselves.

peterdesmet · 2024-09-02T07:14:14Z

Thanks for thinking along @khusmann! I've been contemplating this a bit more and I came up with this:

flowchart TB
    classDef function color:#fff,fill:#5E8CD8,stroke:#5E8CD8,stroke-width:2px;
    classDef error color:#D86C5D,fill:#fff,stroke:#D86C5D,stroke-width:2px;
    
    v1_a["v1 datapackage.json"]
    v1_b["v1"]
    v1_c["v1"]
    v1_d["v1 datapackage.json"]
    v1_read_package("read_package()"):::function
    v1_create_package("create_package(v1)"):::function
    v1_add_resource("add_resource()"):::function
    v1_write_package("write_package()"):::function
    v1_schema["v1 schema"]
    v2_a["v2 datapackage.json"]
    v2_b["v2"]
    v2_c["v2"]
    v2_d["v2 datapackage.json"]
    v2_read_package("read_package()"):::function
    v2_create_package("create_package(v2)"):::function
    v2_add_resource("add_resource()"):::function
    v2_write_package("write_package()"):::function
    v2_schema["v2 schema"]
    
    b_upgrade_package("upgrade_package()"):::function
    c_upgrade_package("upgrade_package()"):::function
    
    subgraph v2_workflow
    v2_create_package --> v2_b
    v2_a ==> v2_read_package ==> v2_b ==> v2_add_resource ==> v2_c ==> v2_write_package ==> v2_d
    v2_schema -. optionally provided by user .-> v2_add_resource
    v2_add_resource -- user provided v1 schema --> v2_error["error, use upgrade_schema()"]:::error
    end

    v1_b --> b_upgrade_package --> v2_b
    v1_c --> c_upgrade_package --> v2_c
    %% v1_workflow --> b_upgrade_package --> v2_workflow
    
    subgraph v1_workflow
    v1_create_package --> v1_b
    v1_a ==> v1_read_package ==> v1_b ==> v1_add_resource ==> v1_c ==> v1_write_package ==> v1_d
    v1_schema -. optionally provided by user .-> v1_add_resource
    v1_add_resource -- user provided v2 schema --> v1_error["error, use upgrade_package()"]:::error
    end

When reading or creating a v1 package, it stays v1. There is no automatic conversion. Adding resources doesn't change the version (in contrast with what I proposed in Detect Data Package version with version() function #262 (comment)) and neither will writing. This means that any current (v1) workflow is not affected. That also means we don't have to upgrade frictionless to a major version (we can reserve that for an entire API change like API Brainstorm Thread #252)
The same applies for reading or creating a v2 package. Users can decide to adopt a v2 workflow when they want. Once in v2, they stay in v2.
Upgrading from v1 to v2 is an active user choice and is facilitated with upgrade_package(). It performs all the steps described above in the body of this issue. So it upgrades the package, all resources, schemas and dialects. Any remote schema will be downloaded, but the user is informed of that. Since everything gets upgraded, users can upgrade their v1 package at any point in their workflow. They could even do 1. write_package() + git commit. 2. upgrade_package() %>% write_package() + inspect git diff (I know I will).
We don't mix versions (e.g. a v2 package with v1 resource). We have one manipulation function add_resource() where there could be a discrepancy: when a user provides a schema to add_resource() with a different version. In that case we can return an error and suggest to upgrade.
It will likely be useful to create some internal functions to separate the flows (e.g. add_resource_v1(), add_resource_v2(), ...), but I don't think we need a frictionless2 package. In fact, it could complicate things (i.e. does it still support v1, where should the upgrade_package() function live, etc.)

khusmann · 2024-09-02T19:35:42Z

You bet!

This current design is looking good! It looks almost identical to the flows I was imagining with a frictionless2 package, but combined into single package. I also like the explicit upgrades via upgrade_package() when we're putting everything in the same package rather than splitting off to frictionless2.

It will likely be useful to create some internal functions to separate the flows (e.g. add_resource_v1(), add_resource_v2(), ...)

I think this will be key -- my main concern with a single package will be needing to think about v1 stability every time I want to add a v2 feature -- it's one of those cases where duplicating some code can actually be beneficial because it provides isolation between the versions. Internal functions separating the flows can accomplish this as well.

but I don't think we need a frictionless2 package. In fact, it could complicate things (i.e. does it still support v1, where should the upgrade_package() function live, etc.)

I agree, a frictionless2 package isn't strictly needed, it's just gives us a different set of trade-offs. I support whichever direction you go, just am thinking through the pros/cons.

I think there's clear answers to the concerns you list here though -- upgrade_package() would live as an internal function in frictionless2, because the frictionless v1 package would not support v2 packages, whereas frictionless2 would support v1 packages only via upgrading them.

In other words, a frictionless2 package would isolate these flows like this:

Note that upgrading is still an active user choice: If you don't want to upgrade to v2 and remain in v1-land, you use library(frictionless). If you want to upgrade to v2, you use library(frictionless2)

We don't mix versions (e.g. a v2 package with v1 resource).

What about when we encounter mixes in the wild when reading packages? For example a resource like this:

{
  ...v2 resource def
  "schema": {
    ...v1 schema def
  }
}

or the remote case?

{
  ...v2 resource def,
  "schema": "http://example.com/v1_schema.json"
}

Right now we don't say anything in the v2 spec that explicitly disallows this, right? ...maybe we should explicitly disallow this in the spec?

The downside to disallowing this is I can see it as a potentially a valid use case... It might occur when someone is using v2 features in their datapackage, but wants to reference a descriptor that has only been published as v1, for example.

In any case, this is where the sort of "always upgrade, but only write upgrades when modified" behavior I described above would potentially be useful -- if you were working in v2-land, it would load the remote v1 resources as v2 ones, but preserve their URL when serializing as long as they weren't modified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert a v1 to a v2 package #264

How to convert a v1 to a v2 package #264

peterdesmet commented Aug 30, 2024 •

edited

Loading

khusmann commented Aug 30, 2024

peterdesmet commented Sep 2, 2024

khusmann commented Sep 2, 2024 •

edited

Loading

How to convert a v1 to a v2 package #264

How to convert a v1 to a v2 package #264

Comments

peterdesmet commented Aug 30, 2024 • edited Loading

Package

Add package.$schema, remove package.profile

Add package.contributors.roles

Other changes

Each resource

Add resource.$schema, remove resource.profile

Add resource.type

Other changes

For each dialect

Add dialect.$schema

Other changes

For each schema

Add schema.$schema

Update schema.primaryKey

Update schema.foreignKeys

No action required

For each field

Other changes

khusmann commented Aug 30, 2024

peterdesmet commented Sep 2, 2024

khusmann commented Sep 2, 2024 • edited Loading

peterdesmet commented Aug 30, 2024 •

edited

Loading

Add package.$schema, remove `package.profile`

Add resource.$schema, remove `resource.profile`

Update `schema.primaryKey`

Update `schema.foreignKeys`

khusmann commented Sep 2, 2024 •

edited

Loading