Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrroc schema #31

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Wrroc schema #31

wants to merge 18 commits into from

Conversation

SalihuDickson
Copy link

@SalihuDickson SalihuDickson commented Oct 9, 2024

This PR aims to detail a conversion between the GA4GH WES profile to the Workflow Run Crate (WRC) extension of Ro-Crates.

This is a rewrite of the schema that is currently being used, to ensure conformity with the official RO-Crate specification.

Summary by Sourcery

Add a new schema document to facilitate the conversion of GA4GH WES profiles to Workflow Run Crate extensions, ensuring conformity with the RO-Crate specification.

New Features:

  • Introduce a new schema for converting GA4GH WES profiles to Workflow Run Crate (WRC) extensions of RO-Crates, detailing the necessary fields and their requirements.

Documentation:

  • Add comprehensive documentation in 'schema.md' to guide the conversion process between GA4GH WES profiles and WRC extensions, including field definitions and requirements.

Copy link

sourcery-ai bot commented Oct 9, 2024

Reviewer's Guide by Sourcery

This pull request introduces a new schema for converting between the GA4GH WES (Workflow Execution Service) profile and the Workflow Run Crate (WRC) extension of RO-Crates. The schema is defined in a new file, schema.md, which provides a detailed mapping of fields and structures between the two formats.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Introduction of a new schema for WES to WRC conversion
  • Define base object structure for RO-Crate
  • Specify metadata entity fields and their mappings
  • Detail root data entity structure and required fields
  • Define license object structure
  • Specify mainEntity data entity structure with WES field mappings
  • Include run_log, stdout, and stderr entity definitions
  • Define input and workflow parameter structures
  • Specify workflow type and task logs URL entities
schema.md
Comprehensive field mapping between WES and WRC
  • Map WES run_id to WRC identifier
  • Map WES state to WRC creativeWorkStatus
  • Map WES workflow_url to WRC url
  • Map WES tags to WRC keywords
  • Map WES workflow_engine to WRC runtimePlatform
schema.md
Inclusion of RO-Crate and WRC specific requirements
  • Define constant fields required by RO-Crate specification
  • Specify required and recommended fields for WRC compliance
  • Include JSON-LD context definitions
schema.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @SalihuDickson - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 6 issues found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


## WORKFLOW RUN CRATE RO-CRATE FIELDS

For a Json object to conform with the Ro-Crate specification there are certain minimal requirements. Further still the WRC estension has it's own requirements. Considering that, the following definitions aim to differentiate the diffrent types of fields and how they should be considered in each implementation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (documentation): Fix typos: 'estension' and 'diffrent'

Please correct 'estension' to 'extension' and 'diffrent' to 'different'.

schema.md Outdated
Comment on lines 11 to 14
1. constant: These fields are required and for our purposes the value of this field should always be the default value. In other words their values are constant.
2. required: These fields are required but their values are not constant.
3. recommended: These fields are not required but they are recommended to give a comprehensive definition for your RO-crate object.
4. optional: These fields are optional.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Inconsistent capitalization in Terminology section

Consider making the capitalization consistent across all items in the Terminology section.

Suggested change
1. constant: These fields are required and for our purposes the value of this field should always be the default value. In other words their values are constant.
2. required: These fields are required but their values are not constant.
3. recommended: These fields are not required but they are recommended to give a comprehensive definition for your RO-crate object.
4. optional: These fields are optional.
1. Constant: These fields are required and for our purposes the value of this field should always be the default value. In other words their values are constant.
2. Required: These fields are required but their values are not constant.
3. Recommended: These fields are not required but they are recommended to give a comprehensive definition for your RO-crate object.
4. Optional: These fields are optional.

- **@context**(constant):
- **WES field**: N/A
- **type**: `string` | `[string]`
- **description**: URI pointing to the official context page of the version of Ro-Crate and any extensions being implemented.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (documentation): Clarify @context description

The description mentions 'any extensions', but the default value only includes one extension. Consider clarifying if multiple extensions are expected or adjusting the description.


- **WES field**: N/A
- **type**: `string`
- **description**: identifier for the dataset object. Must match the metadata.@id feild
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (documentation): Fix typo: 'feild'

Please correct 'feild' to 'field' in the description.

Comment on lines +89 to +92
reference:

1. [https://www.researchobject.org/ro-crate/specification/1.1/metadata.html](https://www.researchobject.org/ro-crate/specification/1.1/root-data-entity.html)
2. https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Inconsistent formatting of reference links

Consider making the formatting of reference links consistent. The first link is formatted as a proper link, while the second is plain text.

Suggested change
reference:
1. [https://www.researchobject.org/ro-crate/specification/1.1/metadata.html](https://www.researchobject.org/ro-crate/specification/1.1/root-data-entity.html)
2. https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate/
reference:
1. [RO-Crate Specification 1.1: Root Data Entity](https://www.researchobject.org/ro-crate/specification/1.1/root-data-entity.html)
2. [Workflow Run Crate Profile](https://www.researchobject.org/workflow-run-crate/profiles/workflow_run_crate/)

- **identifier**(recommended):
- **WES field**: N/A
- **type**: `string`
- **description**: A reference to a relevant and comprehensive description of the license. May be a URI to the official webpage describing the license. Should match the `license.@id` field. Should be added as some algorithms may look for this instead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Clarify usage of 'identifier' field

The description mentions that 'some algorithms may look for this instead.' Consider providing more specific information about when and why this field might be used.

Suggested change
- **description**: A reference to a relevant and comprehensive description of the license. May be a URI to the official webpage describing the license. Should match the `license.@id` field. Should be added as some algorithms may look for this instead.
- **description**: A reference to a relevant and comprehensive description of the license. Typically a URI to the official webpage describing the license. Should match the `license.@id` field. This field is important for interoperability, as some algorithms and tools specifically search for an 'identifier' field when processing license information, rather than using `license.@id`.

@SalihuDickson
Copy link
Author

@uniqueg, this PR contains my proposed schema to convert GA4GH WES profiles to Ro-Crates. It has not been completed, but all required fields have been added, can you please take a look.

The document is pretty long though 😅

Copy link
Member

@uniqueg uniqueg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good work, thanks @SalihuDickson - I think it would be good to have a concluding overview/summary table that links the different WES properties to the different WRROC properties as a quick reference.

And also one that goes in the other direction and lists all required and recommended WRROC properties and checks if we cover them somehow just from info from a WES run.

Also, maybe for the WES properties that are not yet accounted for, could you indicate for each what problems you are facing?

schema.md Outdated
Comment on lines 49 to 58
- **@type**(required):

- **WES field**: N/A
- **type**: `string`
- **description**: The type of the content stored in the Ro-Crate.
- **default**:

```json
"CreativeWork"
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read the docs, @type MUST be CreativeWork - so I guess this is also "constant" (as per your definition), not "required".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yh, I'll fix that.

schema.md Outdated
"CreativeWork"
```

- **conformsTo**(constant):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unindent

schema.md Outdated
"CreativeWork"
```

- **conformsTo**(constant):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's only partly "constant": the URL to the RO-Crate specification is always there, but possibly there may be a different number of extensions.

Also, it's of note that the URLs are versioned so may change over time. I suppose for a given version of our library, they will still be constant - just wanted to point out that it may need updates over time (so perhaps worth noting).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I can add this to the description. I just thought adding caveats like this would make the document much longer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is that the versions stated here are only meant to conform to the profile detailed in this documentation. So if the documentation is updated to reflect a new version then the default value for this should be updated as well.

This is the base Ro-Crate object.

- **@context**(constant):
- **WES field**: N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could distinguish between "N/A" meaning "not applicable" (i.e., we don't need WES here, it doesn't apply - so no need to think about the mapping; which is the case here) and "not available" (i.e., we would need this value from somewhere, according to whether it's required, recommended, or optional - but WES doesn't supply it).

These "WES field: not available" properties would be the most tricky ones!

Copy link
Author

@SalihuDickson SalihuDickson Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a clarification could be beneficial, but what N/A currently means throughout this document is "not applicable". Because all the required fields that WES does not provide usually fall into 1 of 3 categories; either they're constant, their values are descriptive, i.e a name or description, or the values can easily be generated like an ID. So I didn't really see a need to distinguish or which of these 3 scenarios would fall into the "not available" category.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps after we better distinguish which of these scenarios best fits the "not applicable" description then I can adjust the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for every value, we should try to come up with a possible source (WES, TES or otherwise), or multiple (see below) - and that should be our focus. Of course, within those, required properties MUST, recommended SHOULD and optional MAY have a source (or more).

So I guess what I'm trying to say is that maybe we don't need to think too much about what it means with respect to WES, but rather what the best source will be. Sometimes it's obvious (that could be the constants you are talking about, or a really good field from WES, or something that can be generated on the fly), sometimes it's not so obvious. In some cases, it might be a compound, or something that may or may not be present, depending on the context (e.g., what exactly the workflow run was like). For these, we should list all the possible sources that could help, and possibly define a priority of sources (e.g., if we have this, then that'd be the best, but if we don't, maybe this other thing is the next best option).

In the end we really need this summary table that outlines exactly that - an actionable mapping of sources for each WRROC property, i.e., precise enough that it can be implemented.

Copy link
Member

@uniqueg uniqueg Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically what we already have, but with more details (we did a much better job for TES).

Also: Not sure if you are on the official RO-Crate Slack. There were some issues discussed that other people had with mapping WES to WRROC a few months ago - and I don't think we ever picked them up or solved them. I'll share a link when I find the time next week.

schema.md Outdated
Comment on lines 11 to 14
1. constant: These fields are required and for our purposes the value of this field should always be the default value. In other words their values are constant.
2. required: These fields are required but their values are not constant.
3. recommended: These fields are not required but they are recommended to give a comprehensive definition for your RO-crate object.
4. optional: These fields are optional.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are technically two different classifications that are, at least in principle, independent of one another:

  • constant or not constant (bool)
  • required, recommended or optional

Perhaps in reality, all "constants" are required, but perhaps it need not be the case. So for clarity, perhaps it's best to explicitly state both classifications for each property, e.g., @context(required, constant).

schema.md Outdated

- **WES field**: N/A
- **type**: `{"@id": string}`
- **description**: contains an `@id` field pointing to the dataset entity of the RO-Crate. Must match the dataset.@id field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: Must match the @id field of the root dataset.


## WORKFLOW RUN CRATE RO-CRATE FIELDS

For a Json object to conform with the Ro-Crate specification there are certain minimal requirements. Further still the WRC estension has it's own requirements. Considering that, the following definitions aim to differentiate the diffrent types of fields and how they should be considered in each implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only focus on what's required for WRROC, not for RO-Crates in general. In other words, the library would take an existing valid RO-Crate and add the entitites to it that result from the workflow (or tool) run - but NOT generate an RO-Crate from scratch. Because if we support the latter, it's easy to lose focus and we will end up implementing an entire RO-Crate generating library. We'd need to allow users to specify that, and then that requires some frontend etc. That will be done elsewhere. Rather, we want to be able to build the WRROC-specific parts just from the WES or TES logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand completely, does this mean the CrateGen tool won't be stand alone? And it'll need to be one part of a library that generates the RO-Crate fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say I would like to avoid for this tool to do things that require user intervention. But anything that isn't either, as you say, a constant, or can be fetched from TES or some other API call, likely needs to be provided by the user. And some of the stuff that you described that relates to RO-Crates in general (i.e., not the WRROC-specific stuff) may fall within that category.

Sure, we could avoid a CLI with options for the user to fill in RO-Crate descriptions and the like by instead defining some dummy defaults so that we can create self-contained RO-Crates - with the expectation that users would change these later on. But that's a bit risky - it might lead to bad quality RO-Crates! Also, that's not the core business of this tool, so shouldn't be high priority.

The more likely (and safer) use case in the real world is that there's already an RO-Crate in the system, maybe containing project, author, and funding information. So for now I think we should just focus on the the use case where the client passes an existing RO-Crate, and we put the WRROC-related entities inside and fill up from the generic stuff what absolutely needs filling up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that makes sense, i guess I can isolate those fields so we can fill only those.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add both options, allow the user to insert the descriptive fields (or whatever other fields they would like to fill in as well), then provide the partially created RO-Crate to the CLI tool to fill in the remaining fields. Like this they can also fill in fields that can be generated then we can ask which fields they would like to overwrite or keep as is.

We can also add the option to generate the RO-Crate and then fill in the descriptive fields later. Personally, I feel this would provide a better user experience as the outstanding fields will pretty much already be underlined for them. Also I imagine this tool will not likely be used among people with no technical knowledge, as people without some technical knowledge they are more likely to stick to web based tools with interactive UIs. So we can probably expect the core user base to understand that they need to fill in the required fields after the CLI tool generates the RO-Crate.

Copy link
Author

@SalihuDickson SalihuDickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the table, sure I'lll get on that. And I'll detail the reason for each outstanding field.

@uniqueg
Copy link
Member

uniqueg commented Oct 14, 2024

On the table, sure I'lll get on that. And I'll detail the reason for each outstanding field.

Great! I guess it's best to just extend the existing one.

@SalihuDickson
Copy link
Author

SalihuDickson commented Oct 15, 2024

On the table, sure I'lll get on that. And I'll detail the reason for each outstanding field.

Great! I guess it's best to just extend the existing one.

I've written the table in the schema document. Due to the differences btw both documents, I can't extend the existing one with this, as they don't match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants