Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use specific data formats #36

Open
bernt-matthias opened this issue Aug 31, 2022 · 4 comments
Open

Use specific data formats #36

bernt-matthias opened this issue Aug 31, 2022 · 4 comments

Comments

@bernt-matthias
Copy link
Contributor

I just started to explore the qiime2 Galaxy tools. Obviously starting with the import tool I noticed that often the unspecific format="data" is used, e.g.

<param name="data" type="data" format="data" help="This data should be formatted as a FastqGzFormat. See the documentation below for more information."/>

this should be avoided, in particular if there are corresponding datatypes in Galaxy. In this specific example format="fastq.gz" seems appropriate. But there are also fastqsanger.gz or fastqillumina.gz if a specific phred encoding is required.

@ebolyen
Copy link
Member

ebolyen commented Aug 31, 2022

To accomplish this, we would need some kind of mapping of our formats to Galaxy formats. And since both of these frameworks allow extension, we'll probably always need "data" as an escape hatch. That said, there may be room to use EDAM to figure out mappings where they exist, I had imagined something along those lines a long time ago, but we've never really gotten around to it.

From there it ought to be possible observe that a Galaxy collection which contains entirely a certain type would be compatible with our file collection and thus constrain the collections available to import. (Do Galaxy collections have an observable format?)

@bernt-matthias
Copy link
Contributor Author

Seems that the EDAM annotation is present for the Galaxy data types. Is there a list of qiime datatypes somewhere, maybe with EDAM annotations?

In general collections can contain datasets of different types. On the tool side one can use the format attribute of param also for data_collection inputs. But I'm not sure if this checks all or only the first element. We could check and work on solutions to change this if necessary .. maybe also an additional validator can be used. Or one simply documents that users are required to use only uniform collections.

I find the discussion on automatically generated tool wrappers quite enlightening, because it often sheds light on shortcomings of Galaxy (or its tool framework).

As a further comment on collections: they are a nice way to generate parallelism.

@bernt-matthias
Copy link
Contributor Author

ping @ebolyen .. seems that we had the same ideas already earlier :)

as just discussed: I will try to produce a figure (or a hierarchical datatype like yaml/json) of Galaxy's datatype hierarchy annotated with edam_format and edam_data entries .. then we can think of a mapping .. maybe with some help of @matuskalas

@bernt-matthias
Copy link
Contributor Author

Create a little script over here. There are a few datatypes deriving from more than one class. So I used only the first in the MRO.

Result can be found here.

If needed you can probably run it using

export PYTHONPATH=$(pwd)/lib/
python hierarchy.py

probably some additional python modules are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants