-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for different file formats to spec #49
Comments
Sounds good! To be more flexible what about something like this?
|
Makes sense, I will update the spec later and create an example project with it. |
@tischi I had another look at your proposal and it's not valid json. "imageData": {
"<NAME>": {"format": "bdv.xml", "storage": "fileSystem", "relativePath": "..xml"},
"<NAME>": {"format": "ome.zarr", "storage": "s3", "https://s3...."} i.e. The question is a bit how we deal with these names then. I think we should not restrict them, but then we somehow need to list the available storage options in MoBIE. So do we need another field in "project.json" for this, so that you know this when initialising the viewer without going through all sources? |
Good question... In the code it looks like this:
Which, I think we discussed before, could be make more general:
Where The question now is where does the user specify which of the image data stores (keys) they want to use. I don't know if we have to add this to the project.json, because one can figure out all the options also when parsing the specific dataset.json. Frankly, I don't know exactly, maybe we need to discuss in person. |
I proposed to put it in the project.json, because it is the first entry point for the MoBIE viewer. But if you would rather have it in the dataset.json that's also fine with me. But I see that this changes the logic of opening MoBIE projects a bit, because now we don't know a priori which data sources are available, but need to parse some meta-data for it, so the user cannot select this anymore without loading some metadata. Maybe we can find 5 minutes in the meeting later to discuss this. |
It's good to have it in the projection.json, but then we have to maintain the same information in two places, which is not ideal... |
Alternatively, we could also keep the current logic of using "storage" as key and only allowing a fixed set of keys (currently "fileSystem" and "s3Store"). Then the only change would be that this maps to a map with the format and the source (= either relative path or address, we can still discuss how to actually encode this): "imageData":{
"fileSystem": {"format": "bdv.xml", "source": "some/file.xml"},
"s3Store": {"format": "ome.zarr", "source": "https://s3.something.com"}
} Then we don't need to duplicate the information and know beforehand what options are available (for now "fileSystem" or "s3Store"). |
I still find the whole logic a bit suboptimal:
Maybe we should also unify what we do for images and tables in terms of specifying the storage location and filetype. |
New idea: "imageData": {
"fileSystem": {"format": "bdv.xml", "source": "..."},
"s3Store": {"..."}
},
"tableData": {
"fileSystem": {"...."},
"s3Store": {"..."},
"github": {"..."}
} |
Ok, to summarise the changes discussed with @tischi: "imageData": {
"fileSystem": {"format": "bdv.n5", "source": "images/my_image_fs.xml"},
"gitHub": {"format": "bdv.n5.s3","source": "images/my_image_gh.xml"},
"s3Store": {"format": "bdv.n5.s3", "source": "https://s3.com/my_image.xml"}
},
"tableData": {
"fileSystem": {"format": "tsv", "source": "tables/some_tables",
"gitHub": {"format": "tsv", "source": "tables/some_tables"},
"s3Store": {"format": "tsv", "sorce": "https://s3com/tables/some_tables"}
} This means: for "fileSystem" and "gitHub", the "source" is relative to the project directory / repository (which is derived from the entrypoint that the user provides, i.e. either the github repo or the path on the filesystem). For "s3Store", the "source" points to the s3 address. Note that the key only specifies where the 'image entry point' is located, which is different from the actual image data in the case of bdv.xml (it will be the same for ome.zarr though). The case for tables is very similar and for now we support the format "tsv". In the future, we may support other formats like databases, see #23. For the current format, "source" always provides the location of the top directory that contains all the table files. @tischi I will create a branch with an example for this and let you know once it's there. Edit: changed "github" to "gitHub". |
@tischi I added an example for the updated spec here: |
Thanks!
Somehow I am still confused as to whether we are getting it right...
Maybe we could rethink this again, forgetting for now that we currently need two different xml, assuming that images are stored as ZARR. And then, maybe just in the java code do something special to navigate to the correct xml, but not explicitly specify both locations in the dataset.json?! |
We could have just two xml next to each other, one ending on |
This will not work for the following case: read project data from github and access the image data on s3:
I don't really mind either way, but I like right now that the file format names are the same as used in the bdv xml file,
But this is the most consistent solution. It specifies where the image data entry point is located.
I don't like introducing implicit assumption on the folder structure in the spec again. |
That's true, that could mean that when reading from github one needs to provide a root path to the image data, typically pointing to S3. I think that would make intuitive sense, because when reading from github, the image data cannot be hosted there and that's why it feels natural that one needs to specify the location. We could put available object store locations into the projection.json.
I think that could be OK, because it is a temporary (legacy) thing just because we need different xml right now. I think this will go away once we move to ZARR. It also may not be ideal that in each and every xml we need to specify the S3 location. Specifying this once at "project startup" would be more lean and also allow us to copy the whole project onto a different S3 store without having to modify all xml. Should I have a look into this from the Java side whether I could make this work? Overall I am trying to see whether we could find a solution such that really the same folder structure works on the fileSystem, on gitHub and on S3 and can be copied around without the need to change anything inside the folder structure. In the spirit of OME.ZARR: "Save once, read everywhere". I would hope this may be possible without having to explicitly specify different image and table locations for fileSystem, gitHub and S3, hoping that everything could be relative. I think what stops us currently is:
Essentially, I am thinking if those are the only two issues that prevent us from a much simpler and more portable solution, I would try to rather address those two issues with some specific solutions than adapting the whole storage location specification to accommodate them. What do you think? (I am also thinking in terms of writing the publication, I think it would be a selling point if everything would work with relative paths "everywhere"). |
I agree, that makes a lot of sense. (With the exception that we cannot store image data on github, but I think having the s3 store(s) in the project metadata is a good solution for this).
I still don't like introducing hard-coded folder structures for this. I have two alternative suggestions:
I agree, we can just put the available object store location(s) into the project metadata to cover this. There is one corner case that we don't cover this way: loading image data from different object stores in the same MoBIE instance. A use-case for this is a published project (where all the data is inside of a publicly available s3), where one then wants to add some other data that should be kept private (by putting it inside a protected s3), e.g. for a follow-up publication that is in progress. |
I agree 100% and will have a look into the Java code, probably I'll have some time tomorrow.
Great!
Something like this sounds good. I think also supporting absolute S3 paths (pointing to where-ever) is very important, because it would allow us to create projects linking already hosted image data without the need to duplicate them into our folder structure. |
Ok, I will create a sample project for what I think makes sense and send it to you later and we can iterate based on that. |
So the new layout I would suggest: project.json specify the s3 root for accessing the image data. This is a list, so that multiple stores could be used, and each entry contains the relevant data for one store. By default, the first entry is used (selecting other roots would only be possible in some advanced mode) "s3Root": [{"endpoint": "https://s3.com", "bucket": "my_bucket", "region": "us-west"}] dataset.json:source Each source gives the relative path w.r.t. the dataset root, and optionally one can also give "s3Store", which overrides the default s3 address if the project is loaded from s3. Same for the table data imageData: {
"format": "bdv.n5",
"relativePath": "images/my_image.xml",
"s3Store": "https://s3.com/my_image.xml" # this is optional and overrides the default s3 address, which would be "s3Root" + "dataset" +"relativePath"
}
tableData: {
"format": "tsv",
"relativePath": "tables/my_tables",
"s3Store": "..."
} In addition, we modify the xml for bdv.n5 to allow pulling the information for the s3 access inside. I will add the suggestions for this once I worked on it. |
I am not sure there should be any s3 specific information inside of the xml, because ideally we want the xml to be portable. As said, I will have a look at the Java side of things, hopefully tomorrow. |
But we need to have s3 specific information inside of the xml, otherwise we don't know where to find the image data. |
Could you give an example for this? I was not aware this is the case. |
Sure, for example in the platybrowser case, the local paths are different for the local and remote image data: (We did not enforce any rules on this before, so I didn't always follow a completely fixed schema in the upload, esp. because uploading doesn't always just work with a single script call for the more complicated datasets.) |
And within one project are the paths within S3 consistent? Because if that's the case maybe using this as an s3 root would do? |
(including the absolute part in front, of course) |
No they are not necessarily consistent. For example for platybrowser due to the versioning scheme, that corresponds to files in different folders: |
Ah, I see this was because we were pointing to previous versions if the image existed already, is it? |
Yes, exactly. |
But then would specifying the "s3Store" be enough to handle those cases? We can also have a zoom tomorrow morning about this.. |
Yes, that would also be an option, but a slightly messy one, because we need to upload additional xml files to the s3 which is quite some work compared to just changing things on github.
I would not specify any absolute s3 information in the xml, but just specify the relative path inside the bucket. My idea would be to just add one more field to the xml, something like I would prefer the second option of slightly modifying the xml instead of slightly misusing the s3store, because we would still have the idea of having one root s3 location and then only providing the relative paths (which are read from the xml). |
Maybe allowing for two relative paths in the xml? One for file system and one for s3? |
Yes, that's the idea |
I am still confused 🤔 When loading the project from github, I don't think the code should ever look at the xml on github, but on the xml on s3, because that's where the images are and the xml are part of the image data. |
Ah ok, now I see what you mean. I assumed we can still load xmls from github and then use the s3 information in there to load the actual image data. If we want to enforce putting the xmls on s3, we will also need to re-upload xmls to the s3 buckets. Also we would still need different relative paths in the xml, because of the different relative paths of the n5 files w.r.t. local and s3. |
Yes, but having different xml local and on s3 would solve this without the need to do anything special.
Yes, but we want to also support opening the project only from s3, then the xml need to be there anyway, isn't it? |
But then the xml has to be changed to be uploaded to s3 and changed back to be valid for the local project and we cannot have both on the same git branch -> that's a huge mess in managing the data, esp. since we need this feature for the platy project, where the data is potentially still changing.
Yes, but I didn't really plan to make all the projects we have right now available through s3 only, because we have github as an entrypoint for them already. Nevertheless supporting to open projects from s3 only is important, but the use case for this (at least from my side) is rather for finished projects where nothing is expected to change that can be uploaded to s3 in bulk, because I think that this will be much easier for most users who then don't have to deal with both github and s3. |
I have an example for the new proposed data spec now: https://github.com/mobie/arabidopsis-root-lm-datasets/tree/new-data-spec2. It's following the layout proposed here: #49 (comment) I am still assuming that we can load the xmls from github right now, if we want to change that, the xmls (here) would need to be uploaded to the s3 under the right path as well. (But as I said, I would be very happy if we still allow reading the xmls from github, because otherwise making any changes to the projects will be much more work for me). There is also a local version on the EMBL group share of the project: /g/kreshuk/pape/Work/mobie/arabidopsis-root-lm-datasets. @tischi maybe it's best if we have a short zoom on this tomorrow, potentially after you had a quick look at the new layout. |
I think your current proposal would work. I understand like this:
Right? If we stick to this logic, we could in fact switch to ZARR without needing the voxel size, because we could make an xml with a ZARR image loader, containing the voxel size. If we now decide to read the xml from GitHub, in terms of Java code, it would be easier for me and stay consistent with the current logic of how the xml work, to, as also proposed at some point, have two xml next to each other (one ending on .s3.xml). |
I am thinking now we could in fact also more or less leave it as it was, maybe just cleaning up the xmls, in a sense that we do not have two folders "local" and "remote", but just save them next to each other, with different endings. And then also keep the paths in the s3.xml absolute as we had them, which fixes all sorts of issues. And the Java code would just pick the correct xml, based on some user choice. And if we, for now, also use xml to open the ome.zarr we don't need anything, is it? In fact, I just saw that I seem to have the code for opening ome.zarr via xml already, which as mentioned above would also fix the voxel size issue. And, since we do not urgently need it, we postpone supporting relative s3 paths to a later edition? |
|
|
|
Cool, dann koennen wir auch |
@tischi project with new data spec is there: |
I converted a couple more projects, we now have:
|
I have updated these repos now to have the new data spec on the
|
I could not find the other issue, so I am posting here. Does that look good for the auto-detection of the storage?
|
@tischi overall looks good, but the |
Thanks! Done via mobie/mobie-viewer-fiji@7abaf3b |
@tischi |
done |
Given that the ome.zarr format is progressing (ome/ngff#46) and that we may eventually want to support OpenOrganelle data (cc @bogovicj), I think we should add support for data formats that don't have a bdv.xml representation to the spec.
Currently, we store the data location in the spec like this:
Here, we assume that the values are (relative) file paths that point to a bdv.xml file that in turn contains either the path to a local n5/hdf5 file or the address for the s3 storage.
However, for ome.zarr we don't need an xml file, because all relevant metadata is stored in the zarr attributes. In addition, for data that is stored on s3 (or some other object store), we also don't need to have any corresponding file in the project, but instead need to store the address.
So I would propose to add a new field
fileFormat
to support this. Initially we would supportbdv.xml
and we can probably addome.zarr
very soon.For example:
or (eventually) for OpenOrganelle:
The text was updated successfully, but these errors were encountered: