Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve usability of Directory datatype #17614

Draft
wants to merge 15 commits into
base: dev
Choose a base branch
from

Conversation

wm75
Copy link
Contributor

@wm75 wm75 commented Mar 6, 2024

This adds functionality to the Directory datatype class, which can now be displayed and downloaded as an archive.
It also adds a new archive_to_directory converter that generalizes the existing tar_to_directory one to work with tar and zip archives. Also updates the older converter's requirement to an existing version of the galaxy-util package. Previously the exact requirement wasn't installable via conda.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. upload an archive as either zip or tar
    2. convert the datatype to directory
    3. explore

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 24.1 milestone Mar 6, 2024
@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

@davelopez step 1 for zarr datatype integration

@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

@astrovsky01 could maybe be an interesting alternative to your colabfold tar archive?

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have some unit tests for setting metadata. Maybe here https://github.com/galaxyproject/galaxy/blob/dev/test/unit/data/datatypes/test_data.py.

There should be a bit of inspiration for such tests in the folder.

lib/galaxy/config/sample/datatypes_conf.xml.sample Outdated Show resolved Hide resolved
lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved
lib/galaxy/datatypes/data.py Show resolved Hide resolved
Comment on lines 1306 to 1309
if to_ext:
# Download the directory structure as an archive
trans.log_event(f"Download directory for dataset id: {str(dataset.id)}")
return self._archive_composite_dataset(trans, dataset, headers, do_action=kwd.get("do_action", "zip"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek this is the code that gets repeated to manage the downloading as an archive :)

The rest is for making the eye icon work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not following, what do you mean by making the eye icon work ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index stuff doesn't belong here, directories don't have indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's the part that will be useful in subclasses (like zarr). That's why the candidate index files list for the parent class is empty. Could be implemented in just relevant subclasses of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not following, what do you mean by making the eye icon work ?

well, displaying the root folder or the index file. With the current version of directory, when you click the eye icon, you'll just an empty file downloaded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be implemented in just relevant subclasses of course.

But a directory is a relevant, self-contained, out-there-in-the-wild datatype. Let's not add data where none is needed. subclasses should implement this if required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current version of directory, when you click the eye icon, you'll just an empty file downloaded.

Feel free to disable the eye icon on the client or implement a directory browser, but this does not belong into the datatype.

index_file = dataset.metadata.index_file
if not filename or filename == 'index':
if root_folder is not None and index_file:
# display the index file in lieu of the empty primary dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the right layer to do this, that's a client concern.

@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

Ok, your opinion about the display_data part is a bit diappointing for me. I've spent a considerable amount of time looking at the existing code for the Data datatype class and actually thought that I had implemented the display part in the spirit of that parent class' code, but seems you don't think so.

Unfortunately, I don't understand, in particular, your distinction between the datatype's and the client's concern:
Data.display_data itself implements a poor man's directory display for the case that a link like datasets/ddaca2bad6847b13/display/dataset_22960784-8945-48d8-86fb-46d3c00a8b3e points to a filename in extra_files_path, which is actually a directory. In the case that the link leads to a file the same method will serve that file. So depending on its input params that method is making lots of decisions and can return very different things.
My proposed subclass method leaves almost all of this untouched, but special-cases one more situation, which is links like datasets/ddaca2bad6847b13/preview/. For this particular case, it will use the parent class' already existing directory display to show the contents of the root folder, which I don't think is a very far-fetched preview for a directory datatype.

Now for potential subclasses of Directory that may define index files, the display_data method will display that index file's content as a preview instead.

Data.display_data, in its docstring, also has the warning: Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change. so I thought it'd be better to implement enough flexibility in Directory.display_data so that subclasses won't immediately have to override the method again, but instead there's one hopefully carefully reviewed place where the magic happens. That's why I handle the index file case in the Directory class even though I agree with you that it's of no relevance for that class itself, but it's also logical to think that a number of directory subclasses will have something to use as an index file.

In general, there is no urgency here, and I do not intend to get into any heated argument over this. I'm willing to adjust the code and learn about your opinion and the reason behind it. Above are my reasons for implementing this first version like I did, and all I can say is that I gave this quite some thought, but never expected my first attempt to be perfect nor even close to it.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2024

I should say that our datatype code is quite sub-optimal in so many places, in part due to us not being very strict with reviewing them (for a good reason, we want to collect all those domain-specific datatypes), and having had to resort to server-side templating for the longest time in Galaxy's history. That in turn means that it's not always appropriate to just model new code on existing code, especially if you're working on important datatypes that we have to build on going forward. Now I don't mind altering stuff in display_data, we can always remove that later, but I don't think we should add more data to the database that we're not using.

itself implements a poor man's directory display for the case

All of these should eventually go away, just like the "bam-to-sam-to-tabular" display. You can produce a listing of extra files via the API, which is what a directory browser visualization should use to implement a browsable interface. We shouldn't have to stick that into the database. Making the visualizations first-class is a priority on the roadmap and we're not far away from getting there.

Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change.

I would take this comment quite verbatim to mean that the interface can change, in terms of function signature or expected return values. That comment likely comes from an era when we still had datatypes on the tool shed. I don't think that the number of subclasses matters, and in my previous comment I suggested that you can implement a directory-style subclass and a concrete implementation that uses the additional data you want to store in the database, which is my number one issue with adding unused metadata elements. And then we'll also see how all that is actually used.

What's here is great, could you not just break away the extra metadata elements and add them as a different parent class for your zarr datatype ?

@mvdbeek mvdbeek removed this from the 24.1 milestone May 14, 2024
@mvdbeek mvdbeek self-requested a review May 14, 2024 14:29
@bgruening
Copy link
Member

The Biohackathon is coming up soon again. What is the status here? My understanding is that we need this for the upcoming Zarr datatype and Zarr Visualisation?

@davelopez davelopez marked this pull request as draft October 9, 2024 08:14
@davelopez davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 687d71e to aebb357 Compare October 9, 2024 13:10
@davelopez davelopez marked this pull request as ready for review October 9, 2024 13:18
@github-actions github-actions bot added this to the 24.2 milestone Oct 9, 2024
@davelopez davelopez mentioned this pull request Oct 9, 2024
4 tasks
@davelopez davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 0d5123b to af45cd1 Compare October 10, 2024 10:49
@davelopez davelopez marked this pull request as draft October 11, 2024 08:34
@davelopez davelopez marked this pull request as ready for review October 11, 2024 14:37
@@ -1212,6 +1212,18 @@ def regex_line_dataprovider(
class Directory(Data):
"""Class representing a directory of files."""

file_ext = "directory"

def _archive_main_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for roundtripping a directory via the API ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can try!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test, but I'm not sure why I'm getting ModuleNotFoundError: No module named 'galaxy' when running the converter tool... using the UI seems to work fine 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the API tests are setting up dependency resolution. I am mostly interested in verifying that the structure of the tar archive is the same pre-and-post upload. In fact I think even the checksum should match if we're not compressing the archive. In either case if you upload a tar file and download it again that should be sufficient. The converter is tested in the test framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, so you mean something like this fc7e959#diff-a6ab1700bcef9e1585a2bb0f84e8888470a770fb81c3e0337930e7cad573093fR662

I'll do that 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this:

    def test_fetch_directory(self, history_id):
        testdir = TestDataResolver().get_filename("testdir.tar")
        with open(testdir, "rb") as fh:
            details = self._upload_and_get_details(
                fh, api="fetch", history_id=history_id, ext="directory", assert_ok=True
            )
        assert details["file_ext"] == "directory"
        assert details["file_size"] == 3584
        content = self.dataset_populator.get_history_dataset_content(
            history_id, dataset=details, to_ext="directory", type="bytes"
        )
        dir_path = decompress_bytes_to_directory(cast(bytes, content))
        assert dir_path.endswith("testdir")
        for path, entry_class in EXPECTED_CONTENTS.items():
            path = os.path.join(dir_path, os.path.pardir, path)
            if entry_class == "Directory":
                assert os.path.isdir(path)
            else:
                assert os.path.isfile(path)

But if I don't run the converter manually instead of the to_ext="directory" the extra_files_path is empty, I guess that is why you have more changes persisting extra files in the object store in your referenced branch dev...mvdbeek:galaxy:directory_datatype_improvements#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R132

Should we merge your branch instead? Is something still missing in your branch or should that be how to do it?

Sorry, I'm a bit lost 😅

@davelopez davelopez marked this pull request as draft October 14, 2024 11:59
Compressed (Upload) -> Directory (Unpack) -> Compressed (Download)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants