Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend image metadata #18951

Open
wants to merge 25 commits into
base: dev
Choose a base branch
from

Conversation

kostrykin
Copy link
Contributor

@kostrykin kostrykin commented Oct 8, 2024

This PR adds a series of basic metadata elements for image data, including:

This is useful to define validators for input data when working with images. Some examples of when this will be useful:

  • Require that an image is a binary image by validating that num_unique_values is 1 or 2.
  • Validate dtype: Some tools might not support float or int image data.
  • Validate that channels is 0 or 1: Restrict input data to single-channel images.
  • Validate axes, depth, channels, frames: Require that an image has one or more z-slices / channels / time steps.

TIFF files are read using the tifffile library, other image types are tried to be read using Pillow. The new metadata is defined as optional, because Pillow might not be installed, or it might not be possible to read an image using Pillow (e.g., due to an image format that Pillow does not support).

For multi-page TIFF files, the metadata is determined for each page individually, and then joined into a ,-separated string (with the order corresponding to the order of the pages in the series).

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@kostrykin

This comment was marked as resolved.

@kostrykin kostrykin marked this pull request as draft October 8, 2024 20:01
@bernt-matthias
Copy link
Contributor

bernt-matthias commented Oct 8, 2024

The problem with black will be solved here #18955

@kostrykin kostrykin marked this pull request as ready for review October 8, 2024 21:54
@kostrykin kostrykin marked this pull request as draft October 9, 2024 05:49
@bgruening
Copy link
Member

make format might help you

@bernt-matthias
Copy link
Contributor

One thing that I was thinking about is resource usage (time and memory) for setting the metadata. Is this limited in the current implementation?

For many other data types we restrict the amount of data that we process (often we just read 1MB prefix) -- but I guess this is not useful here.

@kostrykin kostrykin marked this pull request as ready for review October 9, 2024 11:48
@kostrykin
Copy link
Contributor Author

kostrykin commented Oct 9, 2024

Another thing to consider is that, if a tool requires certain metadata to be there, and it's not there because the data had been uploaded into Galaxy too long ago. This issue probably arises whenever new metadata is added in Galaxy? Are there any established procedures to cope with that?

If not, two possible solutions come to my mind. Either the tool should also accept a dataset for which the metadata is missing. Or Galaxy should automatically recognize that the metadata of a dataset is outdated and rerun the respective metadata extraction methods.

The former seems to be more easily feasible. It essentially means that a validator of the form <validator type="dataset_metadata_equal" metadata_name="key" value="val" /> should not only be successful if the metadata element key is set to val, but also if key is set to "". This means that we would have to add an optional attribute to all validators of type dataset_metadata_*, that is false by default for backwards compatibility, but should actually always be set to true in future when "new" metadata is validated. Makes sense?

@bernt-matthias
Copy link
Contributor

I think re-running metadata extraction is the way to go (users can trigger that -- if I'm not wrong).

With the metadata validator you can check if a specific metadata is set and add a message asking users to re-trigger setting metadata.

Adding an optional attribute to metadata validators might also be an option.

@kostrykin
Copy link
Contributor Author

kostrykin commented Oct 9, 2024

One thing that I was thinking about is resource usage (time and memory) for setting the metadata. Is this limited in the current implementation?

For many other data types we restrict the amount of data that we process (often we just read 1MB prefix) -- but I guess this is not useful here.

Definitely something to consider.

Given these considerations, I have made several changes in 1e6701f:

  • Pillow should now avoid loading the full image. Instead, most of the metadata is inferred directly from the PIL.Image object. I expect that .histogram() will still iterate over all pixel values, however, this could be more efficient than loading the full image data into memory (the documentation isn't specific on this).
  • With tifffile, the width, height, depth, frames, channels are now inferred without loading the full image data.

@kostrykin
Copy link
Contributor Author

kostrykin commented Oct 10, 2024

There is something strange going on with the tests, they hang kind of randomly.

Most of the time, both running locally using run_tests.sh -framework -id validation_image_metadata and when running in CI, it hangs on Test 3: https://github.com/galaxyproject/galaxy/actions/runs/11256985091/job/31300182332?pr=18951#step:9:2930

However, after removing the first two tests, Test 3 passes (this is then the first test), and it hangs on Test 4 instead (this is then the second test).

@bernt-matthias
Copy link
Contributor

Does it work locally with run_tests.sh? Have you tried planemo test?

@kostrykin kostrykin marked this pull request as draft October 10, 2024 08:53
@kostrykin
Copy link
Contributor Author

kostrykin commented Oct 10, 2024

Does it work locally with run_tests.sh? Have you tried planemo test?

Nope, same behavior using run_tests.sh -framework -id validation_image_metadata locally. If I remove all tests except the last one, it hangs on the last test. More specifically, the execution reaches this line but doesn't go beyond:

with tifffile.TiffFile(dataset.get_file_name()) as tif:

(note that this line was already there before this PR)

Haven't tried planemo test yet.

@kostrykin
Copy link
Contributor Author

kostrykin commented Oct 10, 2024

Does it work locally with run_tests.sh? Have you tried planemo test?

Nope, same behavior using run_tests.sh -framework -id validation_image_metadata locally. If I remove all tests except the last one, it hangs on the last test. More specifically, the execution reaches this line but doesn't go beyond:

with tifffile.TiffFile(dataset.get_file_name()) as tif:

The issue with the last test is fixed in f3e20d9. Actually this should be totally unrelated to the other tests, but somehow this also fixes their hanging (running both locally and in CI). Maybe a bug in the test execution?

@kostrykin kostrykin marked this pull request as ready for review October 10, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants