-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reintroduce S3FileSystemStore #82
Conversation
The dependency was removed in the first instance since it was breaking the openmicroscopy build. |
With the build on ome/openmicroscopy#6379 now green hopefully this PR should be ready for some integration testing. As far as next steps, there are 2 options for testing, using the bfoptions to load direct from S3 or using the new static method and bypassing the reader initialisation and loading of the memo file. To load directly from S3 the path should be added to the bfoptions file using the new option. This will also require the memo files to be re-generated. Note that in testing the memo files were not detected as being invalid so I had to delete them first. An example of how this would look in the bfoptions file:
To test the new static method will require changes to how pixel data is retrieved. Currently this is specific to the ZarrReader and any non Zarr data will not be able to use this approach. Instead of initialising the reader via the Memoizer and calling openBytes you can simply call the new method, pointing to the location of a zarray that you wish to open. A buffer needs to be declared first also.
|
Removed the |
Conflicting PR. Removed from build BIOFORMATS-push#4. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build BIOFORMATS-push#808. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build BIOFORMATS-push#5. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build BIOFORMATS-push#809. See the console output for more details.
|
Testing today's build... Merged in https://merge-ci.openmicroscopy.org/jenkins/job/BIOFORMATS-push/6/console
restart server... Tested render_image performance on idr0010 - 1st plate using existing memo file (same behaviour as without this PR) Then deleted memo file and added bfoption...
View image to regenerate memo... 13:45 -> approx 10 minutes... Timed the render_image_region (whole plane) on idr-testing:omeroreadwrite (no microservices) via localhost:1080, selecting a Well each time, with the Preview panel open. Average approx 2 seconds with the goofys access - green (if we ignore the outliers > 3.5 secs) and average about 1 second with this PR - blue: |
Repeat the testing above on idr0090 - first plate:
memo started 15:06... |
It seems that the idr0090 plate above is still not viewable. Grepping the logs for Fileset path finds a lot of repetition... OMERO.server/var/log/Blitz-0.log
|
Going to try on a different study - idr0016, testing the memo file regeneration with and without the Delete the memo files for first 2 plates from idr0016...
Update the bfoptions of the first plate only...
Then view images to trigger memo file... After ~40 mins I noticed lots of Restarted the server... ~ 10:50 Compared time to |
Comparing the log files from idr0090 with the regeneration of idr0010 memo file above, idr0010 looks like this at the time that "memo saved" (there was no "memo saved" seen yet for idr0090 logs above): idr0010:
The line with |
As expected, once we have memo file generated with the
|
Updating ZarrReader on all omeroreadonly servers on idr-testing...
Then restarted... |
I have added the |
Started memo file generation for all NGFF filesets on idr-testing, using
EDIT: 27th March 15:12 - getting DatabaseBusyExceptions.
Cancel screen, Restart all servers":
21:14: restarted...
9:00 on 28th - current status:
10:30 pm on 29th March:
10:00 pm on 1st April:
|
It feels like memo file generation is taking longer than before, but I don't have comparable numbers... Since we have some timings for idr0090 memo generation at IDR/idr-metadata#686 (comment), and it would be useful to have the memo files for testing idr0090, let's cancel running above and run idr0090 only... Status before cancelling...
Restarted on idr0090 only...
After 7 hours:
These times (avg 4249 if we omit the 11.96 where the memo already exists) look comparable to times at IDR/idr-metadata#686 (comment) Still running.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the biggest unclarity on my side is to understand how the reader will behave in the different backend scenarios. Based on my reading of the code I am expecting the following
1- OME-Zarr on the local filesystem: JZarrServiceImpl.s3fs
will be null
and all underlying calls will use ZarrArray.open(String)
or ZarrArray.open(String)
2- OME-Zarr on public AWS S3 bucket: JZarrServiceImpl.s3fs
will initialize a S3FileSystemStore
using a client with anonymous credentials and all underlying calls will use ZarrArray.open(Store)
and ZarrGroup.open(Store)
3- OME-Zarr on S3 compatible object store backend (like EBI Embassy 4): as per the endpoint protocol check, JZarrServiceImpl.s3fs
should be null
and we are falling back to scenario 1. How does the opening logic work in the context of the IDR testing in that case? Are we effectively making direct HTTPS calls using Jzarr?
A separate question is related to the new openPlane
API. Is that something that has been exposed / tested anywhere?
client = AmazonS3ClientBuilder.standard() | ||
.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, "auto")) | ||
.withPathStyleAccessEnabled(true) | ||
.withCredentials(new AWSStaticCredentialsProvider(new AnonymousAWSCredentials())).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably something worth documenting in the readme limitations i.e. the S3 functionality will exclusively work unauthenticated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is certainly a very limited solution that is tailored for the IDR use case, i would not particularly recommend this workflow to standard users, so the limitations would need to be clearly documented.
zarrArray = ZarrArray.open(file); | ||
} | ||
else { | ||
LOGGER.warn("S3 access currently not supported"); | ||
s3fs.updateRoot(getZarrRoot(s3fs.getRoot()) + stripZarrRoot(file)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repetition of this call for all commands makes me wonder whether it should be part of the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update root is really setting the location of the particular zarr array to be opened in each case, so unfortunately it is needed in multiple locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array opening in lines 97-103 / 125-131, and group opening in lines 113-119 / 137-143 / 149-155 do appear to be duplicate code though. It might be worth thinking about refactoring those into methods that can just open a ZarrArray
and ZarrGroup
.
Co-authored-by: Sébastien Besson <[email protected]>
I tried testing whether the goofys mount on Testing only on
Then browsed all NGFF data at http://localhost:1080/webclient/
So, this confirms that the |
Before I start reviewing in depth, what's the best way to test this outside of IDR? Is it sufficient to just add the reader jar to the |
Yeah that should be sufficient to test, or using the 'with dependencies' jar from the build might be easiest. |
I think I'm maybe misunderstanding how to evaluate these changes then. With
which suggests that Using some simple test code:
then running Running the same test code on a s3 dataset referenced above:
If I then update the code to set the
and retry:
That looks like a bfoptions file must exist in s3 no matter what? Or am I misunderstanding how to initialize the reader? |
The showinf UnknownFormatException seems like the classpath has likely not been picked up correctly. I normally test bftools by modifying bf.sh to detect the jar but I would have expected the classpath to work. For this PR (and general usage), I would expect the ID used for setID to be a location on a local filesyetm. If setting through ImageReader rather than ZarrReader then it must be one of the files in the .zarr folder rather than the directory itself. The real goal of the PR was to find a way to provide access to open s3 data from embassy without having to reimport. It is definitely a specific solution for this unique use case rather than a general purpose solution. So for now the s3 access is very specific for use with the new alt store option with a public https endpoint such as: If we want to add support for more general s3 handling then one option would be using then using an existing s3 filesystem with jzarr such as https://jzarr.readthedocs.io/en/latest/amazonS3.html. Though I would suggest that would be a follow up. |
Tried to construct a minimal test scenario by downloading a public OME-Zarr IDR dataset 6001240.zarr locally i.e.
With this PR included, running the following command
works as expected, reading the metadata and all planes from the local OME-Zarr dataset Constructed a secondary OME-Zarr where all the chunks (named
also works as I expect, reading the metadata but filling the missing chunks with zeros when navigating through the planes. Trying to combine this metadata-only OME-Zarr with the alternative store option as introduced in this PR via the command-line
still gives me black planes and the initialization includes the following warning
Is that the expectation? |
Sorry @sbesson, the test you did is spot on but the result was not as expected. That was a mistake on my part in the last commit, it should be corrected now. Rerunning your last test should display the data as before except being read from embassy. |
With the last commit
still returns black planes as per the missing chunks but
now loads the planes from the Embassy object store as expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test outlined in #82 (comment) does appear to work for me as well. If it's possible to capture that in an automated test, that might be worth doing here or in a follow-up PR.
zarrArray = ZarrArray.open(file); | ||
} | ||
else { | ||
LOGGER.warn("S3 access currently not supported"); | ||
s3fs.updateRoot(getZarrRoot(s3fs.getRoot()) + stripZarrRoot(file)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array opening in lines 97-103 / 125-131, and group opening in lines 113-119 / 137-143 / 149-155 do appear to be duplicate code though. It might be worth thinking about refactoring those into methods that can just open a ZarrArray
and ZarrGroup
.
return keys; | ||
} | ||
|
||
public ArrayList<String> getFiles() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method appears to be nearly identical to getKeysFor(String)
above, and might be worth refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the getFiles method is no longer being used anywhere so I have now removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally retested as described in #82 (comment) i.e. local OME-Zarr with chunks, local OME-Zarr without chunks and local OME-Zarr without chunks and with options file pointing to the S3 endpoint. Everything worked as expected.
The latest cleanup commits are useful to identify the changes introduced by this PR i.e. new S3 file system store, new reader option and its implementation in ZarrReader
and underlying changes to the JZarrServiceImpl
to read the alternate storage location. No objections to getting this merged in its current form. Eventually the ongoing IDR upgrade work will be the validation of this effort in a large scale deployments. Usability/performance problems arising as part of the testing can be captured and tackled as follow-up issues.
Re-adding the S3FileSystemStore from #38 in order to enable reading directly from Embassy S3. This will also add the
awssdk
dependency back into the project.Some modifications from the original have made and a new option has been added to allow for an alternative s3 location to set in the
bfoptions
file.