Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CSV inputs for uploader, link to multiple objects #19

Merged
merged 4 commits into from
Aug 26, 2024

Conversation

DavidStirling
Copy link
Member

@DavidStirling DavidStirling commented Jul 30, 2024

  • Adds support for chunked uploading of CSVs from a file on disk.
  • Adds parameter to supply extra object targets to link to
  • Adds support for ROIs as the linking target. Show a warning if ROI selected as the primary target (not visible in OMERO.web)

omero2pandas.upload_table now supports either a pandas dataframe or CSV path as an input. With a CSV we need to establish the max length of any string columns, so we need to scan the file before uploading. Strategy is as follows:

  • Grab the head/first <chunk_size> rows of the CSV file. Determine OMERO column types and identify which, if any, contain strings.
  • Iterate through the entire file in chunks, loading just the string columns, so as to determine their max length. While we're at it we'll capture the row count too. Exclusively loading string columns gives us a significant performance benefit, probably since pandas isn't coercing all the numeric columns for every chunk.
  • Generate the OMERO.table, iterate through the file once more in chunks, uploading as we go.

Copy link
Member

@chris-allan chris-allan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot. Some small comments are inline.

omero2pandas/upload.py Outdated Show resolved Hide resolved
omero2pandas/upload.py Outdated Show resolved Hide resolved
annotation_id = link_obj.child.id.val
LOGGER.info(f"Uploaded as FileAnnotation {annotation_id}")
extra_link_objs = []
unloaded_ann = omero.model.FileAnnotationI(annotation_id, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are not importing FileAnnotationI into the package scope?

else:
raise NotImplementedError(f"Column type "
f"{column_type} not supported")
if col_class == omero.grid.StringColumn:
Copy link
Member

@chris-allan chris-allan Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As below, import this? There are also features in omero.columns (which has the concrete implementations of the omero.grid column types) that you may find useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a lot of classes from omero.grid and omero.model, so I'd just left it as a full import.

I wasn't aware that omero.columns existed, it's not mentioned in the main Python docs at all. Probably worth discussing whether there's any benefit to be had from using those instead of grid, and if so it could be best as a dedicated PR.

Copy link
Member

@chris-allan chris-allan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with #20 going to approve this so it can go in a release candidate. It's likely durable enough to ship on its own.

This PR will be for @erindiel, @sbesson and @emilroz to get integrated and shipping as required for other work next week.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few additional comments & suggestions from a strict code review perspective, some of them are mostly to clarify how the new feature is expected to be invoked.

I'll schedule some functional testing over the next few days

omero2pandas/upload.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested through a workflow creating an OMERO.table from a CSV and linking it both to an ROI and the parent Image using the new API.

Everything worked as expected using a CSV with 100k rows and 338 columns as the input. The table was successfully created and linked and could be used as previously using the API. The chunk size is correctly computed internally and the upload logic worked without issue creating the table.

I have not tested all the validation conditions for incorrect target objects but there are still discussions on #19 (comment). Assuming these need additional scoping time (and might be better captured as a separate issue), the only outstanding question from my side is #19 (comment) and whether the README should suggest to let the code auto-compute the chunk size by default (and mention chunk_size) for advanced usage.

@DavidStirling
Copy link
Member Author

Made revisions per @sbesson's suggestions

  • extra_links renamed to links
  • parent_type and parent_id are now optional arguments which get added to the links list before execution. This should make the new system backwards-compatible.
  • All targets in links are now treated equally.
  • We now check that all link targets are in the same OMERO group
  • In keeping with this, server call sequence is now create table -> get OriginalFile -> create FileAnnotation -> link FileAnnotation to targets. Previously it was create table -> get OriginalFile -> create FileAnnotation and first link -> link FileAnnotation to other targets
  • Updated readme for clarity

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DavidStirling, the latest API changes make sense to me and simplify the workflow when linking to multiple parent object at once. Successfully rested in the context described previously of a table linked to both a ROI and an Image.

I will run a final round of testing on Monday in particular to check some of the error conditions but overall, I think we should be in a good position to get this included and released as 0.3.0 and address bugs as we start consuming it.

@DavidStirling DavidStirling merged commit b52085b into glencoesoftware:main Aug 26, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automate chunk size optimisation Support linking to extra objects Enhancement: Upload table from CSV file
3 participants