Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D3.1.4 Build aggregation tools #32

Open
tms-epcc opened this issue Mar 28, 2023 · 9 comments
Open

D3.1.4 Build aggregation tools #32

tms-epcc opened this issue Mar 28, 2023 · 9 comments

Comments

@tms-epcc
Copy link
Contributor

(D3.1.4, May 2024) Build appropriate aggregation tools, providing consensus from classifications provided by multiple volunteers

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Aug 3, 2023

03/AUG/24

  • Chris reported that necessary aggregations tools have been created.
  • Next step is to produce an accompanying report
  • Expect to do this report in early 2024

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Aug 30, 2023

25/AUG/23
@chrislintott reported

  • ongoing discussions with the EPO team regarding the extent to which this deliverable may already have been done

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Jan 19, 2024

19/JAN/24
@chrislintott reported in https://docs.google.com/document/d/13mgVp2T9EWWeuTVkvrf_WEW-xG721zaSsmiCrAdWHwo/edit
that FY24 plan includes

  1. Batch aggregation (D 3.1.4) - provide API endpoints and tools for more sophisticated handling of aggregation, so that it can run on particular subjects or subject sets, and on a programmatic or scheduled basis. Aggregation could also be trigged via the Python client or notebook so data can be returned to RSP. Development underway: target date end Feb.

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Jan 26, 2024

26/JAN/24

@tms-epcc
Copy link
Contributor Author

27/MAR/24
@chrislintott reported that good progress is being made on this and currently sees no issues with the 31/MAY/24 due date.

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Apr 24, 2024

24/APR/24
@chrislintott reported

  • Some components have already been implemented and are being reviewed
  • still expect delivery to meet due date as planned

@tms-epcc
Copy link
Contributor Author

29/MAY/24

  • as reported in FY24 Q2 QU

The existing
Aggregations code (zooniverse/aggregation-for-caesar repo) offers an offline / local solution for processing classifications and producing subject-specific summary results - greatly simplifying the post-processing required by Zooniverse project teams, but this can only be run once data has been downloaded. Instead, we want the code to run in response to a button push or API call, over a batch of recent classifications, facilitating the transfer of aggregated data back to the science platform. This required a new Zooniverse-hosted application endpoint to accept and process requests for aggregation of classification batches, which executes data ingest, extraction, reduction, and output data bundle creation. For now, the example implementation will be for binary workflows, but this could quickly be expanded. Progress has been good, with job management implemented zooniverse/aggregation-for-caesar#783) and the individual pieces of functionality ready to add to Panoptes, the Zooniverse backend (zooniverse/panoptes#4303). Integrating these components will be the first task next quarter. We also completed testing of the project copier functionality, including small bug fixes (e.g. zooniverse/panoptes#4270)

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Nov 8, 2024

From draft FY24 AE

The major piece of technical work during the year has been the delivery of more advanced handling of data produced by Zooniverse citizen science projects, with the goal of facilitating the return of data to the Rubin Science Platform. We anticipate wanting to provide users both with raw data, consisting of individual classifications (‘User X saw subject Y in task Z and provided the following annotations…’), and aggregated data (‘Subject p has score X’).

The existing Zooniverse backend assumed that requests for data required everything from a project. For long-lived projects, this can produce very large files which are hard to handle; for example, Planet Hunters:TESS currently has about 50 million classifications in its database. This meant that requests for data to the API often failed silently, due to the size of the file, and tasks such as updating aggregation were slow as they had to be run from scratch, rather than just including newer classification.

The batch aggregation project involved updates to the Zooniverse’s Panoptes back end to enable requests and processing to run only on some subset of a project’s classifications, either identified by subject set (e.g. a batch of images or light curves which were uploaded together) or by date. The result can be used both via the API which will handle requests for data from the RSP, and by internal aggregation tools which update scores for use in task allocation or machine learning.

@tms-epcc
Copy link
Contributor Author

tms-epcc commented Dec 2, 2024

02/DEC/24

  • deliverable submitted for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant