Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add process.py for SR project #75

Closed
keighrim opened this issue Dec 8, 2023 · 4 comments · Fixed by #83
Closed

add process.py for SR project #75

keighrim opened this issue Dec 8, 2023 · 4 comments · Fixed by #83
Milestone

Comments

@keighrim
Copy link
Member

keighrim commented Dec 8, 2023

using the new gold column naming convention.

@clams-bot clams-bot added this to infra Dec 8, 2023
@github-project-automation github-project-automation bot moved this to Todo in infra Dec 8, 2023
@keighrim
Copy link
Member Author

Related to clamsproject/app-swt-detection#41, I had a brief discussion with @marcverhagen , and we need to decide what is the format of the gold files for SR annotations. Concretely, first thing to decide is whether the gold is time (interval)-based or image-based, or both.

In case we want to keep two representations in the gold format, we've been using csv files with start, end columns in other SR-like past projects (slates, chyrons), and I can't think of an easy way to keep the csv format (for reusing other eval.py files) and, at the same time, to store image-level annotation in that csv format as additional columns. And this repo is designed to allow only one format for golds, so we might need to reconsider that decision as well, if we can't find a way to use a single format to hold two different levels of representation and have to generate two formats.

@keighrim keighrim added this to the eval-v1 milestone Jan 29, 2024
@keighrim
Copy link
Member Author

keighrim commented Jun 5, 2024

Given the way we restructured the SWT app to keep image annotations (TimePoint annotations), I think we can only keep image-based "gold" set fot SR project.

So the output format can be a csv for each cpb-.... ID,

# cpb-xxx-yyyyy
timepoint,label 
t1,B
t2,SH
...

For all the "seen" timepoints in the raw data.

@keighrim
Copy link
Member Author

keighrim commented Jun 5, 2024

At the second look, since the "raw" portion of the annotation data is already organized by the GUIDs, we probably don't need to introduce a new format for gold, and instead can just copy raw files into gold dir.

@keighrim
Copy link
Member Author

keighrim commented Jun 6, 2024

Looking at the files third time, it looks like we can actually benefit from altering the columns a bit. Specifically, given this "raw" format

filename	seen	type label	subtype label	modifier	transcript	note
cpb-aacip-0acac5e9db7_01824989_00000000.jpg	true	B		false		
...
cpb-aacip-0acac5e9db7_01824989_00082015.jpg	true	S	H	false		
...
  1. change the first column timepoint or timestamps, and take only the the last part of the jpg file name (I believe the number is milliseconds, so might need to re-format based our timeunit convention for gold data (https://github.com/clamsproject/aapb-annotations/blob/main/repository_level_conventions.md)
  2. keep the "total" duration (second piece in the jpg file name) as a separate column
  3. remove rows that are now "seen"

@jyoune jyoune linked a pull request Jun 7, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from Todo to Done in infra Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant