Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/round two prep #100

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open

Feature/round two prep #100

wants to merge 5 commits into from

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Oct 3, 2024

This includes my data prep scripts and additional recipe changes for round two.

@laurejt laurejt requested a review from rlskoeser October 3, 2024 00:31
@laurejt laurejt self-assigned this Oct 3, 2024
@laurejt laurejt changed the base branch from main to develop October 3, 2024 00:31
Copy link
Collaborator

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks fine. The unit tests for the path utils are good.

I made some comments about possible improvements but none of those changes need to be made before merging.

I have some concerns about the continuing proliferation of scripts, but much prefer them to be in the repo than not.

The selection of random pages by volume is a clever solution. Right now it's generating an output file that can be used as input to the filter script. I'm wondering what it might look like to connect the methods so it didn't have to be separate calls. Not proposing to add more functionality to the already complicated filter script, more thinking about whether we could make the internal logic reusable elsewhere.

Thanks for making notes about your steps to generate the dataset for this round. Once the dataset and revised recipe are deployed, it would be great for us to take some time to make sure all the details are clearly documented and think about what we want to improve for the next round. It still seems pretty hodgepodge at this point.

return work_id.rsplit("-p", 1)[0]


def get_image_relpath(work_id, page_num):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for moving this method here, this seems like a good location

vol_dir = get_vol_dir(vol_id)
source = get_ppa_source(vol_id)
if source == "Gale":
image_name = f"{vol_id}_{page_num:04d}0.TIF"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIF is the original extension, do we need a way to customize extension when we call this method or should it be handled somewhere downstream ?

Comment on lines +7 to +11
The input CSV file must have the following fields:
* work_id: PPA work id
* page_start: Starting index for page range being considered for this work
* page_end: Ending index for page range being considered for this work
* poery_pages: Comma separated list of page numbers containing poetry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if my comment belongs here or on the google doc you shared with your steps for constructing the dataset, but how did you construct the input CSV?

Comment on lines +47 to +48
start_idx = int(row["page_start"])
end_idx = int(row["page_end"]) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, some PPA supports non-sequential page ranges. It probably doesn't matter in this case, but would be good in future to use the same intspan python library for consistency and simplicity.

Comment on lines +64 to +75
# Select remaining pages randomly
while page_counter < k:
# Select work
work_id = random.choice(list(page_pool.keys()))
# Select page
try:
pg_id = random.choice(list(page_pool[work_id].keys()))
except IndexError:
# Encountered empty list, remove work entry and continue
del page_pool[work_id]
continue
yield page_pool[work_id].pop(pg_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could simplify this by using random.choices or random.sample

Comment on lines +17 to +19
The resulting output CSV file has the following fields:
* work_id: PPA work id
* page_num: Digital page number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, nice - this is the exact format we now support in the filter script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants