-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/round two prep #100
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks fine. The unit tests for the path utils are good.
I made some comments about possible improvements but none of those changes need to be made before merging.
I have some concerns about the continuing proliferation of scripts, but much prefer them to be in the repo than not.
The selection of random pages by volume is a clever solution. Right now it's generating an output file that can be used as input to the filter script. I'm wondering what it might look like to connect the methods so it didn't have to be separate calls. Not proposing to add more functionality to the already complicated filter script, more thinking about whether we could make the internal logic reusable elsewhere.
Thanks for making notes about your steps to generate the dataset for this round. Once the dataset and revised recipe are deployed, it would be great for us to take some time to make sure all the details are clearly documented and think about what we want to improve for the next round. It still seems pretty hodgepodge at this point.
return work_id.rsplit("-p", 1)[0] | ||
|
||
|
||
def get_image_relpath(work_id, page_num): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for moving this method here, this seems like a good location
vol_dir = get_vol_dir(vol_id) | ||
source = get_ppa_source(vol_id) | ||
if source == "Gale": | ||
image_name = f"{vol_id}_{page_num:04d}0.TIF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIF is the original extension, do we need a way to customize extension when we call this method or should it be handled somewhere downstream ?
The input CSV file must have the following fields: | ||
* work_id: PPA work id | ||
* page_start: Starting index for page range being considered for this work | ||
* page_end: Ending index for page range being considered for this work | ||
* poery_pages: Comma separated list of page numbers containing poetry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if my comment belongs here or on the google doc you shared with your steps for constructing the dataset, but how did you construct the input CSV?
start_idx = int(row["page_start"]) | ||
end_idx = int(row["page_end"]) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, some PPA supports non-sequential page ranges. It probably doesn't matter in this case, but would be good in future to use the same intspan
python library for consistency and simplicity.
# Select remaining pages randomly | ||
while page_counter < k: | ||
# Select work | ||
work_id = random.choice(list(page_pool.keys())) | ||
# Select page | ||
try: | ||
pg_id = random.choice(list(page_pool[work_id].keys())) | ||
except IndexError: | ||
# Encountered empty list, remove work entry and continue | ||
del page_pool[work_id] | ||
continue | ||
yield page_pool[work_id].pop(pg_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you could simplify this by using random.choices
or random.sample
The resulting output CSV file has the following fields: | ||
* work_id: PPA work id | ||
* page_num: Digital page number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see, nice - this is the exact format we now support in the filter script
This includes my data prep scripts and additional recipe changes for round two.