Feature/round two prep #100

laurejt · 2024-10-03T00:31:35Z

This includes my data prep scripts and additional recipe changes for round two.

rlskoeser

This all looks fine. The unit tests for the path utils are good.

I made some comments about possible improvements but none of those changes need to be made before merging.

I have some concerns about the continuing proliferation of scripts, but much prefer them to be in the repo than not.

The selection of random pages by volume is a clever solution. Right now it's generating an output file that can be used as input to the filter script. I'm wondering what it might look like to connect the methods so it didn't have to be separate calls. Not proposing to add more functionality to the already complicated filter script, more thinking about whether we could make the internal logic reusable elsewhere.

Thanks for making notes about your steps to generate the dataset for this round. Once the dataset and revised recipe are deployed, it would be great for us to take some time to make sure all the details are clearly documented and think about what we want to improve for the next round. It still seems pretty hodgepodge at this point.

rlskoeser · 2024-10-04T18:13:01Z

src/corppa/utils/path_utils.py

+    return work_id.rsplit("-p", 1)[0]
+
+
+def get_image_relpath(work_id, page_num):


thanks for moving this method here, this seems like a good location

rlskoeser · 2024-10-04T18:13:48Z

src/corppa/utils/path_utils.py

+    vol_dir = get_vol_dir(vol_id)
+    source = get_ppa_source(vol_id)
+    if source == "Gale":
+        image_name = f"{vol_id}_{page_num:04d}0.TIF"


TIF is the original extension, do we need a way to customize extension when we call this method or should it be handled somewhere downstream ?

rlskoeser · 2024-10-04T18:17:37Z

src/corppa/utils/generate_page_set.py

+The input CSV file must have the following fields:
+    * work_id: PPA work id
+    * page_start: Starting index for page range being considered for this work
+    * page_end: Ending index for page range being considered for this work
+    * poery_pages: Comma separated list of page numbers containing poetry


I don't know if my comment belongs here or on the google doc you shared with your steps for constructing the dataset, but how did you construct the input CSV?

rlskoeser · 2024-10-04T18:20:00Z

src/corppa/utils/generate_page_set.py

+            start_idx = int(row["page_start"])
+            end_idx = int(row["page_end"]) + 1


FWIW, some PPA supports non-sequential page ranges. It probably doesn't matter in this case, but would be good in future to use the same intspan python library for consistency and simplicity.

rlskoeser · 2024-10-04T18:20:42Z

src/corppa/utils/generate_page_set.py

+    # Select remaining pages randomly
+    while page_counter < k:
+        # Select work
+        work_id = random.choice(list(page_pool.keys()))
+        # Select page
+        try:
+            pg_id = random.choice(list(page_pool[work_id].keys()))
+        except IndexError:
+            # Encountered empty list, remove work entry and continue
+            del page_pool[work_id]
+            continue
+        yield page_pool[work_id].pop(pg_id)


Seems like you could simplify this by using random.choices or random.sample

rlskoeser · 2024-10-04T18:21:22Z

src/corppa/utils/generate_page_set.py

+The resulting output CSV file has the following fields:
+    * work_id: PPA work id
+    * page_num: Digital page number


oh, I see, nice - this is the exact format we now support in the filter script

laurejt added 5 commits October 2, 2024 16:03

Add script for generating PPA page subset.

9cec2ac

Added path utilities for determining image paths

3d1eb8b

Added script for adding image paths to text corpus

0fa7880

Removed image path logic from add_metadata script

04533de

Removing task routing configs from recipe

bd3a908

laurejt requested a review from rlskoeser October 3, 2024 00:31

laurejt self-assigned this Oct 3, 2024

laurejt changed the base branch from main to develop October 3, 2024 00:31

rlskoeser approved these changes Oct 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/round two prep #100

Feature/round two prep #100

laurejt commented Oct 3, 2024

rlskoeser left a comment

rlskoeser Oct 4, 2024

rlskoeser Oct 4, 2024

rlskoeser Oct 4, 2024

rlskoeser Oct 4, 2024

rlskoeser Oct 4, 2024

rlskoeser Oct 4, 2024

		return work_id.rsplit("-p", 1)[0]


		def get_image_relpath(work_id, page_num):

		start_idx = int(row["page_start"])
		end_idx = int(row["page_end"]) + 1

Feature/round two prep #100

Are you sure you want to change the base?

Feature/round two prep #100

Conversation

laurejt commented Oct 3, 2024

rlskoeser left a comment

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment

rlskoeser Oct 4, 2024

Choose a reason for hiding this comment