SFR-2280: Limit Hathi Records Ingested #414

mitri-slory · 2024-10-24T17:10:12Z

This PR focuses on limiting the number of Hathi records ingested to 200 a day for now from 1000 a day to address current issues with the unfrbrized and unclustered records in QA. I also modified the readHathiFile method to match the method equivalent in the seed_local_data process because the language used there is more clear as to what is occurring in the method. This is step one of attempting to limit the amount of Hathi records ingested daily and the next step will be to address the backfill of Hathi files in QA that take priority over more recent records from other sources in the frbrization and cluster process.

processes/ingest/hathi_trust.py

mitri-slory · 2024-10-28T21:02:34Z

The most recent commit I made implemented the feedback you both gave me and I updated the code to reflect that. The book[14] element represents the date of last update field of the Hathi data rows which is described here: https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/hathifiles/hathifiles-description/#:~:text=of%20Reason%20Codes.-,Date%20of%20last%20update,-rights_timestamp

I will be fixing the formatting and cleaning up the code in a future commit.

processes/ingest/hathi_trust.py

kylevillegas93 · 2024-10-30T17:12:47Z

processes/ingest/hathi_trust.py

+    def importRemoteRecords(self, start_date_time=None, full_or_partial=False):
+        self.importFromHathiTrustDataFile(start_date_time, full_dump=full_or_partial)


Also this is just a pass through to another function in this class. Can we just call importFromHathiTrustDataFile directly and remove this function?

I agree that the method here is unnecessary so I'll remove it and call importFromHathiTrustDataFile directly in the runProcess method.

kylevillegas93 · 2024-10-30T17:22:21Z

processes/ingest/hathi_trust.py

+                if hathi_date_modified > start_date_time:
+                    self.importFromHathiFile(hathi_file.get('url'), start_date_time)
+                    break       
+            if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:


Hmm looks like this is the same if statement as above? Is that correct?

if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:

Should this be elif hathi_file.get('full') and full_dump

Oh thanks for cataching that. That conditional should have a True boolean value for second half of the and statement.

elif hathi_file.get('full') == full_dump and hathi_file.get('full') == True:

I'll add an else statement after that one to continue the for loop to catch the edge case where a hathi file doesn't have a boolean value or doesn't exist.

kylevillegas93 · 2024-10-30T17:23:45Z

processes/ingest/hathi_trust.py

-            if hathiFile['full'] == fullDump:
-                self.importFromHathiFile(hathiFile['url'])
+        for hathi_file in file_json:
+            if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:


can you simplify this tonot hathi_file.get('full') and not full_dump?

Oh I get what you mean. That's a cleaner way to explain this same logic.

kylevillegas93 · 2024-10-30T17:25:15Z

processes/ingest/hathi_trust.py

+                    self.importFromHathiFile(hathi_file.get('url'), start_date_time)
+                    break       
+            if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:
+                self.importFromHathiFile(hathi_file.get('url'), start_date_time)


If this is the full ingest case, should we pass start_date_time here?

You're right since the importFromHathiFile method already sets start_date_time to None if no parameter is inherited from another method.

kylevillegas93 · 2024-10-30T17:27:33Z

processes/ingest/hathi_trust.py

+
+            if len(book) >= 2:
+                book_right = book[2]
+                if len(book) >= 15:


if start_date_time is None should we continue if we aren't able to get the book_date_updated?

The else continue conditional for the book_date_updated variable is unnecessary since we we should only check if the variable exists before we set the hathi_date_modified variable. I'll remove the nested else conditional and make a new if conditional for the hathi_date_modified variable like this:

if book_date_updated: hathi_date_modified = datetime.strptime(book_date_updated, '%Y-%m-%d %H:%M:%S').replace(tzinfo=None)

kylevillegas93 · 2024-10-30T17:31:04Z

processes/ingest/hathi_trust.py

+            if len(book) >= 2:
+                book_right = book[2]
+                if len(book) >= 15:
+                    book_date_updated = book[14]
+                else:
+                    continue


You can simplify this to:

book_right = (len(book) > 2 and book[2]) or None book_date_updated = (len(book) > 14 and book[14]) or None

That's a good way of simplifying this code.

kylevillegas93 · 2024-10-30T17:31:54Z

processes/ingest/hathi_trust.py

-                logger.warning('Unable to read TSV row')
-                logger.debug(e)
+
+            if len(book) >= 2:


There's a bug here if the length of a book array is 2 but we try to access book[2]

Thanks for catching the typo. It should be len(book) >= 3

kylevillegas93 · 2024-10-30T17:36:19Z

processes/ingest/hathi_trust.py

+            if start_date_time:
+                if book_right and book_right not in self.HATHI_RIGHTS_SKIPS and hathi_date_modified > start_date_time:
+                    self.parseHathiDataRow(book)
+            else:
+                if book_right and book_right not in self.HATHI_RIGHTS_SKIPS:
+                    self.parseHathiDataRow(book)


I think since there are two common conditions with the book rights, you can simplify to:

if book_right and book_right not in self.HATHI_RIGHTS_SKIPS: if not start_date_time or hathi_date_modified > start_date_time: self.parseHathiDataRow(book)

Does that look right? Recommend reducing duplicate code.

I will remove the duplicate code. Good way to simplify the conditionals here. Definitely makes it more clear to read.

kylevillegas93 · 2024-10-31T13:51:12Z

processes/ingest/hathi_trust.py

+                    break       
+            elif hathi_file.get('full') and full_dump:
+                self.importFromHathiFile(hathi_file.get('url'))
                break


Do we want to break on lines 78 and 80 or do we want to import the file and keep iterating through the file_json list?

I kept the break because the original code here had a break after we called importFromHathiFile which I'm only assuming was put here so only one hathifile would be imported from all the hathifiles since they could be so massive. I think it's fine to remove them since we will only ingest hathi data rows that were updated in the past day so there won't be as much hathi rows ingested in general. What are your thoughts?

if hathiFile['full'] == fullDump: self.importFromHathiFile(hathiFile['url']) break

SFR-2280: Limit Hathi Records Ingested

d162bfb

mitri-slory requested review from Apophenia and kylevillegas93 October 24, 2024 17:10

Apophenia reviewed Oct 24, 2024

View reviewed changes

processes/ingest/hathi_trust.py Outdated Show resolved Hide resolved

processes/ingest/hathi_trust.py Outdated Show resolved Hide resolved

Modified code to only daily ingest records updated in past day

6aa78b1

kylevillegas93 reviewed Oct 29, 2024

View reviewed changes

processes/ingest/hathi_trust.py Outdated Show resolved Hide resolved