Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFR-2280: Limit Hathi Records Ingested #414

Merged
merged 8 commits into from
Oct 31, 2024

Conversation

mitri-slory
Copy link
Contributor

@mitri-slory mitri-slory commented Oct 24, 2024

This PR focuses on limiting the number of Hathi records ingested to 200 a day for now from 1000 a day to address current issues with the unfrbrized and unclustered records in QA. I also modified the readHathiFile method to match the method equivalent in the seed_local_data process because the language used there is more clear as to what is occurring in the method. This is step one of attempting to limit the amount of Hathi records ingested daily and the next step will be to address the backfill of Hathi files in QA that take priority over more recent records from other sources in the frbrization and cluster process.

processes/ingest/hathi_trust.py Outdated Show resolved Hide resolved
processes/ingest/hathi_trust.py Outdated Show resolved Hide resolved
@mitri-slory
Copy link
Contributor Author

mitri-slory commented Oct 28, 2024

The most recent commit I made implemented the feedback you both gave me and I updated the code to reflect that. The book[14] element represents the date of last update field of the Hathi data rows which is described here: https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/hathifiles/hathifiles-description/#:~:text=of%20Reason%20Codes.-,Date%20of%20last%20update,-rights_timestamp

I will be fixing the formatting and cleaning up the code in a future commit.

Comment on lines 40 to 41
def importRemoteRecords(self, start_date_time=None, full_or_partial=False):
self.importFromHathiTrustDataFile(start_date_time, full_dump=full_or_partial)
Copy link
Contributor

@kylevillegas93 kylevillegas93 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is just a pass through to another function in this class. Can we just call importFromHathiTrustDataFile directly and remove this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the method here is unnecessary so I'll remove it and call importFromHathiTrustDataFile directly in the runProcess method.

if hathi_date_modified > start_date_time:
self.importFromHathiFile(hathi_file.get('url'), start_date_time)
break
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looks like this is the same if statement as above? Is that correct?

if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be elif hathi_file.get('full') and full_dump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for cataching that. That conditional should have a True boolean value for second half of the and statement.

elif hathi_file.get('full') == full_dump and hathi_file.get('full') == True:

I'll add an else statement after that one to continue the for loop to catch the edge case where a hathi file doesn't have a boolean value or doesn't exist.

if hathiFile['full'] == fullDump:
self.importFromHathiFile(hathiFile['url'])
for hathi_file in file_json:
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:
Copy link
Contributor

@kylevillegas93 kylevillegas93 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simplify this tonot hathi_file.get('full') and not full_dump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I get what you mean. That's a cleaner way to explain this same logic.

self.importFromHathiFile(hathi_file.get('url'), start_date_time)
break
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:
self.importFromHathiFile(hathi_file.get('url'), start_date_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the full ingest case, should we pass start_date_time here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right since the importFromHathiFile method already sets start_date_time to None if no parameter is inherited from another method.


if len(book) >= 2:
book_right = book[2]
if len(book) >= 15:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if start_date_time is None should we continue if we aren't able to get the book_date_updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else continue conditional for the book_date_updated variable is unnecessary since we we should only check if the variable exists before we set the hathi_date_modified variable. I'll remove the nested else conditional and make a new if conditional for the hathi_date_modified variable like this:

if book_date_updated:
                hathi_date_modified = datetime.strptime(book_date_updated, '%Y-%m-%d %H:%M:%S').replace(tzinfo=None)

Comment on lines 113 to 118
if len(book) >= 2:
book_right = book[2]
if len(book) >= 15:
book_date_updated = book[14]
else:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this to:

book_right = (len(book) > 2 and book[2]) or None
book_date_updated = (len(book) > 14 and book[14]) or None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good way of simplifying this code.

logger.warning('Unable to read TSV row')
logger.debug(e)

if len(book) >= 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug here if the length of a book array is 2 but we try to access book[2]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the typo. It should be len(book) >= 3

Comment on lines 124 to 129
if start_date_time:
if book_right and book_right not in self.HATHI_RIGHTS_SKIPS and hathi_date_modified > start_date_time:
self.parseHathiDataRow(book)
else:
if book_right and book_right not in self.HATHI_RIGHTS_SKIPS:
self.parseHathiDataRow(book)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since there are two common conditions with the book rights, you can simplify to:

if book_right and book_right not in self.HATHI_RIGHTS_SKIPS:
    if not start_date_time or hathi_date_modified > start_date_time:
        self.parseHathiDataRow(book)

Does that look right? Recommend reducing duplicate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the duplicate code. Good way to simplify the conditionals here. Definitely makes it more clear to read.

Comment on lines 78 to 81
break
elif hathi_file.get('full') and full_dump:
self.importFromHathiFile(hathi_file.get('url'))
break
Copy link
Contributor

@kylevillegas93 kylevillegas93 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to break on lines 78 and 80 or do we want to import the file and keep iterating through the file_json list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the break because the original code here had a break after we called importFromHathiFile which I'm only assuming was put here so only one hathifile would be imported from all the hathifiles since they could be so massive. I think it's fine to remove them since we will only ingest hathi data rows that were updated in the past day so there won't be as much hathi rows ingested in general. What are your thoughts?

if hathiFile['full'] == fullDump:
                        self.importFromHathiFile(hathiFile['url'])
                        break

@mitri-slory mitri-slory merged commit ee85835 into main Oct 31, 2024
1 check passed
@mitri-slory mitri-slory deleted the SFR-2280_Limit-Hathi-Records-Ingested branch October 31, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants