-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFR-2280: Limit Hathi Records Ingested #414
Conversation
The most recent commit I made implemented the feedback you both gave me and I updated the code to reflect that. The book[14] element represents the date of last update field of the Hathi data rows which is described here: https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/hathifiles/hathifiles-description/#:~:text=of%20Reason%20Codes.-,Date%20of%20last%20update,-rights_timestamp I will be fixing the formatting and cleaning up the code in a future commit. |
processes/ingest/hathi_trust.py
Outdated
def importRemoteRecords(self, start_date_time=None, full_or_partial=False): | ||
self.importFromHathiTrustDataFile(start_date_time, full_dump=full_or_partial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is just a pass through to another function in this class. Can we just call importFromHathiTrustDataFile directly and remove this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the method here is unnecessary so I'll remove it and call importFromHathiTrustDataFile
directly in the runProcess
method.
processes/ingest/hathi_trust.py
Outdated
if hathi_date_modified > start_date_time: | ||
self.importFromHathiFile(hathi_file.get('url'), start_date_time) | ||
break | ||
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm looks like this is the same if statement as above? Is that correct?
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be elif hathi_file.get('full') and full_dump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks for cataching that. That conditional should have a True boolean value for second half of the and statement.
elif hathi_file.get('full') == full_dump and hathi_file.get('full') == True:
I'll add an else statement after that one to continue the for loop to catch the edge case where a hathi file doesn't have a boolean value or doesn't exist.
processes/ingest/hathi_trust.py
Outdated
if hathiFile['full'] == fullDump: | ||
self.importFromHathiFile(hathiFile['url']) | ||
for hathi_file in file_json: | ||
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you simplify this tonot hathi_file.get('full') and not full_dump
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I get what you mean. That's a cleaner way to explain this same logic.
processes/ingest/hathi_trust.py
Outdated
self.importFromHathiFile(hathi_file.get('url'), start_date_time) | ||
break | ||
if hathi_file.get('full') == full_dump and hathi_file.get('full') == False: | ||
self.importFromHathiFile(hathi_file.get('url'), start_date_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the full ingest case, should we pass start_date_time here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right since the importFromHathiFile method already sets start_date_time to None if no parameter is inherited from another method.
processes/ingest/hathi_trust.py
Outdated
|
||
if len(book) >= 2: | ||
book_right = book[2] | ||
if len(book) >= 15: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if start_date_time is None should we continue if we aren't able to get the book_date_updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else continue conditional for the book_date_updated
variable is unnecessary since we we should only check if the variable exists before we set the hathi_date_modified
variable. I'll remove the nested else conditional and make a new if conditional for the hathi_date_modified
variable like this:
if book_date_updated:
hathi_date_modified = datetime.strptime(book_date_updated, '%Y-%m-%d %H:%M:%S').replace(tzinfo=None)
processes/ingest/hathi_trust.py
Outdated
if len(book) >= 2: | ||
book_right = book[2] | ||
if len(book) >= 15: | ||
book_date_updated = book[14] | ||
else: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this to:
book_right = (len(book) > 2 and book[2]) or None
book_date_updated = (len(book) > 14 and book[14]) or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good way of simplifying this code.
processes/ingest/hathi_trust.py
Outdated
logger.warning('Unable to read TSV row') | ||
logger.debug(e) | ||
|
||
if len(book) >= 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug here if the length of a book array is 2 but we try to access book[2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching the typo. It should be len(book) >= 3
processes/ingest/hathi_trust.py
Outdated
if start_date_time: | ||
if book_right and book_right not in self.HATHI_RIGHTS_SKIPS and hathi_date_modified > start_date_time: | ||
self.parseHathiDataRow(book) | ||
else: | ||
if book_right and book_right not in self.HATHI_RIGHTS_SKIPS: | ||
self.parseHathiDataRow(book) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think since there are two common conditions with the book rights, you can simplify to:
if book_right and book_right not in self.HATHI_RIGHTS_SKIPS:
if not start_date_time or hathi_date_modified > start_date_time:
self.parseHathiDataRow(book)
Does that look right? Recommend reducing duplicate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the duplicate code. Good way to simplify the conditionals here. Definitely makes it more clear to read.
processes/ingest/hathi_trust.py
Outdated
break | ||
elif hathi_file.get('full') and full_dump: | ||
self.importFromHathiFile(hathi_file.get('url')) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to break on lines 78 and 80 or do we want to import the file and keep iterating through the file_json list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the break because the original code here had a break after we called importFromHathiFile which I'm only assuming was put here so only one hathifile would be imported from all the hathifiles since they could be so massive. I think it's fine to remove them since we will only ingest hathi data rows that were updated in the past day so there won't be as much hathi rows ingested in general. What are your thoughts?
if hathiFile['full'] == fullDump:
self.importFromHathiFile(hathiFile['url'])
break
This PR focuses on limiting the number of Hathi records ingested to 200 a day for now from 1000 a day to address current issues with the unfrbrized and unclustered records in QA. I also modified the
readHathiFile
method to match the method equivalent in theseed_local_data
process because the language used there is more clear as to what is occurring in the method. This is step one of attempting to limit the amount of Hathi records ingested daily and the next step will be to address the backfill of Hathi files in QA that take priority over more recent records from other sources in the frbrization and cluster process.