-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of https://github.com/NYPL/drb-etl-pipeline into …
…kyle/SFR-1980
- Loading branch information
Showing
13 changed files
with
401 additions
and
116 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,22 +32,24 @@ Locally these services can be run in two modes: | |
#### Dependencies and Installation | ||
|
||
Local development requires that the following services be available. They do not need to be running locally, but for development purposes this is probably easiest. These should be installed by whatever means is easiest (on macOS this is generally `brew`, or your package manager of choice). These dependencies are: | ||
- PostgreSQL@10 | ||
- [email protected]> | ||
- PostgreSQL@10 | ||
- Note that v10 is deprecated. | ||
- [email protected] | ||
- Note you may need to follow the [macOS Homebrew install guide](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/brew.html#brew). | ||
- RabbitMQ | ||
- Redis | ||
- XCode Command Line Tools | ||
|
||
This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice). | ||
This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice). There are several options, but most developers use [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/installation.html#). | ||
|
||
The steps to install the application are: | ||
|
||
1. Install dependencies, including Python >= 3.6, if not already installed | ||
2. Set up virtual environment | ||
3. Clone this repository | ||
3. Run `pip install -r requirements.txt` from the root directory | ||
4. Configure environment variables per instructions below | ||
5. Run `DevelopmentSetupProcess` per instructions below | ||
4. Run `pip install -r requirements.txt` from the root directory. If you run into the error ```pip: command not found``` while installing the dependencies, you may need to alias python3 and pip3 to python and pip, respectively. | ||
5. Configure environment variables per instructions below | ||
6. Run `DevelopmentSetupProcess` per instructions below | ||
|
||
#### Running services on host machine | ||
|
||
|
@@ -78,7 +80,7 @@ The docker compose file uses the sample-compose.yaml file in the `config` direct | |
|
||
To run the processes individually the command should be in this format: `python main.py --process APIProcess`. | ||
|
||
The currently available processes are: | ||
The currently available processes (with the exception of the UofSC and ChicagoISAC processes) are: | ||
|
||
- `DevelopmentSetupProcess` Initialize a testing/development database | ||
- `APIProcess` run the DRB API | ||
|
@@ -90,9 +92,20 @@ The currently available processes are: | |
- `NYPLProcess` Fetch files from the NYPL catalog (specifically Bib records) and import them | ||
- `GutenbergProcess` Fetch updated files from Project Gutenberg and import them | ||
- `MUSEProcess` Fetch open access books from Project MUSE and import them | ||
- `DOABProcess` Fetch open access books from the Directory of Open Access Books | ||
- `METProcess` Fetch open access books from The MET Watson Digital Collections and import them | ||
- `DOABProcess` Fetch open access books from the Directory of Open Access Books and import them | ||
- `LOCProcess` Fetch open access and digitized books from the Library of Congress and import them | ||
- `UofMProcess` Fetch open access books from the Univerity of Michigan and import them | ||
- `CoverProcess` Fetch covers for edition records | ||
|
||
#### Appendix Link Flags (All flags are booleans) | ||
- `reader` Added to 'application/webpub+json' links to indicate if a book will have a Read Online function on the frontend | ||
- `embed` Indicates if a book will be using a third party web reader like Hathitrust's web reader on the frontend | ||
- `download` Added to pdf/epub links to indicate if a book is downloadable on the frontend | ||
- `catalog` Indicates if a book is a part of a catalog which may not be readable online, but can be accessed with other means like requesting online | ||
- `nypl_login` Indicates if a book is a requestable book on the frontend for NYPL patrons | ||
- `fulfill_limited_access` Indicates if a Limited Access book has been encrypted and can be read by NYPL patrons | ||
|
||
#### Building and running a process in Docker | ||
|
||
To run these processes as a containerized process you must have Docker Desktop installed. | ||
|
@@ -156,3 +169,4 @@ And you're done! | |
- ~~Unit tests for all components~~ | ||
- Functional tests for each process | ||
- Integration tests for the full cluster | ||
- Update dependencies |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
import json | ||
import os | ||
import copy | ||
import logging | ||
from botocore.exceptions import ClientError | ||
|
||
from .core import CoreProcess | ||
from datetime import datetime, timedelta, timezone | ||
from model import Link | ||
from logger import createLog | ||
|
||
logger = createLog(__name__) | ||
|
||
class FulfillProcess(CoreProcess): | ||
|
||
def __init__(self, *args): | ||
super(FulfillProcess, self).__init__(*args[:4]) | ||
|
||
self.fullImport = self.process == 'complete' | ||
self.startTimestamp = None | ||
|
||
# Connect to database | ||
self.generateEngine() | ||
self.createSession() | ||
|
||
# S3 Configuration | ||
self.s3Bucket = os.environ['FILE_BUCKET'] | ||
self.host = os.environ['DRB_API_HOST'] | ||
self.prefix = 'manifests/UofM/' | ||
self.createS3Client() | ||
|
||
def runProcess(self): | ||
if self.process == 'daily': | ||
startTimeStamp = datetime.now(timezone.utc) - timedelta(days=1) | ||
self.getManifests(startTimeStamp) | ||
elif self.process == 'complete': | ||
self.getManifests() | ||
elif self.process == 'custom': | ||
timeStamp = self.ingestPeriod | ||
startTimeStamp = datetime.strptime(timeStamp, '%Y-%m-%dT%H:%M:%S') | ||
self.getManifests(startTimeStamp) | ||
|
||
def getManifests(self, startTimeStamp=None): | ||
|
||
'''Load batch of LA works based on startTimeStamp''' | ||
|
||
batches = self.load_batches(self.prefix, self.s3Bucket) | ||
if startTimeStamp: | ||
filtered_batches = batches.search(f"Contents[?to_string(LastModified) > '\"{startTimeStamp}\"'].Key") | ||
for batch in filtered_batches: | ||
for c in batch['Contents']: | ||
currKey = c['Key'] | ||
metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket) | ||
self.update_manifest(metadataObject, self.s3Bucket, currKey) | ||
else: | ||
for batch in batches: | ||
for c in batch['Contents']: | ||
currKey = c['Key'] | ||
metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket) | ||
self.update_manifest(metadataObject, self.s3Bucket, currKey) | ||
|
||
def update_manifest(self, metadataObject, bucketName, currKey): | ||
|
||
metadataJSON = json.loads(metadataObject['Body'].read().decode("utf-8")) | ||
metadataJSONCopy = copy.deepcopy(metadataJSON) | ||
|
||
counter = 0 | ||
|
||
metadataJSON, counter = self.linkFulfill(metadataJSON, counter) | ||
metadataJSON, counter = self.readingOrderFulfill(metadataJSON, counter) | ||
metadataJSON, counter = self.resourceFulfill(metadataJSON, counter) | ||
metadataJSON, counter = self.tocFulfill(metadataJSON, counter) | ||
|
||
if counter >= 4: | ||
for link in metadataJSON['links']: | ||
self.fulfillFlagUpdate(link) | ||
|
||
self.closeConnection() | ||
|
||
if metadataJSON != metadataJSONCopy: | ||
try: | ||
fulfillManifest = json.dumps(metadataJSON, ensure_ascii = False) | ||
return self.putObjectInBucket( | ||
fulfillManifest, currKey, bucketName | ||
) | ||
except ClientError as e: | ||
logging.error(e) | ||
|
||
def linkFulfill(self, metadataJSON): | ||
for link in metadataJSON['links']: | ||
fulfillLink, counter = self.fulfillReplace(link, counter) | ||
link['href'] = fulfillLink | ||
|
||
return (metadataJSON, counter) | ||
|
||
def readingOrderFulfill(self, metadataJSON): | ||
for readOrder in metadataJSON['readingOrder']: | ||
fulfillLink, counter = self.fulfillReplace(readOrder, counter) | ||
readOrder['href'] = fulfillLink | ||
|
||
return (metadataJSON, counter) | ||
|
||
def resourceFulfill(self, metadataJSON): | ||
for resource in metadataJSON['resources']: | ||
fulfillLink, counter = self.fulfillReplace(resource, counter) | ||
resource['href'] = fulfillLink | ||
|
||
return (metadataJSON, counter) | ||
|
||
def tocFulfill(self, metadataJSON): | ||
|
||
''' | ||
The toc dictionary has no "type" key like the previous dictionaries | ||
therefore the 'href' key is evaluated instead | ||
''' | ||
|
||
for toc in metadataJSON['toc']: | ||
if 'pdf' in toc['href'] \ | ||
or 'epub' in toc['href']: | ||
for link in self.session.query(Link) \ | ||
.filter(Link.url == toc['href'].replace('https://', '')): | ||
counter += 1 | ||
toc['href'] = f'https://{self.host}/fulfill/{link.id}' | ||
|
||
return (metadataJSON, counter) | ||
|
||
def fulfillReplace(self, metadata): | ||
if metadata['type'] == 'application/pdf' or metadata['type'] == 'application/epub+zip' \ | ||
or metadata['type'] == 'application/epub+xml': | ||
for link in self.session.query(Link) \ | ||
.filter(Link.url == metadata['href'].replace('https://', '')): | ||
counter += 1 | ||
metadata['href'] = f'https://{self.host}/fulfill/{link.id}' | ||
|
||
return (metadata['href'], counter) | ||
|
||
def fulfillFlagUpdate(self, metadata): | ||
if metadata['type'] == 'application/webpub+json': | ||
for link in self.session.query(Link) \ | ||
.filter(Link.url == metadata['href'].replace('https://', '')): | ||
print(link) | ||
print(link.flags) | ||
if 'fulfill_limited_access' in link.flags.keys(): | ||
if link.flags['fulfill_limited_access'] == False: | ||
newLinkFlag = dict(link.flags) | ||
newLinkFlag['fulfill_limited_access'] = True | ||
link.flags = newLinkFlag | ||
self.commitChanges() | ||
|
||
class FulfillError(Exception): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.