Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/NYPL/drb-etl-pipeline into …
Browse files Browse the repository at this point in the history
…kyle/SFR-1980
  • Loading branch information
kylevillegas93 committed Jun 18, 2024
2 parents d67f37a + de7874b commit 28a4465
Show file tree
Hide file tree
Showing 13 changed files with 401 additions and 116 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,12 @@
## Added
- New script to parse download requests from S3 log files for UMP books
- New script to update current UofM manifests with fulfill endpoints to replace pdf/epub urls
- Updated README with appendix and additions to avaliable processes
- New process to add fulfill urls to Limited Access manifests and update fulfill_limited_access flags to True
- Updated README and added more information to installation steps
- Deprecated datetime.utcnow() method
## Fixed
- Resolved the format of fulfill endpoints in UofM manifests

## 2024-03-21 -- v0.13.0
## Added
Expand Down
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,22 +32,24 @@ Locally these services can be run in two modes:
#### Dependencies and Installation

Local development requires that the following services be available. They do not need to be running locally, but for development purposes this is probably easiest. These should be installed by whatever means is easiest (on macOS this is generally `brew`, or your package manager of choice). These dependencies are:
- PostgreSQL@10
- [email protected]>
- PostgreSQL@10
- Note that v10 is deprecated.
- [email protected]
- Note you may need to follow the [macOS Homebrew install guide](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/brew.html#brew).
- RabbitMQ
- Redis
- XCode Command Line Tools

This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice).
This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice). There are several options, but most developers use [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/installation.html#).

The steps to install the application are:

1. Install dependencies, including Python >= 3.6, if not already installed
2. Set up virtual environment
3. Clone this repository
3. Run `pip install -r requirements.txt` from the root directory
4. Configure environment variables per instructions below
5. Run `DevelopmentSetupProcess` per instructions below
4. Run `pip install -r requirements.txt` from the root directory. If you run into the error ```pip: command not found``` while installing the dependencies, you may need to alias python3 and pip3 to python and pip, respectively.
5. Configure environment variables per instructions below
6. Run `DevelopmentSetupProcess` per instructions below

#### Running services on host machine

Expand Down Expand Up @@ -78,7 +80,7 @@ The docker compose file uses the sample-compose.yaml file in the `config` direct

To run the processes individually the command should be in this format: `python main.py --process APIProcess`.

The currently available processes are:
The currently available processes (with the exception of the UofSC and ChicagoISAC processes) are:

- `DevelopmentSetupProcess` Initialize a testing/development database
- `APIProcess` run the DRB API
Expand All @@ -90,9 +92,20 @@ The currently available processes are:
- `NYPLProcess` Fetch files from the NYPL catalog (specifically Bib records) and import them
- `GutenbergProcess` Fetch updated files from Project Gutenberg and import them
- `MUSEProcess` Fetch open access books from Project MUSE and import them
- `DOABProcess` Fetch open access books from the Directory of Open Access Books
- `METProcess` Fetch open access books from The MET Watson Digital Collections and import them
- `DOABProcess` Fetch open access books from the Directory of Open Access Books and import them
- `LOCProcess` Fetch open access and digitized books from the Library of Congress and import them
- `UofMProcess` Fetch open access books from the Univerity of Michigan and import them
- `CoverProcess` Fetch covers for edition records

#### Appendix Link Flags (All flags are booleans)
- `reader` Added to 'application/webpub+json' links to indicate if a book will have a Read Online function on the frontend
- `embed` Indicates if a book will be using a third party web reader like Hathitrust's web reader on the frontend
- `download` Added to pdf/epub links to indicate if a book is downloadable on the frontend
- `catalog` Indicates if a book is a part of a catalog which may not be readable online, but can be accessed with other means like requesting online
- `nypl_login` Indicates if a book is a requestable book on the frontend for NYPL patrons
- `fulfill_limited_access` Indicates if a Limited Access book has been encrypted and can be read by NYPL patrons

#### Building and running a process in Docker

To run these processes as a containerized process you must have Docker Desktop installed.
Expand Down Expand Up @@ -156,3 +169,4 @@ And you're done!
- ~~Unit tests for all components~~
- Functional tests for each process
- Integration tests for the full cluster
- Update dependencies
4 changes: 4 additions & 0 deletions config/production.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ NYPL_LOCATIONS_BY_CODE: https://nypl-core-objects-mapping-qa.s3.amazonaws.com/by
# API_CLIENT_ID and API_CLIENT_SECRET must be configured in secrets file
NYPL_API_CLIENT_TOKEN_URL: https://isso.nypl.org/oauth/token

# DRB API Credentials
DRB_API_HOST: 'drb-api-production.nypl.org'
DRB_API_PORT: '80'

# GITHUB API Credentials
# GITHUB_API_KEY must be configured in secrets file
GITHUB_API_ROOT: https://api.github.com/graphql
Expand Down
4 changes: 4 additions & 0 deletions config/qa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ NYPL_LOCATIONS_BY_CODE: https://nypl-core-objects-mapping-qa.s3.amazonaws.com/by
# API_CLIENT_ID and API_CLIENT_SECRET must be configured in secrets file
NYPL_API_CLIENT_TOKEN_URL: https://isso.nypl.org/oauth/token

# DRB API Credentials
DRB_API_HOST: 'drb-api-qa.nypl.org'
DRB_API_PORT: '80'

# GITHUB API Credentials
# GITHUB_API_KEY must be configured in secrets file
GITHUB_API_ROOT: https://api.github.com/graphql
Expand Down
5 changes: 5 additions & 0 deletions config/sample-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,11 @@ NYPL_API_CLIENT_PUBLIC_KEY: |
EQIDAQAB
-----END PUBLIC KEY-----
# DRB API Credentials
DRB_API_HOST: drb_local_webapp
DRB_API_PORT: '5050'

# Bardo CCE API URL
BARDO_CCE_API: http://sfr-bardo-copyright-development.us-east-1.elasticbeanstalk.com/search

Expand Down
8 changes: 8 additions & 0 deletions managers/s3.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,14 @@ def getObjectFromBucket(self, objKey, bucket, md5Hash=None):
)
except ClientError:
raise S3Error('Unable to get object from s3')

def load_batches(self, objKey, bucket):

'''# Loading batches of data using a paginator until there are no more batches'''

paginator = self.s3Client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket=bucket, Prefix=objKey)
return page_iterator

@staticmethod
def getmd5HashOfObject(obj):
Expand Down
7 changes: 3 additions & 4 deletions processes/UofM.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def processUofMRecord(self, record):
UofMRec = UofMMapping(record)
UofMRec.applyMapping()
self.addHasPartMapping(record, UofMRec.record)
#self.storePDFManifest(UofMRec.record)
self.storePDFManifest(UofMRec.record)
self.addDCDWToUpdateList(UofMRec)

except (MappingError, HTTPError, ConnectionError, IndexError, TypeError) as e:
Expand Down Expand Up @@ -90,7 +90,7 @@ def addHasPartMapping(self, resultsRecord, record):
urlPDFObject,
'UofM',
'application/pdf',
'{"catalog": false, "download": true, "reader": false, "embed": false}'
'{"catalog": false, "download": true, "reader": false, "embed": false, "nypl_login": true}'
])
record.has_part.append(linkString)

Expand All @@ -101,7 +101,6 @@ def addHasPartMapping(self, resultsRecord, record):
logger.info(UofMError("Object doesn't exist"))



def storePDFManifest(self, record):
for link in record.has_part:
itemNo, uri, source, mediaType, flags = link.split('|')
Expand All @@ -124,7 +123,7 @@ def storePDFManifest(self, record):
manifestURI,
source,
'application/webpub+json',
'{"catalog": false, "download": false, "reader": true, "embed": false}'
'{"catalog": false, "download": false, "reader": true, "embed": false, "fulfill_limited_access": false}'
])

record.has_part.insert(0, linkString)
Expand Down
1 change: 1 addition & 0 deletions processes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@
from .UofSC import UofSCProcess
from .loc import LOCProcess
from .UofM import UofMProcess
from .fulfillURLManifest import FulfillProcess
151 changes: 151 additions & 0 deletions processes/fulfillURLManifest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
import json
import os
import copy
import logging
from botocore.exceptions import ClientError

from .core import CoreProcess
from datetime import datetime, timedelta, timezone
from model import Link
from logger import createLog

logger = createLog(__name__)

class FulfillProcess(CoreProcess):

def __init__(self, *args):
super(FulfillProcess, self).__init__(*args[:4])

self.fullImport = self.process == 'complete'
self.startTimestamp = None

# Connect to database
self.generateEngine()
self.createSession()

# S3 Configuration
self.s3Bucket = os.environ['FILE_BUCKET']
self.host = os.environ['DRB_API_HOST']
self.prefix = 'manifests/UofM/'
self.createS3Client()

def runProcess(self):
if self.process == 'daily':
startTimeStamp = datetime.now(timezone.utc) - timedelta(days=1)
self.getManifests(startTimeStamp)
elif self.process == 'complete':
self.getManifests()
elif self.process == 'custom':
timeStamp = self.ingestPeriod
startTimeStamp = datetime.strptime(timeStamp, '%Y-%m-%dT%H:%M:%S')
self.getManifests(startTimeStamp)

def getManifests(self, startTimeStamp=None):

'''Load batch of LA works based on startTimeStamp'''

batches = self.load_batches(self.prefix, self.s3Bucket)
if startTimeStamp:
filtered_batches = batches.search(f"Contents[?to_string(LastModified) > '\"{startTimeStamp}\"'].Key")
for batch in filtered_batches:
for c in batch['Contents']:
currKey = c['Key']
metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket)
self.update_manifest(metadataObject, self.s3Bucket, currKey)
else:
for batch in batches:
for c in batch['Contents']:
currKey = c['Key']
metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket)
self.update_manifest(metadataObject, self.s3Bucket, currKey)

def update_manifest(self, metadataObject, bucketName, currKey):

metadataJSON = json.loads(metadataObject['Body'].read().decode("utf-8"))
metadataJSONCopy = copy.deepcopy(metadataJSON)

counter = 0

metadataJSON, counter = self.linkFulfill(metadataJSON, counter)
metadataJSON, counter = self.readingOrderFulfill(metadataJSON, counter)
metadataJSON, counter = self.resourceFulfill(metadataJSON, counter)
metadataJSON, counter = self.tocFulfill(metadataJSON, counter)

if counter >= 4:
for link in metadataJSON['links']:
self.fulfillFlagUpdate(link)

self.closeConnection()

if metadataJSON != metadataJSONCopy:
try:
fulfillManifest = json.dumps(metadataJSON, ensure_ascii = False)
return self.putObjectInBucket(
fulfillManifest, currKey, bucketName
)
except ClientError as e:
logging.error(e)

def linkFulfill(self, metadataJSON):
for link in metadataJSON['links']:
fulfillLink, counter = self.fulfillReplace(link, counter)
link['href'] = fulfillLink

return (metadataJSON, counter)

def readingOrderFulfill(self, metadataJSON):
for readOrder in metadataJSON['readingOrder']:
fulfillLink, counter = self.fulfillReplace(readOrder, counter)
readOrder['href'] = fulfillLink

return (metadataJSON, counter)

def resourceFulfill(self, metadataJSON):
for resource in metadataJSON['resources']:
fulfillLink, counter = self.fulfillReplace(resource, counter)
resource['href'] = fulfillLink

return (metadataJSON, counter)

def tocFulfill(self, metadataJSON):

'''
The toc dictionary has no "type" key like the previous dictionaries
therefore the 'href' key is evaluated instead
'''

for toc in metadataJSON['toc']:
if 'pdf' in toc['href'] \
or 'epub' in toc['href']:
for link in self.session.query(Link) \
.filter(Link.url == toc['href'].replace('https://', '')):
counter += 1
toc['href'] = f'https://{self.host}/fulfill/{link.id}'

return (metadataJSON, counter)

def fulfillReplace(self, metadata):
if metadata['type'] == 'application/pdf' or metadata['type'] == 'application/epub+zip' \
or metadata['type'] == 'application/epub+xml':
for link in self.session.query(Link) \
.filter(Link.url == metadata['href'].replace('https://', '')):
counter += 1
metadata['href'] = f'https://{self.host}/fulfill/{link.id}'

return (metadata['href'], counter)

def fulfillFlagUpdate(self, metadata):
if metadata['type'] == 'application/webpub+json':
for link in self.session.query(Link) \
.filter(Link.url == metadata['href'].replace('https://', '')):
print(link)
print(link.flags)
if 'fulfill_limited_access' in link.flags.keys():
if link.flags['fulfill_limited_access'] == False:
newLinkFlag = dict(link.flags)
newLinkFlag['fulfill_limited_access'] = True
link.flags = newLinkFlag
self.commitChanges()

class FulfillError(Exception):
pass
2 changes: 1 addition & 1 deletion scripts/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,4 @@
from .nyplLoginFlags import main as nyplFlags
from .deleteUMPManifestLinks import main as deleteUMPManifests
from .parseDownloadRequests import main as parseDownloads
from .fulfillURLManifest import main as fulfillManifest
from .addFulfillManifest import main as fulfillManifest
Loading

0 comments on commit 28a4465

Please sign in to comment.