Merge branch 'main' of https://github.com/NYPL/drb-etl-pipeline into …

…kyle/SFR-1980
NYPL · Jun 18, 2024 · 28a4465 · 28a4465
2 parents d67f37a + de7874b
commit 28a4465
Show file tree

Hide file tree

Showing 13 changed files with 401 additions and 116 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,8 +4,12 @@
 ## Added
 - New script to parse download requests from S3 log files for UMP books 
 - New script to update current UofM manifests with fulfill endpoints to replace pdf/epub urls
+- Updated README with appendix and additions to avaliable processes
+- New process to add fulfill urls to Limited Access manifests and update fulfill_limited_access flags to True
+- Updated README and added more information to installation steps
 - Deprecated datetime.utcnow() method
 ## Fixed
+- Resolved the format of fulfill endpoints in UofM manifests
 
 ## 2024-03-21 -- v0.13.0
 ## Added

diff --git a/README.md b/README.md
@@ -32,22 +32,24 @@ Locally these services can be run in two modes:
 #### Dependencies and Installation
 
 Local development requires that the following services be available. They do not need to be running locally, but for development purposes this is probably easiest. These should be installed by whatever means is easiest (on macOS this is generally `brew`, or your package manager of choice). These dependencies are:
-- PostgreSQL@10
-- [email protected]>
+- PostgreSQL@10 
+  - Note that v10 is deprecated.
+- [email protected] 
+  - Note you may need to follow the [macOS Homebrew install guide](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/brew.html#brew).
 - RabbitMQ
 - Redis
 - XCode Command Line Tools
 
-This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice).
+This is a Python application and requires Python >= 3.6. It is recommended that a virtual environment be set up for the application (again use the virtual environment tool of your choice).  There are several options, but most developers use [venv](https://docs.python.org/3/library/venv.html) or [virtualenv](https://virtualenv.pypa.io/en/latest/installation.html#).
 
 The steps to install the application are:
 
 1. Install dependencies, including Python >= 3.6, if not already installed
 2. Set up virtual environment
 3. Clone this repository
-3. Run `pip install -r requirements.txt` from the root directory
-4. Configure environment variables per instructions below
-5. Run `DevelopmentSetupProcess` per instructions below
+4. Run `pip install -r requirements.txt` from the root directory.  If you run into the error ```pip: command not found``` while installing the dependencies, you may need to alias python3 and pip3 to python and pip, respectively. 
+5. Configure environment variables per instructions below
+6. Run `DevelopmentSetupProcess` per instructions below
 
 #### Running services on host machine
 
@@ -78,7 +80,7 @@ The docker compose file uses the sample-compose.yaml file in the `config` direct
 
 To run the processes individually the command should be in this format: `python main.py --process APIProcess`.
 
-The currently available processes are:
+The currently available processes (with the exception of the UofSC and ChicagoISAC processes) are:
 
 - `DevelopmentSetupProcess` Initialize a testing/development database
 - `APIProcess` run the DRB API
@@ -90,9 +92,20 @@ The currently available processes are:
 - `NYPLProcess` Fetch files from the NYPL catalog (specifically Bib records) and import them
 - `GutenbergProcess` Fetch updated files from Project Gutenberg and import them
 - `MUSEProcess` Fetch open access books from Project MUSE and import them
-- `DOABProcess` Fetch open access books from the Directory of Open Access Books
+- `METProcess` Fetch open access books from The MET Watson Digital Collections and import them
+- `DOABProcess` Fetch open access books from the Directory of Open Access Books and import them
+- `LOCProcess` Fetch open access and digitized books from the Library of Congress and import them
+- `UofMProcess` Fetch open access books from the Univerity of Michigan and import them
 - `CoverProcess` Fetch covers for edition records
 
+#### Appendix Link Flags (All flags are booleans)
+- `reader` Added to 'application/webpub+json' links to indicate if a book will have a Read Online function on the frontend
+- `embed` Indicates if a book will be using a third party web reader like Hathitrust's web reader on the frontend
+- `download` Added to pdf/epub links to indicate if a book is downloadable on the frontend
+- `catalog` Indicates if a book is a part of a catalog which may not be readable online, but can be accessed with other means like requesting online 
+- `nypl_login` Indicates if a book is a requestable book on the frontend for NYPL patrons
+- `fulfill_limited_access` Indicates if a Limited Access book has been encrypted and can be read by NYPL patrons
+
 #### Building and running a process in Docker
 
 To run these processes as a containerized process you must have Docker Desktop installed.
@@ -156,3 +169,4 @@ And you're done!
   - ~~Unit tests for all components~~
   - Functional tests for each process
   - Integration tests for the full cluster
+- Update dependencies
diff --git a/config/production.yaml b/config/production.yaml
@@ -57,6 +57,10 @@ NYPL_LOCATIONS_BY_CODE: https://nypl-core-objects-mapping-qa.s3.amazonaws.com/by
 # API_CLIENT_ID and API_CLIENT_SECRET must be configured in secrets file
 NYPL_API_CLIENT_TOKEN_URL: https://isso.nypl.org/oauth/token
 
+# DRB API Credentials
+DRB_API_HOST: 'drb-api-production.nypl.org'
+DRB_API_PORT: '80'
+
 # GITHUB API Credentials
 # GITHUB_API_KEY must be configured in secrets file
 GITHUB_API_ROOT: https://api.github.com/graphql

diff --git a/config/qa.yaml b/config/qa.yaml
@@ -57,6 +57,10 @@ NYPL_LOCATIONS_BY_CODE: https://nypl-core-objects-mapping-qa.s3.amazonaws.com/by
 # API_CLIENT_ID and API_CLIENT_SECRET must be configured in secrets file
 NYPL_API_CLIENT_TOKEN_URL: https://isso.nypl.org/oauth/token
 
+# DRB API Credentials
+DRB_API_HOST: 'drb-api-qa.nypl.org'
+DRB_API_PORT: '80'
+
 # GITHUB API Credentials
 # GITHUB_API_KEY must be configured in secrets file
 GITHUB_API_ROOT: https://api.github.com/graphql

diff --git a/config/sample-compose.yaml b/config/sample-compose.yaml
@@ -62,6 +62,11 @@ NYPL_API_CLIENT_PUBLIC_KEY: |
   EQIDAQAB
   -----END PUBLIC KEY-----
 
+  # DRB API Credentials
+  
+DRB_API_HOST: drb_local_webapp
+DRB_API_PORT: '5050'
+
 # Bardo CCE API URL
 BARDO_CCE_API: http://sfr-bardo-copyright-development.us-east-1.elasticbeanstalk.com/search
 

diff --git a/managers/s3.py b/managers/s3.py
@@ -94,6 +94,14 @@ def getObjectFromBucket(self, objKey, bucket, md5Hash=None):
             )
         except ClientError:
             raise S3Error('Unable to get object from s3')
+
+    def load_batches(self, objKey, bucket):
+
+        '''# Loading batches of data using a paginator until there are no more batches'''
+
+        paginator = self.s3Client.get_paginator('list_objects_v2')
+        page_iterator = paginator.paginate(Bucket=bucket, Prefix=objKey)
+        return page_iterator
 
     @staticmethod
     def getmd5HashOfObject(obj):

diff --git a/processes/UofM.py b/processes/UofM.py
@@ -45,7 +45,7 @@ def processUofMRecord(self, record):
             UofMRec = UofMMapping(record)
             UofMRec.applyMapping()
             self.addHasPartMapping(record, UofMRec.record)
-            #self.storePDFManifest(UofMRec.record)
+            self.storePDFManifest(UofMRec.record)
             self.addDCDWToUpdateList(UofMRec)
 
         except (MappingError, HTTPError, ConnectionError, IndexError, TypeError) as e:
@@ -90,7 +90,7 @@ def addHasPartMapping(self, resultsRecord, record):
                     urlPDFObject,
                     'UofM',
                     'application/pdf',
-                    '{"catalog": false, "download": true, "reader": false, "embed": false}'
+                    '{"catalog": false, "download": true, "reader": false, "embed": false, "nypl_login": true}'
                 ])
                 record.has_part.append(linkString)
 
@@ -101,7 +101,6 @@ def addHasPartMapping(self, resultsRecord, record):
                     logger.info(UofMError("Object doesn't exist"))
 
 
-
     def storePDFManifest(self, record):
         for link in record.has_part:
             itemNo, uri, source, mediaType, flags = link.split('|')
@@ -124,7 +123,7 @@ def storePDFManifest(self, record):
                     manifestURI,
                     source,
                     'application/webpub+json',
-                    '{"catalog": false, "download": false, "reader": true, "embed": false}'
+                    '{"catalog": false, "download": false, "reader": true, "embed": false, "fulfill_limited_access": false}'
                 ])
 
                 record.has_part.insert(0, linkString)

diff --git a/processes/__init__.py b/processes/__init__.py
@@ -18,3 +18,4 @@
 from .UofSC import UofSCProcess
 from .loc import LOCProcess
 from .UofM import UofMProcess
+from .fulfillURLManifest import FulfillProcess
diff --git a/processes/fulfillURLManifest.py b/processes/fulfillURLManifest.py
@@ -0,0 +1,151 @@
+import json
+import os
+import copy
+import logging
+from botocore.exceptions import ClientError
+
+from .core import CoreProcess
+from datetime import datetime, timedelta, timezone
+from model import Link
+from logger import createLog
+
+logger = createLog(__name__)
+
+class FulfillProcess(CoreProcess):
+
+    def __init__(self, *args):
+        super(FulfillProcess, self).__init__(*args[:4])
+
+        self.fullImport = self.process == 'complete' 
+        self.startTimestamp = None
+
+        # Connect to database
+        self.generateEngine()
+        self.createSession()
+
+        # S3 Configuration
+        self.s3Bucket = os.environ['FILE_BUCKET']
+        self.host = os.environ['DRB_API_HOST']
+        self.prefix = 'manifests/UofM/'
+        self.createS3Client()
+
+    def runProcess(self):
+        if self.process == 'daily':
+            startTimeStamp = datetime.now(timezone.utc) - timedelta(days=1)
+            self.getManifests(startTimeStamp)
+        elif self.process == 'complete':
+            self.getManifests()
+        elif self.process == 'custom':
+            timeStamp = self.ingestPeriod
+            startTimeStamp = datetime.strptime(timeStamp, '%Y-%m-%dT%H:%M:%S')
+            self.getManifests(startTimeStamp)
+
+    def getManifests(self, startTimeStamp=None):
+
+        '''Load batch of LA works based on startTimeStamp'''
+
+        batches = self.load_batches(self.prefix, self.s3Bucket)
+        if startTimeStamp:
+            filtered_batches = batches.search(f"Contents[?to_string(LastModified) > '\"{startTimeStamp}\"'].Key")
+            for batch in filtered_batches:
+                for c in batch['Contents']:
+                    currKey = c['Key']
+                    metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket)
+                    self.update_manifest(metadataObject, self.s3Bucket, currKey)
+        else:
+            for batch in batches:
+                for c in batch['Contents']:
+                    currKey = c['Key']
+                    metadataObject = self.getObjectFromBucket(currKey, self.s3Bucket)
+                    self.update_manifest(metadataObject, self.s3Bucket, currKey)
+
+    def update_manifest(self, metadataObject, bucketName, currKey):
+
+        metadataJSON = json.loads(metadataObject['Body'].read().decode("utf-8"))
+        metadataJSONCopy = copy.deepcopy(metadataJSON)
+
+        counter = 0
+
+        metadataJSON, counter = self.linkFulfill(metadataJSON, counter)
+        metadataJSON, counter = self.readingOrderFulfill(metadataJSON, counter)
+        metadataJSON, counter = self.resourceFulfill(metadataJSON, counter)
+        metadataJSON, counter = self.tocFulfill(metadataJSON, counter)
+
+        if counter >= 4: 
+            for link in metadataJSON['links']:
+                self.fulfillFlagUpdate(link)
+
+        self.closeConnection()
+
+        if metadataJSON != metadataJSONCopy:
+            try:
+                fulfillManifest = json.dumps(metadataJSON, ensure_ascii = False)
+                return self.putObjectInBucket(
+                    fulfillManifest, currKey, bucketName
+                )
+            except ClientError as e:
+                logging.error(e)
+
+    def linkFulfill(self, metadataJSON):
+        for link in metadataJSON['links']:
+            fulfillLink, counter = self.fulfillReplace(link, counter)
+            link['href'] = fulfillLink
+
+        return (metadataJSON, counter)
+
+    def readingOrderFulfill(self, metadataJSON):
+        for readOrder in metadataJSON['readingOrder']:
+            fulfillLink, counter = self.fulfillReplace(readOrder, counter)
+            readOrder['href'] = fulfillLink
+
+        return (metadataJSON, counter)
+
+    def resourceFulfill(self, metadataJSON):
+        for resource in metadataJSON['resources']:
+            fulfillLink, counter = self.fulfillReplace(resource, counter)
+            resource['href'] = fulfillLink
+
+        return (metadataJSON, counter)
+
+    def tocFulfill(self, metadataJSON): 
+
+        '''
+        The toc dictionary has no "type" key like the previous dictionaries 
+        therefore the 'href' key is evaluated instead
+        '''
+
+        for toc in metadataJSON['toc']:
+            if 'pdf' in toc['href'] \
+                or 'epub' in toc['href']:
+                    for link in self.session.query(Link) \
+                        .filter(Link.url == toc['href'].replace('https://', '')):
+                            counter += 1
+                            toc['href'] = f'https://{self.host}/fulfill/{link.id}'
+
+        return (metadataJSON, counter)
+
+    def fulfillReplace(self, metadata):
+        if metadata['type'] == 'application/pdf' or metadata['type'] == 'application/epub+zip' \
+            or metadata['type'] == 'application/epub+xml':
+                for link in self.session.query(Link) \
+                    .filter(Link.url == metadata['href'].replace('https://', '')):
+                        counter += 1            
+                        metadata['href'] = f'https://{self.host}/fulfill/{link.id}'
+
+        return (metadata['href'], counter)
+
+    def fulfillFlagUpdate(self, metadata):
+        if metadata['type'] == 'application/webpub+json':
+            for link in self.session.query(Link) \
+                .filter(Link.url == metadata['href'].replace('https://', '')):   
+                        print(link)
+                        print(link.flags)
+                        if 'fulfill_limited_access' in link.flags.keys():
+                            if link.flags['fulfill_limited_access'] == False:
+                                newLinkFlag = dict(link.flags)
+                                newLinkFlag['fulfill_limited_access'] = True
+                                link.flags = newLinkFlag
+                                self.commitChanges()
+
+class FulfillError(Exception):
+    pass
diff --git a/scripts/__init__.py b/scripts/__init__.py
@@ -15,4 +15,4 @@
 from .nyplLoginFlags import main as nyplFlags
 from .deleteUMPManifestLinks import main as deleteUMPManifests
 from .parseDownloadRequests import main as parseDownloads
-from .fulfillURLManifest import main as fulfillManifest
+from .addFulfillManifest import main as fulfillManifest