Skip to content

Commit

Permalink
Merge pull request #263 from KPMP/develop
Browse files Browse the repository at this point in the history
merge for Release 3.0
  • Loading branch information
rlreamy authored Mar 27, 2023
2 parents 11022c0 + 3ff7a2e commit 4aaf719
Show file tree
Hide file tree
Showing 58 changed files with 1,167 additions and 1,275 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,6 @@ out
tokens
**/node_modules
globus_tokens
__pycache__/
.vscode
.DS_Store
19 changes: 0 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,6 @@
## Documentation
Visit [kpmp.github.io/dlu](http://kpmp.io.github.io/dlu)

### Rezipping files for the DLU (RegenerateZipFiles)
To regenerate zip files:
1. Connect to Mongo by opening ssh session to prod-upload
- `ssh <username>@172.20.66.165 -L 27017:localhost:27017`
2. Update mongo package with the desired change
3. Set `regenerateZip` to `true` on the package
4. login to upload and login to kpmp-appuser
- `ssh prod-upload`
- `sudo su - kpmp-appuser`
5. Open a screen session to prevent issues with long-running zips
- `screen -S rezipping` (-r to recover)
6. Login to the orion spring container
- `docker exec -it orion-spring bash`
7. Run RegenerateZipFiles script
- `java -cp build/libs/orion-data.jar -Dloader.main=org.kpmp.RegenerateZipFiles org.springframework.boot.loader.PropertiesLauncher`
8. Navigate to clearCache URL to clear the old cache
- `https://upload.kpmp.org/api/v1/clearCache`


## Removing packages
1. Connect to Mongo by opening ssh session to prod-upload
- `ssh <username>@172.20.66.165 -L 27017:localhost:27017`
Expand Down
1 change: 0 additions & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ dependencies {
implementation 'org.springframework.boot:spring-boot-starter-web'
implementation 'commons-io:commons-io:2.6'
implementation 'mysql:mysql-connector-java:6.0.5'
testImplementation 'org.springframework:spring-test:5.0.5.RELEASE'
implementation 'org.springframework.boot:spring-boot-starter-data-mongodb'
implementation 'org.apache.commons:commons-compress:1.17'
implementation 'org.apache.commons:commons-text:1.7'
Expand Down
36 changes: 34 additions & 2 deletions scripts/dataPromotion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,40 @@
pip install -r requirements.txt

# Moving files from DL to S3
1. Add packageIDs,filenames to files_to_s3.txt one per line
2. Execute 'python filesToS3.py'

About:
This script reads the Datalake database to find files that have not yet been copied into the S3 bucket based on the release version of the data. Due to different package requirements, the script contains partially unique paths depending on the metadata_type_id. E.g. We zip a package with the exclusion of a few files prior to uploading.

Setup and running:

1. Setup the .env - filesToS3.py requires the following .env variables are set:

`destination_bucket` - the name of the S3 bucket to which the files will be moved

`datalake_dir` - the directory in the datalake where the files are located

`source_bucket` - Another directory to try searching if the datalake dir does not contain the files

`mysql_user` - To access the file table inside of the knowledge environment database

`mysql_pwd` - To access the file table inside of the knowledge environment database


2. The script also requires an argument when calling the script to indicate the release version of files to move. You'll need to get this number before you run the script.

`-v` or `--release_version`

3. Make sure you have tunnel open to the KE database, e.g.

`$ ssh atlas-ke -i ~/.ssh/um-kpmp.pem -L 3306:localhost:3306`

4. Execute the Datalake to S3 move script

`$ python filesToS3.py --release-version 0`

Issues that may arise:

If you find yourself needing to re-run the script with the same data, you'll need to delete the existing file in the S3 bucket and delete the existing record in the `moved_files` table.

# Move the files from “file_pending” table to the “file” table in the Staging DB
1. Requires a connection to the DLU Mongo and the Staging DB MySQL (e.g. through tunnels)
Expand Down
102 changes: 0 additions & 102 deletions scripts/dataPromotion/filesToS3.py

This file was deleted.

4 changes: 0 additions & 4 deletions scripts/dataPromotion/loadClinical/.env.example

This file was deleted.

55 changes: 0 additions & 55 deletions scripts/dataPromotion/loadClinical/addPilotClinicalDataFile.py

This file was deleted.

59 changes: 0 additions & 59 deletions scripts/dataPromotion/loadClinical/clinicalToKE.py

This file was deleted.

3 changes: 0 additions & 3 deletions scripts/dataPromotion/loadClinical/requirements.txt

This file was deleted.

24 changes: 24 additions & 0 deletions scripts/dataPromotion/package_zipper/packageZipper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from zipfile import ZipFile
import os

EXCLUDED_TYPES = ['.jpg', 'metadata.json', '.DS_Store']

def is_not_excluded_type(filename: str, excludedTypes: list):
for excludedType in excludedTypes:

if filename.lower().endswith(excludedType.lower()):
return False
return True

def zip_package_data(zipName: str, folderToZip: str, packageId: str):
with ZipFile(zipName, 'w') as zippedPackage:
for root, dir, files in os.walk(folderToZip):
for filename in files:
if is_not_excluded_type(filename, EXCLUDED_TYPES):
zippedPackage.write(folderToZip+'/'+filename, 'package_'+packageId+'/'+filename)

def zip_package_cleanup(zipName: str):
os.remove(zipName)

if __name__ == "__main__":
zip_package_data('packageid_lipidomics.zip', 'folder-to-zip/', 'packageid')
54 changes: 54 additions & 0 deletions scripts/dataPromotion/package_zipper/test_packageZipper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from fileinput import filename
import os
import unittest
import packageZipper
import shutil
from os.path import exists

class TestZipperMethods(unittest.TestCase):
@classmethod
def setUpClass(self):
print('setUpClass')
if os.path.isdir('test-folder-to-zip/'):
shutil.rmtree('test-folder-to-zip/')
os.mkdir('test-folder-to-zip/')
os.mkdir('test-folder-to-zip/zip-me')
os.mkdir('test-folder-to-zip/zip-me/zip-me-2')

with open('test-folder-to-zip/zip-me/zip-me-2/bar.txt', 'a'):
os.utime('test-folder-to-zip/zip-me/zip-me-2/bar.txt', None)

with open('test-folder-to-zip/zip-me/foo.txt', 'a'):
os.utime('test-folder-to-zip/zip-me/foo.txt', None)

@classmethod
def tearDownClass(self):
shutil.rmtree('test-folder-to-zip/')
os.remove('packageid_lipidomics.zip')

def test_valid_file(self):
filename = 'abc.test'
EXCLUDED_TYPES = ['.jpg', 'metadata.json']
self.assertEqual(packageZipper.is_not_excluded_type(filename, EXCLUDED_TYPES), True)

def test_valid_file_with_no_excluded(self):
filename = 'abc.test'
EXCLUDED_TYPES = []
self.assertEqual(packageZipper.is_not_excluded_type(filename, EXCLUDED_TYPES), True)

def test_invalid_jpg(self):
filename = 'abc.jpg'
EXCLUDED_TYPES = ['.jpg', 'metadata.json']
self.assertEqual(packageZipper.is_not_excluded_type(filename, EXCLUDED_TYPES), False)

def test_invalid_metadata(self):
filename = 'metadata.json'
EXCLUDED_TYPES = ['.jpg', 'metadata.json']
self.assertEqual(packageZipper.is_not_excluded_type(filename, EXCLUDED_TYPES), False)

def test_zip_package_data(self):
packageZipper.zip_package_data('packageid_lipidomics', 'test-folder-to-zip/')
self.assertEqual(exists('packageid_lipidomics.zip'), True)

if __name__ == '__main__':
unittest.main()
Loading

0 comments on commit 4aaf719

Please sign in to comment.