The ODS Transfer Service
- Eureka (not required)
- CockroachDb(required)
- RabbitMQ (not required)
- Pmeter (not required)
- InfluxDB (not required) to enable or disable any of these look in the application.properties file.
Let's begin with the understanding of what this service does. Strictly speaking the Transfer-Service moves data using many threads. Before we get into the internals lets look at an api example.
{
"ownerId": "[email protected]", <- the users ods email of the user that is submitting the request
"source": {
"type": "vfs", <- dropbox, gdrive, sftp, ftp, box, s3, http, vfs, scp
"vfsSourceCredential": {
},
"parentInfo": {
"path": "/Users/jacobgoldverg/UBPINTOS", <- the folder to find all the files in the infoList
"id": "/Users/jacobgoldverg/UBPINTOS" <- the folder to find all the files in the infoList
},
"infoList": [
//each json element here is a separate file
{
"path": "absolute path of the file",
"size": 11326976, <- in bytes
"chunkSize": 11326976, <- the chunk size to use for this file
"id" : "apache-maven-3.6.3-bin.tar" <- the name of the file
},
{
"path": "go1.16beta1.darwin-amd64.tar",
"size": 411944960,
"chunkSize": 411944960,
"id" : "go1.16beta1.darwin-amd64.tar"
}
]
},
"destination": {
"type": "scp", <- dropbox, gdrive, sftp, ftp, box, s3, gftp, http, vfs, scp
"vfsDestCredential": {
"username": "ubuntu",
"secret": "", <- this should contain the password or the pem file
"uri": "ipaddr:22", <- the ip address: port of the destination
"encryptedSecret": "" <- do not use this ever
},
"parentInfo": {
"path": "destination path to put data/",
"id": "destination path to put data/"
}
},
"options": {
"concurrencyThreadCount": 3,
"parallelThreadCount":4,
"pipeSize": 1,
"compress": false,
"retry" : 1
}
}
Let's begin with some verbiage clarification:
- vfsDestCredential, vfsSourceCredential, oauthDestCredential, oauthSourceCredential are two sets of credentials to be found to access the source and the destination respectively. There is never a case where the api should have vfs and oauth, its XOR only for both source or destinatino depending on what kind of remote endpoint you are attempting to access.
- Options, these options are fully described in the ODS-Starter documentation so please reference that.
- Source: represents the source where we are downloading data from
- Destination: represents the location where we are uploading data too.
- OwnerId: This is a required field that corresponds to the user email of the onedatashare.org account
Things to install:
- Install Java 11
- Install maven
- Get the boot.sh and certs files from Jacob.
Please read Language Domain of Spring batch first. Using the above json as long as you replace the values appropriately you can run a transfer. General things to know: 1 step = 1 file, ODS parallelism= strapping a thread pool to a step, ODS concurrency= splitting steps across a thread pool. Connections are important, for us that entails pooling the clients, which honestly might not be the best idea for the cloud provider clients, but it works and compares/beats to RClone so.
So to start this service receives a message. Either as a request through the controller or the RabbitMQ consumer. Once we get a request, we process which means running the code in JobControl.java, which all it really does it set up the Job object with the various steps. This is where we apply a concurrency, parallelism and pipelining(commit-count, number of read calls to 1 write) by splitting the execution of many steps across a thread pool. Once we are done defining a Job we launch the job in the JobLauncher. Once the job starts spring batch actually keeps track of the read, skip, write, ,,, counts in CockroachDB. Which means we can run many Transfer-Services that use the same Job table.
Once we have created
-
Spring Batch Docs Sections:
- (Language Domain)[https://docs.spring.io/spring-batch/docs/current/reference/html/domain.html#domainLanguageOfBatch]
- (Multi Threaded Steps)[https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#multithreadedStep]
- (Parallel Steps)[https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#scalabilityParallelSteps]
- few others as well but no point in listing them all
-
AWS Java Docs Sections
- Only worth reading about S3 stuff nothing else.
-
Jsch Examples Here is the thing about Jsch THERE ARE NO DOCS. I know I hate it too sorry. So we have to go off examples and stackoverflow BUT its actually a damn good library b/c it works similarly to how you would expect the ssh protocol to work Sections
- Only read on stuff about Scp and Sftp, unless you are doing a remote execution which case I would prob use Shell or Exec docs
-
Spring JPA I would only look this over if you plan on working with the Data otherwise not very necessary