By default, JARVICE stores all system and job-related objects in its local database (or external database if so configured). Database-stored objects are scalable and have no limit in theory, other than underlying storage space. However, JARVICE limits them to the last 10,000 lines of output for a job (or approximately 800kb based on full 80-column lines). The internal object store is indexed and provides adequate performance for storing and retrieving job output.
Alternatively, JARVICE can be configured to store job output objects in an S3-compatible object storage service.
NOTE: JARVICE does not provide an S3 service as part of its deployment.
In some cases, such as the following, it may be advantageous to move job output to an external object storage:
When using this mechanism there is no limit to the size of the objects, and the entire "standard output" of each job is captured without truncation. This may be beneficial for batch solves of very large job runs.
Some S3-compatible services offer advanced policies for deleting old objects or moving them to lower tiered storage - note that missing job output is considered an error in JARVICE, but only applies if a user or administrator examines the specific job the output pertained to. Pruning older objects may be an acceptable way to reduce storage costs if users are expected to retrieve output within a certain period of time. Retention and lifecycle management policies are external to JARVICE and service provider-specific.
Some service providers may charge less for object storage (especially lower tiers) than block storage which usually underpins the internal JARVICE database. Keeping very large blobs out of the database may result in significant cost savings as more jobs complete and produce output over time. Further, the loss of one or more job output objects affects only user(s) inspecting the respective job(s) after they execute; the loss of database storage results in complete system failure (until it's restored from backup). Because of this, it may be beneficial to use different storage tiers for database storage and job output object storage, respectively.
Storing fewer "binary blobs" in the internal database should result in smaller and more efficient database backups.
Enabling this capability requires setting all of the following values correctly:
jarvice.JARVICE_S3_BUCKET
- set to the S3 bucket to store job output objects injarvice.JARVICE_S3_ACCESSKEY
- set to the S3 access key for the bucketjarvice.JARVICE_S3_SECRETKEY
- set to the S3 secret access key for the bucket
Additionally, the following optional values may be set:
jarvice.JARVICE_S3_ENDPOINTUL
- set to the S3 endpoint URL, if not using AWS (default is AWS)jarvice.JARVICE_S3_REGION
- set to the S3 region to use, if applicablejarvice.JARVICE_S3_PREFIX
- set to the object name prefix - the default isoutput/
, so that all objects in the bucket will appear to be in anoutput
folder; some S3 service providers will provide folder views to the object store in their respective console(s), and using this prefix can be an effective organizational/"namespace" method if storing other types of objects in the same bucket
The best practice is to configure both the upstream and all downstream cluster(s) to point to the same object store. However, it is possible to only configure S3 output storage for select downstream cluster(s) in a topology. If a downstream cluster is not configured to talk to S3, it will return truncated job output (to last 10,000 lines) as part of job completion, and JARVICE will store it in the database instead. JARVICE will automatically find objects either in the upstream cluster's configured S3 endpoint, or its database (see below). It may be advantageous to not configure certain downstream clusters with an S3 endpoint in order to address data transfer fees. Note that a configuration that only uses S3 for the control plane but runs all compute endpoints without it does not make logical sense.
Upstream | Downstream(s) |
---|---|
Database | Database (all) |
S3 | S3 (same as upstream) |
S3 | Database (one or more) |
Upstream | Downstream(s) |
---|---|
Database | S3 (any) |
S3 | S3 (different from upstream) |
- If jobs have already run and completed before the system is configured to use an external S3 service, JARVICE will automatically fetch their output from the database. This is true for any job's output that is not found in the S3 storage.
- If the external S3 service is disabled after jobs have run and stored their output there, their output will no longer be available (a copy of it is not stored in the database). Subsequent jobs however will store their output in the database.
- If an external S3 service is used with retention/lifecycle policies configured to prune objects, and a user tries to examine output pertaining to an object that was pruned, they will receive an error in the portal or via the API. This also applies to system administrators using the job detail function in Administration->Jobs
JARVICE can support any service that implements the S3 API, provided it offers an HTTPS-based endpoint URL and can respond similarly to Amazon's S3. For the most up to date list of S3 service providers JARVICE has been tested against, please see the External S3-compatible Object Storage Service Compatibility section in the Release Notes.