Skip to content
Laia Codó edited this page Oct 30, 2024 · 25 revisions

openvre

Table of contents

What is openVRE

openVRE (open Virtual Research Environment) is an open-source platform specifically designed to facilitate the creation, management, and customization of Virtual Research Environments (VREs). It provides the tools, frameworks, and the necessary compute infrastructure to rapidly build a cloud-based working environment. It offers:

homepage

Key Features
  • Customizable front-end design: openVRE offers a modular architecture that allows institutions or research groups to build tailored VREs that meet their specific research needs. Users can add or remove tools, customize the interface, and adapt the environment to different research workflows.

  • Data Management and Integration: openVRE is designed to handle complex data management, supporting diverse data types and formats. It integrates data storage and metadata handling, making it easier for researchers to upload, import, organize, and manage their datasets.

  • Integrated Analytical and Computational Tools: openVRE enables the modular integration of virtually any analytical tool. It handles the execution on a local cloud-based compute back-end or can also connect to high-performance computing (HPC) resources and cloud services to support computationally intensive tasks.

  • Interoperability with External Resources: Designed with interoperability in mind, openVRE can link to external databases, repositories, and other research resources. This helps researchers streamline their workflow by bringing external data and resources into their VRE.

  • Data Sharing and Publication Support: openVRE makes it easy to publish and share research data and results. Researchers can use openVRE to push their private datasets to long-term associated repositories like B2SHARE.

Benefits of openVRE
  • Scalability and Flexibility: Institutions can deploy openVRE for projects of various sizes and can scale resources up or down based on project needs.
  • Cost-Effectiveness: Being open-source, openVRE reduces costs associated with proprietary VREs.
  • Enhanced Collaboration: With built-in collaborative tools, openVRE encourages knowledge sharing and teamwork, regardless of team location.
  • Improved Data Governance: openVRE’s data management tools support quality control, versioning, and reproducibility, crucial for modern research.

Architecture

The main repository contains the source code of openVRE, including a customizable web front-end and core code that manages data, metadata, processing units (analysis and visualization tools), and computational resources. The following diagram illustrates a possible layout for a computational infrastructure based on openVRE:

arch_1

Different components work together to provide a computational environment for analysis or visualization tools initiated by users via the VRE web interface. User requests are processed by the different scheduling systems supported. These launchers trigger in batch mode the tools within the local or remote compute back-end, either as software containers (i.e. Docker, Singularity) or as entire virtual machines (VMs). When VMs are enacted, resources are dynamically allocated on elastic virtual clusters horizontally scaling according to job requirements. As such, the analysis Tool being triggered is encapsulated on a transient and isolated computational environment, but with access to user's data volumes.

Once a job request reaches the allocated VM, tool execution begins. The integrated VRE tools are independent software components developed by third parties (e.g., bioinformaticians, researchers, statisticians) and are deployed on the on-premises VMs of the compute platform. All tools adhere to the common structure defined by the openvre-tool-api, allowing the openVRE core to trigger them uniformly.

Components

  • Virtual Research Environment: Digital workspace designed to support the entire lifecycle of a research project. It provides researchers with the tools, services, and resources needed to conduct collaborative research online.
  • Job Schedulers:
    • Queue systems: openVRE relays on well-known workload managers to handle job scheduling and resource allocation. By default, support is given for SGE (Oracle Grid Engine, formerly Sun Grid Engine) and SLURM (Simple Linux Utility for Resource Management). They serve as the core server responsible for launching and managing jobs in a virtual cluster-based infrastructure. In this context, the openVRE core acts as a submitter node, dispatching jobs to VMs that host the tool code (the workers). Each tool has a dedicated queue, with several VM replicas associated with each queue. The availability of these replicas is managed by Oneflow, an OpenNebula service that dynamically provisions VMs based on configurable system metrics, such as VM load. Consequently, the number of queue workers can automatically adjust to meet demand, ensuring that there is always a VM available to handle job requests.
    • Message Brokers: openVRE is also able to send remote job petitions via message brokers. The integration of RabbitMQ is in progress. Right now, PMES (Programming Model Enactment Service) is integrated. This service remotely controls job execution on the SGE core server in the underlying cloud platform via the Open Cloud Computing Interface (OCCI), independent of the cloud middleware used (e.g., OpenNebula, OpenStack). PMES oversees VM creation, contextualization, application execution, and VM termination, facilitating the efficient execution of jobs submitted to the SGE.

arch_2

  • Tools Database (MongoDB): The backbone of the openVRE architecture, MongoDB is a NoSQL database that stores all data, including user information, job metadata, and tool configurations. Its flexible schema allows for dynamic data modeling, making it ideal for the varying requirements of computational tasks. MongoDB ensures efficient data retrieval and management, supporting the scalability needed for large-scale analyses and visualizations.

  • Authentication and Authorization Server (Keycloak): Keycloak serves as the authorization and authentication component for openVRE. It can be installed locally or configured to refer to a remote domain. Keycloak enables secure user authentication and role-based access control, ensuring that only authorized users can access specific functionalities and data within the openVRE environment.

  • Credentials Vault (HashiCorp): HashiCorp Vault is used to securely store keys and credentials that users need to launch jobs or access computational environments and databases. Vault provides a robust security model for managing sensitive data, enabling dynamic secrets, and controlling access to various resources.

  • Analysis Tools (openvre-tool-api based): analysis tools or workflows should present to a particular skeleton to ensure interoperability with openVRE backend. This is given by the openvre-tool-api, a set of parsing and modeling python classes that facilitate batch and asynchronous execution. It provides a common command-line client for openVRE tools, acting as an adapter and uniform entry point between the openVRE core and the application code.

Docker-based architecture

The branch Dockerized of the main repository provides a containerized version of openVRE along with the essential components needed to deploy a minimal yet functional computational cloud infrastructure.. While the underlying components for the infrastructure remain the same, they are packaged in a Docker environment for ease of deployment and scalability. The following diagram outlines these components:

arch_2

Installation

openVRE provides a central manager and a user interface to the in-premisses compute platform. Other components (see the following section ) are also required to build an operational infrastructure.

Manual installation

Use the step-by-step installation guide from the OpenVRE source code repository. It provide instructions on how to build and start the PHP-based application service.

Docker-based deployment

To simplify the installation process, the following repository contains the code for deploying an operational openVRE-based computational platform out of the box. It includes a composition of the minimal dependent components (i.e. an authentication service, a local SGE queue system, etc.) along with the openVRE service.

Containerized Deployment Install.md

Pluggable resources

openVRE supports three distinct types of pluggable software components that can be integrated into a functional VRE in a modular fashion by a VRE administrator. Depending on the implementation method, different tutorials and guidelines are available for integrating these resources.

Tools

What are openVRE Tools?

openVRE Tools are modular computational units in the openVRE analysis platform, enabling diverse research functions. Developed by third-party Tool Developers, these tools are parameterized by researchers to fit specific project needs. Once configured, they are scheduled and executed by the platform’s compute back-end. Implemented within software containers (e.g., Docker, Singularity), openVRE Tools ensure compatibility, portability, and isolation, making it straightforward to integrate various analytical tools or pipelines seamlessly into the research platform.

tool_parts


$TOOL_EXECUTABLE                      \
    --config job_configuration.json   \
    --in_metadata input_metadata.json \
    --out_metadata manifest_output.json > tool.log

Types of Tools

There are two primary categories of tools within the openVRE platform:

  • Non-Interactive Tools: These operate in batch mode, processing data without user intervention, making them ideal for automated tasks and large datasets.

  • Interactive Tools: These provide a user-friendly, web-based interface that allows researchers to engage directly with the tool, facilitating real-time data analysis and customization.

How to bring your own tool?

This installation guide provides instructions for implementing tools in openVRE. Depending on your preference and requirements, you can choose between two different methods: Manual Installation and Dockerized Deployment. Below are the details for each approach. You can find the ste-by-step instruction on how to integrate a tool on the platform here and here for Dockerized version.

Repository Interfaces

Exist a collection of already implemented data repository interfaces ready to be integrated in any openVRE. Here the list

  • OpenStack Swift Object Storage: Swift is an OpenStack service for scalable, distributed storage, commonly used for archiving large datasets in research and industry. It enables researchers to store, retrieve, and share unstructured data securely and efficiently across distributed systems.

  • via WebDAV:

    • Nextcloud: Nextcloud is an open-source platform for file sharing and collaboration, allowing researchers to manage, sync, and securely share data through WebDAV support, which enables remote access and integration with other services.
  • via HTTP APIs:

    • XNAT (eXtensible Neuroimaging Archive Toolkit) is an open-source imaging informatics platform for storing, managing, and sharing neuroimaging and biomedical research data, frequently used in medical imaging studies.

    • ArrayExpress: A functional genomics data repository managed by EMBL-EBI, ArrayExpress archives and provides access to gene expression data from high-throughput technologies, facilitating data reuse and integrative analysis.

    • BigNASim: BigNASim is a specialized repository and data management platform for nucleic acid simulations, offering structured data resources for large-scale bioinformatics and computational chemistry projects involving DNA and RNA structures.

  • European Genome-Phenome Archive (EGA): The EGA provides secure data storage and controlled access for genomic and phenotypic data, supporting personalized medicine research by sharing sensitive human data responsibly within the European research community.

  • B2SHARE: Developed by EUDAT, B2SHARE is an open data repository service for researchers to store, publish, and share research data while supporting metadata standards for findability and interoperability.

  • Rclone (in progress): Rclone is a command-line tool designed for managing, syncing, and transferring data across various cloud storage services. It supports encrypted transfers and integrates with over 40 storage backends, making it popular in research environments that require secure data movement and replication across platforms.