Skip to content

Commit

Permalink
Merge pull request #20 from CBBIO/fantasia
Browse files Browse the repository at this point in the history
Fantasia
  • Loading branch information
frapercan authored Dec 19, 2024
2 parents bbe6bbd + 197615a commit 3276442
Show file tree
Hide file tree
Showing 12 changed files with 199 additions and 121 deletions.
10 changes: 5 additions & 5 deletions docs/source/configuration/fantasia/parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,21 +31,21 @@ Default system configuration should be OK for any Unix system, for the current s
- `limit_execution`: Maximum number of items to process (for debugging). Default: `100`.

5. LookUp table from CSV
---------------------
-------------------------

- `load_accesion_csv`: Path to the CSV file containing accessions. Default: `~/fantasia/input/genes.csv`.
- `load_accesion_column`: Column name in the CSV file. Default: `id`.
- `tag`: Identifier tag for accession loading. Default: `'GOA2022'`.

5b. LookUp table from UniProt (By default)
----------------------
------------------------------------------------------------------

- `search_criteria`: Query string for UniProt search. Default: `'(reviewed:true)'`.
- `limit`: Number of results to fetch from UniProt. Default: `200`.
- `tag`: Identifier tag for accession loading. Default: `'GOA2022'`. (It's the same defined in 5.)

6. Evidence Codes
-----------------
-----------------------------------------------------------------------------------
- `allowed_evidences`: This parameter specifies the list of UniProt evidence codes used to filter annotations. The default selection includes codes that indicate experimental evidence:

- `EXP`: Inferred from Experiment.
Expand All @@ -62,7 +62,7 @@ Default system configuration should be OK for any Unix system, for the current s
However, you can include any evidence codes from the full set defined by the Gene Ontology Consortium. For more details on all available evidence codes, visit the `Guide to GO Evidence Codes <https://geneontology.org/docs/guide-go-evidence-codes/>`_.

7. FANTASIA Settings
---------------------
-------------------------------------------

- `fantasia_input_fasta`: Path to the input FASTA file. Default: `~/fantasia/input/input.fasta`.
- `fantasia_output_h5`: Directory to save embeddings in HDF5 format. Default: `~/fantasia/embeddings/`.
Expand All @@ -75,7 +75,7 @@ Default system configuration should be OK for any Unix system, for the current s
- `topgo`: Enable or disable TopGO analysis. Default: `True`.

8. Embedding Settings
----------------------
--------------------------------------------

- `embedding.types`: List of embedding types to compute (Comment with # at the beginning to de/activate):

Expand Down
35 changes: 18 additions & 17 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,6 @@ BioInfo design (Plan)
- **BioInfo - API**: Application Programming Interface that allows integration and access to system data and functionalities from other systems.
- **BioInfo - Portal**: Indicates an access platform or user interface where the project's tools and data are deployed for research exploitation.

Pipelines
===============
.. toctree::
:maxdepth: 3
:caption: Pipelines
:hidden:


pipelines/fantasia/


- `Fantasia <pipelines/fantasia.html>`_: Functional ANnoTAtion based on embedding space SImilArity

- Massive search of metamorphisms and multifunctionality (WIP)


Task Types
===============
Expand All @@ -44,6 +29,22 @@ Task Types
- `GPU Tasks <tasks/gpu.html>`_: Handle computationally intensive tasks using GPU.


Pipelines
===============
.. toctree::
:maxdepth: 3
:caption: Pipelines
:hidden:


pipelines/fantasia/


- `Fantasia <pipelines/fantasia.html>`_: Functional ANnoTAtion based on embedding space SImilArity

- Massive search of metamorphisms and multifunctionality (WIP)


Extraction
===============
.. toctree::
Expand All @@ -65,9 +66,9 @@ Embedding
:caption: Embedding
:hidden:

operation/embedding/sequence_embeddings
operation/embedding/sequence_embedding

- `Sequence Embedding <operation/embedding/sequence_embeddings.html>`_: Facilitates encoding of protein and nucleotide sequences into dense vectors.
- `Sequence Embedding <operation/embedding/sequence_embedding.html>`_: Facilitates encoding of protein and nucleotide sequences into dense vectors.


FAQ
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Embedding Modulefds
================
Sequence Embeddings Module
==========================


.. automodule:: protein_metamorphisms_is.operation.embedding.sequence_embedding
:members:
:show-inheritance:
:show-inheritance:

8 changes: 5 additions & 3 deletions protein_metamorphisms_is/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
AccessionManager,
PDBExtractor,
UniProtExtractor,
SequenceEmbeddingManager,
Structure3DiManager,
SequenceClustering,
StructuralSubClustering,
Expand All @@ -14,22 +13,25 @@
GoMultifunctionalityMetrics,
)


def main(config_path='config/config.yaml'):
conf = read_yaml_config(config_path)
AccessionManager(conf).fetch_accessions_from_api()
AccessionManager(conf).load_accessions_from_csv()
UniProtExtractor(conf).start()
PDBExtractor(conf).start()
SequenceEmbeddingManager(conf).start()
Structure3DiManager(conf).start()
SequenceClustering(conf).start()
StructuralSubClustering(conf).start()
StructuralAlignmentManager(conf).start()
SequenceGOAnnotation(conf).start()
GoMultifunctionalityMetrics(conf).start()


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run the pipeline with a specified configuration file.")
parser.add_argument("--config", type=str, required=False, help="Path to the configuration YAML file.")
args = parser.parse_args()
main(config_path=args.config)

config_path = args.config if args.config else 'config/config.yaml'
main(config_path=config_path)
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ def load_model(model_name):
def load_tokenizer(model_name):
return AutoTokenizer.from_pretrained(model_name)


def embedding_task(sequences, model, tokenizer, embedding_type_id=None):
"""
Processes sequences to generate embeddings.
Expand Down
82 changes: 45 additions & 37 deletions protein_metamorphisms_is/operation/embedding/sequence_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
SequenceEmbeddingType,
SequenceEmbedding,
)

from protein_metamorphisms_is.sql.model.entities.sequence.sequence import Sequence
from protein_metamorphisms_is.tasks.gpu import GPUTaskInitializer

Expand All @@ -17,20 +18,27 @@ class SequenceEmbeddingManager(GPUTaskInitializer):
for embedding generation.
Attributes:
reference_attribute (str): Reference attribute name used for embedding.
model_instances (dict): A dictionary storing loaded models.
tokenizer_instances (dict): A dictionary storing loaded tokenizers.
base_module_path (str): Base module path for dynamic model imports.
batch_size (int): Batch size for processing sequences, defaults to 40.
types (dict): Dictionary containing model and task configurations.
reference_attribute (str): Name of the attribute used as the reference for embedding (default: 'sequence').
model_instances (dict): Dictionary of loaded models keyed by embedding type ID.
tokenizer_instances (dict): Dictionary of loaded tokenizers keyed by embedding type ID.
base_module_path (str): Base module path for dynamic imports of embedding tasks.
batch_size (int): Number of sequences processed per batch. Defaults to 40.
types (dict): Configuration dictionary for embedding types.
"""

def __init__(self, conf):
"""
Initializes the SequenceEmbeddingManager.
Args:
conf (dict): Configuration dictionary containing embedding parameters.
:param conf: Configuration dictionary containing embedding parameters.
:type conf: dict
Example:
>>> conf = {
>>> "embedding": {"batch_size": 50, "types": [1, 2]},
>>> "limit_execution": 100
>>> }
>>> manager = SequenceEmbeddingManager(conf)
"""
super().__init__(conf)
self.reference_attribute = 'sequence'
Expand All @@ -42,10 +50,9 @@ def __init__(self, conf):

def fetch_models_info(self):
"""
Retrieves and initializes embedding models based on configuration.
Retrieves and initializes embedding models based on the database configuration.
Queries the `SequenceEmbeddingType` table to fetch available embedding models.
Modules are dynamically imported and stored in the `types` attribute.
:raises sqlalchemy.exc.SQLAlchemyError: If there's an error querying the database.
"""
self.session_init()
embedding_types = self.session.query(SequenceEmbeddingType).all()
Expand All @@ -68,11 +75,13 @@ def enqueue(self):
"""
Enqueues tasks for sequence embedding processing.
Fetches all sequences from the database, batches them, and checks for existing embeddings.
If no existing embeddings are found, tasks are prepared and published for processing.
Splits all sequences into batches, checks for existing embeddings in the database,
and prepares tasks for missing embeddings.
Raises:
Exception: If an error occurs during the enqueue process.
:raises Exception: If an error occurs during the enqueue process.
Example:
>>> manager.enqueue()
"""
try:
self.logger.info("Starting embedding enqueue process.")
Expand Down Expand Up @@ -118,28 +127,27 @@ def enqueue(self):

def process(self, batch_data):
"""
Processes a batch of task data for embedding generation.
Args:
batch_data (list[dict]): List of task data dictionaries, each containing:
- sequence (str): Input sequence.
- sequence_id (int): Sequence identifier.
- embedding_type_id (int): Embedding type identifier.
Processes a batch of sequences to generate embeddings.
Returns:
list[dict]: List of records with embedding results.
:param batch_data: List of dictionaries, each containing sequence data.
:type batch_data: list[dict]
:return: List of dictionaries with embedding results.
:rtype: list[dict]
:raises Exception: If there's an error during embedding generation.
Raises:
Exception: If an error occurs during embedding processing.
Example:
>>> batch_data = [{"sequence": "ATCG", "sequence_id": 1, "embedding_type_id": 2}]
>>> results = manager.process(batch_data)
"""
try:
embedding_type_id = batch_data[0]['embedding_type_id']
model = self.model_instances[embedding_type_id]
tokenizer = self.tokenizer_instances[embedding_type_id]
module = self.types[embedding_type_id]['module']

# Pass embedding_type_id explicitly to the embedding_task function
embedding_records = module.embedding_task(batch_data, model, tokenizer, embedding_type_id=embedding_type_id)
embedding_records = module.embedding_task(
batch_data, model, tokenizer, embedding_type_id=embedding_type_id
)

return embedding_records

Expand All @@ -149,17 +157,17 @@ def process(self, batch_data):

def store_entry(self, records):
"""
Stores embedding records in the database.
Stores embedding results in the database.
Args:
records (list[dict]): List of embedding records, each containing:
- sequence_id (int): Sequence identifier.
- embedding_type_id (int): Embedding type identifier.
- embedding: The embedding result.
- shape: Shape of the embedding.
:param records: List of embedding result dictionaries.
:type records: list[dict]
:raises RuntimeError: If an error occurs during database storage.
Raises:
RuntimeError: If an error occurs during database storage.
Example:
>>> records = [
>>> {"sequence_id": 1, "embedding_type_id": 2, "embedding": [0.1, 0.2], "shape": [2]}
>>> ]
>>> manager.store_entry(records)
"""
session = self.session
try:
Expand Down
80 changes: 73 additions & 7 deletions protein_metamorphisms_is/pipelines/fantasia/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,84 @@
Gracias por la observación. He actualizado el README con la cita faltante:

---

# FANTASIA: Functional Annotation System and Task Orchestration

![FANTASIA Logo](img/FANTASIA_logo.png)

# FANTASIA: Functional Annotation System and Task Orchestration
FANTASIA (Functional ANnoTAtion based on embedding space SImilArity) is a pipeline for annotating Gene Ontology (GO) terms for protein sequences using advanced protein language models like **ProtT5**, **ProstT5**, and **ESM2**. This system automates complex workflows, from sequence processing to functional annotation, providing a scalable and efficient solution for protein structure and functionality analysis.

---

## Key Features

- **Redundancy Filtering**: Removes identical sequences with **CD-HIT** and optionally excludes sequences based on length constraints.
- **Embedding Generation**: Utilizes state-of-the-art models for protein sequence embeddings.
- **GO Term Lookup**: Matches embeddings with a vector database to retrieve associated GO terms.
- **Results**: Outputs annotations in timestamped CSV files for reproducibility.

---

## Installation

To install FANTASIA, ensure you have Python 3.8+ installed and use the following commands:

```bash
pip install fantasia-pipeline
```

For more details, visit the [PyPI page](https://protein-metamorphisms-is.readthedocs.io/en/latest/pipelines/fantasia.html).

---

## Quick Start

### Prerequisites

Ensure the **Information System** is properly configured before running FANTASIA. Detailed instructions are available in the [project documentation](../../README.md).

### Running the Pipeline

Execute the following command, specifying the path to the configuration file:

```bash
python main.py --config <path_to_config.yaml>
```

### Pipeline Overview

1. **Redundancy Filtering**: Removes identical sequences and optionally filters sequences based on length.
2. **Embedding Generation**: Computes embeddings for sequences using supported models and stores them in HDF5 format.
3. **GO Term Lookup**: Queries a vector database to find and annotate similar proteins.
4. **Output**: Saves annotations in a structured CSV file.

FANTASIA es un sistema diseñado para la anotación funcional y la orquestación de tareas a gran escala, proporcionando una solución eficiente y automatizada para flujos de trabajo complejos en análisis estructural y funcional de proteínas.
---

## Documentation

For complete details on pipeline configuration, parameters, and deployment, visit the [FANTASIA Documentation](https://protein-metamorphisms-is.readthedocs.io/en/latest/pipelines/fantasia.html).

---

## Requisitos previos
## Citation

If you use FANTASIA in your work, please cite the following:

Para ejecutar el pipeline de FANTASIA, es necesario que el **Sistema de Información** esté correctamente desplegado y configurado. Este sistema proporciona las bases de datos y servicios necesarios para la ejecución de tareas, el almacenamiento de resultados y la gestión coherente de las operaciones del pipeline.
1. Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024). Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
https://doi.org/10.1101/2024.02.28.582465.

2. Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R. & Rojas, A.M. (2024). Decoding proteome functional information in model organisms using protein language models.
https://doi.org/10.1101/2024.02.14.580341.

---

## Próximos pasos
## Contact Information

- Francisco Miguel Pérez Canales: [email protected]
- Gemma I. Martínez-Redondo: [email protected]
- Ana M. Rojas: [email protected]
- Rosa Fernández: [email protected]

---

1. **Despliegue del Sistema de Información**: Asegúrese de seguir la [documentación principal](../../README.md) para la configuración previa del sistema de información y los servicios asociados.
2. **Ejecución del pipeline**: Una vez desplegado el sistema de información, puede ejecutar el pipeline siguiendo las instrucciones detalladas a continuación.
Si necesitas más ajustes, ¡hazmelo saber!
Loading

0 comments on commit 3276442

Please sign in to comment.