Merge pull request #20 from CBBIO/fantasia

Fantasia
CBBIO · Dec 19, 2024 · 3276442 · 3276442
2 parents bbe6bbd + 197615a
commit 3276442
Show file tree

Hide file tree

Showing 12 changed files with 199 additions and 121 deletions.
diff --git a/docs/source/configuration/fantasia/parameters.rst b/docs/source/configuration/fantasia/parameters.rst
@@ -31,21 +31,21 @@ Default system configuration should be OK for any Unix system, for the current s
 - `limit_execution`: Maximum number of items to process (for debugging). Default: `100`.
 
 5. LookUp table from CSV
----------------------
+-------------------------
 
 - `load_accesion_csv`: Path to the CSV file containing accessions. Default: `~/fantasia/input/genes.csv`.
 - `load_accesion_column`: Column name in the CSV file. Default: `id`.
 - `tag`: Identifier tag for accession loading. Default: `'GOA2022'`.
 
 5b. LookUp table from UniProt (By default)
-----------------------
+------------------------------------------------------------------
 
 - `search_criteria`: Query string for UniProt search. Default: `'(reviewed:true)'`.
 - `limit`: Number of results to fetch from UniProt. Default: `200`.
 - `tag`: Identifier tag for accession loading. Default: `'GOA2022'`. (It's the same defined in 5.)
 
 6. Evidence Codes
------------------
+-----------------------------------------------------------------------------------
 - `allowed_evidences`: This parameter specifies the list of UniProt evidence codes used to filter annotations. The default selection includes codes that indicate experimental evidence:
 
   - `EXP`: Inferred from Experiment.
@@ -62,7 +62,7 @@ Default system configuration should be OK for any Unix system, for the current s
   However, you can include any evidence codes from the full set defined by the Gene Ontology Consortium. For more details on all available evidence codes, visit the `Guide to GO Evidence Codes <https://geneontology.org/docs/guide-go-evidence-codes/>`_.
 
 7. FANTASIA Settings
----------------------
+-------------------------------------------
 
 - `fantasia_input_fasta`: Path to the input FASTA file. Default: `~/fantasia/input/input.fasta`.
 - `fantasia_output_h5`: Directory to save embeddings in HDF5 format. Default: `~/fantasia/embeddings/`.
@@ -75,7 +75,7 @@ Default system configuration should be OK for any Unix system, for the current s
 - `topgo`: Enable or disable TopGO analysis. Default: `True`.
 
 8. Embedding Settings
-----------------------
+--------------------------------------------
 
 - `embedding.types`: List of embedding types to compute (Comment with # at the beginning to de/activate):
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -12,21 +12,6 @@ BioInfo design (Plan)
 - **BioInfo - API**: Application Programming Interface that allows integration and access to system data and functionalities from other systems.
 - **BioInfo - Portal**: Indicates an access platform or user interface where the project's tools and data are deployed for research exploitation.
 
-Pipelines
-===============
-.. toctree::
-   :maxdepth: 3
-   :caption: Pipelines
-   :hidden:
-
-
-   pipelines/fantasia/
-
-
-- `Fantasia <pipelines/fantasia.html>`_: Functional ANnoTAtion based on embedding space SImilArity
-
-- Massive search of metamorphisms and multifunctionality (WIP)
-
 
 Task Types
 ===============
@@ -44,6 +29,22 @@ Task Types
 - `GPU Tasks <tasks/gpu.html>`_: Handle computationally intensive tasks using GPU.
 
 
+Pipelines
+===============
+.. toctree::
+   :maxdepth: 3
+   :caption: Pipelines
+   :hidden:
+
+
+   pipelines/fantasia/
+
+
+- `Fantasia <pipelines/fantasia.html>`_: Functional ANnoTAtion based on embedding space SImilArity
+
+- Massive search of metamorphisms and multifunctionality (WIP)
+
+
 Extraction
 ===============
 .. toctree::
@@ -65,9 +66,9 @@ Embedding
    :caption: Embedding
    :hidden:
 
-   operation/embedding/sequence_embeddings
+   operation/embedding/sequence_embedding
 
-- `Sequence Embedding <operation/embedding/sequence_embeddings.html>`_: Facilitates encoding of protein and nucleotide sequences into dense vectors.
+- `Sequence Embedding <operation/embedding/sequence_embedding.html>`_: Facilitates encoding of protein and nucleotide sequences into dense vectors.
 
 
 FAQ

diff --git a/...eration/embedding/sequence_embeddings.rst → ...peration/embedding/sequence_embedding.rst b/...eration/embedding/sequence_embeddings.rst → ...peration/embedding/sequence_embedding.rst
@@ -1,6 +1,8 @@
-Embedding Modulefds
-================
+Sequence Embeddings Module
+==========================
+
 
 .. automodule:: protein_metamorphisms_is.operation.embedding.sequence_embedding
     :members:
-    :show-inheritance:
+    :show-inheritance:
+
diff --git a/protein_metamorphisms_is/main.py b/protein_metamorphisms_is/main.py
@@ -5,7 +5,6 @@
     AccessionManager,
     PDBExtractor,
     UniProtExtractor,
-    SequenceEmbeddingManager,
     Structure3DiManager,
     SequenceClustering,
     StructuralSubClustering,
@@ -14,22 +13,25 @@
     GoMultifunctionalityMetrics,
 )
 
+
 def main(config_path='config/config.yaml'):
     conf = read_yaml_config(config_path)
     AccessionManager(conf).fetch_accessions_from_api()
     AccessionManager(conf).load_accessions_from_csv()
     UniProtExtractor(conf).start()
     PDBExtractor(conf).start()
-    SequenceEmbeddingManager(conf).start()
     Structure3DiManager(conf).start()
     SequenceClustering(conf).start()
     StructuralSubClustering(conf).start()
     StructuralAlignmentManager(conf).start()
     SequenceGOAnnotation(conf).start()
     GoMultifunctionalityMetrics(conf).start()
 
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Run the pipeline with a specified configuration file.")
     parser.add_argument("--config", type=str, required=False, help="Path to the configuration YAML file.")
     args = parser.parse_args()
-    main(config_path=args.config)
+
+    config_path = args.config if args.config else 'config/config.yaml'
+    main(config_path=config_path)
diff --git a/protein_metamorphisms_is/operation/embedding/proccess/sequence/esm.py b/protein_metamorphisms_is/operation/embedding/proccess/sequence/esm.py
@@ -9,6 +9,7 @@ def load_model(model_name):
 def load_tokenizer(model_name):
     return AutoTokenizer.from_pretrained(model_name)
 
+
 def embedding_task(sequences, model, tokenizer, embedding_type_id=None):
     """
     Processes sequences to generate embeddings.

diff --git a/protein_metamorphisms_is/operation/embedding/sequence_embedding.py b/protein_metamorphisms_is/operation/embedding/sequence_embedding.py
@@ -5,6 +5,7 @@
     SequenceEmbeddingType,
     SequenceEmbedding,
 )
+
 from protein_metamorphisms_is.sql.model.entities.sequence.sequence import Sequence
 from protein_metamorphisms_is.tasks.gpu import GPUTaskInitializer
 
@@ -17,20 +18,27 @@ class SequenceEmbeddingManager(GPUTaskInitializer):
     for embedding generation.
 
     Attributes:
-        reference_attribute (str): Reference attribute name used for embedding.
-        model_instances (dict): A dictionary storing loaded models.
-        tokenizer_instances (dict): A dictionary storing loaded tokenizers.
-        base_module_path (str): Base module path for dynamic model imports.
-        batch_size (int): Batch size for processing sequences, defaults to 40.
-        types (dict): Dictionary containing model and task configurations.
+        reference_attribute (str): Name of the attribute used as the reference for embedding (default: 'sequence').
+        model_instances (dict): Dictionary of loaded models keyed by embedding type ID.
+        tokenizer_instances (dict): Dictionary of loaded tokenizers keyed by embedding type ID.
+        base_module_path (str): Base module path for dynamic imports of embedding tasks.
+        batch_size (int): Number of sequences processed per batch. Defaults to 40.
+        types (dict): Configuration dictionary for embedding types.
     """
 
     def __init__(self, conf):
         """
         Initializes the SequenceEmbeddingManager.
 
-        Args:
-            conf (dict): Configuration dictionary containing embedding parameters.
+        :param conf: Configuration dictionary containing embedding parameters.
+        :type conf: dict
+
+        Example:
+            >>> conf = {
+            >>>     "embedding": {"batch_size": 50, "types": [1, 2]},
+            >>>     "limit_execution": 100
+            >>> }
+            >>> manager = SequenceEmbeddingManager(conf)
         """
         super().__init__(conf)
         self.reference_attribute = 'sequence'
@@ -42,10 +50,9 @@ def __init__(self, conf):
 
     def fetch_models_info(self):
         """
-        Retrieves and initializes embedding models based on configuration.
+        Retrieves and initializes embedding models based on the database configuration.
 
-        Queries the `SequenceEmbeddingType` table to fetch available embedding models.
-        Modules are dynamically imported and stored in the `types` attribute.
+        :raises sqlalchemy.exc.SQLAlchemyError: If there's an error querying the database.
         """
         self.session_init()
         embedding_types = self.session.query(SequenceEmbeddingType).all()
@@ -68,11 +75,13 @@ def enqueue(self):
         """
         Enqueues tasks for sequence embedding processing.
 
-        Fetches all sequences from the database, batches them, and checks for existing embeddings.
-        If no existing embeddings are found, tasks are prepared and published for processing.
+        Splits all sequences into batches, checks for existing embeddings in the database,
+        and prepares tasks for missing embeddings.
 
-        Raises:
-            Exception: If an error occurs during the enqueue process.
+        :raises Exception: If an error occurs during the enqueue process.
+
+        Example:
+            >>> manager.enqueue()
         """
         try:
             self.logger.info("Starting embedding enqueue process.")
@@ -118,28 +127,27 @@ def enqueue(self):
 
     def process(self, batch_data):
         """
-        Processes a batch of task data for embedding generation.
-
-        Args:
-            batch_data (list[dict]): List of task data dictionaries, each containing:
-                - sequence (str): Input sequence.
-                - sequence_id (int): Sequence identifier.
-                - embedding_type_id (int): Embedding type identifier.
+        Processes a batch of sequences to generate embeddings.
 
-        Returns:
-            list[dict]: List of records with embedding results.
+        :param batch_data: List of dictionaries, each containing sequence data.
+        :type batch_data: list[dict]
+        :return: List of dictionaries with embedding results.
+        :rtype: list[dict]
+        :raises Exception: If there's an error during embedding generation.
 
-        Raises:
-            Exception: If an error occurs during embedding processing.
+        Example:
+            >>> batch_data = [{"sequence": "ATCG", "sequence_id": 1, "embedding_type_id": 2}]
+            >>> results = manager.process(batch_data)
         """
         try:
             embedding_type_id = batch_data[0]['embedding_type_id']
             model = self.model_instances[embedding_type_id]
             tokenizer = self.tokenizer_instances[embedding_type_id]
             module = self.types[embedding_type_id]['module']
 
-            # Pass embedding_type_id explicitly to the embedding_task function
-            embedding_records = module.embedding_task(batch_data, model, tokenizer, embedding_type_id=embedding_type_id)
+            embedding_records = module.embedding_task(
+                batch_data, model, tokenizer, embedding_type_id=embedding_type_id
+            )
 
             return embedding_records
 
@@ -149,17 +157,17 @@ def process(self, batch_data):
 
     def store_entry(self, records):
         """
-        Stores embedding records in the database.
+        Stores embedding results in the database.
 
-        Args:
-            records (list[dict]): List of embedding records, each containing:
-                - sequence_id (int): Sequence identifier.
-                - embedding_type_id (int): Embedding type identifier.
-                - embedding: The embedding result.
-                - shape: Shape of the embedding.
+        :param records: List of embedding result dictionaries.
+        :type records: list[dict]
+        :raises RuntimeError: If an error occurs during database storage.
 
-        Raises:
-            RuntimeError: If an error occurs during database storage.
+        Example:
+            >>> records = [
+            >>>     {"sequence_id": 1, "embedding_type_id": 2, "embedding": [0.1, 0.2], "shape": [2]}
+            >>> ]
+            >>> manager.store_entry(records)
         """
         session = self.session
         try:

diff --git a/protein_metamorphisms_is/pipelines/fantasia/README.md b/protein_metamorphisms_is/pipelines/fantasia/README.md
@@ -1,18 +1,84 @@
+Gracias por la observación. He actualizado el README con la cita faltante:
+
+---
+
+# FANTASIA: Functional Annotation System and Task Orchestration
+
 ![FANTASIA Logo](img/FANTASIA_logo.png)
 
-# FANTASIA: Functional Annotation System and Task Orchestration  
+FANTASIA (Functional ANnoTAtion based on embedding space SImilArity) is a pipeline for annotating Gene Ontology (GO) terms for protein sequences using advanced protein language models like **ProtT5**, **ProstT5**, and **ESM2**. This system automates complex workflows, from sequence processing to functional annotation, providing a scalable and efficient solution for protein structure and functionality analysis.
+
+---
+
+## Key Features
+
+- **Redundancy Filtering**: Removes identical sequences with **CD-HIT** and optionally excludes sequences based on length constraints.
+- **Embedding Generation**: Utilizes state-of-the-art models for protein sequence embeddings.
+- **GO Term Lookup**: Matches embeddings with a vector database to retrieve associated GO terms.
+- **Results**: Outputs annotations in timestamped CSV files for reproducibility.
+
+---
+
+## Installation
+
+To install FANTASIA, ensure you have Python 3.8+ installed and use the following commands:
+
+```bash
+pip install fantasia-pipeline
+```
+
+For more details, visit the [PyPI page](https://protein-metamorphisms-is.readthedocs.io/en/latest/pipelines/fantasia.html).
+
+---
+
+## Quick Start
+
+### Prerequisites
+
+Ensure the **Information System** is properly configured before running FANTASIA. Detailed instructions are available in the [project documentation](../../README.md).
+
+### Running the Pipeline
+
+Execute the following command, specifying the path to the configuration file:
+
+```bash
+python main.py --config <path_to_config.yaml>
+```
+
+### Pipeline Overview
+
+1. **Redundancy Filtering**: Removes identical sequences and optionally filters sequences based on length.
+2. **Embedding Generation**: Computes embeddings for sequences using supported models and stores them in HDF5 format.
+3. **GO Term Lookup**: Queries a vector database to find and annotate similar proteins.
+4. **Output**: Saves annotations in a structured CSV file.
 
-FANTASIA es un sistema diseñado para la anotación funcional y la orquestación de tareas a gran escala, proporcionando una solución eficiente y automatizada para flujos de trabajo complejos en análisis estructural y funcional de proteínas.
+---
+
+## Documentation
+
+For complete details on pipeline configuration, parameters, and deployment, visit the [FANTASIA Documentation](https://protein-metamorphisms-is.readthedocs.io/en/latest/pipelines/fantasia.html).
 
 ---
 
-## Requisitos previos  
+## Citation
+
+If you use FANTASIA in your work, please cite the following:
 
-Para ejecutar el pipeline de FANTASIA, es necesario que el **Sistema de Información** esté correctamente desplegado y configurado. Este sistema proporciona las bases de datos y servicios necesarios para la ejecución de tareas, el almacenamiento de resultados y la gestión coherente de las operaciones del pipeline.
+1. Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024). Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.  
+   https://doi.org/10.1101/2024.02.28.582465.
+
+2. Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R. & Rojas, A.M. (2024). Decoding proteome functional information in model organisms using protein language models.  
+   https://doi.org/10.1101/2024.02.14.580341.
 
 ---
 
-## Próximos pasos  
+## Contact Information
+
+- Francisco Miguel Pérez Canales: [email protected]  
+- Gemma I. Martínez-Redondo: [email protected]  
+- Ana M. Rojas: [email protected]  
+- Rosa Fernández: [email protected]  
+
+---
 
-1. **Despliegue del Sistema de Información**: Asegúrese de seguir la [documentación principal](../../README.md) para la configuración previa del sistema de información y los servicios asociados.  
-2. **Ejecución del pipeline**: Una vez desplegado el sistema de información, puede ejecutar el pipeline siguiendo las instrucciones detalladas a continuación.  
+Si necesitas más ajustes, ¡hazmelo saber!