VectorInstitute · fcogidi · Jan 22, 2025 · Jan 22, 2025 · Jan 22, 2025 · Jan 22, 2025
diff --git a/.github/workflows/docs_build.yml b/.github/workflows/docs_build.yml
@@ -31,5 +31,5 @@ jobs:
           python3 -m pip install --upgrade pip && python3 -m pip install poetry
           poetry env use '3.10'
           source $(poetry env info --path)/bin/activate
-          poetry install --with docs,test
+          poetry install --with docs,test,dev,peft
           cd docs && rm -rf source/reference/api && make html
diff --git a/README.md b/README.md
@@ -1,23 +1,30 @@
 # mmlearn
+
 [![code checks](https://github.com/VectorInstitute/mmlearn/actions/workflows/code_checks.yml/badge.svg)](https://github.com/VectorInstitute/mmlearn/actions/workflows/code_checks.yml)
 [![integration tests](https://github.com/VectorInstitute/mmlearn/actions/workflows/integration_tests.yml/badge.svg)](https://github.com/VectorInstitute/mmlearn/actions/workflows/integration_tests.yml)
 [![license](https://img.shields.io/github/license/VectorInstitute/mmlearn.svg)](https://github.com/VectorInstitute/mmlearn/blob/main/LICENSE)
 
-This project aims at enabling the evaluation of existing multimodal representation learning methods, as well as facilitating
+*mmlearn* aims at enabling the evaluation of existing multimodal representation learning methods, as well as facilitating
 experimentation and research for new techniques.
 
 ## Quick Start
+
 ### Installation
+
 #### Prerequisites
+
 The library requires Python 3.10 or later. We recommend using a virtual environment to manage dependencies. You can create
 a virtual environment using the following command:
+
 ```bash
 python3 -m venv /path/to/new/virtual/environment
 source /path/to/new/virtual/environment/bin/activate
 ```
 
 #### Installing binaries
+
 To install the pre-built binaries, run:
+
 ```bash
 python3 -m pip install mmlearn
 ```
@@ -73,13 +80,15 @@ Uses the <a href=https://huggingface.co/docs/peft/index>PEFT</a> library to enab
 </table>
 
 For example, to install the library with the `vision` and `audio` extras, run:
+
 ```bash
 python3 -m pip install mmlearn[vision,audio]
 ```
 
 </details>
 
 #### Building from source
+
 To install the library from source, run:
 
 ```bash
@@ -89,6 +98,7 @@ python3 -m pip install -e .
 ```
 
 ### Running Experiments
+
 We use [Hydra](https://hydra.cc/docs/intro/) and [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/) to manage configurations
 in the library.
 
@@ -97,9 +107,11 @@ have an `__init__.py` file to make it a Python package and an `experiment` folde
 This format allows the use of `.yaml` configuration files as well as Python modules (using [structured configs](https://hydra.cc/docs/tutorials/structured_config/intro/) or [hydra-zen](https://mit-ll-responsible-ai.github.io/hydra-zen/)) to define the experiment configurations.
 
 To run an experiment, use the following command:
+
 ```bash
 mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
 ```
+
 Hydra will compose the experiment configuration from all the configurations in the specified directory as well as all the
 configurations in the `mmlearn` package. *Note the dot-separated path to the directory containing the experiment configuration
 files.*
@@ -109,23 +121,38 @@ One can add a path to `hydra.searchpath` either as a package (`pkg://path.to.con
 Hence, please refrain from using the `file://` notation.
 
 Hydra also allows for overriding configuration parameters from the command line. To see the available options and other information, run:
+
 ```bash
 mmlearn_run 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> --help
 ```
 
 By default, the `mmlearn_run` command will run the experiment locally. To run the experiment on a SLURM cluster, we use
 the [submitit launcher](https://hydra.cc/docs/plugins/submitit_launcher/) plugin built into Hydra. The following is an example
 of how to run an experiment on a SLURM cluster:
+
 ```bash
-mmlearn_run --multirun hydra.launcher.mem_gb=32 hydra.launcher.qos=your_qos hydra.launcher.partition=your_partition hydra.launcher.gres=gpu:4 hydra.launcher.cpus_per_task=8 hydra.launcher.tasks_per_node=4 hydra.launcher.nodes=1 hydra.launcher.stderr_to_stdout=true hydra.launcher.timeout_min=60 '+hydra.launcher.additional_parameters={export: ALL}' 'hydra.searchpath=[pkg://path.to.config.directory]' +experiment=<name_of_experiment_yaml_file> experiment=your_experiment_name
+mmlearn_run --multirun \
+    hydra.launcher.mem_per_cpu=5G \
+    hydra.launcher.qos=your_qos \
+    hydra.launcher.partition=your_partition \
+    hydra.launcher.gres=gpu:4 \
+    hydra.launcher.cpus_per_task=8 \
+    hydra.launcher.tasks_per_node=4 \
+    hydra.launcher.nodes=1 \
+    hydra.launcher.stderr_to_stdout=true \
+    hydra.launcher.timeout_min=720 \
+    'hydra.searchpath=[pkg://path.to.my_project.configs]' \
+    +experiment=my_experiment \
+    experiment_name=my_experiment_name
 ```
+
 This will submit a job to the SLURM cluster with the specified resources.
 
 **Note**: After the job is submitted, it is okay to cancel the program with `Ctrl+C`. The job will continue running on
 the cluster. You can also add `&` at the end of the command to run it in the background.
 
-
 ## Summary of Implemented Methods
+
 <table>
 <tr>
 <th style="text-align: left; width: 250px"> Pretraining Methods </th>
@@ -181,33 +208,6 @@ Binary and multi-class classification tasks are supported.
 </tr>
 </table>
 
-## Components
-### Datasets
-Every dataset object must return an instance of `Example` with one or more keys/attributes corresponding to a modality name
-as specified in the `Modalities` registry. The `Example` object must also include an `example_index` attribute/key, which
-is used, in addition to the dataset index, to uniquely identify the example.
-
-<details>
-<summary><b>CombinedDataset</b></summary>
-
-The `CombinedDataset` object is used to combine multiple datasets into one. It accepts an iterable of `torch.utils.data.Dataset`
-and/or `torch.utils.data.IterableDataset` objects and returns an `Example` object from one of the datasets, given an index.
-Conceptually, the `CombinedDataset` object is a concatenation of the datasets in the input iterable, so the given index
-can be mapped to a specific dataset based on the size of the datasets. As iterable-style datasets do not support random access,
-the examples from these datasets are returned in order as they are iterated over.
-
-The `CombinedDataset` object also adds a `dataset_index` attribute to the `Example` object, corresponding to the index of
-the dataset in the input iterable. Every example returned by the `CombinedDataset` will have an `example_ids` attribute,
-which is instance of `Example` containing the same keys/attributes as the original example, with the exception of the
-`example_index` and `dataset_index` attributes, with values being a tensor of the `dataset_index` and `example_index`.
-</details>
-
-### Dataloading
-When dealing with multiple datasets with different modalities, the default `collate_fn` of `torch.utils.data.DataLoader`
-may not work, as it assumes that all examples have the same keys/attributes. In that case, the `collate_example_list`
-function can be used as the `collate_fn` argument of `torch.utils.data.DataLoader`. This function takes a list of `Example`
-objects and returns a dictionary of tensors, with all the keys/attributes of the `Example` objects.
-
 ## Contributing
 
 If you are interested in contributing to the library, please see [CONTRIBUTING.MD](CONTRIBUTING.MD). This file contains

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -31,8 +31,14 @@
     "sphinx_copybutton",
     "sphinx_design",
     "sphinxcontrib.apidoc",
+    "myst_parser",
 ]
 add_module_names = False
+apidoc_module_dir = "../../mmlearn"
+apidoc_output_dir = "reference/api"
+apidoc_excluded_paths = ["tests"]
+apidoc_separate_modules = True
+apidoc_module_first = True
 autoclass_content = "class"
 autodoc_default_options = {
     "members": True,
@@ -47,13 +53,6 @@
 autosummary_generate = True
 copybutton_prompt_text = r">>> |\.\.\. "
 copybutton_prompt_is_regexp = True
-napoleon_google_docstring = False
-napoleon_numpy_docstring = True
-napoleon_include_init_with_doc = True
-napoleon_attr_annotations = True
-set_type_checking_flag = True
-
-
 intersphinx_mapping = {
     "python": ("https://docs.python.org/3.10/", None),
     "numpy": ("http://docs.scipy.org/doc/numpy/", None),
@@ -67,9 +66,12 @@
     "torchmetrics": ("https://lightning.ai/docs/torchmetrics/stable/", None),
     "Pillow": ("https://pillow.readthedocs.io/en/latest/", None),
     "transformers": ("https://huggingface.co/docs/transformers/en/", None),
-    "peft": ("https://huggingface.co/docs/peft/en/", None),
 }
-
+napoleon_google_docstring = False
+napoleon_numpy_docstring = True
+napoleon_include_init_with_doc = True
+napoleon_attr_annotations = True
+set_type_checking_flag = True
 templates_path = ["_templates"]
 
 # -- Options for HTML output -------------------------------------------------

diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -0,0 +1,2 @@
+.. include:: ../../CONTRIBUTING.md
+   :parser: myst_parser.sphinx_
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -12,5 +12,6 @@ Contents
    :maxdepth: 2
 
    installation
-   getting_started
+   user_guide
+   contributing
    api
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		.. include:: ../../CONTRIBUTING.md
		:parser: myst_parser.sphinx_