mmcdermott · Oufattole · Oct 22, 2024 · Sep 10, 2024 · Oct 22, 2024 · Oct 22, 2024
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,7 +3,7 @@ default_language_version:
 
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.4.0
+    rev: v5.0.0
     hooks:
       # list of supported hooks: https://pre-commit.com/hooks.html
       - id: trailing-whitespace

diff --git a/README.md b/README.md
diff --git a/docs/Makefile b/docs/Makefile
diff --git a/docs/README.MD b/docs/README.MD
@@ -0,0 +1,35 @@
+# Maintainer Guide
+
+## Testing
+
+To run tests, use the following command:
+
+Run all the fast tests that are fast (and don't use gpu) with:
+
+```bash
+pytest -k "not slow"
+```
+
+Run all tests with:
+
+```
+pytest
+```
+
+______________________________________________________________________
+
+## Local Documentation Development
+
+This section explains how to edit documentation files in the `docs` directory.
+
+First install docs code
+
+```bash
+pip install -e .[docs]
+```
+
+Run
+
+```bash
+mkdocs serve
+```
diff --git a/docs/assets/dark_purple_meds_tab.png b/docs/assets/dark_purple_meds_tab.png
diff --git a/docs/assets/white_meds_tab.png → docs/assets/light_logo.png b/docs/assets/white_meds_tab.png → docs/assets/light_logo.png
diff --git a/docs/assets/light_purple_meds_tab.png b/docs/assets/light_purple_meds_tab.png
diff --git a/docs/gen_ref_pages.py b/docs/gen_ref_pages.py
@@ -0,0 +1,83 @@
+"""Generate the code reference pages and navigation."""
+
+from pathlib import Path
+
+import mkdocs_gen_files
+
+api_nav = mkdocs_gen_files.Nav()
+config_nav = mkdocs_gen_files.Nav()
+
+root = Path(__file__).parent.parent
+src = root / "src" / "MEDS_tabular_automl"
+
+
+def process_python_files():
+    for path in sorted(src.rglob("*.py")):
+        # Skip the configs directory for API Reference
+        if "configs" in path.parts:
+            continue
+
+        module_path = path.relative_to(src).with_suffix("")
+        doc_path = path.relative_to(src).with_suffix(".md")
+        full_doc_path = Path("reference/api") / doc_path
+
+        parts = tuple(module_path.parts)
+
+        if parts[-1] == "__main__":
+            continue
+
+        md_file_lines = []
+
+        if parts[-1] == "__init__":
+            parts = parts[:-1]
+            doc_path = doc_path.with_name("index.md")
+            full_doc_path = full_doc_path.with_name("index.md")
+
+            readme_path = path.parent / "README.md"
+            if readme_path.exists():
+                md_file_lines.append(f'--8<-- "{readme_path.relative_to(root)}"')
+
+        if parts:  # Only add to navigation if parts is not empty
+            api_nav[parts] = doc_path.as_posix()
+
+        ident = "MEDS_tabular_automl"
+        if parts:
+            ident += "." + ".".join(parts)
+        md_file_lines.append(f"::: {ident}")
+
+        with mkdocs_gen_files.open(full_doc_path, "w") as fd:
+            fd.write("\n".join(md_file_lines))
+
+        mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root))
+
+
+def process_yaml_files():
+    config_dir = src / "configs"
+    for path in sorted(config_dir.rglob("*.yaml")) + sorted(config_dir.rglob("*.yml")):
+        rel_path = path.relative_to(config_dir)
+        doc_path = rel_path.with_suffix(".md")
+        full_doc_path = Path("reference/config") / doc_path
+
+        parts = tuple(rel_path.parts)
+
+        config_nav[parts] = doc_path.as_posix()
+
+        with mkdocs_gen_files.open(full_doc_path, "w") as fd:
+            fd.write(f"# {path.stem}\n\n")
+            fd.write("```yaml\n")
+            fd.write(path.read_text())
+            fd.write("\n```\n")
+
+        mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root))
+
+
+process_python_files()
+process_yaml_files()
+
+with mkdocs_gen_files.open("reference/api/SUMMARY.md", "w") as nav_file:
+    nav_file.write("# API Reference\n\n")
+    nav_file.writelines(api_nav.build_literate_nav())
+
+with mkdocs_gen_files.open("reference/config/SUMMARY.md", "w") as nav_file:
+    nav_file.write("# Config Reference\n\n")
+    nav_file.writelines(config_nav.build_literate_nav())
diff --git a/docs/generate.sh b/docs/generate.sh
diff --git a/docs/source/implementation.md → docs/implementation.md b/docs/source/implementation.md → docs/implementation.md
@@ -29,12 +29,12 @@ This initial stage processes a pre-shareded dataset. We expect a structure as fo
 ...
 ```
 
-We then compute and store feature frequencies, crucial for determining which features are relevant for further analysis.
+We then compute and store feature counts, crucial for determining which features are relevant for further analysis.
 
 **Detailed Workflow:**
 
 - **Data Loading and Sharding**: We iterate through shards to compute feature frequencies for each shard.
-- **Frequency Aggregation**: After computing frequencies across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost.
+- **Count Aggregation**: After computing feature counts across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost.
 
 ## 2. Tabularization of Time-Series Data
 
@@ -92,7 +92,7 @@ Now that we have generated tabular features for all the events in our dataset, w
 - **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
 - **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.
 
-The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `subject_id`, `timestamp`, and `label` columns.
+The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels must follow the [MEDS label-schema](https://github.com/Medical-Event-Data-Standard/meds?tab=readme-ov-file#the-label-schema), specifically including the `subject_id`, `prediction_time`, and `boolean_value` columns which are necessary for binary classification tasks.
 
 ## 4. XGBoost Training
 
@@ -102,4 +102,4 @@ The final stage uses the processed and cached data to train an XGBoost model. Th
 
 - **Iterator for Data Loading**: Custom iterators are designed to load sparse matrices efficiently into the XGBoost training process, which can handle sparse inputs natively, thus maintaining high computational efficiency.
 - **Training and Validation**: The model is trained using the tabular data, with evaluation steps that include early stopping to prevent overfitting and tuning of hyperparameters based on validation performance.
-- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion frequency.
+- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
+- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion count.
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
+- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion count.
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1 @@
+--8\<-- "README.md"
diff --git a/docs/javascripts/mathjax.js b/docs/javascripts/mathjax.js
@@ -0,0 +1,19 @@
+window.MathJax = {
+  tex: {
+    inlineMath: [["\\(", "\\)"]],
+    displayMath: [["\\[", "\\]"]],
+    processEscapes: true,
+    processEnvironments: true
+  },
+  options: {
+    ignoreHtmlClass: ".*|",
+    processHtmlClass: "arithmatex"
+  }
+};
+
+document$.subscribe(() => {
+  MathJax.startup.output.clearCache()
+  MathJax.typesetClear()
+  MathJax.texReset()
+  MathJax.typesetPromise()
+})
diff --git a/docs/make.bat b/docs/make.bat
diff --git a/docs/overrides/main.html b/docs/overrides/main.html
@@ -0,0 +1,5 @@
+{% extends "base.html" %}
+
+{% block announce %}
+This is a community-supported tool. If you'd like to contribute, check out our GitHub repository. Your contributions are welcome!
+{% endblock %}