Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Overhaul for meds-tab #96

Merged
merged 5 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ default_language_version:

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
rev: v5.0.0
hooks:
# list of supported hooks: https://pre-commit.com/hooks.html
- id: trailing-whitespace
Expand Down
567 changes: 24 additions & 543 deletions README.md

Large diffs are not rendered by default.

20 changes: 0 additions & 20 deletions docs/Makefile

This file was deleted.

35 changes: 35 additions & 0 deletions docs/README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Maintainer Guide

## Testing

To run tests, use the following command:

Run all the fast tests that are fast (and don't use gpu) with:

```bash
pytest -k "not slow"
```

Run all tests with:

```
pytest
```

______________________________________________________________________

## Local Documentation Development

This section explains how to edit documentation files in the `docs` directory.

First install docs code

```bash
pip install -e .[docs]
```

Run

```bash
mkdocs serve
```
Binary file removed docs/assets/dark_purple_meds_tab.png
Binary file not shown.
File renamed without changes
Binary file removed docs/assets/light_purple_meds_tab.png
Binary file not shown.
83 changes: 83 additions & 0 deletions docs/gen_ref_pages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
"""Generate the code reference pages and navigation."""

from pathlib import Path

import mkdocs_gen_files

api_nav = mkdocs_gen_files.Nav()
config_nav = mkdocs_gen_files.Nav()

root = Path(__file__).parent.parent
src = root / "src" / "MEDS_tabular_automl"


def process_python_files():
for path in sorted(src.rglob("*.py")):
# Skip the configs directory for API Reference
if "configs" in path.parts:
continue

module_path = path.relative_to(src).with_suffix("")
doc_path = path.relative_to(src).with_suffix(".md")
full_doc_path = Path("reference/api") / doc_path

parts = tuple(module_path.parts)

if parts[-1] == "__main__":
continue

md_file_lines = []

if parts[-1] == "__init__":
parts = parts[:-1]
doc_path = doc_path.with_name("index.md")
full_doc_path = full_doc_path.with_name("index.md")

readme_path = path.parent / "README.md"
if readme_path.exists():
md_file_lines.append(f'--8<-- "{readme_path.relative_to(root)}"')

if parts: # Only add to navigation if parts is not empty
api_nav[parts] = doc_path.as_posix()

ident = "MEDS_tabular_automl"
if parts:
ident += "." + ".".join(parts)
md_file_lines.append(f"::: {ident}")

with mkdocs_gen_files.open(full_doc_path, "w") as fd:
fd.write("\n".join(md_file_lines))

mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root))


def process_yaml_files():
config_dir = src / "configs"
for path in sorted(config_dir.rglob("*.yaml")) + sorted(config_dir.rglob("*.yml")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify YAML file discovery

At line 56, you're combining two sorted lists to find YAML files. Consider using a single rglob with a pattern that matches both .yaml and .yml extensions to simplify the code.

You can modify the code as follows:

- for path in sorted(config_dir.rglob("*.yaml")) + sorted(config_dir.rglob("*.yml")):
+ for path in sorted(config_dir.rglob("*.y*ml")):

Committable suggestion was skipped due to low confidence.

rel_path = path.relative_to(config_dir)
doc_path = rel_path.with_suffix(".md")
full_doc_path = Path("reference/config") / doc_path

parts = tuple(rel_path.parts)

config_nav[parts] = doc_path.as_posix()

with mkdocs_gen_files.open(full_doc_path, "w") as fd:
fd.write(f"# {path.stem}\n\n")
fd.write("```yaml\n")
fd.write(path.read_text())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential file read errors when processing YAML files

When reading YAML files at line 68, consider adding error handling to manage situations where a file might not be read successfully due to permissions or other I/O issues.

You can modify the code to include a try-except block:

 fd.write("```yaml\n")
- fd.write(path.read_text())
+ try:
+     fd.write(path.read_text())
+ except Exception as e:
+     fd.write(f"# Error reading file {path.name}: {e}")
 fd.write("\n```\n")

fd.write("\n```\n")

mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root))


process_python_files()
process_yaml_files()

with mkdocs_gen_files.open("reference/api/SUMMARY.md", "w") as nav_file:
nav_file.write("# API Reference\n\n")
nav_file.writelines(api_nav.build_literate_nav())

with mkdocs_gen_files.open("reference/config/SUMMARY.md", "w") as nav_file:
nav_file.write("# Config Reference\n\n")
nav_file.writelines(config_nav.build_literate_nav())
9 changes: 0 additions & 9 deletions docs/generate.sh

This file was deleted.

8 changes: 4 additions & 4 deletions docs/source/implementation.md → docs/implementation.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ This initial stage processes a pre-shareded dataset. We expect a structure as fo
...
```

We then compute and store feature frequencies, crucial for determining which features are relevant for further analysis.
We then compute and store feature counts, crucial for determining which features are relevant for further analysis.

**Detailed Workflow:**

- **Data Loading and Sharding**: We iterate through shards to compute feature frequencies for each shard.
- **Frequency Aggregation**: After computing frequencies across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost.
- **Count Aggregation**: After computing feature counts across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost.

## 2. Tabularization of Time-Series Data

Expand Down Expand Up @@ -92,7 +92,7 @@ Now that we have generated tabular features for all the events in our dataset, w
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process.
- **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times.

The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `subject_id`, `timestamp`, and `label` columns.
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels must follow the [MEDS label-schema](https://github.com/Medical-Event-Data-Standard/meds?tab=readme-ov-file#the-label-schema), specifically including the `subject_id`, `prediction_time`, and `boolean_value` columns which are necessary for binary classification tasks.

## 4. XGBoost Training

Expand All @@ -102,4 +102,4 @@ The final stage uses the processed and cached data to train an XGBoost model. Th

- **Iterator for Data Loading**: Custom iterators are designed to load sparse matrices efficiently into the XGBoost training process, which can handle sparse inputs natively, thus maintaining high computational efficiency.
- **Training and Validation**: The model is trained using the tabular data, with evaluation steps that include early stopping to prevent overfitting and tuning of hyperparameters based on validation performance.
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion frequency.
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typos in hyperparameter tuning description.

There are several spelling errors in this line:

  • "pramters" should be "parameters"
  • "minimimum" should be "minimum"

Apply this correction:

- We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
+ We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion count.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion count.

1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--8\<-- "README.md"
19 changes: 19 additions & 0 deletions docs/javascripts/mathjax.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
window.MathJax = {
tex: {
inlineMath: [["\\(", "\\)"]],
displayMath: [["\\[", "\\]"]],
processEscapes: true,
processEnvironments: true
},
options: {
ignoreHtmlClass: ".*|",
processHtmlClass: "arithmatex"
}
};

document$.subscribe(() => {
MathJax.startup.output.clearCache()
MathJax.typesetClear()
MathJax.texReset()
MathJax.typesetPromise()
})
35 changes: 0 additions & 35 deletions docs/make.bat

This file was deleted.

5 changes: 5 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{% extends "base.html" %}

{% block announce %}
This is a community-supported tool. If you'd like to contribute, check out our GitHub repository. Your contributions are welcome!
{% endblock %}
Loading
Loading