-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation Overhaul for meds-tab #96
Changes from 3 commits
07001be
88ed333
8d8a44c
41d539c
6b91ed8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Maintainer Guide | ||
|
||
## Testing | ||
|
||
To run tests, use the following command: | ||
|
||
Run all the fast tests that are fast (and don't use gpu) with: | ||
|
||
```bash | ||
pytest -k "not slow" | ||
``` | ||
|
||
Run all tests with: | ||
|
||
``` | ||
pytest | ||
``` | ||
|
||
______________________________________________________________________ | ||
|
||
## Local Documentation Development | ||
|
||
This section explains how to edit documentation files in the `docs` directory. | ||
|
||
First install docs code | ||
|
||
```bash | ||
pip install -e .[docs] | ||
``` | ||
|
||
Run | ||
|
||
```bash | ||
mkdocs serve | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
"""Generate the code reference pages and navigation.""" | ||
|
||
from pathlib import Path | ||
|
||
import mkdocs_gen_files | ||
|
||
api_nav = mkdocs_gen_files.Nav() | ||
config_nav = mkdocs_gen_files.Nav() | ||
|
||
root = Path(__file__).parent.parent | ||
src = root / "src" / "MEDS_tabular_automl" | ||
|
||
|
||
def process_python_files(): | ||
for path in sorted(src.rglob("*.py")): | ||
# Skip the configs directory for API Reference | ||
if "configs" in path.parts: | ||
continue | ||
|
||
module_path = path.relative_to(src).with_suffix("") | ||
doc_path = path.relative_to(src).with_suffix(".md") | ||
full_doc_path = Path("reference/api") / doc_path | ||
|
||
parts = tuple(module_path.parts) | ||
|
||
if parts[-1] == "__main__": | ||
continue | ||
|
||
md_file_lines = [] | ||
|
||
if parts[-1] == "__init__": | ||
parts = parts[:-1] | ||
doc_path = doc_path.with_name("index.md") | ||
full_doc_path = full_doc_path.with_name("index.md") | ||
|
||
readme_path = path.parent / "README.md" | ||
if readme_path.exists(): | ||
md_file_lines.append(f'--8<-- "{readme_path.relative_to(root)}"') | ||
|
||
if parts: # Only add to navigation if parts is not empty | ||
api_nav[parts] = doc_path.as_posix() | ||
|
||
ident = "MEDS_tabular_automl" | ||
if parts: | ||
ident += "." + ".".join(parts) | ||
md_file_lines.append(f"::: {ident}") | ||
|
||
with mkdocs_gen_files.open(full_doc_path, "w") as fd: | ||
fd.write("\n".join(md_file_lines)) | ||
|
||
mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root)) | ||
|
||
|
||
def process_yaml_files(): | ||
config_dir = src / "configs" | ||
for path in sorted(config_dir.rglob("*.yaml")) + sorted(config_dir.rglob("*.yml")): | ||
rel_path = path.relative_to(config_dir) | ||
doc_path = rel_path.with_suffix(".md") | ||
full_doc_path = Path("reference/config") / doc_path | ||
|
||
parts = tuple(rel_path.parts) | ||
|
||
config_nav[parts] = doc_path.as_posix() | ||
|
||
with mkdocs_gen_files.open(full_doc_path, "w") as fd: | ||
fd.write(f"# {path.stem}\n\n") | ||
fd.write("```yaml\n") | ||
fd.write(path.read_text()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Handle potential file read errors when processing YAML files When reading YAML files at line 68, consider adding error handling to manage situations where a file might not be read successfully due to permissions or other I/O issues. You can modify the code to include a try-except block: fd.write("```yaml\n")
- fd.write(path.read_text())
+ try:
+ fd.write(path.read_text())
+ except Exception as e:
+ fd.write(f"# Error reading file {path.name}: {e}")
fd.write("\n```\n") |
||
fd.write("\n```\n") | ||
|
||
mkdocs_gen_files.set_edit_path(full_doc_path, path.relative_to(root)) | ||
|
||
|
||
process_python_files() | ||
process_yaml_files() | ||
|
||
with mkdocs_gen_files.open("reference/api/SUMMARY.md", "w") as nav_file: | ||
nav_file.write("# API Reference\n\n") | ||
nav_file.writelines(api_nav.build_literate_nav()) | ||
|
||
with mkdocs_gen_files.open("reference/config/SUMMARY.md", "w") as nav_file: | ||
nav_file.write("# Config Reference\n\n") | ||
nav_file.writelines(config_nav.build_literate_nav()) |
This file was deleted.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -29,12 +29,12 @@ This initial stage processes a pre-shareded dataset. We expect a structure as fo | |||||
... | ||||||
``` | ||||||
|
||||||
We then compute and store feature frequencies, crucial for determining which features are relevant for further analysis. | ||||||
We then compute and store feature counts, crucial for determining which features are relevant for further analysis. | ||||||
|
||||||
**Detailed Workflow:** | ||||||
|
||||||
- **Data Loading and Sharding**: We iterate through shards to compute feature frequencies for each shard. | ||||||
- **Frequency Aggregation**: After computing frequencies across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost. | ||||||
- **Count Aggregation**: After computing feature counts across shards, we aggregate them to get a final count of each feature across the entire dataset training dataset, which allows us to filter out infrequent features in the tabularization stage or when tuning XGBoost. | ||||||
|
||||||
## 2. Tabularization of Time-Series Data | ||||||
|
||||||
|
@@ -92,7 +92,7 @@ Now that we have generated tabular features for all the events in our dataset, w | |||||
- **Row Selection Based on Tasks**: Only the data rows that are relevant to the specific tasks are selected and cached. This reduces the memory footprint and speeds up the training process. | ||||||
- **Use of Sparse Matrices for Efficient Storage**: Sparse matrices are again employed here to store the selected data efficiently, ensuring that only non-zero data points are kept in memory, thus optimizing both storage and retrieval times. | ||||||
|
||||||
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels follow the same shard filestructure as the input meds data from step (1), and the label parquets need `subject_id`, `timestamp`, and `label` columns. | ||||||
The file structure for the cached data mirrors that of the tabular data, also consisting of `.npz` files, where users must specify the directory that stores labels. Labels must follow the [MEDS label-schema](https://github.com/Medical-Event-Data-Standard/meds?tab=readme-ov-file#the-label-schema), specifically including the `subject_id`, `prediction_time`, and `boolean_value` columns which are necessary for binary classification tasks. | ||||||
|
||||||
## 4. XGBoost Training | ||||||
|
||||||
|
@@ -102,4 +102,4 @@ The final stage uses the processed and cached data to train an XGBoost model. Th | |||||
|
||||||
- **Iterator for Data Loading**: Custom iterators are designed to load sparse matrices efficiently into the XGBoost training process, which can handle sparse inputs natively, thus maintaining high computational efficiency. | ||||||
- **Training and Validation**: The model is trained using the tabular data, with evaluation steps that include early stopping to prevent overfitting and tuning of hyperparameters based on validation performance. | ||||||
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion frequency. | ||||||
- **Hyperaparameter Tuning**: We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix typos in hyperparameter tuning description. There are several spelling errors in this line:
Apply this correction: - We use [optuna](https://optuna.org/) to tune over XGBoost model pramters, aggregations, window sizes, and the minimimum code inclusion count.
+ We use [optuna](https://optuna.org/) to tune over XGBoost model parameters, aggregations, window sizes, and the minimum code inclusion count. 📝 Committable suggestion
Suggested change
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
--8\<-- "README.md" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
window.MathJax = { | ||
tex: { | ||
inlineMath: [["\\(", "\\)"]], | ||
displayMath: [["\\[", "\\]"]], | ||
processEscapes: true, | ||
processEnvironments: true | ||
}, | ||
options: { | ||
ignoreHtmlClass: ".*|", | ||
processHtmlClass: "arithmatex" | ||
} | ||
}; | ||
|
||
document$.subscribe(() => { | ||
MathJax.startup.output.clearCache() | ||
MathJax.typesetClear() | ||
MathJax.texReset() | ||
MathJax.typesetPromise() | ||
}) |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{% extends "base.html" %} | ||
|
||
{% block announce %} | ||
This is a community-supported tool. If you'd like to contribute, check out our GitHub repository. Your contributions are welcome! | ||
{% endblock %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify YAML file discovery
At line 56, you're combining two sorted lists to find YAML files. Consider using a single
rglob
with a pattern that matches both.yaml
and.yml
extensions to simplify the code.You can modify the code as follows: