Skip to content

Commit

Permalink
Merge branch 'main' into clean
Browse files Browse the repository at this point in the history
  • Loading branch information
Oufattole authored Jun 5, 2024
2 parents a19ad3e + 6240c8a commit ec73910
Show file tree
Hide file tree
Showing 14 changed files with 267 additions and 24 deletions.
13 changes: 13 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: "2"

build:
os: "ubuntu-22.04"
tools:
python: "3.12"

python:
install:
- requirements: docs/requirements.txt

sphinx:
configuration: docs/source/conf.py
31 changes: 7 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,20 @@ bash hf_cohort/aces_task.sh # generates labels (step 5)
bash xgboost.sh # trains xgboos (step 6)
```


## Feature Construction, Storage, and Loading

Tabularization of a (raw) MEDS dataset is done by running the `scripts/data/tabularize.py` script. This script
must inherently do a base level of preprocessing over the MEDS data, then will construct a sharded tabular
representation that respects the overall sharding of the raw data. This script uses [Hydra](https://hydra.cc/)
to manage configuration, and the configuration file is located at `configs/tabularize.yaml`.

Tabularization will take as input a MEDS dataset in a directory we'll denote `$MEDS_cohort_dir` and will write out a collection of tabularization files to disk in subdirectories of this cohort directory. In particular for a given shard prefix in the raw MEDS cohort (e.g., `train/0`, `held_out/1`, etc.)

1. In `$MEDS_cohort_dir/tabularized/static/$SHARD_PREFIX.parquet` will be tabularized, wide-format representations of code / value occurrences with null timestamps. In the case that sub-sharding is needed, sub-shards will instead be written as sub-directories of this base directory: `$MEDS_cohort_dir/tabularized/static/$SHARD_PREFIX/$SUB_SHARD.parquet`. This sub-sharding pattern will hold for all files and not be subsequently measured.
2. In `$MEDS_cohort_dir/tabularized/at_observation/$SHARD_PREFIX.parquet` will be tabularized, wide-format representations of code / value observations for all observations of patient data with a non-null timestamp.
3. In `$MEDS_cohort_dir/tabularized/over_window/$WINDOW_SIZE/$SHARD_PREFIX.parquet` will be tabularized, wide-format summarization of the code / value occurrences over a window of size `$WINDOW_SIZE` as of the index date at the row's timestamp.

## AutoML Pipelines

# TODOs
Expand All @@ -93,27 +100,3 @@ to manage configuration, and the configuration file is located at `configs/tabul
5. Investigate the feasibility of using TemporAI for this task.
6. Consider splitting the feature construction and AutoML pipeline parts of this repository into separate
repositories.

# YAML Configuration File

- `MEDS_cohort_dir`: directory of MEDS format dataset that is ingested.
- `tabularized_data_dir`: output directory of tabularized data.
- `min_code_inclusion_frequency`: The base feature inclusion frequency that should be used to dictate
what features can be included in the flat representation. It can either be a float, in which
case it applies across all measurements, or `None`, in which case no filtering is applied, or
a dictionary from measurement type to a float dictating a per-measurement-type inclusion
cutoff.
- `window_sizes`: Beyond writing out a raw, per-event flattened representation, the dataset also has
the capability to summarize these flattened representations over the historical windows
specified in this argument. These are strings specifying time deltas, using this syntax:
`link`\_. Each window size will be summarized to a separate directory, and will share the same
subject file split as is used in the raw representation files.
- `codes`: A list of codes to include in the flat representation. If `None`, all codes will be included
in the flat representation.
- `aggs`: A list of aggregations to apply to the raw representation. Must have length greater than 0.
- `n_patients_per_sub_shard`: The number of subjects that should be included in each output file.
Lowering this number increases the number of files written, making the process of creating and
leveraging these files slower but more memory efficient.
- `do_overwrite`: If `True`, this function will overwrite the data already stored in the target save
directory.
- `seed`: The seed to use for random number generation.
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
15 changes: 15 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
sphinx==7.1.2
sphinx-rtd-theme==1.3.0rc1
sphinx-collections
recommonmark
piccolo_theme
sphinx_subfigure
nbsphinx
myst_parser
pypandoc
linkify-it-py
ipykernel
omegaconf
ipywidgets
ipykernel
ipython
8 changes: 8 additions & 0 deletions docs/source/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
API
====

.. autosummary::
:toctree: generated
:recursive:

src
91 changes: 91 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
import os
import sys

# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = "MEDS-TAB"
copyright = "2024, Matthew McDermott, Nassim Oufattole, Teya Bergamaschi"
author = "Matthew McDermott, Nassim Oufattole, Teya Bergamaschi"
release = "0.1.0"
version = "0.1.0"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

sys.path.insert(0, os.path.abspath("../.."))
extensions = [
"sphinx.ext.duration",
"sphinx.ext.doctest",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinx.ext.napoleon",
"sphinx_rtd_theme",
"recommonmark",
# "sphinx_immaterial"
]

source_suffix = {
".rst": "restructuredtext",
".txt": "markdown",
".md": "markdown",
}

intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"sphinx": ("https://www.sphinx-doc.org/en/master/", None),
}
intersphinx_disabled_domains = ["std"]

templates_path = ["_templates"]
exclude_patterns = []

autosummary_generate = True

pygments_style = "tango"


# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

# html_theme = "sphinx_rtd_theme"
html_theme = "piccolo_theme"
# html_theme = "sphinx_immaterial"
html_static_path = ["_static"]


html_title = f"NEDS-TAB v{version} Documentation"
html_short_title = "MEDS-TAB Documentation"

# html_logo = "query-512.png"
# html_favicon = "query-16.ico"

# html_sidebars = {"**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"]}

html_theme_options = {
"dark_mode_code_blocks": False,
# "nav_title": "MEDS-TAB",
# "palette": {"primary": "green", "accent": "green"},
# "repo_url": "https://github.com/mmcdermott/MEDS_Tabular_AutoML",
# "repo_name": "MEDS_Tabular_AutoML",
# # Visible levels of the global TOC; -1 means unlimited
# "globaltoc_depth": 3,
# If False, expand all TOC entries
"globaltoc_collapse": True,
# If True, show hidden TOC entries
"globaltoc_includehidden": False,
}


html_show_copyright = True
htmlhelp_basename = "meds-tab-doc"


# -- Options for EPUB output
epub_show_urls = "footnote"
4 changes: 4 additions & 0 deletions docs/source/generated/src.MEDS_tabular_automl.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
src.MEDS\_tabular\_automl
=========================

.. automodule:: src.MEDS_tabular_automl
30 changes: 30 additions & 0 deletions docs/source/generated/src.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
src
===

.. automodule:: src



















.. rubric:: Modules

.. autosummary::
:toctree:
:recursive:

src.MEDS_tabular_automl
20 changes: 20 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. MEDS-TAB documentation master file, created by
sphinx-quickstart on Mon Jun 3 20:41:52 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to MEDS-TAB's documentation!
====================================
.. image:: https://readthedocs.org/projects/meds-tabular-automl/badge/?version=latest
:target: https://meds-tabular-automl.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. toctree::
:maxdepth: 1
:caption: Contents:

overview
installation
usage
api
license
7 changes: 7 additions & 0 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Installation
============

.. include:: ../../README.md
:parser: markdown
:start-after: Installation
:end-before: Usage
4 changes: 4 additions & 0 deletions docs/source/license.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
License
========

.. include:: ../../LICENSE
6 changes: 6 additions & 0 deletions docs/source/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Overview
========

.. include:: ../../README.md
:parser: markdown
:end-before: Installation
7 changes: 7 additions & 0 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Usage
======

.. include:: ../../README.md
:parser: markdown
:start-after: Usage
:end-before: TODOs

0 comments on commit ec73910

Please sign in to comment.