Support Loading Tabularized Features for Statistical Analysis #101

Oufattole · 2024-11-14T20:01:53Z

Overview

Enable users to load and analyze sparse representations of patient data across different temporal aggregation windows in a notebook environment. This will facilitate statistical analysis of tabularized features.

Requirements

Feature Discovery Function

def get_available_windows_and_aggs(tabularized_data_dir: Path) -> tuple[list[str], list[str]]:
    """
    Scan the filesystem to discover available window sizes and aggregation types.
    
    Args:
        tabularized_data_dir (Path): Directory containing tabularized data files
        
    Returns:
        tuple[list[str], list[str]]: Available window sizes and aggregation types
    """

Data Loading Function

def load_data(
    tabularized_data_dir: Path,
    windows: list[str],
    aggs: list[str],
    metadata_fp: Path
) -> Union[pd.DataFrame, tuple[sp.sparse.csr_matrix, list[str], pd.DataFrame]]:
    """
    Load sparse patient representation for specified windows and aggregations.
    
    Args:
        tabularized_data_dir (Path): Directory containing tabularized data files
        windows (list[str]): List of window sizes to include
        aggs (list[str]): List of aggregation types to include
        metadata_fp (Path): Path to patient metadata file
        
    Returns:
        Either:
        - pd.DataFrame: Sparse DataFrame with features as columns, (patient_id, time) as index
        - tuple[sp.sparse.csr_matrix, list[str], pd.DataFrame]: 
            - Sparse matrix containing feature values
            - List of feature names (code + window + agg)
            - DataFrame with patient_id and time information
    """

Data Structure

Rows: Aligned with patient and time information
Columns: Feature names in format {code}_{window}_{agg}
Values: Sparse representation of feature values

Output Options

Option A: Sparse pandas DataFrame
- Index: MultiIndex with (patient_id, time)
- Columns: Feature names
Option B: Tuple of three elements
- Sparse matrix (CSR format) containing the data
- List of column names (features)
- DataFrame with patient_id and time information

Implementation Notes

Use efficient sparse matrix format to handle large, sparse feature sets
Implement robust error handling for missing or corrupt files
Consider adding validation for window sizes and aggregation types
Include progress indicators (maybe tqdm.auto.tqdm) for operations as this will be used in the notebook setting

Examples

# Get available options
windows, aggs = get_available_windows_and_aggs(data_dir)
print(f"Available windows: {windows}")
print(f"Available aggregations: {aggs}")

# Load data (DataFrame option)
df = load_data(
    data_dir,
    windows=["1h", "4h", "24h"],
    aggs=["mean", "max", "count"],
    metadata_fp=metada

The text was updated successfully, but these errors were encountered:

teyaberg self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Loading Tabularized Features for Statistical Analysis #101

Support Loading Tabularized Features for Statistical Analysis #101

Oufattole commented Nov 14, 2024

Support Loading Tabularized Features for Statistical Analysis #101

Support Loading Tabularized Features for Statistical Analysis #101

Comments

Oufattole commented Nov 14, 2024

Overview

Requirements

Feature Discovery Function

Data Loading Function

Data Structure

Output Options

Implementation Notes

Examples