Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Loading Tabularized Features for Statistical Analysis #101

Open
Oufattole opened this issue Nov 14, 2024 · 0 comments
Open

Support Loading Tabularized Features for Statistical Analysis #101

Oufattole opened this issue Nov 14, 2024 · 0 comments
Assignees

Comments

@Oufattole
Copy link
Collaborator

Overview

Enable users to load and analyze sparse representations of patient data across different temporal aggregation windows in a notebook environment. This will facilitate statistical analysis of tabularized features.

Requirements

Feature Discovery Function

def get_available_windows_and_aggs(tabularized_data_dir: Path) -> tuple[list[str], list[str]]:
    """
    Scan the filesystem to discover available window sizes and aggregation types.
    
    Args:
        tabularized_data_dir (Path): Directory containing tabularized data files
        
    Returns:
        tuple[list[str], list[str]]: Available window sizes and aggregation types
    """

Data Loading Function

def load_data(
    tabularized_data_dir: Path,
    windows: list[str],
    aggs: list[str],
    metadata_fp: Path
) -> Union[pd.DataFrame, tuple[sp.sparse.csr_matrix, list[str], pd.DataFrame]]:
    """
    Load sparse patient representation for specified windows and aggregations.
    
    Args:
        tabularized_data_dir (Path): Directory containing tabularized data files
        windows (list[str]): List of window sizes to include
        aggs (list[str]): List of aggregation types to include
        metadata_fp (Path): Path to patient metadata file
        
    Returns:
        Either:
        - pd.DataFrame: Sparse DataFrame with features as columns, (patient_id, time) as index
        - tuple[sp.sparse.csr_matrix, list[str], pd.DataFrame]: 
            - Sparse matrix containing feature values
            - List of feature names (code + window + agg)
            - DataFrame with patient_id and time information
    """

Data Structure

  • Rows: Aligned with patient and time information
  • Columns: Feature names in format {code}_{window}_{agg}
  • Values: Sparse representation of feature values

Output Options

  1. Option A: Sparse pandas DataFrame

    • Index: MultiIndex with (patient_id, time)
    • Columns: Feature names
  2. Option B: Tuple of three elements

    • Sparse matrix (CSR format) containing the data
    • List of column names (features)
    • DataFrame with patient_id and time information

Implementation Notes

  • Use efficient sparse matrix format to handle large, sparse feature sets
  • Implement robust error handling for missing or corrupt files
  • Consider adding validation for window sizes and aggregation types
  • Include progress indicators (maybe tqdm.auto.tqdm) for operations as this will be used in the notebook setting

Examples

# Get available options
windows, aggs = get_available_windows_and_aggs(data_dir)
print(f"Available windows: {windows}")
print(f"Available aggregations: {aggs}")

# Load data (DataFrame option)
df = load_data(
    data_dir,
    windows=["1h", "4h", "24h"],
    aggs=["mean", "max", "count"],
    metadata_fp=metada
@teyaberg teyaberg self-assigned this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants