Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filtering functionality to bids2table #6

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 36 additions & 4 deletions bids2table/_bids2table.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
import logging
from pathlib import Path
from typing import Optional
from typing import Any, Dict, Optional

import pandas as pd
from elbow.builders import build_parquet, build_table
from elbow.sources.filesystem import Crawler
from elbow.typing import StrOrPath
from elbow.utils import setup_logging

from bids2table import exceptions
from bids2table.extractors.bids import extract_bids_subdir
from bids2table.helpers import flat_to_multi_columns

Expand All @@ -25,6 +26,7 @@ def bids2table(
worker_id: Optional[int] = None,
max_failures: Optional[int] = 0,
return_df: bool = True,
filters: Optional[Dict[str, Any]] = None,
) -> Optional[pd.DataFrame]:
"""
Index a BIDS dataset directory and load as a pandas DataFrame.
Expand All @@ -44,6 +46,8 @@ def bids2table(
overwrite.
max_failures: number of extract failures to tolerate.
return_df: whether to return the dataframe or just build the persistent index.
filters: optional dictionary of filters to apply to the index. Keys are
column names and values are values or lists of values to keep.

Returns:
A DataFrame containing the BIDS Index.
Expand Down Expand Up @@ -75,7 +79,7 @@ def bids2table(
else:
logging.info("Found cached index %s; nothing to do", output)
df = None
return df
return _filter(df, filters)

if not persistent:
logging.info("Building index in memory")
Expand All @@ -85,7 +89,7 @@ def bids2table(
max_failures=max_failures,
)
df = flat_to_multi_columns(df)
return df
return _filter(df, filters)

logging.info("Building persistent Parquet index")
build_parquet(
Expand All @@ -99,7 +103,7 @@ def bids2table(
max_failures=max_failures,
)
df = load_index(output) if return_df else None
return df
return _filter(df, filters)


def load_index(
Expand All @@ -112,3 +116,31 @@ def load_index(
if split_columns:
df = flat_to_multi_columns(df, sep=sep)
return df


def _filter(df: pd.DataFrame, filters: Optional[Dict[str, Any]]) -> pd.DataFrame:
"""
Filter a pandas DataFrame based on a dictionary of filters.

Args:
df: The bids2table DataFrame to filter.
filters: A dictionary of filters to apply to the DataFrame. Format must be
either a single value or a list of values. If None, does not filter.

Returns:
pd.DataFrame: The filtered DataFrame.
"""
if filters is None:
return df

for key, value in filters.items():
if not isinstance(value, list):
value = [value]
try:
df = df[df["entities"][key].isin(value)]
except KeyError as exc_info:
raise exceptions.InvalidFilterError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I would catch and re-raise a custom exception. I would just let the KeyError be raised. Just as informative I think and less code.

Copy link
Contributor Author

@ReinderVosDeWael ReinderVosDeWael Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a developer like you or myself you're right. However, my idea here is to raise an error that is informative to someone who has never opened the source code. Consider someone who calls bids2table(..., filters={...}) If they get thrown a KeyError at line 140 at df = df[df["entities"][key].isin(value)] then they'd have to go through the stack trace and figure out from there that this key variable refers to one of their filters. That's a commitment beyond many end-users. If we're lucky they launch an issue for it, if we're unlucky they'll move to a different package. If they get thrown a InvalidFilterError then they would barely have to glance at the error description to know that it's a user error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. The default KeyError message is very minimal and would probably be confusing. How about we re-raise KeyError with a more informative message? In general, I prefer not to use custom exceptions unless there's no good fit among the built-in exceptions or I need to handle the exception specially.

f"Invalid filter: {key} is not a valid column."
) from exc_info

return df
3 changes: 3 additions & 0 deletions bids2table/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
class InvalidFilterError(Exception):
"""Raised when a filter is invalid."""
pass
2 changes: 1 addition & 1 deletion example/bids-examples
Submodule bids-examples updated 188 files
Loading