New feature to filter datasets by column and its corresponding values #52
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What existing problem does the pull request solve?
The pull request introduces a filtering feature to the compile function of the Smart Drift object. Previously, users experienced cluttered visualizations due to the inclusion of all data points. This update allows users to apply filters to the dataset to refine the visualizations and focus on the most relevant data, leading to clearer insights and a more streamlined analysis process.
Test Plan
An example of compiling a dataset with filters can be seen below. Where we are working with a dataset with 300+ countries and simplifying it to 6 countries.
sd.compile(
full_validation=True, # Optional: to save time, leave the default False value. If True, analyze consistency on modalities between columns.
date_compile_auc="01/01/2022", # Optional: useful when computing the drift for a time that is not now
datadrift_file="datadrift_auc.csv", # Optional: name of the csv file that contains the performance history of data drift
filter_column='name', #Optional: Name of the column you wish to filter
filter_values=['France', 'Ottomans', "Austria", "Poland", "Brandenburg", "Bohemia"] # Optional: Names of the values from the column you chose above that you wish to filter.
)
Description
The issue that was fixed pertains to cluttered visualizations (Issue #51) and being able to focus on specify column values in our dataset.
Type of Change
New feature (non-breaking change which adds functionality or feature that would cause existing functionality to not work as expected)
How Has This Been Tested?
The new feature was tested with several datasets of varying sizes and complexities. Filters were applied to exclude specific ranges, outliers, and categories. The filtered data produced clearer visualizations that matched expected outcomes.
Test Configuration:
OS: Windows
Python version: [e.g., 3.9]