Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved autofix strategy #148

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

improved autofix strategy #148

wants to merge 34 commits into from

Conversation

aditya1503
Copy link

@aditya1503 aditya1503 commented Nov 16, 2023

Skeleton code for improved Auto-Fix strategies

from cleanlab_studio import Studio
API_KEY = os.environ['CLEANLAB_API_KEY']
studio = Studio(API_KEY)
df = pd.DataFrame(...)
dataset_id = studio.upload_dataset(df)
project_id = studio.create_project(dataset_id=dataset_id, ...)
cleanset_id = studio.get_latest_cleanset_id(project_id)


# Beginner user:
new_df = studio.autofix_dataset(df, cleanset_id)  # deepcopy of df 


# Advanced user pattern:
hyperparam_dict = get_autofix_defaults(cleanset_id)  # contains integer values correspond to number of data points to fix/exclude for each issue-type
# user who wants to edit less data will manually adjust the integers in hyperparam_dict  
new_df = studio.autofix_dataset(df, cleanset_id, params=hyperparam_dict)

Link to Notion: https://www.notion.so/cleanlab/Improve-ML-accuracy-with-Studio-via-better-Autofix-99434fa92a164131b3860093d85e5350?pvs=4

Note: this is only for text/tabular datasets, not image.

@jwmueller jwmueller marked this pull request as draft November 16, 2023 22:26
@jwmueller
Copy link
Member

request my review when this is ready

@jwmueller
Copy link
Member

add a little script on how the user is going to use this thing as a PR comment

cleanset_df: pd.DataFrame, name_col: str, num_rows: int, asc=True
) -> List[str]:
"""
Extracts the top specified number of rows based on a specified score column from a DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only return the IDs of datapoints to drop for a given setting of the num_rows to drop during autofix

Parameters:
- cleanset_df (pd.DataFrame): The input DataFrame containing the cleanset.
- name_col (str): The name of the column indicating the category for which the top rows should be extracted.
- num_rows (int): The number of rows to be extracted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In autofix, we can simply multiply the fraction of issues that are the cleanset defaults by the number of datapoints to get this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right when we spoke originally, we wanted this call to be similar to the Studio web interface call, hence I rewrote it this way, it was floating percentage before.
the function _get_autofix_defaults does the multiplication by number of datapoints

}


def _get_autofix_defaults(cleanset_df: pd.DataFrame) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)

return default_values


def _get_top_fraction_ids(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Studio team should move this function to backend of the app so it happens on server (eventually should be used in web app too)

Comment on lines 413 to 416
cleanset_df = self.download_cleanlab_columns(cleanset_id)
if params is None:
params = _get_autofix_defaults(cleanset_df)
print("Using autofix parameters:", params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: replace this code once an analogous method exists in Studio backend

@jwmueller
Copy link
Member

from anish: Would you want this to be:

studio.autofix_dataset(cleanset_id)
new_df = studio.apply_corrections(df, cleanset_id)

"label_issue": 0.5,
"near_duplicate": 0.2,
"outlier": 0.5,
"confidence_threshold": 0.95,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to: "relabel_confidence_threshold"

@@ -63,3 +64,131 @@ def check_none(x: Any) -> bool:

def check_not_none(x: Any) -> bool:
return not check_none(x)


def _get_autofix_default_params() -> dict: # Studio team port to backend
Copy link
Member

@jwmueller jwmueller Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow string options (rethink names):

"optimized_training_data"

"drop_all_issues"

"suggested_actions"

Copy link
Member

@jwmueller jwmueller Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_all_issues strategy should just drop all issues from the DF (no re-label)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested_actions strategy should relabel all label issues, drop outliers, and extra copies of near duplicates. Nothing done to ambiguous examples or other issues

@@ -383,3 +386,32 @@ def poll_cleanset_status(self, cleanset_id: str, timeout: Optional[int] = None)

except (TimeoutError, CleansetError):
return False

def autofix_dataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow string options to passed straight through into _get_autofix_default_params()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be added now, clarified in the docs:

params (dict, optional): Default parameter dictionary containing confidence threshold for auto-relabelling, and

self, original_df: pd.DataFrame, cleanset_id: str, params: dict = None
) -> pd.DataFrame:
"""
This method returns the auto-fixed dataset.
Copy link
Member

@jwmueller jwmueller Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should clarify that Dataset must be a DataFrame (text or tabular dataset only)


def _get_autofix_default_params() -> dict: # Studio team port to backend
"""returns default percentage-wise params of autofix"""
return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can choose more specific key names here


Example:
{
'drop_ambiguous': 9,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to fractions

@sanjanag sanjanag marked this pull request as ready for review December 6, 2023 12:57
Comment on lines 225 to 227
def get_autofix_defaults_for_strategy(strategy):
return AUTOFIX_DEFAULTS[strategy]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make everything that should be ported to backend a private method

@@ -198,6 +221,130 @@ def check_not_none(x: Any) -> bool:
return not check_none(x)


# Studio team port to backend
def get_autofix_defaults_for_strategy(strategy):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_autofix_defaults_for_strategy(strategy):
def _get_autofix_defaults_for_strategy(strategy):

dataset, cl_cols, id_col, label_col, keep_excluded
)
return corrected_ds
return snowflake_corrected_ds
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be from studio team?
in that case we need to merge their main branch first
@aditya1503 will have a look

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes need to rebase everything here against latest master branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants