You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the data masking example at https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns is not very realistic, as it's unlikely anyone would want to use a function with hardcoded column names. Instead, a function which takes columns as a parameter could be shown. Such function is more complicated that the dummy example, but it would be helpful to people wanting to implement masking by showing a real-world example. This implementation also requires the use of a closure, which is a software engineering concept many non-professional software engineers might be unfamiliar with (and thus it could be hard for them to come up with the solution even with the help of a chatbot).
Here's the function I wrote (note that it's only tested with sql_database and sql_table sources).
fromenumimportStrEnumimportpyarrowaspaimportpandasaspdimportdltclassMaskingMethod(StrEnum):
MASK="mask"NULLIFY="nullify"defmask_sql_db_columns(
columns: list[str],
method: MaskingMethod|None=None,
mask: str="******",
) ->pa.Table|pd.DataFrame|dict:
"""Mask specified columns in a SQL Database table. All backends supported by the sql_database source, as of version 1.4.1, are supported. See https://github.com/dlt-hub/dlt/blob/devel/dlt/sources/sql_database/helpers.py#L50 Args: columns (list[str]): The list of columns to mask. Returns: pa.Table | pd.DataFrame | dict[str, Any]: The table or row with the specified columns masked. """ifmethodisNone:
method=MaskingMethod.MASKdefmask_columns(
table_or_row: pa.Table|pd.DataFrame|dict[str, Any],
) ->pa.Table|pd.DataFrame|dict[str, Any]:
# Handle `pyarrow` and `connectorx` backends.ifisinstance(table_or_row, pa.Table):
table=table_or_rowforcolintable.schema.names:
ifcolincolumns:
ifmethod==MaskingMethod.MASK:
replace_with=pa.array([mask] *table.num_rows)
elifmethod==MaskingMethod.NULLIFY:
replace_with=pa.nulls(
table.num_rows, type=table.schema.field(col).type
)
table=table.set_column(
table.schema.get_field_index(col),
col,
replace_with,
)
returntable# Handle `pandas` backend.ifisinstance(table_or_row, pd.DataFrame):
table=table_or_rowforcolintable.columns:
ifcolincolumns:
ifmethod==MaskingMethod.MASK:
table[col] =maskelifmethod==MaskingMethod.NULLIFY:
table[col] =Nonereturntable# Handle `sqlalchemy` backend.ifisinstance(table_or_row, dict):
row=table_or_rowforcolinrow:
ifcolincolumns:
ifmethod==MaskingMethod.MASK:
row[col] =maskelifmethod==MaskingMethod.NULLIFY:
row[col] =Nonereturnrow# Handle unsupported table types.msg=f"Unsupported table type: {type(table_or_row)}. Supported backends: (pyarrow, connectorx, pandas, sqlalchemy)."raiseNotImplementedError(msg)
Documentation description
Currently, the data masking example at https://dlthub.com/docs/general-usage/customising-pipelines/pseudonymizing_columns is not very realistic, as it's unlikely anyone would want to use a function with hardcoded column names. Instead, a function which takes
columns
as a parameter could be shown. Such function is more complicated that the dummy example, but it would be helpful to people wanting to implement masking by showing a real-world example. This implementation also requires the use of a closure, which is a software engineering concept many non-professional software engineers might be unfamiliar with (and thus it could be hard for them to come up with the solution even with the help of a chatbot).Here's the function I wrote (note that it's only tested with
sql_database
andsql_table
sources).Then, it can be used like so:
Are you a dlt user?
Yes, I'm already a dlt user.
The text was updated successfully, but these errors were encountered: