[FEATURE]: Add context info to the output #55

mwojtyczka · 2024-09-17T10:50:59Z

Is there an existing issue for this?

I have searched the existing issues

Problem statement

Currently we only report warnings and errors. It is unknown when the check was performed and in which pipeline.

Proposed Solution

Add _context column to the output containing: timestamp, workflow it was run on.

Additional Context

No response

pierre-monnet · 2025-01-29T14:32:55Z

@mwojtyczka
What do you think about replace the current structure of _warning and _error
FROM

{
  col_col1_is_null: "Column col1 is null",
  col_col2_is_null: "Column col2 is null"
}

TO

{
  "col_col1_is_null": {
    "rule": "is_not_null",
    "col": "col1",
    "filter": "col2<3",
    "message": "col1 is not null",
    "execution_date": "2025-01-29 13:37:00",
    "workflow_name": "my_workflow"
  },
  "col_col2_is_null": {
    "rule": "is_not_null",
    "col": "col2",
    "message": "col2 is not null",
    "execution_date": "2025-01-29 13:37:00",
    "workflow_name": "my_workflow"
  }
}

mwojtyczka · 2025-01-29T15:18:25Z

thanks for the proposal. It looks good with minor changes. I agree, reusing existing reporting columns is more elegant. The only disadvantage is that the execution time will be the same for the same row so potentially repeated info.

I would drop the workflow name or run_id as this info will be available in the Unity Catalog lineage when the dataframe is saved, along with info about the input datasets so we can skip these. Using more contextual info would also require more knowledge about the environment. This could potentially be retrieved but would rather require a separate table.

For the time we should use ISO format. We can use shorter name to save some bytes like "run_time".

{
  "col_col1_is_null": {
    "rule": "is_not_null",
    "col": "col1",
    "filter": "col2<3",
    "message": "col1 is not null",
    "run_time": "2025-01-29T13:37:00",
  },
  "col_col2_is_null": {
    "rule": "is_not_null",
    "col": "col2",
    "message": "col2 is not null",
    "run_time": "2025-01-29T13:37:00",
  }
}

gergo-databricks · 2025-02-07T11:26:27Z

Question: How would we create this "filter" string? str(DQRule.check)?

I recommend using a timestamp for the time value. Most users of this library will store this data in a table either in this structure or even flatten it, so it's better to store it properly. This would mean converting the current Map to a Struct type.

I propose also giving some flexibility here and creating an option for users to add their metadata.

mwojtyczka added the enhancement New feature or request label Sep 17, 2024

mwojtyczka mentioned this issue Jan 29, 2025

[FEATURE]: Add filter to rule #140

Closed

1 task

pierre-monnet mentioned this issue Jan 31, 2025

feat: add filter to rule #141

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Add context info to the output #55

[FEATURE]: Add context info to the output #55

mwojtyczka commented Sep 17, 2024

pierre-monnet commented Jan 29, 2025 •

edited

Loading

mwojtyczka commented Jan 29, 2025 •

edited

Loading

gergo-databricks commented Feb 7, 2025

[FEATURE]: Add context info to the output #55

[FEATURE]: Add context info to the output #55

Comments

mwojtyczka commented Sep 17, 2024

Is there an existing issue for this?

Problem statement

Proposed Solution

Additional Context

pierre-monnet commented Jan 29, 2025 • edited Loading

mwojtyczka commented Jan 29, 2025 • edited Loading

gergo-databricks commented Feb 7, 2025

pierre-monnet commented Jan 29, 2025 •

edited

Loading

mwojtyczka commented Jan 29, 2025 •

edited

Loading