Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Add context info to the output #55

Open
1 task
mwojtyczka opened this issue Sep 17, 2024 · 3 comments
Open
1 task

[FEATURE]: Add context info to the output #55

mwojtyczka opened this issue Sep 17, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@mwojtyczka
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

Currently we only report warnings and errors. It is unknown when the check was performed and in which pipeline.

Proposed Solution

Add _context column to the output containing: timestamp, workflow it was run on.

Additional Context

No response

@mwojtyczka mwojtyczka added the enhancement New feature or request label Sep 17, 2024
@pierre-monnet
Copy link
Contributor

pierre-monnet commented Jan 29, 2025

@mwojtyczka
What do you think about replace the current structure of _warning and _error
FROM

{
  col_col1_is_null: "Column col1 is null",
  col_col2_is_null: "Column col2 is null"
}

TO

{
  "col_col1_is_null": {
    "rule": "is_not_null",
    "col": "col1",
    "filter": "col2<3",
    "message": "col1 is not null",
    "execution_date": "2025-01-29 13:37:00",
    "workflow_name": "my_workflow"
  },
  "col_col2_is_null": {
    "rule": "is_not_null",
    "col": "col2",
    "message": "col2 is not null",
    "execution_date": "2025-01-29 13:37:00",
    "workflow_name": "my_workflow"
  }
}

@mwojtyczka
Copy link
Contributor Author

mwojtyczka commented Jan 29, 2025

thanks for the proposal. It looks good with minor changes. I agree, reusing existing reporting columns is more elegant. The only disadvantage is that the execution time will be the same for the same row so potentially repeated info.

I would drop the workflow name or run_id as this info will be available in the Unity Catalog lineage when the dataframe is saved, along with info about the input datasets so we can skip these. Using more contextual info would also require more knowledge about the environment. This could potentially be retrieved but would rather require a separate table.

For the time we should use ISO format. We can use shorter name to save some bytes like "run_time".

{
  "col_col1_is_null": {
    "rule": "is_not_null",
    "col": "col1",
    "filter": "col2<3",
    "message": "col1 is not null",
    "run_time": "2025-01-29T13:37:00",
  },
  "col_col2_is_null": {
    "rule": "is_not_null",
    "col": "col2",
    "message": "col2 is not null",
    "run_time": "2025-01-29T13:37:00",
  }
}

@gergo-databricks
Copy link

Question: How would we create this "filter" string? str(DQRule.check)?

I recommend using a timestamp for the time value. Most users of this library will store this data in a table either in this structure or even flatten it, so it's better to store it properly. This would mean converting the current Map to a Struct type.

I propose also giving some flexibility here and creating an option for users to add their metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants