-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Add context info to the output #55
Comments
@mwojtyczka
TO
|
thanks for the proposal. It looks good with minor changes. I agree, reusing existing reporting columns is more elegant. The only disadvantage is that the execution time will be the same for the same row so potentially repeated info. I would drop the workflow name or run_id as this info will be available in the Unity Catalog lineage when the dataframe is saved, along with info about the input datasets so we can skip these. Using more contextual info would also require more knowledge about the environment. This could potentially be retrieved but would rather require a separate table. For the time we should use ISO format. We can use shorter name to save some bytes like "run_time".
|
Question: How would we create this "filter" string? str(DQRule.check)? I recommend using a timestamp for the time value. Most users of this library will store this data in a table either in this structure or even flatten it, so it's better to store it properly. This would mean converting the current Map to a Struct type. I propose also giving some flexibility here and creating an option for users to add their metadata. |
Is there an existing issue for this?
Problem statement
Currently we only report warnings and errors. It is unknown when the check was performed and in which pipeline.
Proposed Solution
Add _context column to the output containing: timestamp, workflow it was run on.
Additional Context
No response
The text was updated successfully, but these errors were encountered: