Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce schema of ProductionRuns DataFrame. #35

Merged
merged 11 commits into from
Feb 1, 2024
Merged

Conversation

daniel-k
Copy link
Member

@daniel-k daniel-k commented Jan 24, 2024

The ProductionRun dataclass contains some fields that in turn are dataclasses themselves. When such fields are turned into DataFrame columns, they are flattened such that every field contained in the sub-objects becomes a column in the final DataFrame. Since some of these fields are actually optional, also the field name itself will turn into a column which in practice always is None. This PR removes such useless columns from the DataFrame.

Moreover, if any of the optional dataclass fields is empty for all production runs, then their sub fields would not become part of the DataFrame. This can be annoying when building services on top of the SDK, because you cannot assume that a column would be part of the DataFrame, thus requiring checks everywhere. Therefore, we now enforce the schema of the returned DataFrame. This means, that irrespective of the underlying data, all nested fields will be always be present as flattened columns in the DataFrame.

The `ProductionRun` dataclass contains some fields that in turn are dataclasses
themselves. When such fields are turned into DataFrame columns, they are
flattened such that every field contained in the sub-objects becomes a column
in the final DataFrame. Since some of these fields are actually optional, also
the field name itself will turn into a column which in practice always is
`None`. This PR removes such useless columns from the DataFrame.
@denizs
Copy link
Member

denizs commented Jan 25, 2024

CI is failing, but I'm wondering whether this is an improvement. With this patch, we are introducing polymorphic responses for one appliance depending on the timeframe being queried. If you wanted to build an application on top of the SDK, the client / application would need to perform response shape validation on every response. We should rather keep the response schema and thus the data frame schema static.

@daniel-k
Copy link
Member Author

For reference: we have found this issue:

which makes clear that we need a stable DataFrame schema. I'll turn the PR into draft mode and will tackle it as well.

@daniel-k daniel-k marked this pull request as draft January 25, 2024 13:09
Copy link

github-actions bot commented Jan 25, 2024

Coverage results

Update on 2024-01-31 14:40:51.938011184 +0000

This is the coverage report for commit dd1d066

Name                                                                                Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------------------------------------------------
.tox/py/lib/python3.12/site-packages/enlyze/__init__.py                                 4      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/api_clients/base.py                        65      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/api_clients/production_runs/client.py      20      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/api_clients/production_runs/models.py      51      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/api_clients/timeseries/client.py           19      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/api_clients/timeseries/models.py           34      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/auth.py                                    13      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/client.py                                  86      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/constants.py                                6      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/errors.py                                   3      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/models.py                                 110      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/schema.py                                  26      0   100%
.tox/py/lib/python3.12/site-packages/enlyze/validators.py                              38      0   100%
-----------------------------------------------------------------------------------------------------------------
TOTAL                                                                                 475      0   100%

4 empty files skipped.

@daniel-k
Copy link
Member Author

@denizs Please have a look at a03b686. I haven't updated tests nor polished it, this is just a PoC how we can enforce the schema of our data frames. It's based on typing.get_type_hints(MyDataClass) and assembles a flat schema similar similar to pandas.json_normalize() but since it uses type information, it doesn't depend on every field permutation being present in the data.

@denizs
Copy link
Member

denizs commented Jan 30, 2024

@denizs Please have a look at a03b686. I haven't updated tests nor polished it, this is just a PoC how we can enforce the schema of our data frames. It's based on typing.get_type_hints(MyDataClass) and assembles a flat schema similar similar to pandas.json_normalize() but since it uses type information, it doesn't depend on every field permutation being present in the data.

Looks good! Just a nit: When I renamed level to path in my head, the code got easier to follow.

@daniel-k
Copy link
Member Author

When I renamed level to path in my head, the code got easier to follow.

Makes sense, will do. I only wanted to get early validation for the idea because it's quite some code / complexity we have to add. I'll polish everything and request review from you once it's done. Thanks :)

@daniel-k daniel-k changed the title Exclude always-NA columns in ProductionRuns DataFrame. Enfore schema of ProductionRuns DataFrame. Jan 31, 2024
@daniel-k daniel-k changed the title Enfore schema of ProductionRuns DataFrame. Enforce schema of ProductionRuns DataFrame. Jan 31, 2024
@daniel-k daniel-k marked this pull request as ready for review January 31, 2024 14:54
@daniel-k daniel-k requested review from denizs and clehensen January 31, 2024 14:54
Copy link
Contributor

@clehensen clehensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for picking it up so quickly.

When I renamed level to path in my head, the code got easier to follow.

Coming from the pandas side of things, level was actually more intuitive for me than path but both works 👍

src/enlyze/models.py Show resolved Hide resolved
@daniel-k daniel-k requested a review from clehensen February 1, 2024 09:28
Copy link
Contributor

@clehensen clehensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@daniel-k daniel-k merged commit 617ba2d into master Feb 1, 2024
9 checks passed
@daniel-k daniel-k deleted the fix/empty-column-in-df branch February 1, 2024 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants