-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eia176 wide table #3590
base: main
Are you sure you want to change the base?
Eia176 wide table #3590
Conversation
src/pudl/transform/eia176.py
Outdated
"""Compare reported and calculated totals for different geographical aggregates, report any differences.""" | ||
|
||
|
||
def _compare_totals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was helpful for validating raw inputs before tackling how to work with them, but I think I want to adapt this to just validate the output of the transposition function above. Planning to do that and roll it into the validate_totals
function above for US-, state-, and other level aggregations.
return reset_calculated.compare(reset_reported) | ||
|
||
|
||
# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other ideas on validations to cover here as an asset check? I also see reasonable boundaries invalidate/gens_eia860_test.py
and could pursue something similar for eia176.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know much about the actual semantics of the gas data - I think a reasonable thing to do is graph some of the different variables over time and see if anything jumps out as "suspicious", then bring that up and we can try to research that together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to just get min, max, and enumerated values based on data so far to validate against, maybe in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a perfect follow-up PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's mostly headed in a good direction, thanks for pushing on this & sorry it took so long to review!
I made a suggestion for the performance stuff - I think that what I did is still logically sound but let me know if that's not what you were trying to do.
There's also the question of whether we should be using item
as the variable name (#3501).
Separately, is this meant to close #3555 or to prepare for the work of closing #3555? If we're just preparing, then this seems fine. But if we're trying to close that issue we should also have more discussions about whether this table schema with 80 columns really makes a ton of sense as a "tidy" table. Maybe we need to split this into multiple tidy tables - I don't know! But we should probably talk about it.
Personally I feel like we should take things one step at a time, make this big wide table, then figure out the exact design later - love a series of small iterative changes 😄
I think next steps are:
- get a handle on performance
- roll the totals into an asset check
- make some graphs of the data and see if anything "weird" pops up - or, just post a bunch of graphs that "seem normal" and see if people who know more about gas act surprised when they see the graphs.
- handle whatever table design questions seem relevant
src/pudl/etl/__init__.py
Outdated
@@ -62,6 +62,9 @@ | |||
|
|||
|
|||
core_module_groups = { | |||
"_core_eia176": [pudl.transform.eia176], | |||
"_core_eia860": [pudl.transform.eia860], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised you had to add the pudl.transform.eia860
and pudl.transform.eia923
modules here, since they're already included under the core_eia860
and core_eia923
keys below... was there a specific issue you saw?
If there was confusion about "dagster asset groups" vs. "groups of Python modules to crawl for assets" that's super understandable and we should add a comment explaining the difference.
src/pudl/transform/eia176.py
Outdated
|
||
logger = get_logger(__name__) | ||
|
||
# Asset Checks are still Experimental, silence the warning since we use them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: They've graduated from "experimental" since you opened this draft PR, we can take this out! Benefits of waiting a month to review your PR 😅
src/pudl/transform/eia176.py
Outdated
granular_data = raw_eia176__data[ | ||
(raw_eia176__data["company"] != " Total of All Companies") | ||
] | ||
for report_year, area, id_ in granular_data.groupby(primary_key).count().index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like you're trying to manually do an unstack() operation (take the variable names in the "variable_name"
column and turn them into a bunch of columns). There's 51k PK groups here, and doing iterrows()
across each group and recombining is pretty expensive in pandas.
I was able to get a similarly shaped result across all years in ~5s by leveraging the built-in unstack
functionality and some inelegant cleanup of unstack
artifacts.
primary_key = ["report_year", "area", "id"]
raw_eia176__data["variable_name"] = (
raw_eia176__data["line"] + "_" + raw_eia176__data["atype"]
)
# TODO should probably sanitize this company name somewhere beforehand
granular = raw_eia176__data.loc[
raw_eia176__data.company.str.strip().str.lower() != "total of all companies"
]
unstacked = (
granular
.drop(columns=["itemsort", "item", "atype", "line", "company"])
.set_index(primary_key + ["variable_name"])
.unstack(level="variable_name")
)
# columns is a weird multi-index with ("value", "actual column name") - clean that up
unstacked.columns = unstacked.columns.droplevel(0)
unstacked.columns.name = None # gets rid of "variable_name" name of columns index
# TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first.
condensed = unstacked.groupby(level=primary_key).first().reset_index()
return condensed
One sort of weird thing that I'm curious about - in #3501 it looks like we wanted to use item
to identify variables - should we be doing that instead of the line
+ atype
thing we have going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outputs for comparison/posterity:
`unstack`, all years
report_year area id 1010_CS 1010_CT 1010_VL 101T_VL 1020_CS 1020_CT 1020_VL 102T_VL 1030_CS 1030_CT 1030_VL 103T_VL 1040_CS 1040_CT 1040_VL 104T_VL 1050_CS 1050_CT 1050_VL 105T_VL 1060_CS 1060_CT 1060_VL 106T_VL 110_VL 1110_CS 1110_CT 1110_VL 1120_CS 1120_CT 1120_VL 1130_CS 1130_CT 1130_VL 1140_CS 1140_CT 1140_VL 1150_CS 1150_CT 1150_VL 1160_CT 1160_VL 120_VL 1210_VL 1220_VL 1230_VL 1240_VL 1250_VL 1260_VL 1310_VL 1320_VL 1330_VL 1400_VL 1500_VL 1600_VL 1700_VL 1810_VL 1820_VL 1830_VL 1840_VL 1900_VL 2000_VL 210_VL 220_VL 230_VL 300_VL 3014_CT 3014_VL 3_VL 400_VL 410_CS 410_VL 420_VL 500_VL 5_CT 600_VL 6_CT 700_VL 800_VL 810_VL 810_YA 820_VL 820_YA 900_VL 9_VL
0 1997 Alabama 17600048AL 251371.0 625.0 29469.0 29469.0 69956.0 50.0 8358.0 8358.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37827.0 -9795.0 NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN 1100.0 NaN
1 1997 Alabama 17600049AL 221993.0 501.0 28016.0 28016.0 27135.0 22.0 4606.0 4606.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32622.0 1417.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34039.0 NaN NaN NaN 34039.0 NaN NaN NaN NaN NaN 1018.0 NaN
2 1997 Alabama 17600139AL 246105034.0 423130.0 29308019.0 29308019.0 72340233.0 35652.0 9951817.0 20985483.0 16947052.0 1498.0 2724147.0 51277620.0 NaN NaN NaN 7028200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 87.0 11033666.0 0.0 194.0 48553473.0 0.0 3.0 7028200.0 NaN NaN NaN NaN NaN NaN 179045.0 NaN NaN NaN NaN NaN NaN 783604.0 NaN NaN NaN NaN NaN NaN NaN 7800144.0 NaN 117362115.0 49682.0 NaN 404450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 117007347.0 NaN NaN NaN 117411797.0 NaN NaN NaN NaN NaN 1017.0 NaN
3 1997 Alabama 17600141AL NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14356530.0 NaN NaN NaN 1598402.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 9.0 14356530.0 0.0 1.0 1598402.0 NaN NaN NaN NaN NaN NaN 52657.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18275351.0 NaN NaN NaN 34282940.0 80555.0 NaN NaN NaN 10785270.0 NaN NaN NaN NaN NaN NaN NaN 23578225.0 NaN NaN NaN 34363495.0 NaN NaN NaN NaN NaN NaN NaN
4 1997 Alabama 17600162AL 1344834.0 3488.0 204256.0 204256.0 842826.0 523.0 135980.0 135980.0 2427378.0 7.0 703608.0 703608.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1043844.0 165.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1044009.0 NaN NaN NaN 1044009.0 NaN NaN NaN NaN NaN 1037.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51365 2022 Wyoming 17678821WY NaN NaN NaN NaN NaN NaN NaN NaN 1926358.0 6.0 228238.0 228238.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 721281.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13711595.0 NaN NaN 14661114.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14661114.0 NaN NaN NaN 14661114.0 NaN NaN NaN NaN NaN 1070.0 NaN
51366 2022 Wyoming 17695156WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76757.0 1.0 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1522.0 460891.0 NaN NaN NaN 2313739.0 -1522.0 NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN 1000.0 NaN
51367 2022 Wyoming 17695458WY 1588901.0 1550.0 134273.0 134273.0 1426548.0 277.0 125426.0 125426.0 NaN NaN NaN NaN NaN NaN NaN NaN 19892.0 1.0 1819.0 1819.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1530.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 215.0 NaN NaN NaN NaN 263263.0 2823.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 1095.0 NaN
51368 2022 Wyoming 17696301WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 234285.0 2.0 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN 1050.0 NaN
51369 2022 Wyoming 17699910WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 120000000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[51370 rows x 88 columns]
existing code, all years
report_year area id 1010_CS 1010_CT 1010_VL 101T_VL 1020_CS 1020_CT 1020_VL 102T_VL 1030_CS 1030_CT 1030_VL 103T_VL 1040_CS 1040_CT 1040_VL 104T_VL 1050_CS 1050_CT 1050_VL 105T_VL 1060_CS 1060_CT 1060_VL 106T_VL 110_VL 1110_CS 1110_CT 1110_VL 1120_CS 1120_CT 1120_VL 1130_CS 1130_CT 1130_VL 1140_CS 1140_CT 1140_VL 1150_CS 1150_CT 1150_VL 1160_CT 1160_VL 120_VL 1210_VL 1220_VL 1230_VL 1240_VL 1250_VL 1260_VL 1310_VL 1320_VL 1330_VL 1400_VL 1500_VL 1600_VL 1700_VL 1810_VL 1820_VL 1830_VL 1840_VL 1900_VL 2000_VL 210_VL 220_VL 230_VL 300_VL 3014_CT 3014_VL 3_VL 400_VL 410_CS 410_VL 420_VL 500_VL 5_CT 600_VL 6_CT 700_VL 800_VL 810_VL 810_YA 820_VL 820_YA 900_VL 9_VL
0 1997 Alabama 17600048AL 251371.0 625.0 29469.0 29469.0 69956.0 50.0 8358.0 8358.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37827.0 -9795.0 NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN 1100.0 NaN
1 1997 Alabama 17600049AL 221993.0 501.0 28016.0 28016.0 27135.0 22.0 4606.0 4606.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32622.0 1417.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34039.0 NaN NaN NaN 34039.0 NaN NaN NaN NaN NaN 1018.0 NaN
2 1997 Alabama 17600139AL 246105034.0 423130.0 29308019.0 29308019.0 72340233.0 35652.0 9951817.0 20985483.0 16947052.0 1498.0 2724147.0 51277620.0 NaN NaN NaN 7028200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 87.0 11033666.0 0.0 194.0 48553473.0 0.0 3.0 7028200.0 NaN NaN NaN NaN NaN NaN 179045.0 NaN NaN NaN NaN NaN NaN 783604.0 NaN NaN NaN NaN NaN NaN NaN 7800144.0 NaN 117362115.0 49682.0 NaN 404450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 117007347.0 NaN NaN NaN 117411797.0 NaN NaN NaN NaN NaN 1017.0 NaN
3 1997 Alabama 17600141AL NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14356530.0 NaN NaN NaN 1598402.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 9.0 14356530.0 0.0 1.0 1598402.0 NaN NaN NaN NaN NaN NaN 52657.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18275351.0 NaN NaN NaN 34282940.0 80555.0 NaN NaN NaN 10785270.0 NaN NaN NaN NaN NaN NaN NaN 23578225.0 NaN NaN NaN 34363495.0 NaN NaN NaN NaN NaN NaN NaN
4 1997 Alabama 17600162AL 1344834.0 3488.0 204256.0 204256.0 842826.0 523.0 135980.0 135980.0 2427378.0 7.0 703608.0 703608.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1043844.0 165.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1044009.0 NaN NaN NaN 1044009.0 NaN NaN NaN NaN NaN 1037.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
51365 2022 Wyoming 17678821WY NaN NaN NaN NaN NaN NaN NaN NaN 1926358.0 6.0 228238.0 228238.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 721281.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13711595.0 NaN NaN 14661114.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14661114.0 NaN NaN NaN 14661114.0 NaN NaN NaN NaN NaN 1070.0 NaN
51366 2022 Wyoming 17695156WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76757.0 1.0 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1522.0 460891.0 NaN NaN NaN 2313739.0 -1522.0 NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN 1000.0 NaN
51367 2022 Wyoming 17695458WY 1588901.0 1550.0 134273.0 134273.0 1426548.0 277.0 125426.0 125426.0 NaN NaN NaN NaN NaN NaN NaN NaN 19892.0 1.0 1819.0 1819.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1530.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 215.0 NaN NaN NaN NaN 263263.0 2823.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 1095.0 NaN
51368 2022 Wyoming 17696301WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 234285.0 2.0 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN 1050.0 NaN
51369 2022 Wyoming 17699910WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 120000000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[51370 rows x 88 columns]
They seem to be the same if you df.compare()
the two as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was looking for a built-in function and it wasn't a simple transpose. I've worked this in but need to get the pre-commit hooks passing. Will chip away at this over the next few days.
test/unit/transform/eia176_test.py
Outdated
).empty | ||
|
||
|
||
# TODO: Implement, if we can even unit-test a function annotated as an asset check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to! In my mind, asset checks are just assets, which can be called directly...
return reset_calculated.compare(reset_reported) | ||
|
||
|
||
# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know much about the actual semantics of the gas data - I think a reasonable thing to do is graph some of the different variables over time and see if anything jumps out as "suspicious", then bring that up and we can try to research that together.
src/pudl/transform/eia176.py
Outdated
reset_calculated = ( | ||
calculated_totals.sort_values(by=groupby_cols) | ||
.reset_index()[groupby_cols + ["value"]] | ||
.round(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use numpy.isclose()
for comparing floats without rounding off to 2 decimal points everywhere!
@davidmudrauskas Just a heads up that I'm back on review duty here! Just nudge or ping me when you're ready for a re-review. |
unstacked.columns = unstacked.columns.droplevel(0).fillna(0) | ||
unstacked.columns.name = None # gets rid of "variable_name" name of columns index | ||
|
||
# TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't exactly discern what you meant here @jdangerx, if you want to expand. I can probably figure out the intent with a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - yes, I meant first non-NA value
here, which must have been confusing. Check out this example:
>>> df
a b
0 1 10.0
1 1 NaN
2 2 20.0
3 2 21.0
>>> df.groupby("a").first()
b
a
1 10.0
2 20.0
.first()
grabs the first non-null value regardless of if there are other non-null values. So for group a=1
we get the unambiguously right outcome of b=10
, but for group a=2
... is b=20
or b=21
the right outcome?
If we can tell what the right outcome should be, maybe by sorting by some field, then we should sort by that field before calling .first()
:
>>> df.sort_values("b", ascending=False)
a b
3 2 21.0
2 2 20.0
0 1 10.0
1 1 NaN
>>> df.sort_values("b", ascending=False).groupby("a").first()
b
a
1 10.0
2 21.0
If there's no simple rule, we should check to see that we're not running into this ambiguous behavior by checking that there are no groups with multiple non-NA values before going forward with the .first()
call:
>>> df.groupby("a").count()
b
a
1 1
2 2
>>> (df.groupby("a").count() == 1).all()
b False
dtype: bool
Hopefully that clears things up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're guaranteed to have one value per group by the time we call first
, because we're using unstack
, which doesn't tolerate competing values:
>>> df
company_id variable value
0 1 var1 11
1 1 var2 12
2 2 var2 22
3 2 var2 0
>>> df.set_index(["company_id", "variable"]).unstack(level="variable")
...
ValueError: Index contains duplicate entries, cannot reshape
In fact I don't think we need first
at all. Seems like a holdover from when we were working with the idea of a sparse matrix.
>>> df
company_id variable value
0 1 var1 11
1 1 var2 12
2 2 var2 22
>>> df.set_index(["company_id", "variable"]).unstack(level="variable")
value
variable var1 var2
company_id
1 11.0 12.0
2 NaN 22.0
I'll get the adjustment in my next commit.
@jdangerx I think this is pretty much the substance of the work. Let me know if you have any other loose ends, and I'll work this into a proper PR with description, merged with main, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally got around to this! It looks pretty good structurally - just a few small changes / elaborating on that "willy-nilly" comment I had / a question about how you set up your test data.
|
||
One table with data for each year and company, one with state- and US-level aggregates per year. | ||
""" | ||
raw_eia176__data["report_year"].astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These astype
calls don't mutate the original object:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1]})
>>> df.dtypes
a int64
dtype: object
>>> df["a"].astype("float")
0 1.0
Name: a, dtype: float64
>>> df.dtypes
a int64
dtype: object
So I think you'll have to do some reassignment.
I also think it would be nice to do all that reassignment in one call - which you can do by calling astype
on the DataFrame
instead of its constituent Series
:
>>> df = pd.DataFrame({"a": [1], "b": ["1"], "c": 1.0})
>>> df
a b c
0 1 1 1.0
>>> df.dtypes
a int64
b object
c float64
dtype: object
>>> df = df.astype({"a": "float", "b": "float"})
>>> df
a b c
0 1.0 1.0 1.0
>>> df.dtypes
a float64
b float64
c float64
dtype: object
def get_wide_table(long_table: pd.DataFrame, primary_key: list[str]) -> pd.DataFrame: | ||
"""Take a 'long' or entity-attribute-value table and return a wide table with one column per attribute/variable.""" | ||
unstacked = ( | ||
# we must drop 'id' here and cannot use as primary key because its arbitrary/duplicate in aggregate records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a fun quirk you've found! Hooray!
These comments seem like they belong before line 44, where you're actually doing the ID dropping.
long_company = raw_eia176__data.loc[ | ||
raw_eia176__data.company != "total of all companies" | ||
] | ||
aggregate_primary_key = ["report_year", "area"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: combined with the comment below about why the primary keys should differ, I think it would be slightly more readable if you assigned the two different primary keys names - instead of only giving the aggregate one a name and the company one just randomly has ["id"]
tacked on.
# your comment about how the primary keys should differ
aggregate_primary_key = [...]
company_primary_key = aggregate_primary_key + ["id"]
unstacked.columns = unstacked.columns.droplevel(0).fillna(0) | ||
unstacked.columns.name = None # gets rid of "variable_name" name of columns index | ||
|
||
# TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - yes, I meant first non-NA value
here, which must have been confusing. Check out this example:
>>> df
a b
0 1 10.0
1 1 NaN
2 2 20.0
3 2 21.0
>>> df.groupby("a").first()
b
a
1 10.0
2 20.0
.first()
grabs the first non-null value regardless of if there are other non-null values. So for group a=1
we get the unambiguously right outcome of b=10
, but for group a=2
... is b=20
or b=21
the right outcome?
If we can tell what the right outcome should be, maybe by sorting by some field, then we should sort by that field before calling .first()
:
>>> df.sort_values("b", ascending=False)
a b
3 2 21.0
2 2 20.0
0 1 10.0
1 1 NaN
>>> df.sort_values("b", ascending=False).groupby("a").first()
b
a
1 10.0
2 21.0
If there's no simple rule, we should check to see that we're not running into this ambiguous behavior by checking that there are no groups with multiple non-NA values before going forward with the .first()
call:
>>> df.groupby("a").count()
b
a
1 1
2 2
>>> (df.groupby("a").count() == 1).all()
b False
dtype: bool
Hopefully that clears things up!
return reset_calculated.compare(reset_reported) | ||
|
||
|
||
# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a perfect follow-up PR!
|
||
from pudl.transform.eia176 import _core_eia176__data, get_wide_table, validate_totals | ||
|
||
COLUMN_NAMES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Food for thought: it sort of seems like you'd have a nicer time making a single dataframe that looks like the raw EIA 176 data, registering it as a @pytest.fixture
, and then selecting the subsets of it you want to test with in the individual tests. What made you go this direction instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a shot. I think a fixture could save a number of lines and data-building, particularly in checking internal consistency of aggregations. I think I'll still declare a good number of constants to use more semantically across the test file, instead of expecting maintainers to glean the nature of an arbitrary number (e.g., 532842.0) in limited contexts. I'm also inclined to keep the fixture minimal and somewhat contrived, if informed by actual raw data, rather than a fuller sample of raw data; I don't want more data points than needed to illustrate the tested behaviors, since that could make tests and code harder to comprehend.
Ah sweet! That is the best outcome :)
In retrospect, yes I have seen that error message countless times when
dealing with unstack. Just didn’t put two and two together.
David Mudrauskas ***@***.***>于2024年11月11日 周一下午9:57写道:
… ***@***.**** commented on this pull request.
------------------------------
In src/pudl/transform/eia176.py
<#3590 (comment)>
:
> + return wide_company, wide_aggregate
+
+
+def get_wide_table(long_table: pd.DataFrame, primary_key: list[str]) -> pd.DataFrame:
+ """Take a 'long' or entity-attribute-value table and return a wide table with one column per attribute/variable."""
+ unstacked = (
+ # we must drop 'id' here and cannot use as primary key because its arbitrary/duplicate in aggregate records
+ # 'id' is a reliable ID only in the context of granular company data
+ long_table.set_index(primary_key + ["variable_name"]).unstack(
+ level="variable_name"
+ )
+ )
+ unstacked.columns = unstacked.columns.droplevel(0).fillna(0)
+ unstacked.columns.name = None # gets rid of "variable_name" name of columns index
+
+ # TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first.
We're guaranteed to have one value per group by the time we call first,
because we're using unstack, which doesn't tolerate competing values:
>>> df
company_id variable value
0 1 var1 11
1 1 var2 12
2 2 var2 22
3 2 var2 0
>>> df.set_index(["company_id", "variable"]).unstack(level="variable")
...
ValueError: Index contains duplicate entries, cannot reshape
In fact I don't think we need first at all. Seems like a holdover from
when we were working with the idea of a sparse matrix.
>>> df
company_id variable value
0 1 var1 11
1 1 var2 12
2 2 var2 22
>>> df.set_index(["company_id", "variable"]).unstack(level="variable")
value
variable var1 var2
company_id
1 11.0 12.0
2 NaN 22.0
I'll get the adjustment in my next commit.
—
Reply to this email directly, view it on GitHub
<#3590 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATBKMQTJR4DDXIBASVDOV32AGKDFAVCNFSM6AAAAABGYACIYOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMRYGYZTIMJYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Overview
Closes #XXXX.
What problem does this address?
What did you change?
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list