Eia176 wide table #3590

davidmudrauskas · 2024-04-25T02:39:56Z

Overview

Closes #XXXX.

What problem does this address?

What did you change?

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g., test_minmax_rows())
Update the release notes: reference the PR and related issues.
Ensure docs build, unit & integration tests, and test coverage pass locally with make pytest-coverage (otherwise the merge queue may reject your PR)
Review the PR yourself and call out any questions or issues you have
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For significant ETL, data coverage or analysis changes, once make pytest-coverage passes, ensure the full ETL runs locally and run data validation tests using make pytest-validate (a ~10 hour run). If you can't run this locally, run the build-deploy-pudl GitHub Action (or ask someone with permissions to). Then, check the logs on the #pudl-deployments Slack channel or gs://builds.catalyst.coop.
Options

davidmudrauskas · 2024-04-25T02:41:58Z

src/pudl/transform/eia176.py

+    """Compare reported and calculated totals for different geographical aggregates, report any differences."""
+
+
+def _compare_totals(


This was helpful for validating raw inputs before tackling how to work with them, but I think I want to adapt this to just validate the output of the transposition function above. Planning to do that and roll it into the validate_totals function above for US-, state-, and other level aggregations.

davidmudrauskas · 2024-04-25T02:48:20Z

src/pudl/transform/eia176.py

+    return reset_calculated.compare(reset_reported)
+
+
+# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed?


Other ideas on validations to cover here as an asset check? I also see reasonable boundaries invalidate/gens_eia860_test.py and could pursue something similar for eia176.

I don't really know much about the actual semantics of the gas data - I think a reasonable thing to do is graph some of the different variables over time and see if anything jumps out as "suspicious", then bring that up and we can try to research that together.

Happy to just get min, max, and enumerated values based on data so far to validate against, maybe in a follow-up.

This sounds like a perfect follow-up PR!

…udl into eia176_wide_table

davidmudrauskas · 2024-05-14T04:41:12Z

I spot-checked this by materializing the data for a couple years.

jdangerx

Looks like it's mostly headed in a good direction, thanks for pushing on this & sorry it took so long to review!

I made a suggestion for the performance stuff - I think that what I did is still logically sound but let me know if that's not what you were trying to do.

There's also the question of whether we should be using item as the variable name (#3501).

Separately, is this meant to close #3555 or to prepare for the work of closing #3555? If we're just preparing, then this seems fine. But if we're trying to close that issue we should also have more discussions about whether this table schema with 80 columns really makes a ton of sense as a "tidy" table. Maybe we need to split this into multiple tidy tables - I don't know! But we should probably talk about it.

Personally I feel like we should take things one step at a time, make this big wide table, then figure out the exact design later - love a series of small iterative changes 😄

I think next steps are:

get a handle on performance
roll the totals into an asset check
make some graphs of the data and see if anything "weird" pops up - or, just post a bunch of graphs that "seem normal" and see if people who know more about gas act surprised when they see the graphs.
handle whatever table design questions seem relevant

jdangerx · 2024-05-22T13:57:30Z

src/pudl/etl/__init__.py

@@ -62,6 +62,9 @@


 core_module_groups = {
+    "_core_eia176": [pudl.transform.eia176],
+    "_core_eia860": [pudl.transform.eia860],


I'm surprised you had to add the pudl.transform.eia860 and pudl.transform.eia923 modules here, since they're already included under the core_eia860 and core_eia923 keys below... was there a specific issue you saw?

If there was confusion about "dagster asset groups" vs. "groups of Python modules to crawl for assets" that's super understandable and we should add a comment explaining the difference.

jdangerx · 2024-05-22T13:59:01Z

src/pudl/transform/eia176.py

+
+logger = get_logger(__name__)
+
+# Asset Checks are still Experimental, silence the warning since we use them


FYI: They've graduated from "experimental" since you opened this draft PR, we can take this out! Benefits of waiting a month to review your PR 😅

jdangerx · 2024-05-22T15:22:42Z

src/pudl/transform/eia176.py

+    granular_data = raw_eia176__data[
+        (raw_eia176__data["company"] != " Total of All Companies")
+    ]
+    for report_year, area, id_ in granular_data.groupby(primary_key).count().index:


It looks to me like you're trying to manually do an unstack() operation (take the variable names in the "variable_name" column and turn them into a bunch of columns). There's 51k PK groups here, and doing iterrows() across each group and recombining is pretty expensive in pandas.

I was able to get a similarly shaped result across all years in ~5s by leveraging the built-in unstack functionality and some inelegant cleanup of unstack artifacts.

primary_key = ["report_year", "area", "id"] raw_eia176__data["variable_name"] = ( raw_eia176__data["line"] + "_" + raw_eia176__data["atype"] ) # TODO should probably sanitize this company name somewhere beforehand granular = raw_eia176__data.loc[ raw_eia176__data.company.str.strip().str.lower() != "total of all companies" ] unstacked = ( granular .drop(columns=["itemsort", "item", "atype", "line", "company"]) .set_index(primary_key + ["variable_name"]) .unstack(level="variable_name") ) # columns is a weird multi-index with ("value", "actual column name") - clean that up unstacked.columns = unstacked.columns.droplevel(0) unstacked.columns.name = None # gets rid of "variable_name" name of columns index # TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first. condensed = unstacked.groupby(level=primary_key).first().reset_index() return condensed

One sort of weird thing that I'm curious about - in #3501 it looks like we wanted to use item to identify variables - should we be doing that instead of the line + atype thing we have going on here?

Outputs for comparison/posterity:

`unstack`, all years

report_year area id 1010_CS 1010_CT 1010_VL 101T_VL 1020_CS 1020_CT 1020_VL 102T_VL 1030_CS 1030_CT 1030_VL 103T_VL 1040_CS 1040_CT 1040_VL 104T_VL 1050_CS 1050_CT 1050_VL 105T_VL 1060_CS 1060_CT 1060_VL 106T_VL 110_VL 1110_CS 1110_CT 1110_VL 1120_CS 1120_CT 1120_VL 1130_CS 1130_CT 1130_VL 1140_CS 1140_CT 1140_VL 1150_CS 1150_CT 1150_VL 1160_CT 1160_VL 120_VL 1210_VL 1220_VL 1230_VL 1240_VL 1250_VL 1260_VL 1310_VL 1320_VL 1330_VL 1400_VL 1500_VL 1600_VL 1700_VL 1810_VL 1820_VL 1830_VL 1840_VL 1900_VL 2000_VL 210_VL 220_VL 230_VL 300_VL 3014_CT 3014_VL 3_VL 400_VL 410_CS 410_VL 420_VL 500_VL 5_CT 600_VL 6_CT 700_VL 800_VL 810_VL 810_YA 820_VL 820_YA 900_VL 9_VL 0 1997 Alabama 17600048AL 251371.0 625.0 29469.0 29469.0 69956.0 50.0 8358.0 8358.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37827.0 -9795.0 NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN 1100.0 NaN 1 1997 Alabama 17600049AL 221993.0 501.0 28016.0 28016.0 27135.0 22.0 4606.0 4606.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32622.0 1417.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34039.0 NaN NaN NaN 34039.0 NaN NaN NaN NaN NaN 1018.0 NaN 2 1997 Alabama 17600139AL 246105034.0 423130.0 29308019.0 29308019.0 72340233.0 35652.0 9951817.0 20985483.0 16947052.0 1498.0 2724147.0 51277620.0 NaN NaN NaN 7028200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 87.0 11033666.0 0.0 194.0 48553473.0 0.0 3.0 7028200.0 NaN NaN NaN NaN NaN NaN 179045.0 NaN NaN NaN NaN NaN NaN 783604.0 NaN NaN NaN NaN NaN NaN NaN 7800144.0 NaN 117362115.0 49682.0 NaN 404450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 117007347.0 NaN NaN NaN 117411797.0 NaN NaN NaN NaN NaN 1017.0 NaN 3 1997 Alabama 17600141AL NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14356530.0 NaN NaN NaN 1598402.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 9.0 14356530.0 0.0 1.0 1598402.0 NaN NaN NaN NaN NaN NaN 52657.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18275351.0 NaN NaN NaN 34282940.0 80555.0 NaN NaN NaN 10785270.0 NaN NaN NaN NaN NaN NaN NaN 23578225.0 NaN NaN NaN 34363495.0 NaN NaN NaN NaN NaN NaN NaN 4 1997 Alabama 17600162AL 1344834.0 3488.0 204256.0 204256.0 842826.0 523.0 135980.0 135980.0 2427378.0 7.0 703608.0 703608.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1043844.0 165.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1044009.0 NaN NaN NaN 1044009.0 NaN NaN NaN NaN NaN 1037.0 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 51365 2022 Wyoming 17678821WY NaN NaN NaN NaN NaN NaN NaN NaN 1926358.0 6.0 228238.0 228238.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 721281.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13711595.0 NaN NaN 14661114.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14661114.0 NaN NaN NaN 14661114.0 NaN NaN NaN NaN NaN 1070.0 NaN 51366 2022 Wyoming 17695156WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76757.0 1.0 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1522.0 460891.0 NaN NaN NaN 2313739.0 -1522.0 NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN 1000.0 NaN 51367 2022 Wyoming 17695458WY 1588901.0 1550.0 134273.0 134273.0 1426548.0 277.0 125426.0 125426.0 NaN NaN NaN NaN NaN NaN NaN NaN 19892.0 1.0 1819.0 1819.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1530.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 215.0 NaN NaN NaN NaN 263263.0 2823.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 1095.0 NaN 51368 2022 Wyoming 17696301WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 234285.0 2.0 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN 1050.0 NaN 51369 2022 Wyoming 17699910WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 120000000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [51370 rows x 88 columns]

existing code, all years

report_year area id 1010_CS 1010_CT 1010_VL 101T_VL 1020_CS 1020_CT 1020_VL 102T_VL 1030_CS 1030_CT 1030_VL 103T_VL 1040_CS 1040_CT 1040_VL 104T_VL 1050_CS 1050_CT 1050_VL 105T_VL 1060_CS 1060_CT 1060_VL 106T_VL 110_VL 1110_CS 1110_CT 1110_VL 1120_CS 1120_CT 1120_VL 1130_CS 1130_CT 1130_VL 1140_CS 1140_CT 1140_VL 1150_CS 1150_CT 1150_VL 1160_CT 1160_VL 120_VL 1210_VL 1220_VL 1230_VL 1240_VL 1250_VL 1260_VL 1310_VL 1320_VL 1330_VL 1400_VL 1500_VL 1600_VL 1700_VL 1810_VL 1820_VL 1830_VL 1840_VL 1900_VL 2000_VL 210_VL 220_VL 230_VL 300_VL 3014_CT 3014_VL 3_VL 400_VL 410_CS 410_VL 420_VL 500_VL 5_CT 600_VL 6_CT 700_VL 800_VL 810_VL 810_YA 820_VL 820_YA 900_VL 9_VL 0 1997 Alabama 17600048AL 251371.0 625.0 29469.0 29469.0 69956.0 50.0 8358.0 8358.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 37827.0 -9795.0 NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28032.0 NaN NaN NaN NaN NaN 1100.0 NaN 1 1997 Alabama 17600049AL 221993.0 501.0 28016.0 28016.0 27135.0 22.0 4606.0 4606.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 32622.0 1417.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34039.0 NaN NaN NaN 34039.0 NaN NaN NaN NaN NaN 1018.0 NaN 2 1997 Alabama 17600139AL 246105034.0 423130.0 29308019.0 29308019.0 72340233.0 35652.0 9951817.0 20985483.0 16947052.0 1498.0 2724147.0 51277620.0 NaN NaN NaN 7028200.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 87.0 11033666.0 0.0 194.0 48553473.0 0.0 3.0 7028200.0 NaN NaN NaN NaN NaN NaN 179045.0 NaN NaN NaN NaN NaN NaN 783604.0 NaN NaN NaN NaN NaN NaN NaN 7800144.0 NaN 117362115.0 49682.0 NaN 404450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 117007347.0 NaN NaN NaN 117411797.0 NaN NaN NaN NaN NaN 1017.0 NaN 3 1997 Alabama 17600141AL NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14356530.0 NaN NaN NaN 1598402.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 9.0 14356530.0 0.0 1.0 1598402.0 NaN NaN NaN NaN NaN NaN 52657.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 18275351.0 NaN NaN NaN 34282940.0 80555.0 NaN NaN NaN 10785270.0 NaN NaN NaN NaN NaN NaN NaN 23578225.0 NaN NaN NaN 34363495.0 NaN NaN NaN NaN NaN NaN NaN 4 1997 Alabama 17600162AL 1344834.0 3488.0 204256.0 204256.0 842826.0 523.0 135980.0 135980.0 2427378.0 7.0 703608.0 703608.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1043844.0 165.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1044009.0 NaN NaN NaN 1044009.0 NaN NaN NaN NaN NaN 1037.0 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 51365 2022 Wyoming 17678821WY NaN NaN NaN NaN NaN NaN NaN NaN 1926358.0 6.0 228238.0 228238.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 721281.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13711595.0 NaN NaN 14661114.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14661114.0 NaN NaN NaN 14661114.0 NaN NaN NaN NaN NaN 1070.0 NaN 51366 2022 Wyoming 17695156WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 76757.0 1.0 1851326.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1522.0 460891.0 NaN NaN NaN 2313739.0 -1522.0 NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2312217.0 NaN NaN NaN NaN NaN 1000.0 NaN 51367 2022 Wyoming 17695458WY 1588901.0 1550.0 134273.0 134273.0 1426548.0 277.0 125426.0 125426.0 NaN NaN NaN NaN NaN NaN NaN NaN 19892.0 1.0 1819.0 1819.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1530.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 215.0 NaN NaN NaN NaN 263263.0 2823.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 266086.0 NaN NaN NaN NaN NaN 1095.0 NaN 51368 2022 Wyoming 17696301WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 234285.0 2.0 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5450009.0 NaN NaN NaN 5450009.0 NaN NaN NaN NaN NaN 1050.0 NaN 51369 2022 Wyoming 17699910WY NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 120000000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [51370 rows x 88 columns]

They seem to be the same if you df.compare() the two as well.

Thanks. I was looking for a built-in function and it wasn't a simple transpose. I've worked this in but need to get the pre-commit hooks passing. Will chip away at this over the next few days.

jdangerx · 2024-05-22T15:24:13Z

test/unit/transform/eia176_test.py

+    ).empty
+
+
+# TODO: Implement, if we can even unit-test a function annotated as an asset check


Should be able to! In my mind, asset checks are just assets, which can be called directly...

jdangerx · 2024-05-22T15:25:53Z

src/pudl/transform/eia176.py

+    return reset_calculated.compare(reset_reported)
+
+
+# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed?


I don't really know much about the actual semantics of the gas data - I think a reasonable thing to do is graph some of the different variables over time and see if anything jumps out as "suspicious", then bring that up and we can try to research that together.

jdangerx · 2024-05-22T15:26:41Z

src/pudl/transform/eia176.py

+    reset_calculated = (
+        calculated_totals.sort_values(by=groupby_cols)
+        .reset_index()[groupby_cols + ["value"]]
+        .round(2)


You can use numpy.isclose() for comparing floats without rounding off to 2 decimal points everywhere!

e-belfer · 2024-07-12T13:04:32Z

@davidmudrauskas Just a heads up that I'm back on review duty here! Just nudge or ping me when you're ready for a re-review.

davidmudrauskas · 2024-10-16T02:25:09Z

src/pudl/transform/eia176.py

+    unstacked.columns = unstacked.columns.droplevel(0).fillna(0)
+    unstacked.columns.name = None  # gets rid of "variable_name" name of columns index
+
+    # TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first.


I didn't exactly discern what you meant here @jdangerx, if you want to expand. I can probably figure out the intent with a closer look.

Ah - yes, I meant first non-NA value here, which must have been confusing. Check out this example:

>>> df a b 0 1 10.0 1 1 NaN 2 2 20.0 3 2 21.0 >>> df.groupby("a").first() b a 1 10.0 2 20.0

.first() grabs the first non-null value regardless of if there are other non-null values. So for group a=1 we get the unambiguously right outcome of b=10, but for group a=2... is b=20 or b=21 the right outcome?

If we can tell what the right outcome should be, maybe by sorting by some field, then we should sort by that field before calling .first():

>>> df.sort_values("b", ascending=False) a b 3 2 21.0 2 2 20.0 0 1 10.0 1 1 NaN >>> df.sort_values("b", ascending=False).groupby("a").first() b a 1 10.0 2 21.0

If there's no simple rule, we should check to see that we're not running into this ambiguous behavior by checking that there are no groups with multiple non-NA values before going forward with the .first() call:

>>> df.groupby("a").count() b a 1 1 2 2 >>> (df.groupby("a").count() == 1).all() b False dtype: bool

Hopefully that clears things up!

We're guaranteed to have one value per group by the time we call first, because we're using unstack, which doesn't tolerate competing values:

>>> df company_id variable value 0 1 var1 11 1 1 var2 12 2 2 var2 22 3 2 var2 0 >>> df.set_index(["company_id", "variable"]).unstack(level="variable") ... ValueError: Index contains duplicate entries, cannot reshape

In fact I don't think we need first at all. Seems like a holdover from when we were working with the idea of a sparse matrix.

>>> df company_id variable value 0 1 var1 11 1 1 var2 12 2 2 var2 22 >>> df.set_index(["company_id", "variable"]).unstack(level="variable") value variable var1 var2 company_id 1 11.0 12.0 2 NaN 22.0

I'll get the adjustment in my next commit.

davidmudrauskas · 2024-10-16T02:26:53Z

@jdangerx I think this is pretty much the substance of the work. Let me know if you have any other loose ends, and I'll work this into a proper PR with description, merged with main, etc.

jdangerx

Finally got around to this! It looks pretty good structurally - just a few small changes / elaborating on that "willy-nilly" comment I had / a question about how you set up your test data.

jdangerx · 2024-11-08T22:29:17Z

src/pudl/transform/eia176.py

+
+    One table with data for each year and company, one with state- and US-level aggregates per year.
+    """
+    raw_eia176__data["report_year"].astype(int)


These astype calls don't mutate the original object:

>>> import pandas as pd >>> df = pd.DataFrame({"a": [1]}) >>> df.dtypes a int64 dtype: object >>> df["a"].astype("float") 0 1.0 Name: a, dtype: float64 >>> df.dtypes a int64 dtype: object

So I think you'll have to do some reassignment.

I also think it would be nice to do all that reassignment in one call - which you can do by calling astype on the DataFrame instead of its constituent Series:

>>> df = pd.DataFrame({"a": [1], "b": ["1"], "c": 1.0}) >>> df a b c 0 1 1 1.0 >>> df.dtypes a int64 b object c float64 dtype: object >>> df = df.astype({"a": "float", "b": "float"}) >>> df a b c 0 1.0 1.0 1.0 >>> df.dtypes a float64 b float64 c float64 dtype: object

jdangerx · 2024-11-08T22:34:17Z

src/pudl/transform/eia176.py

+def get_wide_table(long_table: pd.DataFrame, primary_key: list[str]) -> pd.DataFrame:
+    """Take a 'long' or entity-attribute-value table and return a wide table with one column per attribute/variable."""
+    unstacked = (
+        # we must drop 'id' here and cannot use as primary key because its arbitrary/duplicate in aggregate records


What a fun quirk you've found! Hooray!

These comments seem like they belong before line 44, where you're actually doing the ID dropping.

jdangerx · 2024-11-08T22:35:21Z

src/pudl/transform/eia176.py

+    long_company = raw_eia176__data.loc[
+        raw_eia176__data.company != "total of all companies"
+    ]
+    aggregate_primary_key = ["report_year", "area"]


super-nit: combined with the comment below about why the primary keys should differ, I think it would be slightly more readable if you assigned the two different primary keys names - instead of only giving the aggregate one a name and the company one just randomly has ["id"] tacked on.

# your comment about how the primary keys should differ aggregate_primary_key = [...] company_primary_key = aggregate_primary_key + ["id"]

jdangerx · 2024-11-08T22:43:46Z

src/pudl/transform/eia176.py

+    unstacked.columns = unstacked.columns.droplevel(0).fillna(0)
+    unstacked.columns.name = None  # gets rid of "variable_name" name of columns index
+
+    # TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first.


Ah - yes, I meant first non-NA value here, which must have been confusing. Check out this example:

>>> df a b 0 1 10.0 1 1 NaN 2 2 20.0 3 2 21.0 >>> df.groupby("a").first() b a 1 10.0 2 20.0

.first() grabs the first non-null value regardless of if there are other non-null values. So for group a=1 we get the unambiguously right outcome of b=10, but for group a=2... is b=20 or b=21 the right outcome?

If we can tell what the right outcome should be, maybe by sorting by some field, then we should sort by that field before calling .first():

>>> df.sort_values("b", ascending=False) a b 3 2 21.0 2 2 20.0 0 1 10.0 1 1 NaN >>> df.sort_values("b", ascending=False).groupby("a").first() b a 1 10.0 2 21.0

If there's no simple rule, we should check to see that we're not running into this ambiguous behavior by checking that there are no groups with multiple non-NA values before going forward with the .first() call:

>>> df.groupby("a").count() b a 1 1 2 2 >>> (df.groupby("a").count() == 1).all() b False dtype: bool

Hopefully that clears things up!

jdangerx · 2024-11-08T22:44:17Z

src/pudl/transform/eia176.py

+    return reset_calculated.compare(reset_reported)
+
+
+# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed?


This sounds like a perfect follow-up PR!

jdangerx · 2024-11-08T22:47:52Z

test/unit/transform/eia176_test.py

+
+from pudl.transform.eia176 import _core_eia176__data, get_wide_table, validate_totals
+
+COLUMN_NAMES = [


Food for thought: it sort of seems like you'd have a nicer time making a single dataframe that looks like the raw EIA 176 data, registering it as a @pytest.fixture, and then selecting the subsets of it you want to test with in the individual tests. What made you go this direction instead?

I'll give it a shot. I think a fixture could save a number of lines and data-building, particularly in checking internal consistency of aggregations. I think I'll still declare a good number of constants to use more semantically across the test file, instead of expecting maintainers to glean the nature of an arbitrary number (e.g., 532842.0) in limited contexts. I'm also inclined to keep the fixture minimal and somewhat contrived, if informed by actual raw data, rather than a fuller sample of raw data; I don't want more data points than needed to illustrate the tested behaviors, since that could make tests and code harder to comprehend.

jdangerx · 2024-11-12T06:27:14Z

Ah sweet! That is the best outcome :) In retrospect, yes I have seen that error message countless times when dealing with unstack. Just didn’t put two and two together. David Mudrauskas ***@***.***>于2024年11月11日周一下午9:57写道：

…

***@***.**** commented on this pull request. ------------------------------ In src/pudl/transform/eia176.py <#3590 (comment)> : > + return wide_company, wide_aggregate + + +def get_wide_table(long_table: pd.DataFrame, primary_key: list[str]) -> pd.DataFrame: + """Take a 'long' or entity-attribute-value table and return a wide table with one column per attribute/variable.""" + unstacked = ( + # we must drop 'id' here and cannot use as primary key because its arbitrary/duplicate in aggregate records + # 'id' is a reliable ID only in the context of granular company data + long_table.set_index(primary_key + ["variable_name"]).unstack( + level="variable_name" + ) + ) + unstacked.columns = unstacked.columns.droplevel(0).fillna(0) + unstacked.columns.name = None # gets rid of "variable_name" name of columns index + + # TODO instead of "first NA value we see in each column" applied willy-nilly, we could check to see if there are any conflicting non-null values using .count() first. We're guaranteed to have one value per group by the time we call first, because we're using unstack, which doesn't tolerate competing values: >>> df company_id variable value 0 1 var1 11 1 1 var2 12 2 2 var2 22 3 2 var2 0 >>> df.set_index(["company_id", "variable"]).unstack(level="variable") ... ValueError: Index contains duplicate entries, cannot reshape In fact I don't think we need first at all. Seems like a holdover from when we were working with the idea of a sparse matrix. >>> df company_id variable value 0 1 var1 11 1 1 var2 12 2 2 var2 22 >>> df.set_index(["company_id", "variable"]).unstack(level="variable") value variable var1 var2 company_id 1 11.0 12.0 2 NaN 22.0 I'll get the adjustment in my next commit. — Reply to this email directly, view it on GitHub <#3590 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATBKMQTJR4DDXIBASVDOV32AGKDFAVCNFSM6AAAAABGYACIYOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMRYGYZTIMJYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

davidmudrauskas added 2 commits April 10, 2024 01:17

Function for getting a column per variable for EIA-176 data

b81ff5b

Transpose rows to columns for entity variables, initial test/check ideas

20ebdb9

davidmudrauskas commented Apr 25, 2024

View reviewed changes

davidmudrauskas assigned davidmudrauskas and e-belfer and unassigned davidmudrauskas Apr 25, 2024

Merge branch 'main' into eia176_wide_table

9d9e590

jdangerx self-assigned this May 6, 2024

cmgosnell added new-data Requests for integration of new data. eia176 Issues related to the EIA Form 176 natural gas supply and disposition dataset. labels May 6, 2024

davidmudrauskas added 3 commits May 14, 2024 00:34

Dagster job for eia176 core data

b399e58

Merge branch 'eia176_wide_table' of github.com:catalyst-cooperative/p…

a921da2

…udl into eia176_wide_table

Merge branch 'main' into eia176_wide_table

1e9e482

jdangerx self-requested a review May 22, 2024 13:45

jdangerx unassigned e-belfer May 22, 2024

jdangerx reviewed May 22, 2024

View reviewed changes

aesharpe assigned e-belfer Jul 1, 2024

davidmudrauskas added 2 commits July 10, 2024 21:22

Merge branch 'main' into eia176_wide_table

a64a614

Use unstack instead of looping to create wide table

93002d0

davidmudrauskas added 6 commits August 17, 2024 15:32

Merge branch 'main' into eia176_wide_table

a59aac4

Extract testable function for wide table, validation function

3aee64e

Test for wide table function

c11545c

Test for validating totals

2824d7a

Information-hide drop columns

54523a4

Normalize string values in raw dataframe

df1eab8

davidmudrauskas commented Oct 16, 2024

View reviewed changes

jdangerx self-requested a review October 21, 2024 18:17

jdangerx requested changes Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eia176 wide table #3590

Eia176 wide table #3590

davidmudrauskas commented Apr 25, 2024

To-do list

davidmudrauskas Apr 25, 2024

davidmudrauskas Apr 25, 2024

jdangerx May 22, 2024

davidmudrauskas Oct 16, 2024

jdangerx Nov 8, 2024

davidmudrauskas commented May 14, 2024

jdangerx left a comment

jdangerx May 22, 2024

jdangerx May 22, 2024

jdangerx May 22, 2024

jdangerx May 22, 2024

davidmudrauskas Jun 21, 2024

jdangerx May 22, 2024

jdangerx May 22, 2024

jdangerx May 22, 2024

e-belfer commented Jul 12, 2024

davidmudrauskas Oct 16, 2024

jdangerx Nov 8, 2024

davidmudrauskas Nov 12, 2024

davidmudrauskas commented Oct 16, 2024

jdangerx left a comment

jdangerx Nov 8, 2024

jdangerx Nov 8, 2024

jdangerx Nov 8, 2024

jdangerx Nov 8, 2024

jdangerx Nov 8, 2024

jdangerx Nov 8, 2024

davidmudrauskas Nov 13, 2024

jdangerx commented Nov 12, 2024 via email

		"""Compare reported and calculated totals for different geographical aggregates, report any differences."""


		def _compare_totals(

		return reset_calculated.compare(reset_reported)


		# TODO: Reasonable boundaries -- in a script/notebook in the 'validate' directory? How are those executed?


		logger = get_logger(__name__)

		# Asset Checks are still Experimental, silence the warning since we use them

		).empty


		# TODO: Implement, if we can even unit-test a function annotated as an asset check


		from pudl.transform.eia176 import _core_eia176__data, get_wide_table, validate_totals

		COLUMN_NAMES = [

Eia176 wide table #3590

Are you sure you want to change the base?

Eia176 wide table #3590

Conversation

davidmudrauskas commented Apr 25, 2024

Overview

Testing

To-do list

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmudrauskas commented May 14, 2024

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-belfer commented Jul 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmudrauskas commented Oct 16, 2024

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Nov 12, 2024 via email