-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract raw PHMSA distribution and start of transmission data (Table A-D, H, I) #2932
Conversation
it seems to all just work which is tres fun but makes sense after looking at it
update the 860m doi
Doc updates can be viewed here: https://catalystcoop-pudl--2932.org.readthedocs.build/en/2932/data_sources/phmsagas.html |
src/pudl/extract/phmsagas.py
Outdated
data. This means one page may get split into multiple tables by column. To | ||
prevent each table from containing all columns from these older years, filter by | ||
the list of columns specified for the page, with a warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last sentence is still a bit confusing, but I think I'm starting to understand better. Is the idea that the content from one page (tab) gets split into multiple tables because the content in later years is spread across several pages (tabs). And we don't want to retain duplicate columns across tables, one of which will contain data and one of which will be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to clarify again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few minor, non-blocking comments. Still feel like it would be better as just phmsa
instead of phmsagas
but it might be more work than it's worth right now.
…tive/pudl into phmsa-extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks pretty reasonable! Based on my potentially flawed understanding, I think there's some refactoring opps but nothing critical...
Here's my understanding of things, let me know what I got wrong :)
When we want to get the data for a specific (Page, year), we need to:
- download the resource based on the partition(s) - we're using the
raw_df_factory
, which assumes yearly partitions; since we need other partitions to ID the resource to download, we usepage_part_map.csv
- unzip the resource and look in the filename determined in
file_map.csv
- within that filename, look on the sheet in
page_map.csv
- within that sheet, rename columns based on
column_maps/page_name.csv
and the year.
And there's actually two sets of partitions
- ZIP resource partitions - we need to know this when we're downloading the zipfile. This includes year + form.
- asset partitions - no single table comes from both distribution AND transmission, so the only meaningful partition here is by year.
Then the excel extractor basically:
- takes a dataset name
- looks for all the years in that dataset's settings
- makes an extractor filtering for each of those years
- combines those into one dataframe at the end
In the extractor, we assume that the resource partitions are always the same as the asset partitions plus the extra bits defined in page_part_map.csv
, which is why we need to add those bits in the self.zipfile_resource_partitions()
method.
If all that is correct, it might be nice to explicitly call out the "input partitions/zipfile resource partitions" and "output partitions/just straight-up years" as two separate important concepts in the GenericExtractor
docs. But if I'm the only one who finds that framing enlightening I'm happy to keep it filed away in my own brain 😅
"""Final processing stage applied to a page DataFrame.""" | ||
return df | ||
|
||
def zipfile_resource_partitions(self, page, **partition) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns the "input data" partition that tells us where to find the source data for a certain output partition, right?
And if we want to do weird custom logic here in a dataset-specific subclass of GenericExtractor
we could just override this method?
If so, it might be clearer if we rename the method + update the docstring to establish: "if you are looking to do some fancy mapping from input partitions to output partitions, for example if you output quarterly data but read in yearly files, override this method. the default is to take the output partitions + add any additional partitions defined in the page_part_map.csv."
def process_final_page(self, df, page): | ||
"""Drop columns that get mapped to other assets. | ||
|
||
Older years of PHMSA data have one Excel tab in the raw data, while newer data | ||
has multiple tabs. To extract data into tables that follow the newer data format | ||
without duplicating the older data, we need to split older pages into multiple | ||
tables by column. To prevent each table from containing all columns from these | ||
older years, filter by the list of columns specified for the page, with a | ||
warning. | ||
""" | ||
to_drop = [ | ||
c | ||
for c in df.columns | ||
if c not in self._metadata.get_all_columns(page) | ||
and c not in self.cols_added | ||
] | ||
if to_drop: | ||
logger.warning( | ||
f"Dropping columns {to_drop} that are not mapped to this asset." | ||
) | ||
df = df.drop(columns=to_drop, errors="ignore") | ||
return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a bad idea overall but it will result in the extractor needing to read the old excel files for the same chunk of data to extract each of the new parts right?
We could just read the old data as one big squished together table and sort it out during the transform step. Although I do like the idea of not needing to re-org quite as much during the transform step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parts of the form still exist, it's just that they're all included in one giant Excel sheet instead of being split into Excel sheets in the earlier years. My preference is to form the raw table that we want with the columns that belong to it, with that design being informed by the most recent years of data, and to use the transform step to do cleaning. I'm subclassing this method since I think this is a more strict check than we want for the other data sources, but I think that it is relatively consistent with how we've handled column movements across tabs in other datasets over time.
In this PR, adapt the generic Excel extractor to work on PHMSA datasets. The goal of this PR is to extract raw PHMSA distribution and selected transmission (tables A-D, H, I) data and create corresponding raw assets in PUDL.
This includes:
years
partitionspudl_datastore
(in concert with PR#167 in thepudl-archiver
repo)process_final_page
)docs.data_sources.phmsagas
Still to do:
form
notdataset
to avoid variable clashes and add listedyears
partitions in PHMSA archiver: add multi-year years partition, add new dataset pudl-archiver#200data_sources
docs
README
Out of scope:
process_final_page
, and are logged as a warning.PR Checklist
dev
).