You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first line of work for the Mozilla AI for Environmental Justice grant to perform record linkage between EIA utility data and SEC utility ownership (proposal here). This particular epic involves accessing and archiving the SEC Ex. 21 PDFs and integrating that work into PUDL.
Motivation
Utilities are often subsidiaries of a larger utility holding company. These ownership relationships reveal important power dynamics underlying utility behaviors: for example, which electric utilities are intimately linked to natural gas companies through ownership by the same parent company? Existing analyses of plant ownership (e.g. ClimateTrace’s global analysis, Little Sis's PowerLines project, the Energy Democracy Project) do not publish their source code, focus on a small subset of plants, or rely on entirely manual parsing of owner-subsidiary relationships.
Owner to subsidiary relationships are reported in Exhibit 21 (Ex. 21) of publicly traded companies’ 10-K filings with the Security Exchange Commission (SEC). While 10-K filings are reported in XBRL, Ex. 21 is instead distributed in an unstructured attachment to the form. This attachment lacks a standard layout and is often a PDF or text file, inhibiting analyses at larger geographic or temporal scales. While other open-source tools have been previously created to extract Ex. 21 data, to our knowledge none are complete or still maintained. A popular example, CorpWatch’s dataset, has a codebase that has not been maintained since 2010 and is missing crucial data fields.
We propose to address this gap by extracting Ex. 21 data using automated unstructured data extraction models. Next, we will use entity resolution models to connect Ex. 21 data to EIA utilities data, as these datasets refer to the same utility entities, but lack a join key. Building on prior record linkages between EIA, FERC and EPA data in PUDL, this project will connect parent companies to a wealth of power system data, such as annual hourly emissions data and detailed breakdowns of utility investments.
Scope
How do we know when we are done?
We have a database of Ex. 21 PDFs that are ready to be OCR'ed
We have metadata about which companies we have Ex. 21's for and overall coverage of all Ex. 21 filings
Regular archiving is performed
The archiver is integrated into PUDL
What is out of scope?
Model to extract data from PDFs into machine readable formats as well as OCR models
Record linkage to EIA
Anything else?
Is there future work described in a google doc or epic?
See Mozilla EJ for AI folder in drive
Anything special about this epic? Super high priority? Things that might not work? Parts that could balloon?
Definitely anticipate problems with IP requests, start with looking at options for bulk downloads because that would be great.
The content you are editing has changed. Please copy your edits and refresh the page.
Description
The first line of work for the Mozilla AI for Environmental Justice grant to perform record linkage between EIA utility data and SEC utility ownership (proposal here). This particular epic involves accessing and archiving the SEC Ex. 21 PDFs and integrating that work into PUDL.
Motivation
Utilities are often subsidiaries of a larger utility holding company. These ownership relationships reveal important power dynamics underlying utility behaviors: for example, which electric utilities are intimately linked to natural gas companies through ownership by the same parent company? Existing analyses of plant ownership (e.g. ClimateTrace’s global analysis, Little Sis's PowerLines project, the Energy Democracy Project) do not publish their source code, focus on a small subset of plants, or rely on entirely manual parsing of owner-subsidiary relationships.
Owner to subsidiary relationships are reported in Exhibit 21 (Ex. 21) of publicly traded companies’ 10-K filings with the Security Exchange Commission (SEC). While 10-K filings are reported in XBRL, Ex. 21 is instead distributed in an unstructured attachment to the form. This attachment lacks a standard layout and is often a PDF or text file, inhibiting analyses at larger geographic or temporal scales. While other open-source tools have been previously created to extract Ex. 21 data, to our knowledge none are complete or still maintained. A popular example, CorpWatch’s dataset, has a codebase that has not been maintained since 2010 and is missing crucial data fields.
We propose to address this gap by extracting Ex. 21 data using automated unstructured data extraction models. Next, we will use entity resolution models to connect Ex. 21 data to EIA utilities data, as these datasets refer to the same utility entities, but lack a join key. Building on prior record linkages between EIA, FERC and EPA data in PUDL, this project will connect parent companies to a wealth of power system data, such as annual hourly emissions data and detailed breakdowns of utility investments.
Scope
Anything else?
Tasks
The text was updated successfully, but these errors were encountered: