Skip to content

Commit

Permalink
Merge branch 'main' into fix_op_factory
Browse files Browse the repository at this point in the history
  • Loading branch information
zschira committed Jan 17, 2024
2 parents f90dd25 + c7010e5 commit 80f7b29
Show file tree
Hide file tree
Showing 39 changed files with 2,774 additions and 2,273 deletions.
1 change: 1 addition & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ jobs:
- name: Run integration tests, trying to use GCS cache if possible
run: |
pip install --no-deps --editable .
pudl_datastore --dataset epacems --partition year_quarter=2022q1
make pytest-integration
- name: Upload coverage
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,10 +104,10 @@ jobs:
git push
notify-slack:
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v20')
runs-on: ubuntu-latest
needs:
- publish-github
if: ${{ always() }}
steps:
- name: Inform the Codemonkeys
uses: 8398a7/action-slack@v3
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ repos:
# Formatters: hooks that re-write Python & documentation files
####################################################################################
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.1.9
rev: v0.1.13
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
90 changes: 48 additions & 42 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,48 +42,54 @@ What is PUDL?
The `PUDL <https://catalyst.coop/pudl/>`__ Project is an open source data processing
pipeline that makes US energy data easier to access and use programmatically.

Hundreds of gigabytes of valuable data are published by US government agencies, but
it's often difficult to work with. PUDL takes the original spreadsheets, CSV files,
and databases and turns them into a unified resource. This allows users to spend more
time on novel analysis and less time on data preparation.
Hundreds of gigabytes of valuable data are published by US government agencies, but it's
often difficult to work with. PUDL takes the original spreadsheets, CSV files, and
databases and turns them into a unified resource. This allows users to spend more time
on novel analysis and less time on data preparation.

The project is focused on serving researchers, activists, journalists, policy makers,
and small businesses that might not otherwise be able to afford access to this data
from commercial sources and who may not have the time or expertise to do all the
data processing themselves from scratch.
and small businesses that might not otherwise be able to afford access to this data from
commercial sources and who may not have the time or expertise to do all the data
processing themselves from scratch.

We want to make this data accessible and easy to work with for as wide an audience as
possible: anyone from a grassroots youth climate organizers working with Google
sheets to university researchers with access to scalable cloud computing
resources and everyone in between!
possible: anyone from a grassroots youth climate organizers working with Google sheets
to university researchers with access to scalable cloud computing resources and everyone
in between!

PUDL is comprised of three core components:

- **Raw Data Archives**

- PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__
all the raw data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
to ensure perminant, versioned access to the data. In the event that an agency
changes how they publish data or deletes old files, the ETL will still have access
to the original inputs. Each of the data inputs may have several different versions
archived, and all are assigned a unique DOI and made available through the REST API.
You can read more about the Raw Data Archives in the
`docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#raw-data-archives>`__.
- **ETL Pipeline**

- The ETL pipeline (this repo) ingests the raw archives, cleans them,
integrates them, and outputs them to a series of tables stored in SQLite Databases,
Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL
Python package is embedded with a set of of DOIs to indicate which version of the
raw inputs it is meant to process. This process helps ensure that the ETL and it's
outputs are replicable. You can read more about the ETL in the
`docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#the-etl-process>`__.
- **Data Warehouse**

- The outputs from the ETL, sometimes called "PUDL outputs",
are stored in a data warehouse as a collection of SQLite and Parquet files so that
users can access the data without having to run any code. Learn more about how to
access the data `here <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_access.html>`__.
Raw Data Archives
^^^^^^^^^^^^^^^^^
PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__ all our raw
inputs on `Zenodo
<https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__ to ensure
permanent, versioned access to the data. In the event that an agency changes how they
publish data or deletes old files, the data processing pipeline will still have access
to the original inputs. Each of the data inputs may have several different versions
archived, and all are assigned a unique DOI (digital object identifier) and made
available through Zenodo's REST API. You can read more about the Raw Data Archives in
the `docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#raw-data-archives>`__.

Data Pipeline
^^^^^^^^^^^^^
The data pipeline (this repo) ingests raw data from the archives, cleans and integrates
it, and writes the resulting tables to `SQLite <https://sqlite.org>`__ and `Apache
Parquet <https://parquet.apache.org/>`__ files, with some acompanying metadata stored as
JSON. Each release of the PUDL software contains a set of of DOIs indicating which
versions of the raw inputs it processes. This helps ensure that the outputs are
replicable. You can read more about our ETL (extract, transform, load) process in the
`PUDL documentation <https://catalystcoop-pudl.readthedocs.io/en/nightly/#the-etl-process>`__.

Data Warehouse
^^^^^^^^^^^^^^
The SQLite, Parquet, and JSON outputs from the data pipeline, sometimes called "PUDL
outputs", are updated each night by an automated build process, and periodically
archived so that users can access the data without having to install and run our data
processing system. These outputs contain hundreds of tables and comprise a small
file-based data warehouse that can be used for a variety of energy system analyses.
Learn more about `how to access the PUDL data
<https://catalystcoop-pudl.readthedocs.io/en/nightly/data_access.html>`__.

What data is available?
-----------------------
Expand All @@ -98,23 +104,23 @@ PUDL currently integrates data from:
* **EIA Form 861**: 2001-2022
- `Source Docs <https://www.eia.gov/electricity/data/eia861/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/eia861.html>`__
* **EIA Form 923**: 2001-2022
* **EIA Form 923**: 2001-2023
- `Source Docs <https://www.eia.gov/electricity/data/eia923/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/eia923.html>`__
* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995Q1-2023Q3
- `Source Docs <https://campd.epa.gov/>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/epacems.html>`__
* **FERC Form 1**: 1994-2021
* **FERC Form 1**: 1994-2022
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/ferc1.html>`__
* **FERC Form 714**: 2006-2020
* **FERC Form 714**: 2006-2022 (mostly raw)
- `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__
- `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/ferc714.html>`__
* **FERC Form 2**: 2021 (raw only)
* **FERC Form 2**: 1996-2022 (raw only)
- `Source Docs <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__
* **FERC Form 6**: 2021 (raw only)
* **FERC Form 6**: 2000-2022 (raw only)
- `Source Docs <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__
* **FERC Form 60**: 2021 (raw only)
* **FERC Form 60**: 2006-2022 (raw only)
- `Source Docs <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__
* **US Census Demographic Profile 1 Geodatabase**: 2010
- `Source Docs <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__
Expand Down
20 changes: 17 additions & 3 deletions devtools/datasette/publish.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,19 @@ def metadata(pudl_out: Path) -> str:
flag_value="metadata",
help="Generate the Datasette metadata.yml in current directory, but do not deploy.",
)
def deploy_datasette(deploy: str) -> int:
"""Generate deployment files and run the deploy."""
@click.argument(
"fly_args",
required=False,
nargs=-1,
)
def deploy_datasette(deploy: str, fly_args: tuple[str]) -> int:
"""Generate deployment files and deploy Datasette either locally or to fly.io.
Any additional arguments after -- will be passed through to flyctl if deploying to
fly.io. E.g. the following would build ouputs for fly.io, but not actually deploy:
python publish.py --fly -- --build-only
"""
pudl_out = PudlPaths().pudl_output
metadata_yml = metadata(pudl_out)
# Order the databases to highlight PUDL
Expand Down Expand Up @@ -148,7 +159,10 @@ def deploy_datasette(deploy: str) -> int:
)

logging.info("Running fly deploy...")
check_call(["/usr/bin/env", "flyctl", "deploy"], cwd=fly_dir) # noqa: S603
cmd = ["/usr/bin/env", "flyctl", "deploy"]
if fly_args:
cmd = cmd + list(fly_args)
check_call(cmd, cwd=fly_dir) # noqa: S603
logging.info("Deploy finished!")

elif deploy == "local":
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def data_dictionary_metadata_to_rst(app):
"""Export data dictionary metadata to RST for inclusion in the documentation."""
# Create an RST Data Dictionary for the PUDL DB:
print("Exporting PUDL DB data dictionary metadata to RST.")
skip_names = ["datasets", "accumulated_depreciation_ferc1", "entity_types_eia"]
skip_names = ["datasets", "accumulated_depreciation_ferc1"]
names = [name for name in RESOURCE_METADATA if name not in skip_names]
package = Package.from_resource_ids(resource_ids=tuple(sorted(names)))
# Sort fields within each resource by name:
Expand Down
16 changes: 8 additions & 8 deletions docs/data_access.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,18 +144,18 @@ HTTPS using the following links:
* `FERC-714 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ferc714_xbrl_datapackage.json>`__
* `FERC-714 XBRL Taxonomy Metadata as JSON (2021-2022) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ferc714_xbrl_taxonomy_metadata.json>`__

.. note::

To reduce network transfer times, we ``gzip`` the SQLite database files, which can
be quite large when uncompressed. To decompress them locally, at the command line
on Linux, MacOS, or Windows you can use the ``gunzip`` command.
To reduce network transfer times, we compress the SQLite databases using ``gzip``. To
decompress them locally, at the command line on Linux, MacOS, or Windows you can use the
``gunzip`` command. (Git for Windows installs ``gzip`` / ``gunzip`` by default, and it
can also be installed using the conda package manager).

.. code-block:: console
.. code-block:: console
$ gunzip *.sqlite.gz
$ gunzip *.sqlite.gz
On Windows you can also use a 3rd party tool like
`7zip <https://www.7-zip.org/download.html>`__.
If you're not familiar with using Unix command line tools in Windows you can also use a
3rd party tool like `7zip <https://www.7-zip.org/download.html>`__.

.. _access-zenodo:

Expand Down
10 changes: 10 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,23 @@ v2024.01.XX
protected and automatically updated. Build outputs are now written to
``gs://builds.catalyst.coop`` and retained for 30 days. See issues :issue:`3140,3179`
and PRs :pr:`3195,3206,3212`
* The :mod:`pudl.analysis.record_linkage.eia_ferc1_record_linkage` module has been
refactored to use PUDL record linkage infrastructure and include extra cleaning
steps. This resulted in around 500 or 2% of matches changing.

Data Coverage
^^^^^^^^^^^^^
* Updated :doc:`data_sources/epacems` to switch to pulling the quarterly updates of
CEMS instead of the annual files. Integrates CEMS through 2023q3. See issue
:issue:`2973` & PR :pr:`3096`.

Data Cleaning
^^^^^^^^^^^^^

* Filled in null annual balances with fourth-quarter quarterly balances in
:ref:`core_ferc1__yearly_balance_sheet_liabilities_sched110`. :issue:`3233` and
:pr:`3234`.

---------------------------------------------------------------------------------------
v2023.12.01
---------------------------------------------------------------------------------------
Expand Down
Loading

0 comments on commit 80f7b29

Please sign in to comment.