Merge branch 'main' into fix_op_factory

catalyst-cooperative · Jan 17, 2024 · 80f7b29 · 80f7b29
2 parents f90dd25 + c7010e5
commit 80f7b29
Show file tree

Hide file tree

Showing 39 changed files with 2,774 additions and 2,273 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -172,6 +172,7 @@ jobs:
       - name: Run integration tests, trying to use GCS cache if possible
         run: |
           pip install --no-deps --editable .
+          pudl_datastore --dataset epacems --partition year_quarter=2022q1
           make pytest-integration
 
       - name: Upload coverage

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -104,10 +104,10 @@ jobs:
           git push
 
   notify-slack:
+    if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v20')
     runs-on: ubuntu-latest
     needs:
       - publish-github
-    if: ${{ always() }}
     steps:
       - name: Inform the Codemonkeys
         uses: 8398a7/action-slack@v3

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -29,7 +29,7 @@ repos:
   # Formatters: hooks that re-write Python & documentation files
   ####################################################################################
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.1.9
+    rev: v0.1.13
     hooks:
       - id: ruff
         args: [--fix, --exit-non-zero-on-fix]

diff --git a/README.rst b/README.rst
@@ -42,48 +42,54 @@ What is PUDL?
 The `PUDL <https://catalyst.coop/pudl/>`__ Project is an open source data processing
 pipeline that makes US energy data easier to access and use programmatically.
 
-Hundreds of gigabytes of valuable data are published by US government agencies, but
-it's often difficult to work with. PUDL takes the original spreadsheets, CSV files,
-and databases and turns them into a unified resource. This allows users to spend more
-time on novel analysis and less time on data preparation.
+Hundreds of gigabytes of valuable data are published by US government agencies, but it's
+often difficult to work with. PUDL takes the original spreadsheets, CSV files, and
+databases and turns them into a unified resource. This allows users to spend more time
+on novel analysis and less time on data preparation.
 
 The project is focused on serving researchers, activists, journalists, policy makers,
-and small businesses that might not otherwise be able to afford access to this data
-from commercial sources and who may not have the time or expertise to do all the
-data processing themselves from scratch.
+and small businesses that might not otherwise be able to afford access to this data from
+commercial sources and who may not have the time or expertise to do all the data
+processing themselves from scratch.
 
 We want to make this data accessible and easy to work with for as wide an audience as
-possible: anyone from a grassroots youth climate organizers working with Google
-sheets to university researchers with access to scalable cloud computing
-resources and everyone in between!
+possible: anyone from a grassroots youth climate organizers working with Google sheets
+to university researchers with access to scalable cloud computing resources and everyone
+in between!
 
 PUDL is comprised of three core components:
 
-- **Raw Data Archives**
-
-  - PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__
-    all the raw data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__
-    to ensure perminant, versioned access to the data. In the event that an agency
-    changes how they publish data or deletes old files, the ETL will still have access
-    to the original inputs. Each of the data inputs may have several different versions
-    archived, and all are assigned a unique DOI and made available through the REST API.
-    You can read more about the Raw Data Archives in the
-    `docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#raw-data-archives>`__.
-- **ETL Pipeline**
-
-  - The ETL pipeline (this repo) ingests the raw archives, cleans them,
-    integrates them, and outputs them to a series of tables stored in SQLite Databases,
-    Parquet files, and pickle files (the Data Warehouse). Each release of the PUDL
-    Python package is embedded with a set of of DOIs to indicate which version of the
-    raw inputs it is meant to process. This process helps ensure that the ETL and it's
-    outputs are replicable. You can read more about the ETL in the
-    `docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#the-etl-process>`__.
-- **Data Warehouse**
-
-  - The outputs from the ETL, sometimes called "PUDL outputs",
-    are stored in a data warehouse as a collection of SQLite and Parquet files so that
-    users can access the data without having to run any code. Learn more about how to
-    access the data `here <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_access.html>`__.
+Raw Data Archives
+^^^^^^^^^^^^^^^^^
+PUDL `archives <https://github.com/catalyst-cooperative/pudl-archiver>`__ all our raw
+inputs on `Zenodo
+<https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__ to ensure
+permanent, versioned access to the data. In the event that an agency changes how they
+publish data or deletes old files, the data processing pipeline will still have access
+to the original inputs. Each of the data inputs may have several different versions
+archived, and all are assigned a unique DOI (digital object identifier) and made
+available through Zenodo's REST API.  You can read more about the Raw Data Archives in
+the `docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/#raw-data-archives>`__.
+
+Data Pipeline
+^^^^^^^^^^^^^
+The data pipeline (this repo) ingests raw data from the archives, cleans and integrates
+it, and writes the resulting tables to `SQLite <https://sqlite.org>`__ and `Apache
+Parquet <https://parquet.apache.org/>`__ files, with some acompanying metadata stored as
+JSON.  Each release of the PUDL software contains a set of of DOIs indicating which
+versions of the raw inputs it processes. This helps ensure that the outputs are
+replicable. You can read more about our ETL (extract, transform, load) process in the
+`PUDL documentation <https://catalystcoop-pudl.readthedocs.io/en/nightly/#the-etl-process>`__.
+
+Data Warehouse
+^^^^^^^^^^^^^^
+The SQLite, Parquet, and JSON outputs from the data pipeline, sometimes called "PUDL
+outputs", are updated each night by an automated build process, and periodically
+archived so that users can access the data without having to install and run our data
+processing system. These outputs contain hundreds of tables and comprise a small
+file-based data warehouse that can be used for a variety of energy system analyses.
+Learn more about `how to access the PUDL data
+<https://catalystcoop-pudl.readthedocs.io/en/nightly/data_access.html>`__.
 
 What data is available?
 -----------------------
@@ -98,23 +104,23 @@ PUDL currently integrates data from:
 * **EIA Form 861**: 2001-2022
   - `Source Docs <https://www.eia.gov/electricity/data/eia861/>`__
   - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/eia861.html>`__
-* **EIA Form 923**: 2001-2022
+* **EIA Form 923**: 2001-2023
   - `Source Docs <https://www.eia.gov/electricity/data/eia923/>`__
   - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/eia923.html>`__
-* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995-2022
+* **EPA Continuous Emissions Monitoring System (CEMS)**: 1995Q1-2023Q3
   - `Source Docs <https://campd.epa.gov/>`__
   - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/epacems.html>`__
-* **FERC Form 1**: 1994-2021
+* **FERC Form 1**: 1994-2022
   - `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-1-electric-utility-annual>`__
   - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/ferc1.html>`__
-* **FERC Form 714**: 2006-2020
+* **FERC Form 714**: 2006-2022 (mostly raw)
   - `Source Docs <https://www.ferc.gov/industries-data/electric/general-information/electric-industry-forms/form-no-714-annual-electric/data>`__
   - `PUDL Docs <https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/ferc714.html>`__
-* **FERC Form 2**: 2021 (raw only)
+* **FERC Form 2**: 1996-2022 (raw only)
   - `Source Docs <https://www.ferc.gov/industries-data/natural-gas/industry-forms/form-2-2a-3-q-gas-historical-vfp-data>`__
-* **FERC Form 6**: 2021 (raw only)
+* **FERC Form 6**: 2000-2022 (raw only)
   - `Source Docs <https://www.ferc.gov/general-information-1/oil-industry-forms/form-6-6q-historical-vfp-data>`__
-* **FERC Form 60**: 2021 (raw only)
+* **FERC Form 60**: 2006-2022 (raw only)
   - `Source Docs <https://www.ferc.gov/form-60-annual-report-centralized-service-companies>`__
 * **US Census Demographic Profile 1 Geodatabase**: 2010
   - `Source Docs <https://www.census.gov/geographies/mapping-files/2010/geo/tiger-data.html>`__

diff --git a/devtools/datasette/publish.py b/devtools/datasette/publish.py
@@ -110,8 +110,19 @@ def metadata(pudl_out: Path) -> str:
     flag_value="metadata",
     help="Generate the Datasette metadata.yml in current directory, but do not deploy.",
 )
-def deploy_datasette(deploy: str) -> int:
-    """Generate deployment files and run the deploy."""
+@click.argument(
+    "fly_args",
+    required=False,
+    nargs=-1,
+)
+def deploy_datasette(deploy: str, fly_args: tuple[str]) -> int:
+    """Generate deployment files and deploy Datasette either locally or to fly.io.
+
+    Any additional arguments after -- will be passed through to flyctl if deploying to
+    fly.io. E.g. the following would build ouputs for fly.io, but not actually deploy:
+
+    python publish.py --fly -- --build-only
+    """
     pudl_out = PudlPaths().pudl_output
     metadata_yml = metadata(pudl_out)
     # Order the databases to highlight PUDL
@@ -148,7 +159,10 @@ def deploy_datasette(deploy: str) -> int:
         )
 
         logging.info("Running fly deploy...")
-        check_call(["/usr/bin/env", "flyctl", "deploy"], cwd=fly_dir)  # noqa: S603
+        cmd = ["/usr/bin/env", "flyctl", "deploy"]
+        if fly_args:
+            cmd = cmd + list(fly_args)
+        check_call(cmd, cwd=fly_dir)  # noqa: S603
         logging.info("Deploy finished!")
 
     elif deploy == "local":

diff --git a/docs/conf.py b/docs/conf.py
@@ -137,7 +137,7 @@ def data_dictionary_metadata_to_rst(app):
     """Export data dictionary metadata to RST for inclusion in the documentation."""
     # Create an RST Data Dictionary for the PUDL DB:
     print("Exporting PUDL DB data dictionary metadata to RST.")
-    skip_names = ["datasets", "accumulated_depreciation_ferc1", "entity_types_eia"]
+    skip_names = ["datasets", "accumulated_depreciation_ferc1"]
     names = [name for name in RESOURCE_METADATA if name not in skip_names]
     package = Package.from_resource_ids(resource_ids=tuple(sorted(names)))
     # Sort fields within each resource by name:

diff --git a/docs/data_access.rst b/docs/data_access.rst
@@ -144,18 +144,18 @@ HTTPS using the following links:
   * `FERC-714 Datapackage (JSON) describing SQLite derived from XBRL <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ferc714_xbrl_datapackage.json>`__
   * `FERC-714 XBRL Taxonomy Metadata as JSON (2021-2022) <https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/ferc714_xbrl_taxonomy_metadata.json>`__
 
-.. note::
 
-   To reduce network transfer times, we ``gzip`` the SQLite database files, which can
-   be quite large when uncompressed. To decompress them locally, at the command line
-   on Linux, MacOS, or Windows you can use the ``gunzip`` command.
+To reduce network transfer times, we compress the SQLite databases using ``gzip``. To
+decompress them locally, at the command line on Linux, MacOS, or Windows you can use the
+``gunzip`` command. (Git for Windows installs ``gzip`` / ``gunzip`` by default, and it
+can also be installed using the conda package manager).
 
-   .. code-block:: console
+.. code-block:: console
 
-      $ gunzip *.sqlite.gz
+   $ gunzip *.sqlite.gz
 
-  On Windows you can also use a 3rd party tool like
-  `7zip <https://www.7-zip.org/download.html>`__.
+If you're not familiar with using Unix command line tools in Windows you can also use a
+3rd party tool like `7zip <https://www.7-zip.org/download.html>`__.
 
 .. _access-zenodo:
 

diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -24,13 +24,23 @@ v2024.01.XX
   protected and automatically updated. Build outputs are now written to
   ``gs://builds.catalyst.coop`` and retained for 30 days. See issues :issue:`3140,3179`
   and PRs :pr:`3195,3206,3212`
+* The :mod:`pudl.analysis.record_linkage.eia_ferc1_record_linkage` module has been
+  refactored to use PUDL record linkage infrastructure and include extra cleaning
+  steps. This resulted in around 500 or 2% of matches changing.
 
 Data Coverage
 ^^^^^^^^^^^^^
 * Updated :doc:`data_sources/epacems` to switch to pulling the quarterly updates of
   CEMS instead of the annual files. Integrates CEMS through 2023q3. See issue
   :issue:`2973` & PR :pr:`3096`.
 
+Data Cleaning
+^^^^^^^^^^^^^
+
+* Filled in null annual balances with fourth-quarter quarterly balances in
+  :ref:`core_ferc1__yearly_balance_sheet_liabilities_sched110`. :issue:`3233` and
+  :pr:`3234`.
+
 ---------------------------------------------------------------------------------------
 v2023.12.01
 ---------------------------------------------------------------------------------------