Skip to content

Commit

Permalink
Update docs for 2009-2010 EIA 860 integration.
Browse files Browse the repository at this point in the history
  • Loading branch information
zaneselvans committed Feb 17, 2020
1 parent e07aa8c commit df20192
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 69 deletions.
39 changes: 19 additions & 20 deletions docs/data_catalog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,24 +25,20 @@ EIA Form 860
=================== ===========================================================
Source URL https://www.eia.gov/electricity/data/eia860/
Source Format Microsoft Excel (.xls/.xlsx)
Source Years 2001-2017
Source Years 2001-2018
Size (Download) 127 MB
Size (Uncompressed) 247 MB
PUDL Code ``eia860``
Years Liberated 2011-2018
Records Liberated ~500,000
Issues `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia860>`__
Years Liberated 2009-2018
Records Liberated ~600,000
Issues `open EIA 860 issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia860>`__
=================== ===========================================================

Nearly all of the data reported to the EIA on Form 860 is being pulled into the
PUDL database for the years 2011-2018.

We are working on integrating the 2009-2010 EIA 860 data, which has a similar
format. This will give us the same coverage in both EIA 860 and EIA 923, which
is good since the two datasets are tightly integrated.

Currently we are extending the 2011 EIA 860 data back to 2009 as needed to
integrate it with EIA 923.
PUDL database for the years 2009-2018. This data is tightly integrated with the
EIA 923 data, for which we integrate the same set of years. We do not
anticipate integrating EIA 860 data from before 2009 at at this time, but if
you need that data, let us know.

.. _data-eia923:

Expand All @@ -57,15 +53,18 @@ Size (Download) 196 MB
Size (Uncompressed) 299 MB
PUDL Code ``eia923``
Years Liberated 2009-2018
Records Liberated ~2 million
Issues `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia923>`__
Records Liberated ~3.2 million
Issues `Open EIA 923 issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia923>`__
=================== ===========================================================

Nearly all of EIA Form 923 is being pulled into the PUDL database, for years
2009-2017. Earlier data is available from EIA, but the reporting format for
2009-2018. Earlier data is available from EIA, but the reporting format for
earlier years is substantially different from the present day, and will require
more work to integrate. Monthly year to date releases are not yet being
integrated.
integrated, and only larger utilities are required to make monthly reports.

We have not yet integrated tables reporting fuel stocks on hand, data from
Puerto Rico, or EIA 923 schedules 6, 7, and 8.

.. _data-epacems:

Expand All @@ -81,7 +80,7 @@ Size (Uncompressed) ~100 GB
PUDL Code ``epacems``
Years Liberated 1995-2018
Records Liberated ~1 billion
Issues `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepacems>`__
Issues `Open EPA CEMS issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepacems>`__
=================== ===========================================================

All of the EPA's hourly Continuous Emissions Monitoring System (CEMS) data is
Expand Down Expand Up @@ -112,7 +111,7 @@ Size (Uncompressed) 14 MB
PUDL Code ``epaipm``
Years Liberated N/A
Records Liberated ~650,000
Issues `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepaipm>`__
Issues `Open EPA IPM Issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepaipm>`__
=================== ===========================================================

.. todo::
Expand All @@ -130,11 +129,11 @@ Source URL https://www.ferc.gov/docs-filing/forms/form-1/data.asp
Source Format FoxPro Database (.DBC/.DBF)
Source Years 1994-2018
Size (Download) 1.4 GB
Size (Uncompressed) 2.5 GB
Size (Uncompressed) 10 GB
PUDL Code ``ferc1``
Years Liberated 1994-2018
Records Liberated ~12 million (116 raw tables), ~280,000 (7 clean tables)
Issues `open issues labeled <https://github.com/catalyst-cooperative/pudl/issues?q=is%3Aissue+is%3Aopen+label%3Aferc1>`__
Issues `Open FERC Form 1 issues <https://github.com/catalyst-cooperative/pudl/issues?q=is%3Aissue+is%3Aopen+label%3Aferc1>`__
=================== ===========================================================

The FERC Form 1 database consists of 116 data tables containing ~8GB of data,
Expand Down
82 changes: 33 additions & 49 deletions docs/datapackages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,51 +18,35 @@ We our hope this will allow the data to reach the widest possible audience.
specifications, a project of
`the Open Knowledge Foundation <https://okfn.org>`__

-------------------------------------------------------------------------------
Downloading Data Packages
-------------------------------------------------------------------------------

.. note::

Release v0.3.0 of the ``catalystcoop.pudl`` package will be used to
generate tabular datapackages for distribution. You will be able to find
them listed on the `Catalyst Cooperative Community page on Zenodo <https://zenodo.org/communities/catalyst-cooperative/>`__

Our intent is to automate the creation of a standard bundle of data packages
containing all of the currently integrated data. Users who aren't working with
Python, or who don't want to set up and run the data processing pipeline
themselves will be able to just download and use the data packages directly.
Each data release will be issued a DOI, and archived at Zenodo, and may be
made available in other ways as well.

Zenodo
^^^^^^

Every PUDL software release is
automatically `archived and issued a digital object id (DOI) <https://guides.github.com/activities/citable-code/>`__ by
`Zenodo <https://zenodo.org/>`__ through an integration with
`Github <https://github.com>`__. The overarching DOI for the entire PUDL
project is `10.5281/zenodo.3404014 <https://doi.org/10.5281/zenodo.3404014>`__,
and each release will get its own (versioned) DOI.

On a quarterly basis, we will also upload a standard set of data packages to
Zenodo alongside the PUDL release that was used to generate them, and the
packages will also be issued citeable DOIs so they can be easily referenced in
research and other publications. Our goal is to make replication of any
analyses that depend on the released code and published data as easy to
replicate as possible.

Other Sites?
^^^^^^^^^^^^

Are there other data archiving and access platforms that you'd like to see the
pudl data packages published to? If so feel free to
`create an issue on Github <https://github.com/catalyst-cooperative/pudl/issues>`__
to let us know about it, and explain what it would add to the project. Other
sites we've thought about include:

* `Open EI <https://openei.org/wiki/Main_Page>`__
* `data.world <https://data.world/>`__
We intend to publish tabular data packages on a quarterly basis, containing the
full outputs from the PUDL ETL pipeline via `Zenodo <https://zenodo.org>`__,
and open data archiving service provided by CERN. The most recent release can
always be found through this concept DOI:
`10.5281/zenodo.3653158 <https://doi.org/10.5281/zenodo.3653158>`__. Each
individual version of the data releases will be assigned its own unique DOI.

Users who aren't working with Python, or who don't want to set up and run the
data processing pipeline themselves can download and use the data packages
directly. We provide scripts alongside the published data packages that will
load them into a local SQLite database or for the larger datasets, convert them
into `Apache Parquet <https://parquet.apache.org/>`__ datasets on disk which
can be read directly into
`Pandas <https://pandas.pydata.org>`__,
`Dask <https://dask.org>`__, or
`R dataframes <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame>`__.

We archive the original input data and provide a one-line script which should
allow users to replicate the entire ETL process, generating byte-for-byte
identical outputs. See the documentation published with the data releases for
details on how to load or reproduce the data packages.

We also curate the
`Catalyst Cooperative Community on Zenodo <https://zenodo.org/communities/catalyst-cooperative/>`__
which lists all of the archived products generated by our projects.

The these archives and the DOIs associated with them should be permanently
accessible, and are suitable for use as references in academic and other
publications.

-------------------------------------------------------------------------------
Using Data Packages
Expand Down Expand Up @@ -90,7 +74,6 @@ DB, you could run the following command from within your PUDL workspace:
-o datapkg/pudl-example/pudl-merged \
datapkg/pudl-example/ferc1-example/datapackage.json \
datapkg/pudl-example/eia-example/datapackage.json \
datapkg/pudl-example/epaipm-example/datapackage.json
The path after the ``-o`` flag tells the script where to put the merged
data package, and the subsequent paths to the various ``datapackage.json``
Expand All @@ -110,7 +93,7 @@ especially if you have a powerful system with multiple cores, a solid state
disk, and plenty of memory.

If you have generated an EPA CEMS data package, you can use the
``epacems_to_parquet`` script to convert the hourly emisssions table like this:
``epacems_to_parquet`` script to convert the hourly emissions table like this:

.. code-block:: console
Expand All @@ -130,7 +113,8 @@ tools:
.. todo::

Document process for pulling data packages or datapackage bundles into
Microsoft Access / Excel
Microsoft Access / Excel. If you've gotten this to work and would like to
contribute an example, please let us know!

Other Platforms
^^^^^^^^^^^^^^^
Expand All @@ -140,7 +124,7 @@ well normalized relational database tables, pulling them directly into e.g.
Pandas or R dataframes for interactive use probably isn't the most useful
thing to do. In the future we intend to generate and publish data packages
containing denormalized tables including values derived from analysis of the
original data, post-ETL. These packages would be suitable for direct
original data, post-ETL. These packages would be more suitable for direct
interactive use.

Want to submit another example? Check out :doc:`the documentation on
Expand Down

0 comments on commit df20192

Please sign in to comment.