Update docs for 2009-2010 EIA 860 integration.

catalyst-cooperative · Feb 17, 2020 · df20192 · df20192
1 parent e07aa8c
commit df20192
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 69 deletions.
diff --git a/docs/data_catalog.rst b/docs/data_catalog.rst
@@ -25,24 +25,20 @@ EIA Form 860
 =================== ===========================================================
 Source URL          https://www.eia.gov/electricity/data/eia860/
 Source Format       Microsoft Excel (.xls/.xlsx)
-Source Years        2001-2017
+Source Years        2001-2018
 Size (Download)     127 MB
 Size (Uncompressed) 247 MB
 PUDL Code           ``eia860``
-Years Liberated     2011-2018
-Records Liberated   ~500,000
-Issues              `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia860>`__
+Years Liberated     2009-2018
+Records Liberated   ~600,000
+Issues              `open EIA 860 issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia860>`__
 =================== ===========================================================
 
 Nearly all of the data reported to the EIA on Form 860 is being pulled into the
-PUDL database for the years 2011-2018.
-
-We are working on integrating the 2009-2010 EIA 860 data, which has a similar
-format. This will give us the same coverage in both EIA 860 and EIA 923, which
-is good since the two datasets are tightly integrated.
-
-Currently we are extending the 2011 EIA 860 data back to 2009 as needed to
-integrate it with EIA 923.
+PUDL database for the years 2009-2018. This data is tightly integrated with the
+EIA 923 data, for which we integrate the same set of years. We do not
+anticipate integrating EIA 860 data from before 2009 at at this time, but if
+you need that data, let us know.
 
 .. _data-eia923:
 
@@ -57,15 +53,18 @@ Size (Download)     196 MB
 Size (Uncompressed) 299 MB
 PUDL Code           ``eia923``
 Years Liberated     2009-2018
-Records Liberated   ~2 million
-Issues              `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia923>`__
+Records Liberated   ~3.2 million
+Issues              `Open EIA 923 issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aeia923>`__
 =================== ===========================================================
 
 Nearly all of EIA Form 923 is being pulled into the PUDL database, for years
-2009-2017. Earlier data is available from EIA, but the reporting format for
+2009-2018. Earlier data is available from EIA, but the reporting format for
 earlier years is substantially different from the present day, and will require
 more work to integrate. Monthly year to date releases are not yet being
-integrated.
+integrated, and only larger utilities are required to make monthly reports.
+
+We have not yet integrated tables reporting fuel stocks on hand, data from
+Puerto Rico, or EIA 923 schedules 6, 7, and 8.
 
 .. _data-epacems:
 
@@ -81,7 +80,7 @@ Size (Uncompressed) ~100 GB
 PUDL Code           ``epacems``
 Years Liberated     1995-2018
 Records Liberated   ~1 billion
-Issues              `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepacems>`__
+Issues              `Open EPA CEMS issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepacems>`__
 =================== ===========================================================
 
 All of the EPA's hourly Continuous Emissions Monitoring System (CEMS) data is
@@ -112,7 +111,7 @@ Size (Uncompressed) 14 MB
 PUDL Code           ``epaipm``
 Years Liberated     N/A
 Records Liberated   ~650,000
-Issues              `open issues labeled epacems <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepaipm>`__
+Issues              `Open EPA IPM Issues <https://github.com/catalyst-cooperative/pudl/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+label%3Aepaipm>`__
 =================== ===========================================================
 
 .. todo::
@@ -130,11 +129,11 @@ Source URL          https://www.ferc.gov/docs-filing/forms/form-1/data.asp
 Source Format       FoxPro Database (.DBC/.DBF)
 Source Years        1994-2018
 Size (Download)     1.4 GB
-Size (Uncompressed) 2.5 GB
+Size (Uncompressed) 10 GB
 PUDL Code           ``ferc1``
 Years Liberated     1994-2018
 Records Liberated   ~12 million (116 raw tables), ~280,000 (7 clean tables)
-Issues              `open issues labeled <https://github.com/catalyst-cooperative/pudl/issues?q=is%3Aissue+is%3Aopen+label%3Aferc1>`__
+Issues              `Open FERC Form 1 issues <https://github.com/catalyst-cooperative/pudl/issues?q=is%3Aissue+is%3Aopen+label%3Aferc1>`__
 =================== ===========================================================
 
 The FERC Form 1 database consists of 116 data tables containing ~8GB of data,

diff --git a/docs/datapackages.rst b/docs/datapackages.rst
@@ -18,51 +18,35 @@ We our hope this will allow the data to reach the widest possible audience.
     specifications, a project of
     `the Open Knowledge Foundation <https://okfn.org>`__
 
--------------------------------------------------------------------------------
-Downloading Data Packages
--------------------------------------------------------------------------------
-
-.. note::
-
-    Release v0.3.0 of the ``catalystcoop.pudl`` package will be used to
-    generate tabular datapackages for distribution. You will be able to find
-    them listed on the `Catalyst Cooperative Community page on Zenodo <https://zenodo.org/communities/catalyst-cooperative/>`__
-
-Our intent is to automate the creation of a standard bundle of data packages
-containing all of the currently integrated data. Users who aren't working with
-Python, or who don't want to set up and run the data processing pipeline
-themselves will be able to just download and use the data packages directly.
-Each data release will be issued a DOI, and archived at Zenodo, and may be
-made available in other ways as well.
-
-Zenodo
-^^^^^^
-
-Every PUDL software release is
-automatically `archived and issued a digital object id (DOI) <https://guides.github.com/activities/citable-code/>`__ by
-`Zenodo <https://zenodo.org/>`__ through an integration with
-`Github <https://github.com>`__. The overarching DOI for the entire PUDL
-project is `10.5281/zenodo.3404014 <https://doi.org/10.5281/zenodo.3404014>`__,
-and each release will get its own (versioned) DOI.
-
-On a quarterly basis, we will also upload a standard set of data packages to
-Zenodo alongside the PUDL release that was used to generate them, and the
-packages will also be issued citeable DOIs so they can be easily referenced in
-research and other publications. Our goal is to make replication of any
-analyses that depend on the released code and published data as easy to
-replicate as possible.
-
-Other Sites?
-^^^^^^^^^^^^
-
-Are there other data archiving and access platforms that you'd like to see the
-pudl data packages published to?  If so feel free to
-`create an issue on Github <https://github.com/catalyst-cooperative/pudl/issues>`__
-to let us know about it, and explain what it would add to the project. Other
-sites we've thought about include:
-
-* `Open EI <https://openei.org/wiki/Main_Page>`__
-* `data.world <https://data.world/>`__
+We intend to publish tabular data packages on a quarterly basis, containing the
+full outputs from the PUDL ETL pipeline via `Zenodo <https://zenodo.org>`__,
+and open data archiving service provided by CERN. The most recent release can
+always be found through this concept DOI:
+`10.5281/zenodo.3653158 <https://doi.org/10.5281/zenodo.3653158>`__. Each
+individual version of the data releases will be assigned its own unique DOI.
+
+Users who aren't working with Python, or who don't want to set up and run the
+data processing pipeline themselves can download and use the data packages
+directly. We provide scripts alongside the published data packages that will
+load them into a local SQLite database or for the larger datasets, convert them
+into `Apache Parquet <https://parquet.apache.org/>`__ datasets on disk which
+can be read directly into
+`Pandas <https://pandas.pydata.org>`__,
+`Dask <https://dask.org>`__, or
+`R dataframes <https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame>`__.
+
+We archive the original input data and provide a one-line script which should
+allow users to replicate the entire ETL process, generating byte-for-byte
+identical outputs. See the documentation published with the data releases for
+details on how to load or reproduce the data packages.
+
+We also curate the
+`Catalyst Cooperative Community on Zenodo <https://zenodo.org/communities/catalyst-cooperative/>`__
+which lists all of the archived products generated by our projects.
+
+The these archives and the DOIs associated with them should be permanently
+accessible, and are suitable for use as references in academic and other
+publications.
 
 -------------------------------------------------------------------------------
 Using Data Packages
@@ -90,7 +74,6 @@ DB, you could run the following command from within your PUDL workspace:
         -o datapkg/pudl-example/pudl-merged \
         datapkg/pudl-example/ferc1-example/datapackage.json \
         datapkg/pudl-example/eia-example/datapackage.json \
-        datapkg/pudl-example/epaipm-example/datapackage.json
 
 The path after the ``-o`` flag tells the script where to put the merged
 data package, and the subsequent paths to the various ``datapackage.json``
@@ -110,7 +93,7 @@ especially if you have a powerful system with multiple cores, a solid state
 disk, and plenty of memory.
 
 If you have generated an EPA CEMS data package, you can use the
-``epacems_to_parquet`` script to convert the hourly emisssions table like this:
+``epacems_to_parquet`` script to convert the hourly emissions table like this:
 
 .. code-block:: console
 
@@ -130,7 +113,8 @@ tools:
 .. todo::
 
     Document process for pulling data packages or datapackage bundles into
-    Microsoft Access / Excel
+    Microsoft Access / Excel. If you've gotten this to work and would like to
+    contribute an example, please let us know!
 
 Other Platforms
 ^^^^^^^^^^^^^^^
@@ -140,7 +124,7 @@ well normalized relational database tables, pulling them directly into e.g.
 Pandas or R dataframes for interactive use probably isn't the most useful
 thing to do. In the future we intend to generate and publish data packages
 containing denormalized tables including values derived from analysis of the
-original data, post-ETL. These packages would be suitable for direct
+original data, post-ETL. These packages would be more suitable for direct
 interactive use.
 
 Want to submit another example? Check out :doc:`the documentation on