Add documentation for datasets

PolicyEngine · Oct 13, 2022 · 5291104 · 5291104
1 parent 959d41a
commit 5291104
Show file tree

Hide file tree

Showing 9 changed files with 218 additions and 132 deletions.
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -10,6 +10,7 @@ parts:
     - file: usage/country
     - file: usage/cli
     - file: usage/parameters
+    - file: usage/datasets
   - caption: Python API
     chapters:
     - file: python_api/commons

diff --git a/docs/usage/datasets.ipynb b/docs/usage/datasets.ipynb
@@ -0,0 +1,103 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Writing datasets\n",
+    "\n",
+    "A common use case of PolicyEngine Core country models is not just simulating for a few households, but thousands in the form of microsimulation on survey data. This technique can be used to simulate the impact of a policy on a population, or to compare the impact of different policies on the same population. To do this, we need to be able to load data into PolicyEngine Core, and to do this we use a standardised format using the `Dataset` class.\n",
+    "\n",
+    "## Example\n",
+    "\n",
+    "Here's the Country Template's default example for a dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from policyengine_core.country_template.constants import COUNTRY_DIR\n",
+    "from policyengine_core.data import Dataset\n",
+    "from policyengine_core.periods import ETERNITY, MONTH, period\n",
+    "\n",
+    "\n",
+    "class CountryTemplateDataset(Dataset):\n",
+    "    # Specify metadata used to describe and store the dataset.\n",
+    "    name = \"country_template_dataset\"\n",
+    "    label = \"Country template dataset\"\n",
+    "    folder_path = COUNTRY_DIR / \"data\" / \"storage\"\n",
+    "    data_format = Dataset.TIME_PERIOD_ARRAYS\n",
+    "\n",
+    "    # The generation function is the most important part: it defines\n",
+    "    # how the dataset is generated from the raw data for a given year.\n",
+    "    def generate(self, year: int) -> None:\n",
+    "        person_id = [0, 1, 2]\n",
+    "        household_id = [0, 1]\n",
+    "        person_household_id = [0, 0, 1]\n",
+    "        person_household_role = [\"parent\", \"child\", \"parent\"]\n",
+    "        salary = [100, 0, 200]\n",
+    "        salary_time_period = period(\"2022-01\")\n",
+    "        weight = [1e6, 1.2e6]\n",
+    "        weight_time_period = period(\"2022\")\n",
+    "        data = {\n",
+    "            \"person_id\": {ETERNITY: person_id},\n",
+    "            \"household_id\": {ETERNITY: household_id},\n",
+    "            \"person_household_id\": {ETERNITY: person_household_id},\n",
+    "            \"person_household_role\": {ETERNITY: person_household_role},\n",
+    "            \"salary\": {salary_time_period: salary},\n",
+    "            \"household_weight\": {weight_time_period: weight},\n",
+    "        }\n",
+    "        self.save_variable_values(year, data)\n",
+    "\n",
+    "# Important: we must instantiate datasets. This tests their validity and adds dynamic logic.\n",
+    "CountryTemplateDataset = (\n",
+    "    CountryTemplateDataset()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dataset API\n",
+    "\n",
+    "PolicyEngine Core also includes two subclasses of `Dataset`:\n",
+    "\n",
+    "* `PublicDataset` - a dataset that is publicly available, and can be downloaded from a URL. Includes a `download` method to download the dataset.\n",
+    "* `PrivateDataset` - a dataset that is not publicly available, and must be downloaded from a private URL (specifically, Google Cloud buckets). Includes a `download` method to download the dataset, and a `upload` method to upload the dataset.\n",
+    "\n",
+    "See {doc}`/python_api/data` for the API reference."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.9.12 ('base')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/usage/parameters.ipynb b/docs/usage/parameters.ipynb
diff --git a/docs/usage/parameters.md b/docs/usage/parameters.md
@@ -0,0 +1,99 @@
+# Writing parameters
+
+**Parameters** are values that can change over time (but are the same for an entire country package, i.e. they can't be different for different entities). The `parameters/` folder of a country package contains the **parameter tree**, a tree of parameter nodes (objects that hold parameters as children) containing parameters or subnodes. A parameter node can be represented either as a folder, or a YAML file. Parameters are stored in parameter files. For example, a parameter tree might have the following structure in a country repo:
+
+```{eval-rst}
+.. code:: none
+
+   parameters/
+   ├── child_benefit/
+   │   ├── basic.yaml
+   │   ├── child.yaml
+   |   ├── family.yaml
+   │   └── index.yaml # Contains metadata for the child_benefit node.
+   ├── income_tax/
+   │   ├── basic.yaml
+   │   ├── personal.yaml
+   │   └── thresholds.yaml
+   ├── national_insurance/
+   │   ├── basic.yaml
+   │   ├── employee.yaml
+   │   └── thresholds.yaml
+   └── universal_credit/
+       ├── basic.yaml
+       ├── child.yaml
+       └── family.yaml
+```
+
+## Parameter values
+
+Parameters are defined as a function of time, and are evaluated at a given instant. To achieve this, parameters must be defined with a **value history**: a set of time-dated values (where the value at a given instant is the last-set value before that instant). For example, the `child_benefit.basic.amount` parameter might have the following value history:
+
+```yaml
+values:
+   2019-04-01: 20.00
+   2019-04-02: 21.00
+   2019-04-03: 22.00
+   2019-04-04: 23.00
+   2019-04-05: 24.00
+
+```
+
+## Metadata
+
+Each parameter (or parameter node) can also set **metadata**: data that describes the parameter, such as its name, description, and units. While the metadata for each parameter and node is freeform, using the schemas defined in `policyengine_core.data_structures` will ensure consistency between country packages, and better maintainability.
+
+Here's an example metadata specification for the `child_benefit.basic.amount` parameter:
+
+```yaml
+metadata:
+    name: child_benefit
+    label: Child Benefit
+    description: The amount of Child Benefit paid per child.
+    unit: currency-GBP
+    reference: 
+        - label: GOV.UK | Child Benefit
+          href: https://www.gov.uk/government/publications/child-benefit-rates-and-thresholds/child-benefit-rates-and-thresholds
+```
+
+### Metadata for parameters
+
+```{eval-rst}
+.. autoclass:: policyengine_core.data_structures.ParameterMetadata
+    :members:
+```
+
+### Metadata for parameter nodes
+
+```{eval-rst}
+.. autoclass:: policyengine_core.data_structures.ParameterNodeMetadata
+    :members:
+```
+
+## Specifying references
+
+```{eval-rst}
+.. autoclass:: policyengine_core.data_structures.Reference
+    :members:
+```
+
+## Units
+
+```{eval-rst}
+.. autoclass:: policyengine_core.data_structures.Unit
+    :members:
+```
+
+## Other specifications
+
+```{eval-rst}
+.. automodule:: policyengine_core.data_structures.parameter_metadata
+    :members:
+    :show-inheritance:
+    :exclude-members: ParameterMetadata, Reference, Unit
+
+.. automodule:: policyengine_core.data_structures.parameter_node_metadata
+    :members:
+    :show-inheritance:
+    :exclude-members: ParameterNodeMetadata
+```
diff --git a/policyengine_core/country_template/data/datasets/country_template_dataset.py b/policyengine_core/country_template/data/datasets/country_template_dataset.py
@@ -4,11 +4,14 @@
 
 
 class CountryTemplateDataset(Dataset):
+    # Specify metadata used to describe and store the dataset.
     name = "country_template_dataset"
     label = "Country template dataset"
     folder_path = COUNTRY_DIR / "data" / "storage"
     data_format = Dataset.TIME_PERIOD_ARRAYS
 
+    # The generation function is the most important part: it defines
+    # how the dataset is generated from the raw data for a given year.
     def generate(self, year: int) -> None:
         person_id = [0, 1, 2]
         household_id = [0, 1]
@@ -28,7 +31,7 @@ def generate(self, year: int) -> None:
         }
         self.save_variable_values(year, data)
 
-
+# Important: we must instantiate datasets. This tests their validity and adds dynamic logic.
 CountryTemplateDataset = (
     CountryTemplateDataset()
-)  # Important: must be instantiated
+)
diff --git a/policyengine_core/data/dataset.py b/policyengine_core/data/dataset.py
@@ -15,10 +15,15 @@ class Dataset:
     like cloud storage, metadata and loading."""
 
     name: str = None
+    """The name of the dataset. This is used to generate filenames and is used as the key in the `datasets` dictionary."""
     label: str = None
-    is_openfisca_compatible: bool = True  # For example, a raw relational database that we'd still like to keep with the OpenFisca dataset code.
+    """The label of the dataset. This is used for logging and is used as the key in the `datasets` dictionary."""
+    is_openfisca_compatible: bool = True
+    """Whether the dataset is compatible with OpenFisca. If True, the dataset will be stored as a collection of arrays. If False, the dataset will be stored as a collection of tables."""
     data_format: str = None
+    """The format of the dataset. This can be either `Dataset.ARRAYS`, `Dataset.TIME_PERIOD_ARRAYS` or `Dataset.TABLES`. If `Dataset.ARRAYS`, the dataset is stored as a collection of arrays. If `Dataset.TIME_PERIOD_ARRAYS`, the dataset is stored as a collection of arrays, with one array per time period. If `Dataset.TABLES`, the dataset is stored as a collection of tables (DataFrames)."""
     folder_path: str = None
+    """The path to the folder where the dataset is stored (in .h5 files)."""
 
     # Data formats
     TABLES = "tables"
@@ -261,6 +266,7 @@ def remove_all(self):
 
     @property
     def years(self):
+        """Returns the years for which the dataset has been generated."""
         pattern = re.compile(f"\n{self.name}_([0-9]+).h5")
         return list(
             map(