Skip to content

Commit

Permalink
Add documentation for datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
nikhilwoodruff committed Oct 13, 2022
1 parent 959d41a commit 5291104
Show file tree
Hide file tree
Showing 9 changed files with 218 additions and 132 deletions.
1 change: 1 addition & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ parts:
- file: usage/country
- file: usage/cli
- file: usage/parameters
- file: usage/datasets
- caption: Python API
chapters:
- file: python_api/commons
Expand Down
103 changes: 103 additions & 0 deletions docs/usage/datasets.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing datasets\n",
"\n",
"A common use case of PolicyEngine Core country models is not just simulating for a few households, but thousands in the form of microsimulation on survey data. This technique can be used to simulate the impact of a policy on a population, or to compare the impact of different policies on the same population. To do this, we need to be able to load data into PolicyEngine Core, and to do this we use a standardised format using the `Dataset` class.\n",
"\n",
"## Example\n",
"\n",
"Here's the Country Template's default example for a dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from policyengine_core.country_template.constants import COUNTRY_DIR\n",
"from policyengine_core.data import Dataset\n",
"from policyengine_core.periods import ETERNITY, MONTH, period\n",
"\n",
"\n",
"class CountryTemplateDataset(Dataset):\n",
" # Specify metadata used to describe and store the dataset.\n",
" name = \"country_template_dataset\"\n",
" label = \"Country template dataset\"\n",
" folder_path = COUNTRY_DIR / \"data\" / \"storage\"\n",
" data_format = Dataset.TIME_PERIOD_ARRAYS\n",
"\n",
" # The generation function is the most important part: it defines\n",
" # how the dataset is generated from the raw data for a given year.\n",
" def generate(self, year: int) -> None:\n",
" person_id = [0, 1, 2]\n",
" household_id = [0, 1]\n",
" person_household_id = [0, 0, 1]\n",
" person_household_role = [\"parent\", \"child\", \"parent\"]\n",
" salary = [100, 0, 200]\n",
" salary_time_period = period(\"2022-01\")\n",
" weight = [1e6, 1.2e6]\n",
" weight_time_period = period(\"2022\")\n",
" data = {\n",
" \"person_id\": {ETERNITY: person_id},\n",
" \"household_id\": {ETERNITY: household_id},\n",
" \"person_household_id\": {ETERNITY: person_household_id},\n",
" \"person_household_role\": {ETERNITY: person_household_role},\n",
" \"salary\": {salary_time_period: salary},\n",
" \"household_weight\": {weight_time_period: weight},\n",
" }\n",
" self.save_variable_values(year, data)\n",
"\n",
"# Important: we must instantiate datasets. This tests their validity and adds dynamic logic.\n",
"CountryTemplateDataset = (\n",
" CountryTemplateDataset()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset API\n",
"\n",
"PolicyEngine Core also includes two subclasses of `Dataset`:\n",
"\n",
"* `PublicDataset` - a dataset that is publicly available, and can be downloaded from a URL. Includes a `download` method to download the dataset.\n",
"* `PrivateDataset` - a dataset that is not publicly available, and must be downloaded from a private URL (specifically, Google Cloud buckets). Includes a `download` method to download the dataset, and a `upload` method to upload the dataset.\n",
"\n",
"See {doc}`/python_api/data` for the API reference."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.12 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
128 changes: 0 additions & 128 deletions docs/usage/parameters.ipynb

This file was deleted.

99 changes: 99 additions & 0 deletions docs/usage/parameters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Writing parameters

**Parameters** are values that can change over time (but are the same for an entire country package, i.e. they can't be different for different entities). The `parameters/` folder of a country package contains the **parameter tree**, a tree of parameter nodes (objects that hold parameters as children) containing parameters or subnodes. A parameter node can be represented either as a folder, or a YAML file. Parameters are stored in parameter files. For example, a parameter tree might have the following structure in a country repo:

```{eval-rst}
.. code:: none
parameters/
├── child_benefit/
│ ├── basic.yaml
│ ├── child.yaml
| ├── family.yaml
│ └── index.yaml # Contains metadata for the child_benefit node.
├── income_tax/
│ ├── basic.yaml
│ ├── personal.yaml
│ └── thresholds.yaml
├── national_insurance/
│ ├── basic.yaml
│ ├── employee.yaml
│ └── thresholds.yaml
└── universal_credit/
├── basic.yaml
├── child.yaml
└── family.yaml
```

## Parameter values

Parameters are defined as a function of time, and are evaluated at a given instant. To achieve this, parameters must be defined with a **value history**: a set of time-dated values (where the value at a given instant is the last-set value before that instant). For example, the `child_benefit.basic.amount` parameter might have the following value history:

```yaml
values:
2019-04-01: 20.00
2019-04-02: 21.00
2019-04-03: 22.00
2019-04-04: 23.00
2019-04-05: 24.00

```

## Metadata

Each parameter (or parameter node) can also set **metadata**: data that describes the parameter, such as its name, description, and units. While the metadata for each parameter and node is freeform, using the schemas defined in `policyengine_core.data_structures` will ensure consistency between country packages, and better maintainability.

Here's an example metadata specification for the `child_benefit.basic.amount` parameter:

```yaml
metadata:
name: child_benefit
label: Child Benefit
description: The amount of Child Benefit paid per child.
unit: currency-GBP
reference:
- label: GOV.UK | Child Benefit
href: https://www.gov.uk/government/publications/child-benefit-rates-and-thresholds/child-benefit-rates-and-thresholds
```
### Metadata for parameters
```{eval-rst}
.. autoclass:: policyengine_core.data_structures.ParameterMetadata
:members:
```
### Metadata for parameter nodes
```{eval-rst}
.. autoclass:: policyengine_core.data_structures.ParameterNodeMetadata
:members:
```
## Specifying references
```{eval-rst}
.. autoclass:: policyengine_core.data_structures.Reference
:members:
```
## Units
```{eval-rst}
.. autoclass:: policyengine_core.data_structures.Unit
:members:
```
## Other specifications
```{eval-rst}
.. automodule:: policyengine_core.data_structures.parameter_metadata
:members:
:show-inheritance:
:exclude-members: ParameterMetadata, Reference, Unit

.. automodule:: policyengine_core.data_structures.parameter_node_metadata
:members:
:show-inheritance:
:exclude-members: ParameterNodeMetadata
```
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,14 @@


class CountryTemplateDataset(Dataset):
# Specify metadata used to describe and store the dataset.
name = "country_template_dataset"
label = "Country template dataset"
folder_path = COUNTRY_DIR / "data" / "storage"
data_format = Dataset.TIME_PERIOD_ARRAYS

# The generation function is the most important part: it defines
# how the dataset is generated from the raw data for a given year.
def generate(self, year: int) -> None:
person_id = [0, 1, 2]
household_id = [0, 1]
Expand All @@ -28,7 +31,7 @@ def generate(self, year: int) -> None:
}
self.save_variable_values(year, data)


# Important: we must instantiate datasets. This tests their validity and adds dynamic logic.
CountryTemplateDataset = (
CountryTemplateDataset()
) # Important: must be instantiated
)
8 changes: 7 additions & 1 deletion policyengine_core/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,15 @@ class Dataset:
like cloud storage, metadata and loading."""

name: str = None
"""The name of the dataset. This is used to generate filenames and is used as the key in the `datasets` dictionary."""
label: str = None
is_openfisca_compatible: bool = True # For example, a raw relational database that we'd still like to keep with the OpenFisca dataset code.
"""The label of the dataset. This is used for logging and is used as the key in the `datasets` dictionary."""
is_openfisca_compatible: bool = True
"""Whether the dataset is compatible with OpenFisca. If True, the dataset will be stored as a collection of arrays. If False, the dataset will be stored as a collection of tables."""
data_format: str = None
"""The format of the dataset. This can be either `Dataset.ARRAYS`, `Dataset.TIME_PERIOD_ARRAYS` or `Dataset.TABLES`. If `Dataset.ARRAYS`, the dataset is stored as a collection of arrays. If `Dataset.TIME_PERIOD_ARRAYS`, the dataset is stored as a collection of arrays, with one array per time period. If `Dataset.TABLES`, the dataset is stored as a collection of tables (DataFrames)."""
folder_path: str = None
"""The path to the folder where the dataset is stored (in .h5 files)."""

# Data formats
TABLES = "tables"
Expand Down Expand Up @@ -261,6 +266,7 @@ def remove_all(self):

@property
def years(self):
"""Returns the years for which the dataset has been generated."""
pattern = re.compile(f"\n{self.name}_([0-9]+).h5")
return list(
map(
Expand Down
Loading

0 comments on commit 5291104

Please sign in to comment.