Skip to content

Multitable API

James Griffiths edited this page Oct 4, 2017 · 2 revisions

Multitables

This document is a draft proposal for supporting multitables in Scrunch. The Crunch API Documentation for multitables can be found at: http://docs.crunch.io/#multitables.

Accessing the multitables of a Dataset

The multitable entities within a Dataset object can be accessed through the dict-like multitables property, using the name of the multitable as the key:

ds = get_dataset('My dataset')
benchmark_mt = ds.multitables['Benchmark']

Each of the items returned by the multitables property is a Multitable object (described in more detail later in this document).

As is expected, using an invalid name on the multitables property raises a KeyError exception.

Creating new multitables

Dataset objects provide a create_multitable method that allows creating new multitable entities within a crunch dataset. Note that numeric variables can be included, either by their alias alone (in which case unique values from the data will be categorized into their own columns) or they can be automatically grouped into bins using the bin function as is shown. Here are some examples:

new_mt = ds.create_multitable(
    name='new multitable',
    template=['bin(age)', 'race', 'gender'],
    is_public=True
)
new_mt = ds.create_multitable(
    name='new multitable',
    template=[
        'bin(age)',
        'race',
        {
            'query': 'gender',
            'transform': {
                'categories': [
                    {'id': 3, 'name': 'Undisclosed'},
                    {'id': 2, 'name': 'Male'},
                    {'id': 1, 'name': 'Female'}
                ]
            }
        }
    ],
    is_public=True
)

The arguments to the create_multitable method are:

  • name: the name for the new multitable. If the name is invalid, a ValueError exception is raised.
  • template: an iterable of one or more template queries, which define the conditioning variables for the new multitable (see the Template queries section in this document for more details).
  • is_public: a boolean that determines if the new multitable belongs to the dataset (True) or if it is a personal multitable (False). The default value is False.

Upon success, the create_multitable method returns a Multitable object; otherwise a MultitableCreateError is raised with further details on the failure.

The Multitable class

The Multitable class provides high-level, convenient helpers for accessing and manipulating a multitable entity from the Crunch API.

For starters, the class provides the following read-only attributes:

  • name: the name of the multitable, which is unique within a given Dataset.
  • id: the crunch-generated entity id.
  • is_public: whether the multitable is dataset-owned (True) or personal (False).
  • owner: an instance of the User object to who owns the multitable.
  • template: the current set of template queries that provide the multitable's conditioning variables.
>>> mt = ds.multitables['Benchmark']
>>> print(mt.id)
abcde1234

>>> print(mt.is_public)
True

>>> print(mt.template)
[
    {'query': 'bin(age)'},
    {'query': 'race'},
    {
        'query': 'gender',
        'transform': {
            'categories': [
                {'id': 3, 'name': 'Undisclosed'},
                {'id': 2, 'name': 'Male'},
                {'id': 1, 'name': 'Female'}
            ]
        }
    }
]

Editing a Multitable object

The edit method in the Multitable class is the means of manipulating multitables in scrunch.

The arguments to the edit method are:

  • name: a new name for the multitable. If the name is invalid, a ValueError exception is thrown.
  • template: an iterable of one or more template queries, which will overwrite the current ones for the existing multitable entity (see the Template queries section in this document for more details).
  • is_public: a boolean that determines if the multitable belongs to the dataset (True) or if it is a personal multitable (False).

Upon successful edition, edit returns the current Multitable object. Else, a MultitableEditError is thrown with further details of the edit failure.

An example of using the edit method would be:

mt = ds.multitables['Benchmark']
mt.edit(name='My_Benchmark', template=['bin(age)'], is_public=True)

Deleting a Multitable object

There are two ways of deleting a Multitable object:

  1. Using its delete method or its alias remove():
mt = ds.multitables['Benchmark']
mt.delete()

Tab books

Tab books contain a set of variables in the dataset crosstabbed with a given multitable.

In this regard, the Multitable class provides the export method, which receives a path in the filesystem where the exported tab book for the current multitable will be saved. The format argument can be either 'xlsx' (default) or 'json'.

Parameters: http://docs.crunch.io/#post-body-parameters. :path: Local Filesystem path to save the file to :filter: Name of Filter instance of a Dataset's filter :where: list of variables to include; ['varA', 'varB'] :options: Display options as python dictionary :weight: Name of the weight_variable existing in the Dataset

Examples of usage:

mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.xlsx')

Or a more detailed export:

resp = mt.export(path='/home/User/Benchmark.xlsx', format='xlsx',
    filter=ds.filters['männlich'], where=['languagepreference', 'religion'], 
    weight='sample_weight')

NOTE: json format endpoint is not working in crunch at the moment, ignore this for now

mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.json', format='json')

By default, the export method selects all variables in the dataset to be crosstabbed with the current multitable. Nevertheless, the select optional argument can be used to make an explicit selection of the variables to use. The select argument is expected to be an iterable of the names of the variables to use in the tab book export:

mt = ds.multitables['Benchmark']
mt.export(path='/home/User/Benchmark.xlsx', where=['age', 'gender', 'race'])

Multitable as a Pandas DataFrame

NOT IMPLEMENTED

The Multitable class provides the dataframe method, which returns the multitable as a Pandas DataFrame object:

df = ds.multitables['Benchmark'].dataframe()

By default, the dataframe method selects all variables in the dataset to be included in the pandas DataFrame object. The select optional argument can be used to make an explicit selection of the variables to use. The select argument is expected to be an iterable of the names of the variables to use in the construction of the DataFrame object:

df = ds.multitables['Benchmark'].dataframe(select=('age', 'gender', 'race'))

Template queries

Both the Dataset.create_multitable and the Multitable.edit methods take a template parameter which is expected to be an iterable of one or more template queries.

In scrunch, template queries are represented as dictionary objects with a query element that defines a single conditioning variable expression for the multitable. The query expression might be a simple variable reference (e.g. {'query': 'race'}) or a function over a variable (e.g. {'query': 'bin(age)'}).

Template queries may also contain an optional transform element that provides a declarative definition of what the conditioning variable's dimension should look like after computation. For example:

new_mt = ds.create_multitable(
    name='new multitable',
    template=[
        {'query': 'bin(age)'},
        {'query': 'race'},
        {
            'query': 'gender',
            'transform': {
                'categories': [
                    {'id': 3, 'name': 'Undisclosed'},
                    {'id': 2, 'name': 'Male'},
                    {'id': 1, 'name': 'Female'}
                ]
            }
        }
    ]
)

The format of the transform element is described in http://docs.crunch.io/#transforming-analyses-for-presentation.

For template queries that don't have a transform element, scrunch supports using simple str objects to define them, in order to simplify things. With this in mind, the last example can be rewritten as:

new_mt = ds.create_multitable(
    name='new multitable',
    template=[
        'bin(age)',
        'race',
        {
            'query': 'gender',
            'transform': {
                'categories': [
                    {'id': 3, 'name': 'Undisclosed'},
                    {'id': 2, 'name': 'Male'},
                    {'id': 1, 'name': 'Female'}
                ]
            }
        }
    ]
)

More information on the underlying usage and format of template queries in the crunch API can be found at http://docs.crunch.io/#template-query.