-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Templates for derived variables #98
Comments
YAML representationsGood idea from @apdjustino to write specs from the perspective of what the yaml representation of templates would look like as well. Here's what case 1 from above would look like, more or less. (ModelManager currently saves parameters in alphabetical order, which is an easy way to make the yaml files play nicely with git diffs. It might be better to customize the ordering for each template, but we haven't implemented that yet.) modelmanager_version: 0.2.dev3
saved_object:
autorun: true # register column when yaml is loaded
cache: true # for orca
cache_scope: iteration # for orca
column_name: pct_low_income
expression: low_income*100/population
name: pct_low_income # name of saved object, we can provide good defaults
table: blocks
tags:
- estimation
- demographics
template: ColumnFromExpression
template_version: 0.2.dev2
This format is not really optimized for users to create yaml directly; the objective is more for it to be human-readable and editable while storing more metadata than Orca currently does. But if folks have ideas for improving the format, we should definitely explore them! Multiple objects in a single yaml fileAdding a link to issue #104, where we're discussing what kind of super-structure to create for storing multiple columns and other associated info.. |
Implementation using settings objectsI'm part-way through building these templates, and I think it would be a good idea to implement them using the "settings objects" sketched out in issue #54. Here's what it would look like: CoreTemplateSettings, for all templates
ExpressionSettings, for the ColumnFromExpression template
BroadcastSettings, for the ColumnFromBroadcast template
AggregationSettings, for the ColumnFromAggregation template
OutputColumnSettings, used by any template that generates or modifies a column
This way, the signature of a template can be much simpler -- no need to duplicate core/output parameters across many templates: class ColumnFromBroadcast():
"""
Parameters
----------
meta : CoreTemplateSettings
data : BroadcastSettings
output : OutputColumnSettings Usage would look like this: c = ColumnFromBroadcast()
c.data.tables = ['households', 'zones']
c.data.expression = 'residential_vacancy_rate * 100'
c.output.column_name = 'residential_vacancy_pct' And the yaml file would remain similar, but group settings into three dicts representing the component settings objects. Pros and consThis will substantially reduce the amount of boilerplate code that's copied from template to template to support repeated parameters. I think it will also make it easier for users -- for example, every template that creates a new column will accept the same settings and interpret them the same way. More shared code makes it easier to implement and test new templates, too. The main thing to worry about is that this does add another layer of abstraction to the code, which can make things more confusing and fragile. But i think the advantages outweigh this. ImplementationI'll create the settings objects in a new PR, first adding them to the ColumnFromExpression template which is already finished. Then i'll use them to build the rest of the column templates. |
This issue is to plan out a set of templates for derived variables.
Overview
Each of these templates would generate an indexed
pd.Series()
associated with an Orca table: not a local column that's part of the wrapped DataFrame, but a stand-alone column that can be evaluated lazily but still accessed as part of the table.Open questions:
Some resources:
variable_generators
project: https://github.com/UDST/variable_generatorscolumn_builder
project: https://github.com/urbansim/column_builder1.
urbansim_templates.data.ColumnFromExpression()
Creates a column from a string expression, accepting anything that can be passed to
df.eval()
. This could be math, an existing column to duplicate, or something more complicated. Cannot involve columns from other tables, though.Params:
column_name
(column to be created)destination_table
expression
2.
urbansim_templates.data.ColumnFromBroadcast()
Creates a column by "broadcasting" coarse-grained data to another table, taking advantage of join key relationships. For example: adding the census tract id to the households table, or adding the mean zonal building height to the buildings table.
I think our implementation of this should not require Orca broadcasts, because of the limitations discussed in Issue #78. Instead, we can use overlapping column names as implicit join keys.
Params:
column_name
(column to be created)destination_table
source_table
(allow a chain, or not?)source_column
(column name or expression)3.
urbansim_templates.data.ColumnFromAggregation()
Creates a column by grouping and transforming finer-grained data, taking advantage of join key relationships among tables. For example: number of households per tract, mean home price by zone, etc.
Params:
column_name
(column to be created)destination_table
source_table
source_column
(column name or expression)filters
(?)group_by
(column name, must appear in source table and destination table)aggregation
(min, max, mean, count, sum, stdev, etc.)4.
urbansim_templates.data.ColumnFromNetwork()
Creates a column from a Pandana network aggregation. (Params will need to be fine-tuned a bit..)
Params:
column_name
(column to be created)network
radius
aggregation
(min, max, mean, count, sum, etc.)decay
The text was updated successfully, but these errors were encountered: