forked from ConvergenceDA/visdom
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexample_feature_extraction.rmd
316 lines (234 loc) · 19.5 KB
/
example_feature_extraction.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
---
title: "Feature Extraction Usage"
author: "Sam Borgeson"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Feature Extraction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
#tldr: all of these are valid ways to do feature extraction runs
```{r eval=FALSE}
library(visdom)
# No frills basic feature run
DATA_SOURCE = TestData(n=100) # 100 fake customers with random data
run_results = iterator.iterateMeters( DATA_SOURCE$getIds(),
custFn = visdom::basicFeatures,
as_df=T )
head(run_results)
# Feature function that runs a piecewise regression against outside temperature, using the tou1CP model
dailyCP = DescriptorGenerator( name='tout1CP',
genImpl=toutDailyCPGenerator,
subset=list(all="TRUE"),
cvReps=8) # 1 CP
regressionFeaturesfn = function(cust,ctx,...) { # for future results
result = dailyCP$run(cust, as.daily.df(cust))
out = as.list(t(result$other))
names(out) = fixNames(rownames(result$other), prefix='dailyCP_')
return( out )
}
# you can run more than one feature function at a time if they are passed as a list
ctx = new.env( )
fnList = c(visdom::basicFeatures, regressionFeaturesfn)
run_results3 = iterator.iterateMeters( DATA_SOURCE$getIds(),
fnList,
ctx=ctx,
as_df=T )
# you can also add the function list to the execution context that VISDOM uses and tell it
# to look them up there under ctx$fnVector (you have to use that name)
ctx = new.env( )
ctx$fnVector=fnList
run_results2 = iterator.iterateMeters( DATA_SOURCE$getIds(),
iterator.callAllFromCtx,
ctx=ctx,
as_df=T )
# you can also run customers by zip code and only look up the related weather data once to improve speed:
iterator.iterateZip( DATA_SOURCE$getGeocodes(),
fnList,
ctx=ctx,
as_df=T )
# Or just run all meters in a single zip code
visdom::iterator.runZip('94503', # just 10 for speed
fnList,
ctx=ctx, as_df=T )
# you can even run a single meter, using either its id or class:
iterator.runMeter( DATA_SOURCE$getIds()[1],
fnList,
ctx=ctx )
mdc = DATA_SOURCE$getMeterDataClass(DATA_SOURCE$getIds()[1])
iterator.runMeter( mdc,
fnList,
ctx=ctx )
```
#Details on running feature extraction using VISDOM
##Load the module
Load VISDOM (which loads its dependencies) and any supporting libraries your custom feature extraction code will require.
```{r}
library(visdom)
library(plyr)
library(acs)
```
##Load your custom data source
To use VISDOM, you must have a DataSource implementation that maps your source data to the VISDOM cannonical data formats for meter data, meter ids, weather data, etc. You configure VISDOM to use your DataSource by assigning the global variable DATA_SOURCE as an instance of it. For this example we are using the `TestData()` data source, defined in `testDataSource.R` that conforms to the data source interface requirements, but generates loosely structured synthetic data (with coarse diurnal and seasonal changes).
```{r}
DATA_SOURCE = TestData(n=100) # Use random/fake test data for analysis
```
## Implement feature extraction methods
Next you must provide implementation of your feature extraction algorithms. The required output format of any feature function is a named list of values. First, we assign the VISDOM internal function `basicFeatures()` the name `basicFeaturesfn`, just to demonstrate that functions can be referenced and called through variable names. Basic features include max/min/mean, range, variance, hour-of-day, peak timing, and other basic statistical extracts from the passed meter data `meterData`. Note that it relies on the presence of weather data in the meter data object.
```{r}
basicFeaturesfn = visdom::basicFeatures
```
Here are some other examples of meter data feature functions. The first simply returns the zip code of the meter. The second calculates the mean of the daily peak hour of day by gaining a reference to the kWMat 24 (or 96) x N matrix of meter observations, where N is the number of days covered by each meter's data, calculating the maximum column for each day and averaging that value across all days.
```{r}
custZipfn = function(meterData,ctx,...) {
return( list(zip5=meterData$zipcode) )
}
peakHOD = function(meterData,ctx,...) {
peakHOD = max.col( as.matrix(meterData$kwMat) )
return( list( meanPeakHOD = mean(peakHOD)) )
}
```
###Regression support
Next we do some configuration for regression-based features. First we instantiate a model descriptor with a call to a descriptor generator function that dynamically generates a feature descriptor for a single thermal change point model. The key is the `toutDailyCPGenerator` function found in `R/util-regression.R`. The implementation of DescriptorGenerator is also found in the same file.
```{r}
dailyCP = DescriptorGenerator( name='tout1CP',
genImpl=toutDailyCPGenerator,
subset=list(all="TRUE"),
cvReps=8) # 1 CP
```
The actual feature function calls `run()` in the generated descriptor, here called `dailyCP` to run the specified regression model, with optional cross-validation, etc. as specified in the call to DescriptorGenerator. The model parameters of interest are stored in the object returned by `run` under `other` values. We return these with the prefix `dailyCP_`, to distinguish them from any other model outputs we might generate from other models in the same feature run.
```{r}
regressionFeaturesfn = function(cust,ctx,...) { # for future results
result = dailyCP$run(cust, as.daily.df(cust))
out = as.list(t(result$other))
names(out) = fixNames(rownames(result$other), prefix='dailyCP_')
return( out )
}
```
Weather features can be calculated from the WeatherClass data
```{r}
weatherFeaturesfn = function(cust,ctx,...) {
if( is.null( ctx$weatherFeatures ) ) {
ctx$weatherFeatures = list()
}
wf = ctx$weatherFeatures[[cust$weather$geocode]]
if( is.null( wf ) ) {
wf = weatherFeatures(cust$weather)
wf$zip5 <- NULL # remove the zip code, which we already have
ctx$weatherFeatures[[cust$weather$geocode]] = wf
print('computed weather features')
} else {
print('weather features found in ctx')
}
return(wf)
}
```
##Configure runtime context
Now we turn our attention to configuring the context object that will configure the feature run and store its results. The key parameters are:
1. `fnVector`, a list of all feature functions that will be invoked on each meter Note that these reference the functions just defined above.
2. `start.date` and `end.date` can be used by the underlying data source to filter meter data to fall within prescribed time periods.
3. `a` is an example of an arbitrary parameter that can be accessed by a custom function of your own devising. Each feature function call is passed the context as well, so you can make use of anything found within the context (i.e. placed there by you during this configuration stage) in your own functions.
Note that the context, here called `ctx`, is implemented as a new environment `new.env()`, which allows values to be set dynamically into the context during code execution. In technical terms, standard R lists are not mutable - a new copy is created for every modification. In other words, they are passed around by making copies of their values and changes functions make to them are not accessible to code that maintains a reference to the original. References to environments, on the other hand, do provide access to any changes made to the referenced environment.
```{r}
ctx=new.env()
ctx$fnVector = c(custZipfn, basicFeatures, regressionFeaturesfn, weatherFeaturesfn, peakHOD)
# note that htis is only applied if the meter data is available via ctx$RAW_DATA, rather than a direct lookup
ctx$dateFilter = list(DOW = 2:6,
start.date = as.Date('2013-05-15'),
end.date = as.Date('2013-10-15') )
```
##Running the feature extraction itself
The business end of feature extraction is the call to an interator method that is passed meter ids and rules for calling feature function on meter data objects created using the passed data. This single line, runs features on every customer in the list of ids (including all of them) by using the id's to instantiate `CustomerData` objects via the `DATA_SOURCE`. This can be called during feature development and for testing with just a handful of ids (as shown here), or with hundreds of thousands for running features on rich data sets. Note that as a practical matter, the latter call (on large samples of customers) would logically be called with better support for parallel processing and error failover than the simple `iterator.iterateMeters` function provides. This call returns a list of lists of features. So the outter list is indexed by customer id and there is a named list of feature for each customer.
```{r eval=FALSE}
aRunOut = iterator.iterateMeters( DATA_SOURCE$getIds()[1:10], # just 10 for speed
iterator.callAllFromCtx,
ctx=ctx,
extra='somedata')
```
A list of lists is returned because it is maximally flexible. Some feature functions may opt to return diagnostic, error, residual, model, etc. data that are not scalar features, but have values in diagnosing problems, testing hypotheses, etc. To boild the list of lists down to a data frame of the scalar feature values, we call the utility function `iterator.todf()`.
```{r eval=FALSE}
runDF = iterator.todf( aRunOut )
```
With the feature data in hand and in a data frame, it can be cased as RData, incorporated into figures, merged with other customer- or meter-specific data or results form other feature runs, etc.
#Advanced topics
##Exporting your data
Once your features are computed, you may logically want to save or export them. In `util-export.R`, there are several functions that are designed to support exports of your feature data to various useful formats.
###data for VISDOM-web
Feature data for VISDOM-web is exported with specific fields and naming requirements. The rules are:
1. you must include an `id` column and a `zip5` column, both of which should be text data.
id | zip5 | all | other | features
-- | ----- | ----|------------|------
1 | 94610 | all | your other | feature data
2. The names of your exported features can only contain letters, numbers, and underscores _'s. There is a utility function in util-export.R called `fixNames()` that will automatically convert all other punctuation to _'s and is called on a data frame, featuredf as follows: `names(featuredf) = fixNames(featuredf)`.
3. Your categorical data must be converted to character strings. There is a utility function `cleanFeatureDF()` that fixes the column names and converts the categorical data.
4. There are several utility functions in `util-export.R` that can be used to cleanly export files.
5. Once you have exported data, see the VISDOM-web project documentation on data sources for information on how to configure the system to import and properly display your features.
###csv export
The simplest export format compatible with VISDOM-web is csv. If you only have a single data frame of features and are not using more complicated data, like load shape encodings, it is a great way to get up and running quickly. Picking up on the feature run example we can call the util-export.R function exportData(), which will automatically clean column names and convert factors to regular strings and save the results to a csv without row names, which corrupt data when imported into VISDOM-web.
```{r eval=F}
runPlusCensusDF = mergeCensus(runDF, zipCol='zip5')
# csv file written to getwd() location. csv extension automatically added
exportData(df=runDF, name='myFeatures', format='csv')
```
###database export
You can also save feature data to a database using a database connection with CREATE TABLE privileges. If `overwrite=TRUE`, the table will be re-written. Otherwise, the data will be appended to the table. See `visdom::writeDatabaseData` for options.
```{r eval=F}
library('RMySQL')
db_cfg_path = "db_connect.conf"
conn = conf.dbCon(dbCfg(db_cfg_path))
run_cfg_path = file.path(system.file(package="visdom"),
"feature_set_run_conf", "example_feature_set.conf")
# note datesToEpoch ensures that dates are stored as integer seconds since the epoch
exportData(df=datesToEpoch(runDF), name=NULL, format="database", conn=conn, overwrite=TRUE, runConfig="example_feature_set.conf")
```
##parallel processing
Official way to do it:
There is a function in iterator.R called iterator.runMeter that takes a meter id (one of whatever your `DATA_SOURCE$getIds()` returns), the feature extraction function, and a context object with additional config. The idea is that function can be called as the target for alply, where you can configure foreach first and use `.parallel = TRUE`.
However, parallel, etc. aren't explicit package dependencies of VISDOM. They are suggested in the module DESCRIPTION, so you have to get them yourself.
The previous example used (approximately) this one liner to run features for all meters:
```{r eval=F}
aRunOut = iterator.iterateMeters(DATA_SOURCE$getIds(), iterator.callAllFromCtx, ctx=ctx)
```
The equivalent with alply is:
```{r eval=F}
aRunOut = alply( DATA_SOURCE$getIds(),
.margins = 1,
.fun = iterator.runMeter,
iterator.callAllFromCtx, ctx)
```
where you can add the `.parallel=T` option, noting that parallel backend configuration, which uses the configuration you set up for foreach support is platform specific, better supported by Revolution R (the open version is called RRO) than standard R distributions, and you need to make decisions about multi-core vs. multi-machine parallelization. There are ample resources online addressing these issues, especially the [Getting Started with doParallel and foreach pdf](https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf). In general you will likely be setting up your foreach configuration by registering the doParallel backend and ideally alply, etc. will handle the rest:
```{r eval=F}
library(doParallel)
nCores = parallel::detectCores()
registerDoParallel(cores=nCores) # setup parallel processing on multiple cores
aRunOut = plyr::alply( .data = DATA_SOURCE$getIds(),
.margins = 1,
.fun = iterator.runMeter,
iterator.callAllFromCtx, ctx,
parallel=T,
.progress='text' )
```
However, this call has been shown to run serially for some users (i.e. no benefit from parallelization). An alternate approach using foreach is the currently recommended approach. Here we are running all the meter data for each geocode (i.e. zip code in this example) in each parallel process:
```{r eval=F}
library(doParallel)
nCores = parallel::detectCores()
registerDoParallel(cores=nCores) # setup parallel processing on multiple cores
zips = DATA_SOURCE$getGeocodes(useCache=T)
zipRunOutParallel = foreach(i = length(zips)) %dopar% {
zipRunOut = iterator.iterateZip( zipList = zips[i],
custFn = iterator.callAllFromCtx,
cacheResults = T,
ctx=ctx )
return(zipRunOut)
}
# concat results
aRunOut = do.call(c, zipRunOutParallel)
```
###Alternate 1 (eliminating redundant data access with or without parallelization):
Note that one performance optimization that is pretty common when processing large numbers of meters is to load weather data for one location and process all meters from that location with the weather data and all relevant meter data cached in the ctx. This can be a bigger performance boost than naively running each customer in parallel, where data access will redundantly happen over and over. In this case, you might run alply across zip codes, calling a function that loads weather data and all customer data for that zip code, places them both into the ctx and then calls alply with all the customer ids for that zipcode (the underlying code looks in the ctx for weather data and customer data). Technically, this function does not exist. The "official" support for processing customers by zip code is in `iterator.iterateZip`, which is implemented using standard for loops, but it should be pretty clear from that what to do next if you want a parallelizable version.
###Alternate 2 (subsetting meters using command line args):
In practice, users have often written a wrapper script that implements a form of parallelization by segmenting meter ids (or zip codes) into N even blocks and selecting one block based on command line arguments passed to the script. They can then invoke the script N times (even from N different machines) to cover all meters. The modest amount of manual effort to get these running can be encapsulated into a shell script and is small in the context of a multi-day run time. Note that on certain cluster resources, users can be restricted to processes that run less than 24hrs (or some other fixed threshold). In this case, even with parallelization, you may need to subset your meters to ensure completion within available time and such scripts are a good way to accomplish tunable runtimes.
###Caveats:
Note that as will all parallel operations, error handling is a bit tricky. `iterator.runMeter` traps errors and returns NAs after printing the error message so it can keep running. This allows the rest of the data to be processed. When running in parallel the place for print statements to output to is poorly defined, so users may never see the printed error message. Thus, you have to be extra careful to investigate any NAs values returned for any customers. In the future, we hope to architect infrastrucutre that allows storing and returning error diagnostic information for each customer that has an error within the listof lists data structure, but for now parallel processing are related error handling is a partially unsupported "advanced" feature.
Also note that other than the error trapping, the official VISDOM code doesn't support mid-stream failures very gracefully. Ideally users would have the option of incrementally saving out results so they can recover from any fatal errors that could cause them to lose valid features built up in memory (or so that they can fill in NAs caused by errors without re-running all the good results). Parallel jobs tend to be all or nothing in their return values and can toss out viable computed values due to a data frame column name incompatibility or other minor issue compiling results. In the past VISDOM users have created scripted options for saving each zip code worth of features and also for savings lists of meters whose validation rules eliminated them from processing to save time re-processing good data over and over due to later failures and re-validating meters over and over. We may provide suitable official versions of these capabilities via the `iterator.R` functions in the future, but advanced users should currently plan to write their own iterators and wrappers if graceful failure is desired.