Skip to content

Constrained changepoint detection

Toby Dylan Hocking edited this page Mar 14, 2018 · 9 revisions

Background

There are many R packages for changepoint detection, which is an important problem in many application domains (genomics, finance, etc). A new changepoint model with up-down constraints has shown state-of-the-art peak detection accuracy in genomic data (Hocking et al 2015). However, the Constrained Dynamic Programming Algorithm (CDPA) proposed in that paper is quadratic time (too slow for big data) and heuristic (not guaranteed to recover the optimal solution).

The goal of this project is to implement an optimal log-linear time algorithm for such constrained models, using the functional pruning technique (Maidstone et al 2016) that has been recently shown to work for constrained models (Hocking et al 2017). In particular we are interested in implementing several loss functions (Poisson, Normal, Binomial), and several constraint functions between adjacent segment means (no constraint, non-decreasing, non-increasing), for the optimal partitioning model (compute the best single segmentation model for a given penalty parameter).

Related work

The table below summarizes related R packages. The top half of the table is for algorithms that solve the segment neighborhood problem, which computes all models from 1 to K segments. The bottom half of the table is for algorithms that solve the optimal partitioning problem, which computes the best single model with K segments that corresponds to a given penalty parameter (and it does not compute the models from 1 to K-1 segments). Time complexity is for a sequence of N data.

function models time optimal constraints loss
Segmentor K O(K N log N) yes none several
cDPA K O(K N^2) no up-down Poisson
PeakSegPDPA K O(K N log N) yes up-down Poisson
Fpop 1 O(N log N) yes none Normal
PeakSegFPOP 1 O(N log N) yes up-down Poisson
this project 1 O(N log N) yes several several

The table above shows that previous work has implemented several loss functions for the unconstrained model (Segmentor), and the Poisson loss function for the up-down constrained model (PeakSegFPOP). The goal of this project is to implement a solver for several loss and constraint functions.

Coding project: solver for various loss and constraint functions

Implement a C++ function which implements the Generalized Functional Pruning Optimal Partitioning (GFPOP) algorithm described in the last section (B.5) of our arXiv paper. There should be an R interface for solving changepoint problems with several kinds of loss and constraint functions. The main R solver function should be the GFPOP function:

GFPOP(model, data.vec, weight.vec, penalty)

data.vec is a numeric vector of data to segment (with corresponding weights in weight.vec). The penalty is a non-negative numeric scalar which is added to the cost for each changepoint. The model specifies the loss and constraint functions. The constraints can be described in terms of a finite number of states and possible changes. For example the unconstrained model (same as fpop package) could be specified as one state with one possible unconstrained change:

library(data.table)
state <- function(state.name){
  structure(data.table(state.name), type="state")
}
change <- function(from, to, constraint, penalty=1){
  structure(data.table(from, to, penalty, constraint), type="change")
}
loss <- function(loss.name){
  structure(data.table(loss.name), type="loss")
}
unconstrained.Normal <- list(
  state("anything"),
  change("anything", "anything", "no.constraint"),
  loss("Normal"))

The reduced isotonic regression problem also has one state, but constrains all changes to be non-decreasing:

isotonic.Normal <- list(
  state("anything"),
  change("anything", "anything", "non.decreasing"),
  loss("Normal"))

The PeakSegFPOP solver can be described using two states, peak (after an up change) and background (after a down change). The model is also constrained to start and end in the background state:

start <- function(...){
  structure(data.table(state.name=c(...)), type="start")
}
end <- function(...){
  structure(data.table(state.name=c(...)), type="end")
}
PeakSeg.Poisson <- list(
  loss("Poisson"),
  state("peak"),
  state("background"),
  change("background", "peak", "non.decreasing"),
  change("peak", "background", "non.increasing", penalty=0),
  start("background"),
  end("background"))

The idea is that any of these models (or others) could be used as the model argument to GFPOP. This R code that describes the model should be converted into a format that can be passed to C++ code, along with a data set and penalty e.g.

class Change {//represents an edge in the constraint graph
  int from;
  int to;
  //EITHER (easy to implement based on current code)
  int constraint;//1=non-decreasing 2=non-decreasing 3=unconstrained
  //OR (harder to implement, more general)
  double a,b,c;//a*m_t + b*m_{t+1} + c \leq 0.
  double penalty;
};
void GFPOP(
  int n_data,
  double *data_vec,   //array[n_data]
  double *weight_vec, //array[n_data]
  double penalty, 
  int loss,  //1=Normal 2=Poisson 3=Binomial, etc
  int n_states, //number of vertices in constraint graph
  int n_changes, //number of edges in constraint graph
  Change *change_vec, //array[n_changes]
  int n_starts,              //number of states possible at first data point
  int *start_state_vec,      //array[n_starts] possible states at first data point
  int n_ends,                //number of states possible at last data point
  int *end_state_vec,        //array[n_ends] possible states at last data point
  //output below:
  int *optimal_state_vec,     //vector of optimal state/vertex for each segment
  int *optimal_change_vec,    //vector of optimal changes/edges
  double *optimal_mean_vec,     //vector of optimal segment means
  double *optimal_cost_vec,     //maybe just the last point?
  int *optimal_segment_end_vec, //last data point of each segment
  int *intervals_mat //number of intervals or cost function pieces
);

The ideal student will also write tests, documentation, vignettes and maybe even a blog.

Expected impact

This GSOC project will provide log-linear time code for solving a large class of constrained change-point problems. Because of the log-linear time, this code will be useful for large data sets. Because of the different constraints and loss functions, it will be useful in many different application domains.

Mentors

Please get in touch with Toby Dylan Hocking <[email protected]> and Guillem Rigaill <[email protected]> after completing at least one of the tests below.

FAQ

in the FPOP algorithm should we record all C(μ) for every possible μ?

Yes, C(μ) should be stored for every possible μ, using a list of coefficients and intervals representing a real-valued piecewise convex function, see the FPOP paper for details.

  • for the square loss store a list of a,b,c such that C(μ) = a μ^2 + b μ + c.
  • for the Poisson loss store a list of a,b,c such that C(μ) = a log(μ) + b μ + c (in the code these are a=Log, b=Linear, c=Constant in the PoissonLossPieceLog class).

the Medium test tells me to compare with `Segmentor3IsBack::Segmentor(model=1)’. But how can I specify penalty in Segmentor3IsBack?

The penalty can not be directly specified, but you can compute the solution to the penalized (optimal partitioning) problem based on the result of `Segmentor3IsBack::Segmentor`, which computes the solution to the Segment Neighborhood problem (best model in 1, …, K segments). The optimal partitioning problem is defined for any penalty λ\geq 0 as

  • optimization in terms of segment mean m: minm ∈ R^ni=1^n L(x_i, m_i) + λ * ∑i=1n-1 I(m_i != m_i+1)

which is equivalent to

  • optimization in terms of number of segments k, given optimal segment means m_i^k (computed via Segmentor): mink ∈ 1, …, ni=1^n L(x_i, m_i^k) + λ * k

where

  • x_i is the data
  • L is the loss function, square loss L(x, m)=(x-m)^2; Poisson loss L(x, m)= m - x log(m).

Tests

NOTE: this is a very difficult GSOC project, and so the ideal student really should do all tests below. If we can’t find a student that can do all these tests, we should probably not fund this GSOC project.

Easy: write two R functions checkModel and plotModel which input one of the models described above. checkModel should stop with an informative error if the model is invalid (e.g. if change/start/end reference an undefined state, or if an unrecognized loss/constraint function is used). Write a package and use testthat to write some test cases that make sure checkModel is working correctly, both for positive examples (valid models should not cause errors) and negative examples (invalid models should cause errors). plotModel should draw a node-and-edge graph that describes the model states and changes.

Medium: the Segmentor3IsBack package implements the segment neighborhood model (best models in 1,…,K segments) for the Poisson loss and no constraints, but there is no implementation available for the optimal partitioning model (best K-segment model for a single penalty parameter, without computing the models with 1,…,K-1 segments). The goal of this test is to modify the code in the coseg package, in order to implement a solver for the optimal partitioning problem with Poisson loss and no constraints. Begin by studying PeakSegFPOPLog.cpp which implements the optimal partitioning model for the Poisson loss and the up-down constraints. There are two states in this model, up and down. Since the up-down constrained solver has two states, there are N x 2 optimal cost functions to compute (cost_model_mat is of dimension data_count*2). The cost of being in the up state at data_i is cost_model_mat[data_i] and the cost of being in the down state is cost_model_mat[data_i+data_count]. The min_prev_cost.set_to_min_less_of(down_cost_prev) method enforces a non-decreasing constraint between adjacent segment means, for the state change down->up. Analogously, the PiecewisePoissonLossLog::set_to_min_more_of method enforces a non-increasing constraint for the state change up->down. To implement the unconstrained solver, you just need to implement a new PiecewisePoissonLossLog::set_to_unconstrained_min_of method that computes the minimum constant cost function (one PoissonLossPieceLog object), and then uses that to compute the N x 1 array of optimal cost functions (cost_model_mat). Read about the FPOP algorithm in Maidstone et al 2016 for more info. When you are done with your implementation, check your work by comparing with the output of Segmentor3IsBack::Segmentor(model=1). Perform model selection yourself for a range of penalty parameters. Using testthat, write some test cases which make sure your function gives the same exact model as the corresponding selected Segmentor model.

Hard: there is not yet an regularized isotonic regression solver for the normal loss (issue), and your goal in this test is to implement one. Like the unconstrained model described the Medium test, the regularized isotonic regression model also has only one state. So you can start by modifying the Medium test code, which should have a cost_model_mat which is N x 1. However the isotonic regression constraint means that all changes are non-decreasing, so you should use set_to_min_less_of instead of set_to_unconstrained_min_of. Now the difficult part: the existing code in the coseg package implements the Poisson loss via class PoissonLossPieceLog, but you need to implement another class for the Normal loss, NormalLossPiece. This class will need to declare different coefficients Constant, Linear, Quadratic for a function f(mean)=Constant + Linear*mean + Quadratic*mean^2. You will need to provide implementations for get_smaller_root and get_larger_root by using the sqrt function in #include <math.h>. It will be judged even better if you can get PoissonLossPieceLog and NormalLossPiece to inherit from the same base class with shared methods (that is the approach that the Segmentor package uses to implement several loss functions, and that is the approach that will be recommended for this GSOC project). Check your work by writing a testthat unit test to make sure that the model returned by your function with penalty=0 is the same as the model returned by the isoreg function (PAVA algorithm). Write another test that checks that the output model is the same as Fpop (when all changes are non-decreasing).

Solutions of tests

  • Students, please post a link to your test results here.
Clone this wiki locally