Report Reviewer #2 #3

MarieEtienne · 2024-07-12T12:27:06Z

Associate Editor: MP Etienne
Reviewer : Emilie Lebarbier chose to lift her anonymity

Reviewer: Reviewing history

Paper submitted June 2023
Reviewer invited June, 27th 2023
Review 1 received November, 12th 2023
Paper revised February, 26th 2024
Reviewer invited March, 14th 2024
Review 2 received May, 6th 2024
Paper conditionally accepted June, 3rd 2024

First Round

The development of efficient algorithms is a real challenge in segmentation in particular in high-dimensional context (long series and/or too much series). If the univariate case has therefore already been considered with these two previous pruning algorithms, their extensions to the case of several series is relevant. Indeed, in many applications, it is common to have a set of series and to assume that the series (or a large fraction of series) are affected by the same changes. This work therefore makes an interesting and important contribution to the computational aspect for segmentation problems.

The article tackles the computational problem of detecting change-points of a set of independent time series when the change-points are assumed to be located at the same positions for all the series (that is equivalent to the computational problem of detecting change-points in a multivariate series). The objective is to provide an algorithm to solve the optimization problem of searching for the change-points (the minimization of a penalized contrast) in an exact and computationally efficient manner. To do this, the authors propose to extend the two algorithms PELT \cite{killick_pelt} and FPOP \cite{Maidstone2016} to this multivariate case. These algorithms are recent accelerated versions which are based on two different pruning techniques and make it possible to cope with the large dimension of the data. If the extension of PELT turns out to be simple, this is not the case for FPOP. The authors propose thus an geometrical extension of the update rule of the algorithm. Several ways or considerations in this update rule have been considered. An important simulation study was carried out to study the computational efficiency of the proposed extensions and their limitations, as well as a comparison with PELT. Finally, the authors have developped a R package named GeomFPOP written in R/C

languages and that is available on github. This package implements several algorithms: the extensions of OP, PELT and the proposed algorithm GeomFPOP.

The development of efficient algorithms is a real challenge in segmentation in particular in high-dimensional context (long series and/or too much series). If the univariate case has therefore already been considered with these two previous pruning algorithms, their extensions to the case of several series is relevant. Indeed, in many applications, it is common to have a set of series and to assume that the series (or a large fraction of series) are affected by the same changes. This work therefore makes an interesting and important contribution to the computational aspect for segmentation problems.

However, I have some comments/questions/remarks and I have also noted many typos.

Comments on the paper

The authors must choose between the two following formulations of the problem during all the paper: (i) segmentation of several series or (ii) segmentation in a multivariate series’'. The two problems are completely equivalent except for one (important) detail: the multiple series of (i) must be affected by the same change-points. This case can be refered in the literature as ‘the simultaneous segmentation’. If the authors choose the formulation (1), the previous assumption must be given at each time (in the abstract, the introduction, the model, etc …). And if the authors choose the formulation (ii) (that is the case in general during the paper), this remark can be made.
For example, the sentence 'We consider the problem of change point detection in multiple time series of length and dimension

(first line of the section 1.1.) is not clear since it is a mixture of (1) and (2).

Both algorithms require conditions on the cost to be optimized: (i) the segment-additivity for PELT and the point-additivity for FPOP, and (ii) the cost is a one-parameter function and convex with respect to its parameter. The extensions must not escape these conditions (as we can see in equation (1) of the paper). It would be good to understand what the limitations of these extensions are or rather what distribution and cost/penalty function are allowed. Thus I think these conditions must be clearly given.

The authors used the term ‘cost function’ or simply ‘cost’ for different quantities: segment cost
, the cost of point , the cost of a segmentation. To be more clear, you could take different worlds for each cost or specified the cost of what at each time it is used. For example, in the last paragraph of subsection 1.1. on the penalty, the authors said 'We consider a penalized version of the cost by a penalty … '. For me, this cost refered to the global one, but I realize just after this is the cost of a segment and that the authors take a (global) penalty.

At the end of paragraph of subsection 1.1, the authors speak about penalties that are not proportional to the number of segments (Zhang and David 2007; Lebarbier 2005; Verzelen et al. 2020). But these penalties cannot be considered here. The forms of penalties authorized here are penalties proportional to the number of segments, as we can see from equation 2. This point need to be clearly precised and discussed.

In the presentation of the structure of the paper, the authors argued that in Section 1:‘We then review the existing pruned dynamic programming methods for solving this problem’ but they present only FPOP. I think, the authors could first recall that FPOP is a version of OP which uses the functional pruning of PDPA. Then a review of PELT, even brief, can be done since the authors use its extension in the simulation.

In the different Figures 2, 3 and 4: I don’t understand why the x-axis and the y-axis are
and instead of and

respectively.

At the beginning of section 2.1 (first sentence), the authors said that ‘the sets
’ are not easy to update. Could the authors clarify why in the section before? Or at least, put the sentence 'From Proposition 1.1 we have that starting from the set is obtained by successively applying two types of operations: intersection with an S-type set S () or subtraction of an S-type set S (

) ’ in the previous section.

The authors consider for the simulations the penalty
(based on the one proposed by Yao (1984)). Even if this model selection issue is not the purpose oif this paper, it is delicate in segmentation and I have two questions. First, the penalty of Yao (1984) includes the variance term in the segmentation of a Gaussian series. Why is not include in this penalty? Then, one could wonder if is the really number of observations (and in this case, we have to take ) or if we have to take into account for the total number of observations and in this case the logarithm term becomes

. Do the authors tested other penalties? Do the authors have an opinion on this?

The random strategy allows to reduce drastically the run times. But what it is not clear for me is in this case will we stil retrieve the exact solution?

For the simulation study discussed in the paragraph 4.5, do the authors have an explanation of the behaviour of the curves? More precisely, the fact that the curves suddenly increases from a large number of segments?

In section 5.1, the authors consider three different models. For the negative binomial distribution, they assume that the change-points affect only one of the two parameters, the probability one, the overdispersion one staying constant and known. Do the authors have any idea how to estimate or choose it beforehand?

In the Gaussian case, the variance of each series is assumed to be the same, which is not necessarily the case in practice. It is possible to extend to the case where the variance is series-specific?

Miscellaneous

page 1, line -3 of the abstract : ‘small dimension (2,3,4)’: the numbers are not necessary and precise that ‘small dimension’ is small set of series or for very few series.

page 3, line 2: this is ‘the

-likelihood’ rather.

page 3, lines 6-7: ‘the number of non-negligible change-points is comparable to

’ rather.

page 3, line 10: what does ‘multi-parametric models’ means?

‘We then review the existing pruned dynamic programming methods for solving this problem.’:

page 3, line -6: ‘Each segment
is generated by independent random variables …’. It is a little inadequate to say that a segment is generated. This is rather: ‘The data of the segment

is generated …’

page 3, line -1: 'the segmentation of data points between positions 1 to t ': this is the segmentation of data points between positions 1 to t in

segments

page 4, title of section 1.2: this is FPDPA or FPOP?

page 4, line 3: ‘calculated at the’ instead of ‘linked to’

page 4, line 7: ‘the independence hypothesis between dimension’ : to be change to ‘the independence hypothesis between series’ (depending of the choice of the formulation of the problem but independence between dimensions is not statistically relevant).

page 4, last sentence of subsection 1.1.: the optimal segmentation is denoted

as in the sequel. Moreover, if is designed to be the penalized cost of the best segmentation, a minimum over is missing. Otherwise, this is the penalized cost of the best segmentaiton in

segments.

page 4, line -9:
. This is not rather ? same for

page 4, in the definition of the recursion, a ``)‘’ is missing:

page 4, line -4: as for each integer'', remove as’’

page 4, in equation (3), why the authors take the notation
. if I well understand,

.

page 4: There is a problem with the reference of Zhang and David 2007''. This is Zhang and Siegmund 2007’’

page 5, in the definition of

and line -2: ``i-1’': an optimal break cannot be 0.

page 5, line -13: generation'': to be changed to distribution’'.

page 5, line -3 and in the sequel:‘‘bivariate’’ instead ``bi-variate’’

page 6, line -4: are also described by relation'' to be changed to can thus be written as’’ and put

.

page 6, line -3: this is
or

as in Figure 2.

page 6, line -1: in the definition of the past set, this is not for
from to ? and in the future set, this is not from to

?

page 7, in Figure 2:
, as in other examples, instead of

.

page 7, Proposition 1.1: the fonctional cost
as for the functional cost . Change the first sentence for ``At iteration , the sets are defined for a functional cost ’'. Put the number of the equation of the definition of the sets

.

put some large bigcap

page 8, line 1: in order to understand this paragraph, we need the definition of the inequality-based pruning of PELT (i.e. a brief review of PELT, see remarks for section 1.2).

page 8, line 2: This is
or

?

page 8, figure 3: the example could be more detailed and the Zit set drawn in another color so that this example is informative.

page 10, line 1: affect'', not affects ``.

page 13, line 2 of paragraph 4.1: change `signals’’ to ``series’’ as made until now.

page 13, line 1 of paragraph 4.1: with $10^4$ data points (without change, i.e. i.i.d $\mathcal{N}_p(0,I_p)$''. To be more clear:
independant Gaussian series with length

each’‘. Samre remark page 15, line 7 ``(without change, i.e i.i.d ….)’.’

page 14, caption of Figure 5: Say only that this correspond to the ``average over 100 datasets’'. Here that means that only
are generated times but it is more since this is

times.

page 15, line 6: length ranging''. Say that is $n$. \item page 15, first line of paragraph 4.4: precise that $\alpha$ is the slope of the curves as a function of the length of the series? \item page 15, line -5: with
data points’‘, precise that is

    \item page 16, line 1: remove ``even’'.

Some general typos:

Do not put capital letters in the text, as for example `Parametric Multivariate Models’.
\item Change opposite log-likelihood'' with negative log-likelihood’’

Put some ,'' for better reading. For example, change In what follows we will’’ by ``In what follows, we will’’

Comments on the R package

When I tested the package, all the algorithms with the different types and options, on the same example given on github with dimension equals to 2 with one change-point, I had a warning:

``‘Warning message:
In getChanges[5] <- getChangePoints(data = tsOneChange, penalty = Penality, :
le nombre d’objets à remplacer n’est pas multiple de la taille du remplacement’

This is because a [] is missing: this is a list, so In getChanges[5]'' should be replaced by In getChanges[[5]]‘’

The authors could give also the cost of optimal segmentation in the output parameters of the getChangePoints function.
\item The package does not handle missing data, which is common in practice. Would it be complicated to adapt the code to missing data?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report Reviewer #2 #3

Report Reviewer #2 #3

MarieEtienne commented Jul 12, 2024 •

edited

Loading

Report Reviewer #2 #3

Report Reviewer #2 #3

Comments

MarieEtienne commented Jul 12, 2024 • edited Loading

Reviewer: Reviewing history

First Round

Comments on the paper

MarieEtienne commented Jul 12, 2024 •

edited

Loading