TEXT SIZE

• •   CrossRef (0) On study for change point regression problems using a difference-based regression model  Jong Suk Parka, Chun Gun Park1,b, Kyeong Eun Leea

aDepartment of Statistics, Kyungpook National University, Korea;
bDepartment of Mathematics, Kyonggi University, Korea
Correspondence to: 1Department of Mathematics, Kyonggi University, 154-42, Gwanggyosan-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do 16227, Korea. E-mail: cgpark@kgu.ac.kr
Received May 9, 2019; Revised September 26, 2019; Accepted October 11, 2019.
Abstract
This paper derive a method to solve change point regression problems via a process for obtaining consequential results using properties of a difference-based intercept estimator first introduced by Park and Kim (Communications in Statistics - Theory Methods, 2019) for outlier detection in multiple linear regression models. We describe the statistical properties of the difference-based regression model in a piecewise simple linear regression model and then propose an efficient algorithm for change point detection. We illustrate the merits of our proposed method in the light of comparison with several existing methods under simulation studies and real data analysis. This methodology is quite valuable, “no matter what regression lines” and “no matter what the number of change points”.
Keywords : change point, difference-based intercept estimator, difference-based regression model, piecewise linear regression
1. Introduction

A piecewise linear regression model is a special type of relationship between a dependent variable and one (or more) explanatory variables that consists of piecewise lines. In this model, the matter of interest is the boundary points of the adjacent lines called break points, change points or jointpoints (Muggeo, 2008). Change point problems occur in fields such as molecular biology, machine learning, and econometrics. The results derived from statistical inference could be misleading if there exist change points; consequently, it is important to ascertain the threshold values where the effect of the independent variable changes (Ulm, 1991; Betts et al., 2007; Muggeo, 2008).

Significant literature has been developed in many fields related to the change point regression problems since the 1950’s (Quandt, 1958). Many studies have focused on testing for change points rather than estimates since then (Kim and Siegmund, 1989; Andrews, 1993; Andrews and Ploberger, 1994). However, later statisticians have focused on estimates (Loader, 1996; Bai, 1997; Julious, 2001; Muggeo, 2003; Zhou and Liang, 2008). Subsequently, related software programs on detecting change points have been developed. Some R packages related to these problems include strucchange (Zeileis et al., 2002), segmented (Muggeo, 2008), and chngpt (Fong et al., 2017), which have worked only under certain conditions of a continuous or discontinuous type, a single covariate or multiple covariates, and no change point, one change point or multiple change points. For example, the R package segmented (Muggeo, 2003, 2008) supports the continuous type of a mean function that is a connected line at the change point, and allows multiple change points. The chngpt package proposed by Fong et al. (2017) also supports both continuous and discontinuous types in logistic regression models as well as linear regression models and allows two-phase regression models. Most of all R packages for change point detection support only one covariate, except for the segmented package which is available for several covariates under strong conditions (https://cran.r-project.org/web/packages/segmented/segmented.pdf).

Park and Kim (2019) first introduced a difference-based regression model (DBRM) which is useful for outlier detection in multiple linear regression models. The proposed outlier detection approach uses a difference-based intercept estimator that is influenced by anomalous data. Based on the DBRM we can apply properties of this estimator to the piecewise regression model. In particular, we propose an efficient algorithm for change point detection in a piecewise simple linear regression model (PSLR). Compared to the previously mentioned methods, our proposed method has advantages that can be applied to various circumstances: continuous or discontinuous types, single covariate or multiple covariates, and no change point, one change point or multiple change points.

The remainder of this paper is organized as follows. In Section 2, we briefly describe piecewise linear regression models. In Section 3, we utilize the process of the DBRM (Park and Kim, 2019) and derive the statistical properties of the difference-based coefficient estimator in the PSLR. An algorithm to detect change points will then be given in Section 4. In Section 5 and Section 6, we illustrate the merits of our proposed method in comparison with several existing methods by simulation studies and real data analysis. The article concludes with a discussion in Section 7.

2. Piecewise linear regression model

In this section, we consider a piecewise regression model. To do this, we use a multiple change point regression model described by Chen et al. (2011). Assume that the number of regimes is r + 1 with r change points, response variable is the y = (y1, y2, …, yN)′, and the number of regressors is p, where change points are defined in terms of only one of the regressors, x. Other regressors are expressed as z1, …, zp−1. Then general piecewise linear regression model is:

$y i = { α 1 , 1 + β 1 , 1 x i + ∑ l = 2 p β l z i , l - 1 + ŽĄ i , if i ≤ n 1 , α 1 , 2 + β 1 , 2 x i + ∑ l = 2 p β l z i , l - 1 + ŽĄ i , if n 1 < i ≤ n 1 + n 2 , Ōŗ« Ōŗ« α 1 , m + β 1 , m x i + ∑ l = 2 p β l z i , l - 1 + ŽĄ i , if ∑ l = 1 m - 1 n l < i ≤ ∑ l = 1 m n l , Ōŗ« Ōŗ« α 1 , r + 1 + β 1 , r + 1 x i + ∑ l = 2 p β l z i , l - 1 + ŽĄ i , if ∑ l = 1 r n l < i ≤ N .$

Here, N is the total number of observations. As for δ = (δ1, δ2, …, δr)′ is the r × 1 vector, and its parameters are change point parameters which satisfy δ1 < δ2 < · · · < δr. n1 is the number of observations in the first regime (xiδ1), nm is the number of observations in the mth regime (δm−1 < xiδm) for m = 2, …, r, and $n r + 1 = N - ∑ m = 1 r n m$.

A piecewise linear regression model is classified by Hawkins (1980) as continuous and discontinuous. Continuous type here means that the regression function is a connected line at the change point, δm that satisfies the following equation: α1,m + β1,mδm = α1,m+1 + β1,m+1δm for m = 1, 2, …, r. If this is not satisfied, the model is discontinuous.

Below, we use Equation (2.1) with r = 1 and p = 1, assuming that change point location, δ is unknown and this model has two regimes. Without loss of generality, assume that n = #(xiδ) (i.e., x1 ≤ · · · ≤ xn < δxn+1 ≤ · · · ≤ xN). Then the formula can be expressed as:

$y i = { α 1 + β 1 x i + ŽĄ i , if i ≤ n , α 2 + β 2 x i + ŽĄ i , if i > n ,$

where α1, α2, β1, and β2 are unknown parameters and $ŽĄ i ∼ iid N ( 0 , σ 2 )$. For these two phase linear models, Fong et al. (2017) distinguished them into step, hinge, segmented, and stegmented. We now address change point problems using the Model (2.2), and use the four types to illustrate.

3. Difference-based regression model for PSLR

### 3.1. Fitting PSLR using difference-based regression model

In this section, after introducing the DBRM (Park and Kim, 2019), we explain how to apply the process of the DBRM to the PSLR. Park and Kim (2019) proposed an outlier-detection approach using the properties of an intercept estimator in the DBRM. This method uses only the estimator of the intercept: it does not require estimating the other parameters in the DBRM. In this paper, we first use the DBRM process to detect change points and explain how to apply DBRM to the PSLR.

First, we describe the DBRM without change points. Then, the simple linear regression can be expressed as: yi = α + βxi + ŽĄi, i = 1, …, N. Let y(k)i be the difference between yi and yk and x(k)i be the difference between xi and xk, where k = u, …, N. Here, u is the minimum number of observations to estimate parameters, and we set u = 5. Also, we assume that i = 1, …, k − 1 due to the effect of the change point. The DBRM for the kth observation is written as

$y ( k ) i ŌłŻ ŽĄ k = - ŽĄ k + β x ( k ) i + ŽĄ i , ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŽĄ i ∼ iid N ( 0 , σ 2 ) ,$

where β is the slope of regression line.

Second, we describe the DBRM with change point using Equation (2.2). This model is divided into two parts according to the value of k affected by the location of the change point. For the first part (kn), the DBRM for the kth observation is

$y ( k ) i ŌłŻ ŽĄ k = - ŽĄ k + β 1 x ( k ) i + ŽĄ i , ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŽĄ i ∼ iid N ( 0 , σ 2 ) ,$

where β1 is the slope of the first regression line. For the second part (k > n), the DBRM for the kth observation is

$y ( k ) i ŌłŻ ŽĄ k = - ŽĄ k + α 1 - α 2 + ( β 1 - β 2 ) x k + β 1 x ( k ) i + ŽĄ i , ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŽĄ i ∼ iid N ( 0 , σ 2 ) ,$

where α1, α2, β1, and β2 are the coefficients of each regression line in Equation (2.2). For k = u, …, N, Model (3.1), (3.2), and (3.3) can be expressed as a simple linear regression model:

$y ( k ) i ŌłŻ ŽĄ k = α ( k ) + β ( k ) x ( k ) i + ŽĄ i , ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŽĄ i ∼ iid N ( 0 , σ 2 ) ,$

where i = 1, …, k − 1, and α(k) and β(k) are regression coefficients of these models for the kth observation. In simple notation, the matrix form is as follows.

$( y 1 - y k y 2 - y k Ōŗ« y k - 1 - y k ) = ( 1 ( x 1 - x k ) 1 ( x 2 - x k ) Ōŗ« Ōŗ« 1 ( x k - 1 - x k ) ) ( α ( k ) β ( k ) ) + ( ŽĄ 1 ŽĄ 2 Ōŗ« ŽĄ k - 1 ) .$

Then we estimate the kth coefficients, α(k) and β(k) in the DBRM (3.4) using the least square estimators. The result of the DBRM without change point leads to the estimators as the following two parts: intercept estimator $α^(k)|ŽĄk$ and slope estimator $β^(k)|ŽĄk$.

$α ^ ( k ) ŌłŻ ŽĄ k = - ( y k - y ¯ ( k ) ) + ( x k - x ¯ ( k ) ) ( β ^ ( k ) ŌłŻ ŽĄ k ) ,$ $β ^ ( k ) ŌłŻ ŽĄ k = ∑ j = 1 k - 1 ( x j - x ¯ ( k ) ) ( y j - y ¯ ( k ) ) ∑ j = 1 k - 1 ( x j - x ¯ ( k ) ) 2 ,$

where $x ¯ ( k ) = ∑ j = 1 k - 1 x j / ( k - 1 )$ and $y ¯ ( k ) = ∑ j = 1 k - 1 y j / ( k - 1 )$. The result of the DBRM with change point is divided into two parts based on where k is located: when k = n, n is the number of observations in the first regime. In this case, $α^(n)|ŽĄn$ and $β^(n)|ŽĄn$ are expressed as

$α ^ ( n ) ŌłŻ ŽĄ n = - ( y n - y ¯ ( n ) ) + ( x n - x ¯ ( n ) ) ( β ^ ( n ) ŌłŻ ŽĄ n ) ,$ $β ^ ( n ) ŌłŻ ŽĄ n = ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) ( y j - y ¯ ( n ) ) ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 ,$

where $x ¯ ( n ) = ∑ j = 1 n - 1 x j / ( n - 1 )$ and $y ¯ ( n ) = ∑ j = 1 n - 1 y j / ( n - 1 )$. In the case of k = n + 1, the expression of α╠é(n+1)|ŽĄn+1 and β╠é(n+1)|ŽĄn+1 using Equation (3.6) is noted as

$α ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 = ( n - 1 ) R n + 1 ( α ^ ( n ) ŌłŻ ŽĄ n ) - ( y n + 1 - y n ) + ( x n + 1 - x n ) ( β ^ ( n ) ŌłŻ ŽĄ n ) ,$ $β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 = P n + 1 ( β ^ ( n ) ŌłŻ ŽĄ n ) + Q n + 1 ( y n - y ¯ ( n ) ) ,$

where

$R n + 1 = ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 - ( x n + 1 - x n ) ( x n - x ¯ ( n ) ) n ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 + ( n - 1 ) ( x n - x ¯ ( n ) ) 2 , P n + 1 = n ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 n ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 + ( n - 1 ) ( x n - x ¯ ( n ) ) 2 , Q n + 1 = ( n - 1 ) ( x n - x ¯ ( n ) ) n ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 + ( n - 1 ) ( x n - x ¯ ( n ) ) 2 .$

### 3.2. Statistical properties of coefficient estimators

For k = u, …, N, we provide mean and variance of the kth coefficient estimators, $α^(k)|ŽĄk$ and $β^(k)|ŽĄk$ in the following two parts: no change point and one change point.

• No change point: In the case of Model (3.1) without change point, the mean and variance of coefficient estimators, $α^(k)|ŽĄk$ and $β^(k)|ŽĄk$ for the kth observation are expressed as

$E ( α ^ ( k ) ) = E ( E ( α ^ ( k ) ŌłŻ ŽĄ k ) ) = E ( - ŽĄ k ) = 0 ,$ $Var ( α ^ ( k ) ) = E ( V ( α ^ ( k ) ŌłŻ ŽĄ k ) ) + V ( E ( α ^ ( k ) ŌłŻ ŽĄ k ) ) = ∑ j = 1 k - 1 ( x j - x k ) 2 ( k - 1 ) ∑ j = 1 k - 1 ( x j - x ¯ ( k ) ) 2 σ 2 + σ 2 ,$ $E ( β ^ ( k ) ) = E ( E ( β ^ ( k ) ŌłŻ ŽĄ k ) ) = β ,$ $Var ( β ^ ( k ) ) = E ( V ( β ^ ( k ) ŌłŻ ŽĄ k ) ) + V ( E ( β ^ ( k ) ŌłŻ ŽĄ k ) ) = σ 2 ∑ j = 1 k - 1 ( x j - x ¯ ( k ) ) 2 ,$

where $x ¯ ( k ) = ∑ j = 1 k - 1 x j / ( k - 1 )$ and β is the coefficient of regression line.

• One change point: When k = n of Model (3.2) and k = n + 1 of Model (3.3), respectively, the mean and variance are expressed as following two parts. If k = n, there are

$E ( α ^ ( n ) ) = E ( E ( α ^ ( n ) ŌłŻ ŽĄ n ) ) = E ( - ŽĄ n ) = 0 ,$ $Var ( α ^ ( n ) ) = E ( V ( α ^ ( n ) ŌłŻ ŽĄ n ) ) + V ( E ( α ^ ( n ) ŌłŻ ŽĄ n ) ) = ∑ j = 1 n - 1 ( x j - x n ) 2 ( n - 1 ) ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 σ 2 + σ 2 ,$ $E ( β ^ ( n ) ) = E ( E ( β ^ ( n ) ŌłŻ ŽĄ n ) ) = β 1 ,$ $Var ( β ^ ( n ) ) = E ( V ( β ^ ( n ) ŌłŻ ŽĄ n ) ) + V ( E ( β ^ ( n ) ŌłŻ ŽĄ n ) ) = σ 2 ∑ j = 1 n - 1 ( x j - x ¯ ( n ) ) 2 ,$

where $x ¯ ( n ) = ∑ j = 1 n - 1 x j / ( n - 1 )$ and β1 is the coefficient of the first regression line in Equation (3.2). If k = n + 1, there are

$E ( α ^ ( n + 1 ) ) = E ( E ( α ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) = E ( λ n + 1 - ŽĄ n + 1 ) = λ n + 1 ,$ $Var ( α ^ ( n + 1 ) ) = E ( V ( β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) + V ( E ( β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) = ∑ j = 1 n ( x j - x n + 1 ) 2 n ∑ j = 1 n ( x j - x ¯ ( n + 1 ) ) 2 σ 2 + σ 2 ,$ $E ( β ^ ( n + 1 ) ) = E ( E ( β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) = β 1 ,$ $Var ( β ^ ( n + 1 ) ) = E ( V ( β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) + V ( E ( β ^ ( n + 1 ) ŌłŻ ŽĄ n + 1 ) ) = σ 2 ∑ j = 1 n ( x j - x ¯ ( n + 1 ) ) 2 ,$

where λn+1 = α1α2 + (β1β2)xn+1. For k = n + 2, n + 3, …, N, the difference based observations aren’t of the linear relationship between x(k)i and y(k)i, i = 1, 2, …, k − 1 with the first regime, (xj, yj), j = 1, 2, …, n, so $E(α^(n+2))≠λn+2-ŽĄn+2$.

In accordance with Equation (3.8), the mean of the intercept estimator is zero. If kn, results are also similar to the previous result. However, in accordance with Equation (3.10), the mean of the intercept estimator is non-zero value, λn+1. Prior to the change point, the mean of the intercept estimator is zero, but beyond the change point, the mean of the intercept estimator is non-zero. The estimated intercept is highly influenced by which observation is removed and then by whether the removed observation is the change point (or not). Hence, we use the difference-based intercept estimator for change point detection.

4. Hypothesis testing for change point detection

In this section, we explain characteristics of an intercept estimator, $α^(k)|ŽĄk$, k = u, …, N, and propose a testing method for change point detection using properties of the intercept estimator. In order to detect locations of change points, we also further develop algorithm using the properties of $α^(k)|ŽĄk$.

### 4.1. Characteristics of intercept estimator

The difference-based intercept estimator is highly affected by the change point. We now explain the characteristics of the intercept estimator in three cases: no change point, one change point, and two change points.

• No change point: Assume a simple linear regression model (y = −4x + 5) without change point. In Figure 1(b), we display the scatter plot of the kth observation and the intercept estimate without the kth observation with k = u, …, N. As a result, estimated $α^(k)|ŽĄk$’s are random at zero.

• One change point: Assume a piecewise linear regression model with one change point. This model also has two types of continuous or discontinuous. If the model is a continuous type, $α^(k)|ŽĄk$’s appear randomly at zero prior to the true change point. But, $α^(k)|ŽĄk$’s have the same sign values and tend to increase or decrease beyond the true change point in Figure 2(b). However, if the model is discontinuous type, the estimated intercepts have the highest or lowest point beyond the true change point in

• Two change points: Assume a continuous piecewise linear regression model with two change points. As shown above, $α^(k)|ŽĄk$’s appear randomly at zero prior to the first change point. But passing each change point, the intercept estimates have different patterns in

### 4.2. A testing method for change point detection

In most cases, a simple graphical analysis is able to detect a change point, but in other cases a hypothesis test is required. In accordance with Section 3.2, if there is no change point, intercept estimators, $α^(k)|ŽĄk$’s are random at zero. However, if there are change points, intercept estimators have patterns beyond the change point. Accordingly, we detect change points using these characteristics. In order to do so, we consider rank test (Bartels, 1982) which perform better than an ordinary run test in our simulation.

Bartels (1982) considered the rank version of the von Neumann’s ratio statistic and obtained the critical values of this statistic under the randomness hypothesis. Suppose Rt is the rank of the tth observation in T observations. The null hypothesis H0 of randomness is rejected at large in absolute value of the test statistic

$Bar * = Bar - E [ Bar ] D [ Bar ] = Bar - 2 2 5 / ( 5 T + 7 ) ,$

where $Bar = ∑ t = 1 T - 1 ( R t - R t + 1 ) 2 / ∑ t = 1 T ( R t - R ¯ ) 2$. It is known that Bar* is asymptotically standard normal distributed under H0. In this paper, we use the rank, the corresponding sequential number of {α╠é(u), α╠é(u+1), …, α╠é(s)} with s = u + v − 1, …, N. Here, v is the minimum number of observations for the test, and we set v = 5. For this test, we use R package randtests (Mateus and Caeiro, 2015).

Let us revisit the simulation data described in Section 4.1 and apply the randomness test proposed by Bartels (1982). Here, the indices for ascending and descending data are same as the indices and s of algorithm in Section 4.3. As a result, Figures 58 show the corresponding plots for this rank test. In Figures 5(a) and (b), the test results show that all points are random if there is no change point. However, if there are change points, the test results indicate that all points are not random, regardless of the type of piecewise regression (Figures 68).

### 4.3. Algorithm

In accordance with Section 4.2, the result of randomness test is highly affected by which observation is a change point or not. Therefore, we propose a computing algorithm consisting of the following four steps in order to detect change point.

• Step 1. For k = u, …, N, fit the Model (3.4):

$y ( k ) i ŌłŻ ŽĄ k = α ( k ) + β ( k ) x ( k ) i + ŽĄ i , ŌĆŖŌĆŖ ŌĆŖŌĆŖ ŌĆŖŌĆŖ i = 1 , … , k - 1 ;$

• Step 2. Estimate the difference-based intercept estimators:

• - Do k = u : N{

* Estimate intercepts, $α^(k)|ŽĄk$;

* Estimate intercepts $α^(N-k+1)|ŽĄN-k+1$; }

• - Lets $α ^ asc = ( α ^ asc ( u ) , α ^ asc ( u + 1 ) , … , α ^ asc ( N ) )$ and $α ^ des = ( α ^ des N - u + 1 , … , α ^ des ( 2 ) , α ^ des ( 1 ) )$;

• Step 3. Perform the rank test (Bartels, 1982):

• - Do s = (u + v − 1) : N{

* Put $α ^ asc , s = ( α ^ asc ( u ) , α ^ asc ( u + 1 ) , … , α ^ asc ( s ) )$ and $α ^ des , s = ( α ^ des ( N - u + 1 ) , … , α ^ des ( N - s + 1 ) )$;

* Perform the randomness test for sequence $α^asc,s$ and $α^des,s$;

* Compute the p-value of each test and put pasc,s and pdes,s; }

• - Rewrite pasc = (pasc,u+v−1, …, pasc,N) and pdes = (pdes,Nuv+2, …, pdes,2, pdes,1);

• Step 4. Decide if change points exist:

• - Using the results of Step 3, the hypothesis, H0: there is no change point, H0 is rejected if pasc,N > 0.05 and pdes,1 > 0.05;

• - If H0 is rejected, i.e., there are change points, then the number of change points is determined by the following procedures:

* Let maxasc be arg min{s : pasc,s < 0.05, s = (u+v−1), …, N} and mindes be arg max{s : pdes,s < 0.05, s = 1, …, (Nuv + 2)};

* If maxasc ≥ mindes, there is one change point;

* Otherwise, there are two or more change points;

• Step 5. Detect the location of first change points:

• - If there is one change point between maxasc and mindes, assume that the number of observations in the first regime, $n^$ is the median between maxasc and mindes;

• - If there are two or more change points, repeat the following loop until maxasc ≥ mindes {

* Put N = mindes −1;

* Perform Steps 1–4; }

• - Then there is first change point between maxasc and mindes. Here, assume that the number of observations in the first regime, $n^$ is the median between maxasc and mindes.

The other change points can be detected with the remaining data except for the first regime corresponding to $n^$ detected using the above algorithm.

5. Simulation studies

We conduct simulations to evaluate the performance of our approach compared to other existing approaches: R packages chngpt (Fong et al., 2017; Fong and He, 2018) and segmented (Muggeo, 2008, 2017). First, chngpt provides both estimation and hypothesis testing functionalities for four types of piecewise regression models. Second, segmented package offers a module to estimate the parameters in a GLM with segmented relationships. This package supports the continuous model.

### 5.1. Simulation setting

In order to assess and compare the performance of the proposed difference-based intercept estimators, simulations have been conducted under different conditions: sample sizes (N = 50, 100), the number of change points (r = 0, 1, 2), types of piecewise regression, and the number of independent variables. We generated 100 data sets for each model. The model is divided as follows.

• Model 1. This model for comparison is yi = 5 − 4xi + ŽĄi, with i = 1, …, N, where $ŽĄ i ∼ iid N ( 0 , 1.5 )$ and N = 50. The covariate xi is generated with U(0, 10).

• Model 2. Consider the Model (2.2) with one change point. We also divide this model using four types separated by Fong et al. (2017). The data generating processes for the covariate variable x is the same as in Model 1 and $ŽĄ i ∼ iid N ( 0 , 1 )$. Also, n = 25 and N = 50. We set four type models according to the values of the parameters α1, α2, β1, and β2 as follows.

• - hinge: This model is zero slope prior to the change point and continuous at the change point. We set parameters β1 = 0, α1 = 2, β2 = 5, and α2 = α1 +(β1β2)δ with change point location δ = xn.

• - segmented: This model generalizes the hinge model by allowing non-zero slope prior to the change point. Accordingly, put the parameter of the model be β1 = −4, α1 = 5, β2 = 2, and α2 = α1 + (β1β2)δ with change point location δ = xn.

• - step: In this model, both regression lines have a slope of 0, and are discontinuous at the change point. We set parameters β1 = 0, α1 = 12, β2 = 0, and α2 = 3.

• - stegmented: According to Fong et al. (2017), stegmented model is viewed as the fusion of the segmented and step models. Both regression liens have different slopes and intercepts. We set parameters β1 = −4, α1 = 10, β2 = 2, and α2 = 2.

• Model 3. Consider the model with two change points. We assume that the model is continuous at the change points:

$y i = { α 1 + β 1 x i + ŽĄ i , if i ≤ n , α 2 + β 2 x i + ŽĄ i , if n < i ≤ n + n 2 , α 3 + β 3 x i + ŽĄ i , if i > n + n 2 ,$

where $ŽĄ i ∼ iid N ( 0 , 1 )$ and N = 100. We set β1 = −2, α1 = 5, β2 = 1, α2 = α1 + (β1β2)δ1, β3 = −2, and α3 = α2 + (β2β3)δ2 with change point locations δ1 = xn and δ2 = xn+n2 with n = 25 and n + n2 = 60. The covariate xi is generated with U(0, 20).

• Model 4. We assume a multiple linear regression model with two explanatory variables and one change point:

$y i = { α 1 + β 1 x i + γ 1 z i + ŽĄ i , if i ≤ n , α 2 + β 2 x i + γ 2 z i + ŽĄ i , if i < n ,$

where $ŽĄ i ∼ iid N ( 0 , 1 )$ and N = 50. The change point is defined in terms of only one of the regressors, x. The location of the change point, δ is x25. The covariate xi and zi are generated with U(0, 1), respectively. We set two types models according to the values of the parameters as follows.

• - Continuous type: This model consists of a connected line at the change point. We set parameters θ1 = (α1, β1, γ1 = (5, −3, 2) and θ2 = (α2, β2, γ2 = (α1 + (β1β2)δ, 2, 2).

• - Discontinuous type: This model consists of a disconnected line at the change point. We set parameters θ1 = (α1, β1, γ1 = (5, −3, 2) and θ2 = (0, 2, 2).

Here, u = 5 and v = 5 for estimation and testing of the difference-based intercept estimator. To compare the performance of our difference-based change point detection method and R packages, segmented and chngpt, we consider the following setting of these packages. In segmented, we use “davies.test” and “pscore.test” of these functions for testing and “segmented” function for detection of change points. In chngpt, we use testing function, “chngpt.test” that tests each of the four types, respectively. The significance level for the rank test of Bartels (1982) is 0.05.

### 5.2. Simulation results

In this section, performances of our proposed procedure are evaluated by results of 100 replicates for each model. The results are summarized in Table 1, where we report the proportions of false detections among 100 replications.

We evaluate the ratio of falsely detecting a change point when there is no change point (Model 1). As a result, in our method (DCD), the ratio of false detections among 100 replications is 0.01, and is remarkably accurate. This implies that our approach can identify the existence of change points under various circumstances. However, in SEG, the ratio is 0.03 and in CHN, it is close to one. CHN tends to find the most appropriate change point for the model.

We evaluate the ratio of not detecting change points when there are change points (Models 2–4). First, our method (DCD) and CHN produce negligible percentages of false detection of change points for all types of model. However, results of SEG show a high ratio in the continuous model, but not in the discontinuous model. This is because this package is intended to be suitable for a continuous model. Second, for Model 3, all three methods show accurate results. Finally, in simulation of Model 4, our method shows accurate results regardless of the type of model, but SEG and CHN perform worse than our method.

In the following, we evaluate the accuracy of change point estimation when there are change points. The performance of the change point estimator has been evaluated, through mean and standard deviation (SD). The performance of the three methods is also evaluated via the absolute relative bias (ARB) (Chen et al., 2011) that represents the percentage error of the estimate $n^$ compared to the true value $n(|(n^-n)/n|×100)$). Table 2, Figure 9, and Figure 10 show the results.

We first discuss Model 2 with only one change point. Our method and CHN estimate the location of change point with reasonable accuracy (Table 2). However, results of SEG estimate the location of change point accurately in the continuous model, but not accurately in the discontinuous model. In the simulation of Model 3 and Model 4, our method (DCD) works better than the other two methods. The results in Figure 9 and Figure 10 indicate that our method is more accurate than the other two methods. It is shown that our method is superior in the case of one change point and in the case of two change points in the simple linear regression model as well as superior in the case of one change point in the multiple regression model.

Figure 11 and Figure 12 display scatter plots of the intercept estimates and the results of the rank test for Model 4. The intercept estimate is influenced by the change point effect. This means that the difference-based intercept estimators that correspond to the observations including the change point effect, are large values.

6. Real data analysis

In this section, we apply our difference-based change point detection method to Down syndrome (DS) data (Davison and Hinkley, 1997). This data set is used by Muggeo (2008) to evaluate his method. There are three explanatory variables: the number of babies with DS (cases), the number of total births (births) and the mother’s mean age (age). DS risk generally increases with mother’s age, but evaluation is needed to show at what point a risk change occurs.

We apply our method (DCD) to this data and show the estimated intercepts in Figure 13. These estimates show a distinct trend between the 13th observation and the 17th observation. Even though a clear stopping rule is not provided, we can check for existence of change points using a trend of difference-based intercept estimates. Randomness tests are performed on the intercepts added one by one; in addition, we also show the scatter plot and the estimated $n^$ for each method. As a result, $n^$s of the three methods are slightly different in Figure 14. For our method, the number of observations in the first regime determined by the median of the interval between min and max, is 16. For SEG method, under a null left slope constraint, the number of the first regime observations is 14 (Muggeo, 2008). For CHN method, under the hinge type, the number of the first regime observations is 15.

7. Discussion

In this paper, we derive a way to solve change point regression problems via a process for getting the consequential results using the properties of a difference-based intercept estimator (Park and Kim, 2019) as well as describe the statistical properties of the DBRM in a PSLR. We also propose an algorithm for change point detection.

We compare the proposed methodology with other methods available in the recent literatures, SEG and CHN. The simulation results indicate that the performance of the proposed method is good in various circumstances. First, our method can successfully identify the existence of change points under various circumstances: continuous and discontinuous types, single covariate and multiple covariates, one change point and multiple change points. We can check for existence of change points using a trend of the difference-based intercept estimators despite not providing a clear stopping rule. We also determine the change point location as an interval and point estimation using our proposed algorithm. Our method is also more accurate than the other two methods, SEG and CHN: our method is superior in the case of one change point and in the case of two change points in the simple linear regression model, and our method is superior in the case one change point in the multiple regression model. Our method is affected by the number of samples in each regime and must have at least 15 samples. Therefore, we need to develop our method to make robust estimates regardless of the sample size in each regime as well as develop an algorithm that automatically searches for change point detection.

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2018R1D1A1B070).

Figures Fig. 1.

Intercept estimates in the DBRM without change point: (a) scatter plot of x and y; (b) estimated intercepts in the DBRM without the kth observation, k = 5, …, N. DBRM = difference-based regression model. Fig. 2.

Intercept estimates in the DBRM with one change point (continuous type) with true number of observations in the first regime (n) = 25: (a) scatter plot of x and y; (b) estimated intercepts in the DBRM without the kth observation, k = 5, …, N. DBRM = difference-based regression model. Fig. 3.

Intercept estimates in the DBRM with one change point (discontinuous type) with true number of observations in the first regime (n) = 25: (a) scatter plot of x and y; (b) estimated intercepts in the DBRM without the kth observation, k = 5, …, N. DBRM = difference-based regression model. Fig. 4.

Intercept estimates in the DBRM with two change points with true number of observations in the first regime (n) = 25, and in the second regime (n2) = 35: (a) scatter plot of x and y; (b) estimated intercepts in the DBRM without the kth observation, k = 5, …, N. DBRM = difference-based regression model. Fig. 5.

Bartels test of intercept estimators without change point. Fig. 6.

Bartels test of intercept estimators with one change point (continuous type) with true number of observations in the first regime (n) = 25 (estimated $n^$ is 24). Fig. 7.

Bartels test of intercept estimators with one change point (discontinuous type) with true number of observations in the first regime (n) = 25 (estimated $n^$ is 25). Fig. 8.

Bartels test of intercept estimators with two change points with true number of observations in the first regime (n) = 25, and in the second regime (n2) = 35 (estimated $n^$ is 28). Fig. 9.

Boxplots of n estimated by each method on simulated data for Model 2 with N = 50 and true number of observations in the first regime (n) = 25. Fig. 10.

Boxplots of the first n estimated by each method on simulated data for Model 3 and Model 4: (a) Model 3: N = 100, true number of observations in the first regime (n) = 25, and in the second regime (n2) = 35. (b) and (c) Model 4: N = 50, true number of observations in the first regime (n) = 25. Fig. 11.

Intercept estimators and result of Bartels test on simulated data for Model 4 (continuous type), true number of observations in the first regime (n) = 25 (estimated $n^$ is 25). Fig. 12.

Intercept estimators and result of Bartels test on simulated data for Model 4 (discontinuous type), true number of observations in the first regime (n) = 25 (estimated $n^$ is 25.). Fig. 13.

Estimated intercepts in the DBRM without the kth observation in our example data. DBRM = difference-based regression model. Fig. 14.

Scatter plot and regimes obtained each method in our example data. DCD = our difference-based change point detection method; SEG = R package segmented proposed by Muggeo (2008, 2017); CHN = R package chngpt proposed by Fong et al. (2017), Fong and He (2018).

TABLES

### Table 1

Comparison among three methods: DCD, SEG, and CHN

Model r DCD SEG CHN
Model 1 No 0.02 0.03 1.00

Model 2 hinge One 0.00 0.00 0.00
segmented 0.00 0.00 0.00
step 0.00 0.86 0.00
stegmented 0.00 0.00 0.00

Model 3 One 0.00 0.00 0.00

Model 4 Connected Two 0.00 0.00 0.03
Disconnected 0.00 0.06 0.04

DCD = our difference-based change point detection method; SEG = R package segmented proposed by Muggeo (2008, 2017); CHN = R package chngpt proposed by Fong et al. (2017), Fong and He (2018); r = the number of true change points.

### Table 2

Performances of DCD: mean, SD, and ARB of estimated $n^$ when repeated 100 times; true number of observation in the first regime (n) = 25

Model DCD SEG CHN

Mean SD ARB Mean SD ARB Mean SD ARB
Model 1 - - - - - - 6.01 0.00 75.96

Model 2 hinge 24.78 1.84 0.88 24.59 4.64 1.64 24.99 0.10 0.04
segmented 24.24 2.73 3.04 24.26 1.17 2.96 24.76 0.83 0.96
step 24.87 3.37 0.52 3.63 1.09 85.48 26.00 0.92 4.00
stegmented 24.44 2.79 2.24 17.41 9.95 30.36 26.00 0.00 4.00

Model 3 24.80 3.29 0.80 67.24 1.66 168.96 18.02 0.00 27.92

Model 4 Connected 24.91 1.49 0.36 24.39 3.34 2.44 24.77 1.52 0.92
Disconnected 25.75 1.74 3.00 31.46 8.14 25.84 34.21 3.67 36.84

DCD = our difference-based change point detection method; SEG = R package segmented proposed by Muggeo (2008, 2017); CHN = R package chngpt proposed by Fong et al. (2017), Fong and He (2018); SD = standard deviation; ARB = absolute relative bias.

References
1. Andrews DWK (1993). Tests for parameter instability and structural change with unknown change point, Econometrica, 61, 821-856.
2. Andrews DWK and Ploberger W (1994). Optimal tests when a nuisance parameter is present only under the alternative, Econometrica, 62, 1383-1414.
3. Bai J (1997). Estimation of a change point in multiple regressions models, Review of Economics and Statistics, 79, 551-563.
4. Bartels R (1982). The rank version of von Neumann’s ratio test for randomness, Journal of the American Statistical Association, 77, 40-46.
5. Betts M, Forbes G, and Diamond A (2007). Thresholds in songbird occurrence in relation to landscape structure, Conservation Biology, 21, 1046-1058.
6. Chen CWS, Chan JSK, Gerlach R, and Hsieh WYL (2011). A comparison of estimators for regression models with change points, Statistics and Computing, 21, 395-414.
7. Davison A and Hinkley D (1997). Bootstrap Methods and Their Application, Cambridge University Press,.
8. Fong Y and He Z. Package ’chngpt’: Estimation and Hypothesis Testing for Threshold Regression, R package version 2018.7-25.
9. Fong Y, Huang Y, Gilbert PB, and Permar SR (2017). chngpt: threshold regression model estimation and inference, BMC Bioinformatics, 18,.
10. Hawkins DM (1980). A note on continuous and discontinuous segmented regressions, Technometrics, 22, 443-444.
11. Julious SA (2001). Inference and estimation in a changepoint regression problem, The Statistician, 50, 51-61.
12. Kim HJ and Siegmund D (1989). The likelihood ratio test for a changepoint in simple linear regression, Biometrika, 76, 409-423.
13. Loader CR (1996). Change point estimation using nonparametric regression, The Annals of Statistics, 24, 1667-1678.
14. Mateus A and Caeiro F. Package ‘randtests’: Testing randomness in R, R package version 1.0.
15. Muggeo VMR (2003). Estimating regression models with unknown break-points, Statistics in Medicine, 22, 3055-3071.
16. Muggeo VMR (2008). Segmented: An R package to fit regression models with broken-line relationships, R News, 8(1), 20-25.
17. Muggeo VMR. Package ‘segmented’: Regression Models with Break-Points/Change-Points Estimation, R package version 0.5-3.0.
18. Park CG and Kim I (2019). Robust difference-based outlier detection, Communications in Statistics - Theory Methods, Published online.
19. Quandt RE (1958). The estimation of the parameters of a linear regression system obeying two separate regimes, Journal of the American Statistical Association, 53, 873-880.
20. Ulm K (1991). A statistical methods for assessing a threshold in epidemiological studies, Statistics in Medicine, 10, 341-349.
21. Zeileis A, Kleiber C, Krämer W, and Hornik K (2002). Testing and Dating Structural Changes in Exchange Rate Regimes, Computational Statistics and Data Analysis, 54, 1696-1706.
22. Zhou HL and Liang KY (2008). On estimating the change point in generalized linear models, IMS Collections Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen, 1, 305-320.