A piecewise linear regression model is a special type of relationship between a dependent variable and one (or more) explanatory variables that consists of piecewise lines. In this model, the matter of interest is the boundary points of the adjacent lines called break points, change points or jointpoints (Muggeo, 2008). Change point problems occur in fields such as molecular biology, machine learning, and econometrics. The results derived from statistical inference could be misleading if there exist change points; consequently, it is important to ascertain the threshold values where the effect of the independent variable changes (Ulm, 1991; Betts
Significant literature has been developed in many fields related to the change point regression problems since the 1950’s (Quandt, 1958). Many studies have focused on testing for change points rather than estimates since then (Kim and Siegmund, 1989; Andrews, 1993; Andrews and Ploberger, 1994). However, later statisticians have focused on estimates (Loader, 1996; Bai, 1997; Julious, 2001; Muggeo, 2003; Zhou and Liang, 2008). Subsequently, related software programs on detecting change points have been developed. Some R packages related to these problems include
Park and Kim (2019) first introduced a difference-based regression model (DBRM) which is useful for outlier detection in multiple linear regression models. The proposed outlier detection approach uses a difference-based intercept estimator that is influenced by anomalous data. Based on the DBRM we can apply properties of this estimator to the piecewise regression model. In particular, we propose an efficient algorithm for change point detection in a piecewise simple linear regression model (PSLR). Compared to the previously mentioned methods, our proposed method has advantages that can be applied to various circumstances: continuous or discontinuous types, single covariate or multiple covariates, and no change point, one change point or multiple change points.
The remainder of this paper is organized as follows. In Section 2, we briefly describe piecewise linear regression models. In Section 3, we utilize the process of the DBRM (Park and Kim, 2019) and derive the statistical properties of the difference-based coefficient estimator in the PSLR. An algorithm to detect change points will then be given in Section 4. In Section 5 and Section 6, we illustrate the merits of our proposed method in comparison with several existing methods by simulation studies and real data analysis. The article concludes with a discussion in Section 7.
In this section, we consider a piecewise regression model. To do this, we use a multiple change point regression model described by Chen
Here,
A piecewise linear regression model is classified by Hawkins (1980) as continuous and discontinuous. Continuous type here means that the regression function is a connected line at the change point,
Below, we use Equation (2.1) with
where
In this section, after introducing the DBRM (Park and Kim, 2019), we explain how to apply the process of the DBRM to the PSLR. Park and Kim (2019) proposed an outlier-detection approach using the properties of an intercept estimator in the DBRM. This method uses only the estimator of the intercept: it does not require estimating the other parameters in the DBRM. In this paper, we first use the DBRM process to detect change points and explain how to apply DBRM to the PSLR.
First, we describe the DBRM without change points. Then, the simple linear regression can be expressed as:
where
Second, we describe the DBRM with change point using Equation (2.2). This model is divided into two parts according to the value of
where
where
where
Then we estimate the
where
where
where
For
where
where
where
In accordance with Equation (3.8), the mean of the intercept estimator is zero. If
In this section, we explain characteristics of an intercept estimator,
The difference-based intercept estimator is highly affected by the change point. We now explain the characteristics of the intercept estimator in three cases: no change point, one change point, and two change points.
In most cases, a simple graphical analysis is able to detect a change point, but in other cases a hypothesis test is required. In accordance with Section 3.2, if there is no change point, intercept estimators,
Bartels (1982) considered the rank version of the von Neumann’s ratio statistic and obtained the critical values of this statistic under the randomness hypothesis. Suppose
where
Let us revisit the simulation data described in Section 4.1 and apply the randomness test proposed by Bartels (1982). Here, the indices for ascending and descending data are same as the indices and
In accordance with Section 4.2, the result of randomness test is highly affected by which observation is a change point or not. Therefore, we propose a computing algorithm consisting of the following four steps in order to detect change point.
Step 1. For
Step 2. Estimate the difference-based intercept estimators:
- Do
* Estimate intercepts,
* Estimate intercepts
- Lets
Step 3. Perform the rank test (Bartels, 1982):
- Do
* Put
* Perform the randomness test for sequence
* Compute the
- Rewrite
Step 4. Decide if change points exist:
- Using the results of Step 3, the hypothesis,
- If
* Let max_{asc} be arg min{
* If max_{asc} ≥ min_{des}, there is one change point;
* Otherwise, there are two or more change points;
Step 5. Detect the location of first change points:
- If there is one change point between max_{asc} and min_{des}, assume that the number of observations in the first regime,
- If there are two or more change points, repeat the following loop until max_{asc} ≥ min_{des} {
* Put
* Perform Steps 1–4; }
- Then there is first change point between max_{asc} and min_{des}. Here, assume that the number of observations in the first regime,
The other change points can be detected with the remaining data except for the first regime corresponding to
We conduct simulations to evaluate the performance of our approach compared to other existing approaches: R packages
In order to assess and compare the performance of the proposed difference-based intercept estimators, simulations have been conducted under different conditions: sample sizes (
Model 1. This model for comparison is
Model 2. Consider the Model (2.2) with one change point. We also divide this model using four types separated by Fong
- hinge: This model is zero slope prior to the change point and continuous at the change point. We set parameters
- segmented: This model generalizes the hinge model by allowing non-zero slope prior to the change point. Accordingly, put the parameter of the model be
- step: In this model, both regression lines have a slope of 0, and are discontinuous at the change point. We set parameters
- stegmented: According to Fong
Model 3. Consider the model with two change points. We assume that the model is continuous at the change points:
where
Model 4. We assume a multiple linear regression model with two explanatory variables and one change point:
where
- Continuous type: This model consists of a connected line at the change point. We set parameters
- Discontinuous type: This model consists of a disconnected line at the change point. We set parameters
Here,
In this section, performances of our proposed procedure are evaluated by results of 100 replicates for each model. The results are summarized in Table 1, where we report the proportions of false detections among 100 replications.
We evaluate the ratio of falsely detecting a change point when there is no change point (Model 1). As a result, in our method (DCD), the ratio of false detections among 100 replications is 0.01, and is remarkably accurate. This implies that our approach can identify the existence of change points under various circumstances. However, in SEG, the ratio is 0.03 and in CHN, it is close to one. CHN tends to find the most appropriate change point for the model.
We evaluate the ratio of not detecting change points when there are change points (Models 2–4). First, our method (DCD) and CHN produce negligible percentages of false detection of change points for all types of model. However, results of SEG show a high ratio in the continuous model, but not in the discontinuous model. This is because this package is intended to be suitable for a continuous model. Second, for Model 3, all three methods show accurate results. Finally, in simulation of Model 4, our method shows accurate results regardless of the type of model, but SEG and CHN perform worse than our method.
In the following, we evaluate the accuracy of change point estimation when there are change points. The performance of the change point estimator has been evaluated, through mean and standard deviation (SD). The performance of the three methods is also evaluated via the absolute relative bias (ARB) (Chen
We first discuss Model 2 with only one change point. Our method and CHN estimate the location of change point with reasonable accuracy (Table 2). However, results of SEG estimate the location of change point accurately in the continuous model, but not accurately in the discontinuous model. In the simulation of Model 3 and Model 4, our method (DCD) works better than the other two methods. The results in Figure 9 and Figure 10 indicate that our method is more accurate than the other two methods. It is shown that our method is superior in the case of one change point and in the case of two change points in the simple linear regression model as well as superior in the case of one change point in the multiple regression model.
Figure 11 and Figure 12 display scatter plots of the intercept estimates and the results of the rank test for Model 4. The intercept estimate is influenced by the change point effect. This means that the difference-based intercept estimators that correspond to the observations including the change point effect, are large values.
In this section, we apply our difference-based change point detection method to Down syndrome (DS) data (Davison and Hinkley, 1997). This data set is used by Muggeo (2008) to evaluate his method. There are three explanatory variables: the number of babies with DS (cases), the number of total births (births) and the mother’s mean age (age). DS risk generally increases with mother’s age, but evaluation is needed to show at what point a risk change occurs.
We apply our method (DCD) to this data and show the estimated intercepts in Figure 13. These estimates show a distinct trend between the 13
In this paper, we derive a way to solve change point regression problems via a process for getting the consequential results using the properties of a difference-based intercept estimator (Park and Kim, 2019) as well as describe the statistical properties of the DBRM in a PSLR. We also propose an algorithm for change point detection.
We compare the proposed methodology with other methods available in the recent literatures, SEG and CHN. The simulation results indicate that the performance of the proposed method is good in various circumstances. First, our method can successfully identify the existence of change points under various circumstances: continuous and discontinuous types, single covariate and multiple covariates, one change point and multiple change points. We can check for existence of change points using a trend of the difference-based intercept estimators despite not providing a clear stopping rule. We also determine the change point location as an interval and point estimation using our proposed algorithm. Our method is also more accurate than the other two methods, SEG and CHN: our method is superior in the case of one change point and in the case of two change points in the simple linear regression model, and our method is superior in the case one change point in the multiple regression model. Our method is affected by the number of samples in each regime and must have at least 15 samples. Therefore, we need to develop our method to make robust estimates regardless of the sample size in each regime as well as develop an algorithm that automatically searches for change point detection.
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2018R1D1A1B070).
Intercept estimates in the DBRM without change point: (a) scatter plot of
Intercept estimates in the DBRM with one change point (continuous type) with true number of observations in the first regime (
Intercept estimates in the DBRM with one change point (discontinuous type) with true number of observations in the first regime (
Intercept estimates in the DBRM with two change points with true number of observations in the first regime (
Bartels test of intercept estimators without change point.
Bartels test of intercept estimators with one change point (continuous type) with true number of observations in the first regime (
Bartels test of intercept estimators with one change point (discontinuous type) with true number of observations in the first regime (
Bartels test of intercept estimators with two change points with true number of observations in the first regime (
Boxplots of
Boxplots of the first
Intercept estimators and result of Bartels test on simulated data for Model 4 (continuous type), true number of observations in the first regime (
Intercept estimators and result of Bartels test on simulated data for Model 4 (discontinuous type), true number of observations in the first regime (
Estimated intercepts in the DBRM without the
Scatter plot and regimes obtained each method in our example data. DCD = our difference-based change point detection method; SEG = R package
Comparison among three methods: DCD, SEG, and CHN
Model | DCD | SEG | CHN | ||
---|---|---|---|---|---|
Model 1 | No | 0.02 | 0.03 | 1.00 | |
Model 2 | hinge | One | 0.00 | 0.00 | 0.00 |
segmented | 0.00 | 0.00 | 0.00 | ||
step | 0.00 | 0.86 | 0.00 | ||
stegmented | 0.00 | 0.00 | 0.00 | ||
Model 3 | One | 0.00 | 0.00 | 0.00 | |
Model 4 | Connected | Two | 0.00 | 0.00 | 0.03 |
Disconnected | 0.00 | 0.06 | 0.04 |
DCD = our difference-based change point detection method; SEG = R package
Performances of DCD: mean, SD, and ARB of estimated
Model | DCD | SEG | CHN | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Mean | SD | ARB | Mean | SD | ARB | Mean | SD | ARB | ||
Model 1 | - | - | - | - | - | - | 6.01 | 0.00 | 75.96 | |
Model 2 | hinge | 24.78 | 1.84 | 0.88 | 24.59 | 4.64 | 1.64 | 24.99 | 0.10 | 0.04 |
segmented | 24.24 | 2.73 | 3.04 | 24.26 | 1.17 | 2.96 | 24.76 | 0.83 | 0.96 | |
step | 24.87 | 3.37 | 0.52 | 3.63 | 1.09 | 85.48 | 26.00 | 0.92 | 4.00 | |
stegmented | 24.44 | 2.79 | 2.24 | 17.41 | 9.95 | 30.36 | 26.00 | 0.00 | 4.00 | |
Model 3 | 24.80 | 3.29 | 0.80 | 67.24 | 1.66 | 168.96 | 18.02 | 0.00 | 27.92 | |
Model 4 | Connected | 24.91 | 1.49 | 0.36 | 24.39 | 3.34 | 2.44 | 24.77 | 1.52 | 0.92 |
Disconnected | 25.75 | 1.74 | 3.00 | 31.46 | 8.14 | 25.84 | 34.21 | 3.67 | 36.84 |
DCD = our difference-based change point detection method; SEG = R package