The segmentation of a binary sequence has been discussed with several different names depending on its context, where the binary sequence comes from and what are the purpose of the segmentation. In signal processing, it is often called the de-nosing of a noisy signal or the compression of binary signal (Gray, 1984; Lelewer and Hirschberg, 1987; Donoho
In all the above, the hidden Markov model (HMM) is the most popular method to segment the binary sequence/achieve their goal (Zucchini
In this paper, as an alternative to HMM, we propose a procedure for the segmentation of a binary sequence that is purely data-driven and does not make any distributional or model assumptions. We suggest finding a sequence that minimizes the least square distance to the observations, which we call the least square error in this paper, under the assumption that the number of segments is given a priori. Here, in a binary sequence, the least square distance is equal to the Hamming distance without the constant, and the number of segments of a binary sequence is equal to its total variation. We develop a polynomial time algorithm to solve the problem (a formal definition of the problem is given in (
This article is organized as follows. In Section 2, we propose an algorithm to solve the problem, the minimization of the least square error under total variation regularization. We illustrate it with a toy example to help the understanding of the algorithm and show its computational complexity. In addition, we review and discuss the GAP statistic and its variants to decide the number of optimal segments. In Section 3, we apply it to segmenting two real data examples, the Gemini boat race data between Oxford and Cambridge University https://www.theboatrace.org and the Tiger Woods tournament data between September, 1999 and June, 2001.
Suppose we observe a binary sequence of binary observations
The problem we suggest to solve is, for a given constant
where
We start with some notations and preliminary results related with the optimal solution
Suppose, for some ,
i.e.,
while
It contradicts to
We use the following notations and terminologies in explaining our algorithm.
(T1) ∀
(T2) Suppose
where
(T3) We define the act of flipping or flip as changing all the elements of a given chunk from 0 to 1 or 1 to 0.
(T4) We define the cost vector
(T5) In the iteration of the algorithm, we call the ℰ(
We make three remarks as preliminaries for the algorithm. First, a sequence
Now let us explain our algorithm to solve (
(S1) Change the encoding of the binary sequence
(S2) Divide the problem into 4 sub-problems according to our first move, named as ‘0 to 0 (I-00)’, ‘0 to 1 (I-01)’, ‘1 to 0 (I-10)’, ‘1 to 1 (I-11)’. Here, ‘0 to 0 (I-00)’ implies our first move is to make both the first and last chunk’s signals 0, ‘0 to 1 (I-01)’ implies our first move is to make the first chunk’s signal is 0 and the last chunk’s signal is 1, ‘1 to 0 (I-10)’ implies our first move is to make the first chunk’s signal is 1 and the last chunk’s signal is 0, and ‘1 to 1 (I-11)’ implies our first move is to make both the first and last chunk’s signals 1.
(S3) For each of sub-problem (I-00, I-01, I-10, I-11), flip the chunk having the smallest cost except the first and the last chunk fixed at (S2). Update the corresponding RLC, cost, and loss. Repeat this until the required
(S4) Among the solutions to sub-problems (I-00, I-01, I-10, I-11), find the solution with the smallest loss.
To understand the algorithm above better, let us consider the following simple toy example. Suppose we observe the sequence with a length of
We aim to segment it into three chunks (
In (S1), the observation is encoded as RLC with
and the corresponding cost vector is represented as
In this step, the cost vector and the RLC code is the same because flipping the chunk of length
In (S2), we consider four initial movements. Firstly, let us consider the (I-00) case. Since our original data
The last element of the RLC code is computed by
In (S3), we flip the second chunk because it has the smallest value in
In this iteration,
For subproblem (I-01), (S2) is skipped since our data
and the loss becomes 3. The chunk with the smallest cost is the second so flipping the second one yields
and the loss increases to 7. Now that the total number of chunk is less than our target, we stop here.
We iterate the same procedure for subproblem (I-10) and (I-11), and both end up with
respectively.
Finally, in (S4), we choose both (I-00) and (I-11) which yielded the smallest loss, 5. Thus our algorithm finds two optimal solutions
We make two remarks on our proposed algorithm. First, the solution found by the algorithm is at least sub-optimal by its nature. The algorithm searches for the best flip of the current segmentation among all possibilities to make the smallest increase in the cost. Thus, by its nature, the solution has the smallest cost among its neighbor segmentations made by a flip of the current segmentation. Second, the proposed algorithm is a polynomial time algorithm. It is because, in (S3), each step decreases the quantity
The parameter
In this paper, we read our segmentation problem as a clustering of binary sequences on a line graph and adapt the original GAP statistic by (Tibshirani
A brief introduction to the original GAP is as follows. Suppose the data are segmented into
where
where |
where
To be specific, let
and find the optimal
where
and
In this section, we apply our algorithm (named as HTV) to the Gemini boat race data from https://theboatrace.org/results. It records the boat race results between Oxford and Cambridge University from year 1829 to now. The data are in the form of 167×5 table each column denoting race, year, winner, winning distance, and winning time. In our analysis, we only used the winner column, where we arbitrarily assigned 1 if Oxford won, 0 if Cambridge won.
We tried
As our second data example, we analyzed the tournament data of Tiger Woods, which includes the results of 112 tournaments from September 1999 to June 2001. The data are reported in Yang (2004) and the author proposed a Bayesian binary segmentation procedure, inherently similar to HMM, to understand the phenomenon of ‘streakiness’.
Figure 4 displayed the results by three methods considered here, HTV, HMM, and the Bayesian method by Yang (2004). Our HTV identified eight distinct change points following the procedure outlined in Section 2.4, where the GAP statistics and losses are plotted in Figure 3. On the other hand, the HMM and the Bayesian method by Yang (2004) pinpointed only a single changepoint, which provides much coarser segments. The clustered data elucidate the career trajectory of Tiger Woods during the period. We claim our HTV shows a more detailed move of his career than the other two methods. The detected 8 change points are the 17th, 19th, 71th, 79th, 86th, 94th, 105th, and 111th and their possibly relevant events are reported in Table 1.
In this paper, we propose to minimize the least square errors to segment a binary sequence as an alternative to the HMM-based method. We develop a polynomial time algorithm to find the solution for a given number of segments,
The results of this paper address the segmentation problem for the simplest form of data, but we conjecture they could be extended to more complex data. For example, our HTV can be directly applied to the segmentation of binomial sequences, a notable instance being baseball hit data. To be specific, the HTV for binomial sequences is: First, we initialize chunks based on their predominant binary outcomes and we randomly break the tie. Specifically, a chunk is assigned a value of 1 if the frequency of 1s exceeds that of 0s, and 0 otherwise. This step results in an RLC form. Second, calculate the cost vector of chunks but with extra care as our initial chunks could contain both 0s and 1s. Now that we obtained the RLC code and the cost vector, we can proceed as specified in Section 2. Some other examples we are of interest are the sequence of data with a higher base more than two, or the sequence of two- or multi-dimensional binary data. In both extensions, the form of total variation regularization becomes more complex, posing a challenge in how to effectively control it during the iterative procedures. We leave these for future work.
The authors are grateful to the associate editor and two reviewers for several variable comments. The R code of this paper is available from https://github.com/z0o0/bseg. This paper is supported by the National Research Foundation of Korea (No. NRF-2021R1A2C1010786).
The detected change points and their relevant events
Changepoint | Tournament name | Relevant event |
---|---|---|
17 | Masters tournament, 1997 | First major championship win in Woods’ career |
19 | MasterCard Colonial, 1997 | Woods started to undertake his first swing change under the guidance of Butch Harmon |
71 | WGC-NEC Invitational, 1999 | Woods underwent laser eye surgery to correct his myopic. |
86 | Memorial, 2000 | Woods perfectly adapted to his changed swing. |
94 | National Car Rental Golf Classic Disney, 2000 | Woods signed a blockbuster deal with NIKE. |
111 | US Open, 2001 | Woods claimed a historic achievement that has since been known as the ‘Tiger Slam’ (winning all four men’s major championships). |