
The development of smart grids has enabled the easy collection of a large amount of power data. There are some common patterns that make it useful to cluster power consumption patterns when analyzing s power big data. In this paper, clustering analysis is based on distance functions for time series and clustering algorithms to discover patterns for power consumption data. In clustering, we use 10 distance measures to find the clusters that consider the characteristics of time series data. A simulation study is done to compare the distance measures for clustering. Cluster validity measures are also calculated and compared such as error rate, similarity index, Dunn index and silhouette values. Real power consumption data are used for clustering, with five distance measures whose performances are better than others in the simulation.
Smart grids have many advantages over traditional conventional power grids because they enhance the way electricity is generated, distributed, and consumed by using advanced sensing devices and controllers that depend on power consumption profiles. With the development of the smart grid technology, electrical data is accumulating into big data in real time. According to Haben
There are various types of distances used in clustering such as model-free, model-based, and complexity methods (Montero and Vilar, 2014). The model-free method is used to measure the proximity between two time series based on the closeness of their values at specific points in time. One way is to compare the autocorrelation function (ACF) between two time series (Bohte
Model-based approaches consider that each time series is generated by some kind of model or by a mixture of underlying probability distributions. Time series are considered similar when model parameters characterizing individual series or the remaining residuals after fitting the model are similar. For example, the distance using the correlation of two time series (Golay
Complexity-based approaches compare the levels of complexity of time series. The similarity of two time series does not rely on specific serial features or the knowledge of underlying models, but on measuring the level of information shared by both time series. The mutual information between two series can be formally established using the Kolmogorov complexity concept; however this measure cannot be computed in practice and must be approximated. There are two ways to consider the complexity: calculate the complexity of each time series and compare them with each other (Li
In this paper we compare clustering methods for time series and apply them to power consumption times series as power consumption profiling. This practical application is important for load forecasting, bad data correction, optimal energy resources scheduling. This paper is organized as follows. Section 2 introduces 10 distance measures considered, and Section 3 provides hierarchical and K-means clustering methods and clustering comparative measures. Section 4 shows the clustering results of simulation data and its application to electricity consumption data. Discussions are in Section 5. Clustering analysis is implemented with R program and TSclust package.
Let
The autocorrelation function is used as a dissimilarity measure and some authors have studied this type of measure (Bohte
where
If
If the geometric weights are decaying according to the autocorrelation lag,
Likewise this measure evaluates the dissimilarity between the corresponding spectral representations of the series.
Correlation measures the similarity of two series. Golay
based on the Pearson correlation between
where
Chouakria and Nagabhushan (2007) propose a dissimilarity measure that covers both existing measures of the proximity on observations and temporal correlations for the behavior proximity estimation. The proximity between the behaviors of the series is evaluated by the first-order temporal correlation coefficient as follows:
CorT(
where
Caiado
where
Fréchet (1906) proposed a method for measuring proximity between continuous curves. Fréchet distance is widely used in the discrete case (Eiter and Mannila, 1994) and time series framework. Let
while
The Fréchet distance is defined by
Fréchet distance not only treats the series as two point sets, but also considers the order of observation. Note that
Piccolo (1990) argues that autoregressive expansions convey all the useful information about the stochastic structure of processes except for initial values. If the series are non-stationary, differencing is carried out to make them stationary. If the series have seasonality, seasonality is removed before further analysis, then they are fitted with the truncated AR(∞) models of order
The Piccolo’s distance with
where
If
A major feature of the Maharaj method is the introduction of hypothesis tests to see if the two time series are significantly different. Maharaj (1996, 2000) uses hypothesis testing to determine whether the two time series have significantly different generating processes for the invertible and stationary ARIMA classes.
The test statistic is
where
Therefore, the dissimilarity measure between
If a hierarchical algorithm starting from the pairwise matrix of
The linear predictive coding (LPC) cepstrum is proposed by Kalpakis
Cepstral-based distance is calculated as the Euclidean distance between the LPC cepstral coefficients of
where a time series
The Kolmogorov complexity of an object
Based on this theory, Li
Here
Let
A metric
Keogh
This metric
Batista
where an existing raw-data distance
Here CE(·) is a complexity estimator of series. If all the complexity of the series is the same, then
Hierarchical clustering works by combining individual objects with similar objects or groups sequentially and hierarchically using a tree model. It uses a dendrogram, a tree-like structure that shows the order in which objects are joined, so we can do this without having to define a number of clusters in advance. After creating the dendrogram, the tree can be cut at the appropriate level to divide the entire data into several clusters. The distance or similarity between objects is required to perform hierarchical clustering. We calculate group objects that are close together in sequence along with the distance between the newly bundled cluster and another object (or another cluster). Several linkage methods can be used and we use the complete linkage method in this paper. The complete linkage method uses the farthest distance of all pairs of objects in two clusters when calculating the distance between two clusters.
K-means clustering is a method of forming clusters by gathering individuals close to the center of each cluster. K-means clustering must first set the center of each cluster. Unlike hierarchical clustering, it can only work by specifying the number of clusters in advance. Let
where
To evaluate clustering validity we consider four measures. Error rate (ER) and similarity (sim) index compare the actual clusters with the estimated clusters. The Dunn index and silhouette measure are computed to compare within cluster connectedness.
(i) Error rate (ER)
Let
Regarding the estimation error, the clustering estimation error rate
where
(ii) Similarity index (Sim index)
where is the true partition, and
(iii) Dunn index
Dunn (1974) proposed an internal measure of clustering as
where diam (
(iv) Silhouette
Rousseeuw (1987) proposed silhouette width as the average of each observation’s silhouette value.
where
The silhouette value is from −1 to 1 and means how tightly grouped an internal evaluation is. A larger value means it is tighter.
We design a set of 100 time series in five clusters. Each cluster has 20 time series. Each time series has 96 time points. The errors are generated independently from
Cluster 1. Step up and down
Cluster 2. Combined AR model
Cluster 3. ARMA model
Cluster 4. Periodic model
Cluster 5. Complex periodic model
Figure 1 shows one data set (a) in the simulation and mean lines (b) of each cluster. The simulation results are obtained from 100 repetitions. We used 10 distance measures with hierarchical complete linkage and K-means clustering algorithm respectively. Table 1 and Table 2 show the results. Hierarchical and K-means clustering methods with each distance measure provide some similar measures of cluster validity. According to ER, the best performance is done with
In practice, attention should be given to the choice of the distance and clustering algorithm. Each property should be considered to make proper clusters. A range of feature-, model-, and complexity-based dissimilarities are included in this paper. For instance, the K-means clustering algorithm moves each series to the cluster whose centroid is closest in order to recalculate the cluster centroid and repeats the procedure until no more are assigned. The range of proper methods become limited and careful implementation is required once the clustering objectives are made clear and the characteristics of time series are considered.
For a clustering application, we use a power consumption data set from a total of 90 devices and buildings in 16 companies located in South Korea on June 19, 2018 (Table 3). Power consumption data measured every 15 minutes is a time series with a total of 96 time points per day. Each power consumption pattern varies widely, but can be divided into three types: continuous consumption, on-off repetition, and turning on and off once a day. For example, machine A repeats on-off every 1 hour from 9 am, but machine B can repeat on-off every 2 to 3 hours even if it starts at 9 am. Various continuous patterns are also possible according to working periods and workloads.
According to comparing the silhouette values from three to ten clusters based on
Table 4 provides the clustering validity measures of this real power consumption data. Figure 2 gives clusters and their averaging patterns using the
The development of smart grids has enabled the easy collection of vast amount of power data. In order to efficiently analyze this huge data, it is very useful to quickly catch and cluster power consumption patterns. If you can understand power consumption patterns, there is an advantage in analysis such as prediction. We compared 10 distance measures using hierarchical clustering and K-means clustering. Simulation provides that there is no one best clustering. The time series structure should be reflected in doing clustering analysis. There should be more measures to evaluate time series clustering except Dunn index and silhouette width. This work could be extended to meet the requirements of real-time data processing applications, such as clustering the power consumption of appliances and controlling usage patterns, and possibly the detection of appliances with anomalies that indicate faulty or compromised appliances. We hope that this research will help serve others interested in advancing time series clustering research. For power consumption big data, clustering is a challenging problem considering local and global profiles with computational burden and effective complexity modelling.
Hierarchical clustering performance with simulation data
Distance | Measure | |||
---|---|---|---|---|
ER | Sim index | Dunn index | Silhouette | |
ACF | 0.2991 | 0.5388 | 0.4173 | 0.1006 |
Cor | 0.1995 | 0.6547 | 0.5747 | 0.2477 |
CorT | 0.1189 | 0.8002 | 0.3547 | 0.3022 |
Per | 0.2108 | 0.6954 | 0.3188 | 0.4807 |
Frechet | 0.1834 | 0.7124 | 0.2723 | 0.2212 |
AR.Pic | 0.2852 | 0.5771 | 0.1312 | 0.2802 |
AR.Mah | 0.1797 | 0.7213 | 0.0464 | 0.4702 |
AR.LPC.CEP | 0.1764 | 0.7117 | 0.1880 | 0.2902 |
CDM | 0.2545 | 0.5547 | 0.9776 | 0.0009 |
CID | 0.1055 | 0.8200 | 0.4502 | 0.3623 |
K-means clustering performance with simulation data
Distance | Measure | |||
---|---|---|---|---|
ER | Sim index | Dunn index | Silhouette | |
ACF | 0.2991 | 0.5388 | 0.4173 | 0.1006 |
Cor | 0.0886 | 0.8472 | 0.4630 | 0.2282 |
CorT | 0.0865 | 0.8556 | 0.2889 | 0.2949 |
Per | 0.1087 | 0.8138 | 0.1562 | 0.4991 |
Frechet | 0.1059 | 0.8261 | 0.3016 | 0.2649 |
AR.Pic | 0.1840 | 0.6780 | 0.0696 | 0.2612 |
AR.Mah | 0.1730 | 0.7314 | 0.0114 | 0.4436 |
AR.LPC.CEP | 0.1423 | 0.7475 | 0.1119 | 0.2784 |
CDM | 0.1530 | 0.7089 | 0.9762 | 0.0014 |
CID | 0.0778 | 0.8708 | 0.3361 | 0.3278 |
List of 90 devices and buildings in 16 companies located in South Korea
Company | Type |
---|---|
D1 | LV2 |
LV5 | |
Chiller No.4 | |
Chiller No.6 | |
Chiller No.7 | |
Chiller No.8 | |
Water treatment | |
D2 | L-1M public |
L-2 main | |
L-14 main | |
LP-CAR parking tower | |
M1 | Main device |
Main office | |
Grooving machine | |
G1 | Public |
Main | |
Shopping area | |
N1 | 250KVA_Main |
300KVA_Main | |
Load1 | |
Load2 | |
Load3 | |
Lathe1 | |
Lathe2 | |
New material | |
Grinder | |
Urethane1 | |
Urethane2 | |
D3 | Rooftop |
S1 | L-A |
L-B | |
LP-A1 | |
LP-M | |
PA-2 | |
PA-3 | |
PB-1 | |
S2 | E_V |
Water supply | |
S3 | 1F |
2,3,4F | |
5F | |
Public | |
Main | |
B1 | F-B3 |
L-F | |
L-O | |
P-CAR | |
P-E-1 | |
P-EHP | |
K1 | Main circuit breaker |
Circuit breaker1 | |
Circuit breaker4 | |
C1 | Load1 |
Load2 | |
Load3 | |
Load4 | |
T1 | A-04 |
B-01 | |
CO2 welding machine | |
Main press | |
Welding Line Distribution Board | |
P1 | Unit1_1 |
Unit1_2 | |
Unit1_3 | |
Unit1_4 | |
Unit1_5 | |
Unit1_coiler | |
Unit2_1 | |
Unit2_2 | |
Unit2_3 | |
Unit2_4 | |
Unit2_coiler | |
Unit3_1 | |
Unit3_2 | |
Unit3_3 | |
Unit3_incidental equipment | |
Unit3_coiler | |
Unit4_1 | |
Unit4_2 | |
Unit4_3 | |
Unit4_4 | |
Unit4_incidental equipment | |
Unit4_coiler | |
H1 | L-1A |
L-M | |
Public | |
H1 | LC-1 office main |
Main MCCB | |
Sub1 light | |
Main1 main |
Hierarchical & K-means clustering with electricity consumption data with 4 clusters
Method | Measure | Cor | CorT | Per | AR.Mah | CID |
---|---|---|---|---|---|---|
Hierarchical | Dunn index | 0.6077 | 0.2632 | 0.5383 | 0.0059 | 0.0235 |
Silhouette | 0.1972 | 0.6823 | 0.7799 | 0.3408 | 0.6142 | |
K-means | Dunn index | 0.4076 | 0.0066 | 0.1415 | 0.0004 | 0.0008 |
Silhouette | 0.1002 | 0.4646 | 0.6568 | 0.2940 | 0.2793 |