Time Series Clustering Based on the K-Means Algorithm

Time series is one of the forms of data presentation that is used in many studies. It is convenient, easy and informative. Clustering is one of the tasks of data processing. Thus, the most relevant currently are methods for clustering time series. Clustering time series data aims to create clusters with high similarity within a cluster and low similarity between clusters. This work is devoted to clustering time series. Various methods of time series clustering are considered. Examples are given for real data.


Introduction
Primary data is the base that allows you to understand and predict the processes that are studied and analyzed. Therefore, data processing and analysis is the basis for any research (Matarneh, Maksymova, Lyashenko & Belova, 2017;Lyashenko et al., 2016). The amount of such data can be very large. This makes it necessary to use various methods for the analysis and interpretation of primary data (Khan, Joshi, Ahmad & Lyashenko, 2015;Baranova, Sergienko, Stepurina & Lyashenko, 2020;Kang, 2019). Among these methods, data clustering should be distinguished. This approach allows you to split the general data set into separate groups, where each group has some common characteristics.
Thus, clustering is a way of preprocessing data for more convenient subsequent analysis. Having received the necessary groups, as well as their centroids, you can continue to work with specific representatives, and not with the entire data set. This reduces the processing time and the time to obtain results. This approach also allows for a better understanding of the data; to carry out their compression in conditions of unprofitable data. It should also be noted that raw data can be presented in different ways. Time series is one of the forms of data presentation. A time series is a time-oriented sequence of data on a certain subject area that is of interest. It is a way of presenting statistics. Time series data is used in various spheres of human activity (Baranova, Sergienko, Stepurina & Lyashenko, 2020;. Therefore, this form of data presentation is of particular interest. Some issues of processing such data are considered in our work.

Some Features of Time Series Processing
When processing a time series, you can encounter typical difficulties: large dimension of input data, presence of noise and missing data. Considering clustering of time series, one should also pay attention to the fact that rows can contain a different number of samples; there are more Irwin's criterion is used to analyze anomalous data; methods such as moving average, exponential smoothing are used to smooth data (Zou et al., 2019;Walker, Curtis & Goldacre, 2019). All this must be taken into account when clustering data that is presented in the form of time series.

K-Means Based Time Series Clustering Methods
Let's consider the most common time series clustering methods that use the k-means algorithm. These methods include: Euclidean k-means, DBA k-means and Soft-DTW k-means. One of the common method for clustering time series is the k-means approach, where Euclidean distance is used as a measure of proximity (Steinley, 2006;Khachumov, 2012): where and are two time series of length .
The k-means algorithm is that arbitrary centers are selected first. Then the rest of the elements are grouped around these centers, which must be divided into classes. At the next step, new centers are calculated for the resulting clusters so that the square of the Euclidean distance from the cluster element to its centroid is less than the distance to the centroids of the remaining clusters.
At the same time, the algorithm places the centers of the clusters (centroids) so that the average values for the lists of elements within the constructed clusters differ as much as possible. Thus, the Euclidean k-means method divides time series of sample length into groups (clusters). This separation occurs by minimizing the total squared deviation of cluster points from the centroids of these clusters: where ( ) ∈ , ∈ ;cluster centroid .
Using the Euclidean k-means method has several disadvantages: it is necessary to determine in advance the number of resulting clusters, which may not always be advisable; the method is sensitive to the choice of the initial cluster centersthis leads to an increase in the probability of error and the possibility of obtaining results that differ from each other when the algorithm is restarted.
There are also cases when an object can belong to different clusters. Despite the shortcomings, Euclidean k-means is a simple algorithm, well suited for understanding the general clustering processes and a good basis for building extended new algorithms on its basis. When clustering time series, it is essential to take into account the fact that some series can be almost the same, but at the same time these series can be shifted in time (along the time axis). Therefore, it is advisable to use a metric that is implemented in the dynamic timeline transformation (DTW) algorithm. Consider two time series with length and with length : Then the implementation of the DTW method can be described in the following steps (Kate, 2016;Hu, Mashtalir, Tyshchenko & Stolbovyi, 2018).
At the first step, we construct the distance matrix = { , }.
After filling in the transformation matrix, we move on to the final stage. This stage consists in building the optimal transformation path and DTW distance. The transformation path is a set of contiguous elements of the matrix that matches the series and and minimizes the total distance between these time series. Thus, the last step is to build the optimal transformation path and DTW distance.
The transformation path between and is determined by the formula (Kate, 2016): where path length.
Then DTW the distance between two time series is determined by the formula (Kate, 2016;Hu, Mashtalir, Tyshchenko & Stolbovyi, 2018): A modification of the DTW method is the soft-DTW k-means algorithm, in which the DTW distance is determined as (Montgomery, Jennings & Kulahci, 2015): for different values of the smoothing parameter (γ) of the time series.
Also in the Euclidean k-means method, we can estimate the distance between the «centers of weight» of each group of time series (Okawa, 2019): Then the corresponding method for determining the distance between time series (clustering them) is called the DBA-k-means method (DTW Barycenter Averaging). Let's conduct a comparative analysis of clustering time series using the methods that we discussed above.

Results and Discussion
For the analysis, we will look at the time series that represent medical data. In particular, these are the data of the electrocardiogram of the heartbeat (ECG). Thus, the time series correspond to the forms of the electrocardiogram of heart contractions for the normal case and cases of lesions by various arrhythmias and myocardial infarction. These signals were preprocessed and segmented, with each segment corresponding to one heartbeat. An example of such time series is shown in Figure 1. These time series are included in the database that is used for fundamental research and is described in (Moody & Mark, 2001).
The main characteristics that are used for clustering time series (Figure 1) are: the number of series -87554; the number of values in each row is at least 187; sampling rate -125 Hz; the number of classes that we are considering is 4. To implement the methods discussed above, to carry out the clustering procedure, the Python environment was chosen.