Edwin Lughofer, Moamar Sayed-Mouchaweh,
"Autonomous Data Stream Clustering Implementing Split-and-Merge Techniques - Towards a Plug-and-Play Approach"
, in Information Sciences, Vol. 204, Elsevier, Seite(n) 54-79, 2015
Autonomous Data Stream Clustering Implementing Split-and-Merge Techniques - Towards a Plug-and-Play Approach
Sprache des Titels:
We propose a new clustering method, which is dynamic in the sense that it updates its structure (cluster partition) permanently based on new incoming data samples. As it basically operates in single-pass and sample-wise manner without using time-intensive re-training phases, it is applicable for extracting clusters from on-line data streams. The approach builds upon an extended, dynamic version of evolving vector quantization, generalizing it to prototype clusters with convex (ellipsoidal) shapes in arbitrary positions. It includes incremental split-and-merge techniques in order to resolve cluster fusion and cluster delamination effects. The merging process is guided by geometrical criteria indicating overlapping clusters with joint homogeneity areas, and implements a fast fusion of two clusters fulfilling the criteria. The splitting process relies on concepts of seeking for heterogeneity (disjoint density areas) within already extracted clusters. This is based (1) on components (mixture models) identification while measuring their reliability (a) with the Bayesian information criterion (BIC) and (b) in terms of a proper convergence of the EM algorithm and (2) on an extended form of the Welch test for verifying whether the identified components are statistically independent - the extension concerns the integration of their mixing proportions. Finally, split-and-merge operations are able to automatically compensate inappropriate settings of the method's learning parameter, thus still being able to extract the desired cluster structure. Thus, our approach presents an attempt towards a parameter-free evolving clustering algorithm. This is an essential property as several trials with different settings are often not desired (in case of huge streams) or not possible at all (in case of on-line streams). Based on high-dimensional real-world clustering streams, our approach will be compared with alternative incremental as well as batch prototype-based clustering methods.