|
|
Bimonthly Since 1986 |
ISSN 1004-9037
|
|
|
|
|
Publication Details |
Edited by: Editorial Board of Journal of Data Acquisition and Processing
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Published by: SCIENCE PRESS, BEIJING, CHINA
Distributed by:
China: All Local Post Offices
|
|
|
|
|
|
|
|
|
|
Abstract
The silhouette method is a famous statistical method to find the cluster count value as well as to solve the issues of outliers in the sample space. An outlier is a data object that deviates significantly from the rest of objects. Silhouette coefficient value of a sample is a clear indication of the outlier in the data set. This study aims to improve the cluster quality by detecting and removing the outlier using different cluster methods. The value can be used to determine the compactness of formed clusters. In partition methods, the cluster results are very sensitive to the cluster count value we select. The performance of the Silhouette method is analysed with different data sets from UCI data repository. We propose two methods to detect and remove outliers. One method uses the silhouette value of sample and the other method measures the distances of sample with all cluster centroids and decide the sample as outliers based on a threshold distance. We have implemented methods in Python and the results are checked using different data sets from UCI and large public data sets. The performances of the cluster quality are checked using the cluster evaluation indexes such as Silhouette, Dunn, DB and C indexes. The removal of outliers improves the quality and compactness of the newly formed cluster. Analysis is done to study performance as well as cluster efficiency by removing the outlier from the sample space.
Keyword
Cluster Compactness, Data Mining, Dunn index, Kmeans. Outlier, Partition Algorithm and Silhouette.
PDF Download (click here)
|
|
|
|
|