Open Access Open Access  Restricted Access Subscription or Fee Access

A Novel Clustering Techniques Evaluation using Monte Carlo Simulation

Md.Siraj- Ud- Doulah, Md. Abdul Hamid, Md. Nazmul Islam, Mosfaka Aktar

Abstract


In Machine Learning clustering is one the most significant method. Today, we have data in rich from many sources but in order to get meaningful information from it is very boring task. Machine learning clustering algorithms to create cluster to decode the meaningful information from the data, this analysis approach has gained much popularity in recent years. This paper explores evaluation performance of frequently used existing clustering techniques such as single linkage, complete linkage, average linkage, centroid, and Ward’s method based on the proximity measures like Euclidean distance, Minkowski distance, Manhattan distance, Maximum distance, Correlation based distance and Canberra distance. Together with other commonly used clustering techniques such as SOM, Fuzzy C-means, Partitioning Around Medoids (PAM), model-based clustering, K-means, Kernel K-means, Robust K-means as well as our newly proposed technique K-HMs (K-Harmonic Means) are applied to decide the most suitable method for the identification of homogeneous items. Stability of the cluster is also tested based on the measures of performance evaluation such as Recall/Sensitivity, Precision, Accuracy and F-Score. We have also checked the performance followed by ROC curve. We have simulated the two types of data sets. It is to be noted that evaluations using Monte Carlo simulations show that our proposed method K-HMs is almost always reliable for clustering homogeneous items based on both types of datasets. Alternatively other methods correctly identify homogeneous items when equal variance but not unequal variances.


Keywords


Clustering, Kernel, Robust, K-HM, Performance Measures, ROC Curve, Monte Carlo Simulation

Full Text:

PDF

References


Grabmeier, J. and Rudolph, A., 2002. Techniques of cluster algorithms in data mining. Data Mining and knowledge discovery, 6(4), pp.303-360.

Hossen, B., Siraj-Ud-Doulah, H.A. and Hoque, A., 2015. Methods for evaluating agglomerative hierarchical clustering for gene expression data: A comparative study. Computational Biology and Bioinformatics, 3(6), pp.88-94.

Everitt BS. Cluster analysis. Edward Arnold, London, 1993.

Pham DT, Afify AA. Clustering techniques and their applications in engineering. Submitted to Proceedings of the Institution of Mechanical Engineers. IOSR Journal of Engineering, 2012; 2(4):719-725p.

Cherkassky V, Mulier FM. Learning from Data: Concepts, Theory,and Methods. 2nd ed.; John Wiley- IEEE Press, 2007; 340-464p.

Hardy, A., 1996. On the number of clusters. Computational Statistics & Data Analysis, 23(1), pp.83-96.

Doulah MSU, Hakim MA, Hamid MA. Performance Analysis of Hierarchical and Non-Hierarchical Clustering Techniques. Research & Reviews: Journal of Statistics. 2020; 9(2): 54–71p.

Sarah N, Kohail AM, Halees E. Implementation of Data Mining Techniques for Meteorological Data Analysis. IJICT Journal, 2011; 1(3):59-86p.

Johnson R, Wichern D. Applied Multivariate Statistical Analysis. 3rd ed., Englewood Cliffs, NJ: Prentice–Hall, 1998; 573-627p.

Meilă, M., 2007. Comparing clusterings—an information based distance. Journal of multivariate analysis, 98(5), pp.873-895.

Luxburg U. Clustering stability: An overview, Found. Trends Mach. Learn, 2010; 2(3):235–274p.

Murtagh, F. and Legendre, P., 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification, 31(3), pp.274-295.

Hossen, M.B. and Siraj-Ud-Doulah, M., 2017. Identification of robust clustering methods in gene expression data analysis. Current Bioinformatics, 12(6), pp.558-562.

Jain AK, Dubes RC. Data clustering: A review. ACM Computing Surveys, 1999; 31:01-69p.

Margaret H, Danham S. Data mining, Introductory and Advanced Topics. 1st ed.; Person education, UK, 2006; 75-84p.

Gan, G., Ma, C. and Jianhong, W., 2007. Center-based clustering algorithms. Data Clustering Theory, Algorithms and Applications.

Siraj-Ud-Doulah, M. and Islam, M.N., 2019. Defining homogenous climate zones of Bangladesh using cluster analysis. International Journal, 6(1), pp.119-129.

Han J, Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco. 2006; 443-540p.

Satyvan Y, Sananse SL. Comparisons of Different Methods of Cluster Analysis with application to Rainfall Data. IJIRSET Journal, 2015; 4(11):203-229p.

Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed.; Morgan Kaufmann, USA, 2011; 203-215p.

Zhang PX, Song PX. Clustering categorical data based on distance vectors. The Journal of the American Statistical Association, 2006; 101(473):355–367p.

Crawley MJ. The R Book. 1st ed.; JohnWiley & Sons Ltd, England, 2007; 811-827p.

Doulah MSU. Application of Machine Learning Algorithms in Bioinformatics, Bioinformatics & Proteomics Open Access Journal, 2019; 3(1):1-11p.


Refbacks

  • There are currently no refbacks.