Araştırma Makalesi
BibTex RIS Kaynak Göster

A Comparative Application on Clustering of Mixed-type Data Sets with kamila, k-means, k-medoids and k-prototypes Algorithms

Yıl 2019, Cilt: 20 Sayı: 2, 48 - 70, 30.11.2019

Öz

Cluster Analysis is one of the crucial tools which is being used in many areas of scientific researches. As known, there are many algorithms for performing Cluster Analysis.
Nowadays, the main two debates relating to these algorithms are; which one to use for mixedtype data sets and how to decide selecting the best number of clusters. In this study, KAMILA algorithm which is created very ambitiously and other algorithms used before KAMILA such as k-means, k-medoids and k-prototypes algorithms will be performed for clustering the values
of different scaled variables. With this aim, a data set of a grocery store in Istanbul will be analyzed. The company has stores in different districts of Istanbul and the customers have different demographic characteristics and different purchasing behaviors. The data set provided for 999 customers includes information such as; whether the customers are purchasing the product categories that are crucial for the company's profitability and how much the total price of the purchased items are. These data were subjected to clustering analysis for customer segmentation. As a result, it is observed that KAMILA algorithm can successfully identify the customers in the segment that can be named the gold segment.

Kaynakça

  • AGGARWAL, Charu C (2015), Data mining: The textbook, Switzerland: Springer.
  • CUI, Hongyan, Kuo ZHANG, Yajun FANG, Stanislav SOBOLEVSKY, Carlo RATTI and Berthold KP HORN (2017), "A Clustering Validity Index Based on Pairing Frequency", IEEE Access, 5, 24884-24894.
  • EVERITT, Brian and Torsten HOTHORN. (2011). Cluster analysis An Introduction to Applied Multivariate Analysis with R (pp. 163-200): Springer.
  • R Development Core Team (2008). R: A language and environment, STATISTICAL COMPUTING. R FOUNDATION FOR STATISTICAL COMPUTING and Austria. ISBN 3-900051-07-0 VIENNA, URL http://www.R-project.org.
  • FOSS, Alex, Marianthi MARKATOU, Bonnie RAY and Aliza HECHING (2016), "A semiparametric method for clustering mixed data", Machine Learning, 105(3), 419-458.
  • FOSS, Alexander H, Marianthi MARKATOU and Bonnie RAY (2018), "Distance Metrics and Clustering Methods for Mixed‐type Data", International Statistical Review.
  • GAN, Guojun, Chaoqun MA and Jianhong WU (2007), Data clustering: theory, algorithms, and applications, (Vol. 20): Siam.
  • GOWER, John C (1971), "A general coefficient of similarity and some of its properties", Biometrics, 857-871.
  • HALKIDI, Maria, Yannis BATISTAKIS and Michalis VAZIRGIANNIS (2001), "On clustering validation techniques", Journal of intelligent information systems, 17(2-3), 107-145.
  • HENNIG, Christian and Tim F LIAO (2013), "How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification", Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309-369.
  • HSU, Chung-Chian and Yu-Cheng CHEN (2007), "Mining of mixed data with application to catalog marketing", Expert Systems with Applications, 32(1), 12-23. HUANG, Zhexue (1998), "Extensions to the k-means algorithm for clustering large data sets with categorical values", Data mining and knowledge discovery, 2(3), 283-304.
  • JAIN, Anil K and Richard C DUBES (1988), Algorithms for clustering data: Prentice-Hall, Inc.
  • JI, Jinchao, Tian BAI, Chunguang ZHOU, Chao MA and Zhe WANG (2013), "An improved k-prototypes clustering algorithm for mixed numeric and categorical data", Neurocomputing, 120, 590-596.
  • KASSAMBARA, Alboukadel (2017), Practical guide to cluster analysis in R: Unsupervised machine learning, (Vol. 1): STHDA.
  • KAUFMAN, L and PJ ROUSSEW. (1990). Finding Groups in Data-An Introduction to Cluster Analysis. A Wiley-Science Publication John Wiley & Sons: Inc.
  • LIU, Yanchi, Zhongmou LI, Hui XIONG, Xuedong GAO and Junjie WU (2010). Understanding of internal clustering validation measures. Paper presented at the Data Mining (ICDM), 2010 IEEE 10th International Conference on.
  • MAULIK, Ujjwal and Sanghamitra BANDYOPADHYAY (2002), "Performance evaluation of some clustering algorithms and validity indices", IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1650-1654.
  • MORLINI, Isabella and Sergio ZANI. (2010). Comparing approaches for clustering mixed mode data: an application in marketing research Data Analysis and Classification (pp. 49-57): Springer.
  • SAITTA, Sandro, Benny RAPHAEL and Ian FC SMITH (2008), "A comprehensive validity index for clustering", Intelligent Data Analysis, 12(6), 529-548.
  • SOKAL, Robert R and F James ROHLF (1962), "The comparison of dendrograms by objective methods", Taxon, 33-40.
  • STARCZEWSKI, Artur (2017), "A new validity index for crisp clusters", Pattern Analysis and Applications, 20(3), 687-700.
  • SWENSON, Eric R, Nathaniel D BASTIAN and Harriet B NEMBHARD (2016), "Data analytics in health promotion: Health market segmentation and classification of total joint replacement surgery patients", Expert Systems with Applications, 60, 118-129.
  • THEODORIDIS, S and K KOUTROUBAS (1999), "Feature generation II", Pattern Recognition, 2, 269-320.
  • TIBSHIRANI, Robert and Guenther WALTHER (2005), "Cluster validation by prediction strength", Journal of Computational and Graphical Statistics, 14(3), 511-528.
  • TOMAŠEV, Nenad and Miloš RADOVANOVIĆ. (2016). Clustering evaluation in high-dimensional data Unsupervised Learning Algorithms (pp. 71-107): Springer.
  • WU, Junjie, Hui XIONG and Jian CHEN (2009). Adapting the right measures for k-means clustering. Paper presented at the Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
  • YU, Wang, Guo QIANG and Li XIAO-LI (2006). A kernel aggregate clustering approach for mixed data set and its application in customer segmentation. Paper presented at the Management Science and Engineering, 2006. ICMSE'06. 2006 International Conference on.
  • ZAKI, Mohammed J and Wagner MEIRA JR (2014), Data mining and analysis: fundamental concepts and algorithms: Cambridge University Press.

KARMA TİPTEKİ VERİLERİ KAMILA, K-ORTALAMALAR, KORTAYLAR ve K-PROTOTİPLER ALGORİTMALARIYLA KÜMELEME: KARŞILAŞTIRMALI BİR UYGULAMA

Yıl 2019, Cilt: 20 Sayı: 2, 48 - 70, 30.11.2019

Öz

Kümeleme Analizi Sosyal Bilimlerden Fen Bilimlerine birçok alanda yaygın olarak kullanılan önemli bir araçtır. Kümeleme Analizini gerçekleştirebilmek için hazırlanmış pek çok algoritma mevcuttur. Günümüzde bu algoritmalar ile ilgili olarak en çok tartışılan hususlardan ilk ikisinin, karma tipteki veri setleri için hangi kümeleme algoritmasının kullanılması gerektiği ve en iyi küme sayısının nasıl belirlenebileceği olduğu söylenebilir. Bu çalışmada, farklı ölçeklerle ölçülmüş karma tipteki değişkenlerin değerlerini içeren bir veri seti, bu tip veriler için yeni ve çok iddialı bir şekilde oluşturulmuş olan KAMILA algoritması ile analiz edilecektir. Daha sonra veri seti bu algoritmadan önce karma tipteki veriler için kullanılagelen k-ortalamalar, k-ortaylar ve k-prototipler gibi algoritmalarla da kümelere ayrılacaktır. Bu doğrultuda, İstanbul’da faaliyet gösteren yerel bir süpermarket zincirinden sağlanan alışveriş işlem verileri, R programlama dili kullanılarak analiz edilmiştir. Mağazaları İstanbul’un farklı semtlerinde bulunan bu firmanın müşterileri farklı demografik özelliklere ve farklı satın alma davranışlarına sahiptir. İşlem kolaylığı açısından 999 müşteri için sağlanmış olan veri kümesi, müşterilerin firmanın kârlılığı açısından önem arz eden ürün kategorilerinden alış veriş yapıp yapmadıklarını ve satın alınan ürünlerin toplam fiyatının ne kadar olduklarını içermektedir. Bu veriler müşteri segmentasyonu amacıyla kümeleme analizine tâbi tutulmuştur. Sonuç olarak, KAMILA algoritmasının altın segment olarak isimlendirebilecek segmentteki müşterileri başarıyla tespit edebildiği gözlenmiştir. 

Kaynakça

  • AGGARWAL, Charu C (2015), Data mining: The textbook, Switzerland: Springer.
  • CUI, Hongyan, Kuo ZHANG, Yajun FANG, Stanislav SOBOLEVSKY, Carlo RATTI and Berthold KP HORN (2017), "A Clustering Validity Index Based on Pairing Frequency", IEEE Access, 5, 24884-24894.
  • EVERITT, Brian and Torsten HOTHORN. (2011). Cluster analysis An Introduction to Applied Multivariate Analysis with R (pp. 163-200): Springer.
  • R Development Core Team (2008). R: A language and environment, STATISTICAL COMPUTING. R FOUNDATION FOR STATISTICAL COMPUTING and Austria. ISBN 3-900051-07-0 VIENNA, URL http://www.R-project.org.
  • FOSS, Alex, Marianthi MARKATOU, Bonnie RAY and Aliza HECHING (2016), "A semiparametric method for clustering mixed data", Machine Learning, 105(3), 419-458.
  • FOSS, Alexander H, Marianthi MARKATOU and Bonnie RAY (2018), "Distance Metrics and Clustering Methods for Mixed‐type Data", International Statistical Review.
  • GAN, Guojun, Chaoqun MA and Jianhong WU (2007), Data clustering: theory, algorithms, and applications, (Vol. 20): Siam.
  • GOWER, John C (1971), "A general coefficient of similarity and some of its properties", Biometrics, 857-871.
  • HALKIDI, Maria, Yannis BATISTAKIS and Michalis VAZIRGIANNIS (2001), "On clustering validation techniques", Journal of intelligent information systems, 17(2-3), 107-145.
  • HENNIG, Christian and Tim F LIAO (2013), "How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification", Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309-369.
  • HSU, Chung-Chian and Yu-Cheng CHEN (2007), "Mining of mixed data with application to catalog marketing", Expert Systems with Applications, 32(1), 12-23. HUANG, Zhexue (1998), "Extensions to the k-means algorithm for clustering large data sets with categorical values", Data mining and knowledge discovery, 2(3), 283-304.
  • JAIN, Anil K and Richard C DUBES (1988), Algorithms for clustering data: Prentice-Hall, Inc.
  • JI, Jinchao, Tian BAI, Chunguang ZHOU, Chao MA and Zhe WANG (2013), "An improved k-prototypes clustering algorithm for mixed numeric and categorical data", Neurocomputing, 120, 590-596.
  • KASSAMBARA, Alboukadel (2017), Practical guide to cluster analysis in R: Unsupervised machine learning, (Vol. 1): STHDA.
  • KAUFMAN, L and PJ ROUSSEW. (1990). Finding Groups in Data-An Introduction to Cluster Analysis. A Wiley-Science Publication John Wiley & Sons: Inc.
  • LIU, Yanchi, Zhongmou LI, Hui XIONG, Xuedong GAO and Junjie WU (2010). Understanding of internal clustering validation measures. Paper presented at the Data Mining (ICDM), 2010 IEEE 10th International Conference on.
  • MAULIK, Ujjwal and Sanghamitra BANDYOPADHYAY (2002), "Performance evaluation of some clustering algorithms and validity indices", IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1650-1654.
  • MORLINI, Isabella and Sergio ZANI. (2010). Comparing approaches for clustering mixed mode data: an application in marketing research Data Analysis and Classification (pp. 49-57): Springer.
  • SAITTA, Sandro, Benny RAPHAEL and Ian FC SMITH (2008), "A comprehensive validity index for clustering", Intelligent Data Analysis, 12(6), 529-548.
  • SOKAL, Robert R and F James ROHLF (1962), "The comparison of dendrograms by objective methods", Taxon, 33-40.
  • STARCZEWSKI, Artur (2017), "A new validity index for crisp clusters", Pattern Analysis and Applications, 20(3), 687-700.
  • SWENSON, Eric R, Nathaniel D BASTIAN and Harriet B NEMBHARD (2016), "Data analytics in health promotion: Health market segmentation and classification of total joint replacement surgery patients", Expert Systems with Applications, 60, 118-129.
  • THEODORIDIS, S and K KOUTROUBAS (1999), "Feature generation II", Pattern Recognition, 2, 269-320.
  • TIBSHIRANI, Robert and Guenther WALTHER (2005), "Cluster validation by prediction strength", Journal of Computational and Graphical Statistics, 14(3), 511-528.
  • TOMAŠEV, Nenad and Miloš RADOVANOVIĆ. (2016). Clustering evaluation in high-dimensional data Unsupervised Learning Algorithms (pp. 71-107): Springer.
  • WU, Junjie, Hui XIONG and Jian CHEN (2009). Adapting the right measures for k-means clustering. Paper presented at the Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
  • YU, Wang, Guo QIANG and Li XIAO-LI (2006). A kernel aggregate clustering approach for mixed data set and its application in customer segmentation. Paper presented at the Management Science and Engineering, 2006. ICMSE'06. 2006 International Conference on.
  • ZAKI, Mohammed J and Wagner MEIRA JR (2014), Data mining and analysis: fundamental concepts and algorithms: Cambridge University Press.
Toplam 28 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Bölüm Makaleler
Yazarlar

Emrah Bilgiç 0000-0002-9875-2299

Yayımlanma Tarihi 30 Kasım 2019
Gönderilme Tarihi 2 Ocak 2019
Yayımlandığı Sayı Yıl 2019Cilt: 20 Sayı: 2

Kaynak Göster

APA Bilgiç, E. (2019). KARMA TİPTEKİ VERİLERİ KAMILA, K-ORTALAMALAR, KORTAYLAR ve K-PROTOTİPLER ALGORİTMALARIYLA KÜMELEME: KARŞILAŞTIRMALI BİR UYGULAMA. Cumhuriyet Üniversitesi İktisadi Ve İdari Bilimler Dergisi, 20(2), 48-70.

Cumhuriyet Üniversitesi İktisadi ve İdari Bilimler Dergisi Creative Commons Atıf-GayriTicari 4.0 Uluslararası Lisansı (CC BY NC) ile lisanslanmıştır.