Research Article
BibTex RIS Cite

COMPARISON OF CLASSIFICATION RESULTS OF SMO AND J48 ALGORITHMS ON DIFFERENT DATA SETS

Year 2018, Volume: 6 Issue: 3, 199 - 213, 26.12.2018
https://doi.org/10.22139/jobs.487388

Abstract

The data sources of institutions, social media shares,
articles on websites and forms provide large amounts of data. It is very
difficult to process large amounts of data in traditional ways and to produce
information for use in decision processes.

In this context, data mining can provide the
production of the information needed from the available data with the advanced
techniques that it offers.

Databases are rich in confidential information that
will enable rational decision-making. Classification and estimation are two
important data analysis techniques used for estimating future data trends or
explaining important data classes. These analyzes can be useful in better
understanding of large amounts of data. Today, institutions produce large
amounts of data, but they have difficulties in revealing meaningful and useful
information within these data. It is not easy to analyze large data with traditional
statistical methods. Special methods are therefore required to process and
analyze data. Data mining methods have emerged to meet this requirement.

The aim of this study is to compare the performances
of the SMO and J48 algorithms used in the classification of data mining. For
this purpose, data mining was performed by using three different student data
sets.

Data mining is an analysis method that summarizes data
and exposes hidden relationships with both useful and understandable data, in
unusual ways. This method is one of the processes of knowledge discovery in the
database, which first explores scientific and technical data to reveal unknown
patterns. Classification is a process that is frequently used in daily life. By
classification, the objects are split and separated, that is, each of the
mutually exclusive or general categories can be assigned as a class. Many
practical decision-making processes can be formulated as a classification
problem. For example, people or objects can be one of many categories.
Classification is the process of assigning different elements in different
classes. These classes may be business rules, class boundaries, or some
mathematical functions. The classification process can be constructed on a
relationship between a class of the classified element and a known class value
and properties. This type of classification is called “supervised learning”. If
there are no known examples of a class, this classification is unsupervised.
The most common uncontrolled classification approach is clustering. The most
common applications of clustering technology are retail basket analysis and
fraud detection.

The concept of controlled learning in data mining is
to teach a classification function on the basis of known data with a classification
or to construct a classification model. This function or model converts data
from the database into target attributes, so new data can be used in class
estimation. The data mining system relates to areas such as spatial data
analysis, information retrieval, model recognition, image analysis, signal
processing, computer graphics, web technology, economics, business,
bioinformatics or psychology, depending on the types of data to be mining or
the specific data mining application.

SMO (Sequential Minimal Optimization) is a simple
algorithm that can quickly solve the SVM QP problem without any extra matrix
storage and without using numerical QP optimization steps. SMO chooses to solve
the smallest possible optimization problem at every step. The smallest possible
optimization problem for the standard SVM QP problem involves two Lagrange
multipliers because the Lagrange multipliers must comply with a linear equality
constraint. At each step, the SMO selects two Lagrange multipliers to jointly
optimize it, finds the most appropriate values ​​for these multipliers and
updates the SVM to reflect the new optimal values. The advantage of SMO lies in
the fact that the analysis of two Lagrange multipliers can be done
analytically. Thus, numerical QP optimization is completely prevented. Although
more optimization sub-problems are solved during the algorithm, each
sub-problem is so fast that the general QP problem is solved quickly.
Furthermore, SMO does not require any additional matrix storage. Therefore, very
large SVM training problems can fit into the memory of an ordinary personal
computer or workstation. SMO is less sensitive to numerical sensitivity
problems since no matrix algorithm is used.

J48 is a decision tree algorithm based on the very
popular C4.5 algorithm developed by J. Ross Quinlan. Decision trees are a
classic way of representing information from a machine learning algorithm and
provide a powerful and fast way to express data structures. This algorithm
classifies the data recursively. This ensures the maximum accuracy of the
training data, but it can only create extreme rules that define the specific
behavior characteristics of the data. J48 Algorithm; Based on the Information
Gain Theory, it has the ability to automatically process the data to select the
relevant properties. It is the iterative algorithm that divides the samples
from the point where information gain is the best. The tree structure starts
with the process of dividing the subjects and selecting the best root variable
of the tree and building it from top to bottom. The J48 is able to perform an
effective pruning process to cut weak branches, which is not meaningful. One of
the reasons is that the purpose of decision trees is not to discover data, but
to create a simple classification model on the data.

In this study, three different data sets of university
students were used. The data were subjected to the necessary regulations using
Excel macros and data warehouses were prepared. After making the necessary
conversions, the data is printed in the text file “iibf1.arff ”, “iibf2.arff”
and “myo.arff”. In the study, the WEKA Program (Waikato Environment for
Knowledge Analysis) version 3.7.2 developed by the University of Waikato was
used. For each data set, the student's gender, province, family income level,
the number of siblings, number of siblings studying, and entry point were taken
as qualifications. The degree of entry score is used in the class definitions.



















According to the data results, the success rate of the
SMO algorithm in the classification is higher compared J48 algorithm, making
this algorithm more reliable.

References

  • Aharwal ,Ramesh Prasad (2016), Evaluatıon Of Varıous Classıfıcatıon Technıques Of Weka Using Different Datasets, International Journal of Advance Research and Innovative Ideas in Education, Vol-2 Issue-2, p.558-552Akçetin, Eyüp, Çelik, Ufuk(2014), İstenmeyen Elektronik Posta (Spam) Tespitinde Karar Ağacı Algoritmalarının Performans Kıyaslaması, İnternet Uygulamaları ve Yönetimi (5/2), doi: 10.5505/iuyd.2014.43531, p.43-56Arora, Milandeep and Sharma, Ajay, (2016), Chronic Kidney Disease Detection by Analyzing Medical Datasets in Weka, International Journal of Computer Application (2250-1797) Volume 6– No.4, July- August 2016,p.20-26Bramer, Max (2007), Principles of Data Mining, Springer, London Chaudhary, Niharika, Mehta, Gaurav and Bajaj, Karan (2015), Comparıson Of Classification Algorithms And Design Of A Percentage-Split Based Method For Data Classification, International Journal Of Computer Science & It, Volume 2, Issue 5, p.1-6Daş, Bihter, Varol, Asaf (2013), 2D:4D Sayısal Parmak Oranına Göre Bireylerin Kişilik Durumlarının Sınıflandırılması, International Symposium on Digital Forensics and Security (ISDFS’13)Dong-Peng Yang, Li Jin-Lin, Lun Ran and Chao Zhou, (2008), Applications of Data Mining Methods in the Evaluation of Client Credibility, Applications of Data Mining in E-Business and Finance C. Soares et al. (Eds.), IOS Press, Amsterdam, p.35-43Han, Jiawei and Kamber, Micheline, (2006), Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publications, San Francisco Jain, Y. K., Yadav, V. K. and Panday, G. S., (2011), “An Efficient Association Rule Hiding Algorithm for Privacy Preserving Data Mining”, International Journal On Computer Science And Engineering, Vol. 3 No. 7, p. 2792-2798.Kaura, P., Singhb, M., Josan, G. S. (2015), Classification and Prediction Based Data Mining Algorithms to Predict Slow Learners in Education Sector, 3rd International Conference on Recent Trends in Computing 2015(ICRTC- 2015), Procedia Computer Science 57,p. 500 – 508 Classification Algorithms Applied to Anneal Dataset Using Data Mining Techniques, International Journal of Future Innovative Science and Engineering Research (IJFISER) , Volume-2, Issue-1, p. 127-134Larose, Daniel T., (2005), Discovering Knowledge In Data, Wiley Publication, New JerseyNisbet, R., Elder, J., and Miner, G., (2009), Handbook of Statistical Analysis and Data Mining Applications, Elsevier Inc, Burlington.Nizam, Hatice, Akın, Saliha Sıla (2014), Sosyal Medyada Makine Öğrenmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin Performanslarının Karşılaştırılması, XIX. Türkiye'de İnternet KonferansıÖzkan, Yalçın (2008), Veri Madenciliği Yöntemleri, Papatya Yayınları, İstanbul Rokach, Lior and Maimon, Oded (2008), Data Mining with Decision Trees, World Scientific, New JerseySalama, Gouda, Abdelhalim, M. B., and Zeid,Magdy Abd-elghany (2012), Experimental Comparison of Classifiers for Breast Cancer Diagnosis, 978-1-4673-2961-3/12 ©2012 IEEE, DOI: 10.1109/ICCES.2012.6408508 p. 180-185Singaravelan, S., Murugan, D. and 1R. Mayakrishnan (, 2015), Analysis of Classification Algorithms J48 and Smo on Different Datasets, World Engineering & Applied Sciences Journal 6 (2): p.119-123Tadesse, T., Wardlow, B. And Hayes, M.J. (2009), The Application of Data Mining for Drought Monitoring and Prediction, Data Mining Applications for Empowering Knowledge Societies, Edited by Hakikur Rahman, Information Science Reference, New York, p.280-291Weiss, Sholom M. And Zhang, Tong (2003), Performance Analysis and Evaluation, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.436-439Wu, Tong and Li Xiangyang (2003), Data Storage and Management, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.393-407

FARKLI VERİ SETLERİ ÜZERİNDE SMO VE J48 ALGORİTMALARININ SINIFLANDIRMA SONUÇLARININ KARŞILAŞTIRILMASI

Year 2018, Volume: 6 Issue: 3, 199 - 213, 26.12.2018
https://doi.org/10.22139/jobs.487388

Abstract

Amaç: Veri madenciliği disiplinler arası bir alandır, sürekli gelişmekte ve kullanım alanları yaygınlaşmaktadır. Çeşitli tekniklerin ve algoritmaların kullanılmasıyla verilerin güvenilirliğinin sağlanmasına yardımcı olmaktadır. Sınıflandırma, araştırmacılar tarafından yaygın olarak kullanıldığı için önemli bir veri madenciliği tekniğidir.

Yöntem: Bu çalışmada, üç farklı öğrenci veri seti üzerinde SMO ve J48 algoritmalarının sınıflandırma sonuçları karşılaştırılmıştır. Çalışmada, üç farklı veri seti ile TP-Oranı, FP-Oranı, Kesinlik, Duyarlık, F-ölçütü ve ROC analizi gibi çeşitli doğruluk ölçümleri kullanılarak, J48 ve SMO algoritmalarının sınıflandırma doğruluğu açısından performansı değerlendirilmiştir.

Bulgular ve Sonuç: Yapılan testler sonucunda her üç veri setinde SMO algoritmasının sınıflandırma performansının daha iyi olduğu ortaya konmuştur. 


References

  • Aharwal ,Ramesh Prasad (2016), Evaluatıon Of Varıous Classıfıcatıon Technıques Of Weka Using Different Datasets, International Journal of Advance Research and Innovative Ideas in Education, Vol-2 Issue-2, p.558-552Akçetin, Eyüp, Çelik, Ufuk(2014), İstenmeyen Elektronik Posta (Spam) Tespitinde Karar Ağacı Algoritmalarının Performans Kıyaslaması, İnternet Uygulamaları ve Yönetimi (5/2), doi: 10.5505/iuyd.2014.43531, p.43-56Arora, Milandeep and Sharma, Ajay, (2016), Chronic Kidney Disease Detection by Analyzing Medical Datasets in Weka, International Journal of Computer Application (2250-1797) Volume 6– No.4, July- August 2016,p.20-26Bramer, Max (2007), Principles of Data Mining, Springer, London Chaudhary, Niharika, Mehta, Gaurav and Bajaj, Karan (2015), Comparıson Of Classification Algorithms And Design Of A Percentage-Split Based Method For Data Classification, International Journal Of Computer Science & It, Volume 2, Issue 5, p.1-6Daş, Bihter, Varol, Asaf (2013), 2D:4D Sayısal Parmak Oranına Göre Bireylerin Kişilik Durumlarının Sınıflandırılması, International Symposium on Digital Forensics and Security (ISDFS’13)Dong-Peng Yang, Li Jin-Lin, Lun Ran and Chao Zhou, (2008), Applications of Data Mining Methods in the Evaluation of Client Credibility, Applications of Data Mining in E-Business and Finance C. Soares et al. (Eds.), IOS Press, Amsterdam, p.35-43Han, Jiawei and Kamber, Micheline, (2006), Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publications, San Francisco Jain, Y. K., Yadav, V. K. and Panday, G. S., (2011), “An Efficient Association Rule Hiding Algorithm for Privacy Preserving Data Mining”, International Journal On Computer Science And Engineering, Vol. 3 No. 7, p. 2792-2798.Kaura, P., Singhb, M., Josan, G. S. (2015), Classification and Prediction Based Data Mining Algorithms to Predict Slow Learners in Education Sector, 3rd International Conference on Recent Trends in Computing 2015(ICRTC- 2015), Procedia Computer Science 57,p. 500 – 508 Classification Algorithms Applied to Anneal Dataset Using Data Mining Techniques, International Journal of Future Innovative Science and Engineering Research (IJFISER) , Volume-2, Issue-1, p. 127-134Larose, Daniel T., (2005), Discovering Knowledge In Data, Wiley Publication, New JerseyNisbet, R., Elder, J., and Miner, G., (2009), Handbook of Statistical Analysis and Data Mining Applications, Elsevier Inc, Burlington.Nizam, Hatice, Akın, Saliha Sıla (2014), Sosyal Medyada Makine Öğrenmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin Performanslarının Karşılaştırılması, XIX. Türkiye'de İnternet KonferansıÖzkan, Yalçın (2008), Veri Madenciliği Yöntemleri, Papatya Yayınları, İstanbul Rokach, Lior and Maimon, Oded (2008), Data Mining with Decision Trees, World Scientific, New JerseySalama, Gouda, Abdelhalim, M. B., and Zeid,Magdy Abd-elghany (2012), Experimental Comparison of Classifiers for Breast Cancer Diagnosis, 978-1-4673-2961-3/12 ©2012 IEEE, DOI: 10.1109/ICCES.2012.6408508 p. 180-185Singaravelan, S., Murugan, D. and 1R. Mayakrishnan (, 2015), Analysis of Classification Algorithms J48 and Smo on Different Datasets, World Engineering & Applied Sciences Journal 6 (2): p.119-123Tadesse, T., Wardlow, B. And Hayes, M.J. (2009), The Application of Data Mining for Drought Monitoring and Prediction, Data Mining Applications for Empowering Knowledge Societies, Edited by Hakikur Rahman, Information Science Reference, New York, p.280-291Weiss, Sholom M. And Zhang, Tong (2003), Performance Analysis and Evaluation, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.436-439Wu, Tong and Li Xiangyang (2003), Data Storage and Management, The Handbook of Data Mining, Edited by. Nong Ye, Lawrence Erlbaum Associates Publishers. London, p.393-407
There are 1 citations in total.

Details

Primary Language Turkish
Journal Section Original Articles
Authors

Mehmet Alan 0000-0001-8562-547X

Cavit Yeşilyurt 0000-0001-9814-4085

Publication Date December 26, 2018
Submission Date November 25, 2018
Acceptance Date December 24, 2018
Published in Issue Year 2018 Volume: 6 Issue: 3

Cite

APA Alan, M., & Yeşilyurt, C. (2018). FARKLI VERİ SETLERİ ÜZERİNDE SMO VE J48 ALGORİTMALARININ SINIFLANDIRMA SONUÇLARININ KARŞILAŞTIRILMASI. İşletme Bilimi Dergisi, 6(3), 199-213. https://doi.org/10.22139/jobs.487388