Document Type: Original Research Paper

Authors

1 Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmai, Iran.

2 Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmia, IRAN

Abstract

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, forming feature vectors, and final classification. In the presented model, the authors formed a feature vector for each document by means of weighting features use for IWO. Then, documents are trained with NB classifier; then using the test, similar documents are classified together. FS do increase accuracy and decrease the calculation time. IWO-NB was performed on the datasets Reuters-21578, WebKb, and Cade 12. In order to demonstrate the superiority of the proposed model in the FS, Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) have been used as comparison models. Results show that in FS the proposed model has a higher accuracy than NB and other models. In addition, comparing the proposed model with and without FS suggests that error rate has decreased.

Keywords

Main Subjects

[1]    W. Hadi, Q.A. Al-Radaideh, S. Alhawari, Integrating associative rule-based classification with Naïve Bayes for text classification, Applied Soft Computing, Vol. 69, pp. 344-356, 2018.
[2]    D. Mahata, R.R. Shah, J. Kuriakose, R. Zimmermann, J.R. Talburt, Theme-Weighted Ranking of Keywords from Text Documents Using Phrase Embeddings, IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, pp. 184-189, 2018.
[3]    A. Kulkarni, V. Tokekar, P. Kulkarni, Discovering Context of Labeled Text Documents Using Context Similarity Coefficient, Procedia Computer Science, Vol. 49, pp. 118-127, 2015
[4]    K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Systems with Applications, Vol. 66, pp. 245-260, 2016.
[5]    S. Ramanna, J.F. Peters, C. Sengoz, Application of Tolerance Rough Sets in Structured and Unstructured Text Categorization: A Survey, Thriving Rough Sets, Springer, Vol. 708, pp. 119-138, 2017.
[6]    A.R. Mehrabian, C. Lucas, A novel numerical optimization algorithm inspired from weed colonization, Ecol. Inform. 1(4): 355-366, 2006.
[7]    A. McCallum, K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification, In AAAI-98 workshop on learning for text categorization, Vol. 752, pp. 41-48, 1998.
[8]    X. Deng, Y. Li, J. Weng, J. Zhang, Feature selection for text classification: A review, Multimedia Tools and Applications, pp. 1-20, 2018.
[9]    M. Rogati, Y. Yang, High-performing variable selection for text classification, in: CIKM ’02 Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659-661, 2002.
[10]    Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: The Fourteenth International Conference on Machine Learning (ICML97), pp. 412-420, 1997.
[11]    J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan, Michigan, USA, 1975.
[12]    J. Kennedy, R. C. Eberhart, Particle Swarm Optimization, In Proceedings of the IEEE International Conference on Neural Networks, pp. 1942-1948, 1995.
[13]    A. Trstenjak, S. Mikac, D. Donko, KNN with TF-IDF based Framework for Text Categorization, Procedia Engineering, Vol. 69, pp. 1356-1364, 2014.
[14]    Y. Ko, J. Seo, Text classification from unlabeled documents with bootstrapping and feature projection techniques, Information Processing & Management, Vol. 45, Issue 1, pp. 70-83, 2009
[15]    D. Ghasempour, F.S.Gharehchopogh, A New Approach for Feature Selection in Text Documents Classification by Using Hybrid Model of Bat and K-Nearest Neighborhood Algorithms, Islamic Azad University, Urmia Branch, Thesis, Summer 2016.
[16]    A. Allahvirdipour, F.S. Gharehchopogh, New Approach in Features Selection in Text Documents Classification using the Hybrid Model Algorithms of Naive Bayes and K-Means, Islamic Azad University, Urmia Branch, Thesis, Spring 2016.
[17]    R. Habibpour, K. Khalilpour, A New Hybrid K-means and K-Nearest-Neighbor Algorithms for Text Document Clustering, International Journal of Academic Research, Vol. 6 Issue 3, pp. 79-84, 2014
[18]    M. Karabulut, Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection, Knowledge-Based Systems, Vol. 54, pp. 288-297, 2013.
[19]    A.K. Uysal, S. Gunal, Text classification using genetic algorithm oriented latent semantic features, Expert Systems with Applications, Vol. 41, Issue 13, pp. 5938-5947, 2014
[20]    T. Wei, Y. Lu, H. Chang, Q. Zhou, X. Bao, A semantic approach for text clustering using WordNet and lexical chains, Expert Systems with Applications, Vol. 42, Issue 4, pp. 2264-2275, 2015
[21]    W. Zhang, X. Tang, T. Yoshida, TESC: An approach to TExt classification using Semi-Supervised Clustering, Knowledge-Based Systems, Vol. 75, pp.152-160, 2015
[22]    K.K. Bharti, P.K. Singh, Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering, Applied Soft Computing, Vol. 43, pp. 20-34, 2016.
[23]    D. AbuZeina, F.S. Al-Anzi, Employing fisher discriminant analysis for Arabic text classification, Computers & Electrical Engineering, in press, corrected proof, Available online 10 November 2017.
[24]    R. Wongso, F.A. Luwinda, B.C. Trisnajaya, O. Rusli, Rudy, News Article Text Classification in Indonesian Language, Procedia Computer Science, Vol. 116, pp. 137-143, 2017.
[25]    H.P. Luhn, A Statistical Approach to the Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, Vol. 1, No. 4, pp. 309-317, 1957.
[26]    G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.
[27]    R.S. Michalski, I. Bratko, M. Kubat, Machine Learning and Data Mining: Methods and Applications, New York: Wiley, 1998.
[28]    D. Francois, Binary classification performances measure cheat sheet, 2009.
[29]    C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/?mlearn/MLRepository.html], University of California. Department of Information and computer science, Irvine, CA, 1998, pp. 55
[30]    http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
[31]    http://ana.cachopo.org/datasets-for-single-label-text-categorization
[32]    A. Onana, S. Korukoglub, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Systems with Applications, Vol. 57, pp. 232-247, 2016.
[33]    A.K. Uysal, An improved global feature selection scheme for text classification, Expert Systems with Applications, Vol. 43, pp. 82-92, 2016.
[34]    H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems, Vol. 24, Issue 7, pp. 1024-1032, 2011.
[35]    W. Zong, F. Wu, L.K. Chu, D. Sculli, A Discriminative and Semantic Feature Selection Method for Text Categorization, International Journal of Production Economics, Vol. 165, pp. 215-222, 2015.
[36]    C. Veenhuis, Binary Invasive Weed Optimization, Second World Congress on Nature and Biologically Inspired Computing (NaBIC), pp. 449-454, 2010.
[37]    L.M. Abualigah, A.T. Khader, Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering, The Journal of Supercomputing, Vol. 73, Issue 11, pp. 4773-4795, 2017.