Text Mining - A Comparative Review of Twitter Sentiments Analysis

Article ID: e260723219115 Pages: 17

  • * (Excluding Mailing and Handling)

Abstract

Background: Text mining derives information and patterns from textual data. Online social media platforms, which have recently acquired great interest, generate vast text data about human behaviors based on their interactions. This data is generally ambiguous and unstructured. The data includes typing errors and errors in grammar that cause lexical, syntactic, and semantic uncertainties. This results in incorrect pattern detection and analysis. Researchers are employing various text mining techniques that can aid in Topic Modeling, the detection of Trending Topics, the identification of Hate Speeches, and the growth of communities in online social media networks.

Objective: This review paper compares the performance of ten machine learning classification techniques on a Twitter data set for analyzing users' sentiments on posts related to airline usage.

Methods: Review and comparative analysis of Gaussian Naive Bayes, Random Forest, Multinomial Naive Bayes, Multinomial Naive Bayes with Bagging, Adaptive Boosting (AdaBoost), Optimized AdaBoost, Support Vector Machine (SVM), Optimized SVM, Logistic Regression, and Long-Short Term Memory (LSTM) for sentiment analysis.

Results: The results of the experimental study showed that the Optimized SVM performed better than the other classifiers, with a training accuracy of 99.73% and testing accuracy of 89.74% compared to other models.

Conclusion: Optimized SVM uses the RBF kernel function and nonlinear hyperplanes to split the dataset into classes, correctly classifying the dataset into distinct polarity. This, together with Feature Engineering utilizing Forward Trigrams and Weighted TF-IDF, has improved Optimized SVM classifier performance regarding train and test accuracy. Therefore, the train and test accuracy of Optimized SVM are 99.73% and 89.74% respectively. When compared to Random Forest, a marginal of 0.09% and 1.73% performance enhancement is observed in terms of train and test accuracy and 1.29% (train accuracy) and 3.63% (test accuracy) of improved performance when compared with LSTM. Likewise, Optimized SVM, gave more than 10% of enhanced performance in terms of train accuracy when compared with Gaussian Naïve Bayes, Multinomial Naïve Bayes, Multinomial Naïve Bayes with Bagging, Logistic Regression and a similar enhancement is observed with Ada- Boost and Optimized AdaBoost which are ensemble models during the experimental process. Optimized SVM also has outperformed all the classification models in terms of AUC-ROC train and test scores.

Graphical Abstract

[1]
P.K. Pandia, "Impact of social media on culture, society and education", J. Adv. Res. Human. Social Sci., vol. 5, no. 3, pp. 17-24, 2018.
[2]
"The positive and negative impact of social media on education, teenagers, business and society", Int. J. Innov. Res. Sci. Eng. Technol., vol. 6, no. 10, pp. 19652-19657, 2017.
[3]
W.A. Zargar, "Impact of social media on education with positive and negative aspects", Int. J. Manag. IT Eng., vol. 8, no. 3, pp. 145-153, 2018.
[4]
R. Wang, D. Zhou, M. Jiang, J. Si, and Y. Yang, "A survey on opinion mining: From stance to product aspect", IEEE Access, vol. 7, pp. 41101-41124, 2019.
[http://dx.doi.org/10.1109/ACCESS.2019.2906754]
[5]
P. Sánchez-Núñez, M.J. Cobo, C.D.L. Heras-Pedrosa, J.I. Peláez, and E. Herrera-Viedma, "Opinion mining, sentiment analysis and emotion understanding in advertising: A bibliometric analysis", IEEE Access, vol. 8, pp. 134563-134576, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.3009482]
[6]
J.E.T. Akinsola, "Supervised machine learning algorithms: Classification and comparison", Int. J. Comput. Trends Tech., vol. 48, pp. 128-138, 2017.
[http://dx.doi.org/10.14445/22312803/IJCTT-V48P126]
[7]
B. Viswanath, and M. Ahmad Bashir, "Towards detecting anomalous user behavior in online social networks", Proceedings of the 23rd USENIX Security Symposium (USENIX Security), 2014, pp. 223-238.
[8]
A. Sheshasaayee, and G. Thailambal, "Comparison of classification algorithms in text mining", Int. J. Pure Appl. Math., vol. 116, no. 22, pp. 425-433, 2017.
[9]
Z. Wang, V. Joo, C. Tong, X. Xin, and H.C. Chin, "Anomaly detection through enhanced sentiment analysis on social media data", 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, Singapore, 2014, pp. 917-922.
[http://dx.doi.org/10.1109/CloudCom.2014.69]
[10]
G. Pu, L. Wang, J. Shen, and F. Dong, "A hybrid unsupervised clustering-based anomaly detection method", Tsinghua Sci. Technol., vol. 26, no. 2, pp. 146-153, 2021.
[http://dx.doi.org/10.26599/TST.2019.9010051]
[11]
J. Ambient Intell. Humaniz. Comput., vol. 1, pp. 1-15, 2019.
[12]
S.K. Jayanthi, and C. Kavi Priya, "Clustering approach for classification of research articles based on keyword search", Int. J. Adv. Res. Comput. Eng. Technol., vol. 7, no. 1, pp. 86-90, 2018.
[13]
"Mayra Rodriguez, Cesar Comin, Dalcimar Casanova, Odemir Bruno, Diego Amancio, Francisco Rodrigues, and Luciano da F. Costa, Clustering algorithms: A comparative approach", PLoS One, vol. 14, no. 1, pp. 1-34, 2016.
[14]
D.N. Kolla, and D.M.G. Kumar, "Supervised learning algorithms of machine learning: Prediction of brand loyalty", Int. J. Innov. Technol. Explor. Eng., vol. 8, no. 11, pp. 3886-3889, 2019.
[http://dx.doi.org/10.35940/ijitee.J9498.0981119]
[15]
M. Rafiqul, N. Sultana, M. Ali, P. Chandra, and B. Rahman, "A comprehensive survey of time series anomaly detection in online social network data", Int. J. Comput. Appl., vol. 180, no. 3, pp. 13-22, 2017.
[http://dx.doi.org/10.5120/ijca2017915989]
[16]
B. Škrlj, M. Martinc, J. Kralj, N. Lavrac, and S. Pollak, "tax2vec; Constructing interpretable features from taxonomies for short text classification", Comput. Speech Lang., vol. 65, no. 1, pp. 1-25, 2019.
[http://dx.doi.org/10.1155/2014/717092]
[17]
S.K. Gah, and E. Kuada, "Sentiment analysis of twitter feeds, effect of feature hashing on model accuracy", 2018 IEEE 7th International Conference on Adaptive Science & Technology (ICAST), Accra, Ghana, 2018.
[18]
B. Zhao, and E.P. Xing, "Hierarchical feature hashing for fast dimensionality reduction", In.2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 2051-2058, .
[http://dx.doi.org/10.1109/CVPR.2014.263]
[19]
W. Anwar, I.S. Bajwa, M.A. Choudhary, and S. Ramzan, "An empirical study on forensic analysis of urdu text using LDA-based authorship attribution", IEEE Access, vol. 7, pp. 3224-3234, 2019.
[http://dx.doi.org/10.1109/ACCESS.2018.2885011]
[20]
J. Li, H. Zhang, and Z. Wei, "The weighted word2vec paragraph vectors for anomaly detection over HTTP traffic", IEEE Access, vol. 8, pp. 141787-141798, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.3013849]
[21]
J. Rashid, S.M. Adnan Shah, A. Irtaza, T. Mahmood, M.W. Nisar, M. Shafiq, and A. Gardezi, "Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy K-means clustering", IEEE Access, vol. 7, pp. 146070-146080, 2019.
[http://dx.doi.org/10.1109/ACCESS.2019.2944973]
[22]
D. Yan, K. Li, S. Gu, and L. Yang, "Network-based bag-of-words model for text classification", IEEE Access, vol. 8, pp. 82641-82652, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.2991074]
[23]
O. Sharif, M.M. Hoque, A.S.M. Kayes, R. Nowrozy, and I.H. Sarker, "Detecting suspicious texts using machine learning techniques", Appl. Sci., vol. 10, no. 18, p. 6527, 2020.
[http://dx.doi.org/10.3390/app10186527]
[24]
A.A.A. Esmin, R.L. De Oliveira Jr, and S. Matwin, "Hierarchical classification approach to emotion recognition in twitter", In.2012 11th International Conference on Machine Learning and Applications, Boca Raton, FL, USA, 2012, pp. 381-385., .
[25]
A. Agarwal, V. Sharma, G. Sikka, and R. Dhir, "Opinion mining of news headlines using SentiWordNet", In", 2016 Symposium on Colossal Data Analysis and Networking (CDAN), Indore, India, 2016, pp. 1-5., .
[http://dx.doi.org/10.1109/CDAN.2016.7570949]
[26]
P.M. Jayle, and S.U. Bohra, "Review on opinion targets and opinion words extraction techniques from online reviews, international research", J. Eng. Technol., vol. 4, no. 3, pp. 2320-2325, 2017.
[27]
X. Li, Q. Peng, Z. Sun, L. Chai, and Y. Wang, "Predicting social emotions from readers’ perspective", IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 255-264, 2019.
[http://dx.doi.org/10.1109/TAFFC.2017.2695607]
[28]
K. Yang, Y. Cai, D. Huang, J. Li, Z. Zhou, and X. Lei, "An effective hybrid model for opinion mining and sentiment analysis", In: 2017 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, 2017, pp. 465-466.
[http://dx.doi.org/10.1109/BIGCOMP.2017.7881759]
[29]
S.K. Chaturvedi, V. Richariya, and N. Tiwari, "Anomaly detection in network using data mining techniques", Int. J. Emerg. Technol. Adv. Eng., vol. 2, no. 5, pp. 349-353, 2012.
[30]
D. Sinanc, and U. Yavanoglu, "A new approach to detecting content anomalies in wikipedia", In: 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA, 2013, pp. 288-293.
[31]
L. Tran, L. Fan, and C. Shahabi, "Distance-based outlier detection in data streams", Proc. VLDB Endow., vol. 9, no. 12, pp. 1089-1100, 2016.
[32]
X. Dai, and M. Bikdash, "Distance-based outliers method for detecting disease outbreaks using social media", In: SoutheastCon 2016, Norfolk, VA, USA, 2016, pp. 1-8.
[http://dx.doi.org/10.1109/SECON.2016.7506752]
[33]
A.S. Neogi, K.A. Garg, R.K. Mishra, and Y.K. Dwivedi, "Sentiment analysis and classification of Indian farmers’ protest using twitter data", Int. J. Inform. Manag. Data Insights, vol. 1, no. 2, p. 100019, 2021.
[http://dx.doi.org/10.1016/j.jjimei.2021.100019]
[34]
M.A. Saddam, E.K. Dewantara, and A. Solichin, "Sentiment analysis of flood disaster management in jakarta on twitter using support vector machines", Synchronous: Inform. Eng. J. Res., vol. 8, no. 1, pp. 470-473, 2023.
[35]
S.B. Wankhede, "Anomaly detection using machine learning techniques", In: 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 2019, pp. 1-3.
[36]
S. Omar, and H. Md Ngadi, "Jebur, and S. Benqdara, Machine Learning Techniques for Anomaly Detection: An Overview", Int. J. Comput. Appl., vol. 79, no. 2, pp. 33-41, 2013.
[http://dx.doi.org/10.5120/13715-1478]
[37]
K.S. Gundu, L.P. Dhyaram, G.N.V. Ramana Rao, and G.S. Deepak, "Comparative analysis of energy consumption in text processing models", In: Rajagopal S., Faruki P., Popat K., Eds., Advancements in Smart Computing and Information Security. ASCIS 2022. Communications in Computer and Information Science., vol. 1759. Springer: Cham, 2022.
[38]
G. Zhao, X. Qian, and X. Xie, "User-service rating prediction by exploring social users’ rating behaviors", IEEE Trans. Multimed., vol. 18, no. 3, pp. 496-506, 2016.
[http://dx.doi.org/10.1109/TMM.2016.2515362]
[39]
L.P. Del Bosque, "Prediction of aggressive comments in social media: An exploratory study", IEEE Latin America Transact., vol. 14, no. 7, pp. 3474-3480, 2016.
[40]
Y. Liu, and S. Xu, "Detecting rumors through modeling information propagation networks in a social media environment", IEEE Trans. Comput. Soc. Syst., vol. 3, no. 2, pp. 46-62, 2016.
[http://dx.doi.org/10.1109/TCSS.2016.2612980]
[41]
L. Gao, and R. Huang, "Detecting online hate speech using context aware models", In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria. INCOMA Ltd., pp. 260-266.
[http://dx.doi.org/10.26615/978-954-452-049-6_036]
[42]
O.L. Haimson, N. Andalibi, M. De Choudhury, and G.R. Hayes, "Relationship breakup disclosures and media ideologies on Facebook", New Media Soc., vol. 20, no. 5, pp. 1931-1952, 2018.
[http://dx.doi.org/10.1177/1461444817711402]
[43]
M. Zaw, and P. Tandayya, "Multi-level sentiment information extraction using the CRbSA algorithm", In: 15th International Joint Conference on Computer Science and Software Engineering (JCSSE), Nakhonpathom, Thailand, 2018, pp. 1-6.
[44]
S. Garg, and S.N. Singh, "Auto predictive customer feedback from textual analysis of online chat logs", In", 2018 4th International Conference on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 2018, pp. 1-6, .
[45]
M. Tian, L. Zhang, P. Guo, H. Zhang, Q. Chen, Y. Li, and A. Xue, "Data dependence analysis for defects data of relay protection devices based on apriori algorithm", IEEE Access, vol. 8, pp. 120647-120653, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.3006345]
[46]
Q. Cai, "Cause analysis of traffic accidents on urban roads based on an improved association rule mining algorithm", IEEE Access, vol. 8, pp. 75607-75615, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.2988288]
[47]
G.K. Pitsilis, H. Ramampiaro, and H. Langseth, "Effective hate speech detection in Twitter data using recurrent neural networks", Appl. Intell., vol. 48, no. 12, pp. 4730-4742, 2018.
[http://dx.doi.org/10.1007/s10489-018-1242-y]
[48]
M.A. Fauzi, and A. Yuniarti, "Ensemble method for indonesian twitter hate speech detection", Indones. J. Electr. Eng. Comput. Sci., vol. 11, no. 1, pp. 294-299, 2018.
[http://dx.doi.org/10.11591/ijeecs.v11.i1.pp294-299]
[49]
Z. Wang, and Z. Qu, "Research on web text classification algorithm based on improved CNN and SVM", In. IEEE 17th International Conference on Communication Technology (ICCT), Chengdu, China, 2017, pp. 1958-1961, .
[50]
J. Salminen, M. Hopf, S.A. Chowdhury, S. Jung, H. Almerekhi, and B.J. Jansen, "Developing an online hate classifier for multiple social media platforms", Human-centric Comput. Inform. Sci., vol. 10, no. 1, pp. 1-34, 2020.
[http://dx.doi.org/10.1186/s13673-019-0205-6]
[51]
W. Jia, R.M. Shukla, and S. Sengupta, "Anomaly detection using supervised learning and multiple statistical methods", In. 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA 2019, pp. 1291-1297, .
[52]
F. Huch, M. Golagha, A. Petrovska, and A. Krauss, "Machine learning-based run-time anomaly detection in software systems, an industrial evaluation", In IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), Campobasso, Italy, 2018, pp. 13-18
[53]
A. Nourbakhsh, F. Vermeer, G. Wiltvank, and R. van der Goot, "sthruggle at SemEval-2019 Task 5: An ensemble approach to hate speech detection", In: Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA. Association for Computational Linguistics, pp. 484-488.
[http://dx.doi.org/10.18653/v1/S19-2086]
[54]
D. Jiang, X. Luo, J. Xuan, and Z. Xu, "Sentiment computing for the news event based on the social media big data", IEEE Access, vol. 5, pp. 2373-2382, 2017.
[http://dx.doi.org/10.1109/ACCESS.2016.2607218]
[55]
M. Shirakawa, T. Hara, and S. Nishio, "N-gram IDF: A global term weighting scheme based on information distance", WWW '15: Proceedings of the 24th International Conference on World Wide Web, vol. 5. 2015, pp. 2373-2382.
[56]
J. Song, S. Lee, and J. Kim, "Inference attack on browsing history of twitter users using public click analytics and twitter metadata", IEEE Trans. Depend. Secure Comput., vol. 13, no. 3, pp. 340-354, 2016.
[http://dx.doi.org/10.1109/TDSC.2014.2382577]
[57]
Y. Zhang, X. Ruan, H. Wang, H. Wang, and S. He, "Twitter trends manipulation: A first look inside the security of twitter trending", IEEE Trans. Inf. Forensics Security, vol. 12, no. 1, pp. 144-156, 2017.
[58]
A. Karami, M. Lundy, F. Webb, and Y.K. Dwivedi, "Twitter and research: A systematic literature review through text mining", IEEE Access, vol. 8, pp. 67698-67717, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.2983656]
[59]
K. Joseph, P.M. Landwehr, and K.M. Carley, "Two 1%s don’t make a whole: Comparing simultaneous samples from twitter’s streaming API", Lect. Notes Comput. Sci., vol. 8393, pp. 75-83, 2014.
[http://dx.doi.org/10.1007/978-3-319-05579-4_10]
[60]
Z. Jianqiang, and G. Xiaolin, "Comparison research on text pre-processing methods on twitter sentiment analysis", IEEE Access, vol. 5, pp. 2870-2879, 2017.
[http://dx.doi.org/10.1109/ACCESS.2017.2672677]
[61]
M.F.R. Abu Bakar, N. Idris, L. Shuib, and N. Khamis, "Sentiment analysis of noisy malay text: State of art, challenges and future work", IEEE Access, vol. 8, pp. 24687-24696, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.2968955]
[62]
J.O. Contreras, S. Hilles, and Z.B. Abubakar, "Automated essay scoring with ontology based on text mining and NLTK tools", In International Conference on Smart Computing and Electronic Enterprise (ICSCEE), Shah Alam, Malaysia, 2018, pp. 1-6
[http://dx.doi.org/10.1109/ICSCEE.2018.8538399]
[63]
N.K. Jha, "An approach towards text to emoticon conversion and vice-versa using NLTK and wordnet", In 2nd International Conference on Data Science and Business Analytics (ICDSBA), Changsha, China, 2018, pp. 161-166
[64]
S.E. Saad, and J. Yang, "Twitter sentiment analysis based on ordinal regression", IEEE Access, vol. 7, pp. 163677-163685, 2019.
[http://dx.doi.org/10.1109/ACCESS.2019.2952127]
[65]
S. Zahoor, and R. Rohilla, "Twitter sentiment analysis using lexical or rule based approach: A case study", In 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2020, pp. 537-542
[66]
P. Gupta, S. Kumar, R.R. Suman, and V. Kumar, "Sentiment analysis of lockdown in india during COVID-19: A case study on twitter", IEEE Trans. Comput. Soc. Syst., vol. 8, no. 4, pp. 992-1002, 2021.
[http://dx.doi.org/10.1109/TCSS.2020.3042446]
[67]
R. Hermansyah, and R. Sarno, "Sentiment analysis about product and service evaluation of PT telekomunikasi indonesia TBK from tweets using textBlob; Naive Bayes & K-NN Method", In International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, 2020, pp. 511-516
[68]
A.K. Kalia, N. Buchler, A. DeCostanza, and M.P. Singh, "Computing team process measures from the structure and content of broadcast collaborative communications", IEEE Transact. Comput. Social Syst., vol. 4, no. 2, pp. 26-39, 2017.
[69]
A. Amalia, O.S. Sitompul, E.B. Nababan, and T. Mantoro, "An efficient text classification using fasttext for bahasa indonesia documents classification", In International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA), Medan, Indonesia, 2020, pp. 69-75
[http://dx.doi.org/10.1109/DATABIA50434.2020.9190447]
[70]
L. Shi, C. Jianping, and X. Jie, "Prospecting information extraction by text mining based on convolutional neural networks-a case study of the lala copper deposit, China", IEEE Access, vol. 6, pp. 52286-52297, 2018.
[http://dx.doi.org/10.1109/ACCESS.2018.2870203]
[71]
S. Pang, J. Yao, T. Liu, H. Zhao, and H. Chen, "A text similarity measurement based on semantic fingerprint of characteristic phrases", Chin. J. Electron., vol. 29, no. 2, pp. 233-241, 2020.
[http://dx.doi.org/10.1049/cje.2019.12.011]
[72]
F. Rustam, A. Mehmood, M. Ahmad, S. Ullah, D.M. Khan, and G.S. Choi, "Classification of shopify app user reviews using novel multi text features", IEEE Access, vol. 8, pp. 30234-30244, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.2972632]
[73]
S. Kokatnoor, and B. Krishnan, "A Two-stepped feature engineering process for topic modeling using batchwise LDA with stochastic variational inference model", Int. J. Intell. Eng. Syst., vol. 13, no. 4, pp. 333-345, 2020.
[http://dx.doi.org/10.22266/ijies2020.0831.29]
[74]
R. Primartha, and B.A. Tama, "Anomaly detection using random forest: A performance revisited", In International Conference on Data and Software Engineering (ICoDSE), Palembang, Indonesia, 2017, pp. 1-6
[75]
T. Chengsheng, X. Bing, and L. Huacheng, "The application of the adaboost algorithm in the text classification", In 2nd IEEE Advanced Information Management,Communicates, Electronic and Automation Control Conference (IMCEC), Xi'an, China, 2018, pp. 1792-1796
[76]
R. Islam, "Early stage DRC prediction using ensemble machine learning algorithms, IEEE", Can. J. Electr. Comput. Eng., vol. 45, no. 4, pp. 354-364, 2022.
[77]
M.A. Rubi, M. Hasan Imam Bijoy, S. Chowdhury, and M.K. Islam, "Machine learning prediction of consumer travel insurance purchase behavior", In 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2022, pp. 1-5
[http://dx.doi.org/10.1109/ICCCNT54827.2022.9984470]
[78]
S. Wilson, and C.K. Mohan, "Coherent and noncoherent dictionaries for action recognition", IEEE Signal Process. Lett., vol. 24, no. 5, pp. 698-702, 2017.
[http://dx.doi.org/10.1109/LSP.2017.2690461]