Recent Advances in Computer Science and Communications

Author(s): Kousik Barik, Sanjay Misra*, Karabi Konar, Manju Kaushik and Ravin Ahuja

DOI: 10.2174/2666255816666220601113550

A Comparative Study on the Application of Text Mining in Cybersecurity

Article ID: e010622205492 Pages: 14

  • * (Excluding Mailing and Handling)

Abstract

Aims: This paper aims to conduct a Systematic Literature Review (SLR) of the relative applications of text mining in cybersecurity.

Objectives: The amount of data generated worldwide has been attributed to a change in different activities associated with cyber security, and demands a high automation level.

Methods: In the cyber security domain, text mining is an alternative for improving the usefulness of various activities that entail unstructured data. This study searched databases of 516 papers from 2015 to 2021. Out of which, 75 papers are selected for analysis. A detailed evaluation of the selected studies employs sources, techniques, and information extraction on cyber security applications.

Results: This study extends gaps for future studies, such as text processing, availability of datasets, innovative methods, and intelligent text mining.

Conclusion: This study concludes with interesting findings of employing text mining in cybersecurity applications; the researchers need to exploit all related techniques and algorithms in text mining to detect and protect the organization from Cybersecurity applications.

Keywords: Cybersecurity, text mining, natural language processing, data mining, ML techniques.

[1]
K. van der Schaaf, B. Tekinerdogan, and C. Catal, "A feature‐based approach for guiding the selection of Internet of Things cybersecurity standards using text mining", Concurr. Comput., vol. 33, no. 21, p. e6385, 2021.
[2]
Y. Jiang, and Y. Atif, "A selective ensemble model for cognitive cybersecurity analysis", J. Netw. Comput. Appl., vol. 193, p. 103210, 2021.
[http://dx.doi.org/10.1016/j.jnca.2021.103210]
[3]
S. Noel, Text mining for modeling cyberattacks. Handbook of Statistics., vol. Vol. 38. Elsevier, 2018, pp. 463-515.
[4]
S. Kumar, A.K. Kar, and P.V. Ilavarasan, "Applications of text mining in services management: A systematic literature review", Int. J. Inform. Manag. Data Insights, vol. 1, no. 1, p. 100008, 2021.
[http://dx.doi.org/10.1016/j.jjimei.2021.100008]
[5]
S. Samtani, S. Yu, H. Zhu, M. Patton, J. Matherly, and H. Chen, "Identifying supervisory control and data acquisition (SCADA) devices and their vulnerabilities on the Internet of Things (IoT): A text mining approach", IEEE Intell. Syst., 2018.
[http://dx.doi.org/10.1109/MIS.2018.111145022]
[6]
C.F. da Silva, and S. Moro, "Blockchain technology as an enabler of consumer trust: A text mining literature analysis", Telemat. Inform., vol. 60, p. 101593, 2021.
[http://dx.doi.org/10.1016/j.tele.2021.101593]
[7]
F. Chiarello, G. Fantoni, T. Hogarth, V. Giordano, L. Baltina, and I. Spada, "Towards ESCO 4.0–Is the European classification of skills in line with Industry 4.0? A text mining approach", Technol. Forecast. Soc. Change, vol. 173, p. 121177, 2021.
[http://dx.doi.org/10.1016/j.techfore.2021.121177]
[8]
A. Miller, "Text Mining digital humanities projects: Assessing content analysis capabilities of voyant tools", J. Web Librariansh., vol. 12, no. 3, pp. 169-197, 2018.
[http://dx.doi.org/10.1080/19322909.2018.1479673]
[9]
Z. Wang, and Y. Zhong, "What were residents’ petitions in Beijing-based on text mining", J. Urban Manag., vol. 9, no. 2, pp. 228-237, 2020.
[http://dx.doi.org/10.1016/j.jum.2019.11.006]
[10]
R. Syed, "Cybersecurity vulnerability management: A conceptual ontology and cyber intelligence alert system", Inf. Manage., vol. 57, no. 6, p. 103334, 2020.
[http://dx.doi.org/10.1016/j.im.2020.103334]
[11]
K.M. Kwayu, V. Kwigizile, K. Lee, and J.S. Oh, "Discovering latent themes in traffic fatal crash narratives using text mining analytics and network topology", Accid. Anal. Prev., vol. 150, p. 105899, 2021.
[http://dx.doi.org/10.1016/j.aap.2020.105899] [PMID: 33285445]
[12]
N. Shinde, and P. Kulkarni, "Cyber incident response and planning: A flexible approach", Comput. Fraud Secur., vol. 2021, no. 1, pp. 14-19, 2021.
[http://dx.doi.org/10.1016/S1361-3723(21)00009-9]
[13]
M.I. Pramanik, R.Y. Lau, W.T. Yue, Y. Ye, and C. Li, "Big data analytics for security and criminal investigations", Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 7, no. 4, p. e1208, 2017.
[http://dx.doi.org/10.1002/widm.1208]
[14]
Z. Zuo, and K. Zhao, "The more multidisciplinary the better?–The prevalence and interdisciplinarity of research collaborations in multidisciplinary institutions", J. Informetrics, vol. 12, no. 3, pp. 736-756, 2018.
[http://dx.doi.org/10.1016/j.joi.2018.06.006]
[15]
E. Chaix, L. Deléger, R. Bossy, and C. Nédellec, "Text mining tools for extracting information about microbial biodiversity in food", Food Microbiol., vol. 81, pp. 63-75, 2019.
[http://dx.doi.org/10.1016/j.fm.2018.04.011] [PMID: 30910089]
[16]
M.R. Alagheband, A. Mashatan, and M. Zihayat, "Time-based gap analysis of cybersecurity trends in academic and digital media", ACM Trans. Manag. Inf. Syst., vol. 11, no. 4, pp. 1-20, 2020.
[http://dx.doi.org/10.1145/3389684]
[17]
S. Fareri, G. Fantoni, F. Chiarello, E. Coli, and A. Binda, "Estimating Industry 4.0 impact on job profiles and skills using text mining", Comput. Ind., vol. 118, p. 103222, 2020.
[http://dx.doi.org/10.1016/j.compind.2020.103222]
[18]
S.S. Gill, and R. Buyya, Bio-inspired algorithms for big data analytics: A survey, taxonomy, and open challenges.Big Data Analytics for Intelligent Healthcare Management., Academic Press, 2019, pp. 1-17.
[http://dx.doi.org/10.1016/B978-0-12-818146-1.00001-5]
[19]
B. Zhong, X. Pan, P.E. Love, J. Sun, and C. Tao, "Hazard analysis: A deep learning and text mining framework for accident prevention", Adv. Eng. Inform., vol. 46, p. 101152, 2020.
[http://dx.doi.org/10.1016/j.aei.2020.101152]
[20]
A.A. Hamza, I.T. Abdel-Halim, M.A. Sobh, and A.M. Bahaa-Eldin, "A survey and taxonomy of program analysis for IoT platforms", Ain Shams Eng. J., vol. 12, no. 4, pp. 3725-3736, 2021.
[http://dx.doi.org/10.1016/j.asej.2021.03.026]
[21]
S.M.C. Loureiro, J. Guerreiro, S. Eloy, D. Langaro, and P. Panchapakesan, "Understanding the use of virtual reality in marketing: A text mining-based review", J. Bus. Res., vol. 100, pp. 514-530, 2019.
[http://dx.doi.org/10.1016/j.jbusres.2018.10.055]
[22]
A. Serna, and S. Gasparovic, "Transport analysis approach based on big data and text mining analysis from social media", Transp. Res. Procedia, vol. 33, pp. 291-298, 2018.
[http://dx.doi.org/10.1016/j.trpro.2018.10.105]
[23]
M. Pejic-Bach, T. Bertoncel, M. Mesko, and Z. Krstic, "Text mining of industry 4.0 job advertisements", Int. J. Inf. Manage., vol. 50, pp. 416-431, 2020.
[http://dx.doi.org/10.1016/j.ijinfomgt.2019.07.014]
[24]
M. Marzouk, and M. Enaba, "Text analytics to analyze and monitor construction project contract and correspondence", Autom. Construct., vol. 98, pp. 265-274, 2019.
[http://dx.doi.org/10.1016/j.autcon.2018.11.018]
[25]
Z. Alzamil, D. Appelbaum, and R. Nehmer, "An ontological artifact for classifying social media: Text mining analysis for financial data", Int. J. Account. Inf. Syst., vol. 38, p. 100469, 2020.
[http://dx.doi.org/10.1016/j.accinf.2020.100469]
[26]
S. Huang, X. Luo, J. Huang, Y. Guo, and S. Gu, "An unsupervised approach for learning a Chinese IS-A taxonomy from an unstructured corpus", Knowl. Base. Syst., vol. 182, p. 104861, 2019.
[http://dx.doi.org/10.1016/j.knosys.2019.07.032]
[27]
D. Yang, J. Kleissl, C.A. Gueymard, H.T. Pedro, and C.F. Coimbra, "History and trends in solar irradiance and P.V. power forecasting: A preliminary assessment and review using text mining", Sol. Energy, vol. 168, pp. 60-101, 2018.
[http://dx.doi.org/10.1016/j.solener.2017.11.023]
[28]
S. Raheja, and G. Munjal, “Text mining for secure cyber space”, Intelligent data analytics for Terror Threat Prediction: Architectures., Methodologies, Techniques and Applications, 2021, pp. 95-118.
[http://dx.doi.org/10.1002/9781119711629.ch5]
[29]
R. Gorwa, and D. Guilbeault, "Unpacking the social media bot: A typology to guide research and policy", Policy Internet, vol. 12, no. 2, pp. 225-248, 2020.
[http://dx.doi.org/10.1002/poi3.184]
[30]
Y. Xu, Q. Zeng, G. Wang, C. Zhang, J. Ren, and Y. Zhang, "An efficient privacy‐enhanced attribute‐based access control mechanism", Concurr. Comput., vol. 32, no. 5, p. e5556, 2020.
[http://dx.doi.org/10.1002/cpe.5556]
[31]
U. Ogiela, "Cognitive cryptography for data security in cloud computing", Concurr. Comput., vol. 32, no. 18, p. e5557, 2020.
[http://dx.doi.org/10.1002/cpe.5557]
[32]
K. Thakur, J. Shan, and A.S.K. Pathan, "Innovations of phishing defense: The mechanism, measurement and defense strategies", Int. J. Commun. Netw. Inf. Secur., vol. 10, no. 1, pp. 19-27, 2018.
[33]
A. Cohen, N. Nissim, L. Rokach, and Y. Elovici, "SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods", Expert Syst. Appl., vol. 63, pp. 324-343, 2016.
[http://dx.doi.org/10.1016/j.eswa.2016.07.010]
[34]
M.A. Zardari, and L.T. Jung, "Data security rules/regulations based classification of file data using TsF-kNN algorithm", Cluster Comput., vol. 19, no. 1, pp. 349-368, 2016.
[http://dx.doi.org/10.1007/s10586-016-0539-z]
[35]
Y. Fang, Y. Guo, C. Huang, and L. Liu, "Analyzing and identifying data breaches in underground forums", IEEE Access, vol. 7, pp. 48770-48777, 2019.
[http://dx.doi.org/10.1109/ACCESS.2019.2910229]
[36]
S.O. Baror, H.S. Venter, and R. Adeyemi, "A natural human language framework for digital forensic readiness in the public cloud", Aust. J. Forensic Sci., vol. 53, no. 5, pp. 566-591, 2021.
[http://dx.doi.org/10.1080/00450618.2020.1789742]
[37]
Z. Abbasiantaeb, and S. Momtazi, "Text‐based question answering from information retrieval and deep neural network perspectives: A survey", Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 11, no. 6, p. e1412, 2021.
[http://dx.doi.org/10.1002/widm.1412]
[38]
L. Ignaczak, G. Goldschmidt, C.A.D. Costa, and R.D.R. Righi, "Text Mining in Cybersecurity", ACM Comput. Surv., vol. 54, no. 7, pp. 1-36, 2021.
[http://dx.doi.org/10.1145/3462477]
[39]
S.M.C. Loureiro, J. Guerreiro, and F. Ali, "20 years of research on virtual reality and augmented reality in tourism context: A text-mining approach", Tour. Manage., vol. 77, p. 104028, 2020.
[http://dx.doi.org/10.1016/j.tourman.2019.104028]
[40]
R. Coulter, Q.L. Han, L. Pan, J. Zhang, and Y. Xiang, "Code analysis for intelligent cyber systems: A data-driven approach", Inf. Sci., vol. 524, pp. 46-58, 2020.
[http://dx.doi.org/10.1016/j.ins.2020.03.036]
[41]
B. Biswas, A. Mukhopadhyay, S. Bhattacharjee, A. Kumar, and D. Delen, "A text-mining based cyber-risk assessment and mitigation framework for critical analysis of online hacker forums", Decis. Support Syst., vol. 152, p. 113651, 2021.
[http://dx.doi.org/10.1016/j.dss.2021.113651]
[42]
M.A. Williams, R.C. Barranco, S.M. Naim, S. Dey, M. Shahriar Hossain, and M. Akbar, "A vulnerability analysis and prediction framework", Comput. Secur., vol. 92, p. 101751, 2020.
[http://dx.doi.org/10.1016/j.cose.2020.101751]
[43]
J. Shuja, M.A. Humayun, W. Alasmary, H. Sinky, E. Alanazi, and M.K. Khan, "Resource efficient geo-textual hierarchical clustering framework for social IoT applications", IEEE Sens. J., vol. 21, no. 22, pp. 25114-25122, 2021.
[http://dx.doi.org/10.1109/JSEN.2021.3060953]
[44]
H. Pellet, S. Shiaeles, and S. Stavrou, "Localising social network users and profiling their movement", Comput. Secur., vol. 81, pp. 49-57, 2019.
[http://dx.doi.org/10.1016/j.cose.2018.10.009]
[45]
J. van Roosmalen, H. Vranken, and M. van Eekelen, "Applying deep learning on packet flows for botnet detection", In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018, pp. 1629-1636
[http://dx.doi.org/10.1145/3167132.3167306]
[46]
J. Wu, M. Wen, R. Lu, B. Li, and J. Li, "Toward efficient and effective bullying detection in online social network", Peer-to-Peer Netw. Appl., vol. 13, no. 5, pp. 1567-1576, 2020.
[http://dx.doi.org/10.1007/s12083-019-00832-1]
[47]
M.S. Iqbal, M. Zulkernine, F. Jaafar, and Y. Gu, "Protecting Internet users from becoming victimized attackers of click‐fraud", J. Softw. Evol. Process, vol. 30, no. 3, p. e1871, 2018.
[48]
M. Edwards, R. Larson, B. Green, A. Rashid, and A. Baron, "Panning for gold: Automatically analysing online social engineering attack surfaces", Comput. Secur., vol. 69, pp. 18-34, 2017.
[http://dx.doi.org/10.1016/j.cose.2016.12.013]
[49]
N.K. Conroy, V.L. Rubin, and Y. Chen, "Automatic deception detection: Methods for finding fake news", Proc. Assoc. Inf. Sci. Technol., vol. 52, no. 1, pp. 1-4, 2015.
[http://dx.doi.org/10.1002/pra2.2015.145052010082]
[50]
U. Noor, Z. Anwar, T. Amjad, and K.K.R. Choo, "A machine learning-based fin-tech cyber threat attribution framework using high-level indicators of compromise", Future Gener. Comput. Syst., vol. 96, pp. 227-242, 2019.
[http://dx.doi.org/10.1016/j.future.2019.02.013]
[51]
K. Barik, K. Konar, A. Banerjee, S. Das, and A. Abirami, An exploration of attack patterns and protection approaches using penetration testingIntelligent Data Communication Technologies and Internet of Things, Springer: Singapore, 2022, p. 491-503.
[http://dx.doi.org/10.1007/978-981-16-7610-9_36]
[52]
A.S. Toor, H. Wechsler, M. Nappi, and K.K.R. Choo, "Visual Question Authentication Protocol (VQAP)", Comput. Secur., vol. 76, pp. 285-294, 2018.
[http://dx.doi.org/10.1016/j.cose.2017.11.017]
[53]
Y. Khandelwal, and R. Bhargava, SPAM Filtering Using Artificial Intelligence., Artificial Intelligence and Data Mining Approaches in Security Frameworks, 2021, pp. 261-291.
[54]
N. Milosevic, A. Dehghantanha, and K.K.R. Choo, "Machine learning aided Android malware classification", Comput. Electr. Eng., vol. 61, pp. 266-274, 2017.
[http://dx.doi.org/10.1016/j.compeleceng.2017.02.013]
[55]
K. Barik, A. Abirami, K. Konar, and S. Das, Research perspective on digital forensic tools and investigation process. Illumination of Artificial Intelligence in Cybersecurity and Forensics., Springer: Cham, 2022, pp. 71-95.
[http://dx.doi.org/10.1007/978-3-030-93453-8_4]
[56]
C. Nguyen, M. Jensen, and E. Day, "Learning not to take the bait: A longitudinal examination of digital training methods and overlearning on phishing susceptibility", Eur. J. Inf. Syst., vol. 30, no. 8, pp. 1-25, 2021.
[http://dx.doi.org/10.1080/0960085X.2021.1931494]
[57]
S. Sharma, and A. Jain, "Role of sentiment analysis in social media security and analytics", Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 10, no. 5, p. e1366, 2020.
[http://dx.doi.org/10.1002/widm.1366]
[58]
R. Kaur, S. Singh, and H. Kumar, "AuthCom: Authorship verification and compromised account detection in online social networks using AHP-TOPSIS embedded profiling based technique", Expert Syst. Appl., vol. 113, pp. 397-414, 2018.
[http://dx.doi.org/10.1016/j.eswa.2018.07.011]
[59]
S. Mekruksavanich, and A. Jitpattanakul, Convolutional neural network and data augmentation for behavioral-based biometric user identification. ICT Systems and Sustainability., Springer: Singapore, 2021, pp. 753-761.
[http://dx.doi.org/10.1007/978-981-15-8289-9_72]
[60]
K. Barik, A. Abirami, S. Das, K. Konar, and A. Banerjee, "Penetration testing analysis with standardized report generation", In 2021 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021), 2021, p. 365-372
[http://dx.doi.org/10.2991/ahis.k.210913.045]
[61]
R. Williams, S. Samtani, M. Patton, and H. Chen, "Incremental hacker forum exploit collection and classification for proactive cyber threat intelligence: An exploratory study", In 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 2018, pp. 94-99
[http://dx.doi.org/10.1109/ISI.2018.8587336]
[62]
A. Mukhopadhyay, S. Chatterjee, K.K. Bagchi, P.J. Kirs, and G.K. Shukla, "Cyber Risk Assessment and Mitigation (CRAM) framework using logit and Probit models for cyber insurance", Inf. Syst. Front., vol. 21, no. 5, pp. 997-1018, 2019.
[http://dx.doi.org/10.1007/s10796-017-9808-5]
[63]
P.K. Jain, R. Pamula, and G. Srivastava, "A systematic literature review on machine learning applications for consumer sentiment analysis using online reviews", Comput. Sci. Rev., vol. 41, p. 100413, 2021.
[http://dx.doi.org/10.1016/j.cosrev.2021.100413]
[64]
J. Wu, J. Cai, X.R. Luo, and J. Benitez, "How to increase customer repeated bookings in the short-term room rental market? A large-scale granular data investigation", Decis. Support Syst., vol. 143, p. 113495, 2021.
[http://dx.doi.org/10.1016/j.dss.2021.113495]
[65]
X. Yang, G. Yang, J. Wu, Y. Dang, and W. Fan, "Modeling relationships between retail prices and consumer reviews: A machine discovery approach and comprehensive evaluations", Decis. Support Syst., vol. 145, p. 113536, 2021.
[http://dx.doi.org/10.1016/j.dss.2021.113536]
[66]
V. Benjamin, B. Zhang, J.F. Nunamaker, and H. Chen, "Examining Hacker participationlength in cybercriminal internet-relay-chat communities", J. Manage. Inf. Syst., vol. 33, no. 2, pp. 482-510, 2016.
[http://dx.doi.org/10.1080/07421222.2016.1205918]
[67]
H.C. Lin, and C.M. Chang, "What motivates health information exchange in social media? The roles of the social cognitive theory and perceived interactivity", Inf. Manage., vol. 55, no. 6, pp. 771-780, 2018.
[http://dx.doi.org/10.1016/j.im.2018.03.006]
[68]
N.T. Nguyen, K. Jearanaitanakij, A. Selamat, and B. Trawiński, and S. Chittayasothorn, Intelligent Information and Database Systems., Springer International Publishing: Cham, 2020.
[69]
E. Saloux, and J.A. Candanedo, "Forecasting district heating demand using machine learning algorithms", Energy Procedia, vol. 149, pp. 59-68, 2018.
[http://dx.doi.org/10.1016/j.egypro.2018.08.169]
[70]
S. Altalhi, and A. Gutub, "A survey on predictions of cyber-attacks utilizing real-time twitter tracing recognition", J. Ambient Intell. Humaniz. Comput., vol. 12, no. 11, pp. 10209-10221, 2021.
[http://dx.doi.org/10.1007/s12652-020-02789-z]
[71]
A. Ritter, E. Wright, W. Casey, and T. Mitchell, "Weakly supervised extraction of computer security events from twitter", In Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 896-905
[http://dx.doi.org/10.1145/2736277.2741083]
[72]
I.H. Sarker, A.S.M. Kayes, S. Badsha, H. Alqahtani, P. Watters, and A. Ng, "Cybersecurity data science: An overview from machine learning perspective", J. Big Data, vol. 7, no. 1, pp. 1-29, 2020.
[http://dx.doi.org/10.1186/s40537-020-00318-5]
[73]
A. Sapienza, A. Bessi, S. Damodaran, P. Shakarian, K. Lerman, and E. Ferrara, "Early warnings of cyber threats in online discussions", In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017, pp. 667-674
[http://dx.doi.org/10.1109/ICDMW.2017.94]
[74]
E. Toch, "The privacy implications of cyber security systems", ACM Comput. Surv., vol. 51, no. 2, pp. 1-27, 2018.
[http://dx.doi.org/10.1145/3172869]
[75]
D. Di Castro, L. Lewin-Eytan, Y. Maarek, R. Wolff, and E. Zohar, "Enforcing k-anonymity in web mail auditing", In 2016 Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, p. 327-336
[http://dx.doi.org/10.1145/2835776.2835803]