Recent Advances in Computer Science and Communications

Author(s): Pradeep S.* and Jagadish S. Kallimani

DOI: 10.2174/2213275912666190417150421

Machine Learning Based Predictive Action on Categorical Non-Sequential Data

Page: [1020 - 1030] Pages: 11

  • * (Excluding Mailing and Handling)

Abstract

Background: With the advent of data analysis and machine learning, there is a growing impetus of analyzing and generating models on historic data. The data comes in numerous forms and shapes with an abundance of challenges. The most sorted form of data for analysis is the numerical data. With the plethora of algorithms and tools it is quite manageable to deal with such data. Another form of data is of categorical nature, which is subdivided into, ordinal (order wise) and nominal (number wise). This data can be broadly classified as Sequential and Non-Sequential. Sequential data analysis is easier to preprocess using algorithms.

Objective: The challenge of applying machine learning algorithms on categorical data of nonsequential nature is dealt in this paper.

Methods: Upon implementing several data analysis algorithms on such data, we end up getting a biased result, which makes it impossible to generate a reliable predictive model. In this paper, we will address this problem by walking through a handful of techniques which during our research helped us in dealing with a large categorical data of non-sequential nature. In subsequent sections, we will discuss the possible implementable solutions and shortfalls of these techniques.

Results: The methods are applied to sample datasets available in public domain and the results with respect to accuracy of classification are satisfactory.

Conclusion: The best pre-processing technique we observed in our research is one hot encoding, which facilitates breaking down the categorical features into binary and feeding it into an Algorithm to predict the outcome. The example that we took is not abstract but it is a real – time production services dataset, which had many complex variations of categorical features. Our Future work includes creating a robust model on such data and deploying it into industry standard applications.

Keywords: Machine learning, predictive analysis, algorithms, data analysis, categorical, non-sequential data.

Graphical Abstract

[1]
"Measuring Usability LLC, “Basic concepts: Nominal, ordinal, interval and ratio”, 2004-2017. Available from:", https://www.usablestats.com/lessons/noir
[2]
"My Market Research Methods, “Types of data & measurement scales: Nominal, ordinal, interval and ratio”, 2013. Available from:", https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
[3]
"The Pennsylvania State University, “Discrete data types and examples”, 2017. Available from:", https://online.stat.psu.edu/stat504/node/1/
[4]
Brett Presnell, "An introduction to categorical data analysis ”, March 28, 2000. Available from URL: ", http://www. stat. ufl. edu/∼ presnell/Courses/sta4504-2000sp/R/R-CDA. pdf
[5]
"Label Encoding. Scikit Learning 2007-2017. Available from: ", https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
[6]
"One Hot Encoding. Scikit Learning 2007-2017. Available from:", https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing OneHotEncoder.html
[7]
"Understanding the decision tree structure. Scikit Learn 2007-2017. Available from:", https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_stru cture.html
[8]
"Neural network models. Scikit Learn 2007-2017. Available from: ", https://scikit-learn.org/stable/modules/neural_networks_supervised.html
[9]
Y. Shang, "On the likelihood of forests", Physica A, vol. 456, no. C, pp. 157-166, 2016.
[10]
W. Mathew, R. Raposo, and B. Martins, "Predicting future locations with hidden Markov models", In: Proceedings of the 2012 ACM conference on ubiquitous computing, ACM, 2012, pp. 911-918.