Abstract
Background: With the explosion of communication technologies and the accompanying
pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments,
and other forms of expressions in different languages. This content attracted researchers from different
fields; economics, political sciences, social sciences, psychology and particularly language
processing. One of the prominent subjects is the discrimination between similar languages and dialects
using natural language processing and machine learning techniques. The problem is usually
addressed by formulating the identification as a classification task.
Methods: The approach is based on machine learning classification methods to discriminate between
Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf
and North-African. Several models were trained to discriminate between the studied dialects in
large corpora mined from online Arabic newspapers and manually annotated.
Results: Experimental results showed that n-gram features could substantially improve performance.
Logistic regression based on character and word n-gram model using Count Vectors identified
the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear
Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram,
and word-based uni-gram, bi-gram with an overall accuracy of 95.1%.
Conclusion: The results showed that n-gram features could substantially improve performance. Additionally,
we noticed that the kind of data representation could provide a significant performance
boost compared to simple representation.
Keywords:
Computational linguistics, dialects identification, social media, machine learning, Arabic, logistic regression.
Graphical Abstract
[4]
O.F. Zaidan, and C. Callison-Burch, ""The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content"", In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011 pp. 37-41
[5]
H. Elfardy, and M. Diab, ""Sentence level dialect identification in Arabic"", In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2, 2013 pp. 456-461
[9]
Y. Belinkov, and J. Glass, A character-level convolutional neural network for distinguishing similar languages and dialects arXiv preprint arXiv:1609.07568, 2016 .
[12]
M. Ali, ""Character level convolutional neural network for Arabic dialect identification"", In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), 2018 pp. 122-127
[13]
M. Elaraby, and M. Abdul-Mageed, ""Deep Models for Arabic Dialect Identification on Benchmarked Data"", In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), 2018 pp. 263-274
[14]
M. Salameh, and H. Bouamor, ""Fine-grained arabic dialect identification"", In Proceedings of the 27th International Conference on Computational Linguistics, 2018 pp. 1332-1344
[15]
S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, and J. Tiedemann, ""Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task"", In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), 2016 pp. 1-14
[17]
M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, S. Malmasi, Eds., and A. Ali, (eds.), In Proceedings of the Fifth Workshop on NLP
for Similar Languages, Varieties and Dialects (VarDial 2018),
2018,
[18]
H. Bouamor, N. Habash, and K. Oflazer, A Multidialectal Parallel Corpus of Arabic ", In LREC, 2014, pp. 1240-1245.
[20]
H. Schütze, C.D. Manning, and P. Raghavan, Introduction to information retrieval., vol. 39. Cambridge University Press: Cambridge, New York, 2008 .