Efficient Discrimination between Arabic Dialects

Sadik       Bessou; Racha       Sari

Abstract

Background: With the explosion of communication technologies and the accompanying pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments, and other forms of expressions in different languages. This content attracted researchers from different fields; economics, political sciences, social sciences, psychology and particularly language processing. One of the prominent subjects is the discrimination between similar languages and dialects using natural language processing and machine learning techniques. The problem is usually addressed by formulating the identification as a classification task.

Methods: The approach is based on machine learning classification methods to discriminate between Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf and North-African. Several models were trained to discriminate between the studied dialects in large corpora mined from online Arabic newspapers and manually annotated.

Results: Experimental results showed that n-gram features could substantially improve performance. Logistic regression based on character and word n-gram model using Count Vectors identified the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram, and word-based uni-gram, bi-gram with an overall accuracy of 95.1%.

Conclusion: The results showed that n-gram features could substantially improve performance. Additionally, we noticed that the kind of data representation could provide a significant performance boost compared to simple representation.

Keywords: Computational linguistics, dialects identification, social media, machine learning, Arabic, logistic regression.

Graphical Abstract

[1] 
O.F. Zaidan,  and C. Callison-Burch, ""Arabic dialect identification."", Comput. Linguist., vol. 40, no. 1, pp. 171-202, 2014 .
[http://dx.doi.org/10.1162/COLI_a_00169] 
[2] 
F. Biadsy, J. Hirschberg,  and N. Habash, ""Spoken Arabic dialect identification using phonotactic modeling"", In Proceedings of the EACL 2009 workshop on computational approaches to semitic languages, 2009 pp. 53-61 
[http://dx.doi.org/10.3115/1621774.1621784] 
[3] 
S. Shon, A. Ali,  and J. Glass, Convolutional neural networks and language embeddings for end-to-end dialect recognition , arXiv
	preprint arXiv:1803.04567, 2018..
[http://dx.doi.org/10.21437/Odyssey.2018-14] 
[4] 
O.F. Zaidan,  and C. Callison-Burch, ""The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content"", In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011 pp. 37-41 
[5] 
H. Elfardy,  and M. Diab, ""Sentence level dialect identification in Arabic"", In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2, 2013 pp. 456-461 
[6] 
H. Elfardy, M. Al-Badrashiny,  and M. Diab, ""AIDA: Identifying code switching in informal Arabic text"", In Proceedings of The First Workshop on Computational Approaches to Code Switching, 2014 pp. 94-101 
[http://dx.doi.org/10.3115/v1/W14-3911] 
[7] 
C. Tillmann, S. Mansour,  and Y. Al-Onaizan, ""Improved sentence-level Arabic dialect classification"", In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, 2014 pp. 110-119 
[http://dx.doi.org/10.3115/v1/W14-5313] 
[8] 
K. Darwish, H. Sajjad,  and H. Mubarak, ""Verifiably effective Arabic dialect identification"", In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014 pp. 1465-1468 
[http://dx.doi.org/10.3115/v1/D14-1154] 
[9] 
Y. Belinkov,  and J. Glass, A character-level convolutional neural network for distinguishing similar languages and dialects arXiv	preprint arXiv:1609.07568, 2016 .
[10] 
Y. Samih, M. Attia, M. Eldesouki, A. Abdelali, H. Mubarak, L. Kallmeyer,  and K. Darwish, ""A neural architecture for dialectal Arabic segmentation"", In Proceedings of the Third Arabic Natural Language Processing Workshop, 2017 pp. 46-54 
[http://dx.doi.org/10.18653/v1/W17-1306] 
[11] 
S. Malmasi,  and M. Zampieri, ""Arabic dialect identification using iVectors and ASR transcripts"", In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2017 pp. 178-183 
[http://dx.doi.org/10.18653/v1/W17-1222] 
[12] 
M. Ali, ""Character level convolutional neural network for Arabic dialect identification"", In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), 2018 pp. 122-127 
[13] 
M. Elaraby,  and M. Abdul-Mageed, ""Deep Models for Arabic Dialect Identification on Benchmarked Data"", In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), 2018 pp. 263-274 
[14] 
M. Salameh,  and H. Bouamor, ""Fine-grained arabic dialect identification"", In Proceedings of the 27th International Conference on Computational Linguistics, 2018 pp. 1332-1344 
[15] 
S. Malmasi, M. Zampieri, N. Ljubešić, P. Nakov, A. Ali,  and J. Tiedemann, ""Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task"", In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), 2016 pp. 1-14 
[16] 
M. Zampieri, S. Malmasi, N. Ljubešić, P. Nakov, A. Ali, J. Tiedemann, Y. Scherrer,  and N. Aepli, ""Findings of the VarDial evaluation campaign 2017"", In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 2017 pp. 1-15 
[http://dx.doi.org//10.18653/v1/W17-1201] 
[17] 
M. Zampieri, P. Nakov, N. Ljubešić, J. Tiedemann, S. Malmasi, Eds., and A. Ali, (eds.), In Proceedings of the Fifth Workshop on NLP
for Similar Languages, Varieties and Dialects (VarDial 2018),
2018, 
[18] 
H. Bouamor, N. Habash,  and K. Oflazer, A Multidialectal Parallel Corpus of Arabic ", In LREC, 2014, pp. 1240-1245.
[19] 
F. Sadat, F. Kazemi,  and A. Farzindar, ""Automatic identification of arabic dialects in social media"", In Proceedings of the First International Workshop on Social Media Retrieval and Analysis, 2014 pp. 35-40 
[http://dx.doi.org/10.1145/2632188.2632207] 
[20] 
H. Schütze, C.D. Manning,  and P. Raghavan, Introduction to information retrieval., vol. 39. Cambridge University Press: Cambridge, New York, 2008 .
[21] 
Y. Shang, "Deffuant model with general opinion distributions: First
impression and critical confidence bound", Complexity, vol. 19, no.
2, pp. 38-49, 2013, 
[http://dx.doi.org/10.1002/cplx.21465] 
[22] 
Y. Shang, "Deffuant model of opinion formation in onedimensional multiplex networks", J. Physics A: Mathematical and
Theoretical, vol. 48, no. 39, pp. 395101, 2015, 
[http://dx.doi.org/10.1088/1751-8113/48/39/395101] 

Cite As

Recent Advances in Computer Science and Communications

Efficient Discrimination between Arabic Dialects

Abstract

Graphical Abstract