Arabic NLP & ML in Social Media
Arabic NLP & ML in Social Media
Abstract: Arabic Language is spoken widely in the world. It A major struggle facing the Arabic alphabet is that the
has very special characteristics that made it hard to be handled letters change their form depending on their position in the
by computers. Recently, Social Media is considered as one of word. For example, the letter of the Seen ( )سis at the
the richest source for knowledge sharing and information beginning looks like ( (سـin the middle of look like ()ـسـ, and
gathering in the internet. Arabic Natural Language Processing
be formed at the end like ()ـس. Another difficulty came from
(ANLP) tools play major role when trying to understand the
content of any Arabic textual data (e.g. social media), it helps the word originality; 85% from The Arabic words derived
clean noisy data, stem words, etc. Also, it assists with from the roots.
understanding of the semantic or sentiment contents. We use B. Arabic Social Media
Arabic Machine Learning (Classification and Clustering) with
social media to discover the polarity or opinion in the contents. The increasing use of social media in recent times, gave
Many kinds of classifiers and clusters used with Social Media users the ability to interact and share their opinions,
content detection, like SVM and K-Mean. In this paper we information and knowledge, through comments and
review the literature of the popular ANLP tools with AML publications on live social platforms. There are many
software on social media contents toward identifying the best potential social networking sites like Facebook and Twitter,
tools in these domains. which are among the most widely used social media in the
world. Social networking makes it easier for users to talk to
Keywords— Machine Learning; Arabic Social Media;
Natural Language Processing Introduction friends and family without any problems.
In the “The Impact of Google Apps at Work: Higher
I. INTRODUCTION Educational Perspective” paper, the results showed that
social media with Google Apps made it easier for users to
A. Arabic Language Overview
learn, collaborate and share ideas with each other [5].
Arabic is a difficult and enjoyable language at the same Moreover, social media interferes with many learning such
time, it is one of the Semitic languages, spoken by nearly as e-learning and distance learning. In social media, users
380 million people around the world as their first official tend to write texts in their own language that do not adhere
language [1]. The Arab people display powerful linguistic to grammar, spelling, or common language.
and educational continuity. Arabic is the formal tongue of
countries from North Africa to the Arabian Gulf. C. Arabic Natural Language Processing
There are many Arabic language formats such as Modern Natural Language Processing (NLP) or Computational
Standard Arabic Language (MSA) used in formal Linguistics is part of the computer science and a branch of
transactions and spoken speech in the media and news [2]. artificial intelligence. NLP tools analyze written texts
The other type of Arabic forms called Classical Arabic automatically without human intervention [6]. The final aim
(CA), which is the language of the Holy Quran and Literary of NLP is to enable machine to understand human language.
texts and poetic poems. This language talked by the Arabian There are many techniques for Arabic natural language
people for more than fourteen centuries [3]. Public dialects processing:
are another type that varies depending on where you live
[4]. 1) Normalizaiton
Arabic language is facing many challenges. The language
In Arabic language, there are 8 characters that can be used
consists of 28 different characters written from right to left.
as extra characters, which are other forms of primary
623
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
Testing data: which judged manually by human methods of supervised machine learning sentiment analysis
domain experts to test the accuracy of the trained of the Arab topics in the social media written in MSA or the
classifier. dialect of Jordan; they made a comparison between two
classifiers SVM and NB using two different features and
Each time the training data is increased, the accuracy will be different preprocessing strategies. The authors used many
improved [11]. N-grams; Unigram, Bigrams, and Trigrams, with different
There are several types of Classification algorithms: weighting schemes (TF, TF-IDF) and applied alternative
Support Victor Machine (SVM) stemming techniques; no stemmer, stemmer, and light
Naive Bayes (NB) stemmer. The best performance scenario is the SVM, which
Decision Tree (DT) uses the stemmer with the TF-IDF through Bigrams,
K-Nearest Neighbor Algorithm (KNN) compared to the scenario used NB classifier. The SVM class
Logistic Regression gives a resolution of 88.72% and F-score: 88.27% [12,13].
624
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
corpora in order to verify the quality and accuracy. A corpus processing, Lexical classifier, Feature extraction, SVM
is defined as a set of texts created from a variety of sources, classification, and Evaluation. They sentiment lexicon
often containing descriptive tags like labels and Part of contains of 1,500 sentiment words; 1,000 negative and 500
Speech (POS) tags. They are used in predicting movie sales, positives. The dictionary was able to classify 812 tweets out
question answering, and more applications. It is also used in of 1,103 tweets. They used natural language processing to
the sentiment analysis to give polarity to publications deal with Arabic by applying Tweet cleaning,
(positive, negative, and neutral). Through their proposed Normalization, Stop word removal, Elimination of speech
approach they created corpus from Facebook, to deal with effect, and applied Stemming through Khoja and Garside
the unnamed Arabic language Dialectal Arabic. The corpus stemmer. The results showed that the proposed solution
consists of 1000 posts collected from “Al Arabiya” News improved the lexical classifier by 5.76%, as well as the
Facebook page and 1000 posts collected from “The Voice” accuracy of 16.41% [23].
Facebook page. Natural language processing was used in the
construction of the corpus where they used tokenizers, POS The researchers in (An Arabic Twitter Corpus for
taggers, stemmer, and vocalizers. They used Manual tagging Subjectivity and Sentiment Analysis), they studied a gold-
then Inter Annotator Agreement (IAA), after that they used standard annotated corpus to backing sensitivity and
classifiers such as Naïve Bayes (NB), Decision Tree (DT), sentiment analysis (SSA) of Arabic twitter. They collected
Support Vector Machines (SVM), K-Nearest Neighbors datasets consisting of 8,868 tweets, which were divided into
(KNN) to categorize the polarity of the content [17,18,19]. 7,503 tweets as development data collected during the
period from 20th of January to 21st of February 2014, and
(Hossam et al.) Studied the spread of social media to raise 1,365 tweets as test data. They annotated corpse through a
interest in emotional analysis. In addition, the views through variety of feature-sets that have a positive impact. The
social means of access have turned into a kind of virtual authors focused subjects posed by twitter as a genre, like a
currency for companies that aim to market their products. mixture of language different and topic-shifts. They used
They presented an approach based on the Arabic sentence online semi supervised learning. They used natural language
level for sentiment analysis. The authors also adopted processing to deal with the Arabic language during Twitter,
Arabic idioms _ saying phrases lexicon for improving the they applied Syntactic Features and Word Tokens about
discovery of the sentiment polarity for Arabic sentences. In word-based n-grams. As they applied tokenization,
order to improve the accuracy of the classification of the diacrization, morphological disambiguation, Part-of-Speech
opposing sentences they used syntactic features. The (POS) tagging, stemming and lemmatization for Arabic
objective of mining opinions and sentiment analysis is to through used MADA TOKAN (v 3.2) [24,25].
determine the opinion and position of the writer in addition
to giving a contextual division of documents; positive, The researcher in [26] discussed Emotional Analysis is a
negative, neutral. Authors used semi-supervised machine procedure by which classification of specific text is
learning for sentiment analysis through using Arabic determined. The researchers searched for a data set,
sentiment words lexicon which is automatically increased, collecting 2,591 tweets / comments, it was collected and
and they used support vector machine (SVM). As for the labelled utilizing crowdsourcing. They applied some
authors' corpus they have established a corpus Arabic techniques for Arabic natural language processing like
sentiment statements includes 10,000 MSA tweets Arabic tokenization, stemming and stop word removal through
dialect tweets, 10,000 comments and 1000 microblogs such Rapid Miner program. The major reason for selecting Rapid
as hotel reservation comments, product reviews, TV Miner is that the text Processing bundle can transact with
program and movie comments. They collected the contents the Arabic language. The Naïve Bayes, KNN and SVM
of the corpus from June 2011 through Twitter API 2, classifiers were applied to discover the polarity of a given
different microblogs and forum websites like: review. 10-fold express effectiveness was used to divide the
http://www.booking.com/, http://forums.fatakat.com, data into training and testing sets. Results showed the top
http://ejabat.google.com. To facilitate the use of Arabic, precision was achieved by SVM = 75.25. The top recall was
they used Arabic language processing, where they used achieved in the condition of KNN (K=10) = 69.04 [26].
terms and their frequencies, Part-Of-Speech tag (POS),
opinion words and phrases, syntactic dependency, and In the investigation (Sentiment Analysis for Dialectal
negation. Results showed that the SVM grade have high Arabic), the authors discussed emotional analysis in Arabic
results in addition to accuracy of up to 95% [20,21,22]. tweets with the existence of controversial words.
The main steps for this research are: Data collection and
(Haifa et al.) Studied the increasing prevalence of Twitter as annotation, Tweet Preprocessing, Classification, and Results
it allows users to express their opinions within 140 Analysis. However, they used Twitter API to collect 22,550
characters. Saudi Arabia is one of the most widely used tweets, and annotated data utilizing the Crowdsourcing
Twitter countries, so it can be used Twitter to sentiment Tool.On the other hand, to better use for this data, they
analysis. The authors suggested overcoming the challenges applied techniques for tweet preprocessing, each tweet is
faced by Saudi Arabia's tweets based on a hybrid approach tokenized into words called " token”, then applied the Khoja
that combines semantic orientation and machine learning stemmer for stemming the tokens. Finally, they used
techniques. They used a lexical-based classifier to train the WEKA tool to applied supervised machine learning
dataset and SVM classifier about Sequential Minimal algorithms to classify the dataset through Support Vector
Optimization (SMO) algorithm. The proposed approach is Machine (SVM) and Naive Bayes (NB) classifier [27,
based on several steps such as: Unlabeled tweet, Pre- 28,29].
625
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
626
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
they used WEKA to implement three classifiers that include: (6,691). They applied both lemmatization and POS tagging
Naïve Bayes, Support vector machine and k-nearest using MADAMIRA v2.1, to extract features for the SVM
neighbor, with an F-Measure score reaching 91% [39]. classifier [45].
627
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
[12] Alomari, K. M., ElSherif, H. M., & Shaalan, K. (2017, June). Arabic [30] Duwairi, R. M., Marji, R., Sha'ban, N., & Rushaidat, S. (2014, April).
Tweets Sentimental Analysis Using Machine Learning. Sentiment analysis in Arabic tweets. In Information and
In International Conference on Industrial, Engineering and Other communication systems (icics), 2014 5th international conference
Applications of Applied Intelligent Systems (pp. 602-610). Springer, on (pp. 1-6). IEEE.
Cham. [31] Al-Rubaiee, H., Qiu, R., Alomar, K., & Li, D. (2016). Sentiment
[13] Hawashin, B., Fotouhi, F., & Truta, T. M. (2011, March). A privacy analysis of Arabic tweets in e-learning. Journal of Computer Science.
preserving efficient protocol for semantic similarity join using long [32] Alhoori, H., Ray Choudhury, S., Kanan, T., Fox, E., Furuta, R., &
string attributes. In Proceedings of the 4th International Workshop on Giles, C. L. (2015). On the relationship between open access and
Privacy and Anonymity in the Information Society (p. 6). ACM. altmetrics. iConference 2015 Proceedings.
[14] Duwairi, R. M., Ahmed, N. A., & Al-Rifai, S. Y. (2015). Detecting [33] Al-Horaibi, L., & Khan, M. B. (2016, July). Sentiment analysis of
sentiment embedded in Arabic social media–a lexicon-based Arabic tweets using text mining techniques. In First International
approach. Journal of Intelligent & Fuzzy Systems, 29(1), 107-117. Workshop on Pattern Recognition (Vol. 10011, p. 100111F).
[15] Mansour, A. M., Obaidat, M. A., & Hawashin, B. (2014). Elderly International Society for Optics and Photonics.
people health monitoring system using fuzzy rule based approach. [34] Abdulla, N. A., Ahmed, N. A., Shehab, M. A., & Al-Ayyoub, M.
International Journal of Advanced Computer Research, 4(4), 904. (2013, December). Arabic sentiment analysis: Lexicon-based and
[16] El-Beltagy, S. R., Khalil, T., Halaby, A., & Hammad, M. (2016, corpus-based. In Applied Electrical Engineering and Computing
April). Combining lexical features and a supervised learning approach Technologies (AEECT), 2013 IEEE Jordan Conference on (pp. 1-6).
for Arabic sentiment analysis. In International Conference on IEEE.
Intelligent Text Processing and Computational Linguistics (pp. 307- [35] Lee, S., Farag, M., Kanan, T., & Fox, E. A. (2015, June). Read
319). Springer, Cham. between the lines: A Machine Learning Approach for Disambiguating
[17] Itani, M., Roast, C., & Al-Khayatt, S. (2017, April). Corpora for the Geo-location of Tweets. In Proceedings of the 15th ACM/IEEE-
sentiment analysis of Arabic text in social media. In Information and CS Joint Conference on Digital Libraries (pp. 273-274). ACM.
Communication Systems (ICICS), 2017 8th International Conference [36] Alwakid, G., Osman, T., & Hughes-Roberts, T. (2017). Challenges in
on (pp. 64-69). IEEE. Sentiment Analysis for Arabic Social Networks. Procedia Computer
[18] AlZu'bi, S., Hawashin, B., EIBes, M., & Al-Ayyoub, M. (2018, Science, 117, 89-100.
October). A Novel Recommender System Based on Apriori [37] Kanan, T., Zhang, X., Magdy, M., & Fox, E. (2015, June). Big data
Algorithm for Requirements Engineering. In 2018 Fifth International text summarization for events: A problem based learning course. In
Conference on Social Networks Analysis, Management and Security Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital
(SNAMS) (pp. 323-327). IEEE. Libraries (pp. 87-90). ACM.
[19] AlZubi, S., 2011. 3D multiresolution statistical approaches for [38] Akaichi, J. (2013, September). Social networks' Facebook'statutes
accelerated medical image and volume segmentation (Doctoral updates mining for sentiment classification. In Social Computing
dissertation, Brunel University School of Engineering and Design (SocialCom), 2013 International Conference on (pp. 886-891). IEEE.
PhD Theses).
[39] Alhazmi, M., & Salim, N. (2015). Arabic opinion target extraction
[20] Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment from tweets. ARPN Journal of Engineering and Applied
analysis for modern standard Arabic and colloquial. arXiv preprint Sciences, 10(3), 1023-1026.
arXiv:1505.03105.
[40] Rabie, O., & Sturm, C. (2014). Feel the heat: Emotion detection in
[21] Al-Zu’bi, S., Al-Ayyoub, M., Jararweh, Y. and Shehab, M.A., 2017. Arabic social media content. In The International Conference on Data
Enhanced 3D segmentation techniques for reconstructed 3D medical Mining, Internet Computing, and Big Data (BigData2014) (pp. 37-
volumes: Robust and Accurate Intelligent System. Procedia Computer 49). The Society of Digital Information and Wireless Communication.
Science, 113, pp.531-538.
[41] Salamah, J. B., & Elkhlifi, A. (2014). Microblogging opinion mining
[22] Jararweh, Y., Alzubi, S. and Hariri, S., 2011, December. An optimal approach for Kuwaiti dialect. In The International Conference on
multi-processor allocation algorithm for high performance GPU Computing Technology and Information Management
accelerators. In Applied Electrical Engineering and Computing (ICCTIM2014) (pp. 388-396). The Society of Digital Information and
Technologies (AEECT), 2011 IEEE Jordan Conference on (pp. 1-6). Wireless Communication.
IEEE.
[42] Kanan, T., Kanaan, R., Al-Dabbas, O., Kanaan, G., Al-Dahoud, A., &
[23] Aldayel, H. K., & Azmi, A. M. (2016). Arabic tweets sentiment Fox, E. (2016). Extracting Named Entities Using Named Entity
analysis–a hybrid scheme. Journal of Information Science, 42(6), 782- Recognizer for Arabic News Articles. International Journal of
797Abdulla, N. A., Ahmed, N. A., Shehab, M. A., & Al-Ayyoub, M. Advanced Studies in Computers, Science and Engineering, 5(11), 78-
(2013, December). Arabic sentiment analysis: Lexicon-based and 84.
corpus-based. In Applied Electrical Engineering and Computing
Technologies (AEECT), 2013 IEEE Jordan Conference on (pp. 1-6). [43] Al-Rubaiee, H., Qiu, R., & Li, D. (2016, March). Identifying
IEEE. Mubasher software products through sentiment analysis of Arabic
tweets. In Industrial Informatics and Computer Systems (CIICS),
[24] Refaee, E., & Rieser, V. (2014, May). An Arabic Twitter Corpus for 2016 International Conference on (pp. 1-6). IEEE.
Subjectivity and Sentiment Analysis. In LREC (pp. 2268-2273).
[44] [44]Yang, S., Kanan, T., & Fox, E. (2010, September). Digital library
[25] Elbes, M. and Al-Fuqaha, A., 2013. Design of a social collaboration educational module development strategies and sustainable
and precise localization services for the blind and visually enhancement by the community. In International Conference on
impaired. Procedia Computer Science, 21, pp.282-291. Theory and Practice of Digital Libraries (pp. 514-517). Springer,
[26] Duwairi, R. M., & Qarqaz, I. (2014, August). Arabic sentiment Berlin, Heidelberg.
analysis using supervised classification. In Future Internet of Things [45] Baly, R., Badaro, G., El-Khoury, G., Moukalled, R., Aoun, R., Hajj,
and Cloud (FiCloud), 2014 International Conference on (pp. 579- H., & Shaban, K. (2017). A characterization study of arabic twitter
583). IEEE. data with a benchmarking for state-of-the-art opinion mining models.
[27] Duwairi, R. M. (2015, April). Sentiment analysis for dialectical In Proceedings of the Third Arabic Natural Language Processing
Arabic. In Information and Communication Systems (ICICS), 2015 Workshop (pp. 110-118).
6th International Conference on (pp. 166-170). IEEE.
[28] Al-Fuqaha, A., Elbes, M. and Rayes, A., 2013. An intelligent data
fusion technique based on the particle filter to perform precise
outdoor localization. International Journal of Pervasive Computing
and Communications, 9(2), pp.163-183.
[29] Al-Fuqaha, A., Kountanis, D., Cooke, S., Elbes, M. and Zhang, J.,
2010, December. A genetic approach for trajectory planning in non-
autonomous Mobile Ad-Hoc Networks with QoS requirements.
In GLOBECOM Workshops (GC Wkshps), 2010 IEEE (pp. 1097-
1102). IEEE.
628