0% found this document useful (0 votes)
13 views7 pages

Arabic NLP & ML in Social Media

The document reviews the challenges and tools associated with Arabic Natural Language Processing (ANLP) and Arabic Machine Learning (AML) in analyzing social media content. It highlights the complexities of the Arabic language, including its unique character forms and dialects, and discusses various techniques such as tokenization, stemming, and sentiment analysis using machine learning classifiers. The paper also examines the effectiveness of different algorithms and frameworks for sentiment analysis in Arabic social media, emphasizing the importance of preprocessing strategies for accurate results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Arabic NLP & ML in Social Media

The document reviews the challenges and tools associated with Arabic Natural Language Processing (ANLP) and Arabic Machine Learning (AML) in analyzing social media content. It highlights the complexities of the Arabic language, including its unique character forms and dialects, and discusses various techniques such as tokenization, stemming, and sentiment analysis using machine learning classifiers. The paper also examines the effectiveness of different algorithms and frameworks for sentiment analysis in Arabic social media, emphasizing the importance of preprocessing strategies for accurate results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

A Review of Natural Language Processing and


Machine Learning Tools Used to Analyze Arabic
Social Media
Odai Sadaqa
Tarek Kanan Amal Aldajeh
Department of Computer Science
Department of Computer Science Department of Computer Science
Al-Zaytoonah University of Jordan
Al-Zaytoonah University of Jordan Al-Zaytoonah University of Jordan
Amman, Jordan
Amman, Jordan Amman, Jordan
sadaqa.odai@yahoo.com
tarek.kanan@zuj.edu.jo amalmohammad555@yahoo.com
Wassan AL-dolime
Hanadi Alshwabka Department of Computer Science Shadi AlZu’bi
Department of Computer Science Al-Zaytoonah University of Jordan Department of Computer Science
Al-Zaytoonah University of Jordan Amman, Jordan Al-Zaytoonah University of Jordan
Amman, Jordan sonasami.h@gmail.com Amman, Jordan
h.alshawabkh@zuj.edu.jo smalzoubi@zuj.edu.jo
Bilal Hawashin
Mohammed Elbes Department of Computer Informaiton Mohammad A. Alia
Department of Computer Science Systems Department of Computer Informaiton
Al-Zaytoonah University of Jordan Al-Zaytoonah University of Jordan Systems
Amman, Jordan Amman, Jordan Al-Zaytoonah University of Jordan
m.elbes@zuj.edu.jo b.hawashin@zuj.edu.jo Amman, Jordan
dr.m.alia@zuj.edu.jo

Abstract: Arabic Language is spoken widely in the world. It A major struggle facing the Arabic alphabet is that the
has very special characteristics that made it hard to be handled letters change their form depending on their position in the
by computers. Recently, Social Media is considered as one of word. For example, the letter of the Seen (‫ )س‬is at the
the richest source for knowledge sharing and information beginning looks like (‫ (سـ‬in the middle of look like (‫)ـسـ‬, and
gathering in the internet. Arabic Natural Language Processing
be formed at the end like (‫)ـس‬. Another difficulty came from
(ANLP) tools play major role when trying to understand the
content of any Arabic textual data (e.g. social media), it helps the word originality; 85% from The Arabic words derived
clean noisy data, stem words, etc. Also, it assists with from the roots.
understanding of the semantic or sentiment contents. We use B. Arabic Social Media
Arabic Machine Learning (Classification and Clustering) with
social media to discover the polarity or opinion in the contents. The increasing use of social media in recent times, gave
Many kinds of classifiers and clusters used with Social Media users the ability to interact and share their opinions,
content detection, like SVM and K-Mean. In this paper we information and knowledge, through comments and
review the literature of the popular ANLP tools with AML publications on live social platforms. There are many
software on social media contents toward identifying the best potential social networking sites like Facebook and Twitter,
tools in these domains. which are among the most widely used social media in the
world. Social networking makes it easier for users to talk to
Keywords— Machine Learning; Arabic Social Media;
Natural Language Processing Introduction friends and family without any problems.
In the “The Impact of Google Apps at Work: Higher
I. INTRODUCTION Educational Perspective” paper, the results showed that
social media with Google Apps made it easier for users to
A. Arabic Language Overview
learn, collaborate and share ideas with each other [5].
Arabic is a difficult and enjoyable language at the same Moreover, social media interferes with many learning such
time, it is one of the Semitic languages, spoken by nearly as e-learning and distance learning. In social media, users
380 million people around the world as their first official tend to write texts in their own language that do not adhere
language [1]. The Arab people display powerful linguistic to grammar, spelling, or common language.
and educational continuity. Arabic is the formal tongue of
countries from North Africa to the Arabian Gulf. C. Arabic Natural Language Processing
There are many Arabic language formats such as Modern Natural Language Processing (NLP) or Computational
Standard Arabic Language (MSA) used in formal Linguistics is part of the computer science and a branch of
transactions and spoken speech in the media and news [2]. artificial intelligence. NLP tools analyze written texts
The other type of Arabic forms called Classical Arabic automatically without human intervention [6]. The final aim
(CA), which is the language of the Holy Quran and Literary of NLP is to enable machine to understand human language.
texts and poetic poems. This language talked by the Arabian There are many techniques for Arabic natural language
people for more than fourteen centuries [3]. Public dialects processing:
are another type that varies depending on where you live
[4]. 1) Normalizaiton
Arabic language is facing many challenges. The language
In Arabic language, there are 8 characters that can be used
consists of 28 different characters written from right to left.
as extra characters, which are other forms of primary

978-1-5386-7942-5/19/$31.00 ©2019 IEEE 622


2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

characters according to their location in the word, such as 5) Stemming


letter Alef it has many shapes; for example (‫إ‬,‫آ‬,‫أ‬,‫)ا‬, Alef
The Stemming aims to remove the suffixes and prefixes
maksoura “ ‫”ى‬, is on a form of Alef, but is regularly
muddled when writing with the letter ya “ ‫”ي‬. from the word, return the word to its root. There are four
types of affixes: Antefixes, Prefixes, Suffixes and Postfixes
Taa Marbota “‫ ”ة‬which is regularly disordered with ha “ ‫”ه‬. that can be connected to words [10]. Figure 4: show and
Several forms of Hamza: “ ‫” ئ‬,“‫” ؤ‬, and “ ‫” ء‬, are utilized example of stemming.
interchangeably contingent on the part of a word in the
sentence [7]. Arabic term: ‫ليجدونهم‬
Antefix Prefix Root Suffix Postfix
‫ل‬ ‫ي‬ ‫وجد‬ ‫ن‬ ‫هم‬
2) Tokenization
Is a process of splitting text so that it puts each word alone Figure 4: Example of Arabic stemming
and distinguish the next word through the first space; each
division is called token [8]. Figure 1: shows an example of 6) Arabic Stop Word Removal
Arabic Tokenization According to MADAMIRA. Stop words are the words that we need to filter out before
processing the text. Stop words should be removed since it
may mislead the results, so we need to ignore them in order
to improve the process of research [8]. Table 1 shows an
example of Arabic stop words.

Table 1: Example of Arabic stop word removal


‫السبب‬ ‫الكلمة‬
‫األسماء الخمسة‬ ‫ فو – ذو‬-‫ حم‬-‫أب – أخ‬
‫حروف العطف‬ ‫أو – ثم‬
Figure 1: Example of Arabic Tokenization
‫الضمائر‬ ‫ انت‬-‫أنا – نحن‬
‫أسماء األشهر‬ ‫تشرين االول‬-‫ فبراير‬-‫يناير‬
3) Named Entity Recognition (NER)
‫أسماء العمالت‬ ‫دينار – لاير – دوالر‬
It is a process of Identifying Persons names, Organizations, ‫األرقام واألعداد‬ ‫واحد – اثنان – مئه – ألف‬
Locations, Data and Time, Phone numbers [6]. Figure 2: ‫أسماء اإلشارة‬ ‫ ذلك‬-‫أولئك – تلك – ذاك‬
shows an example of Arabic NER. NER help locating some
pieces of the text toward extracting knowledge and
information form the text. D. Arabic Machine Learning
Machine learning is one of the artificial intelligences (AI)
branches which enable systems to be able to learn and
improve automatically without the intervention or assistance
of humans. The ability to learn is done by giving the
machine training data and applying an algorithm to
understand that data. The training data help to explore
patterns in data and make best decisions in the future. Figure
5 shows the types of Machine Leanring.

Figure 2: Example of NER

4) Part Of Speech Tagging (POS)


It is the process of identifying each word in the text based
on its appearance and location in the expression, such as:
nouns, verbs, adverb and adjective [9]. Figure 3: show an
example of POS in a sentence.
Figure 5: Types of Machine learning
Arabic Sentence: ‫كان القائد أكثرهم شجاعة وذكاء‬
1) Supervised Machine Learning
Supervised learning is categorized into: Classification and
Regression. Classification is the process of distinguishing
objects from each other and putting every related object in
one or more groups for easier access.
The data is divided into two main parts
 Training data: which judged manually by human
Figure 3: Example of Part of speech
domain experts to train the classifier

623
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

 Testing data: which judged manually by human methods of supervised machine learning sentiment analysis
domain experts to test the accuracy of the trained of the Arab topics in the social media written in MSA or the
classifier. dialect of Jordan; they made a comparison between two
classifiers SVM and NB using two different features and
Each time the training data is increased, the accuracy will be different preprocessing strategies. The authors used many
improved [11]. N-grams; Unigram, Bigrams, and Trigrams, with different
There are several types of Classification algorithms: weighting schemes (TF, TF-IDF) and applied alternative
 Support Victor Machine (SVM) stemming techniques; no stemmer, stemmer, and light
 Naive Bayes (NB) stemmer. The best performance scenario is the SVM, which
 Decision Tree (DT) uses the stemmer with the TF-IDF through Bigrams,
 K-Nearest Neighbor Algorithm (KNN) compared to the scenario used NB classifier. The SVM class
 Logistic Regression gives a resolution of 88.72% and F-score: 88.27% [12,13].

(Duwairia et al.) discussed a new framework for sentiment


2) Unsupervised Machine Learning(Clustering)
analysis for a group of Arabic tweets, which is based on
Clustering is the process of gathering or grouping similar sentiment lexicon. They built the dictionary by translating
data (documents) together based on their relevant features, SentiStrength expressions from English to Arabic and then
which means that the most similar documents are placed in expanded list using the Arabic dictionary. The dictionary
the same group. contains 2,376 entries divided into 1,777 negative entries
Clustering could efficiently enhance the search operation of and 600 positive entries. They used this dictionary to
in any retrieval system. Clustering can be divided into main determine the polarity of tweets. The proposed framework
types: uses unsupervised learning to build a lexicon that helps
1. Hard cluster: Each element must be in one group determine the polarity of tweets. To facilitate dealing with
(cluster) only the complexity of Arabic language, natural language
2. Soft cluster: Each element can be in multiple processing is used. The authors used Tokenizer, Stemming,
groups (clusters) Normalization, and removal stop-word as NLP tools
[14,15].
There are several types of Clustering Algorithms:
 K-Mean Algorithm Clustering Samhaa R. El-Beltagy and others discussed the importance
 Mean-Shift Clustering of sentiment analysis in the Arab media during the past two
 Density-Based Spatial Clustering Application with years, because of the increasing number of users of social
Noise (DBSCAN) media. They found it difficult to deal with texts written on
 Expectation- Maximization (EM) social media, often the slang language, in addition to the
 Agglomerative Hierarchical Clustering great difficulty in dealing with various Arabic dialects. They
added a set of features that were combined with Machine
learning based on sentiment analysis, where they chose an
3) Regression application Complement Naïve Bayes. They applied it to
Regression is a method to find relationships between Modern Standard Arabic, Eastern, Saudi and Egyptian
variables using statistics. The main objective of regression is social media datasets. They used Arabic Sentiment Lexicon
to be used in predictions; knowing some information that NileULex, which contains n has 5,953 positive and negative
yield to predict, for example know the area and location of entries and contains sentiments words and phrases from with
the apartment can yield to predict its price. each other Modern Standard Arabic (MSA) and Egyptian.
They used natural language processing to reduce difficulties
There are several types of regressions: in Arabic, where they used normalization to remove
diacritics and replace the letter "‫ "آ‬and, "‫ "أ‬with letter "‫"ا‬,
 Logistic Regression
also replacing the letter "‫ "ة‬with the letter "‫"ه‬. Remove the
 Ordinal Logistic Regression elongation, where the elongation is used by the writer to
 Ordinary Least Square give more emphasis to the word and become eye-catching.
 Count Data Regression They also replaced any sign beginning with @, with the
 Regularization Techniques word “MENTION”, which is known as Mention
Normalization. It is also necessary to use the Named Entities
II. LITERATURE REVIEW
Tagging label in the absence of POS, which will affect the
In this Section, we divided the reviewed papers into two accuracy of sentiment analysis if not used. To match with
categories: Sentiment Analysis and Opinion Mining. the Lexicon Entries, they used the lemmatization. They used
stemming by using a "torso force". Where any taâa marbota
A. Sentiment Analysis (E), ha'a (e) or redundant ya are removed. The results show
that the proposed model offers the best and most recent
In this paper (Arabic Tweets Sentimental Analysis Using sentiment analysis model for the Arabic language [16].
Machine Learning), the authors studied a collection of
Jordanian Arab blogs from Twitter divided into positive and (Maher Itani et al.) Studied the applications of natural
negative tweets. The total of these tweets is 1,800, written in language processing, such as text categorization, machine
modern Arabic in the Jordanian dialect. To investigate the translation, and Sentiment analysis which requires annotated

624
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

corpora in order to verify the quality and accuracy. A corpus processing, Lexical classifier, Feature extraction, SVM
is defined as a set of texts created from a variety of sources, classification, and Evaluation. They sentiment lexicon
often containing descriptive tags like labels and Part of contains of 1,500 sentiment words; 1,000 negative and 500
Speech (POS) tags. They are used in predicting movie sales, positives. The dictionary was able to classify 812 tweets out
question answering, and more applications. It is also used in of 1,103 tweets. They used natural language processing to
the sentiment analysis to give polarity to publications deal with Arabic by applying Tweet cleaning,
(positive, negative, and neutral). Through their proposed Normalization, Stop word removal, Elimination of speech
approach they created corpus from Facebook, to deal with effect, and applied Stemming through Khoja and Garside
the unnamed Arabic language Dialectal Arabic. The corpus stemmer. The results showed that the proposed solution
consists of 1000 posts collected from “Al Arabiya” News improved the lexical classifier by 5.76%, as well as the
Facebook page and 1000 posts collected from “The Voice” accuracy of 16.41% [23].
Facebook page. Natural language processing was used in the
construction of the corpus where they used tokenizers, POS The researchers in (An Arabic Twitter Corpus for
taggers, stemmer, and vocalizers. They used Manual tagging Subjectivity and Sentiment Analysis), they studied a gold-
then Inter Annotator Agreement (IAA), after that they used standard annotated corpus to backing sensitivity and
classifiers such as Naïve Bayes (NB), Decision Tree (DT), sentiment analysis (SSA) of Arabic twitter. They collected
Support Vector Machines (SVM), K-Nearest Neighbors datasets consisting of 8,868 tweets, which were divided into
(KNN) to categorize the polarity of the content [17,18,19]. 7,503 tweets as development data collected during the
period from 20th of January to 21st of February 2014, and
(Hossam et al.) Studied the spread of social media to raise 1,365 tweets as test data. They annotated corpse through a
interest in emotional analysis. In addition, the views through variety of feature-sets that have a positive impact. The
social means of access have turned into a kind of virtual authors focused subjects posed by twitter as a genre, like a
currency for companies that aim to market their products. mixture of language different and topic-shifts. They used
They presented an approach based on the Arabic sentence online semi supervised learning. They used natural language
level for sentiment analysis. The authors also adopted processing to deal with the Arabic language during Twitter,
Arabic idioms _ saying phrases lexicon for improving the they applied Syntactic Features and Word Tokens about
discovery of the sentiment polarity for Arabic sentences. In word-based n-grams. As they applied tokenization,
order to improve the accuracy of the classification of the diacrization, morphological disambiguation, Part-of-Speech
opposing sentences they used syntactic features. The (POS) tagging, stemming and lemmatization for Arabic
objective of mining opinions and sentiment analysis is to through used MADA TOKAN (v 3.2) [24,25].
determine the opinion and position of the writer in addition
to giving a contextual division of documents; positive, The researcher in [26] discussed Emotional Analysis is a
negative, neutral. Authors used semi-supervised machine procedure by which classification of specific text is
learning for sentiment analysis through using Arabic determined. The researchers searched for a data set,
sentiment words lexicon which is automatically increased, collecting 2,591 tweets / comments, it was collected and
and they used support vector machine (SVM). As for the labelled utilizing crowdsourcing. They applied some
authors' corpus they have established a corpus Arabic techniques for Arabic natural language processing like
sentiment statements includes 10,000 MSA tweets Arabic tokenization, stemming and stop word removal through
dialect tweets, 10,000 comments and 1000 microblogs such Rapid Miner program. The major reason for selecting Rapid
as hotel reservation comments, product reviews, TV Miner is that the text Processing bundle can transact with
program and movie comments. They collected the contents the Arabic language. The Naïve Bayes, KNN and SVM
of the corpus from June 2011 through Twitter API 2, classifiers were applied to discover the polarity of a given
different microblogs and forum websites like: review. 10-fold express effectiveness was used to divide the
http://www.booking.com/, http://forums.fatakat.com, data into training and testing sets. Results showed the top
http://ejabat.google.com. To facilitate the use of Arabic, precision was achieved by SVM = 75.25. The top recall was
they used Arabic language processing, where they used achieved in the condition of KNN (K=10) = 69.04 [26].
terms and their frequencies, Part-Of-Speech tag (POS),
opinion words and phrases, syntactic dependency, and In the investigation (Sentiment Analysis for Dialectal
negation. Results showed that the SVM grade have high Arabic), the authors discussed emotional analysis in Arabic
results in addition to accuracy of up to 95% [20,21,22]. tweets with the existence of controversial words.
The main steps for this research are: Data collection and
(Haifa et al.) Studied the increasing prevalence of Twitter as annotation, Tweet Preprocessing, Classification, and Results
it allows users to express their opinions within 140 Analysis. However, they used Twitter API to collect 22,550
characters. Saudi Arabia is one of the most widely used tweets, and annotated data utilizing the Crowdsourcing
Twitter countries, so it can be used Twitter to sentiment Tool.On the other hand, to better use for this data, they
analysis. The authors suggested overcoming the challenges applied techniques for tweet preprocessing, each tweet is
faced by Saudi Arabia's tweets based on a hybrid approach tokenized into words called " token”, then applied the Khoja
that combines semantic orientation and machine learning stemmer for stemming the tokens. Finally, they used
techniques. They used a lexical-based classifier to train the WEKA tool to applied supervised machine learning
dataset and SVM classifier about Sequential Minimal algorithms to classify the dataset through Support Vector
Optimization (SMO) algorithm. The proposed approach is Machine (SVM) and Naive Bayes (NB) classifier [27,
based on several steps such as: Unlabeled tweet, Pre- 28,29].

625
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

In this paper (Challenges in Sentiment Analysis for Arabic


R. M. Duwairi and others discussed emotion analysis on Social Networks,2017), the authors studied increasing the
Twitter. To this they have gathered around 350,000 tweets. size of Arabic posts through social media, which increases
Crowdsourcing applied to label 25000+ tweets through the importance of sentiment analysis and opinion mining of
every tweet should be minimum 100 letters, removed any views and this increases the interest of researchers and their
tweet include more than four Hashtags, removed any tweet passion, especially since the number of researches of the
include links and mentions, and tweets that are repeats or Arabic language is very few. The difficulty and complexity
retweets are discarded. Each tweet is given a label as of the Arabic language in addition to the lack of tools
Positive, Negative, or Neutral. The researchers applied some available to extract Arabic sentiments from the text is an
techniques to Arabic natural language such as stop word obstacle to researchers so the uses of natural language
removal through Rapidminer program. At the end they used processing facilitate dealing with the Arabic text. They
stemming and light stemming on the dataset. Three built-in presented challenges based on sentiment analysis and the
classifiers in Rapidminer were used to assess our framework opining mining of informal Arabic-language, which are
these classifiers include Naïve Bayes (NB), k-Nearest addressed through the use of natural language processing
Neighbors classifier (K-NN), and Support Vector Machines and the use of machine learning under supervision SVM and
(SVM) [30]. NB algorithms. They proposed an approach to sentiment
analysis in Arabic tweets based on lexical normalization of
In this work (Sentiment Analysis of Arabic Tweets in e- the raw tweet text and supervised machine learning to define
Learning), they presented the layout and application of the polarity of tweets sentiment. They collected of 9,096
Arabic text categorization in regard to King Abdul-Aziz tweets by using Twitter’s API through a Python Script.
university students’ thought. Their method contains five They presented the most complex challenges in addressing
basic stages: Data collection, Filtering Data, Data pre- the natural language, where there are great difficulties in
processing, Classification, and Evaluation. Then, they dealing with the complexities of coordination Twitter
prepared dataset, this data includes 2,000 tweets collected microblogging, in addition to the difficulties of Arabic
by Twitter API. After that they applied Arabic natural dialects. Of the natural language processing techniques used
language processing techniques such as Tokenization, Stop during the proposed approach are rooting to removes
Word Removal and light stem by used RapidMiner prefixes, suffixes, infixes and light stemming to converts the
program. On the other hand, they used supervised machine word into its root form. The results of the proposed
learning through Support Vector Machine (SVM) and Naive approach showed that the machine learning techniques
Bayes (NB) classifiers [31,32]. yielded satisfactory results, where the best rating of feelings
was recorded by the Support Vector Machines (SVM)
In [33] the authors aimed to improve an initial pattern that classifier for Bag of Words (BOW) representation [36,37].
performs reasonably and measurement Arabic Twitter
B. Opinion Mining
emotion. The suggested approach of Arabic tweets emotion
analysis contains three levels: Data collection, Data pre- (Jalel Akaichi et al.,2013) the authors studied the impact of
processing, Classification and evaluation. Then, they used increasing the spread of social media such as Facebook and
Twitter Stream API to collected tweets. Moreover, they used Twitter, which led to increased interest in text mining and
Tweepy library in their Python script. After that they run sentiment analysis. The authors focused on Tunisian user’s
this script twice initial time, on July 15, 2015; and second statuses on Facebook posts through the “Arabic Spring”
time on September 12; the number of tweets they collected period. The purpose is to know the behaviors and sentiments
was 14,984 tweets. They used Python programming of users during that sensitive period. They propose a
language to Tokenization and Arabic stop word removal technique based on Support Vector Machine and Naïve
;(for which they generated a roster of more than 162 Arabic Bayes. They constructed a sentiment lexicon from extracted
words), then applied Information Science Research Institute statuses updates. They collected statuses about 260 statuses
(ISRI) stemmer and the Light stemmer to get the root of posted in Facebook by Tunisian users. They also used
each token in the tweet. Finally, applied Naïve Bayes (NB) natural language processing through the applied Stemming,
and Decision Tree (DT) classifier through NLTK tool, the Remove stop-word, Part of speech tags and Feature
score achieved NB classifier accuracy of 64.85% and DT extraction. After comparing the algorithms SVM and NB,
accuracy of 53.75%. the results showed a high accuracy of the classification of
Facebook cases [38].
(Nawaf A. Abdulla et al.) in this paper, they collected data
set from twitter through utilizing a tweet crawler, this corpus In [39] the authors sought to suggest an approach for
consists out of 2,000 tweets, and categorize this data into extracting the idea from Arabic-language tweets. The job is
two category 1,000 tweets as a positive and 1,000 tweets as carried out in three levels: Preprocess the tweet, Feature
a negative category. Then, they applied some techniques for extraction, and Classification and Result. For evaluation
Arabic Natural Language Processing like stop word removal their techniques they utilized twitter API to gathered 500
and stemming by using Khoja Stemmer, and using MS Arabic tweets in general topics, gathered from 26 April, to 1
Word lexicon as a reference for misspelling editing and June 2014. After that, applied some techniques from Arabic
choose the initial word proposed by it automatically. Natural Language Processing, they used Stanford Arabic for
Finally, they used RapidMiner program to apply some part of speech tagger and using the LinqPipe tool to extract
classifier such as SVM, DT, KNN and NB [34,35]. named entities recognition through Arabic tweets. Finally,

626
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

they used WEKA to implement three classifiers that include: (6,691). They applied both lemmatization and POS tagging
Naïve Bayes, Support vector machine and k-nearest using MADAMIRA v2.1, to extract features for the SVM
neighbor, with an F-Measure score reaching 91% [39]. classifier [45].

In [40] the researcher tackled the automatic discovers of III. CONCLUSION


sentiment in textual parts of Twitter the social media The Arabic social media is playing a major role in sentiment
networks. They collected the data from Twitter; the date analysis and personal opinion expressing. This occur by
gathered contains 1,776 tweets by more than 200 users, collecting then analyzing the contents of social media
from January 25, 2011 to February 11, 2011. They refined platforms (e.g. Twitter and Facebook). In this paper, we
out non-Arabic tweets, retweet, tweets inclusive photos or investigated the previous works related to three main parts:
videos. After filtering the data applied Arabic Natural  The dataset was collected through the Arabic
Language Processing through stop word removal and Social Media contents
stemming by using Shreen Khoja Arabic stemmer. Lastly,  Arabic Natural Language Processing (ANLP), the
they used WEKA program to applied categorization tools that has been used in this aspect
Algorithm, which include SMO, SVM, and NB classifiers  Arabic Machine Learning (classification and
[40]. clustering)
The review of the previous work should show the strength
In this article (Microblogging Opinion Mining Approach for points related to the tools used in the selected papers.
Kuwaiti dialect), the researcher discussed a method to elicit
and classify thoughts in microblogging based on Kuwaiti
dialect. Their approach contains four major steps: Twitter IV. ACKNOWLEDGMENT
collected, Segmentation, Feature generation, and Opinion
Classification. This approach contains 340,000 tweets, This research paper was supported by Al-Zaytoonah
collected by Twitter API, then using Tokenization to University of Jordan fund by the project titled “A novel
segment tweet using Stanford Arabic Tokenizer. After that, approach to extract illegal and terrorist contents form Arabic
using supervised machine learning by WEKA to classify social media posts using machine learning techniques”.
tweets through applied Support Vector Machine (SVM) and “Resolution number 2018-2017/28/10”.
Decision Tree (DT) Algorithms [41,42].
V. REFERENCES
Hamed AL-Rubaiee and others studied the richness of social
[1] Farghaly, A., & Shaalan, K. (2009). Arabic natural language
media with the mining user sentiments, which is a command processing: Challenges and solutions. ACM Transactions on Asian
stock analysis software supplier in the Gulf area. The major Language Information Processing (TALIP), 8(4), 14.
work for this study divided into four steps: Data collected, [2] Zaidan, O. F., & Callison-Burch, C. (2014). Arabic dialect
Data preprocessing, Classification, Evaluation. They identification. Computational Linguistics, 40(1), 171-202.
gathered the dataset from Arabic tweets, by a small desktop [3] Glybovets, A. (2015). Arabic natural language processing.
application (Twitter Data Grabber), the data set contains [4] Kanan, T., Ayoub, S., Saif, E., Kanaan, G., Chandrasekarar, P., &
Fox, E. A. (2015). Extracting Named Entities Using Named Entity
2,051 tweets collected from July15 to September12,2015 Recognizer and Generating Topics Using Latent Dirichlet Allocation
then divided the data into three classes: Positive, Negative, Algorithm for Arabic News Articles. Department of Computer
Neutral. After that, they used RapidMiner program to apply Science, Virginia Polytechnic Institute & State University.
Arabic Natural Language Processing (ANLP) like [5] Al-Emran, M., & Malik, S. I. (2016). The Impact of Google Apps at
Tokenization, Stop Word Removal, LightStem for Work: Higher Educational Perspective. International Journal of
Interactive Mobile Technologies (iJIM), 10(4), 85-88.
stemming. Lastly, they used Support Vector Machine
[6] Ghosh, S. (2009). Application of Natural Language Processing (NLP)
(SVM) and Naive Bayes (NB) classifier as a supervised Techniques in E–Governance. In E-Government Development and
Machine Learning [43,44]. Diffusion: Inhibitors and Facilitators of Digital Democracy (pp. 122-
132). IGI Global.
(Ramy Baly et al.) looked at the difficulty of Opinion [7] Darwish, K., Magdy, W., & Mourad, A. (2012, October). Language
mining Arabic views because of the rich Arab morphology. processing for 6rabic microblog retrieval. In Proceedings of the 21st
ACM international conference on Information and knowledge
Where things become more difficult with regard to Twitter, management (pp. 2427-2430). ACM.
due to increased noise during tweets, the multiplicity of [8] Alhanjouri, M. (2017). Pre Processing Techniques for Arabic
dialects Arabic, the use of Arabizi, and the existence of text Documents Clustering. International Journal of Engineering and
objects of non-text objects like images and URLs to express Management Research (IJEMR), 7(2), 70-79.
opinion. They conducted a study to monitor the different [9] Habash, N., & Rambow, O. (2005, June). Arabic tokenization, part-
of-speech tagging and morphological disambiguation in one fell
linguistic phenomena in different Arab regions. They also swoop. In Proceedings of the 43rd Annual Meeting on Association for
created a typology of Arab tweets in order to better Computational Linguistics (pp. 573-580). Association for
understand them and to promote future research. The Computational Linguistics.
authors used machine learning on Arabic Twitter, through [10] Froud, H., Benslimane, R., Lachkar, A., & Ouatik, S. A. (2010,
used the feature engineering approach and the deep learning September). Stemming and similarity measures for Arabic Documents
Clustering. In I/V Communications and Mobile Network (ISVC),
approach. They collected datasets through Egyptian tweets 2010 5th International Symposium on (pp. 1-4). IEEE.
where they consist of 10,006 tweets by used Twitter4J API, [11] Chowriappa, P., Dua, S., & Todorov, Y. (2014). Introduction to
they collected data sets through Egyptian tweets, which machine learning in healthcare informatics. In Machine Learning in
consisted of 10,006 tweets, divided into several sections: Healthcare Informatics (pp. 1-23). Springer, Berlin, Heidelberg.
positive (799), negative (1,684), neutral (832) and objective

627
2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)

[12] Alomari, K. M., ElSherif, H. M., & Shaalan, K. (2017, June). Arabic [30] Duwairi, R. M., Marji, R., Sha'ban, N., & Rushaidat, S. (2014, April).
Tweets Sentimental Analysis Using Machine Learning. Sentiment analysis in Arabic tweets. In Information and
In International Conference on Industrial, Engineering and Other communication systems (icics), 2014 5th international conference
Applications of Applied Intelligent Systems (pp. 602-610). Springer, on (pp. 1-6). IEEE.
Cham. [31] Al-Rubaiee, H., Qiu, R., Alomar, K., & Li, D. (2016). Sentiment
[13] Hawashin, B., Fotouhi, F., & Truta, T. M. (2011, March). A privacy analysis of Arabic tweets in e-learning. Journal of Computer Science.
preserving efficient protocol for semantic similarity join using long [32] Alhoori, H., Ray Choudhury, S., Kanan, T., Fox, E., Furuta, R., &
string attributes. In Proceedings of the 4th International Workshop on Giles, C. L. (2015). On the relationship between open access and
Privacy and Anonymity in the Information Society (p. 6). ACM. altmetrics. iConference 2015 Proceedings.
[14] Duwairi, R. M., Ahmed, N. A., & Al-Rifai, S. Y. (2015). Detecting [33] Al-Horaibi, L., & Khan, M. B. (2016, July). Sentiment analysis of
sentiment embedded in Arabic social media–a lexicon-based Arabic tweets using text mining techniques. In First International
approach. Journal of Intelligent & Fuzzy Systems, 29(1), 107-117. Workshop on Pattern Recognition (Vol. 10011, p. 100111F).
[15] Mansour, A. M., Obaidat, M. A., & Hawashin, B. (2014). Elderly International Society for Optics and Photonics.
people health monitoring system using fuzzy rule based approach. [34] Abdulla, N. A., Ahmed, N. A., Shehab, M. A., & Al-Ayyoub, M.
International Journal of Advanced Computer Research, 4(4), 904. (2013, December). Arabic sentiment analysis: Lexicon-based and
[16] El-Beltagy, S. R., Khalil, T., Halaby, A., & Hammad, M. (2016, corpus-based. In Applied Electrical Engineering and Computing
April). Combining lexical features and a supervised learning approach Technologies (AEECT), 2013 IEEE Jordan Conference on (pp. 1-6).
for Arabic sentiment analysis. In International Conference on IEEE.
Intelligent Text Processing and Computational Linguistics (pp. 307- [35] Lee, S., Farag, M., Kanan, T., & Fox, E. A. (2015, June). Read
319). Springer, Cham. between the lines: A Machine Learning Approach for Disambiguating
[17] Itani, M., Roast, C., & Al-Khayatt, S. (2017, April). Corpora for the Geo-location of Tweets. In Proceedings of the 15th ACM/IEEE-
sentiment analysis of Arabic text in social media. In Information and CS Joint Conference on Digital Libraries (pp. 273-274). ACM.
Communication Systems (ICICS), 2017 8th International Conference [36] Alwakid, G., Osman, T., & Hughes-Roberts, T. (2017). Challenges in
on (pp. 64-69). IEEE. Sentiment Analysis for Arabic Social Networks. Procedia Computer
[18] AlZu'bi, S., Hawashin, B., EIBes, M., & Al-Ayyoub, M. (2018, Science, 117, 89-100.
October). A Novel Recommender System Based on Apriori [37] Kanan, T., Zhang, X., Magdy, M., & Fox, E. (2015, June). Big data
Algorithm for Requirements Engineering. In 2018 Fifth International text summarization for events: A problem based learning course. In
Conference on Social Networks Analysis, Management and Security Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital
(SNAMS) (pp. 323-327). IEEE. Libraries (pp. 87-90). ACM.
[19] AlZubi, S., 2011. 3D multiresolution statistical approaches for [38] Akaichi, J. (2013, September). Social networks' Facebook'statutes
accelerated medical image and volume segmentation (Doctoral updates mining for sentiment classification. In Social Computing
dissertation, Brunel University School of Engineering and Design (SocialCom), 2013 International Conference on (pp. 886-891). IEEE.
PhD Theses).
[39] Alhazmi, M., & Salim, N. (2015). Arabic opinion target extraction
[20] Ibrahim, H. S., Abdou, S. M., & Gheith, M. (2015). Sentiment from tweets. ARPN Journal of Engineering and Applied
analysis for modern standard Arabic and colloquial. arXiv preprint Sciences, 10(3), 1023-1026.
arXiv:1505.03105.
[40] Rabie, O., & Sturm, C. (2014). Feel the heat: Emotion detection in
[21] Al-Zu’bi, S., Al-Ayyoub, M., Jararweh, Y. and Shehab, M.A., 2017. Arabic social media content. In The International Conference on Data
Enhanced 3D segmentation techniques for reconstructed 3D medical Mining, Internet Computing, and Big Data (BigData2014) (pp. 37-
volumes: Robust and Accurate Intelligent System. Procedia Computer 49). The Society of Digital Information and Wireless Communication.
Science, 113, pp.531-538.
[41] Salamah, J. B., & Elkhlifi, A. (2014). Microblogging opinion mining
[22] Jararweh, Y., Alzubi, S. and Hariri, S., 2011, December. An optimal approach for Kuwaiti dialect. In The International Conference on
multi-processor allocation algorithm for high performance GPU Computing Technology and Information Management
accelerators. In Applied Electrical Engineering and Computing (ICCTIM2014) (pp. 388-396). The Society of Digital Information and
Technologies (AEECT), 2011 IEEE Jordan Conference on (pp. 1-6). Wireless Communication.
IEEE.
[42] Kanan, T., Kanaan, R., Al-Dabbas, O., Kanaan, G., Al-Dahoud, A., &
[23] Aldayel, H. K., & Azmi, A. M. (2016). Arabic tweets sentiment Fox, E. (2016). Extracting Named Entities Using Named Entity
analysis–a hybrid scheme. Journal of Information Science, 42(6), 782- Recognizer for Arabic News Articles. International Journal of
797Abdulla, N. A., Ahmed, N. A., Shehab, M. A., & Al-Ayyoub, M. Advanced Studies in Computers, Science and Engineering, 5(11), 78-
(2013, December). Arabic sentiment analysis: Lexicon-based and 84.
corpus-based. In Applied Electrical Engineering and Computing
Technologies (AEECT), 2013 IEEE Jordan Conference on (pp. 1-6). [43] Al-Rubaiee, H., Qiu, R., & Li, D. (2016, March). Identifying
IEEE. Mubasher software products through sentiment analysis of Arabic
tweets. In Industrial Informatics and Computer Systems (CIICS),
[24] Refaee, E., & Rieser, V. (2014, May). An Arabic Twitter Corpus for 2016 International Conference on (pp. 1-6). IEEE.
Subjectivity and Sentiment Analysis. In LREC (pp. 2268-2273).
[44] [44]Yang, S., Kanan, T., & Fox, E. (2010, September). Digital library
[25] Elbes, M. and Al-Fuqaha, A., 2013. Design of a social collaboration educational module development strategies and sustainable
and precise localization services for the blind and visually enhancement by the community. In International Conference on
impaired. Procedia Computer Science, 21, pp.282-291. Theory and Practice of Digital Libraries (pp. 514-517). Springer,
[26] Duwairi, R. M., & Qarqaz, I. (2014, August). Arabic sentiment Berlin, Heidelberg.
analysis using supervised classification. In Future Internet of Things [45] Baly, R., Badaro, G., El-Khoury, G., Moukalled, R., Aoun, R., Hajj,
and Cloud (FiCloud), 2014 International Conference on (pp. 579- H., & Shaban, K. (2017). A characterization study of arabic twitter
583). IEEE. data with a benchmarking for state-of-the-art opinion mining models.
[27] Duwairi, R. M. (2015, April). Sentiment analysis for dialectical In Proceedings of the Third Arabic Natural Language Processing
Arabic. In Information and Communication Systems (ICICS), 2015 Workshop (pp. 110-118).
6th International Conference on (pp. 166-170). IEEE.
[28] Al-Fuqaha, A., Elbes, M. and Rayes, A., 2013. An intelligent data
fusion technique based on the particle filter to perform precise
outdoor localization. International Journal of Pervasive Computing
and Communications, 9(2), pp.163-183.
[29] Al-Fuqaha, A., Kountanis, D., Cooke, S., Elbes, M. and Zhang, J.,
2010, December. A genetic approach for trajectory planning in non-
autonomous Mobile Ad-Hoc Networks with QoS requirements.
In GLOBECOM Workshops (GC Wkshps), 2010 IEEE (pp. 1097-
1102). IEEE.

628

You might also like