0% found this document useful (0 votes)
149 views9 pages

DL 3

This document presents a deep learning approach called CNBD (Combinational Network for Bullying Detection) to detect cyberbullying in images on social media. The CNBD uses a transformer-based network called BEiT to extract features from images and an MLP network to classify images as cyberbullying or not. It additionally integrates image captioning and OCR to extract text from images to improve the model's performance. The authors claim their approach achieves better accuracy, precision, and recall compared to existing techniques for cyberbullying detection in images.

Uploaded by

Thanu Shri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views9 pages

DL 3

This document presents a deep learning approach called CNBD (Combinational Network for Bullying Detection) to detect cyberbullying in images on social media. The CNBD uses a transformer-based network called BEiT to extract features from images and an MLP network to classify images as cyberbullying or not. It additionally integrates image captioning and OCR to extract text from images to improve the model's performance. The authors claim their approach achieves better accuracy, precision, and recall compared to existing techniques for cyberbullying detection in images.

Uploaded by

Thanu Shri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of Computing and Digital Systems

ISSN (2210-142X)
Int. J. Com. Dig. Sys. , No. (Mon-20..)
http://dx.doi.org/10.12785/ijcds/XXXXXX

Overcoming the Challenge of Cyberbullying Detection in


Images: A Deep Learning Approach with Image Captioning
and OCR Integration
Subbaraju Pericherla1 and E. Ilavarasan2
1,2
Department of Computer Science Engineering, Puducherry Technological University, Puducherry, India

Abstract: Cyberbullying is a serious concern in today’s digital age. The rapid increase in the use of social media platforms has
made cyberbullying even more prevalent. The form of cyberbullying has also evolved with time. In the era of Web 1.0, cyberbullying
was limited to text-based data, but with the advent of Web 2.0 and 3.0, it has expanded to images and multi-modal data. Detecting
cyberbullying in text-based data is relatively easy as various natural language processing techniques (NLP) can be used to identify
offensive language and sentiment. However, detecting cyberbullying in image-based data is a major challenge as images do not have
a clear textual representation. Hence, bullies often try to bypass existing cyberbullying detection techniques by using images and
multi-modal data. We proposed a deep learning technique named as CNBD-Combinational Network for Bullying Detection (CNBD),
which is a combination of two networks: Binary Encoder Image Transformer (BEiT) and Multi-Layer Perceptron (MLP) network.
To improve the performance of the CNBD, we supplied two additional input factors to the CNBD using concepts called Image
Captioning(IC) and OCR (Optical Character Recognition) to extract text overlayed on the images. The experimental results proved the
two additional factors gave an advantage to the CNBD technique in terms of accuracy, precision, and recall.

Keywords: Cyberbullying, Social media, Multi-Layer Perceptron, Deep learning, OCR, Image Captioning.

1. INTRODUCTION ideation, attempts, and completions. Given the prevalence


Cyberbullying is a growing concern in many countries, and severity of cyberbullying, there is a need for effective
including India. Social media platforms like Facebook approaches to identify and intervene in such incidents.
and Twitter are increasingly being used as avenues for One promising approach is the use of natural language
harassment, with a significant portion of such incidents processing (NLP) techniques to automatically identify cy-
involving young people. Cyberbullying can have serious berbullying in social media messages. Several studies have
mental health effects, with some cases even leading to explored the use of NLP techniques for this purpose,
suicide. Given the severity of the problem, there is a need with promising results. Similarly, a study by Mishna et
to develop effective approaches to identify cyberbullying in al.[3] emphasizes the importance of early detection and
social media messages. According to a study by the Cyber intervention, as well as providing emotional support and
and Information Security Division of India’s Ministry of counseling to cyberbullying victims
Electronics and Information Technology (MEITY), 14% of
all harassment incidents in India occur on Facebook and In conclusion, cyberbullying is a serious problem in
Twitter, with youngsters being particularly vulnerable. The India and other countries, with social media platforms
study also found that women and children are the most like Facebook and Twitter being significant sources of
frequent victims of cyberbullying in India. harassment. The mental health effects of cyberbullying can
be severe, with some victims even resorting to suicide.
The mental health effects of cyberbullying are well- Effective approaches are needed to identify and intervene in
documented in the literature. Several studies have found that cyberbullying incidents, and NLP techniques offer promis-
cyberbullying is associated with anxiety, depression, and ing possibilities for automating this process.
other mental health issues, with some victims even resorting
to suicide. For instance, a study by Hinduja and Patchin et To the best of our knowledge, cyberbullying detection
al.[1] found that cyberbullying victims are more likely to in images with vision transformer networks has not been
experience depression, anxiety, and suicidal ideation than applied so far. To achieve this task, the following contribu-
non-victims. Similarly, a study by Kowalski et al. [2] found tions were made to this work.
that cyberbullying is associated with increased suicidal
E-mail address: : raju.pericherla74@gmail.com, eilavarasan@pec.edu http:// journals.uob.edu.bh
190 First Author, et al.: Paper Title.. (short in one line).

• Proposed a CNBD technique for Cyberbullying (CB) posts and comments that contain cyberbullying. The authors
detection in images. evaluate the performance of the system on a dataset of
22,899 Instagram posts and comments, manually annotated
• Fine-tuned the transformer-based network, BEiT to as cyberbullying or non-cyberbullying. The obtained results
our downstream task using Cyberbullying image believe the proposed model has a superior accuracy rate of
dataset. 91.4% in detecting cyberbullying incidents on Instagram.
Elmezain Mahmoud [7] proposed a hybrid classification
• Fine-tuned VGG16 and LSTM (Long Short Term model based on transformers and SVM to predict whether
Memory) architectures with MS-COCO dataset to bullying takes place or not. Using the proposed combined
generate image captions. models with the SVM classifier, the authors claim to have
• The Multi-Layer Perceptron network is built to im- achieved an accuracy rate of 96.05%. Furthermore, the
prove the performance of the model. proposed model has a 99% classification accuracy for the
bullying class and a 93% accuracy for the non-bullying
The rest of the paper has been organized as follows. The class. The study highlights the negative impact of bullying
related works on cyberbullying in images have been dis- on students’ academic performance and the importance of
cussed in section 2. The main contribution of this proposed taking appropriate action against anti-bullying and raising
research as the proposed CNBD technique is presented community awareness of the problem. The authors sug-
in Section 3. The results and discussion are illustrated in gested that future works will focus on using Twitter texts
section 4. Finally, we have concluded with possible future with Google form questionnaires for classifying cyberbul-
enhancements in continuing this work are given in section lying and how to stop it.
5.
Rui Cao et al. [8] proposed a model, PromptHate, that
2. Related Works uses pre-trained RoBERTa language models and constructs
In this section, we presented past research works related simple prompts to prompt the model for hateful meme clas-
to cyberbullying on image data. P.K Roy and Mali[4] sification. To make use of the latent knowledge in the Pre-
proposed a transfer learning-based automated model for cy- trained Language Models, the authors present real-world
berbullying detection in images from social media networks. examples. The model’s performance is measured against
The proposed model extracts hidden features from cyberbul- state-of-the-art baselines, and the findings demonstrate that
lying. The experiments were carried out with two different it excels with an AUC of 90.96 on two publicly available
datasets of 1000 images and 3000 images. They consider datasets. Agarwal et al. [9] presented two approaches to
three deep learning models for cyberbullying detection in identifying hate memes using deep learning techniques. The
images: 2-dimensional CNN, Visual Geometry Group 16 first method incorporates features from several modalities,
(VGG16), and InceptionV3. Among the three models, the while the second employs sentiment analysis based on
Inception V3 outperforms in terms of precision (87 image captioning and text placed on the meme itself. These
methods use a trifecta of deep learning algorithms—GloVe,
Homa Hosseinmardi et al.[5] present a novel approach encoder-decoder, and OCR using the Adamax optimizer.
to detecting cyberbullying incidents on the Instagram social We utilize the Facebook Challenge Hateful Meme Dataset,
network. The authors propose a system that utilizes machine which includes over 8,500 meme images, to test the meth-
learning algorithms to automatically classify Instagram ods. Facebook uses both methods in the ongoing challenge
posts as either cyber bullying or non-cyberbullying. The competition, and they both show promise on the validation
system uses a combination of text and image features dataset. K R Prajwal et al.[10] proposed a novel method
extracted from the posts to train a classification model. The to capture the image content on social media images. They
authors evaluate the system’s performance on a dataset of implemented a two-stage approach for image caption for
10,000 Instagram posts manually labeled as cyberbullying images. In stage-1, emotional representations are captured
or non-cyberbullying. The results show that the system using Transfer Learning and in stage-2, facial emotions are
achieves a high accuracy (92%) in detecting cyberbullying extracted using encoders which are extracted from stage1.
incidents on Instagram. The paper also provides a detailed
analysis of the features that contribute most to the classifi- T Tiwary et al. [11] have introduced the Automatic
cation performance of the system. The authors believe that Image Captioning (AIC) technique to help visually impaired
their approach can be used to develop effective tools for consumers identify products in online grocery stores. To
combating cyberbullying on social media platforms. solve this problem, they proposed an ECANN (Extended
Convolutional Atom Neural Network). For caption extrac-
Haoti Zhong et al.[6] proposed a content-based approach tion from e-commerce image data, the ECANN model
for detecting cyberbullying on the Instagram social network. combines the LSTM architecture and CNN. On the Grocery
The authors developed a system that uses NLP and ML Store Dataset, the proposed ECANN model achieved an
techniques to analyze the textual content of Instagram posts accuracy of 99.46%, and on the Freiburg Groceries dataset,
and comments. The system uses a set of features such it achieved an accuracy of 99.32. Al-Malla et al. [12]
as sentiment, emotion, and content similarity to identify proposed an attention-based encoder and decoder for an

http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. , No. (Mon-20..)) 191

image captioning model which is a combination of CNN 3. Methedology


and object detection module(YOLO4). The proposed model This section explains the working of proposed CNBD
was evaluated on MS-COCO and Flickr30k datasets. Efrat technique. The proposed Combinational Network for Bul-
Blaier et al. [13] proposed Caption Enriched Samples(CES) lying Detection(CNBD) technique consists of two phases
technique and applied it to BERT and RoBERTa models to : Phase-1:Feature extraction from the input images and
image caption from the images. The authors noticed that Phase-2 Multi-Perceptron Layer network for classification
CSE improves the performance by 8.6% and 10%, respec- of input image. Figure 1 shows the proposed CNBD tech-
tively on test data. Yi Zhou et al. [14] have propounded nique. In the Phase-1, three categories of input features
a novel method for hateful memes detection using image extracted from input images. 1) Image features extracted
captioning, OCR, and object detection. They applied the directly from input images 2) Text features using Image
Triplet relation network to the extracted features. Captioning 3) Text features from Text embedded on the
input images.
Pengyuan Lyu et al. [15] presented a technique called
MaskOCR for text recognition in images. The architecture
contains two parts: encoder and decoder transformers to
recognize text representation on the images. They evaluated
the proposed MaskOCR technique over benchmark datasets
of different languages. J Chen et al. [16] have to deal with
texts that can be presented in any direction, they devised
a Transformer-Based Super Resolution Network (TBSRN)
equipped with a Self-Attention Module for sequential infor-
mation extraction. The authors introduced a Position-Aware
Module to highlight the location of each character and a
Content-Aware Module to highlight the content of each
character to extract information down to the character level.
Deli et al.[17] have created a unique framework for precise
scene text recognition and named as semantic reasoning
network (SRN). Within this framework, a global semantic
reasoning module (GSRM) is provided to capture global
semantic context through multi-way parallel transmission.
Nishanth Viswamitra et al. [18] presented a comprehensive
study on the nature of bullying images. They found that
cyberbullying in images can be characterized by five visual
factors: facial emotion, objects, gestures, body pose, and
social factors. They presented a novel methodology to Figure 1. Architecture of CNBD technique.
collect cyberbullying images. Initially, they crawled 117,112 A. Phase-1:(Feature Extraction)
images from different online sources using the keyword of At this stage, the pre-processed image data is fed into the
cyberbullying. Finally, 19,3000 valid images were annotated Bidirectional Encoder for Image Transformers (BEiT)[19]
for experimentation. They proposed four classifiers namely: feature extractor. The BEiT is a self-supervised vision
Baseline model, Factors-only model, Fine-tuned pre-trained model which is based on masked image modeling (MIM)
model, and Multi-modal. Among all the four classifiers, functionality (illustrated in Figure 2). Initially, an image is
the Multimodal classifier achieves the best accuracy of divided into grids called tokens. These blocks of tokens
93.36%. The major contributions of cyberbullying detection are masked randomly and flattened into a vector. These
in images from the above literature survey are summarized embeddings and positional embeddings are learned for
in Table 1. patches. These embeddings are passed through the BEiT
We have noticed some of the limitations from past encoder. Finally, image data can be reconstructed using
research work on image-based cyberbullying detection. tokens. Once fine-tuned with cyberbullying image dataset,
Most of the deep learning models are trained with datasets it can be used for bullying detection in images.
of small sizes. The manual annotation of images is a Masked Image Modeling (MIM) is a computer vision
challenging task. The majority of the works consider only technique that involves predicting the values of missing
image factors such as hand gestures, facial expressions, pixels in an image based on the surrounding visible pixels.
and objects in the images. We observed that two additional The approach is based on a supervised learning framework,
factors will improve the performance of the classification where the deep neural network is trained to predict the
task. We introduced Image captioning to capture image masked pixel values given the visible ones. In the training
content from the input images and OCR to extract text phase, the masked regions are randomly selected from
which is overlayed on the images. the images (marked as [M] in Figure 2), and the corre-
sponding visible regions are used as inputs to the network.
http:// journals.uob.edu.bh
192 First Author, et al.: Paper Title.. (short in one line).

TABLE I. Summary of major contribution of cyberbullying detection in Images.

Authors Algorithms/ Techniques/ Models Limitations Datasets and Evaluation Metrics

Vijaya kumar et al. CNN, ReLU activation function Custom based CNN used for feature extraction Datasets: NSFW and SFW Metrics: 82% accuracy

Hoati Zhong et al. Latent Dirichlet Allocation, pre-trained CNN,SVM classifier feature extraction using image captioning 3000 images collected from Instagram. Metrics: 68.55% accuracy

P.K Roy and F U Mali 2D-CNN,Transfer learning using VGG16, InceptionV3 Unable to predict textual bullying detection in images Created two dataset of sizes 1000 images and 3000 images Metrics: 87% f1-score

Homa Hosseinmardi et al. N-gram , SVM classifier Confined to instagram images only Created 998 media sessions. Metrics: 87% accuracy, Precision 88%,Recall 87%

Mahmoud Elmezain et al. Hybrid model, SVM classifier Unable to predict textual bullying detection in images Created 1200 images dataset. Metrics: 96.05% accuracy

Nishanth Viswamitra et al. Multimodal classifier low level image features, manual features selection Created 19000 images dataset, Metrics: 93.46% accuracy, Precision 94.27%, Recall 96.93%

Figure 2. Overview of BEiT pre-tarining architecture[19]

The network is trained to minimize a reconstruction loss


between the predicted and ground-truth-masked regions.
Bidirectional Encoder for Image Transformers (BEiT) is
a transformer-based architecture that can be thought of as
a modified version of the transformer architecture used in
natural language processing tasks, adapted for use in image
recognition tasks.
BEiT uses a bidirectional encoder that processes both
the image patches and their contextual relationships, pro-
ducing a sequence of hidden representations that capture
both local and global features of the image.
B. Image captioning Figure 3. Block diagram VGG16+LSTM for Image Captioning.
To supply more additional features from the input image,
we consider the image-captioning concept from the input
with 13 convolutional layers and 3 fully connected layers,
images to the neural network. We employed VGG16 [20]
and includes dropout layers to prevent overfitting during
and LSTM [21] networks for image captioning. Figure 3
training. The model produces a feature vector representation
block diagram for image captioning. The VGG16 model is
of the image which is then fed into an LSTM layer
a deep convolutional neural network that is highly effective
for sequence modeling. This combination of VGG16 and
in identifying objects within images due to its ability to
LSTM layers is commonly used in image captioning tasks,
capture important image features through multiple layers of
where the model generates natural language descriptions
convolutions and pooling. The model consists of 16 layers,
http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. , No. (Mon-20..)) 193

of the contents of an image. A RoBERTa [22] architecture Here y is the input to the sigmoid function and ‘e’ is Euler’s
was employed to generate text features of the Image caption
from the image.
C. Text extraction from Images
After that, we employed the Tesseract API[23] to extract
text overlayed on images. The extracted text is passed to
RoBERTa to generate text features of the extracted text.

Figure 4. Text extraction using OCR

Figure 4 shows text retrieved from an image using OCR.


D. Phase-2 (Multi-Layer Perceptron)
In Phase-2, we combine all the features extracted from
three architectures (BEiT, IC, and OCR) using late fusion
[24] which is fed onto the MLP network.
Figure 5 shows the MLP network used in the proposed
technique. The MLP network starts with 2034 neurons as
the input layer which were features generated from BEiT,
IC, and OCR. We have added 500, 100, 50, and 10 neurons
as hidden dense layers for weight updating. The MLP Figure 5. Multi-Layer Perceptron Network
network starts with random initialization of the process of
the weight. We applied Xavier Glorot’s [25] initialization constant (2.781).
for random initialization of weights for better convergence
of weights. 4. Results and Discussion
A. Environment Specifications
Finally 2 neurons as the output layer for the classifi- Training and fine-tuning of deep learning models and
cation of bullying and non-bullying images. We have used pre-trained architectures require very high-end computing
Leaky-Rectified Linear Unit (Leaky-ReLU) as an activation power for the parallel processing of tasks.
function, where ‘L’ is for weight updating . The Leaky-
ReLU activation function avoids dying ReLU problems in
the training process of the neural network. The formula for
the Leaky-ReLU activation function is shown in Equation
2.
f (y) = max(0.01.y, y) (1)
Here the function returns y if it receives a positive input
value and it returns a really small value which is 0.01 times
y when y is negative. At the last layer of the MLP network,
we chose the Sigmoid activation function ‘S’ as it is a binary Figure 6. Class-wise data distribution
classification task. The Sigmoid activation function can be
computed as shown in Equation 3. In this regard, we used Google Colab Pro to run
the programs, which is a cloud-based platform with
1 Tensor Processing Units(TPUs) and Graphical Processing
σ(y) = (2)
1 + e−y Units(GPUs). For the local environment, we used the
http:// journals.uob.edu.bh
194 First Author, et al.: Paper Title.. (short in one line).

Figure 7. Sample of Cyberbullying images

Figure 8. Sample of Non-cyberbullying images

Windows 11 operating system, 16GB of Random Access


TP + TN
Memory. Tensorflow and PyTorch libraries were used in Accuracy = (3)
Python programming. T P + T N + FP + FN
Where TP: (True Positive) means that the model predicts
B. Dataset collection the image is bullying when the actual image is bullying. TN:
We considered 19,300 images for experiments. The (True Negative) means that the model predicts the image is
dataset was prepared by Nishant Viswamitra et al.[19] . non-bullying when the actual image is non-bullying. FP:
These images were collected from various social media plat- (False Positive) means that the model predicts the image is
forms like Facebook, Instagram and Twitter. Finally, 14,581 bullying when the actual image is non-bullying. FN: (False
images were annotated as non-bullying images which are Negative) means that the model predicts the image is non-
labelled as ‘0’ and 4719 images were annotated as bullying bullying when the actual image is bullying.
images which are labelled as ‘1’. Figure 6shows class-wise
data distribution. The images contains facial expressions of Precision Precision is defined as the ratio of the number
persons, hand gestures and objects to expressing bullying of accurately identified instances of bullying to the total
in the form of images. Figure 7 and 8 shows examples of number of identified instances of bullying. It is represented
cyberbullying images and non-cyberbullying images in the by Eq.4
dataset. TP
Precision = (4)
T P + FP
C. Performance evaluation metrics
Recall The recall is a metric that counts the number of
We considered accuracy, precision, recall, and f1-score bullying images retrieved from the entire dataset of bullying
to evaluate the proposed CNBD technique. Accuracy This images and is calculated using Eq.5
has been the primary and most effective metric that has
been utilized in classification algorithms. The ratio between TP
Recall = (5)
the number of accurately predicted cyberbullying images T P + FN
and the total number of detected cyberbullying images. It
is possible to refer to it as in Eq.3 F1-Score The F1-score is the harmonic mean of both

http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. , No. (Mon-20..)) 195

recall and precision and is given as fowls in Eq.6:


precision ∗ recall
F1-score = 2 ∗ (6)
precision + recall
First, all the input images are sent to the data preprocessing

Figure 9. Original image is converted into Patches. Figure 11. Example of Text extraction from Input Images using
OCR.
stage hence the images had different formats of sizes
and structures. In the data preprocessing state we have
applied data augmentation techniques of Normalization and In similar way, the input image passed to VGG16 and
Resizing and Rescaling. Finally, all the input images are set LSTM for image captioning. Figure 10 shows the caption
to a height and weight of 224*224 pixels size with RGB captured by the network for given input images. Similarly,
channels (Red, Green, and Blue). the input image is passed to OCR for text extraction on
the images. Figure 11 shows the text extracted from input
images using OCR.

Figure 12. Comparison of f1-score of CNBD technique with existing


methods

Figure 10. Example of .


caption generated from input images

A 224*224*3 is the input image for the BEiT feature ex-


tractor. The BEiT feature extractor works based on masked
image modeling like the BERT model (masked language
Figure 13. Accuracy and loss of proposed CNBD with IC+OCR.
model). Each image is converted into 16*16 patches as
shown in Figure 9. Some parts of the patches are masked
The image dataset is divided into three parts: train set
randomly and flatten the image patch into a vector. Now
(70%), test set (20%), and validation set (10%). We trained
these patches are passed to the BEiT encoder and finally
the CNBD network with various hyperparameter of BEiT
reconstructed the image using tokens. BEiT was designed
architecture such as learning rate 2*e-5, number of epochs
as 12 layers neural network to the reconstructed original
30, and weight decay 0.001. The proposed technique CNBD
image. The 12-layer generated 768 dimension feature vector
outperforms existing techniques in terms of accuracy, pre-
which is input into the MLP network.
cision, and recall as shown in Table 2. Figure 12 shows the
http:// journals.uob.edu.bh
196 First Author, et al.: Paper Title.. (short in one line).

TABLE II. Comparison of CNBD technique with existing methods

Classifier Model Accuracy Precision Recall


Baseline Model [18] 77.25% 63.00% 29.68%
Factors only model [18] 82.96% 79.34% 80.84%
Fine-tuned Pre-trained model [18] 88.82% 81.40% 73.70%
Multi-modal [18] 93.96% 94.27% 96.93%
CNBD technique(Proposed) 96.30% 96.16% 96.30%
CNBD + IC 97.50% 97.12% 97.05%
CNBD+IC+OCR 98.23% 98.05% 98.05%

f1-score of the proposed technique with existing models. [3] G. Mishna, Mona Kassabri and Joanne, “Risk factors for involve-
ment in cyberbullying: Victims, bullies, and bully-victims.” Children
The accuracy and loss functions concerning the epochs and Youth Services Review, vol. 70, pp. 274–282, 2016.
are shown in Figure 13. It can be observed that the loss
associated with train data is lower compared to the loss of [4] P. Roy and F. Mali, “Cyberbullying detection using deep transfer
learning,” Complex Intelligent Systems, vol. 8, pp. 5449–5467,
test data as expected since the test data is not seen during 2022.
the training phase. It can also be observed that the accuracy
showed a positive trend as we increase the training epochs. [5] H. Hosseinmardi, S. A. Mattson, R. I. Rafiq, R. O. Han, Q. Lv,
This demonstrates the adaptive nature of the network to our and S. Mishra, “Detection of cyberbullying incidents on the insta-
downstream task. gram social network.” Association for the Advancement of Artificial
Intelligence, 2015.
5. Conclusion and Future Enhancements
[6] H. Zhonga, H. Li, A. Squicciarini, R. majer, C. Griffin, Miller,
As the usage of social media platforms continues to and CorneliaCaragea, “Content-driven detection of cyberbullying on
grow, so too does the prevalence of negative online be- the instagram social network.” In Proceedings of the Twenty-Fifth
haviors like cyberbullying, online hate speech, and trolling. International Joint Conference on Artificial Intelligence (IJCAI’16).,
Consequently, there is a growing need to explore effective vol. 8, p. 3952–3958, 2016.
ways to detect and address these harmful activities. One
[7] E. Mahmoud, A. Malki, IbrahimGad, and E.-S. Atlam, “Hybrid deep
important aspect of this is the detection of cyberbullying learning model–based prediction of images related to cyberbully-
on social media, which presents a particular challenge due ing.” International Journal of Applied Mathematics and Computer
to the diverse forms in which it can manifest, including text, Science, vol. 32, pp. 324–333, 2022.
images, and multimedia content. We proposed a technique
named CNBD for cyberbullying detection in images. The [8] C. Rui, Lee, R. Ka-Wei, C. Wen-Haw, and J. Jiang, “Prompthate:
proposed technique was evaluated using the metrics of Prompt-based hateful meme classification with pre-trained language
models.” Proceedings of the 2022 Conference on Empirical Methods
accuracy, precision, and recall. The experimental results in Natural Language Processing, pp. 321–332, 2022.
show that the proposed method with Image Caption features
and OCR text features can improve results compared to the [9] Aggarwal, T. Sharma, Yadav, Agrawal, Singh, Mishra, and Gritli,
existing techniques with an accuracy of 98.23%, precision “Two-way feature extraction using sequential and multimodal ap-
of 98.05%, and recall score of 98.05%. As part of the proach for hateful meme classification,” IEEE Access, vol. 9, pp.
121 962–121 973, 2021.
future, we consider cyberbullying detection for multi-media
data such as text and images, videos, and regional language [10] K. R. Prajwal, C. V. Jawahar, and P. Kumaraguru, “Towards in-
specific such as Telugu, Tamil, and Hindi. creased accessibility of meme images with the help of rich face
emotion captions,” Proceedings of the 27th ACM International
A. Authors and Affiliations Conference on Multimedia, p. 202–210, 2019.
The template is designed so that author affiliations are
not repeated each time for multiple authors of the same af- [11] T. Tiwary and R. P. Mahapatra, “An accurate generation of im-
age captions for blind people using extended convolutional atom
filiation. Please keep your affiliations as succinct as possible neural network.” Multimedia Tools and Applications, vol. 82, p.
(for example, do not differentiate among departments of the 3801–3830, 2022.
same organization).
[12] Al-Malla, Jafar, and Ghneim, “Image captioning model using at-
References tention and object features to mimic human image understanding,”
[1] S. Hinduja and J. W. Patchin, “Bullying, cyberbullying, and suicide,” Journal of Big Data, vol. 9, 2022.
Cyberbullying Research Center, vol. 14, pp. 206 – 221, 2010.
[13] Efrat, I. Malkiel, and L. Wolf, “Caption enriched samples for
[2] S. A. L. M. Kowalski RM, Giumetti GW, “Bullying in the digital improving hateful memes detection.” Conference on Empirical
age: A critical review and meta-analysis of cyberbullying research Methods in Natural Language Processing, 2021.
among youth,” Psychological Bulletin, vol. 140, pp. 1073 –1137,
2014. [14] Y. Zhou and Z. Chen, “Multimodal learning for hateful memes

http:// journals.uob.edu.bh
Int. J. Com. Dig. Sys. , No. (Mon-20..)) 197

detection,” IEEE International Conference on Multimedia & Expo [25] X. Glorot and Y. Bengio, “Understanding the difficulty of training
Workshops (ICMEW), vol. 8, pp. 1–6, 2020. deep feedforward neural networks,” Proceedings of the thirteenth
international conference on artificial intelligence and statistics,
[15] P. Lyu, C. Zhang, S. Liu, M. Qiao, Y. Xu, L. Wu, K. Yao, J. Han, vol. 8, pp. 249–256, 2010.
E. Ding, and J. Wang, “Maskocr: Text recognition with masked
encoder-decoder pretraining,” ArXiv, vol. abs/2206.00311, 2022.

[16] J. Chen, B. Li, and X. Xue, “Scene text telescope: Text-focused


scene image super-resolution,” in 2021 IEEE/CVF Conference on Subbaraju Pericherla received the B.tech.
Computer Vision and Pattern Recognition (CVPR), vol. 8, pp. and M.Tech. degrees, from Andhra Univ.
12 021–12 030, 2021. in 2006 and 2011, respectively. Currently
working as a assistant professor in the Dept.
[17] Deli, Xuan, C. Zhang, Tao, Junyu, Jingtuo, and Errui, “Towards of Information Technology, SRKREC Bhi-
accurate scene text recognition with semantic reasoning networks.” mavaram., and His research interest includes
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 12 110–12 119, 2020. Text Processing, Machine and Deep Learn-
ing. He is a member of AICTE, ISTE, CSI,
[18] V. Nishant, H. Hu, F. Luo, and L. Cheng, “Towards understand- IEEE in India.
ing and detecting cyberbullying in real-world images.” 19th IEEE
International Conference on Machine Learning and Applications
(ICMLA), 2021.

[19] Bao, S. LiDong, and FuruWei, “Beit: Bert pre-training of image


transformers,” ICLR, 2022.
E Ilavarasan received the post graduate
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks degree MTech in Computer Science and
for large-scale image recognition,” 2015 IEEE International Confer- Engineering from Pondicherry University,
ence on Computer Vision and Pattern Recognition (CVPR), vol. 8, Puducherry, India, in 1997 and the PhD
pp. 1–14, 2015. in Computer Science and Engineering from
the same University, in 2008. He is cur-
[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”,” rently working as a Professor in the De-
Neural Computation, vol. 9, pp. 1735–1780, 1997.
partment of Computer Science and Engi-
[22] Z. Liu, W. Lin, Y. Shi, and J. Zhao, “Roberta: A robustly optimized
neering at Pondicherry Engineering College,
bert pretraining approach,” ICLR, 2020. Puducherry, India. He is instrumental in set-
ting up the Microprocessor laboratory and the Embedded System
[23] D. Doermann and K. Ntirogiannis, “Document image analysis Laboratory funded by MODROBSAICTE. Presently he is super-
and recognition: Advances and trends,” Proceedings of the 11th vising five research scholars. His research interests include parallel
International Conference on Document Analysis and Recognition and distributed systems, operating systems security, web services
(ICDAR),, pp. 8–12, 2020.
computing and embedded systems. He has organized National and
International conferences with faculty members working in the
[24] F. Guo, C. Huang, X. Qian, and J. Liu, “A late-fusion deep
learning framework for pedestrian detection,” Proceedings of the Pondicherry Engineering College. He has published more than 50
IEEE Conference on Computer Vision and Pattern Recognition research papers in the international journals and conferences. He
Workshops (CVPRW), vol. 8, pp. 14–22, 2016. had more than 25 years of experience in teaching.

http:// journals.uob.edu.bh

You might also like