Cyberbullying Detection
Cyberbullying Detection
Sequential Models
Lakshmi Amrutha Valli P1, G Neha Pranavi2, Prathap Adimoolam3, Chinta Venkata Murali Krishna4,
Chaitanya Jannu5, Veeraswamy Parisae6, Yalamanchili Arpitha7
1,5,6,7
   Department of ECE, 2,4Department of CSE-DS, 3Department of CSE-AIML, NRI Institute of
Technology, Agiripalli, India.
1
  lakshmiamruthavallipamidi@gmail.com,2neha.gajjala@gmail.com,   3
                                                                   adimoolam.prathap@gmail.com,
4                                       5                  6
  muralikrishna_chinta2007@yahoo.co.in, pvspj3@gmail.com, veera2u@gmail.com
https://scholar.google.com/citations?user=q4wbCqsAAAAJ&hl=en&oi=sra
Abstract
        Bullying refers to an unwanted behaviour by others that harms another physically,
mentally, or socially. Cyberbullying, also known as online bullying, includes textual or visual
bullying. There is a pressing need to detect cyberbullying in today's world, as the prevalence
of cyberbullying is growing, leading to mental health issues. Cyberbullying had previously
been detected using traditional machine learning algorithms. However, recent studies show
that deep learning outperforms traditional machine learning methods in detecting
cyberbullying for a few reasons, such as managing large amounts of data, effectively
categorising text and images, automatically extracting features by means of hidden layers, and
many more. This study examines the surveys that have already been conducted and points out
gaps in the research. In comparison with other models, we offer a deep learning-based hybrid
defence architecture LSTM-BiLSTM-GRU with attention mechanism for cyberbullying
detection that includes various deep learning-based frameworks and models as well as data
representation strategies. Finally, the method was assessed using popular performance metrics
like as F1-score, recall, accuracy, and precision. In comparison to the state-of-the-art, the
suggested method achieved its superior performance with an accuracy of 93.69%, which is
encouraging. The current DL-based method for detecting cyberbullying have been critically
examined, and their noteworthy contributions and suggested avenues for further research have
been noted.
Keywords: Cyber Bullying, Recurrent Neural Network, Long Short-Term Memory, Gated
Recurrent Unit.
1.Introduction
Cyberbullying, often known as cyber harassment, the bullying that takes place online [1].
These days, we can see several types of cyberbullying. Writing offensive language and
disseminating offensive images, like memes, are two examples. Social networking sites like
Facebook, Instagram, Twitter, and others have made it simpler for us to connect with people,
generate content, and communicate with others. Bullying on many social media platforms,
however, can result from unfiltered message content interchange and a lack of privacy
protection [2]. As by April 2024, the subsequent data show the most popular social networks
worldwide in terms of monthly active users. These channels have become essential for daily
communication, particularly among younger generations [3]. However, its broad use has
resulted in more incidents of cyberbullying. These data demonstrate the prevalence of
cyberbullying across multiple platforms. Furthermore, there have been worrisome reports of
cyberbullying contributing to teenage suicides, emphasising the critical need for better
detection and preventive techniques. Cyberbullies can take many different forms, such as
using flames, making hateful statements, sending inappropriate emails, publishing degrading
images, making cruel comments, and pestering people through blogs and social media.
Bullies can cause serious problems like depression, which can even lead to suicide [4,5]. It is
critical to identify cyberbullying to prevent the dangerous issue. Because there are no
salient features in a phrase based on their location. From Eq. (1), the representation of X i ∈
of data retention. RNN, on the other hand, can automatically learn features and recognise the
Rk indicates the word vector in k dimensions corresponding to each i th word within a sentence
of length (n)[26].
The sigmoid and the hyperbolic tanh functions generate feature maps from input sequences.
           Cj = f (W * [ Xj, j+h−1] + BC)                                                          2
In Eq. (2), It represents the jth feature importance,
Xi, i+h-1 is the word window, b is the bias term, whereas f is the activation function.
LSTMs solve the challenge of the vanishing gradient descent in traditional RNNs. LSTMs
are ideal for applications like text classification and predictive modelling due to their
enormous memory capacity. Such a network selectively determines which information must
be sent to further neurones and which may be forgotten nor omitted. These networks use
backpropagation using a gated mechanism. The following equations illustrate an input (IG t),
output (OGt), along with forget gate (FGt) that are fundamental parts of an LSTM network
[27].
The forget gate is used to control how cell-state information is extracted by applying a filter.
This stage removes any information that is unnecessary for the LSTM to comprehend, as well
as any less critical data. This is critical for optimising the LSTM model network output. H t−1
refers to the hidden state that the cell preceding it or the last cell's output, as well as X t is the
input for that time step. We employed weight matrices for multiplying the data provided, and
a bias was imposed. Following that, the value is added using the sigmoid function, resulting
in a vector matching with every value within the cell structure. The value ranges from zero to
one. Again, if '0' is assigned as the output value of the cell's state, the omitted gate would
prefer that the cell state cannot recognise the component of knowledge. By the same way, a
'1' indicates that the missing gate will naturally recall the entire bit of knowledge. Lastly, the
vector's output is multiplied by the cell state. Bidirectional LSTM (Bi-LSTM) is a reliable
approach for improving backpropagation for, Bi-LSTM can move both forward and
backward. A Bi-LSTM may process inputs both in reverse and serially. Architecturally, it
combines two LSTMs in opposite directions. This enables the network to recall information
from the past to the future through the forward layer along with the future to the past layer
using the backward LSTM layer.
 FGt = σ (WFG *[ Ht−1, Xt] + BFG)                                                              3
 IGt = σ (WIG * [ Ht−1, Xt] + BIG)                                                             4
 CGt = tanh (WCG * [ Ht−1, Xt] + BCG)                                                          5
 CSt= FGt * CSt-1 + IGt * CGt                                                                  6
 OGt = σ (WOG * [ Ht−1, Xt] + BOG)                                                             7
 Ht = OGt * tanh (CSt)                                                                         8
Where, FG – Forget Gate
IG – Input Gate
CG – Control Gate
OG – Output Gate
                       →                                                                        ←
The forward layer h calculates for a given series of inputs, whereas the backward layer h
calculates for a reverse sequence using Ht, where Ct is the activation function's vector. The
output of this model is given by the equation
        Yt = yt-2, yt-1, yt, yt+1 …. yt+n-1, yt+n                                              9
                 →   ←
Where, yt = σ( h , h ) and σ is an operator for concatenation.
                                    Fig. 3: Bi-LSTM Architecture [28]
Gated Recurrent Units (GRUs) are another sort of RNN that uses a gated approach to deal
with the vanishing as well exploding gradient problem. These outperform typical RNNs in
terms of testing accuracy due to their capacity for retaining long-term dependencies. GRUs
are a dynamic variant of LSTM networks that may update or reset memory cells. The
network functions as an update gate, combining input with the forget gate seen in LSTMs. In
addition, there exists a reset gate to refresh the memory contents. These are lightweight and
require less parameters over LSTMs. For an input vector Xt and time t, the update gate, reset
gate, hidden state and candidate hidden state equations are depicted as:
 UGt = σ (WUG. (Ht-1, Xt))                                                                10
 RGt = σ (WRG. (Ht-1, Xt))                                                                11
 Ht = (1–UGt). Ht-1 +UGt. Ht’                                                             12
 Ht’ = tanh (WH. (RGt. Ht-1, Xt))                                                         13
In the domain of text cyberbullying categorisation, a hybrid approach which combines the
benefits of LSTM, Bi-LSTM, along with GRU could address their shortcomings while
improving performance. This hybrid technique could use LSTM's pattern recognition
capabilities, Bi-LSTM's long-range dependency comprehension, and GRU's memory
efficiency to offer a comprehensive solution for effectively detecting possibilities of
cyberbullying in text. As a result, the GRU modules analyse the input data at each time step,
generating hidden states that include details about the input pattern. These hidden states are
subsequently passed into a fully connected layer, which generates the most accurate
predictions depending on the learnt weights. The predicted values are compared with the real
target labels,
                            Fig. 4: Proposed Hybrid Framework
and any errors are backpropagated to modify the weights, resulting in increased accuracy over
time. The Bi-LSTM layer, composed of both backward and forward-looking LSTM layers,
analyses the RNN-generated feature vector sequences, capturing the input data's long-term
relationships. A Bi-LSTM, unlike a conventional LSTM, analyses the input sequence both
forwards and backwards, allowing it to gather information from previous and future time
steps. The Bi-LSTM's hidden states are then sent over a fully connected layer, which
generates final predictions relying on the learnt weights. The predictions are compared with
the real target labels, and any errors are backpropagated to modify the weights, resulting in
increased accuracy over time.In this work, we suggest combining a Bi-LSTM network with a
stacked attention model. As the name suggests, the attention model [29] focusses on words
that are more important in the document. Figure 5 shows the proposed design, which
involves processing input through the Bi-LSTM network, passing it over an attention layer
with numerous neurones, and finally to the GRU layers. Understanding the context and
improving the final output allow the system to encode only selective valuable information.
This enables the model to work properly with suitably huge input texts. We use the multi-
head attention method introduced in [29]. The model gives non-zero weights for every input
items. We use the scaled dot product for the similarity function. To calculate the attention
score for a query, key-value pairs (K) are compared to the query to determine their
similarities. This is theoretically expressed by Equation 21. To determine the final attention
given by Equation (22), the weights are normalised using a SoftMax function, with d m serving
as the key dimension.
                                      lx
                                                                                          14
 Attention (Input, Set of Keys) = ∑ Similarity ¿ ¿) x Value j
                                      j=1
                               Input . Key j                                              15
 Similarity (Input, Key j) =
                                   dm
                                           T
                                    IK                                                    16
 Attention (I, K, V) = SoftMax (         ¿V
                                    √ dm
Where, Input = The relevant information required by the element.
Set of Keys = The complete set of keys and values.
Key = Elements in the complete set that are compared to the input.
Value = The information linked with each key contributes to the result.
4. Experimental set up
       For effective model training and evaluation, this work makes use of Google Colab and
Jupyter Notebook, utilising Google's virtual GPU. On a Windows 11 computer, Python 3.12.4
was used for the implementation. The hardware that was used for the experiments has the
Core i7 processor, 10–15 GB of accessible storage, and 32 GB of RAM. For API queries to
run well, especially when accessing external data, a steady internet connection was necessary.
A variety of Python frameworks and libraries were used in the project, such as Scikit-learn to
machine learning tools, Pandas and NumPy to perform data manipulation, Matplotlib and
Seaborn to data visualisation, and TensorFlow and Keras to model construction. While GloVe
embeddings had been employed to improve the semantic comprehension of text data, NLTK
supported tasks related to natural language processing.
4.1. Classification Metrics
Accuracy: It evaluates the validity for the framework's predictions and is the most basic
statistic. It is calculated as the total number of estimations divided with the number of correct
predictions.
Precision: Determines the proportion of all positive estimations that are true positives. "How
many of all the events that were predicted as positive were actually positive?" is the question
it addresses.
Recall: The percentage of real positives within all actual positive cases is measured by recall
(sensitivity). It provides an answer to the query, "How many of the total actual positive items
were correctly predicted?"
F1-Score: The F1-Score is a balanced metric that considers both of which were false positive
and false negatives. It is calculated as the average harmonic of precision and recall.
AUC-ROC: AUC-ROC is the region beneath the operating characteristic curve of the
receiver. It represents the model's ability to distinguish between classes. A higher AUC-ROC
indicates better performance.
Confusion Matrix: A confusion matrix is an array of numbers that highlights what the model
predicts and shows the proportions of the actual positives, the actual negatives, incorrect
positives, and incorrect negatives. Compared to accuracy alone, it offers a more thorough
understanding of the model's performance.
4.2. Dataset Description
The dataset used for this study consists of 18,148 comments collected from various internet
venues. Based on sentiment analysis, 11,661 comments were labelled as negative, however
6,487 comments were classified as good.
The dataset has been collected from two main sources, YouTube Web Scraped Metadata:
Comments were retrieved through the YouTube Web Scraped Data API V3 with a confirmed
API key. Additional data was integrated from a freely accessed Kaggle dataset labelled
"Cyberbullying Classification" [31]. The dataset includes comments from Facebook, Twitter,
and YouTube. Evaluation of online interactions and behaviour, especially in the context of
cyberbullying and online aggression, was made easier by the sentiment classification, which
made it possible to separate comments into binary categories. Applying sentiment analysis
over the web-scraped data, the comments have been classified as either positive or negative.
5. Result and discussion
This study offered techniques for spotting online cyberbullying. The proposed process
comprises several processes, such as feature extraction to convert the text to numerical data,
effective text preparation to prepare the comments, and the application of many techniques to
classify and detect cyberbullying.
The model's overall performance shows good generalisation ability and efficient learning.
Both the accuracy of the validation and training curves show a steady increasing trend during
the training process, suggesting that the model is gradually picking up on the underlying
patterns in the data. The accuracy reaches a high level of stability about 93.69% after 15
epochs, indicating that the framework has achieved convergence.
Crucially, neither overfitting nor underfitting are clearly visible. Usually, overfitting occurs
when training accuracy keeps rising but validation accuracy falls; in this instance, both
measures rise simultaneously and stay very similar. This balance illustrates how well the
model generalises to new data. The framework is suitably complex and fully optimised to
capture the required characteristics from the input data, as evidenced by the consistently good
performance over both datasets, which also confirms the absence of underfitting.
Table 1. Comparative Analysis of the contemporary methods and the proposed approach
                     Model                                    Accuracy                       F1 score                       Recall                     Precision
                                                       Classification Metrics
    0.9
    0.6
    0.3
     0
               N                                                              n           ne           s           st               e              )           d
                        TM         RU         TM
                                                               N
                                                                           sio          hi           ye          re               re             N           se
          RN         LS        G            S            al N            s             c            a           o               T              N           po
                                          -L          on               re             a           eB          F              on             (K
                                                                                                                                                       Pr
                                                                                                                                                         o
                                        Bi         iti             Reg
                                                                                  orM          aïv        dom           cisi            ors
                                                  d                              t                                     e
                                               Tr
                                                 a
                                                              sti
                                                                 c             ec            N
                                                                                                      Ra
                                                                                                         n           D               hb
                                                          o gi            rtV                                                    eig
                                                        L               o                                                      N
                                                                      pp                                                    st
                                                                                                                          re
                                                                   Su                                                  ea
                                                                                                                    -N
                                                                                                                  K
           Fig. 7: Training and Validation Accuracy & Loss curve for LSTM Model
This finding is further supported by both the training as well as validation loss curves. A well-
converged model is characterised by a progressive flattening after a quick reduction in the
early epochs, as seen in both curves. The validation accuracy and loss of data show slight
variations at later epochs (e.g., epoch 7–10), but both are within expected ranges and most
likely the result of typical variation for mini-batch updates while training.
These data lead to the conclusion that the ideal training period for this model is 15 epochs.
Beyond this point, continuing training could result in diminishing returns and raise the
possibility of overfitting without appreciable performance improvement. The model's
robustness, stability, and suitability are confirmed by the training behaviour as seen by
accuracy and loss measures.
     Fig. 8: Training and Validation Accuracy & Loss curve for GRU Model
Fig. 9: Training and Validation Accuracy & Loss curve for Bi-LSTM Model
Fig. 10: Training and Validation Accuracy & Loss curve for Traditional NN Model
  Fig. 11: Training and Validation Accuracy & Loss curve for proposed model
Fig. 12: Confusion matrix for Logistic Regression and Support Vector Machine
               Fig. 13: Confusion matrix for Naïve Bayes and Random Forest
                                                                          Fig
. 14: Confusion matrix for Decision Tree and K-Nearest Neighbours (KNN)
                                                                          Accuracy
         Authors                            Method
                                                                            (%)            F1-Score
   Balakrishnan et al.
                                Machine learning techniques                  91.88            N/A
          [17]
    Al-Khasawneh et
                                    Multi-modal approach                      92.1            86.4
         al.[25]
     F. Razi and N.
                                     m-BERT and MuRIL                         N/A              92
         Eja[24]
                             LSTM-BiLSTM-GRU-Attention
        Proposed                                                              93.6            95.09
                                     mechanism
6. Conclusion
With the intention to improve the effectiveness of cyberbullying detection systems, we
introduced a combination of deep learning architecture in this paper that combines LSTM, Bi-
LSTM, GRU, with an attention mechanism. We were able to create a strong and efficient
model by utilising the advantages of each element: the attention mechanism's focus on
pertinent characteristics, the computational efficiency of GRU, the bidirectional context
awareness of Bi-LSTM, and the LSTM's capacity to capture long-term dependencies. A more
sophisticated comprehension of the patterns of speech and contextual clues commonly present
in cyberbullying content is made possible by the merging of these layers. When compared to
standalone or conventional models, experimental results show that using this hybrid
architecture greatly increases detection accuracy. This method helps create safer and more
welcoming online environments in addition to advancing the area of automated online
harassment identification. For wider applicability, future research may examine further
optimization strategies, cross-lingual abilities, and real-time deployment.
Data Availability:
        The dataset used in this study is available at Kaggle Community [31]. The description
of the dataset is also mentioned in the paper.
Conflicts of Interest:
        The authors declare that there is no conflict of interest regarding the publication of
this paper.
References
[1] Feinberg, T.; Robey, N. Cyberbullying. Educ. Dig. 2009, 74, 26.
[2] Nikolaou, D. Does cyberbullying impact youth suicidal behaviors? J. Health Econ. 2017, 56, 30–46.
[3] Statista. Most popular social networks worldwide as of january 2025, ranked by number of monthly active
users, 2025.
[4] Brailovskaia, J.; Teismann, T.; Margraf, J. Cyberbullying, positive mental health and suicide
ideation/behavior. Psychiatry Res.2018, 267, 240–242.
[5] Lu, N.; Wu, G.; Zhang, Z.; Zheng, Y.; Ren, Y.; Choo, K.K.R. Cyberbullying detection in social media text
based on character-level convolutional neural network with shortcuts. Concurr. Comput. Pract. Exp. 2020, 32,
e5627.
[6] Arif, Muhammad. "A systematic review of machine learning algorithms in cyberbullying detection: future
directions and challenges." Journal of Information Security and Cybercrimes Research 4.1 (2021): 01-26.
[7] Hasan, Md Tarek, et al. "A review on deep-learning-based cyberbullying detection." Future Internet 15.5
(2023): 179.
[8] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014,
arXiv:1406.1078.
[9] Fang, Y., Yang, S., Zhao, B., & Huang, C. (2021). Cyberbullying detection in social networks using bi-gru
with self-attention mechanism. Information, 12(4), 171.
[10] Gada, Mihir, Kaustubh Damania, and Smita Sankhe. "Cyberbullying Detection using LSTM-CNN
architecture and its applications." 2021 International Conference on Computer Communication and Informatics
(ICCCI). IEEE, 2021.
[11] Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015,
arXiv:1508.01991.
[12] Chen, Hsin-Yu, and Cheng-Te Li. "HENIN: Learning heterogeneous neural interaction networks for
explainable cyberbullying detection on social media." arXiv preprint arXiv:2010.04576 (2020).
[13] Caroppo, A.; Leone, A.; Siciliano, P. Comparison between deep learning models and traditional machine
learning approaches for facial expression recognition in ageing adults. J. Comput. Sci. Technol. 2020, 35, 1127–
1146.
[14] Yilmaz, A.; Demircali, A.A.; Kocaman, S.; Uvet, H. Comparison of Deep Learning and Traditional
Machine Learning Techniques for Classification of Pap Smear Images. arXiv 2020, arXiv:2009.06366.
[15] Finizola, J.S.; Targino, J.M.; Teodoro, F.G.S.; Moraes Lima, C.A.d. A comparative study between deep
learning and traditional machine learning techniques for facial biometric recognition. In Proceedings of the
Ibero-American Conference on Artificial Intelligence, Trujillo, Peru, 13–16 November 2018; Springer:
Berlin/Heidelberg, Germany, 2018; pp. 217–228.
[16] Picon, A.; Alvarez-Gila, A.; Irusta, U.; Echazarra, J. Why deep learning performs better than classical
machine learning? Dyna Ing.Ind. 2020, 95, 119–122.
[17] Balakrishnan, Vimala, Shahzaib Khan, and Hamid R. Arabnia. "Improving cyberbullying detection using
Twitter users’ psychological features and machine learning." Computers & Security 90 (2020): 101710.
[18] Perera, Andrea, and Pumudu Fernando. "Cyberbullying detection system on social media using supervised
machine learning." Procedia Computer Science 239 (2024): 506-516.
[19] Chatzakou, D., Leontiadis, I., Blackburn, J., Cristofaro, E. D., Stringhini, G., Vakali, A., & Kourtellis, N.
(2019). Detecting cyberbullying and cyberaggression in social media. ACM Transactions on the Web
(TWEB), 13(3), 1-51.
[20] Fahim, Kaji Mehedi Hasan, et al. Deep learning approaches for Bengali cyberbullying cetection on social
media: a comparative study of BiLSTM, BiGRU and BERT models. Diss. Brac University, 2023.
[21] Batani, John, et al. "A review of deep learning models for detecting cyberbullying on social media
networks." Computer Science On-line Conference. Cham: Springer International Publishing, 2022.
[22] López-Vizcaíno, Manuel F., et al. "Early detection of cyberbullying on social media networks." Future
Generation Computer Systems 118 (2021): 219-229.
[23] Xingyi, Guo, and H. Adnan. "Potential cyberbullying detection in social media platforms based on a multi-
task learning framework." International Journal of Data and Network Science 8.1 (2024): 25-34.
[24] F. Razi and N. Ejaz, "Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English
Social Media Conversations," in IEEE Access, vol. 12, pp. 105201-105210, 2024, doi:
10.1109/ACCESS.2024.3432908.
[25] Al-Khasawneh, Mahmoud Ahmad, et al. "Towards Multi-Modal Approach for Identification and Detection
of Cyberbullying in Social Networks." IEEE Access (2024).
[26] Ombabi, A. H., Ouarda, W., & Alimi, A. M. (2020). Deep learning CNN–LSTM framework for Arabic
sentiment analysis using textual information shared in social networks. Social Network Analysis and Mining, 10,
1-13.
[27] Waqas, Muhammad, and Usa Wannasingha Humphries. "A critical review of RNN and LSTM variants in
hydrological time series predictions." MethodsX (2024): 102946.
[28] https://dagshub.com/blog/rnn-lstm-bidirectional-lstm/
[29] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.
Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems
(NIPS 2017), Long Beach, CA, USA,4–9 December 2017.
[30] Rizwan, Hammad, Muhammad Haroon Shakeel, and Asim Karim. "Hate-speech and offensive language
detection in roman Urdu." Proceedings of the 2020 conference on empirical methods in natural language
processing (EMNLP). 2020.
[31] https://www.kaggle.com/datasets/shauryapanpalia/cyberbullying-classification