0% found this document useful (0 votes)
9 views6 pages

Cyberbullying

The document discusses a machine learning approach to detect cyberbullying on social media, specifically focusing on Twitter and Wikipedia. It proposes using Support Vector Machines (SVM) and Random Forest Classifiers, achieving high accuracy rates of 96% and 70% respectively for detecting cyberbullying content. The methodology includes data collection, preprocessing, feature extraction, and classification, emphasizing the importance of effective detection methods to combat the rising issue of cyberbullying.

Uploaded by

Venkatesh Guthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

Cyberbullying

The document discusses a machine learning approach to detect cyberbullying on social media, specifically focusing on Twitter and Wikipedia. It proposes using Support Vector Machines (SVM) and Random Forest Classifiers, achieving high accuracy rates of 96% and 70% respectively for detecting cyberbullying content. The methodology includes data collection, preprocessing, feature extraction, and classification, emphasizing the importance of effective detection methods to combat the rising issue of cyberbullying.

Uploaded by

Venkatesh Guthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of All Research Education and Scientific Methods (IJARESM),

ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

Cyberbullying Detection on Social Media using Machine


Learning
Roshni Kumari S1, Sindhu B2, Muskan3, Anjali Jadhav4, Venkatesh G5
1,2,3,4
UG Student, Information Science and Engineering, MVJ College of Engineering, Bengaluru, India
5
Assistant Professor, Information Science and Engineering, MVJ College of Engineering, Bengaluru, India

------------------------------------------------------------------****************-----------------------------------------------------------

ABSTRACT

Cyberbullying is a major problem encountered on internet that affects teenagers and also adults. It has lead to mis
happenings like suicide and depression. Social media is a platform where many young people are getting bullied. As
social networking sites are increasing, cyberbullying is increasing day by day. To identify word similarities in the
tweets made by bullies and make use of machine learning and can develop an ML model automatically detect social
media bullying actions. However, many social media bullying detection techniques have been implemented, but
many of them were textual based. The goal of this software is to show the implementation of software that will detect
bullied tweets etc. A machine learning model is proposed to detect and prevent bullying. Two classifiers i.e. SVM
and Random Forests Classifier are used for training and testing the social media bullying content. Both SVM
(Support Vector Machine) Random Forests Classifier and were able to detect the true positives with 91.06% and
76.70% accuracy respectively. SVM outperforms Random Forests Classifier of similar work on the dataset.For
Tweet data the model provides accuracies above 90% and for Wikipedia data it gives accuracies above 80%.

Keywords: Cyberbullying, Hate speech, Personal attack, Machine learning, Twitter, Wikipedia

INTRODUCTION

There is a possibility of unintentional cyberbullying, as thoughtless actions without awareness of the consequences can hurt
the people involved. The abuser often fails to see these reactions and is unaware of the scale of the actions. Given the
consequences of cyberbullying on its victims, it is imperative to find the appropriate actions to detect and prevent it.
Machine learning is one of the successful approaches that learns from data and creates a model that automatically classifies
appropriate actions.

Machine learning can be useful to detect language patterns of bullies and thus can generate a pattern to detect acts of
cyberbullying. A recent study conducted by Microsoft Corporation to understand the spread of cyberbullying globally
showed that India ranks 3rd in terms of cyberbullying after China and Singapore. According to recent studies, 52% of
young people in India have been victims of cyberbullying in the past and around 38% of them have been bullied.
Cyberbullying is basically of two types, one that contains abusive language and one that embarrasses the intended target but
does not use swear words. Posts containing abusive content or profanity are more likely to be labelled as “articles”.

According to, for today's younger generation, "Gay", "Bitch" and "Slag" are the most commonly used abusive terms in
schools. India has many cases of bullying. 79% of Indians are aware of and concerned about cyberbullying, compared to
54% globally. 53% of Indians have been bullied compared to the world average of 37%. On top of that, 50% of Indians
have ever engaged in cyberbullying while globally only 24% of the population has been involved in similar incidents. In
contrast, 63% of Indians are educated and 76% of organizations have a formal policy on cyberbullying, compared with
global averages of 23% and 37%.

The purpose of this article is to analyze and evaluate communication between two specific individuals or an anonymous
person. A proposed machine learning model to detect and prevent harassment on chat interfaces. Two classifiers, SVM and
Random Forest Classifier, are used to train and test social media bullying. Random Forest Classifier and SVM (Support
Vector Machine) were able to detect true positives with 70% and 50% accuracy, respectively contamination & serious
health hazards to doctors, nurses, ward boys, support staff, sanitation workers, rag pickers & other health care workers.
Who are regularly exposed to biomedical waste as an occupationhazards as well as general public in the surrounding area.

Page | 3507
International Journal of All Research Education and Scientific Methods (IJARESM),
ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

PROBLEM STATEMENT

A. Existing System

Cyberbullying is a widespread issue where technology is used to harass, intimidate, or humiliate others. These online
attacks can sometimes lead to real-life threats and even suicide.Researchers are developing methods to identify
cyberbullies. One approach focuses on the idea that anonymous bullies (trolls) likely have hidden profiles they use to
monitor their fake personas. This method uses machine learning to analyse connections between profiles and identify the
author behind a fake account.Another technique involves collaboration between multiple detection systems. These systems,
even if they use different algorithms or data, can share information to improve overall detection accuracy.Other researchers
have explored techniques like B-LSTMs with a focus on attention mechanisms and K-Nearest Neighbors(KNN) with new
embedding methods. These approaches have shown promising results, with some achieving accuracy rates as high as
93%.Current detection methods sometimes struggle with catching all cyberbullying incidents. This means they might miss
some instances, leaving victims vulnerable. Existing systems often focus on identifying patterns already present in the data.
This can be a weakness because cyberbullies can change their tactics. If the system relies solely on past examples, it might
not be able to detect new forms of harassment. Many detection techniques require a significant amount of human effort.
This can be time-consuming and limit the ability to scale these solutions to handle a large volume of data.

B. Proposed System

This work proposes a machine learning approach to detect two prominent forms of cyberbullying: hate speech on Twitter
and personal attacks on Wikipedia.The system leverages Support Vector Machines (SVM) for Twitter data, due to its
effectiveness in separating data points in high-dimensional spaces. Random Forest Classifiers are employed for Wikipedia
analysis due to their ensemble approach, offering robust and accurate classification.This work surpasses existing methods
in several ways. First, SVM achieves an impressive 96% accuracy in detecting cyberbullying on Twitter, significantly
outperforming previous systems. Second, the system moves beyond simply identifying past patterns. It utilizes existing data
to predict future cyberbullying instances, enabling a proactive approach. Finally, the proposed system offers superior
precision compared to existing techniques, minimizing false positives and ensuring users are protected from genuine
cyberbullying threats.

RESEARCH METHODOLOGY

This project tackles cyberbullying detection as a binary classification task. It focuses on identifying two main types: hate
speech on Twitter and personal attacks on Wikipedia. The project then classifies the content as either containing
cyberbullying or not. This methodology will be applied to datasets from both platforms.

A. Data Collection

Public data from social media platforms (e.g., Twitter) containing labeled examples of cyberbullying and normal
interactions is collected. This labeled data is crucial for training the machine learning model.

B. Data Preprocessing

The collected data is cleaned and formatted for machine learning. This may involve removing irrelevant information,
correcting typos, and converting text into numerical features the model can understand.

C. Model Training

Various machine learning algorithms, such as Support Vector Machines (SVM) or Neural Networks, are trained on the
labeled data. The model learns to identify the features that distinguish cyberbullying from normal online behavior.

D. Refinement

Based on the evaluation results, the model can be further refined by adjusting parameters, trying different algorithms, or
incorporating new features. The goal is to achieve the highest accuracy and robustness in cyberbullying detection.

Page | 3508
International Journal of All Research Education and Scientific Methods (IJARESM),
ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

Fig (a): Methodology

Data Preprocessing
The data processing pipeline employed for both datasets begin by converting all text data to lowercase. Subsequently,
certain words such as "what’s" or "can’t" are transformed to their expanded forms like "what is" or "can not". Additionally,
all punctuation marks are eliminated utilizing the string library. Following this, Natural Language Processing techniques
facilitated by the Natural Language Toolkit (NLTK) are applied.

A. Tokenization
Tokenization involves breaking down raw text into meaningful units or tokens. For instance, the text "we will do it" can be
tokenized into individual words such as 'we', 'will', 'do', and 'it'. This process can also tokenize text into sentences, known as
sentence tokenization. While tokenization can have various methods, in our project, we utilize Regex Tokenizer. In Regex
Tokenizer, tokens are determined based on rules specified by regular expressions. For example, using the regular
expression '\w+', all alphanumeric tokens are extracted.

B. Stemming
Stemming involves the transformation of a word into its root form or stem. For instance, for words like 'eating,' 'eats,' and
'eaten,' the stem would be 'eat.' This process is applied to recognize variations of the same root word as similar. NLTK
provides several stemmers including Porter Stemmer, Lancaster Stemmer, Snowball Stemmer, and Regexp Stemmer. In
this project, the Porter Stemmer is utilized for stemming purposes.

C. Stop word Removal


Stop words are essentially words that contribute little to the overall meaning of a sentence. For instance, common stop
words in the English language include "what," "is," "at," and "a." These words are considered irrelevant and can be omitted.
NLTK provides a predefined list of English stop words that can be utilized to filter out tweets effectively. It's a common
practice to exclude stop words from text data during the training of deep learning and machine learning models. This helps
improve performance as these words offer insignificant information to the model.

Fig (b): System Design

Feature Extraction
Feature extraction plays a crucial role in Natural Language Processing. Since classifiers cannot directly process text data, it
is essential to convert them into numerical representations. Each document, such as a tweet or comment, can be represented
as a vector, and these vectors can then be utilized for classification purposes. In this project, three different methods of
feature extraction are examined.

A. Bag of Words model


The Bag of Words (BoW) model is a fundamental technique for extracting features from documents based on word
occurrences. This model comprises two essential components: a vocabulary of words (tokens) derived from all documents,

Page | 3509
International Journal of All Research Education and Scientific Methods (IJARESM),
ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

and a method for measuring these words as features within each document. The term "bag" signifies that the model
disregards the order of words and focuses solely on their presence within the document.

The BoW model follows a specific procedure: first, a vocabulary is constructed from all documents, which may include all
words or only the top frequency tokens (e.g., the top 10 features with the maximum occurrences in the corpus).
Additionally, features can be extracted in various forms based on the number of words used per feature. For instance, the
Unigram model considers single words as features, while the Bigram model pairs two words together as features.

After designing the vocabulary, the next step is to transform all documents based on this vocabulary using a method to
measure features. Typically, two types of measurements are used: binary, where features are represented as 1 or 0 based on
their presence in a document, and frequency-based, where the occurrences of features in a document are considered. While
BoW is a straightforward yet effective method, particularly in sentiment analysis, it has its limitations. It does not account
for word context or order, which can be crucial in certain cases. Moreover, designing the vocabulary becomes challenging
in large datasets due to the increase in the number of features.

For example, the meaning of "Is it interesting" differs from "It is interesting," highlighting the importance of context and
word ordering.

B. TF-IDF Model
The TF-IDF (Term Frequency-Inverse Document Frequency) method shares similarities with the bag of words model as it
also constructs a vocabulary to derive its features. However, TF-IDF addresses a critical issue not extensively seen in the
corpus but significant for feature extraction improvement. The TF-IDF value increases with the frequency of a word within
the same document and decreases with the decrease in frequency across the corpus. This method consists of two
components:

1. Term frequency (TF) calculates the frequency of a word in a document, representing the likelihood of finding a specific
word within that document. It is computed by dividing the frequency of a word Wi appearing in a document Rj by the total
number of words in document Rj.

2. Inverse Document Frequency (IDF) indicates how rare or frequent a word is throughout the corpus. It helps identify rare
words in the corpus by computing the logarithm of the ratio of the total number of words in the corpus to the number of
documents containing the term t.

In the equation above, |D| represents the total number of documents in the corpus, and the denominator denotes the number
of documents containing the term t. Sometimes, 1 is added to the denominator to avoid division by zero.

The TF-IDF score is then calculated by multiplying the TF and IDF values. A high TF-IDF indicates that the word is
frequent in a document but rare in the corpus, making it more useful as a feature. Conversely, a low or close to 0 TF-IDF
suggests that the word occurs in most documents, making it less useful as a feature. TF-IDF effectively resolves some of
the key issues present in the bag of words model, thereby enhancing its efficiency.

C. Word2Vec
Word2Vec is a feature extraction method introduced by Google in 2013, which utilizes word embeddings to represent
words in vector form. These embeddings enable the measurement of similarity between words, where similar words have
vectors with smaller angles between them or a cosine angle close to 1.Word2Vec employs a neural network approach to
train the model and generate word embeddings. There are two primary methods for constructing word embeddings:

1. Common Bag of Words Model (CBOW) This model takes multiple words as input and predicts the target word based on
the context. The input can consist of one or multiple words, and a SoftMax function is applied at the output. CBOW utilizes
negative log likelihood and has a more probabilistic nature compared to deterministic approaches.

2. Skip Gram Model In contrast to CBOW, the skip gram model predicts multiple context words using a single input word.
It aims to predict the total number of words represented by X using the neural network. While CBOW calculates the mean
of input word contexts, the skip gram model can capture multiple semantics for a single word. For example, it can predict
two vectors for "Apple," representing the company and the fruit.

Both methods employ forward and backpropagation techniques to train neural networks and determine optimal parameters.

Page | 3510
International Journal of All Research Education and Scientific Methods (IJARESM),
ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

Subsequently, for each document, a feature vector can be generated by concatenating or combining all word vectors within
that document. The choice between summing or averaging the word vectors depends on the data and specific requirements.

Classification
Once the feature vectors for the training data are obtained by applying the feature extraction methods mentioned above, the
testing data is transformed using the same approach. It's important to note that this transformation is done without fitting it
on the vectorizers or training it on the Word2Vec model. Following this process, classifiers will be trained and tested on the
training data.

A. Support Vector Machine (SVM)


This theorem is primarily utilized to define a hyperplane that serves as a boundary between data points in an N-dimensional
space, effectively separating them. To enhance the margin value, the hinge function is often regarded as one of the most
effective loss functions. Linear SVM (Support Vector Machine) is employed in scenarios where the data is linearly
separable. In cases where there are no misclassifications (i.e., the model accurately predicts the class of data points), only
minor adjustments to the gradient are necessary from the regularization arguments.

However, in instances of misclassification, where the model erroneously predicts the class of a data point, a reduction is
added to the gradient update for regularization purposes.

B. Random Forest
A random forest comprises numerous individual decision trees, each of which predicts a class for given query points. The
final result is determined by the class with the highest number of votes among the individual trees. Decision trees serve as
the foundation for random forests, offering predictions based on decision rules learned from feature vectors. By aggregating
an ensemble of these uncorrelated trees, a random forest can provide more accurate decisions for classification or
regression tasks

CONCLUSION

Cyber bullying across internet is dangerous and leads to mishappenings like suicides, depression etc and therefore there is a
need to control its spread. Therefore cyber bullying detection is vital on social media platforms. With avaibility of more
data and better classified user information for various other forms of cyber attacks Cyberbullying detection can be used on
social media websites to ban users trying to take part in such activity In this paper we proposed an architecture for detection
of cyber bullying to combat the situation. We discussed the architecture for two types of data: Hate speech Data on Twitter
and Personal attacks on Wikipedia. For Hate speech Natural Language Processing techniques proved effective with
accuracies of over 90 percent using basic Machine learning algorithms because tweets containing Hate speech consisted of
profanity which made it easily detectable. Due to this it gives better results with BoW and Tf-Idf models rather than
Word2Vec models However, Personal attacks were difficult to detect through the same model because the comments
generally did not use any common sentiment that could be learned however the three feature selection methods performed
similarly.Word2Vec models that use context of features proved effective in both datasets giving similar results in
comparatively less features when combined with Multi Layered Perceptions.

REFERENCES

[1] I. H. Ting, W. S. Liou, D. Liberona, S. L. Wang, and G. M. T. Bermudez, “Towards the detection of cyberbullying
based on social network mining techniques,” in Proceedings of 4th International Conference on Behavioral,
Economic, and Socio Cultural Computing, BESC 2017, 2017, vol. 2018-January, doi: 10.1109/BESC.2017.8256403.
[2] P. Galán-García, J. G. de la Puerta, C. L. Gómez, I. Santos, and P. G. Bringas, “Supervised machine learning for the
detection of troll profiles in twitter social network: Application to a real case of cyberbullying,” 2014,
doi:10.1007/978-3-319- 01854-6_43.
[3] A. Mangaonkar, A. Hayrapetian, and R. Raje, “Collaborative detection of cyberbullying behavior in Twitter data,”
2015, doi: 10.1109/EIT.2015.7293405.
[4] R. Zhao, A. Zhou, and K. Mao, “Automatic detection of cyberbullying on social networks based on bullying
features,” 2016, doi:10.1145/2833312.2849567.
[5] V. Banerjee, J. Telavane, P. Gaikwad, and P. Vartak, “Detection of Cyberbullying Using Deep Neural Network,”
2019, doi:10.1109/ICACCS.2019.8728378.
[6] K. Reynolds, A. Kontostathis, and L. Edwards, “Using machine learning to detect cyberbullying,” 2011,
doi:10.1109/ICMLA.2011.152.

Page | 3511
International Journal of All Research Education and Scientific Methods (IJARESM),
ISSN: 2455-6211, Volume 12, Issue 4, April-2024, Available online at: www.ijaresm.com

[7] J. Yadav, D. Kumar, and D. Chauhan, “Cyberbullying Detection using Pre-Trained BERT Model,” 2020,
doi:10.1109/ICESC48915.2020.9155700.
[8] M. Dadvar and K. Eckert, “Cyberbullying Detection in Social Networks Using Deep Learning Based Models; A
Reproducibility Study,” arXiv. 2018.
[9] Agrawal and A. Awekar, “Deep learning for detecting cyberbullying across multiple social media platforms,” arXiv.
2018J. Jones. (1991, May 10). Networks (2nd ed.) [Online].

Page | 3512

You might also like