Skip to content

rayaneghilene/MLM_Pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[MASK]ED - Language Modeling for Explainable Classification and Disentangling of Socially Unacceptable Discourse.

This repository contains the code for Token Importance Assessment, Masked Language Modeling (MLM) Pretraining and Finetuning, Label Noise Removal, and Annotated Corpus relabeling. An inference script is provided to test the tuned models.

Pipeline Pipeline

Table of Contents

  1. Installation
  2. Usage
  3. Token Importance Assessment
  4. Masked Language Modelling
  5. Downstream Performance Evaluation
  6. Acknowledgements
  7. Contributing
  8. Contact

Installation

Clone the repo using the following command:

git clone https://github.com/rayaneghilene/MLM_Pretraining.git
cd MLM_Pretraining

We recommend creating a virtual environment (optional):

python3 -m venv myenv
source myenv/bin/activate 

To install the requirements run the following command:

pip install -r requirements.txt

Usage

All experiments should be ran using the main.py file. The arguments are as follows:

  • --experiment_name: can be either 'train' for MLM training, or 'finetune_nli' $$\textcolor{red}{required}$$
  • --model_name: can be either 'roberta', 'bert', or 'electra'
  • --GPU: Specifies the GPU device number to use. If not set, the training will default to using the CPU. Leave this option unset if you don’t have a GPU or prefer not to use one.
  • --pretrained_model_path: is the Path to the pretrained model.
  • --dataset_path: is the Path to your dataset.
  • --masking_strategy: can be either 'PMI', or 'BERTopic' (PMI is the default option)
  • --loss_strategy: is used for optimisation of the loss (with PMI or LDA..), and can be either 'weighted', or for no optimisation 'none' (weighted is the default option)
  • --nli_dataset_name: can be either 'mnli', 'qnli', or 'snli' ('mnli' is the default option)
  • --save_path: is the Path to save the pretrained model and tokenizer (the default path is ''./Trained_models/')

Masked Language Modelling

Here's an example command to train a model for masked Language Modelling:

nohup python main.py 
--experiment_name 'train' 
--GPU '1'
--model_name 'roberta'
--dataset_path Path_to_the_dataset 
--save_path Path_to_save_the_trained_model 
--masking_strategy 'BERTopic'
> Pretraining_logs.log 2>&1 &

Fine tune a model for Supervised Classification

Here's an example command to Fine tune a pretrained model in a supiervised fashion:

nohup python main.py
--pretrained_model_path path_to_your_pretrained_model 
--GPU '1'
--data_path Path_to_you_data
--save_path Path_to_save_the_finetuned_model
> Supervised_logs.log 2>&1 &

You can visualize the finetuning progress via terminal using the following command

tail -f Supervised_logs.log

For supervized classification, we compare the following seeds [42, 123, 4567, 8910, 13579, 24680, 98765, 54321, 11111, 99999] and aggregate average and standard deviation of F1 scores over seeds.

For Inference testing of a model

Run the following command

nohup python utils/Inference_test.py
--model_path path_to_your_pretrained_model 
--data_path Path_to_you_data
> Inference_logs.log 2>&1 &

You can visualize the inference progress via terminal using the following command

tail -f Inference_logs.log

I. Token Importance Assessment

We compute the Pointwise Mutual Information (PMI) score for each token based on its co-occurrence with a specific class label in a professionally annotated corpus of approximately 470K Tweets. The dataset contains annotations for categories related to Socially Unacceptable Discourse, such as hateful, offensive, and toxic content.

$\text{PMI}(x, y) = \log \frac{P(x, y)}{P(x)P(y)}$

Where: - P(x, y) is the probability of both events x and y occurring together - P(x) and P(y) are the probabilities of the individual events x and y occurring independently.

A higher PMI score indicates a stronger association between the token and the specific class. To obtain a final importance score for each token, we compute its PMI score for all class labels and take the average across them. This approach ensures that tokens frequently associated with socially unacceptable discourse receive higher importance scores, guiding our token selection process during masked language model pre-training.

For a detailed mathematical breakdown of PMI and its role in importance assessment, refer to this link.

Example Tokens with high PMI scores in a tweet from the corpus:

tweet tweet

II. Masked Language Modelling

A. Token Masking

The process involves randomly masking a certain percentage of words in a given Tokenized sentence (usually 15%) and training a model to predict the original words based on the surrounding context. The masked tokens are replaced with a [MASK] token for BERT (<mask> for roBERTa), and the original tokens are stored as targets for prediction.

We employ a Static Token masking strategy; The masked tokens are selected once during data preprocessing and remain the same across all training epochs to ensure consistency.

B. Training Dataset preparation

In the dataset, the ground-truth token IDs, masked in the inputs, are present in the label tensor and all other tokens are ignored (set to -100) by the default behaviour of nn.CrossEntropyLoss() as illustrated:

MLM_dataset

MLM_dataset

During preprocessing, labels are initialized to -100 for all tokens, indicating they should be ignored during loss calculation. For positions where tokens were masked, their corresponding token IDs are assigned as labels. The dataset is split into training and test sets, and the masked text, along with the labels, is prepared for model training.

C. Training Loss Optimisation

During the training, the model is optimized by minimizing the loss between its predictions and the original tokens. The importance of the masked tokens, guided by their Importance scores, is incorporated into the loss function to emphasize learning from socially unacceptable discourse tokens. Specifically, tokens with higher scores are weighted more heavily in the loss calculation, encouraging the model to focus on learning the contextual relationships involving these tokens.

For a detailed mathematical breakdown of weighted loss optimisation, refer to this link.

III. Downstream Performance Evaluation

To assess the impact of our pretraining approach, we fine-tune the trained models on downstream tasks. We evaluate their performance on supervised classification for SUD detection benchmark datasets.

1. Supervised Socially Unacceptable Discourse Classification

We fine-tune the models on a collection of datasets focused on detecting hateful, offensive, toxic, and other forms of socially unacceptable discourse. Each dataset contains professionally annotated samples, ensuring robust and reliable evaluation. • Task: Given a text input, classify it into predefined categories such as hateful, offensive, or neutral. • Objective: Measure whether pretraining with importance weighted masking improves classification accuracy compared to baseline models trained with standard MLM. • Metrics: We report macro-F1, accuracy, and precision-recall curves to capture overall performance and class-specific behavior.

Acknowledgements

ARENAS Project EU

This work was conducted as part of the European Arenas project, funded by Horizon Europe. Its objective is to characterize, measure, and understand the role of extremist narratives in discourses that have an impact not only on political and social spheres but importantly on the stakeholders themselves.  Leading an innovative and ambitious research program, ARENAS will significantly contribute to filling the gap in contemporary research, make recommendations to policymakers, media, lawyers, social inclusion professionals, and educational institutions, and propose solutions for countering extreme narratives for developing more inclusive and respectful European societies.

Contributing

We welcome contributions from the community to enhance work. If you have ideas for features, improvements, or bug fixes, please submit a pull request or open an issue on GitHub.

Contact

Feel free to reach out about any questions/suggestions at rayane.ghilene@ensea.fr

About

Accepted at EMNLP 2025: [MASK]ED - Language Modeling for Explainable Classification and Disentangling of Socially Unacceptable Discourse.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages