This repository contains code for our paper accepted at EMNLP 2023.
The dataset developed in this paper is available in this repository and also on HuggingFace at this link. Refer to the HuggingFace README for more details on the dataset format for the hub.
Clone the repository and create a virtual environment with the following libraries from pypi and a python version >= 3.6 to execute all the files with full functionality.
Click me
numpy
pandas
matplotlib
seaborn
tqdm
fasttext
transformers
torch
openai
scikit-learn
scipyRefer to src/hf_demo.py file for a minimal example of how to use the dataset
from huggingface.
from datasets import load_dataset
from weat import WEAT
from encoding_utils import encode_words
dataset = load_dataset("iamshnoo/WEATHub")
example = dataset["original_weat"][0]
target_set_1 = example["targ1.examples"]
target_set_2 = example["targ2.examples"]
attribute_set_1 = example["attr1.examples"]
attribute_set_2 = example["attr2.examples"]
# method M5 from main paper, using DistilmBERT embeddings
args = {
"lang": example["language"],
"embedding_type": "contextual",
"encoding_method": "4",
"phrase_strategy": "average",
"subword_strategy": "average",
}
weat = WEAT(
encode_function=encode_words,
target_set_1=target_set_1,
target_set_2=target_set_2,
attribute_set_1=attribute_set_1,
attribute_set_2=attribute_set_2,
num_partitions=100000,
normalize_test_statistic=True,
encode_args=args,
)
print("Effect size : ", weat.effect_size)
print("p value : ", weat.p_value)The code is contained in the src directory.
Click me
load_annotations.pyloads data from annotations folder and processes it to remove spaces and other issues before saving it to json files in thedatafolder.weat.pydefines a class for the WEAT test. It also includes an example of how to use the class.encoding_utils.pydefines different types of encoding methods. This assumes that fasttext is installed for downloading and using fasttext models, and transformers is installed for downloading and using BERT models and openAI for using the paid Ada API. Note that, to use the ADA option, you need to have an API key from OpenAI stored in asecrets.txtfile in the src folder.run_weat.pygives a very efficient way to call the WEAT class with the corresponding encoding utils for a given language and save the results in a csv. It includes an example usage. It can be run aspython run_weat.py. This is the main file to be run to reproduce the results.compare_embeddings.pyis the file where we perform the bias sensitivity analysis mentioned in our paper.load_valence.pycreates the valence experiments mentioned by 2 out of 3 reviewers andvalence_weat.pyruns them. Results are found infinal_results/valence.
Results for all experiments referred to in the paper are given in the
final_results folder. It includes csv files organized into subfolders, and
also corresponding auto-generated latex table versions of those csv files.
Click me
The main structure of the repository is as follows :.
├── __init__.py
├── annotations
│ ├── ...
├── data
│ ├── ar_all
│ │ ├── ...
│ ├── ar_gt
│ │ ├── ...
│ ├── ar_human
│ │ ├── ...
│ ├── ar_new
│ │ ├── ...
│ ...
│ ├── zh_all
│ │ ├── ...
│ ├── zh_gt
│ │ ├── ...
│ ├── zh_human
│ │ ├── ...
│ └── zh_new
│ ├── ...
├── ft_embeddings
│ ├── cc.en.300.bin
│ ├── ...
├── *.egg-info
├── results
│ ├── ar
│ │ ├── ...
│ ├── consolidated
│ │ ├── ...
│ ...
│ └── zh
│ ├── ...
├── setup.py
└── src
├── __init__.py
├── compare_embeddings.py
├── encoding_utils.py
├── hf_demo.py
├── load_annotations.py
├── run_weat.py
├── secret.txt
└── weat.py