Adam Montgomerie

Kaggle Journey to Competitions Master

2022-02-10T00:00:00+00:00

Kaggle Journey to Competitions Master

UPDATE: I did an interview with Sanyam Bhutani about my Kaggle journey on his Chai Time Podcast on Weights & Biases here

First Attempts at Competitions

I started entering Kaggle competitions near the start of 2021 (about 11 months ago as of the time of writing). I had previously been working on a few machine learning side projects (which I’ve written about on this blog before), but since starting to work full-time as an ML engineer I found that I didn’t really have the time or energy to devote to working on a full machine learning project lifecycle, in addition to doing the same at my job. Despite this, I still wanted to work on some more NLP projects since my work was mostly related to dealing with tabular data and recommender systems at the time.

I thought that Kaggle would be a good way to try out some of the new transformer architectures I’d been hearing about, without having to collect and label my own data, or having to deploy and maintain a whole system. I initially tried a couple of the monthly Tabular Playground series of competitions, which got me used to the format of a Kaggle competition. The first NLP competition I tried was the Coleridge Initiative competition, which turned out to be quite a difficult task with a strange leaderboard: it was possible to get a high public leaderboard score with a simple string matching function, but this didn’t translate to the private leaderboard at all. I gave up on this competition after working on it for a few weeks from lack of motivation.

CommonLit Readability Prize

Instead, I returned to working on my own projects. I built a CEFR classifier for predicting the reading complexity of texts for people learning English as a second language. Near the end of this project, I came across the CommonLit Readability Prize on Kaggle. This really got my attention, as it was almost the same task as I was working on by myself! Conceptually the only difference was that, while my project was aimed at people learning English as a second language, the CommonLit competition was focused on predicting the readability in terms of American grade school reading levels. Another difference was that the task was framed as a regression problem: we had to predict a real value for each text. My project had been a classification task, where I tried to map each text to a discrete reading level label.

I thought I might have better luck with CommonLit than I did with the Coleridge Initiative competition, so I started trying to translate some of my previous work into something that could be submitted to the competition. I quickly found that my hand-crafted features weren’t very useful, and contrary to my results in my own project, BERT-style transformers were definitely the way to go in the competition.

I managed to get pretty high up the leaderboard early on, and as the competition reached its final month I started getting some invites to merge teams. At first, I didn’t really want to, as I thought I might be able to get a competition medal by myself. However as the competition got closer to the end, I found my position on the leaderboard falling as merged teams started to overtake me. (Note: I didn’t realise at the time but I was actually kind of overfitting the public leaderboard at this point, so even if I had stayed near the top of the public leaderboard, I certainly would’ve lost a lot of places in a shake-up at the end).

Fortunately, I got another invite to join a team, which I accepted. Amazingly, I had stumbled into a team with three high ranked Kagglers who were all very experienced in competitions. When they asked me to share my work so far, I was embarrassed as my experiment tracking was a complete mess. I basically just had a big spreadsheet with a bunch of missing parameters, making it very difficult to reproduce what I’d done. On top of this, my naive attempts at ensembling were full of leaks.

I tried to quickly get myself organised and learn as much from my teammates as possible. In the end we finished in 50th place with a silver medal, which I was pretty happy with. After that, I was hooked on taking part in more competitions. I realised I only needed one more medal to Competitions Expert rank, and then if I could somehow get a gold medal I wasn’t even that far off Competitions Master.

Chaii - Hindi and Tamil Question Answering

The second competition I worked on was the chaii - Hindi and Tamil Question Answering competition. Although I don’t speak any Hindi or Tamil, I’m interested in multi-lingual NLP, as it seems like there’s a lot of potential to use it to build systems to help people learn foreign languages. I had previously worked on a question generation project, but had never done extractive question answering, despite it being a standard NLP task. I tried to take everything I’d learned from the previous competition to organise myself better, which paid off as I was able to finish in 14th place and get another silver medal.

In the chaii competition, I had narrowly missed out on a gold medal by selecting the wrong submissions at the end of the competition. In most Kaggle competitions, you’re allowed to select two submissions at the end, and your final rank is whichever of the two performs best on the private leaderboard which is revealed at the end of the competition. Despite having a submission with a high cross-validation score, I hadn’t chosen it as one of my two final submissions, because it hadn’t performed so well on the public leaderboard, which at the time I was myopically focused on. This shook me out of my public leaderboard obsession. I realised that the old Kaggle mantra of “Trust Your CV” does hold important advice, so I resolved that in the next competition I would trust my CV no matter what. (Note: CV stands for Cross-Validation in this context).

Jigsaw Rate Severity of Toxic Comments

The next competition turned out to be a real test of faith in this regard. I entered the Jigsaw Rate Severity of Toxic Comments competition. There were several strange things about this competition. The first strange thing was that there was no training data: instead we were expected to use data from previous Jigsaw competitions, or other public data we could find. The second strange thing was that the public leaderboard was only five percent of the total test data (making it potentially not a very representative sample of the overall set). The only other tool we were given was a validation set that was about three times the size of the public leaderboard, or fifteen percent of the size of the test data. The task in this competition was to match annotators’ rankings of pairs of comments in terms of which comment was considered “more toxic”. This leads to the third strange thing, which was that, since each pair of comments was shown to multiple annotators, and since duplicate annotations weren’t aggregated in any way, the test and validation sets contained conflicting labels, making it impossible even in theory to perfectly match all the annotators labels with predictions.

After making a few submissions, it quickly became clear that the public leaderboard and validation sets didn’t agree very well. Simply encoding some texts with TF-IDF and fitting a linear regression to predict their target values performed surprisingly well on the public leaderboard, even better than state of the art transformer networks. This trend didn’t translate to the validation set though. Here the results came out as I originally expected: transformers like RoBERTa outperformed any other type of model you could throw at the problem. Since I had already vowed to trust my CV, it was a fairly easy choice initially to ignore the leaderboard and stick to my local validation scores. My faith was increasingly tested however, as many other competitors got increasingly high on the leaderboard, dropping me down to one thousand five hundred and somethingth place.

The discussion forums were full of people speculating about why ridge regression might be better than BERT on this particular problem, which seemed to me to be the wrong question. This question already assumed that ridge regression was better, without investigating the lack of correlation between leaderboard and validation. The real question from my perspective was why did submissions which only got about sixty seven percent agreement with annotators on the validation set get ninety percent or more on the leaderboard.

As you might expect, there was a big shake-up at the end of the competition, where I was lucky enough to jump up to fourteenth place (again), barely getting a gold medal this time. Now with a gold and two silvers, I had reached the rank of Competitions Master.

Lessons Learned

I think I’ve learned a lot from taking part in Kaggle Competitions over the past year. I’ve seen people make the claim that Kaggle Competitions aren’t good preparation for “real-life machine learning” because you don’t have to collect, clean, or label data, or deploy or monitor the resulting system. While it’s true that you don’t have to do these things, there’s still a huge amount you can learn from Kaggle Competitions.

Experiment Tracking

When I started out with my first competitions, I didn’t realise I was going to be running hundreds of experiments over a period of several months. I found that without a solid system in place, it quickly becomes an unreproducible mess of metrics, data, and model weights. Small things like a consistent naming convention and file structure are important.

I’ve also learned the value of experiment tracking software like Weights & Biases which allow you to log hyperparameters and metrics from your training runs, and automatically generates plots. Compared to manually entering all the hyperparameters for every run, this is a significant time-saver. Not having to write data visualisation code to plot your training and validation metrics is really nice too.

Model Validation

Beyond simply learning to Trust My CV, I’ve learned the importance of building a CV scheme that is worth trusting. People often get tripped up by incorrectly calculating a metric, or by splitting their data in a way that leaks between folds, leading to a CV that should not be trusted.

I’ve also learned that not only public leaderboard leaderboards, but also validation folds can be overfit: for example if you evaluate every epoch or n steps with early stopping on each fold, you’ll end up with an unrealistic CV score. Say your early stopping condition causes fold 0 to stop on the second epoch, and fold 1 to stop on the fifth epoch, then which one is best? We can’t compare the score between folds, and we don’t know if, in general, our model should be trained for more or fewer epochs on this dataset. A more robust method is to calculate the CV across folds at each epoch, and then take the checkpoint for all folds at the epoch which performs the best on average. This advice comes originally from Ahmet Erdem and several other Kaggle Grand Masters.

Ensembling

Unlike the previous two points, this is something that I haven’t yet used outside of Kaggle (at least in the context of deep learning models). In a Kaggle competition it makes sense to blend or stack as many models as you can as long as performance on a key metric continues to improve, and as long as the final inference run time is within the competition’s maximum runtime allowance. In industrial machine learning applications, performance on a specific metric is not the only consideration, and is often not the most important one. We also have to consider other factors like inference speed, and server and GPU running costs. Despite this, I think it’s still worth mentioning here because it’s quite important in Kaggle Competitions, and because it could have occasional use in real-world scenarios.

The simplest ensembling techniques are just taking the mean of the outputs of several models to get a more accurate output. This technique is surprisingly consistent at improving scores. In cases where a mean doesn’t make sense, for example classification tasks, we can use another technique like majority voting instead. More advanced ensembling techniques include weighting each model in the ensemble differently when calculating the mean, and stacking models by taking the outputs of a set of models, and using them as input to another model.

You have to be careful when evaluating ensembles, as you can’t use the same data that you used to train the models. For stacking, this means you have to set aside a subset of the data that you don’t include in your CV folds. In other cases, you can round this by generating a set of OOF (Out Of Fold) predictions for each k-fold set of models. For each k-fold set of models, each model only generates predictions on the subset of data that was used to evaluate it, and which it didn’t see during training. This allows you to generate one prediction per example in the dataset. These OOF predictions can then be combined in different ways and averaged to optimise your ensemble’s overall score.

Recently, I’ve started using Optuna for finding model weights in an ensemble. I think this was the key factor that pushed my final score up from a silver to a gold in the Jigsaw competition.

Conclusion

I think most of the ideas here are things that many Kagglers or Data Scientists would say that they knew already, but I also think there’s a difference between knowing something in theory and being able to apply it in practice. If you don’t try it out, and mess it up a few times, you won’t be able to apply it properly when it’s really needed. Kaggle is a safe space to make mistakes: it’s better to have a data leak and broken model validation in a Kaggle competition than when deploying a model to production that’s going to serve predictions to thousands of paying customers.

I already knew what model validation, experiment tracking, and ensembling were before I entered any Kaggle Competitions, but I still made a dog’s dinner of it the first time I tried to put these ideas into practice my myself.

Jigsaw Rate Severity of Toxic Comments 14th Place Solution

2022-02-08T00:00:00+00:00

Jigsaw Rate Severity of Toxic Comments 14th Place Solution

UPDATE: I did an interview with Sanyam Bhutani about my solution to this competition on his Chai Time Podcast on Weights & Biases here

This blog is mostly a repost of a thread I posted on Kaggle about this competition. You can find the original thread here. The source code can be found here.

Competition Overview

The goal of this competition was to build a system which can predict how toxic online comments are. What separated this from other similar sentiment analysis tasks was that the comments were divided randomly into pairs and then the comments in each pair were ranked by annotators. For example, if an annotator receives two comments: A) “I hate you.”, and B) “That’s nice.”, they have to choose which of the two comments is “more toxic” and which is “less toxic”. The data contained some duplicate pairs which had been ranked by different annotators. This means that there were many cases of annotator disagreement, which led to inconsistent labels.

The target metric for the competition was Average Agreement with Annotators. Given a pair of texts, the system had to generate a score for each text, and if it generated a higher score for the text labelled by the annotator as “more toxic” then it would receive 1 point, otherwise 0. The final score was then the total number of points divided by the total number of text pairs. Since the dataset contained contradictory pairs, it was impossible to get a score of 1.0 (100% agreement with all annotators).

The test set was 200,000 pairs of comments, and the public leaderboard which is visible throughout the competition only contained 5%, or about 10,000 pairs, of the total test data. The small size of the public leaderboard meant that it was not a very reliable metric, which led lots of teams to overfit.

Another complicating factor was that we didn’t receive any training data for the competition. Instead we got a validation set, and some links to other similar toxicity-rating tasks to potentially use as extra data.

In the end I came in 14th place, just barely getting a gold medal!

Solution Overview

The public leaderboard didn’t seem very useful so my strategy was to just maximise my validation score. My final submission is a weighted mean of 6 (5-fold) transformers. It was both my highest CV (Cross Validation score) and highest private LB score, so I’m glad I trusted my CV this time.

Data

I tried to include as many different datasets as I could. For training I used:

the validation set
jigsaw 1: the data from the first jigsaw competition
jigsaw 2: the data from the second jigsaw competition
ruddit: the data from Ruddit: Norms of Offensiveness for English Reddit Comments
offenseval2020: the data from OffensEval 2020: Multilingual Offensive Language Identification in Social Media. This dataset was uploaded to haggle by @vaby667 here

CV strategy

I used the Union-Find method by @columbia2131 to generate folds that didn’t have any texts leaked across folds. I didn’t use majority voting or any other method of removing disagreements from the data, as I thought this would just make the validation set artificially easier and less similar to the test data.

Training

Loss

I used Margin Ranking Loss to train with the validation data, and tried both Margin Ranking and MSE (Mean Squared Error) with the other datasets. It was fairly easy to create large amounts of ranked paired data from the extra datasets, but I didn’t find that this improved performance over just training them directly on the labels with MSE loss, and also required lower batch sizes.

Evaluation

When fine-tuning on the extra datasets, I computed average agreement with annotators on the validation set at each evaluation, and used early stopping. I trained for multiple epochs on the small datasets (jigsaw 1 and ruddit), evaluating once an epoch. For the large datasets (offenseval and jigsaw 2) I usually only trained for 1 epoch, evaluating every 10% the epoch’s steps.

When training on the validation data, I trained 5 models, using 4/5 folds for training and the remaining fold as validation for each one. I computed the CV at each epoch and used the model weights from the epoch that had the highest CV.

I had original started out using fold-wise early stopping, but I discovered that this leads to overly optimistic CV scores.

Multi-stage fine-tuning

I found that taking a model I had already fine-tuned, and fine-tuning it on another dataset improved performance. This worked across multiple fine-tuning stages. The order that worked best was to start by fine-tuning on the larger, and lower scoring, datasets first, and then on the smaller ones after.

For example:

fine-tune a pretrained model on offenseval (validation: ~0.68)
use #1 to fine-tune on jigsaw 2 (validation ~0.695)
fine-tune #2 on jigsaw 1 (validation ~0.7)
fine-tune #3 on ruddit (validation ~0.705)
fine-tune 5 folds on the validation data, using #4 (CV: 0.71+)

Hyperparameters

I started out training most of the base-sized models with 1e-5 on earlier fine-tuning stages and reduced to 1-6 on the later ones. I used 1e-6 and then 5e-7 on the larger models. At each stage I also used warmup of 5% and linear LR decay. Even with mixed precision training, I could only fit batch size 8 on the GPU with the large models, so I used gradient accumulation to simulate batch sizes of 64.

Models

Here’s a table of model results.

base model	folds	CV	final submission
deberta-v3-base	5	0.715	yes
distilroberta-base	5	0.714	yes
deberta-v3-large	5	0.714	yes
deberta-large	10	0.714	yes
deberta-large	5	0.713	yes
rembert	5	0.713	yes
roberta-base	5	0.711	no
roberta-large	5	0.708	no

Notes:

I mostly stuck to using roberta and deberta variants because they always perform well. If I had had more time I would’ve tried some others, but I spent most of the time trying out different combinations of datasets.
The reason I tried rembert was because I wanted to make use of the multilingual jigsaw 3 data. I wasn’t able to get any improvement from including the extra data, but I was still able to get reasonably good performance out of rembert.
Deberta-large (v1) is in there twice because I did an experiment with a 10 fold model which turned out quite well. I didn’t want to keep training 10 fold models though because it took too long.
I think all of the large models are slightly under-trained. Training on large datasets like offenseval and jigsaw 2 took over 24 hours so my colab instances timed-out.

Ensembling

My final submission was a weighted mean of 6 models which were selected from a pool of models by trying to add each of them to the ensemble one at a time, tuning the weights with Optuna for each combination of models, and greedily selecting whichever model increased the OOF score the most. This was repeated until the OOF score stopped improving. My best score was 0.7196.

Interestingly, the highest weighted models ended up being deberta-large and rembert, despite those having lower CV scores.

Things which didn’t work

The measuring hate speech dataset

The Measuring Hate Speech dataset by ucberkeley-dlab seemed like it was going to be useful, but the labels didn’t seem to match the annotations in the validation set for this competition very well. I was unable to get more than 0.656 with this data.

Binary labels

I wanted to try to make use of the binary labelled data too (jigsaw 3 and toxic tweets). I tried fine-tuning on these datasets, and used the model’s predicted probability of the positive class as an output for inference and evaluation. I was able to get 0.68 on the validation set with this method, but I found that it didn’t chain together with my multi-stage fine-tuning approach as I had to modify the last layer of the model to switch between regression and classification tasks.

TF-IDF with linear models

This method was used by a large number of competitors in this competition, but it seems to have mostly been used to overfit the small public LB. I experimented with it a little bit, but wasn’t able to get anything over 0.7 on the validation set so I gave up on it.

Word vectors with linear models

I tried encoding each comment as the average of spaCy word vectors, and using this as an input into various linear models. It did about as well as TF-IDF.

Improvements

As I’ve already mentioned, I think my large models are under-trained due to GPU limitations. I was surprised by how much rembert helped the ensemble, so think I could’ve made a stronger ensemble by choosing some more diverse model architectures instead of focusing on deberta and roberta so much.

Overall, I’m quite happy with how this competition came out. I feel very lucky to have got a gold medal this time!

Predicting the CEFR Level of English Texts

2021-03-14T00:00:00+00:00

Attempting to Predict the CEFR Level of English Texts

To try out the final model, check out the Streamlit app. The code is available on Github.

I previously wrote a blog about automatic reading comprehension question generation. This one is somewhat related, in that it’s another project about English reading comprehension. This time I wanted to see if I could predict the CEFR level of a given text. This kind of system is a useful tool for teachers or self-studying students as it helps them find reading material of an appropriate difficulty level.

There are several tools like this that already exist, so this is mostly just an exercise in trying to reproduce their behaviour. For example Duolingo CEFR checker (UPDATE: Duolingo seems to have taken this down now) which predicts CEFR at a word level, and then gives an overall score, and Text Inspector, which predicts an overall score based on a number of metrics. There are also a number of metrics which aim to estimate the difficulty level of a text, like the Flesch-Kincaid readability test, the Gunning Fog index, and the Coleman-Liau index.

Dataset

The majority of CEFR levelled reading texts are not freely available, but some free samples can be found. I started by collecting all the freely available labelled sample texts that I could get hold of. The resulting dataset was fairly small, so to increase the size of the dataset, I used the existing CEFR levelling tools to label additional data.

The final dataset contains 1500 example texts split over the 6 CEFR levels. The texts are a mixture of dialogues, stories, articles, and other formats. The dataset can be found here.

The dataset was then split into 80% training and 20% test.

Text Complexity Metrics as a Baseline

The textstat libraries contains a variety of functions for calculating text readability and complexity metrics, including all the previously mentioned ones. To set a baseline performance, each metric was computed for every example in the test set, and the results were scaled and rounded to fit in the range of labels for classification. Of these metrics, scaled Smog Index performed the best with 41% accuracy on the test set. Most seemed to have some predictive power with regards to CEFR levels, except for Flesch Reading Ease which got less than 13% (below the accuracy of a system which generates a random number between 0 and 5).

Text Complexity Metric	Accuracy
Smog Index	41.8%
Dale Chall Readability Score	37.8%
Automated Readability Index	35.1%
Text Standard	34.8%
Flesch Kincaid Grade	34.1%
Linsear Write Formula	33.8%
Gunning Fog	32.8%
Coleman Liau Index	31.8%
Difficult Words	27.1%
Baseline Random	16.7%
Flesch Reading Ease	12.7%

Feature Engineering

In order to try fitting some classifiers, I needed to generate some features. Since the text complexity metrics individually displayed some level of predictive power, I decided to use them. In addition, I generated some features such as the mean parse tree depth and the mean number of each part-of-speech tag using spaCy. Features using the mean were preferred over absolute counts to prevent the level predictions from being directly tied to text length. Higher level texts tend to be longer, but the length of a text by itself is not a good indicator of the text’s difficulty. A short text containing complex sentences filled with obscure terminology is more difficult to read than a longer text of simple short sentences.

Training

I tried training SVC, Decision Tree, Random Forest, and XGBoost. At almost 71% accuracy on the test set, XGBoost slightly outperformed the others.

I also also fine-tuned a couple of transformer models, which I initially assumed would be stronger at this kind of language understanding classification task. Pretrained BERT-base and DeBERTa-base were fine-tuned for sequence classification. The raw text was tokenised, encoded, and used as inputs into the models. But even after experimenting with various hyperparameters, neither transformer managed to outperform XGBoost. These transformer-based solutions are also significantly more resource intensive than XGBoost, and slower at inference without a GPU.

The training code and model artifacts for the sklearn classifiers can be found here. A Colab notebook for finetuning BERT on the same data can be found here. The table below shows the best test set accuracy I was able to get with each model.

Model	Accuracy
XGBoost	70.9%
DeBERTa-base (pretrained)	70.2%
BERT-base-cased (pretrained)	68.6%
Random Forest	68.2%
Logistic Regression	67.9%
SVC	67.9%

Given these results I went with XGBoost as my final model.

The Problem of Vague Boundaries

A maximum of 71% accuracy on this 6 class problem isn’t a particularly impressive result. One possible limiting factor is that the data was collected from various sources without a set of consistent rules for labelling.

Another likely reason is that the criteria for levelling texts are fairly vague, so the boundaries between each class are not clearly defined. The criteria seem to be a set of “can-do” statements for each level, such as “can understand texts that consist mainly of high frequency everyday or job-related language” (B1). It’s not clear exactly which vocabulary is included in “high frequency everyday or job-related language”, or how much of text must consist of this to be considered “mainly”.

Confusion Matrix

label	A1	A2	B1	B2	C1	C2
A1	52	5	1	0	0	0
A2	13	40	1	1	0	0
B1	0	5	23	12	1	0
B2	0	2	9	32	13	1
C1	0	0	1	9	34	4
C2	0	0	0	0	9	31

The confusion matrix above confirms that the majority of the model’s incorrect predictions are one-off misclassifications. The model frequently confuses A1 and A2 for example, but rarely confuses a label with anything other than its immediate neighbour. This seems to confirm the idea that the boundaries are not easy to distinguish. For top-2 accuracy the model scored 95%.

Using Probabilities to Distinguish Between Labels

In cases where a text seems to lie somewhere between B1 and B2, we can call the text B1+ to indicate that it’s somewhere in the middle. Text Inspector seems to take the same approach in these cases. However, the + labels are not present in the dataset which I collected, so to avoid having to completely relabel the data, the model’s predicted probabilities for each label are used.

Predictions where the maximum probability is below a certain threshold (0.7) are counted as instances of the model being uncertain. In these cases the predicted label is the average of the max and the second strongest prediction. For example, in the case that the model predicts somewhere between 0 (A1) and 1 (A2), the returned value will now be 0.5 (A1+). This indicates that the text could belong in either A1 or A2, and might be appropriate for advanced A1 level readers, or A2 level readers.

Improvements

I think the most significant way to improve this model would be to collect a new dataset with a more precise set of rules for labelling. This could be done by only taking all labelled texts from one source. For example the British Council site has a set of texts which presumably follow a consistent set of rules for labelling. However the number of texts is fairly small, and I don’t know of any publicly available source of labelled texts which would be large enough for the task.

Token Classification With Subword Tokenizers for Bulgarian

2020-09-02T00:00:00+00:00

Token Classification With Subword Tokenizers for Bulgarian

All the code for this project can be found at: https://github.com/AMontgomerie/bulgarian-nlp.

I can’t really speak Bulgarian, but I’d like to be able to. Sometimes when I receive an instant message in Bulgarian that I can’t understand, I’m forced to just copy-paste it into Google Translate, which helps me find the overall meaning of the sentence most of the time, but doesn’t provide much in the way of lexical or grammatical information which I could learn something from.

The amount of tools for Bulgarian language learners seems pretty limited, so I thought I’d try building my own. I wanted to know what the individual words in the sentences I was Google-translating were doing, so I decided to train a part-of-speech (POS) tagger. While I was at it I also trained a model for named-entity recognition (NER).

I have some experience fine-tuning models using the Huggingface Transformers library, but hadn’t done much training of transformer networks from scratch. There didn’t seem to be any pretrained checkpoints available for Bulgarian, so I was forced to pretrain my own model, before fine-tuning it on my chosen downstream tasks.

Tokenization

POS tagging and NER are both token classification tasks in that they both require the model to make predictions about the roles of individual words in a sentence. Here we are using “word” and “token” interchangeably, which is not necessarily always the case as text can be tokenized in many different ways.

Tokenization is the process of breaking an input text into a series of meaningful chunks. These chunks, or tokens, can then be encoded and used for a variety of tasks. But along which lines should we split the text? An obvious answer is to split on a word level. We can simply split the text by whitespace and encode each word as a separate token. This allows us to preserve the meaning of each word, but will create problems whenever we encounter words that are not in our vocabulary.

Another strategy is to tokenize on a character level. This solves the problem of encountering out-of-vocabulary words, because we can construct any word (within the character-set of the languages we’re using) from its component characters. However, by reducing words to series of characters, we seem to be discarding the meaning that languages contain at a word level.

Subword Tokenization

A third strategy, and one that has become standard in a lot of modern NLP architectures, is subword tokenization. This is a kind of middle ground between word-level and character-level tokenization, where we split words into chunks based on common patterns. For example, lots of negative words start with the prefix dis-, such as disorganised or dishonest. We can split this prefix from the rest of the word and use it as a token. This helps us preserve the meaning of part of the word while still allowing us to build new unseen words using in-vocabulary components. Even if our system has never encountered the word disagreeable during its training, it can still represent it using the tokens dis, ##agree, and ##able (the ## here indicates that the previous token is part of the same word).

Two of the most common subword tokenization methods are WordPiece and Byte-Pair Encoding (BPE). WordPiece builds tokens based on the combinations of characters which increase likelihood on the training data the most. In contrast, BPE tokens are based on the most frequent byte strings in the data. For this project, BPE tokenization was used.

Token Classification

POS tagging involves predicting which part of speech a word represents. For example big is an adjective and dinosaur is a noun. NER is the task of picking named entities from a text. Here entity means a string which represents a person, place, product, or other type of named thing. “Adam Montgomerie” is a named entity, but “potato” is not.

Datasets for POS tagging and NER are usually labelled at a word level, which means that, when using a word-level tokenizer, there is a one-to-one correspondence between input tokens and labels, which makes calculating training loss and test accuracy easy. But if we are using a subword tokenizer, each word will be potentially split into multiple tokens. How do we resolve this mismatch between inputs and labels?

Token Mapping

A popular, but slightly unintuitive, approach is to ignore all but one token from each word when generating predictions. The original BERT paper uses this strategy, choosing the first token from each word. Let’s use disagreeable as an example again: we split the word into dis, ##agree, and ##able, then just generate predictions based on dis.

This implementation of a POS tagger using BERT suggests that choosing the last token from each word yields superior results. This would mean choosing ##able as the token to generate predictions from. Intuitively this makes sense: in English, important morphological information which hints at the part of speech of a word is often contained at the end of that word. For example manage is a verb, manager is a noun, and managerial is an adjective. We need to inspect the ends of these words to determine which part of speech they correspond to.

This also holds in Bulgarian: учи (learn) is a verb, and училище (school) is a noun. There are several other words with various parts of speech but the same учи or уче prefix.

To implement this, we can use the offset_mapping from Huggingface’s tokenizers. If we set return_offsets_mapping to true, the tokenizer will also return a list of tuples indicating the span of each token.

# list ix:  0123456789012345678
sentence = 'Кучето ми е гладно.'
encoded_sentence = tokenizer(
    sentence, 
    add_special_tokens=True,
    return_offsets_mapping=True,
)
print(encoded_sentence['offset_mapping'])

gives us the following output:

[(0, 0), (0, 2), (2, 6), (7, 9), (10, 11), (12, 15), (15, 18), (18, 19), (0, 0)]

Each tuple in the list represents one of the tokens that the input sequence was split into. The first value of the tuple indicates the start of the tokens span in the original sentence, and the second value indicates the end of the span. Because we set add_special_tokens=True we also got special start-of-sequence and end-of-sequence tags. We can tell which tokens are special tokens by checking the width of their span. The first token goes from zero to zero, because it is a start of sentence token and wasn’t included in the input sentence at all!

We can also see which tokens are the starts and ends of words by checking if their spans overlap. (0, 2) and (2, 6) overlap at 2, so they must be part of the same word. This means that Кучето was split into two tokens. Depending on our label matching strategy, we can either take the first or the second one of these tokens. In contrast, (7, 9) and (10, 11) don’t overlap and therefore represent separate words; ми and е respectively. For single token words the label matching strategy is irrelevant.

Implementation

In the training data for both POS and NER there is a single label per word, but when we tokenize our input sentence we end up with a sequence that is longer than the list of labels we have. To resolve this we can pad the label list to be the same length as the tokenized input_ids sequence. We can set all tokens that we aren’t going to map to label to -100 so that they will be ignored for calculating loss. From the transformers documentation:

Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, …, config.vocab_size].

Then we can match either all of the first or last tokens from each word to that word’s label. Here’s a Huggingface guide which includes using offset_mapping to map tokens to labels. Alternatively, see the encode_tags_first() and encode_tags_last() methods in my POS tagging fine-tuning notebook. For inference, we won’t have any labels to compare with, but we still need to determine which of the input tokens to classify and which to ignore. This can also be done using offset_mapping:

def get_relevant_labels(map):
    relevant_labels = np.zeros(len(map), dtype=int)

    for i in range(1, len(map) - 1):
        if is_last_token(map, i) and not ignore_mapping(map[i]):
            relevant_labels[i] = 1
                
    return relevant_labels

def is_last_token(map, i):
    return map[i][1] != map[i+1][0]

def ignore_mapping(mapping):
    return mapping[0] == mapping[1]

This function takes an offset_mapping generated by a tokenizer and checks each token to see if it’s the last token in a word. It then returns a list of values of the same length as the input_ids list in range [0, 1] where 1 means that the token at this position should be used for prediction and 0 means that it should be ignored. Each token is compared to the next one to see if there’s an overlap. We can skip the first and last tokens in the sequence because they are always SOS and EOS and will be ignored anyway. is_last_token checks if a specified token is at the end of a word and ignore_mapping just checks if the start and end of the span are the same; if so the token is a special token and can be ignored.

Training

Architecture

Following this tutorial on pretraining transformer models I used RoBERTa which is a model that was originally introduced in RoBERTa: A Robustly Optimized BERT Pretraining Approach. It’s essentially BERT, but with some changes to improve performance. For pretraining, this means that the next sentence prediction objective that was used in the original BERT has been removed. The masked-language modeling objective has also been modified so that masked tokens are generated dynamically during training, rather than being generated all at once during pretraining. This is known as dynamic masking.

I initially pretrained a model using RoBERTA-base which has 12 layers, a hidden size of 768, and 12 attention heads. However, I also tried training a smaller version with only 6 layers and found that performance didn’t suffer at all, so I went with that for the final version.

Training Set up

For pretraining data, I used data from Leipzig Corpora Collection and OSCAR. The model was trained for about 1.5 million steps (CONFIRM?) on the pretraining data with a batch size of 8 using the masked language modeling objective. A Colab notebook containing the pretraining routine can be found here.

For fine-tuning as a part-of-speech tagger, the Bulgarian dataset from Universal Dependencies was used. I was able to easily parse the CONLL-U data using this parser and then extract the POS tags. I used this dataset for named-entity recognition. The data seems to be taken from the BSNLP 2019 shared task. For both tasks, fine-tuning was performed over 5 epochs on the relevant dataset with a learning rate of 1e-4. The relevant Colab notebooks are available for both POS tagging and NER.

Results

Below is a comparison of model accuracy with various configurations on the POS tagging and NER test sets. The Token Mapping column shows whether the labels were mapped to the first or last token of each word in the input sequence.

Part-Of-Speech Tagging:

RoBERTa-small

Model	Token Mapping	Accuracy
roberta-small-pretrained	first	97.75%
roberta-small-pretrained	last	98.10%
roberta-small-no-pretraining	first	92.45%
roberta-small-no-pretraining	last	93.13%

RoBERTa-base

Model	Token Mapping	Accuracy
roberta-base-pretrained	first	97.40%
roberta-base-pretrained	last	97.65%
roberta-base-no-pretraining	first	91.67%
roberta-base-no-pretraining	last	92.93%

Named-Entity Recognition

RoBERTa-small

Model	Token Mapping	Accuracy
roberta-small-pretrained	first	98.52%
roberta-small-pretrained	last	98.52%
roberta-small-no-pretraining	first	95.75%
roberta-small-no-pretraining	last	95.69%

RoBERTa-base

Model	Token Mapping	Accuracy
roberta-base-pretrained	first	98.61%
roberta-base-pretrained	last	98.56%
roberta-base-no-pretraining	first	95.66%
roberta-base-no-pretraining	last	95.62%

Observations

Interestingly, mapping the label to the last token of each input word produced very slightly better results in all POS tagging tests. However, it didn’t provide any benefit for NER, and even performed slightly worse in some cases. This makes sense since, while key morphological information is often contained at the end of a word (which helps us with POS tagging), there is no specific part of a name which identifies it as such: hence the identical performance of training on the first token and last token for NER.

Unsurprisingly, pretrained models outperform randomly initialised models. I also found that the randomly initialised models didn’t benefit from training for more epochs, so the benefit of pretraining here was more than just faster convergence on downstream tasks.

More surprisingly, base models didn’t outperform small models, despite having twice as many layers. They also appeared to converge more slowly during training.

A mitigating factor for explaining the lack of increased performance of the larger roberta-base model is the small size of the fine-tuning datasets. Perhaps they would have benefitted from larger datasets. Unfortunately I wasn’t able to find any more data, but perhaps could have used synonym replacement as a data augmentation strategy in the case of POS tagging. In theory, we could replace words with their synonyms as long as the synonym is the same part-of-speech as the original word. Other augmentation strategies like back-translation probably wouldn’t work as the new sentence wouldn’t be guaranteed to have the same correct labels as the original.

Generating Questions Using Transformers

2020-07-30T00:00:00+00:00

Generating Questions Using Transformers

As someone who has both taught English as a foreign language and has tried learning languages as a student, I know that it’s important to find interesting things to read when practicing reading comprehension. The internet is of course a great source of material. However, one difficulty when attempting to study using material you find online is that it’s not always easy to test your understanding. In order to get some feedback, you either have to find a teacher who will quiz you, or instead use a textbook which has some pre-written questions and answers. But a teacher is not always on-hand, and using textbooks significantly limits the range of reading material you can use.

The original goal of this project was to create a system to allow independent learners to test themselves on a set of questions about any text that they choose to read. This means that a learner would be able to pick texts that are about topics they find interesting, which will motivate them to study more. In order to achieve this, I decided to train a neural network to generate questions. Ideally, I would like to have done this in one of my target languages (Japanese or Bulgarian), but I decided it would be simplest and most effective to use English to begin with due to the availability of large datasets in English, and because it would be easiest for me to evaluate the quality of outputs in my native language.

Question-Generation (QG) is an area of Natural Language Processing (NLP) which involves language generation. This distinguishes it from language comprehension tasks like named entity recognition, sentiment analysis, or extractive question answering. At a basic level, QG is a type of language modeling, which means assigning conditional probabilities to a sequence of words or tokens. This means that QG is similar to other NLP tasks like abstractive summarisation or sentence completion.

Some research has been done into QG, but it appears to be less popular than some other areas such as Question Answering (QA). We can easily see this by comparing the amount of QG papers on paperswithcode.com with the number of QA papers. Because of this, there aren’t many resources such as public datasets or benchmarks specifically for QG. However, if we think of QG as a reversed QA task, then we can simply use QA datasets with the input fields and target fields reversed. This is how some previous research into QG has been done.

Gathering a Dataset

In order to train a QG model, I needed to get hold of some question and answer data. Luckily, there are a large number of public QA datasets. In the end, I decided to use data from SQuAD, RACE, CoQA, and MSMARCO.

SQuAD is a dataset containing reading comprehension questions and answers relating to Wikipedia articles. The questions are exactly in the style that I wanted my model to generate. I used SQuAD 2.0, which contains some unanswerable questions. I didn’t want my model to generate unanswerable questions so I filtered those out.
RACE is a dataset collected from English exams for Chinese middle school and high school students. The questions all relate to a passages of text making it suitable for my project. Unfortunately some of the questions are cloze-style (fill-the-blank) rather than actual question sentences, so I filtered those out. Some questions also have multiple-choice answers, so I dropped the incorrect answers and just kept the correct answer corresponding to each question.
CoQA is a more conversational-style QA dataset. It contains sets of questions and answers generated by people having conversations about a text. This dataset contains lots of good questions to learn from, but due to the conversational-style of the questions, some of them were not usable in my case. This is because some questions contain references to things previously said in the conversation. This leads to questions which don’t contain enough context for a reading comprehension format. For example “who was he?”: we can’t answer this question unless we know who “he” refers to. Another example is “and what else?”: this question makes no sense if we haven’t previously started listing things.
MSMARCO is a dataset containing questions from Bing searches and corresponding answers with supporting texts. This dataset also contains lots of good questions, but many of them are not in full grammatical question sentences. This is because people typing questions into a search engine only really need to type some key words rather than a full grammatical question. In order to resolve this, I filtered the dataset to only include examples which start with a list of possible question words; for example “who…?” or “does…?” Even some of these turned out to be not real question sentences though. People often type phrases like “how to bake a cake” into a search engine. In this case, I replaced the “how to” with “how do you” creating the grammatical question sentence “How do you bake a cake?”

After filtering the datasets, I concatenated the answer and context fields into the format of answer_token <answer> context_token <context>. Once concatenated the data could then be encoded and fed into a neural network. The question field was kept as a label for calculating loss during training. The final dataset contained about 250,000 examples taken from the 4 datasets mentioned.

Training a Model

The vast majority of modern NLP systems are based on the Transformer architecture introduced in Attention Is All You Need. These days there is a large variety of different architectures. After reading about several recent architectures, I settled on Google’s T5 model, which was introduced in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The basic idea behind T5 is reframing all NLP tasks as sequence-to-sequence tasks. For example, for summarisation, the model takes the text to be summarised as an input sequence, and outputs the summary as a sequence. For sentiment analysis, the model takes the text to be analysed as an input sequence, and outputs a sequence which states the sentiment of the text. This is useful because although the model wasn’t designed or pretrained with the goal of QG in mind, it can be easily repurposed for QG: we can simply use the answer and context as an input, and train the model to give us a question as the output sequence.

The HuggingFace Transformers library allows us to use a wide range of state-of-the-art transformer models, even allowing us to load pretrained weights. This made it easy to load a pretrained T5-base model and set it up for training with my QG dataset. We can easily load a pretrained model and tokenizer with 3 lines of code like this:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")

Once we have the model and tokenizer, we can easily encode inputs, pass them into the model, and generate outputs:

input_text = # concatenated answer and context here
encoded_input = tokenizer(input_text)
outputs = model(
  input_ids=encoded_input['input_ids'],
  attention_mask=encoded_input['attention_mask'],
  lm_labels=masked_labels)

masked_labels here refers to our encoded target (question) sequence with any padding replaced with the value -100. This indicates to T5 that it should ignore that part of the target when calculating loss. From the documentation:

All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, …, config.vocab_size]

If we don’t do this then the loss values will be incredibly low as any matching padding will count as a correct prediction! I actually made this mistake at first, and found that the model always generated one-word answers followed by 511 pad tokens (the maximum sequence length is 512). Correctly masking the padding in the label sequence solves this issue. I generated the label mask like this:

def mask_label_padding(labels):
    MASK_ID = -100
    labels[labels==tokenizer.pad_token_id] = MASK_ID
    return labels

When we feed an input into this model, the loss is calculated automatically too! So after calling model(), we get the loss and the model’s predictions:

loss, prediction_scores = outputs[:2]

I split the training data into 85% training set and 15% validation set. I trained the model for 20 epochs over the dataset using a learning rate of 0.001 (which was the learning rate used for fine-tuning in the T5 paper). Because T5-base is quite a large model, and because I was working with limited GPU memory, I was only able to use a batch size of 4. This meant that training took about a week! In addition, because I was training on Google Colab, the session timed out every 24 hours meaning I had to regularly save and reload.

The code from the training notebook can be found on my GitHub.

Evaluating Questions

I was initially worried that the model might inconsistently create grammatical question sentences. This would be a problem since my original goal was for language learners, who need correct examples to learn from. Any incorrect sentences might cause confusion or re-enforce bad habits. However, when I tested the model on some sample texts, I found that the grammar was mostly consistent.

On the other hand, I noticed that the model would sometimes generate questions with either no relevance to the answer, or no relevance to the context. The latter were particularly common. An example of this is a question generated from an article about some news relating to Hong Kong and big tech companies. Instead of asking about what happened in the story, the model simply generated the question “what is Facebook?” While this question is grammatically correct and answerable, it is not a reading comprehension question relating to the text, because the text did not contain an explanation of what Facebook is.

Another issue was that the model generated some questions which were tautological or contained the answer within the question. For example, from a text about some events happening in the US, the model generated:

Q: Where is Georgia?

A: Georgia

This is both irrelevant, because the article didn’t explain where Georgia is, and obviously redundant, because the answer doesn’t add anything that we didn’t already know from the question.

To deal with these issues, I decided to train another model which would evaluate the generated questions and answers. I decided to use a pretrained version of BERT for this task. BERT was introduced in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. BERT is a transformer model pretrained using a cloze-style task called masked language modeling; which is basically filling in the blanks in sentences. Using this as a pretraining objective has the advantage of forcing the model to learn bidirectional representations since it must consider what comes before and after the blank to make an accurate prediction. This is in comparison to traditional language modeling objectives, which require the model to predict the next word in a sequence, only learning context from one direction.

Bi-directional representations means that BERT is good for language comprehension tasks, such as evaluating questions and answers! I also chose BERT because of one of its other pretraining objectives, called Next Sentence Prediction (NSP). NSP involves taking two sentences, and predicting whether or not the second sentence follows the first one or not.

For my project, I repurposed the NSP objective by setting the first sentence as a question and the second sentence as the answer to the question. I used the same [CLS] and [SEP] tokens as were used in pretraining.

# BERT pretraining:
"[CLS] <first sentence> [SEP] <second sentence> [SEP]"

# QA evaluator fine-tuning:
"[CLS] <question> [SEP] <answer> [SEP]"

To fine-tune the model, I reused the dataset from the question generator, but removed the context. During training, 50% of the time the model would be given the correct QA pair, but in the other 50% of the time, the answer would be corrupted. I defined two corruption operations: the first one was to replace the answer with another random irrelevant answer from the dataset, and the second was to take a named entity from the question, and copy into the answer. The training objective was then to predict whether the answer had been corrupted or not.

Before fine-tuning on this objective, the pretrained BERT model was only able to achieve 55% on the validation set, which isn’t much better than a random guess. But after training it was able to get over 90% which, while not perfect, I decided was good enough to filter out some of the bad QA pairs.

The code for the QA evaluator training can be found here.

The Final System Pipeline

Now we’ve discussed a system containing two models: the first of which takes answers and generates questions, and the second of which evaluates whether or not those QA pairs are valid or not. An earlier version of the system also included a third model which summarised the text in order to extract the best sentences to use as answers to feed into the QG model. But I found this to be overall too much filtering; both filtering sentences before question generation, and filtering QA pairs after generation. The result was that the model was only able to output a very small number of questions about each article.

As a result, I decided to cut the summarisation model. This enabled me to feed a larger number of candidate answers into the QG model, giving the evaluator more QA pairs to sift through.

The final system splits the text into sentences to be used as candidate answers. Each candidate answer is then concatenated with the text, encoded, and passed into the QG model. The outputted question is then concatenated with its corresponding answer and passed to the QA evaluator model. The evaluator outputs a score predicting how likely it is that the QA pair is valid. The QA pairs are then ordered by their evaluation score, and only the top N pairs are presented to the end-user.

The code for this is available here.

Multiple-Choice Questions

One addition to this system is multiple-choice questions. Multiple choice questions are great for quick tests or for lowering the difficulty of a test, since the student only needs to pick an answer from a predetermined set of answers. Naively, given a question and answer, we could just add random alternative phrases from the text to serve as options. This usually results in incredibly easy questions though, because only the correct answer has any relevance to the question being asked.

In order to make multiple-choice answers more difficult to distinguish between, we can use Named Entity Recognition (NER). In my system, this was done using spaCy’s built-in NER. The entities are extracted from the text and used as candidate answers in the QG model. The alternative answers are then selected from answers of the same entity type. For example, given the following QA pair:

Q: Which city has the largest population in the world?

A: Tokyo

We can identify “Tokyo” as an entity of type GPE (for Geo-Political Entity), and then search the text for others of the same type. The final question will then present the user with 4 geopolitical entities (e.g. other cities, or countries), rather than 1 city and 3 completely random phrases. This is of course only possible if there are 3 other locations mentioned in the text! If there are only two other GPE entities in the text, the empty slot will be filled by another random entity. For example:

Q: Which city has the largest population in the world?

A: 1. Kumamoto

Shinzo Abe

Tokyo

Japan

My final system allows the user to choose between full-sentence answers, multiple-choice answers, or a mix of both. I’ve found that the full-sentence QA pairs tend to be of better quality. This is likely because the training data mostly consisted of full-sentence answers. The QA evaluator model agrees with me, and so when a mix of both question styles is selected, the output tends to include mostly full-sentence QA pairs (as they were ranked higher than the multiple-choice ones).

Example

A full example notebook can be found here. It should be possible to run this notebook in Google Colab and generate questions from any text file you like.

Here’s an example of some generated questions from a BBC article about a new Netflix show about arranged marriages in India. We can instantiate the QuestionGenerator and use it like this:

from questiongenerator import QuestionGenerator
from questiongenerator import print_qa

qg = QuestionGenerator()

article = # read in the article text from a source file

qa_list = qg.generate(
    article, 
    num_questions=10, 
    answer_style='all'
)

print_qa(qa_list)

Initialising the Question Generator will automatically initialise the QA Evaluator too, and questions will be automatically ranked unless use_qa_eval=False. This is the output:

Generating questions...

Evaluating QA pairs...

1) Q: What would have been offended if Sima Aunty spoke about?
   A: In fact, I would have been offended if Sima Aunty was woke and spoke about choice, body positivity and clean energy during matchmaking. 

2) Q: What does she think of Indian Matchmaking?
   A: " Ms Vetticad describes Indian Matchmaking as "occasionally insightful" and says "parts of it are hilarious because Ms Taparia's clients are such characters and she herself is so unaware of her own regressive mindset". 

3) Q: What do parents do to find a suitable match?
   A: Parents also trawl through matrimonial columns in newspapers to find a suitable match for their children. 

4) Q: In what country does Sima taparia try to find suitable matches for her wealthy clients?
   A: 1. Sima Aunty 
      2. US (correct)
      3. Delhi 
      4. Netflix 

5) Q: What is the reason why she is being called out?
   A: No wonder, then, that critics have called her out on social media for promoting sexism, and memes and jokes have been shared about "Sima aunty" and her "picky" clients. 

6) Q: who describes Indian Matchmaking as "occasionally insightful"?
   A: 1. Kiran Lamba Jha 
      2. Sima Taparia 
      3. Anna MM Vetticad 
      4. Ms Taparia's (correct)

7) Q: In what country does Sima taparia try to find suitable matches?
   A: 1. Netflix 
      2. Delhi 
      3. US 
      4. India (correct)

8) Q: What is the story's true merit?
   A: And, as writer Devaiah Bopanna points out in an Instagram post, that is where its true merit lies. 

9) Q: What does Ms Vetticad think of Indian Matchmaking?
   A: But an absence of caveats, she says, makes it "problematic". 

10) Q: Who is the role of matchmaker?
    A: Traditionally, matchmaking has been the job of family priests, relatives and neighbourhood aunties. 

Most of the questions are reasonable, but there are a few awkward examples here. The first question doesn’t really make sense, and should say something like “What would have been offensive for Sima Aunty to speak about?” Question #6 also shows a problem with the multiple choice answers. The multiple-choice answer system does filter out duplicate entities, but not variations of the same name. “Ms Taparia’s” and “Sima Taparia” are considered two separate PERSON entities even though they refer to the same person. Question #8 gives us an example of a valid, if vague, question. Unfortunately the example answer here doesn’t actually answer the question being asked. The QA Evaluator doesn’t seem to have picked up on this either.

We could solve the issue from question #6 by improving the system’s ability to recognise variations of the same entity name. We could also filter out question #8 for being irrelevant by training the QA Evaluator more robustly. But I don’t think it’s clear how we could solve the problem of question #1 without training a better QG model.

Applications of the System

As stated, the original goal of this project was to make a system for independent language learners to generate questions to test themselves with. But I think there are some other possible applications of this system too. Tests like this are also performed in other types of classes to test students’ reading memorisation abilities. Teachers could potentially use a system like this to generate some questions about an excerpt from a book, a poem, or some other piece of text for their class. Another potential application is in generating QA data for training or evaluating models on QA tasks. One could potentially use this kind of system for data-augmentation, or perhaps generating a whole dataset from scratch.

Unexplored Directions

One challenge I haven’t tried to tackle is automatic evaluation of user inputs in the case of full-sentence answers. This is an issue because the user could potentially type a variety of answers, all with the same meaning and truth-value, but with different words and syntax. One simple way to deal with this would be to ask the user to select a sentence from the text to use as an answer rather than typing the answer themselves. A much cooler solution would be to include some kind of machine learning system which evaluated whether the user’s input is semantically equivalent to the correct answer or not.

Another unexplored idea is question difficulty. The model is capable of asking very simple questions which only require a quick scan of the text to find a name, date, or location. But it’s also capable of asking more complex questions about people opinions or the causes of events. A nice feature would be something that can assign a difficulty value to a question. This would allow us to filter questions by difficulty-level depending on the user.

Finally, it would be cool to implement the same kind of QG system for other languages. I’d like to have something like this for Japanese, because I’m sick of all of the textbooks that I have.

Another Question Generation Project

When I was in the process of uploading my code and writing this blog, I came across this GitHub repo by Suraj Patil, which also uses T5 for question generation! They also appear to have fine-tuned using data from SQuAD. One interesting difference from this project is their use of T5 for multiple-tasks; in particular for answer extraction from the target text, and for QA as well as QG. They also go into more detail about how the models perform on various metrics like BLEU and ROUGE.

Using Spotipy to Collect Track Data

2020-07-30T00:00:00+00:00

Using Spotipy to Collect Track Data

For a recent project on classifying music genres, I needed to collect a large dataset of labelled tracks. The Spotify API is ideal for this because, a long with a variety of tabular track data, you can download 30-second track samples from the majority of tracks. An easy way to use the Spotify API in Python is through Spotipy.

In this post I’ll show how to use Spotipy to get track data, and how to download track samples. We’ll also discuss a couple of genre labelling strategies and their issues.

Using Spotipy

The first thing you need to do is register a Spotify Developer account. From the Dashboard, click “create an app”, choose a name, write a description and agree to the terms. This will give you a Client ID and a Client Secret which you can use to gain access.

Using Spotipy, we can now log in like this:

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

def spotify_login(cid, secret):
    client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) 
    return spotipy.Spotify(client_credentials_manager=client_credentials_manager)

cid = "<Your Client ID>"
secret = "<Your Client Secret>"

sp = spotify_login(cid, secret)

Getting data

We can now use sp’s methods to query Spotify. For example, if I want to find Radiohead, I can call sp.search() like this:

result = sp.search("Radiohead", type='artist')

result is a dictionary of all the search results. result['artists']['items'] contains a list of artists, of which the Radiohead we are looking for is the first element. We can extract Radiohead’s Artist ID and then use it to find other Radiohead-related data. To get a list of their albums we can query Spotify again, this time calling sp.artist_albums() and passing Radiohead’s Artist ID as an argument:

id = result['artists']['items'][0]['id']

albums = sp.artist_albums(id)

for album in albums['items']:
    print(album['name'])

Which prints the following:

OK Computer OKNOTOK 1997 2017
A Moon Shaped Pool
TKOL RMX 1234567
The King Of Limbs
In Rainbows
In Rainbows (Disk 2)
Hail To the Thief
Amnesiac
I Might Be Wrong
Kid A
OK Computer
The Bends
Pablo Honey
Ill Wind
Supercollider / The Butcher
Harry Patch (In Memory Of)
Spectre
Daydreaming
Burn the Witch
The Daily Mail / Staircase

Downloading Track Samples

We can also access 30-second track samples from Spotify using each track’s preview_url attribute. Note that not all tracks have this enabled. Let’s say we want to take the first Radiohead album from the list, which is 'OK Computer OKNOTOK 1997 2017', and download track samples from it. We can do this by getting the albums’s Album ID and then calling sp.album_tracks(). We can then make a list of urls like this:

album_id = albums['items'][0]['id']
album_tracks = sp.album_tracks(album_id)
preview_urls = [track['preview_url'] for track in album_tracks['items']]

Now we have the preview urls, we can download them using urlretrieve:

from urllib.request import urlretrieve

directory = "directory/to/save/in"

for i in range(len(preview_urls)):
  urlretrieve(url, "{}/{}{}".format('directory', 'track{}'.format(i+1), ".mp3"))

This will download the numbered tracks into the directory specified.

Genre Labelling

For my project I needed track samples labelled by genre. Unfortunately Spotify tracks don’t have a genre attribute so I needed to find a more creative way to label them. I tried two methods for labelling tracks: using playlists as labels, and using artist genres as labels.

Playlists as Labels

Users often make playlists with a coherent theme, and this theme is sometimes a particular genre. We can use these themed playlists as collections of labelled tracks. Using this method we have a pretty simple data collection strategy: we just need to define a list of genres to search for, search for playlists in each genre, and label any tracks we find with the corresponding genre label.

The following code print the first 10 rock playlists and the number of tracks they contain. Note that in sp.search() the maximum limit value is 50. If you want to show more results you have to use offset to get more pages of results.

def get_playlists_by_genre(genre, limit):
    results = sp.search(genre, limit=limit, type='playlist')
    return results['playlists']['items']

def print_playlist_info(playlists):
    for playlist in playlists:
        print('{}: {}'.format(
            playlist['name'], 
            '{} tracks'.format(playlist['tracks']['total']))
        )

playlists = get_playlists_by_genre('rock', 10)
print_playlist_info(playlists)

Which prints:

Rock Classics: 150 tracks
Rock This: 50 tracks
Rock Hard: 100 tracks
Rock en Español: 60 tracks
Rock Drive: 100 tracks
Rock & Roll Summer: 71 tracks
Rock Party: 50 tracks
Rock Ballads: 75 tracks
Rock En Espanol 80s 90s 2000s: 85 tracks
Rock Covers: 70 tracks

We could replace print_playlist_info() with some code containing urlretrieve if we want to download the track data instead. We can also store data about the tracks in a Pandas DataFrame and save it to a CSV file.

Issues

An issue with this approach is that we can’t guarantee the coherence of the tracks in the playlist. Most Spotify playlists are user-generated, and there is no obligation for tracks to be a single genre, even if the playlist is named in such a way as to indicate that they are.

Artist Genres as Labels

While individual tracks don’t come with genre tags on Spotify, artists do! So instead of relying on the people making playlists to label our data for us, we can just look for artists that have a given genre tag and collect data about their tracks. Unfortunately we can’t just do sp.search('metal', type='artist') because this will return a list of bands who have metal in their band name, not a list of bands that have metal as a genre tag. (Having said this, there does seem to be a bunch of metal bands whose name includes the word “metal”!)

If we can’t search for artists by genre, how do find them? One solution is to search by playlist again, and then check each artist in the playlist to see if they have the genre tag we’re looking for. If they do then we can collect their music and label it accordingly.

In addition to this, we can use Spotify’s Related Artists feature to find more artists with the same tags. We can just keep iterating recursively through related artists until we can’t find any more that have the tags we’re looking for. One difficulty with this approach is that related artists are often quite tightly interconnected, meaning that we could end up going round in circles through the same artists endlessly. To solve this, we can build a list of artists as we go: if we already have an artist in our list we can ignore them and stop exploring in that direction, but if we haven’t seen them yet we can check their genre tags and delve deeper into their related artists.

The code for this can be found here. The recursive part of the code is the following:

def add_related_artists(artist_id, genre_artists, existing_artists):
    if not artist_id:
        return
    related_artists = sp.artist_related_artists(artist_id)['artists']
    for related_artist in related_artists:
        if last_added < search_limit and add_artist(related_artist['id'], genre_artists, existing_artists):
            add_related_artists(related_artist['id'], genre_artists, existing_artists)

Given an Artist ID, we query Spotify for their related artists. Then we iterate over the related artists and try adding them to our list. add_artist() returns True if we are seeing the artist for the first time, in which case it will be added to the list, and False otherwise. As long as artists are being added, we keep calling add_related_artists() recursively to find more.

Issues

However, this approach also has issues. Each artist has a variable length list of genre tags, so it’s not clear which is the “correct” label. Artists sometimes make music which crosses genre boundaries, or make several different styles of music over the course of their careers. Since the tags are related to the artist and not to individual tracks, it’s difficult to determine exactly which tracks should have which labels.

One solution would be to allow for multiple labels. We could simply add the whole list of labels to each track. But if we want to train a model to classify the tracks, we’ll need to build a more complex model which is capable of handling multiple correct labels.

Another solution is to just enforce a one-label-per-artist policy. We can predefine a list of genres that we are looking for, and if we find an artist with that tag, we assign all their music that label. This is a less precise solution for the reasons previously mentioned, and there’s also a chance that an artist will fit into several of our chosen classes, so we’d need to be careful to remove any artists that appear multiple times under different genres.

Conclusion

Using Spotipy, it’s fairly simple to collect and save both tabular data and mp3 track samples. It’s not so easy to label this data for a genre classification task. In the end, for my project, I decided to go with the simpler solution of enforcing a one-label-per-artist policy. This enabled me to build a multi-class classifier and train it on data which has one correct label per example. The full project repository can be found here.

Classifying Heavy Metal Subgenres with Mel-spectrograms

2020-07-30T00:00:00+00:00

Classifying Heavy Metal Subgenres with Mel-spectrograms

Distinguishing between broad music genres like rock, classical, or hip-hop is usually not very challenging for human listeners. However being able to tell apart the subgenres of these broad categories is not always so simple. Someone who is not already a big fan of house music probably won’t be able to distinguish deep house from tech house for example. The subtle differences between subgenres only become apparent when you become a more experienced listener. Training a neural network to classify music genres is not a new idea, but I thought it would be interesting to see if one could be trained to classify subgenres on a more precise level.

I got the original idea from reading a blog post by Priya Dwivedi. In the blog, music samples are converted to mel-spectrograms, and then fed into neural networks with both convolutional and recurrent layers in order to generate a prediction. I decided to try a similar method, but with the goal of classifying subgenres instead of top-level genres. In order to achieve this, I needed to build a dataset.

Data Collection

In the last post I discussed collecting track data and mp3 samples from Spotify to generate a dataset. I collected track data for 100,000 songs using Spotipy and downloaded a 30-second track sample for each one. The data collection strategy was the one referred to as Artist Genres as Labels in my previous post.

I decided to focus mostly on a single parent genre in order to limit the number of possible subgenres to prevent the number of classes getting out of hand. I chose metal since I’m a fan of the genre and feel confident classifying metal subgenres. I made a list of 20 subgenres, mostly from metal but also including subgenres from punk, rock, and hardcore where those genres border on metal.

For comparison I also collected another dataset of top level genres like classical, rock, and folk. This dataset is smaller, at only about forty thousand examples, and contains only ten classes.

After collecting track data from artists for all of the subgenres and downloading track samples for them, I converted the samples from mp3 files to mel-spectrogram pngs. A spectrogram is a visual representation of sound frequencies over time. Usually the Y-axis is decibels and the X-axis is time. Mel-spectrograms are a type of spectrogram where the Y-axis uses the Mel scale. This means that it has been rescaled to more accurately reflect the ways that humans hear sounds.

Here’s some examples of processed mel-spectrograms which I generated:

Techno Classical Jazz

The vertical axis is frequency, and the horizontal axis is time (thirty seconds of each track). I didn’t label or colourise these images because they were originally meant for a neural network rather than a human audience. At a glance, we can see some differences. Techno is very uniform, whereas jazz is quite irregular. Classical seems to softly change over time, whereas jazz appears to more suddenly change.

Architecture

The model is based on Convolutional Recurrent Neural Networks for Music Classification by Keunwoo Choi et al. The model includes both convolutional layers and recurrent layers, allowing us to take advantage of the benefits of both.

CNNs

Convolutional Neural Networks (CNNs) can be used to capture both low and high-level features of images, and are a standard part of image-recognition systems. In convolutional layers, filters are passed over inputs to generate feature maps. A feature map is like the encoded input image, but can be smaller in size while retaining important information and spatial relationships from the original image. A more detailed description of CNNs can be found in this blog.

Convolutional layers can be stacked to capture information from an image with varying levels of abstraction. Earlier layers might capture low level features like edges or corners, and later layers might instead detect discrete objects like faces or cars. This is useful in the case of classifying spectrograms as, by stacking some convolutional layers, we can capture both the low level relationships between adjacent pixels representing sound at various frequencies, and the high level structure of the spectrogram representing the piece of music as a whole.

RNNs

Recurrent Neural Networks (RNNs) are used to capture sequential relationships in input data, for example the relationships between words in text or between data points in a time series. This is achieved by maintaining a hidden state inside each layer which acts as a “memory” of previous inputs. The hidden state is combined with the new input to produce an output and a new hidden state. This process can be repeated as many times as necessary until the whole sequence has been processed. In theory, this means that the final output should include information from not just the final element of the sequence but all the previous elements too.

In practice however, standard RNNs’ ability to “memorise” previous inputs is fairly limited, and they are ineffective at processing long sequences. Long Short-Term Memory networks (LSTMs) attempt to resolve this issue by introducing a cell state and a more complex system for determining what information is kept and what is discarded. This allows the network to learn from more long-term dependencies. Here’s a great blog which discusses RNNs and LSTMs in more detail. In the Convolutional Recurrent Neural Network (CRNN) model, a Gated Recurrent Unit (GRU) is used instead of an LSTM, which is a slightly more lightweight version which retains most of the advantages.

The reason for introducing an RNN component into the network is that spectrograms are a representation of time series data: in this case sound at various frequencies over time. By feeding the spectrogram from left to right into a GRU, we can hopefully capture some of the sequential nature of a piece of music.

CRNN

The CRNN model is simply a CNN stacked on an RNN. The intuition for using these things together is that spectrograms can be viewed both as images and as sequences. Using a CNN allows us to process the spectrogram as an image, and using an RNN allows us to process it as a time sequence.

To generate a prediction, an input image (or batch of images) is encoded and passed into the network. The image is first passed into the CNN sub-network, which is made up of 4 blocks. Each block contains a convolutional layer, batch normalisation, a ReLU activation function, and finally a max-pooling layer. Each block gradually reduces the dimensions of the input image until the output of the final block is a feature map of only one pixel in height and twenty in width (fifteen in the original paper, but my images were wider to begin with). This feature map is then used as an input sequence into the RNN sub-network, where each pixel is an element of the sequence. The sequence is fed into a two-layer GRU, followed by a dense layer. The final prediction is generated by passing the RNN output through a dense layer of width equal to the number of genre classes we have.

Image taken from *Convolutional Recurrent Neural Networks for Music Classification by Keunwoo Choi et al.*

The image shows the input spectrogram on the left. We can see that it is reduced in size gradually by being passed through four convolutional layers. N represents the number of feature maps. We can see that the frequency dimension is reduced to one, so we are just left with a sequence over the time dimension, which is passed to the RNN section. The circles on the right show the possible labels that can be selected from.

My PyTorch implementation of this model can be found here.

Training & Results

The datasets were split into 80% training and 20% test set. The model was trained over 10 epochs with a learning rate of 0.0017 (after using fast.ai’s learning rate finder) on a GPU using Google Colab.

I first tried training on the smaller dataset of high-level genres first, and was able to get 80% accuracy on the test set, which is pretty good. After that, I tried training the same model on the larger metal subgenre dataset, where the test accuracy dropped to below 50%. This is understandable as, despite being two and a half times larger, the dataset has twice as many classes, and the classes are much more similar to each other.

Including Tabular Data

Although the main data source in this project was the 30-second track samples, I was also able to collect some other track data from the Spotify API. This tabular data includes features like track length, mode, key, number of sections, and some Spotify-generated metrics like valence, danceability, and acousticness.

I guessed that these values also have some predictive power, so I tried including them into the model to see if it improved predictions: I found that it did!

How it works is, in addition to the architecture described in the previous section, tabular data relating to a given track sample is encoded and fed into a dense layer. It then goes through a ReLU and another dense layer. The output of this small feed-forward network is concatenated with the output of the RNN and used to make the final prediction.

I retrained the model with this modification and extra source of data, and found that this significantly improved the accuracy from about 50% to 62% top-one accuracy.

An implementation of this version of the model can be found here.

Confusion Matrix

I generated a confusion matrix from each of the trained models to see which genres were most commonly mixed up. I hypothesized that closely related subgenres would be much more likely to be confused. For example, the metal subgenre dataset has “death metal”, “deathcore”, and “melodic death metal” as separate classes. These are distinct subgenres, but musically do borrow a lot from each other.

Here’s the confusion matrix:

Here’s a key for the labels:

{'black metal': 0,
 'death metal': 1,
 'deathcore': 2,
 'doom metal': 3,
 'folk metal': 4,
 'glam metal': 5,
 'grindcore': 6,
 'hard rock': 7,
 'hardcore': 8,
 'industrial metal': 9,
 'melodic death metal': 10,
 'metalcore': 11,
 'nu metal': 12,
 'power metal': 13,
 'progressive metal': 14,
 'punk': 15,
 'screamo': 16,
 'stoner rock': 17,
 'symphonic metal': 18,
 'thrash metal': 19}

Completely contrary to my expectations, the most confused subgenres were:

hard rock with industrial metal
grindcore with progressive metal
stoner rock with industrial metal

I would’ve imagined that perhaps hard rock and stoner rock would get mis-labelled as each other, but instead the model confused them both with industrial metal! This is perhaps partly a dataset problem. The tag “industrial” is thrown around quite casually and sometimes assigned to bands that don’t really fit, so it’s possible that some bands were mis-labelled in Spotify’s database.

But what’s even more surprising is the confusion between grindcore and progressive metal. Grindcore is very fast but usually pretty straight forward music, with songs that rarely last more than two minutes. Progressive metal on the other hand often has songs that last ten minutes or more, and features all sorts of tempo and mood changes as well as other things.

That these two genres were confused shows that the features the model is picking out when classifying are not the same kind of things that a human picks out, because even someone who has never listened to grindcore or progressive metal before would be able to tell the difference. Here’s an example of grindcore and here’s some prog metal in case you want to try comparing them for yourself.

Conclusions

I would consider this project a modest success as the model was clearly able to find some patterns in both datasets and make accurate predictions of high-level genres and subgenres with 80% and 60% accuracy respectively. My assumption that subgenres would be more difficult to classify than broad genres was also confirmed. On the other hand, the confusion matrix shows that the things that I thought would be difficult to classify didn’t necessarily turn out to be, and the model struggled with some things that would seem obvious to a human listener (at least if the given human is a metalhead).

Multi-Class Versus Mult-Label Classification

In this project, the classification task was framed as a multi-class problem; i.e. there is a set of possible classes from which one correct answer must be chosen. In hindsight, perhaps a multi-label classifier would have been better. Multi-label here means that the model can select more than one correct answer from the set of classes.

The reason that this would be preferable is that it would allow us to deal with subgenre edge cases more easily. By this I mean cases where a track or artist only partially fits within a genre category. It’s not uncommon for music to cross genre boundaries and mix several styles together. In cases like these, the classifier I trained is forced to pick whichever label seems most appropriate. But if we allow it to pick more than one label it will be able to list all the genres which feature in the sample.

This issue is exacerbated by some bad subgenre choices on my part. As previously mentioned, I included both “death metal” and “melodic death metal” as classes as I think of them as distinct subgenres. However, as the name suggests, “melodic death metal” is basically a type of death metal. This means that any time we label something as melodic death metal, we are also implicitly labelling it as death metal, as that is the parent genre.

When we take into account top-five accuracy instead of just top-one, the metal subgenre classifier’s accuracy shoots up to 92%. This shows that in most cases it is recognising the specified “correct” label even when it gets it wrong. There are many tracks in the dataset that could have multiple correct labels. I would probably have the same problem if I were asked to classify some pieces of music and told that I’m only allowed to pick one label. There are many cases where we need at least two or three labels to fully describe what’s going on. If I were to do this again, or develop this project further, I’d keep a set of labels of variable length for each track sample, and allow the model to pick more than one class when making predictions.

Adam Montgomerie

Kaggle Journey to Competitions Master

Kaggle Journey to Competitions Master

First Attempts at Competitions

CommonLit Readability Prize

Chaii - Hindi and Tamil Question Answering

Jigsaw Rate Severity of Toxic Comments

Lessons Learned

Experiment Tracking

Model Validation

Ensembling

Conclusion

Jigsaw Rate Severity of Toxic Comments 14th Place Solution

Jigsaw Rate Severity of Toxic Comments 14th Place Solution

Competition Overview

Solution Overview

Data

CV strategy

Training

Loss

Evaluation

Multi-stage fine-tuning

Hyperparameters

Models

Ensembling

Things which didn’t work

The measuring hate speech dataset

Binary labels

TF-IDF with linear models

Word vectors with linear models

Improvements

Predicting the CEFR Level of English Texts

Attempting to Predict the CEFR Level of English Texts

Dataset

Text Complexity Metrics as a Baseline

Feature Engineering

Training

The Problem of Vague Boundaries

Confusion Matrix

Using Probabilities to Distinguish Between Labels

Improvements

Token Classification With Subword Tokenizers for Bulgarian

Token Classification With Subword Tokenizers for Bulgarian

Tokenization

Subword Tokenization

Token Classification

Token Mapping

Implementation

Training

Architecture

Training Set up

Results

Part-Of-Speech Tagging:

Named-Entity Recognition

Observations

Generating Questions Using Transformers

Generating Questions Using Transformers

Gathering a Dataset

Training a Model

Evaluating Questions

The Final System Pipeline

Multiple-Choice Questions

Example

Applications of the System

Unexplored Directions

Another Question Generation Project

Using Spotipy to Collect Track Data

Using Spotipy to Collect Track Data

Using Spotipy

Setup and Login

Getting data

Downloading Track Samples

Genre Labelling

Playlists as Labels

Issues

Artist Genres as Labels

Related artists

Issues

Conclusion

Classifying Heavy Metal Subgenres with Mel-spectrograms