Háskóli Íslands
Spring 2024
TÖL025M
Introduction to language technology
4. homework assignment (12%)
General instructions
● Please read the entire page before you start (number of pages: 4)
● There are two parts to the assignment:
○ Programming part
○ Written part
● What should you turn in through Canvas?
○ Your code + answers to questions in part 4
○ A text file containing your project proposal
● The due date is March 13th before midnight. Be aware that if you turn in your
assignment late, but within 24 hours from the due date, you will at most receive 9
points out of 12 for the assignment. Assignments that are turned in later than that will
not be accepted.
● Note that it’s preferable that you turn in your code as a pdf version of a Jupyter
notebook, Google colab notebook or something similar that shows both your code and
your output. You could also turn in a pdf version of a doc file where you copy-paste
your code and screenshot your output (if you do, try to make sure that the code itself
is in text format, i.e. not an image). Traditional Python scripts will also be accepted
however.
● The written questions can be answered either in English or Icelandic.
Part 1 (programming) – Levenshtein-distance [2 points]
In this part of the assignment, you should write your own Levenshtein function. You
can use whatever you want for inspiration (like the pseudo-code on Wikipedia) but you
need to write the function yourselves (in other words, you cannot do “import
levenshtein-distance”). You can use libraries like Numpy (or something else) if you
think that helps. Test your function by having it calculate the Levenshtein-distance
between a few strings of your choice.
Part 2 (programming) – Dictionary lookup [2 points]
In this part of the assignment, you should create a dictionary lookup that is able to
detect a misspelled word based on whether or not it can be found in your vocabulary.
Start by creating a vocabulary based on your data sets (for instance, making a list of all
unique words in the text file). The input of the function should be a full sentence.
Then compare the input to the vocabulary and flag any word that is not found there.
Feel free to be creative with data structures, something like a trie or a hash table
might be a good idea (although not required). Test the functionality by running at
least one sentence through it that has no misspelled word and one sentence that
contains at least one misspelled word.
Part 3 (programming) – Masked language models for spelling
correction [2 points]
In this part of the assignment, you should try to use a masked language model (like
BERT or some of his friends, just make sure it’s for the correct language as it’s not
feasible to use IceBERT to correct English for instance) to find and correct spelling
errors. Do this in the following way:
a) Take a sentence that contains at least one misspelled word as an input. Run it
through the dictionary lookup from part 2. That should flag and return the
misspelled word.
b) Replace the misspelled word by <mask> (note that some BERT models want the
format to be <mask>, others want [MASK] etc., do what’s right for the model
you’re using). Then send the sentence through the masked language model and
retrieve the 10 words the model decides are the most likely to replace the mask
token.
c) Use your Levenshtein-function on the 10 words the model suggested (in other
words: what’s the Levenshtein distance between the misspelled word and
those 10 words). Which one has the lowest distance? Is it a valid correction of
the intended word (if multiple words have the same distance, is the correct
word among the choices)?
Note that this is not guaranteed to work. If you don’t get the correct word at all, just
try another (simpler) sentence for fun.
Part 4 (programming and written) – Question-answering [2
points]
Go to HuggingFace and find a question-answering model (that has already been
fine-tuned for the task). This could for example be a DistilBERT that has been
fine-tuned on the SQuAD dataset (but it can be any question-answering model).
Follow the appropriate instructions to prepare the model to take in a context example
(a small text that contains some information that you can write questions about and
have the model answer them). Ask the model at least 5 questions. Does the model
answer the way that you would have expected? What happens if you ask it something
that has nothing to do with the context? Can you find any biases in the answers (you
might but you might also not find any, both is fine)?
Example of a question related to the context:
context = “Sharks are a group of elasmobranch fish characterized by a cartilaginous
skeleton, five to seven gill slits on the sides of the head, and pectoral fins that are not fused
to the head. Modern sharks are classified within the clade Selachimorpha (or Selachii) and
are the sister group to the rays. However, the term "shark" is also used to refer to extinct
shark-like members of the subclass Elasmobranchii, such as hybodonts, that lie outside the
modern group.”
question="How many gills do sharks have?"
Answer: 'five to seven', score: 0.7716, start: 84, end: 97
A real example of a question not related to the context:
context = “Once upon a time, a girl called Katy and a boy named James
were playing on the swings.”
question="Who is the CEO?"
Answer: 'James', score: 0.313, start: 54, end: 59
Part 5 (written) – Final assignment project proposal [4 points]
In this part of the assignment, describe the idea that you have for your final project.
You need to explain what type of problem you want to solve (for example: you want
to fine-tune a question-answering model on data related to cats), whether or not this
problem has been solved before by other programmers (to to your knowledge), if
you’re localizing a known model/dataset/research material to another language and
so on. Explain anything that might help me understand what your project is about.
Why does this project interest you?
Briefly describe how you intend on solving this problem. This does not have to be a
detailed project plan (that’s due with homework assignment 5), just briefly, how do
you see yourself solving the task? Which methods will you use? Which libraries do you
need? Which data do you need (or do you intend on collecting your own data - and
how)? Is your project an annotation of a corpus that already exists (see for instance
the difference between MÍM-GOLD and MÍM-GOLD-EL on Clarin)? How do you intend
on packaging your project (will it be a command line tool, a web tool, an API…)?
Note that these questions are ideas to get you started, you only answer what applies
to your project. The main thing is to describe your idea and plan in such a way that I
can see it clearly (and therefore, give you better feedback on it).