0% found this document useful (0 votes)

24 views4 pages

Homework 4-1

The document outlines the 4th homework assignment for the course 'Introduction to language technology' at Háskóli Íslands, due on March 13th. It consists of programming tasks involving Levenshtein distance, dictionary lookup, masked language models for spelling correction, and question-answering models, along with a written project proposal. Students must submit their code, answers, and a project proposal through Canvas, with specific formatting and submission guidelines provided.

Uploaded by

Dimmi Woah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views4 pages

Homework 4-1

Uploaded by

Dimmi Woah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Háskóli Íslands

Spring 2024

TÖL025M
Introduction to language technology
4. homework assignment (12%)

General instructions
● Please read the entire page before you start (number of pages: 4)
● There are two parts to the assignment:
○ Programming part
○ Written part
● What should you turn in through Canvas?
○ Your code + answers to questions in part 4
○ A text file containing your project proposal
● The due date is March 13th before midnight. Be aware that if you turn in your
assignment late, but within 24 hours from the due date, you will at most receive 9
points out of 12 for the assignment. Assignments that are turned in later than that will
not be accepted.
● Note that it’s preferable that you turn in your code as a pdf version of a Jupyter
notebook, Google colab notebook or something similar that shows both your code and
your output. You could also turn in a pdf version of a doc file where you copy-paste
your code and screenshot your output (if you do, try to make sure that the code itself
is in text format, i.e. not an image). Traditional Python scripts will also be accepted
however.
● The written questions can be answered either in English or Icelandic.
Part 1 (programming) – Levenshtein-distance [2 points]

In this part of the assignment, you should write your own Levenshtein function. You
can use whatever you want for inspiration (like the pseudo-code on Wikipedia) but you
need to write the function yourselves (in other words, you cannot do “import
levenshtein-distance”). You can use libraries like Numpy (or something else) if you
think that helps. Test your function by having it calculate the Levenshtein-distance
between a few strings of your choice.

Part 2 (programming) – Dictionary lookup [2 points]

In this part of the assignment, you should create a dictionary lookup that is able to
detect a misspelled word based on whether or not it can be found in your vocabulary.
Start by creating a vocabulary based on your data sets (for instance, making a list of all
unique words in the text file). The input of the function should be a full sentence.
Then compare the input to the vocabulary and flag any word that is not found there.
Feel free to be creative with data structures, something like a trie or a hash table
might be a good idea (although not required). Test the functionality by running at
least one sentence through it that has no misspelled word and one sentence that
contains at least one misspelled word.

Part 3 (programming) – Masked language models for spelling

correction [2 points]

In this part of the assignment, you should try to use a masked language model (like
BERT or some of his friends, just make sure it’s for the correct language as it’s not
feasible to use IceBERT to correct English for instance) to find and correct spelling
errors. Do this in the following way:

a) Take a sentence that contains at least one misspelled word as an input. Run it
through the dictionary lookup from part 2. That should flag and return the
misspelled word.
b) Replace the misspelled word by <mask> (note that some BERT models want the
format to be <mask>, others want [MASK] etc., do what’s right for the model
you’re using). Then send the sentence through the masked language model and
retrieve the 10 words the model decides are the most likely to replace the mask
token.
c) Use your Levenshtein-function on the 10 words the model suggested (in other
words: what’s the Levenshtein distance between the misspelled word and
those 10 words). Which one has the lowest distance? Is it a valid correction of
the intended word (if multiple words have the same distance, is the correct
word among the choices)?

Note that this is not guaranteed to work. If you don’t get the correct word at all, just
try another (simpler) sentence for fun.

Part 4 (programming and written) – Question-answering [2

points]

Go to HuggingFace and find a question-answering model (that has already been

fine-tuned for the task). This could for example be a DistilBERT that has been
fine-tuned on the SQuAD dataset (but it can be any question-answering model).
Follow the appropriate instructions to prepare the model to take in a context example
(a small text that contains some information that you can write questions about and
have the model answer them). Ask the model at least 5 questions. Does the model
answer the way that you would have expected? What happens if you ask it something
that has nothing to do with the context? Can you find any biases in the answers (you
might but you might also not find any, both is fine)?

Example of a question related to the context:

context = “Sharks are a group of elasmobranch fish characterized by a cartilaginous

skeleton, five to seven gill slits on the sides of the head, and pectoral fins that are not fused
to the head. Modern sharks are classified within the clade Selachimorpha (or Selachii) and
are the sister group to the rays. However, the term "shark" is also used to refer to extinct
shark-like members of the subclass Elasmobranchii, such as hybodonts, that lie outside the
modern group.”

question="How many gills do sharks have?"

Answer: 'five to seven', score: 0.7716, start: 84, end: 97

A real example of a question not related to the context:

context = “Once upon a time, a girl called Katy and a boy named James
were playing on the swings.”

question="Who is the CEO?"

Answer: 'James', score: 0.313, start: 54, end: 59

Part 5 (written) – Final assignment project proposal [4 points]

In this part of the assignment, describe the idea that you have for your final project.
You need to explain what type of problem you want to solve (for example: you want
to fine-tune a question-answering model on data related to cats), whether or not this
problem has been solved before by other programmers (to to your knowledge), if
you’re localizing a known model/dataset/research material to another language and
so on. Explain anything that might help me understand what your project is about.
Why does this project interest you?

Briefly describe how you intend on solving this problem. This does not have to be a
detailed project plan (that’s due with homework assignment 5), just briefly, how do
you see yourself solving the task? Which methods will you use? Which libraries do you
need? Which data do you need (or do you intend on collecting your own data - and
how)? Is your project an annotation of a corpus that already exists (see for instance
the difference between MÍM-GOLD and MÍM-GOLD-EL on Clarin)? How do you intend
on packaging your project (will it be a command line tool, a web tool, an API…)?

Note that these questions are ideas to get you started, you only answer what applies
to your project. The main thing is to describe your idea and plan in such a way that I
can see it clearly (and therefore, give you better feedback on it).

Python Final Exam Practice Questions
No ratings yet
Python Final Exam Practice Questions
8 pages
Macbeth
No ratings yet
Macbeth
7 pages
Python Py
No ratings yet
Python Py
19 pages
CS 1301 Homework 3 - Building A Dictionary1
No ratings yet
CS 1301 Homework 3 - Building A Dictionary1
4 pages
CS1103 HW2
No ratings yet
CS1103 HW2
2 pages
Nur Aina Amalina Binti Mohd Ainuddin - 2022675888
No ratings yet
Nur Aina Amalina Binti Mohd Ainuddin - 2022675888
8 pages
GujaratiWordCorrection Jan2025
No ratings yet
GujaratiWordCorrection Jan2025
15 pages
CS 100 Fall 2017 Final
No ratings yet
CS 100 Fall 2017 Final
10 pages
Class XII CS MS Set-1
No ratings yet
Class XII CS MS Set-1
10 pages
Holiday Homework Class Xii: Apeejay School, Pitampura
No ratings yet
Holiday Homework Class Xii: Apeejay School, Pitampura
18 pages
Solutions
No ratings yet
Solutions
11 pages
25-50 Python (Manasi)
No ratings yet
25-50 Python (Manasi)
10 pages
HHW Class 12 Science
No ratings yet
HHW Class 12 Science
11 pages
Sentence-Level Feedback Generation For English Lan
No ratings yet
Sentence-Level Feedback Generation For English Lan
7 pages
Class XII Summer Home Work 2025 26
No ratings yet
Class XII Summer Home Work 2025 26
7 pages
Practical File
No ratings yet
Practical File
1 page
NCS 1286 11
No ratings yet
NCS 1286 11
11 pages
Homework 4 (Main)
No ratings yet
Homework 4 (Main)
9 pages
Parvani
No ratings yet
Parvani
16 pages
2024 JC2 Computing Prelim Paper 2 (Practical)
No ratings yet
2024 JC2 Computing Prelim Paper 2 (Practical)
9 pages
Final Pe1
No ratings yet
Final Pe1
6 pages
Set-1 MKS XII-HY-COMP-ANNEX-C
No ratings yet
Set-1 MKS XII-HY-COMP-ANNEX-C
6 pages
Practice 1
No ratings yet
Practice 1
20 pages
Problem Set 1 Edx
No ratings yet
Problem Set 1 Edx
4 pages
Computer Science Program File
No ratings yet
Computer Science Program File
24 pages
Parctical AI
No ratings yet
Parctical AI
11 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
Lab 2
No ratings yet
Lab 2
2 pages
Part A
No ratings yet
Part A
11 pages
HW 0
No ratings yet
HW 0
3 pages
Anshika's Project Do Not Touch!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
No ratings yet
Anshika's Project Do Not Touch!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
15 pages
Ilovepdf Merged Organized
No ratings yet
Ilovepdf Merged Organized
61 pages
Term1 1
No ratings yet
Term1 1
4 pages
Project
No ratings yet
Project
3 pages
Muhammad Irsyad Bin Mohd Hanafe - D2cdcs2418a
No ratings yet
Muhammad Irsyad Bin Mohd Hanafe - D2cdcs2418a
5 pages
HW1 PDF
No ratings yet
HW1 PDF
2 pages
AI BCAI 551 Lab Manual
No ratings yet
AI BCAI 551 Lab Manual
54 pages
Yash Raj XII-E Program File
No ratings yet
Yash Raj XII-E Program File
19 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Assignment 1: 2.5% Friday, 26th of January at 8AM
No ratings yet
Assignment 1: 2.5% Friday, 26th of January at 8AM
7 pages
Spelling Error Correction With BERT Based On Character-Phonetic
No ratings yet
Spelling Error Correction With BERT Based On Character-Phonetic
5 pages
Batch 2
No ratings yet
Batch 2
13 pages
Automated Essay Scoring Techniques
No ratings yet
Automated Essay Scoring Techniques
13 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Taask
No ratings yet
Taask
18 pages
Spell Correction
No ratings yet
Spell Correction
46 pages
Automated Essay Grading Report
No ratings yet
Automated Essay Grading Report
6 pages
Lesson Plan - CSE (AI&ML) - B Section
No ratings yet
Lesson Plan - CSE (AI&ML) - B Section
6 pages
Hangman Report1
No ratings yet
Hangman Report1
10 pages
Assignment 1: 4% Monday, 30th of September at 8AM
No ratings yet
Assignment 1: 4% Monday, 30th of September at 8AM
8 pages
SIG742 Task1
No ratings yet
SIG742 Task1
9 pages
Python
No ratings yet
Python
5 pages
Stacks & File Handling Worksheet-1 Class 12 CS
No ratings yet
Stacks & File Handling Worksheet-1 Class 12 CS
7 pages
Lenguaje de Procesamiento
No ratings yet
Lenguaje de Procesamiento
7 pages
Natural Language Processing (Weekly Laboratory Assignments) : Sumit Kumar Banerjee
No ratings yet
Natural Language Processing (Weekly Laboratory Assignments) : Sumit Kumar Banerjee
8 pages
Programming Module Assessment Guide
No ratings yet
Programming Module Assessment Guide
6 pages
Python Report Final1
No ratings yet
Python Report Final1
19 pages
Assignment 1 CS 421 Fall 2022
No ratings yet
Assignment 1 CS 421 Fall 2022
5 pages
31 01 23 List, Tuple, Sets and Dictionary in Python
No ratings yet
31 01 23 List, Tuple, Sets and Dictionary in Python
5 pages
What Is NoSQL
No ratings yet
What Is NoSQL
52 pages
CSC NEW Lab Manual - Class 12-2025 - Final
No ratings yet
CSC NEW Lab Manual - Class 12-2025 - Final
101 pages
Python Exam for Class 10 Students
No ratings yet
Python Exam for Class 10 Students
4 pages
Full Stack Engineer Take-Home Project
No ratings yet
Full Stack Engineer Take-Home Project
4 pages
SAP Abap DDIC Questions
No ratings yet
SAP Abap DDIC Questions
20 pages
Chapter Summary: How Sets Work-Practical Consequences
No ratings yet
Chapter Summary: How Sets Work-Practical Consequences
2 pages
Python-Study Materials - All Units
100% (1)
Python-Study Materials - All Units
162 pages
Ruby Tutorial
No ratings yet
Ruby Tutorial
125 pages
Cse - Ai
No ratings yet
Cse - Ai
46 pages
Revision of Python LIST-TUPLE - DICTIONARY (2) - Amit Yerpude
No ratings yet
Revision of Python LIST-TUPLE - DICTIONARY (2) - Amit Yerpude
17 pages
GE8151 Problem Solving and Python Programming MCQ
No ratings yet
GE8151 Problem Solving and Python Programming MCQ
135 pages
CLASS Test 2 Model Answer Paper
No ratings yet
CLASS Test 2 Model Answer Paper
30 pages
Presentation Intro To Python and Metocean Data Analysis
No ratings yet
Presentation Intro To Python and Metocean Data Analysis
27 pages
MongoDB Administrator Training
100% (1)
MongoDB Administrator Training
216 pages
Python Functions Guide
No ratings yet
Python Functions Guide
42 pages
TAW10 Test
No ratings yet
TAW10 Test
11 pages
String Array Functions
No ratings yet
String Array Functions
15 pages
Delhi Public School
No ratings yet
Delhi Public School
27 pages
SV Notes 1 100
No ratings yet
SV Notes 1 100
100 pages
SnowPro Core Cert Questions Part 2
No ratings yet
SnowPro Core Cert Questions Part 2
87 pages
Difference Between HashMap and HashSet in Java
No ratings yet
Difference Between HashMap and HashSet in Java
2 pages
100 Data Structure Interview Question & Answers
No ratings yet
100 Data Structure Interview Question & Answers
16 pages
22CS104 1
No ratings yet
22CS104 1
2 pages
Full Project On BubbleSort Algorithm
No ratings yet
Full Project On BubbleSort Algorithm
9 pages
Step 3 B
No ratings yet
Step 3 B
2 pages
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
100% (1)
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
19 pages
B.Tech CSE (3rd To 8th) Semester
No ratings yet
B.Tech CSE (3rd To 8th) Semester
170 pages
Lab 02
No ratings yet
Lab 02
9 pages
Ge3171 Assinment 3
No ratings yet
Ge3171 Assinment 3
4 pages

Homework 4-1

Uploaded by

Homework 4-1

Uploaded by

Háskóli Íslands

Part 2 (programming) – Dictionary lookup [2 points]

Part 3 (programming) – Masked language models for spelling

Part 4 (programming and written) – Question-answering [2

Go to HuggingFace and find a question-answering model (that has already been

Example of a question related to the context:

context = “Sharks are a group of elasmobranch fish characterized by a cartilaginous

question="How many gills do sharks have?"

Answer: 'five to seven', score: 0.7716, start: 84, end: 97

question="Who is the CEO?"

Answer: 'James', score: 0.313, start: 54, end: 59

Part 5 (written) – Final assignment project proposal [4 points]

You might also like