Advanced Data Engineering &
Analysis: Course Introduction
6 March 2024
Adam Jatowt
adam.jatowt@uibk.ac.at
Examples of recent NLP/IR researches
• Question answering in news article collections
• Automatic hint generation
• Multi-modal document summarization
• Multi-timeline summarization of news articles
• Epidemic information extraction
• Search and recommendation models in news collections
• Sentence temporal validity (aka. information expiry date) estimation
• Text readability and comprehensibility estimation
Anyone interested in doing master’s thesis research on any of these
or related NLP/IR topics can contact me for discussion.
Some open topics are listed on the DS website (more to come..).
About Course
What is Natural Language Processing?
• The amount of digital textual data being generated every day is huge (e.g., the Web, social media, medical
records, digitalized books)
• So does the need for understanding, analyzing, organizing, translating, and processing this flood of words and documents
• Natural language processing (NLP) deals with designing methods and algorithms that take as an input, or
produce as an output, unstructured, natural language data
• Natural language processing is focused on the design & analysis of computational algorithms and
representations for processing natural human language
What is Natural Language Processing?
• Human language is our main general-purpose communication tool
• Natural Language Processing
• Large field: processing natural language text involves many various syntactic, semantic, and
pragmatic tasks in addition to other problems
General things about this course
• Not a linguistics course, but rather a course that includes aspects of language processing,
machine learning and quantitative methods
• We will explore statistical and NN techniques for the automatic analysis of natural (human)
language data
• The dominant modeling paradigm is corpus-driven statistical/deep learning, with both supervised
and unsupervised methods
• NLP is a huge field!
• We focus mainly on core ideas, tasks, and methods needed for or behind fundamental technologies,
and eventually for NLP applications
Course goals
• Overview and study fundamental tasks in NLP
• Learn some classic and state-of-the-art techniques
• Acquire some research ideas, interest and experience :-)
Housekeeping notes
Schedule, Grading, etc.
Overall effort
• This is a 5 ECTS VU course
• 1 ECTS credit = 25 hours of work [1]
• 125h/course → about 8h/week
• 8h/week - 3h/week = roughly 5h/week of work at home (includes preparations for exam)
[1] https://www.uibk.ac.at/studium/organisation/anerkennung-und-ects-zuteilung/index.html.en
Tentative schedule
• Week 1 • Week 8-9
• Course overview • Sentiment analysis & lexicons
• Week 2 • Sequence labeling
• NLP introduction • POS tagging, Named entity recognition
• Basic text pre-processing
• Week 10
• Week 3 • Document summarization
• N-grams, language models • Information and relation extraction
• Spelling error correction • Event extraction
• Week 4-5 • Week 11
• Word relations, senses & Wordnet
• Question answering, Commonsense knowledge extraction
• Text classification, Logistic regression
• Chatbots
• Week 6
• Basics of neural networks for NLP • Week 12-13
• Vector semantics, word embeddings • Tutorial presentations
• Week 7 (May 8) • Week 14 (June 26)
• Mid-term Exam • Final Exam
(The schedule may be subject to change.
Exams and presentation dates are however fixed)
Evaluation
• VU: Continuous assessment course
• Grading is decided based on attendance (5%), class participation (20%), tutorial
presentations (25%), and two exams (50%).
• Participation in the exams is necessary for a positive grade
Points Grade
Less than 50 Not enough
50 - 63 Enough
64 - 77 Satisfactory
78 - 89 Good
90 - 100 Very good
Attendance (5%)
• Attendance is advised but not mandatory
• However, you have to be present at the mid-term and final examinations (week 7 and week 14), as well as
make a tutorial presentation.
• You may cancel your registration in the course before March 21 (after third class) without any
consequences. After this time you may get a grade
• Please send me an email in case of a resignation
Class Participation (20%)
• Weekly paper presentations and occasional homework assignments
• Paper presentation time: about 15min presentation + QA time
• Q&A, especially, actively taking part in the discussions during the final tutorial
presentations and paper presentations
Tutorial Presentations (25%)
• About 25min long presentations in groups of 2-3 students
• Focused on recent NLP methods, mainly Deep Learning and LLMs
• Done in a tutorial style: overview the method, demonstrate key functions and code to implement it, and
showcase its results on any chosen dataset
• Slides to be submitted to OLAT at least a day before
• Presentation details and topics will be given later
Exams (50%)
• Mid-term exam (25%) on May 5 and final exam (25%) on June 26
• Closed book exams
• Multiple choice questions
• Focused more on understanding rather than rote memorization of details
• The exam dates will not be changed so you should not take this course if you anticipate a
schedule conflict
Academic Honesty & Integrity
• We expect you to do your own work unless it is specifically assigned as a group
assignment/project
• Whenever you use someone else’s idea, software library, etc., then it should be clearly
documented
• You are not allowed to collaborate on answering exam questions
• It is an honor code violation to discuss exam questions with other students
Slides
• Slides will be uploaded to OLAT before each lecture
• Some slides borrowed from Julia Hockenmaier, Alex Lascarides, Nathan Schneider, Dan Jurafsky, Chris Manning,
David Bamman, Ray Mooney, Yulia Tsvetkov, Taylor Berg-Kirk, Dan Klein, Diyi Yang, Jannik Strötgen, Vinay Setty,
Anubhav Jangra,...
Books
• H. Lane, C. Howard, H. Max Hapke, Natural Language Processing in Action, Manning, 2019
• Jurafsky and Martin, Speech and Language Processing, 2nd or 3rd Edition
• Manning and Schuetze, Foundations of Statistical NLP Speech and Language Processing, MIT
Press
• Goldberg, Neural Network Methods for Natural Language Processing. Synthesis lectures on
human language technologies
• Etc.
Other Relevant Books
• NLP with Python, The NLTK book, Bird, Klein & Loper.
• https://www.nltk.org/book
• Natural Language Processing, Eisenstein.
• https://tinyurl.com/eisenstein-nlp
• Linguistic Fundamentals of NLP, Bender.
• http://tinyurl.com/bender-nlp
Useful NLP Resources
• https://www.nltk.org/
• https://spacy.io/
• https://openai.com/
• https://allennlp.org/
• An open-source NLP research library, built on PyTorch
• https://huggingface.co/transformers/
• https://towardsdatascience.com/
• http://nlpprogress.com/
• Repository tracking the progress in NLP, including the datasets and the current state-of-the-art for most common NLP tasks
• https://github.com/flairNLP/flair
Relevant Scientific Conferences (ordered by
importance)
• Association for Computational Linguistics (ACL)
• Empirical Methods in Natural Language Processing (EMNLP)
• North American Association for Computational Linguistics (NAACL)
• International Conference on Computational Linguistics (COLING)
• European chapter of the Association for Computational Linguistics (EACL)
• Conference on Computational Natural Language Learning (CoNLL)
Other related Scientific Conferences
• Related ones:
• CIKM
• WSDM
• WWW
• SIGIR
• ECIR
• AAAI, IJCAI
Example NLP Workshops associated with
Relevant Conferences
• NLP for Building Educational Applications (BEA)
• Fact Extraction and VERification (FEVER)
• Figurative Language Processing (FLP)
• NLP for Conversational AI (NLP4ConvAI)
• Narrative Understanding, Storylines, and Events (NUSE)
• Representation Learning for NLP (RepL4NLP)
• Natural Language Processing for Social Media (SocialNLP)
• Neural Generation and Translation (WNGT)
Workshops recently organized by DS group
• The 6th Int. Workshop on Narrative Extraction from Texts at ECIR 2023
(Text2Story2023) https://text2story23.inesctec.pt/
• The 1st Int. Implicit Author Characterization from Texts for Search and Retrieval
(IACT’23) at SIGIR 2023 https://en.sce.ac.il/news/iact23
• The 2nd Int. Workshop on Computational Approaches to Historical Language Change
2021 at ACL 2021 https://languagechange.org/events/2021-acl-lchange/
Consultation Hours
• Fridays: 16:00 - 17:30
• Digital Science Center (DiSC), Innrain 15, A-6020 Innsbruck (room 01-09)
• Please schedule a meeting by email in advance
Disabilities Support Office
• The University of Innsbruck offers support through the disabilities office:
• Mag. Bettina Jeschke
Bettina.Jeschke@uibk.ac.at
• +43 512 507 8887
• You do not need a diagnosis to contact the office
• For students with suspected attention deficit, autism, or learning disabilities, additional support is
provided by the S-AAL project:
• https://www.uibk.ac.at/de/projects/s-aal/
Thank you!