NATURAL LANGUAGE PROCESSING
UNIT 1
FINDING THE STRUCUTRE OF WORDS
Introduction
The study of word structure, known as morphology, is a fundamental aspect of Natural
Language Processing (NLP). This discipline is essential for understanding human
language, which is inherently complex, enabling us to express thoughts and infer meaning
from various levels of detail. Morphology is crucial for processing human language,
including tasks like semantic and syntactic analysis, and is particularly vital in multilingual
settings. The discovery of word structure is termed morphological parsing.
Words and their Components
Explain Words and their Components.
Words are considered the smallest linguistic units capable of conveying meaning through
utterance. However, the concept of a "word" can vary significantly across languages. The
following are the various fundamental components of words:
•Tokens: In many languages, such as English, words are delimited by whitespace and
punctuation, forming tokens. Yet, this is not a universal rule; languages like Japanese,
Chinese, and Thai utilise character strings without whitespace for word delimitation. Other
languages, like Arabic or Hebrew, concatenate certain tokens, where word forms change
depending on preceding or following elements.
•Morphemes: These are the minimal parts of words that convey meaning. Morphemes
constitute the fundamental morphological units and contribute to the overall meaning of a
word.
•Lexemes: A lexeme is a linguistic form that expresses a concept, independent of its
various inflectional categories. The citation form of a lexeme is known as its lemma. When
a word form is converted into its lemma, this process is called lemmatisation.
•Allomorphs: Morphemes can exhibit variations in their sound (phonemes) or spelling
(graphemes), which are termed allomorphs. These variations are due to phonological or
orthographic constraints. Examples include the differing forms of morphemes in Korean and
the non concatenative morphology of Arabic, where word structure is determined by stems,
roots, and patterns.
Morphological Typology
Morphological typology classifies languages based on the number of morphemes per word
and the degree of fusion between them. The types of languages are shown below:
1) Isolating Languages: These languages typically have one or relatively few
morphemes per word, with minimal inflectional changes. Examples include Chinese,
Vietnamese, and Thai.
2) Agglutinative Languages: Characterised by a high number of morphemes per
word, which are often easily separable and combine to form long words. Korean,
Japanese, Finnish, and Turkish are examples.
1
3) Synthetic Languages: Morphemes in these languages tend to combine and fuse
4) Fusional Languages, a subset of synthetic languages, often express multiple
grammatical features (e.g., gender, number, case) with a single morpheme. Arabic,
Czech, Latin, and Sanskrit are examples of fusional languages.
*****
Issues and Challenges
Explain issues and challenges. (Essay Question)
Understanding the structure of words in human language presents several significant
issues and challenges for morphological analysis, primarily due to the inherent complexity,
variability, and dynamic nature of language itself. These challenges are particularly evident
in morphological parsing, which aims to transform varied word forms into well-defined
linguistic units with explicit lexical and morphological properties
.
The following are the key issues and challenges:
1. Variability of Word Forms and Morphological Parsing
Human language is incredibly complex, exhibiting structure at multiple levels of detail1.
Words often do not maintain a constant form and can change significantly based on
syntactic and semantic contexts, as well as specific sensitivities and restrictions.
Morphological parsing is designed to address this variability by converting diverse word
forms into higher-level linguistic units whose lexical and morphological properties are
clearly defined. For instance, in languages like Arabic, an underlying root form (e.g., 'ktb' for
'write') can generate numerous surface forms (e.g., 'kataba', 'ya ktubu', 'maktab') through
processes like concatenation, infixation, and vowel changes. The challenge lies in
accurately mapping these surface forms back to their underlying morphemes and features.
2. Irregularity
A major challenge is the existence of irregular word forms and structures that do not
conform to typical linguistic patterns or rules. While some irregularities can be managed by
refining existing models, many others are lexically dependent and cannot be easily
generalised. This means that the morphological model must accommodate these
exceptions, often requiring detailed, specific descriptions rather than broad rules.
•Korean Morphology: For example, Korean exhibits complex morphological alternation
and phonologically dependent choices of form. It has numerous allomorphs (variant forms
of a morpheme) whose usage depends on the preceding verb stem. Additionally, Korean
irregular verbs present specific challenges in inflection.
•Arabic Morphology: The deep study of morphological processes, even for irregular words
in Arabic, is crucial for mastering its entire morphological and phonological system.
3. Ambiguity
Morphological ambiguity arises when a single word form can be interpreted in multiple
ways, leading to different meanings or functions depending on the context. This includes
homonyms, where words share the same form but have distinct meanings.
•Interpretation Ambiguity: Ambiguity complicates the interpretation of linguistic
expressions, often requiring the disambiguation of words within their context to avoid
restricting the valid interpretations.
2
•Arabic Disambiguation: Arabic, with its rich morphology, presents a particularly high
degree of morphological ambiguity. This is exacerbated because its script often omits short
vowels and other diacritical marks essential for precise phonological representation. The
problem of disambiguation in Arabic extends beyond resolving structural components and
morphosyntactic properties to include challenges in tokenisation and normalisation.
•Contextual Changes: When inflected words combine in an utterance, additional
phonological and orthographic changes (like "external sandhi" in Sanskrit) can occur,
making segmentation and disambiguation non-deterministic and multi-solutional.
4. Productivity
Productivity refers to the language's capacity to generate an infinite set of utterances and
new words from a finite set of structural devices, such as recursion, iteration, or
compounding. This inherent creativity means that new words or senses are constantly
being coined.
•Dynamic Vocabulary: Linguistic corpora, while useful, represent only a finite snapshot of
a language's vocabulary. The "80/20 rule" (Zipf's law) suggests that a small number of
words are very frequent, while a large number are rare. As linguistic data expands, new and
unexpected words continually emerge
•Unknown Word Problem: Morphological models struggle with "unknown words"—words
that are meaningful but not yet licensed by the lexicon of the morphological system. This
problem is severe in speech and writing that deviates from the expected domain, such as
when using specialised terms, foreign names, or mixing multiple languages or dialects. The
term "googol" is an example of such a creative word formation. For instance, a
morphological analyser might fail to parse 'googol' if it's not in its pre-existing lexicon.
5. Computational and Design Challenges for Morphological Models
Developing and implementing robust morphological models involves several computational
and design considerations.
•Resource Limitations: Historically, the development of sophisticated morphological
models has been hampered by limited computational resources relative to the complexity of
the tasks involved.
•Runtime Performance and Efficiency: Modern models must also address concerns
about runtime performance and efficiency, ensuring they can process large volumes of
linguistic data quickly.
•Model Design: The choice of programming methods and design style significantly impacts
whether a model is intuitive, adequate, complete, reusable, and elegant7.
•Domain-Specific vs. General-Purpose: Some approaches use domain-specific
programming languages for easier development of morphological grammars, while others
adopt general-purpose languages, each with its own trade-offs regarding abstraction,
efficiency, and reusability. The aim is to achieve better abstraction for grammar
development and reduce redundant information.
Thus, these interconnected challenges underscore the difficulty and complexity in designing
comprehensive morphological models that can accurately capture and process the intricate
structure of human language.
3
What are various issues and challenges in Morphological Processing? (Short answer
question).
Morphological parsing, the process of identifying and analysing word structures, faces
several significant challenges.
1. Irregularity: Many languages exhibit irregular word forms that do not follow general
rules. These irregularities are particularly pronounced in languages with rich morphology,
such as Arabic or Korean, and can affect both derivation and inflection.
2. Ambiguity: Word forms can have multiple possible interpretations, leading to ambiguity.
◦Homonymy: Occurs when words share the same form (spelling and/or pronunciation)
but have distinct meanings or grammatical functions. Korean provides systematic
examples of homonyms.
◦Syncretism: A form of morphological ambiguity where different grammatical categories
or meanings are indistinguishable in their surface form.
•Neutralization: This occurs when morphological distinctions are not explicitly reflected
in the syntactic structure, meaning a single form might represent several underlying
morphological variants.
•Unknown Word Problem: NLP systems must be able to process words not present in
their lexicon, requiring robust morphological analysis to infer their structure and
meaning.
3. Productivity: The ability of a language to form new words continually poses a
challenge for maintaining comprehensive lexicons.
MORPHOLOGICAL MODELS
Explain various Morphological models. (Essay Question)
Morphological models are computational linguistic approaches designed to understand and
represent the complex structure of words across human languages. They are crucial for
addressing various problems in natural language processing (NLP), ranging from basic
word segmentation to more advanced semantic and syntactic analysis. Due to the inherent
complexity of human language, linguistic expressions are structured at multiple levels of
detail, making these models essential for processing.
The following are various morphological models:
1. Dictionary Lookup:
Dictionary lookup is a fundamental process in morphological analysis where word forms are
associated with their corresponding linguistic descriptions. This method relies on
precomputed data structures like lists, dictionaries, or databases, which are kept
synchronised with sophisticated morphological models.
Data Structure: Linguistic data is typically understood as a data structure that
directly enables efficient lookup operations.
Efficiency: Lookup operations can be optimised using data structures such as
binary search trees, tries, hash tables, and so on.
Limitations: While effective, the set of associations between word forms and their
desired descriptions is finite, meaning that the generative potential of the language
is not fully exploited3. This approach can be tedious, prone to errors, and inefficient
for large, unreliable linguistic resources, especially for enumerative models that are
4
often sufficient for general purposes. It's also less suitable for complex morphology,
as seen in Korean, where dictionary-based approaches can depend on a large
dictionary of all possible combinations of allomorphs and morphological alternations.
2. Finite-State Morphology (FSM) :
Finite-State Morphology is a widely adopted computational linguistic approach that employs
finite-state transducers (FSTs) to model and analyse word structure.
Mechanism: FSTs are directly compiled from specifications written by human
programmers. They represent the relationship between the surface form of words (how they
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). An FST is based on finite-state automata, consisting of a finite set of nodes
(states) connected by directed edges. These edges are labelled with pairs of input and
output symbols, translating a sequence of input symbols into a corresponding sequence of
output symbols.
Functionality: FSTs can compute and compare regular relations, defining the relationship
between an input (surface string) and an output (lexical string, including morphemes and
features). FSM is well-suited for analysing morphological processes in various languages,
including isolating and agglutinative types. They can construct full-fledged morphological
analysers (parsing words into morphemes), morphological generators (producing word
forms from morphemes), and tokenizers.
Advantages: FSTs are flexible, efficient, and robust. They offer a general-purpose
approach for pattern matching and substitution, allowing for the building of complex
morphological analysers and generators.
Limitations: A theoretical limitation of FSTs is their primary focus on generating regular
languages4. This can be challenging for natural language phenomena that exhibit non-
regular patterns, such as certain types of reduplication4
3. Unification-Based Morphology:
Unification-Based Morphology is a declarative approach inspired by various formal linguistic
grammars, particularly head-driven phrase structure grammar (HPSG).
Core Concept: It relies on the concept of feature structures to represent linguistic
information5. These feature structures are viewed as directed acyclic graphs.
Logic Programming: The methods and concepts of unification-based formalism are
closely connected to logic programming.
Functionality: This model can manage complex and recursively nested linguistic
information, expressed by atomic symbols or more appropriate data structures5.
Unification, as the key operation, merges informative feature structures, making it highly
versatile for representing intricate linguistic details.
Advantages: These models are typically formulated as logic programs and use unification
to solve constraint systems. This offers advantages such as better abstraction possibilities
for developing morphological grammars and eliminating redundant information. Unification-
based models can be implemented for various languages, including Russian, Czech,
Slovenian, Persian, Hebrew, and Arabic.
5
4. Functional Morphology:
Functional Morphology is a model that defines morphological operations using principles of
functional programming and type theory.
Approach: It treats morphological operations as pure mathematical functions, organising
linguistic elements as abstract models of distinct types and value classes.
Compatibility: Functional morphology can be compiled into finite-state transducers,
enabling efficient computation within an interpreted mode.
Advantages: This approach offers greater freedom for developers to define their own
lexical constructions, leading to domain-specific embedded languages for morphological
analysis. It supports full-featured, real-world applications and promotes reusability of
linguistic data.
Applicability: It is particularly useful for fusional languages and is influenced by functional
programming frameworks like Haskell. ElixirFM, for instance, implements Arabic
morphology using this framework.
5. Morphology Induction:
Morphology Induction focuses on discovering and inferring word structure, moving beyond
pre-existing linguistic knowledge.
Motivation: This approach is especially valuable for languages where linguistic expertise is
limited or unavailable or for situations where an unsupervised or semi-supervised learning
method is preferred.
Process: It aims at the automated acquisition of morphological and lexical
information8. Even if not perfect, this information can be used to bootstrap and
enhance classical morphological models.
Research Focus: Studies in unsupervised learning of morphology, as seen in the
works of Hammarström and Goldsmith , involve categorising approaches, comparing
and clustering words based on similarity, and identifying prominent features of word
forms.
Key Problem: Most published approaches frame morphology induction as the
problem of word boundary and morpheme boundary detection. This also includes
tasks like morphological tagging and tokenization and normalization.
Challenges: Deducing word structure from forms and context presents several
challenges, including dealing with ambiguity and irregularity in morphology, as well
as orthographic and phonological alterations and non-linear morphological
processes.
Advancements: To improve statistical inference, methods like parallel learning of
morphologies for multiple languages have been proposed by Snyder and Barzilay.
Discriminative log-linear models, such as those by Poon, Cherry, and Toutanova ,
enhance generalization by employing overlapping contextual features for
segmentation decisions.
6
These models, while distinct, complement each other, offering various tools and
perspectives for addressing the complex task of finding and representing the structure of
words across the diverse range of human languages. The choice of model often depends
on the specific language being analysed and the desired application.
***
Short answer questions
1. What is a morpheme?
In Natural Language Processing (NLP), a morpheme is defined as the minimal part of a
word that conveys meaning. Morphemes are considered the fundamental morphological
units. They contribute to various aspects of a word's meaning1 and are essentially the
structural components of word forms.
2. What is Morphology?
Morphology is the study of word structure and formation. It examines how words are constructed
from smaller meaningful units called morphemes and how these units combine to form complex
words. The discovery of word structure is specifically referred to as morphological parsing.
Morphological analysis is considered an essential part of language processing, as it helps convert
diverse word forms into well-defined linguistic units with explicit lexical and morphological properties.
Understanding word structure involves identifying distinct types of units in human languages and
how their internal structure connects with grammatical properties and lexical concepts
3. Define Morphological parsing in Natural Language Processing (NLP).
Morphological parsing in Natural Language Processing (NLP) refers to the discovery of
word structure. It is the process of identifying and analysing the constituent morphemes
within a word to understand its meaning and grammatical function.
This process is a fundamental aspect of understanding human language, which is
inherently complex and organised across multiple levels of detail. Morphology, the study of
word structure, is an essential part of language processing and is particularly significant in
multilingual settings. Morphological parsing is crucial for various NLP tasks, including
semantic and syntactic analysis.
4. What is word segmentation?
In Natural Language Processing, word segmentation is a fundamental step in
morphological analysis1. It is also known as tokenization1. This process is crucial and
serves as a prerequisite for most language processing applications, particularly in
languages where words are not explicitly delimited by whitespace or punctuation1. For
instance, in languages like Japanese, Chinese, and Thai, words are character strings
without whitespace, and word segmentation is essential to identify the individual words.
7
5. How are words delimited?
The delimitation of words, often referred to as tokenization or word segmentation, varies
significantly across different languages. The following are the methods used for how words
are delimited in various linguistic contexts:
•Whitespace and Punctuation In many languages, such as English, words are primarily
delimited by whitespace and punctuation. This means that spaces and common
punctuation marks serve as explicit boundaries between individual words.
•Absence of Whitespace Delimitation In other languages, like Japanese, Chinese, and
Thai, whitespace is not used to separate words. Instead, the writing systems of these
languages present words as character strings without clear word-level delimiters. In such
cases, units that are graphically delimited are typically larger structures like sentences or
clauses.
•Concatenation and Form Changes Languages such as Arabic and Hebrew often
concatenate certain tokens with preceding or following elements. This concatenation can
lead to changes in the word forms themselves, causing the underlying lexical or syntactic
units to appear as a single, compact string of letters rather than distinct words2. These
concatenated units are sometimes referred to as clitics
6. How are words structured?
Words are the smallest linguistic units that can form a complete utterance by themselves1.
Their internal structure can be modelled in relation to their grammatical properties and the
lexical concepts they represent. The discovery of this word structure is known as
morphological parsing.
The structure of words is built upon morphemes, which are defined as the minimal parts of
a word that convey meaning. These are also referred to as segments or morphs and are
considered the fundamental morphological units.
Human languages employ various methods to combine these morphs and morphemes into
complete word forms.
•The simplest method is concatenation, where morphemes are joined sequentially, such
as in "dis-agree-ment-s". In this example, "agree" is a free lexical morpheme, while "dis-", "-
ment-", and "-s" are bound grammatical morphemes that contribute partial meaning.
•In more complex systems, morphs can interact with each other, leading to
morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.
•Word structure is frequently described by how stems combine with root and pattern
morphemes, along with other elements that may be attached to either side.
It is important to note that some properties or features of a word may not be explicitly visible
in its morphological structure. The structural components can be associated with, and
8
dependent on, multiple functions concurrently, without necessarily having a singular
grammatical interpretation within their lexical meaning.
Ultimately, the way word structure is described can depend on the specific language being
analysed and the morphological theory being applied6. Deducing word structure can be
challenging due to factors such as ambiguity, irregularity, and variations in orthography and
phonology.
What are the foundational concepts and methodologies for understanding word
structure across languages?
Understanding word structure across languages involves several foundational concepts and
methodologies that aim to decipher how words are built and what meanings and functions
their components convey.
Foundational Concepts of Word Structure
1.Words as Basic Linguistic Units Words are considered the smallest linguistic units
capable of forming a complete utterance by themselves. Their internal structure can be
modeled in relation to their grammatical properties and the lexical concepts they represent.
2.Morphemes The structure of words is fundamentally built upon morphemes, which are
defined as the minimal parts of a word that convey meaning. They are also referred to as
segments or morphs and are considered the elementary morphological units
3.Combining Morphemes Human languages employ various methods to combine
morphemes into complete word forms:
◦Concatenation The simplest method is sequential joining, as seen in words like "dis-
agree-ment-s". In this example, "agree" is a free lexical morpheme (can stand alone), while
"dis-", "-ment-", and "-s" are bound grammatical morphemes (cannot stand alone) that
contribute partial meaning.
◦Morphophonemic Changes In more complex systems, morphs can interact, leading to
morphophonemic changes where their forms undergo additional phonological and
orthographic modifications. Different forms of the same morpheme are called allomorphs.
◦Stems, Roots, and Patterns Word structure is frequently described by how stems
combine with root and pattern morphemes, along with other elements that may be
attached to either side.
◦Implicit Properties It's important to note that some properties or features of a word may
not be explicitly visible in its morphological structure. Word structure components can be
associated with and dependent on multiple functions concurrently, without necessarily
having a singular grammatical interpretation within their lexical meaning.
Morphological Typologies Languages can be categorized based on how they structure
words:
9
◦Isolating Languages These languages (e.g., Chinese, Vietnamese, Thai) typically
have one morpheme per word.
◦Synthetic Languages These languages combine more morphemes per word than
isolating languages.
◦Agglutinative Languages A type of synthetic language (e.g., Korean, Japanese,
Finnish, Tamil), where morphemes often combine with one function at a time.
◦Fusional Languages These languages (e.g., Arabic, Czech, Latin, Sanskrit, German)
often have a feature-per-morpheme ratio higher than one, meaning a single morpheme
can convey multiple grammatical features8.
◦Concatenative Languages These languages link morphs and morphemes one after
another.
◦Non-concatenative Languages These involve changing consonantal or vocalic
templates, common in Arabic.
Methodologies for Understanding Word Structure
The discovery of word structure is broadly known as morphological parsing. This process
is crucial for various Natural Language Processing (NLP) tasks, including semantic and
syntactic analysis.
1. Word Segmentation (Tokenization) This is a fundamental and prerequisite step for
most language processing applications. It involves identifying individual words within a
text2.
◦Delimitation by Whitespace and Punctuation In languages like English, words are
primarily delimited by whitespace and punctuation.
◦Absence of Whitespace Delimitation In languages such as Japanese, Chinese, and
Thai, words are character strings without explicit whitespace delimiters. In these cases,
graphically delimited units are usually larger structures like sentences or clauses.
◦Concatenation Languages like Arabic and Hebrew often concatenate certain tokens with
preceding or following elements, leading to changes in word forms and appearing as a
single, compact string of letters. These are sometimes called clitics.
◦Speech/Cognitive Units In Korean, character strings are grouped into units called "eojeol"
("word segment"), which are typically larger than individual words but smaller than clauses.
2.Finite-State Morphology (FSM) FSM is a prominent computational linguistic approach
that employs finite-state transducers (FSTs) to model and analyse word structure.
◦Mechanism FSTs represent the relationship between surface word forms (how words
appear) and their underlying lexical or morphological descriptions (their internal structure
and features). They function by mapping input symbols to output symbols. An FST is based
on finite-state automata, where a finite set of nodes (states) are connected by directed
edges labeled with pairs of input and output symbols. This network translates a sequence
of input symbols into a sequence of corresponding output symbols.
10
◦Functionality FSTs are capable of computing and comparing regular relations. They
define the relationship between the input (surface string) and the output (lexical string,
which includes morphemes and their features). FSM is particularly well-suited for analysing
morphological processes in both isolating and agglutinative languages. It can be used to
build full-fledged morphological analysers, which identify morphemes within a word, or
generators, which produce word forms from given morphemes. It is also valuable for
constructing tokenizers.
◦Theoretical Basis Some morphological models, such as Functional Morphology, can be
compiled into finite-state transducers.
◦Limitations A theoretical limitation of FSTs is that they primarily generate regular
languages. However, some aspects of natural language, such as certain types of
reduplication, might exhibit non-regular patterns.
3.Other Morphological Models
◦Dictionary Lookup This is a process where word forms are associated with their
corresponding linguistic descriptions.
◦Unification-Based Morphology These models use feature structures to represent
linguistic information and can be based on logic programming.
◦Functional Morphology This approach defines morphological operations using principles
of functional programming and type theory, and it can be compiled into finite-state
transducers2.
Issues and Challenges
Deducing word structure can be challenging due to several factors:
•Ambiguity Word forms can be understood in multiple ways or have the same form but
distinct functions or meanings (homonyms). Morphological parsing deals with disabiguating
words in their context.
•Irregularity Some word forms may not follow regular patterns and may not be explicitly
listed in a lexicon.
•Variations in Orthography and Phonology Morphophonemic changes, orthographic
collapsing, and phonological contraction can complicate the analysis of word forms..
•Complexity of Natural Language Human language is inherently complex, with structure
at multiple levels of detail, and linguistic expressions are not unorganized.
What are Allomorphs?
Allomorphs are the alternative forms of a morpheme. They represent variations of a single
morpheme that are chosen based on phonological context or other linguistic rules.
11