Correlation and Causation
Correlation and Causation
language processing
R. Kibble
CO3354
2013
Undergraduate study in
Computing and related programmes
 This is an extract from a subject guide for an undergraduate course offered as part of the
 University of London International Programmes in Computing. Materials for these programmes
 are developed by academics at Goldsmiths.
 For more information, see: www.londoninternational.ac.uk
This guide was prepared for the University of London International Programmes by:
R. Kibble
This is one of a series of subject guides published by the University. We regret that due to pressure of work the author is
unable to enter into any correspondence relating to, or arising from, the guide. If you have any comments on this subject
guide, favourable or unfavourable, please use the form at the back of this guide.
The University of London and Goldsmiths assert copyright over all material in this subject guide except where otherwise
indicated. All rights reserved. No part of this work may be reproduced in any form, or by any means, without permission in
writing from the publisher. We make every effort to respect copyright. If you think we have inadvertently used your copyright
material, please let us know.
Contents
    Preface                                                                                                                   1
       About this half unit . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1
       Assessment . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   1
       The subject guide and other learning resources         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
       Suggested study time . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
       Acknowledgement . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
                                                                                                                                   i
CO3354 Introduction to natural language processing
ii
       4.5.3 Transformation-based tagging        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
       4.5.4 Evaluation and performance .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
       Activity: Trained taggers . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
   4.6 Summary . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
   4.7 Sample examination question . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   54
                                                                                                                                  iii
CO3354 Introduction to natural language processing
A Bibliography 93
B Glossary 95
iv
Preface
     This half unit course combines a critical introduction to key topics in theoretical and
     computational linguistics with hands-on practical experience of using existing
     software tools and developing applications to process texts and access linguistic
     resources. The aims of the course and learning outcomes are listed in Chapter 1.
     This course has no specific prerequisites. There will be some programming involved
     and you will need to acquire some familiarity with the Python language, but you will
     not be expected to develop substantial original code or to encode specialised
     algorithms. The course involves some statistical techniques, but the only
     mathematical knowledge assumed is an understanding of elementary probability
     and familiarity with the concept of logarithms.
     Before the advent of the world wide web, most machine-readable information was
     stored in structured databases and accessed via specialised query languages such as
     Structured Query Language (SQL). Nowadays the situation is reversed: most
     information is found in unstructured or semi-structured natural language documents
     and there is increasing demand for techniques to ‘unlock’ this data. Computing
     graduates with knowledge of natural language processing techniques are finding
     employment in areas such as text analytics, sentiment analysis, topic detection and
     information extraction.
Assessment
     You will be expected to provide electronic copies of your coursework for plagiarism
     checking purposes. It is very important that any material that is not original to you
     should be properly attributed and placed in quotation marks, with a full list of
     references at the end of your submission. You should follow the style used in this
     subject guide for citing references, for example:
Segaran (2007, pp.117–118) discusses some problems with rule-based spam filters.
     Answers which consist entirely or mostly of quoted material are unlikely to get many
     marks even if properly attributed, as simply reproducing an answer in someone
     else’s words does not demonstrate that you have fully understood the material.
In order to give you some practice in problem-solving and writing short essays, there
                                                                                               1
CO3354 Introduction to natural language processing
    are a number of Activities throughout this subject guide. The Appendix includes a
    section ‘Answers to selected activities’, although these will not always provide
    complete answers to the questions but are intended to indicate how particular types
    of questions should be approached. Sample examination questions are provided at
    the end of each chapter. Some, but not all, of these are included in the sample
    examination paper with suggested answers at the end of the guide.
    This subject guide is not intended as a self-contained textbook but sets out specific
    topics for study in the CO3354 half unit. There is a recommended textbook and a
    number of other readings are listed at appropriate places. There are also links to
    websites providing useful resources such as software tools and access to online
    linguistic data. The learning outcomes listed in the next chapter assume that you are
    working through the recommended readings, activities and sample examination
    questions. It will not be possible to pass this half unit by reading only the subject
    guide. Please refer to the Computing VLE for other resources, which should be used
    as an aid to your learning.
    The Student Handbook states that ‘To be able to gain the most benefit from the
    programme, it is likely that you will have to spend at least 300 hours studying for
    each full unit, though you are likely to benefit from spending up to twice this time’.
    Note that this subject is a half unit.
2
                                                                                      Contents
Acknowledgement
Bird, S., E. Klein and E. Loper, Natural Language Processing with Python. (O’Reilly
Media 2009) [ISBN 9780596516499; http://nltk.org/book].
You will be expected to draw on it in your studies and to use the accompanying
software package, the Natural Language Toolkit, which requires the Python
language. Natural language processing with Python has been made available under
the terms of the Creative Commons Attribution Noncommercial No-Derivative-Works
3.0 US License: http://creativecommons.org/licenses/by-nc-nd.3.0/us/legalcode (last
visited 13th April 2013).
                                                                                            3
CO3354 Introduction to natural language processing
4
Chapter 1
     The idea of computers being able to understand ordinary languages and hold
     conversations with human beings has been a staple of science fiction since the first
     half of the twentieth century and was envisaged in a classic paper by Alan Turing
     (1950) as a hallmark of computational intelligence. Since the start of the
     twenty-first century this vision has been starting to look more plausible: artificial
     intelligence techniques allied with the scientific study of language have emerged
     from universities and research laboratories to inform a variety of industrial and
     commercial applications. Many websites now offer automatic translation; mobile
     phones can appear to understand spoken questions and commands; search engines
     like Google use basic linguistic techniques for automatically completing or
     ‘correcting’ your queries and for finding relevant results that are closely matched to
     your search terms. We are still some way from full machine understanding of natural
     language, however. Automated translations still need to be reviewed and edited by
     skilled human translators while no computer system has yet come close to passing
     the ‘Turing Test’ of convincingly simulating human conversation. Indeed it has been
     argued that the Turing Test is a blind alley and that research should focus on
     producing effective applications for specific requirements without seeking to
     generate an illusion that users are interacting with a human rather than a machine
     (Hayes and Ford, 1995). Hopefully, by the time you finish this course you will have
     come to appreciate some of the challenges posed by full understanding of natural
     language as well as the very real achievements that have resulted from focusing on a
     range of specific, well-defined tasks.
                                                                                              5
CO3354 Introduction to natural language processing
     1. utilise and explain the function of software tools such as corpus readers,
        stemmers, taggers and parsers
     2. explain the difference between regular and context-free grammars and define
        formal grammars for fragments of a natural language
     3. critically appraise existing Natural Language Processing (NLP) applications such
        as chatbots and translation systems
     4. describe some applications of statistical techniques to natural language analysis,
        such as classification and probabilistic parsing.
    Each main chapter contains a list of learning outcomes specific to that chapter at the
    beginning, as well as a summary at the end of the chapter.
    This is a list of textbooks and other resources which will be useful for all or most
    parts of the course. Additional readings will be given at the start of each chapter. See
    the bibliography for a full list of books and articles referred to, including all ISBNs.
    In some cases several different books will be listed: you are not expected to read all
    of them, rather the intention is to give you some alternatives in case particular texts
    are hard to obtain.
    Essential reading
     Bird, Klein, and Loper (2009): Natural Language Processing with Python. The full
        text including diagrams is freely available online at http://nltk.org/book (last
        visited 13th April 2013). The main textbook for this course, Natural Language
        Processing with Python is the outcome of a project extending over several years
        to develop the Natural Language Toolkit (NLTK), which is a set of tools and
        resources for teaching computational linguistics. The NLTK comprises a suite of
        software modules written in Python and a collection of corpora and other
        resources. See section 1.5 below for advice on installing the NLTK and other
        software packages.
          In the course of working through this text you will gain some experience and
          familiarity with the Python language, though you will not be expected to
          produce substantial original code as part of the learning outcomes of the course.
    Recommended reading
     Pinker (2007). The Language Instinct. This book is aimed at non-specialists and
        deals with many psychological and cultural aspects of language. Chapter 4 is
        particularly relevant to this course as it provides a clear and accessible
        presentation of two standard techniques for modelling linguistic structure:
        finite-state machines and context-free grammars (though Pinker does not in fact
        use these terms, as we will see in Chapter 2 of the subject guide).
6
                                                              Reading list and other learning resources
 Jurafsky and Martin (2009): Speech and Language Processing, second edition.
    Currently the definitive introductory textbook in this field, covering the major
    topics in a way which combines theoretical issues with presentations of key
    technologies, formalisms and mathematical techniques. Much of this book goes
    beyond what you will need to pass this course, but it is always worth turning to
    if you’re looking for a more in-depth discussion of any particular topics.
 Perkins (2010): Python Text Processing with NLTK 2.0 Cookbook. This book will be
    suitable for students who want to get more practice in applying Python
    programming to natural language processing. Perkins explains several
    techniques and algorithms in more technical detail than Bird et al. (2009) and
    provides a variety of worked examples and code snippets.
 Segaran (2007) Programming Collective Intelligence. This highly readable and
   informative text includes tutorial material on machine learning techniques using
   the Python language.
Additional reading
 Russell and Norvig (2010) Artificial Intelligence: a modern approach, third edition.
   This book is currently regarded as the definitive textbook in Artificial
   Intelligence, and includes useful material on natural language processing as well
   as on machine learning, which has many applications in NLP.
 Mitkov (2003) The Oxford Handbook of Computational Linguistics. Edited by Ruslan
   Mitkov. A collection of short articles on major topics in the field, contributed by
   acknowledged experts in their respective disciplines.
 Partee et al. (1990) Mathematical Methods in Linguistics. A classic text, whose
    contents indicate how much the field has changed since its publication. A book
    with such a title nowadays would be expected to include substantial coverage of
    statistics, probability and information theory, but this text is devoted exclusively
    to discrete mathematics including set theory, formal logic, algebra and automata.
    These topics are particularly applicable to the content of Chapters 2 and 6.
Websites
Introductory/Reference The Internet Grammar of English is a clear and informative
    introductory guide to English grammar which also serves as a tutorial in
    grammatical terminology and concepts. The site is hosted by the Survey of
    English Usage at University College London
    (http://www.ucl.ac.uk/internet-grammar/home.htm, last visited 27th May
    2013).
Hands-on corpus analysis
 BNCWeb is a web-based interface to the British National Corpus hosted at Lancaster
   University which supports a variety of online queries for corpus analysis
   (http://bncweb.info/; last visited 27th May 2013).
 The Bank of English forms part of the Collins Corpus, developed by Collins
   Dictionaries and the University of Birmingham. Used as a basis for Collins
   Advanced Learner’s Dictionary, grammars and various tutorial materials for
   learners of English. Limited online access at
   http://www.collinslanguage.com/wordbanks; (last visited 27th May 2013).
Journals and conferences
 Computational Linguistics is the leading journal in this field and is freely available at
   http://www.mitpressjournals.org/loi/coli (last visited 27th May 2013).
 Conference Proceedings are often freely downloadable and many of these are
   hosted by the ACL Anthology at http://aclweb.org/anthology-new/ (last visited
   27th May 2013).
                                                                                                     7
CO3354 Introduction to natural language processing
    This course assumes you have access to the Natural Language Toolkit (NLTK) either
    on your own computer or at your institution. The NLTK can be freely downloaded
    and it is strongly recommended that you install it on your own machine: Windows,
    Mac OSX and Linux distributions are available from http://nltk.org (last visited
    April 10th 2013) and some distributions of Linux have it in their package/software
    managers. Full instructions are available at the cited website along with details of
    associated packages which should also be installed, including Python itself which is
    also freely available. Once you have installed the software you should also download
    the required datasets as explained in the textbook (Bird et al., 2009, p. 3).
    You should check the NLTK website to determine what versions of Python are
    supported. Current stable releases of NLTK are compatible with Python 2.6 and 2.7.
    A version supporting Python 3 is under development and may be available for
    testing by the time you read this guide (as of April 2013).
    This section gives a brief summary of each chapter. These learning outcomes are
    listed at the beginning of each main chapter and assume that you have worked
    through the recommended readings and activities for that chapter.
    This chapter looks at different approaches to analysing texts, ranging from ‘shallow’
    techniques that focus on individual words and phrases to ‘deeper’ methods that
    produce a full representation of the grammatical structure of a sentence as a
    hierarchical tree diagram. The chapter introduces two important formalisms:
    regular expressions, which will play an important part throughout the course, and
    context-free grammars which we return to in Chapter 6 of the subject guide.
    This chapter looks at the different kinds of data resources that can be used for
    developing tools to harvest information that has been published as machine-readable
    documents. In particular, we introduce the notion of a ‘corpus’ (plural corpora) – for
    the purposes of this course, a computer-readable collection of text or speech. The
    NLTK includes a selection of excerpts from several well-known corpora and we
    provide brief descriptions of the most important of these and of the different formats
    in which corpora are stored.
8
                                                                   What the course does not cover
The previous chapter introduced some relatively superficial techniques for language
analysis such as concordancing and collocations. This chapter covers some
fundamental operations in text analysis:
This chapter resumes the discussion of natural language syntax that was introduced
in Chapter 2, concentrating on context-free grammar formalisms and various ways
they need to be modified and extended beyond the model that was presented in that
chapter. Formal grammars do not encode any kind of processing strategy but simply
provide a declarative specification of the well-formed sentences in a language.
Parsers are computer programs that use grammar rules to analyse sentences, and
this chapter introduces some fundamental approaches to syntactic parsing.
1.6.6 Appendices
                                                                                               9
CO3354 Introduction to natural language processing
     However, the course should provide you with a basis for investigating some of these
     areas for your final year project. Some of these topics are dealt with in the later
     chapters of Bird et al. (2009) and most of them are touched on by Jurafsky and
     Martin (2009).
10
Chapter 2
Essential reading
Recommended reading
     Jurafsky and Martin (2009), Speech and Language Processing second edition,
     Chapters/Sections 2 ‘Regular Expressions and Automata’, 5.1 ‘(Mostly) English Word
     Classes’, 12.1 ‘Constituency’, 12.2 ‘Context-Free Grammars’, 12.3 ‘Some Grammar
     Rules for English’.
Additional reading
     By the end of this chapter, and having completed the Essential reading and activities,
     you should be able to:
           explain the concept of finite state machines (FSMs) and their connections with
           regular expressions; work through simple FSMs
           write regular expressions for well-defined patterns of symbols
           analyse sentences in terms of parts of speech (POS) and constituent structure,
           including the use of tree diagrams
           write regular and context-free grammars for small fragments of natural language
           explain the concept of stemming and specify word-formation rules for given
           examples.
                                                                                              11
CO3354 Introduction to natural language processing
2.2 Introduction
     By text we mean words that are written or printed on a flat surface (paper, card,
     street signs and so on) or displayed on a screen or electronic device in order to be
     read by their intended recipient (or by whoever happens to be passing by).
     This course will focus only on the last of these: we will be concerned with various
     ways in which computer systems can analyse and interpret texts, and we will assume
     for convenience that these texts are presented in an electronic format. This is of
     course quite a reasonable assumption, given the huge amount of text we can access
     via the World Wide Web and the increasing availability of electronic versions of
     newspapers, novels, textbooks and indeed subject guides. This chapter introduces
     some essential concepts, techniques and terminology that will be applied in the rest
     of the course. Some material in this chapter is a little technical but no programming
     is involved at this stage.
     We will begin in section 2.3 by considering texts as strings of characters which can
     be broken up into sub-strings, and introduce some techniques for informally
     describing patterns of various kinds that occur in texts. Subsequently in section 2.4
     we will begin to motivate the analysis of texts in terms of hierarchical structures in
     which elements of various kinds can be embedded within each other, in a
     comparable way to the elements that make up an HTML web document. This section
     introduces some technical machinery such as: finite-state machines (FSMs), regular
     expressions, regular grammars and context-free grammars.
     One of the more basic operations that can be applied to a text is tokenising:
     breaking up a stream of characters into words, punctuation marks, numbers and
     other discrete items. So for example the character string
     can be tokenised as in the following example, where each token is enclosed in single
     quotation marks:
     At this level, words have not been classified into grammatical categories and we
     have very little indication of syntactic structure. Still, a fair amount of information
     may be obtained from relatively shallow analysis of tokenised text. For example,
     suppose we want to develop a procedure for finding all personal names in a given
     text. We know that personal names always start with capital letters, but that is not
     enough to distinguish them from names of countries, cities, companies, racehorses
12
                                                                                              Basic concepts
and so on, or from capitalisation at the start of a sentence. Some additional ways to
identify personal names include:
We can express these more concisely as follows, where | is the disjunction symbol,
Word stands for a capitalised word and Int is an integer:
Learning activity
 1. Write down your own examples of names that match each of the above patterns.
 2. Pick a newspaper article or webpage that provides a variety of examples of people’s names. Do they
    match the patterns we have encoded above? If not, see if you can devise additional rules for
    recognising names and write them out in a similar format.
                                                                                                         13
CO3354 Introduction to natural language processing
     Bird et al. (2009, pp. 184–5) make the standard distinction that nouns ‘generally
     refer to people, places, things or concepts’ while verbs ‘describe events or actions’.
     This may be helpful when one is starting to learn grammatical terminology but is
     something of an over-simplification. One can easily find or construct examples
     where the same concept can be expressed by a noun or a verb, or by an adjective or
     an adverb. And on the other hand, there are many words that can take different
     parts of speech depending on what they do in a sentence:
     Additionally, some types of verbs do not correspond to any particular action but
     serve a purely grammatical function: these include the auxiliary verbs such as did,
     shall and so on. So in summary, we can often only assign a part of speech to a word
     depending on its function in context rather than how it relates to real things or
     events in the world.
Learning activity
     You will have noticed several recurring patterns in the above examples: Det Noun,
     Prep Det Noun and so on. You may also have noticed that some types of phrase can
     occur in similar contexts: (John | the cat) sat, a Proper Noun or a sequence Det Noun
     can come before a Verb. Some of these possibilities can be captured using the
     pattern-matching notation introduced above, for example:
     This will match any sequence which ends in a verb barked or slept preceded by
     either a Determiner a or the followed by a Noun cat or dog or a proper name John,
     Jack or Susan.
     Patterns that have similar distributions (meaning that they can occur in similar
     contexts) are standardly identified by phrasal categories such as Noun Phrase or Verb
14
                                                                                         A closer look at syntax
Noun Phrase → Noun Phrase, Conj, Noun Phrase (Examples: Jack and Jill, the owl and the pussycat)
Verb Phrase → Verb, Preposition, Noun Phrase (Examples: went up the hill, sat on the mat)
Learning activity
Read through the recommended sections of the UCL ‘Internet Grammar of English’. Write production rules
that cover some of the examples in these sections.
This section aims to motivate the idea that texts can be analysed as hierarchical
structures rather than ‘flat’ sequences whose elements are organised in various
patterns. The Essential reading for this chapter by Steven Pinker gives a concise and
accessible introduction to some fundamental distinctions we will make in this
section, from the point of view of Chomskyan linguistics (compare Chomsky,
1957/2002). Chomsky and his followers argue that some components of our
knowledge of language are innate, and Pinker (2007, chapter 4) sketches some
arguments in support of this claim. This position is considered to be contentious by
many linguists and we will not address it in this course. However, Pinker’s chapter
provides a useful introduction to syntactic analysis and clearly distinguishes between
two formal techniques for modelling grammatical knowledge, which underlie
regular and context-free grammars respectively (these terms will be explained as
we go along).
Learning activity
If you have access to it, read through the recommended chapter by Pinker and make notes, and have it to
hand while working through the remainder of this section.
Pinker notes that language makes ‘infinite use of finite means’, in Humboldt’s
phrase1 . That is, there is no principled upper limit to the length of a grammatical
sentence: we can always add another phrase, even if it’s a banal one like ‘one could
say that’, ‘and that’s a fact’ or ‘and you can tell that to the Marines’. A large
   1 ‘Sie [die Sprache] muss daher von endlichen Mitteln einen unendlichen Gebrauch machen’ (von Hum-
                                                                                                           15
CO3354 Introduction to natural language processing
     proportion, perhaps most of the sentences we read, hear or speak every day may be
     entirely novel, at least to us. Consequently, knowledge of a language seems to
     consist in knowing rules that specify what sentences belong to the language, rather
     than memorising long lists of sentences to be produced on appropriate occasions.
     Note that Pinker deliberately uses the more descriptive expression ‘wordchain’ as he
     is concerned to avoid the use of forbidding technical terminology. In what follows
     we will stick to the standard term finite-state machine which you are more likely to
     find in textbooks. You may also encounter the terms finite-state automaton or just
     finite automaton.
Word lists
      1. The, a, one
      2. Cat, dog, fish
      3. Barked, slept, swam
Rules
     It is important to keep in mind that FSMs are neutral between accepting and
     generating strings. That is to say, one way to operate a FSM is to read a string, one
     symbol at a time, and determine whether the symbol is found in the list at the
     current state of the machine. If it is, we advance to the next state and read the next
     symbol. Alternatively, this FSM could be used to generate strings by picking one
     word from each list in sequence. Some possible matching strings are:
      1. John/Mary/Fred OR
         1a. the/a/one
         1b. cat/dog/fish
16
                                                                                      A closer look at syntax
 2. (optional): and/or GO TO 1
 3. slept/barked/swam OR
    3a. sat/walked
    3b. on
    3c. a/the
    3d. mat/hill
 4. (optional) and/or GO TO 3
 5. (optional) and/or/but GO TO 1.
This formulation involves the basic operations of sequence, selection and iteration as
follows:
SEQUENCE
SELECTION
Choosing an item from a list, for example cat, dog or fish; choosing between lists.
ITERATION
Learning activity
                                                                                                        17
CO3354 Introduction to natural language processing
      q1       john        q2
      q1       mary        q2
      q1       the         q1a
      q1       a           q1a
      q1a      cat         q2
      q1a      dog         q2
      q2       slept       q3
      q2       barked      q3
      q2       swam        q3
      q3       and         q1
      q3       or          q1
      q3       .           q4
     sequence – to do with the order in which items occur: may include a wildcard
        character which is written as the period or full stop ‘.’ and may be replaced by
        any character.
     selection – specifying a choice between alternative items or sequences, indicated by
         the ‘|’ operator
     iteration – repetition of items or sequences, indicated by the ‘*’ operator, meaning
         zero or more occurrences of whatever precedes the star.
18
                                                                                           A closer look at syntax
Programming languages such as Java, Perl and Python implement extensions of REs
with operators which are mostly redundant in that they can be reduced to
combinations of the above operations, but can make programs much more compact
and readable, including:
See also Bird et al. (2009, Table 3.3) and the other recommended readings on this
topic.
Here are some examples of our suggested ways of recognising personal names coded
as regular expressions. These are intended to be applied to tokenised text and every
sequence enclosed by angled brackets < . . . > stands for an individual token. In
Examples 1, 3 and 4 below, the material within parentheses represents the target
string and sequences outside parentheses provide the context.
 1. <Mrs?>(<.+>) matches ‘Mr’ or ‘Mrs’ followed by any string. The first token
    consists of the sequence Mr followed optionally by the character s. The second
    consists of a sequence of one or more characters: any character may occur in
    place of the wildcard ‘.’.
 2. <[A-Z][a-z]+>+ matches any sequence of one or more capitalised words.
 3. (<[A-Z][a-z]+>+)<,><[0-9]+> matches capitalised word(s) followed by a
    comma and a number (age).
 4. (<[A-Z][a-z]+>+)<said|reported|claimed>.
Learning activity
 1. Write a regular expression for all strings consisting of an odd number of a’s followed by an even
    number of b’s.
 2. Write a regular expression for all sequences of a’s and b’s of length 3.
 3. Write a regular expression for all strings that contain abba somewhere within them.
    ( (John|Mary|Fred) | ( (the|a)(cat|dog|fish) )
    (barked | slept | swam)
    ((and | or) (barked | slept | swam))*
 1. John slept
 2. The cat barked or swam
 3. Mary swam and barked or slept
                                                                                                             19
CO3354 Introduction to natural language processing
4. . . .
     It can be seen that even conceptually simple REs can rapidly become almost
     unreadable. A more manageable formalism is a regular grammar, made up of
     production rules or rewrite rules of the kind you have seen in the previous section:
S → the | a S1
VP1 → and | or VP
VP1 →
                        S1
      the
                                 VP
                 cat
                                           VP1
                       barked
                                                 VP
                                  or
                                                      VP1
                                           swam
                                                       
     Symbols which only occur on the right-hand side of rules, and so can only appear as
     leaf-nodes in a tree, are known as terminal symbols. Regular grammars have the
     restriction that when non-terminal symbols appear on the RHS they must either
     always be the rightmost symbol, or always the leftmost. These classes of grammars
     are known as right-linear and left-linear respectively. A right-linear grammar will
     always result in a right-branching tree as in the above example.
20
                                                                                          A closer look at syntax
Learning activity
Draw tree diagrams according to the above grammar for the sentences:
     Either a happy girl eats ice cream or the boy eats hot dogs.
     *Either a happy girl eats ice cream then one dog eats candy.
     If a girl eats ice cream then the boy eats candy.
     *If a girl eats ice cream or the boy eats candy.
Learning activity
 1. Write a regular grammar that is equivalent to the FSM in Pinker (2007, p. 86).
 2. Convince yourself that it allows you to draw well-formed trees for the ungrammatical examples above.
 3. What characteristic of the grammar prevents it from ruling out the ill-formed examples?
In order to handle these kinds of sentences correctly we need to add new kinds of
rewrite rules, going beyond the class of right- or left-linear grammars:
 1. To match pairs of words like if . . . then, either . . . or, we need rules where a
    non-terminal symbol on the RHS can have additional material on both sides as
    in the first two rules below.
 2. In order to allow for indefinite nesting – if either John will come if Mary does, or
    . . . we need rules where the same symbol can occur on both sides of the arrow.
    This is known as self-embedding or centre-embedding.
                                                                                                            21
CO3354 Introduction to natural language processing
S →Either S or S
S →If S then S
S → NP VP
NP → Det N
Det → a | the |
VP → V NP
VP → V PP
PP → P NP
P → on
     The above grammar handles these cases correctly as well as simple sentences like
     The cat sat on the mat:
NP VP
      Det       N         V        PP
       the      cat   sat     P            NP
on Det N
the mat
Figure 2.2: Tree diagram for The cat sat on the mat.
     (S
     (NP (Det the ) (N cat ))
     (VP (V sat )
      (PP
        (P on )
        (NP
        (Det the ) (N mat ) )
        )
         )
         )
22
                                                                                                A closer look at syntax
If S then S
NP VP NP VP
                    Det       N         V           NP                               Det        N       V            NP
                    the       cat   likes     Det        N                           the     boy     eats     Det           N
cream burgers
Learning activity
 1. Trace through the grammar rules and satisfy yourself that Figure 2.3 represents the structure of the
    sentence If the cat likes cream then the boy eats burgers according to the grammar.
 2. What is the shortest sentence generated by the above grammar?
 3. Using the above grammar, draw complete tree diagrams for:
        (a) If the girl likes cake then either the boy eats burgers or the boy eats candy.
     (b) If either the boy likes cake or the girl likes burgers then the dog eats candy.
 4. Think of ways to modify the grammar to generate more natural-sounding sentences.
                                                                                                                    23
CO3354 Introduction to natural language processing
     Words combine in different orders to form sentences and phrases; they also have
     internal structure. Nouns in English may have different endings according to
     whether they are singular (a box) or plural (some boxes) while in some languages
     this information may be expressed at the start of the word, for example Swahili ziwa
     (‘lake’) vs maziwa (‘lakes’). In English, endings of verbs can indicate person, number,
     tense and mood2 , while other languages may make different dictinctions. Nouns and
     verbs are sometimes classified as regular or irregular according to whether their
     inflected forms can be derived by following simple rules. Table 2.2 shows examples
     of some common past tense forms in English.
      Present                Past
      become                 became
      come                   came
      mistake                mistook
      misunderstand          misunderstood
      ring                   rang
      sell                   sold
      shake                  shook
      sing                   sang
      sink                   sank
      stand                  stood
      take                   took
      tell                   told
      travel                 travelled
      understand             understood
      withstand              withstood
24
                                                                         A brief history of natural language processing
Some of these rules could be made more general: we could combine the -ing and
-ink rules to a single rule, -in → -an . On the other hand, some rules which work for
these particular examples would fail if applied to a wider range of data: we have
come → came, become → became but not welcome → *welcame. This is an example of
rules overfitting the data.
Learning activity
Modify the above rules so that they will account for the past tense forms in Table 2.3 as well as in Table 2.2.
 Present          Past
 bake             baked
 command          commanded
 bring            brought
 sling            slung
 smell            smelt
 think            thought
 wake             woke
The process of removing affixes from words to derive the basic form is called
stemming. We will look at some tools for doing this in Chapter 4, and you will also
have the opportunity to encode your own rules as regular expressions.
                                                                                                                  25
CO3354 Introduction to natural language processing
     theoretical linguistics and computer science but with some input from mathematical
     logic and psychology.
     The notions of a finite-state machine and context-free grammar (CFG) were first
     introduced to linguistics by Chomsky (1957; see Pullum (2011) for a somewhat
     critical reappraisal). Chomsky argued that both formalisms were inadequate for
     modelling natural language and proposed an additional operation of
     transformations, which could essentially permute the output string of a CFG in
     various ways. Chomsky’s work introduced a methodology which was to dominate
     theoretical linguistics for the next couple of decades: linguists concentrated on
     postulating formal rules of grammar which were tested against their own intuitions
     or those of native speakers of other languages, rather than seeking to induce rules
     from large collections of data. Part of the rationale for this was that native speakers
     of a language are able to recognise whether a sequence of words makes up an
     acceptable sentence in their language, even if they have never encountered those
     words in that particular order before. Prior to what was to become known as the
     Chomskyan revolution, corpus-based approaches had been the norm in general
     linguistics. This tradition was overshadowed for a time by so-called ‘generative’
     linguistics, but corpus-based research continued in some quarters until its resurgence
     in the 1980s, including the development of the first machine-readable corpus by the
     Jesuit priest Fr Robert Busa. Busa developed a 10 million-word corpus of medieval
     philosophy on punch-cards, with the support of Thomas Watson of IBM (McEnery
     and Wilson, 2001, pp. 20–21).
26
                                                                                Sample examination questions
Penn Treebank3 or the British National Corpus4 , and events such as the Message
Understanding Conferences5 which were initially sponsored by the US Department
of Defense to measure and foster progress in extracting information from
unstructured text.
Much work since around the year 2000 has involved the use of machine learning
techniques such as Bayesian models and maximum entropy (see Chapter 5). This
has involved using annotated corpora to train systems to segment and annotate texts
according to various morphological, syntactic or semantic criteria. These techniques
have been systematically applied to particular tasks such as parsing, word sense
disambiguation, question answering and summarisation.
2.7    Summary
 1. This chapter has characterised the subject matter of the course as being
    concerned with various ways of using computer programs to analyse text, by
    which we mean words, numbers, punctuation and other meaningful symbols
    that are printed on paper or some other flat surface, or displayed on a screen.
 2. Some fundamental operations in text analysis include tokenisation, which
    involves extracting these meaningful elements from a stream of electronic
    characters and discarding white space, line feed characters and other material
    which has no explicit meaning for a human reader, and using regular
    expressions to identify patterns in a text.
 3. Regular expressions are composed of the three basic operations of sequence,
    selection and iteration, and have many applications in computational linguistics
    and computer science at large. A finite-state machine is a process whose
    operations can be specified by means of regular expressions. A regular grammar
    is a set of production rules or rewrite rules that defines the sentences that make
    up a language, and any language defined by a regular grammar can be
    processed by a finite state machine or described using a regular expression.
 4. A complete syntactic analysis of natural language sentences is generally held to
    require the additional operation of centre-embedded recursion, which is beyond
    the power of finite-state methods. Recursion is formally encoded in context-free
    grammars.
 5. Not only do words combine in various patterns and structures to form sentences;
    they also have internal structure which can be described to an extent using rules
    for regular and irregular forms.
 6. The current state of NLP or computational linguistics builds on research results
    and concepts from many different fields, and we have sketched some of the
    highlights in a very short history of the discipline.
                                                                                                        27
CO3354 Introduction to natural language processing
     1. S → NP VP
         NP → Det N
         NP → PN
         VP → V
         VP → V NP
         VP → V NP PP
         VP → VP Adv
         PP → P NP
         Det → the | a
         N → waiter | chairs | tables | hotel | manager
         PN → Oscar | Paris
         V → died | put | saw | called
         Adv → suddenly | quickly | slowly
         P → in | on
         (a) Using the grammar rules above, draw syntax trees for:
                i. Oscar died suddenly.
               ii. The waiter put the chairs on the tables.
              iii. Oscar called the waiter.
         (b) Modify the grammar so that it generates the unstarred sentences below as
             well as (i–iii) above but not the starred ones. Explain the reasons for your
             modifications.
                i.   Oscar died in Paris.
               ii.   Oscar died in a hotel in Paris.
             iii.    The waiter came to the table when Oscar called him.
              iv.    When Oscar called him the waiter came to the table.
                v.   * Oscar put
              vi.    * The waiter saw on the tables
             vii.    * The waiter put in the chairs
            viii.    * The waiter put the chairs
              ix.    * Oscar died the table
               x.    * When Oscar called him when the waiter came to the table.
     2. Write a regular expression that will identify male and female names in context,
        in an English-language text. Discuss ways in which this might over- or
        under-generate.
     3. Explain the difference between regular and context-free grammars and discuss
        the claim that natural language grammars need at least context-free power.
     4. (a) Write a regular grammar which generates the following sentences:
                i. This is the kid that my father bought.
               ii. This is the cat that killed the kid that my father bought.
              iii. This is the dog that worried the cat that killed the kid that my father
                   bought.
               iv. This is the stick that beat the dog that worried the cat that killed the kid
                   that my father bought.
               (Brewer’s Dictionary of Phrase and Fable, 1896)
         (b) Write out three more sentences generated by your grammar.
28
Chapter 3
Essential reading
     Bird et al. (2009) Natural Language Processing with Python Chapters 1 and 2
     particularly: 1.1–1.3, 2.1–2.2, 2.5.
Recommended reading
Additional reading
     McEnery and Hardie (2011) Corpus Linguistics: Method, Theory and Practice. Chapter
     3 addresses the important topic of research ethics for corpus linguistics, which is
     often neglected in technical or academic presentations of the subject.
     By the end of this chapter, and having completed the Essential reading and activities,
     you should be able to:
                                                                                               29
CO3354 Introduction to natural language processing
     Your turn. You may also find it useful to attempt some of the exercises provided at
     the end of each chapter.
     From this chapter onwards you will be running Python sessions and using the NLTK.
     You should get into the habit of starting sessions with the following commands:
     One of the features that makes the Python language suitable for natural language
     applications is the very flexible treatment of data structures such as lists, strings and
     sequences. You should be familiar with these structures from previous programming
     courses, but should ensure you understand the way they are handled in Python. For
     this chapter, only lists are relevant and you should study Bird et al. (2009, section
     1.2) before trying any of the learning activities in this chapter.
     For example, Bird et al. (2009, pp. 407–412) refer to the TIMIT corpus, an annotated
     speech corpus developed by Texas Instruments and MIT. To ensure
     representativeness, it was designed to include a wide coverage of dialect variations.
     Corpus builders need to exercise expert judgment in deciding on the sampling frame,
30
                                                                           Some uses of corpora
or ‘the entire population of texts from which we will take our samples’ (McEnery and
Wilson, 2001, p. 78) and calculating the size of the corpus that is necessary for it to
be maximally representative. The sampling frame may, for example, be bibliographic,
based on some comprehensive index or the holdings of a particular library, or
demographic, selecting informants on the basis of various social categories as is often
done in public opinion research.
Corpora have tended to be of a finite size and to remain fixed once they have been
compiled. There are also what is known as monitor corpora which are continually
updated with new material. This is particularly useful for compilers of dictionaries
who need to be able to track new words entering the language and the changing or
declining use of old ones. An example of a monitor corpus is the COBUILD Bank of
EnglishTM , which had around 300 million words when it was referred to by McEnery
(2003) and has since more than doubled in size to 650 million words.
A further distinction is between corpora consisting solely of the original or ‘raw’ text
and those that have been marked up with various annotations. One common
technique is standoff annotation where the mark-up is stored in a different file from
the original text (McEnery and Wilson, 2001, p.38); (Bird et al., 2009, p.415).
Some examples of corpora, which will be described in more detail later in the
chapter, are:
McEnery and Wilson (2001, Chapter 4) discuss a variety of uses for corpora, some of
which are briefly discussed below.
                                                                                           31
CO3354 Introduction to natural language processing
3.4.1 Lexicography
     Modern dictionaries such as Chambers, Collins and Longmans now rely heavily on
     corpus data in order to classify and inventorise the various ways words can be used
     in contemporary English as well as any ways these uses may have changed. For
     example, a lexicographer who wishes to determine whether the words scapegoat,
     thermostat or leverage can be used as verbs can enter the appropriate search string
     and be presented with examples such as the following (from the BNC):
     McEnery and Wilson (2001, p. 107) discuss a case where they claim that two
     well-known dictionaries had ‘got it wrong’ by listing quake as a solely intransitive
     verb, while examples in a transitive construction can in fact be found through a
     corpus search:
     Large-scale grammars for pedagogic and reference use such as the Comprehensive
     Grammar of the English Language (Quirk et al., 1985) or the Cambridge Grammar of
     the English Language (Huddleston and Pullum, 2002) use corpora among their
     sources of information along with results of linguistic research and the compilers’
     own subjective intuitions as competent speakers of the language, although this has
     tended to involve qualitative rather than quantitative analysis. Recent advances in
     computational power and developments in parsed corpora and tools to analyse them
     have made it possible for researchers to carry out quantitative studies of various
     kinds of grammatical frequency, such as the relative frequency of different clause
     types in English. Other studies use corpora to test predictions made by formal
     grammars that have been developed within the generative school of linguistics. The
     COBUILD project which provided the resources for Collins English dictionaries has
     also resulted in a series of small handbooks covering various kinds of grammatical
     construction, which are useful both for advanced language learners and for linguists
     in search of examples.
     The notion of style in communication is that people generally have a choice between
     different ways of expressing themselves and not only do individuals tend to make
32
                                                                                       Corpora
similar choices each time they communicate, but their particular choices may be
more characteristic of particular genres (romantic fiction, financial news, court
reports and so on), time periods and channels of communication. By channels we
mean distinctions such as written text compared with spoken discourse, both of
which can be further subdivided: people will make different choices when
composing emails, sending text messages or (rarely) writing a letter by hand. We
probably also adopt different styles when talking face-to-face and on the telephone.
Literary scholars as well as law enforcement and intelligence agencies may have an
interest in identifying the author of a document from internal evidence. There have
been various famous and less well-known instances of controversial attribution of
authorship: for example, various figures have been put forward as the authors of
Shakespeare’s plays.
In addition to the applications listed above, corpora are routinely used in linguistic
research for training and testing machine-learning systems for specific tasks in text
analytics such as:
For example, the Brown corpus and the WSJ corpus are typically used for evaluating
text segmentation among other text processing tasks (Mikheev, 2003, p. 203).
These topics will be covered in more detail in Chapter 5 of this subject guide, where
you will be introduced to various machine-learning techniques. These will all be
types of supervised learning, where a system is trained on ‘corpora containing the
correct label for each input’ (Bird et al., 2009, p. 222), as opposed to unsupervised
learning where the system is designed to detect patterns in the input without any
feedback. This normally means that the corpus has been marked up by human
annotators. Standard practice is to divide a corpus into training and test sets; the
test set is considered a gold standard for comparing the accuracy of trained learning
systems with that of the annotators. Of course humans are fallible, and it is good
practice to use multiple annotators for at least a sample of the corpus and report the
level of inter-annotator agreement that was achieved. This will set an upper bound
for the performance that can be expected from the system, as it seems meaningless
to say that a computer program can achieve 100 per cent accuracy on tasks where
human annotators disagree (see Jurafsky and Martin, 2009, p. 189).
3.5 Corpora
This section provides brief descriptions of various corpora, some of which are
distributed in full or in part with the NLTK and others can be accessed online.
                                                                                          33
CO3354 Introduction to natural language processing
     This was one of the first ‘large-scale’ machine readable corpora, though it may seem
     rather small by today’s standards, consisting of one million words. It was developed
     at Brown University from the early 1960s and took around two decades to complete.
     It was intended as a ‘standard corpus of present-day edited American English’ and is
     caterorised by genre under headings such as:
     The Brown corpus is provided with the NLTK in tagged and untagged versions and
     can be accessed using the various methods listed by Bird et al. (2009, Table 2.3,
     p. 50).
     The BNC is created and managed by the BNC consortium, which includes Oxford
     and Lancaster universities, dictionary publishers OUP, Longmans and Chambers, and
     the British Library. It was developed between 1991 and 1994 and consists of 100
     million words: 90 per cent written and 10 per cent transcriptions of speech. This
     was one of the first corpora to include spontaneous spoken English. The corpus was
     marked up using an automated part-of-speech tagger which resulted in a significant
     saving of time and expense compared with manual annotation by competent
     speakers of the language, but means that there is inevitably a degree of error – as
     you may discover in the course of the exercise given later in this chapter.
     You can access this corpus online and perform various kinds of analysis using the
     Simple Query language. Registration is required via the following link but there is
     currently no charge:
34
                                                                                                     Corpora
The Penn Treebank with its various offshoots is one of the widely used linguistic
resources among empirical researchers.
The Penn Treebank project . . . has produced treebanks from the Brown, Switchboard, ATIS and Wall Street
Journal corpora of English, as well as treebanks in Arabic and Chinese.
The project began at the University of Pennsylvania in the 1990s and the results
have been used as a basis for further annotation efforts involving semantics and
rhetorical structure. The NLTK includes a selection from the Wall Street Journal
(WSJ) component of the Treebank, which can be accessed in each of the above
formats and additionally with a simplified POS tagset (Bird et al., 2009, Table 5-1,
p. 183). Here is an excerpt showing the four different formats:
Raw text
Tagged
[ Pierre/NNP Vinken/NNP ]
,/,
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB
[ the/DT board/NN ]
as/IN
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./.
Parsed
                                                                                                           35
CO3354 Introduction to natural language processing
                      ,)
             (VP will
                 (VP join
                      (NP the board)
                      (PP-CLR as
             (NP a nonexecutive director))
             (NP-TMP Nov. 29)))
             .))
Combined
     ( (S
              (NP-SBJ
               (NP (NNP Pierre) (NNP Vinken) )
               (, ,)
               (ADJP
                 (NP (CD 61) (NNS years) )
                 (JJ old) )
               (, ,) )
             (VP (MD will)
               (VP (VB join)
                 (NP (DT the) (NN board) )
                 (PP-CLR (IN as)
                   (NP (DT a) (JJ nonexecutive) (NN director) ))
                 (NP-TMP (NNP Nov.) (CD 29) )))
             (. .) ))
     The NLTK includes a small selection of out-of-copyright literary texts from Project
     Gutenberg, an archive of free electronic books hosted at http://www.gutenberg.org/
     Some of the texts included are:
         The Reuters corpus distributed with the NLTK contains 10,788 news documents
         totalling 1.3m words, partitioned into ‘training’ and ‘test’ sets. This split is for
         training and testing machine learning algorithms: we return to this topic in
         Chapter 5 of this subject guide.
36
                                                                                                         Corpora
Learning activity
 1. Pick two or three of the corpora mentioned above and research them online, focusing on questions
    such as:
          how large is the corpus?
          what language(s) and genre(s) does it represent?
          when was it constructed and what is its intended use?
          what is the sampling frame?
          what level of interannotator agreement was achieved, if reported?
 2. Logon to the BNC Web (free registration needed) or another online corpus. Study the documentation
    provided and search for data to answer the following questions:
          What syntactic categories can the following words have? total, pancake, requisition, acquisition.
          The guide to Simple Query Syntax provided with the BNC warns that ‘part-of-speech tags have
          been assigned by an automatic software tool and are not always correct’. Have your answers to
          the previous question shown up any examples of incorrect classification, in your view?
          What prepositions can follow the verb talk? Give an example in each case.
3.5.7 WordNet
The NLTK also includes English WordNet, with 155,287 words and 117,659
synonym sets or synsets. A synset consists of a set of synonymous words paired with
a definition and linked to words with more general or more specific meanings. For
example, table has various meanings:
                                                                                                              37
CO3354 Introduction to natural language processing
     This chapter describes some relatively simple techniques, extracting various kinds of
     data in suitable formats for human interpretation of the results. Chapters 4 and 5 of
     the subject guide will look at ways the analysis and interpretation itself can be
     automated to varying degrees.
38
                                                                                        Some basic corpus analysis
Learning activity
 1. Repeat the above process for other categories such as romance, news and religion. How do the
    frequency distributions and sentence counts enable you to compare the literary styles of these
    genres? Explain any assumptions you make.
 2. Having read through Bird et al. (2009, section 2.1), with particular attention to Table 2-3 on page 50,
    answer the following questions:
        (a) Summarise the README file from the Reuters corpus.
        (b) Create a variable soysents containing all sentences from reports concerning soy products.
        (c) Display the first ten sentences in soysents.
        (d) Create a variable metalwords containing all words from reports concerning metals.
        (e) What are the most frequently mentioned metals in this collection? Caution: why might this result
            be less than 100 per cent reliable?
NLTK’s plain text corpus reader can be used to build a ‘corpus’ from a collection of
text files. The resulting corpus will be formatted for access as raw text, lists of words
or lists of sentences and can be re-formatted for other functions such as
concordancing and finding collocations.
The first example involves a one-text ‘corpus’ of the recent report from the UK
Equality and Human Rights Commission: How fair is Britain?
                                                                                                               39
CO3354 Introduction to natural language processing
     We can now use the raw, words and sents methods to display the content in
     different formats:
     >>> mycorpus.fileids()
     [’howfair.txt’]
     >>> mycorpus.words(’howfair.txt’)
     [’Equality’, ’and’, ’Human’, ’Rights’, ’Commission’, ]
     >>> hf_raw = mycorpus.raw(’howfair.txt’)
     >>> hf_raw[:100]
     ’Equality and Human Rights Commission\r\nTriennial
     Review 2010\r\nExecutive Summary\r\nHow fair\r\nis Britain’
     >>> mycorpus.sents(’howfair.txt’)
     [[’Equality’, ’and’, ’Human’, ’Rights’, ’Commission’, ’Triennial’,
     ’Review’, ’2010’, ’Executive’, ’Summary’, ’How’, ’fair’, ’is’,
     ’Britain’, ’?’], ...
     The Text method formats the content in a way which can be accessed by the
     concordance and collocation methods. Note that concordancing will always
     return fixed-length strings which include your target text as a substring, and so may
     be cut off in the middle of a word.
     >>> fair=nltk.Text(mycorpus.words(’howfair.txt’))
     >>> fair.concordance(’equal’)
     Building index...
     Displaying 3 of 3 matches:
     has narrowed considerably since the equal Pay Act 1970 came into force in 1975
     sonal circumstances , should have an equal opportunity to have a say in decisio
     that every individual should have an equal chance to make the most of their tal
     >>> fair.collocations()
     Building collocations list
     Human Rights; Rights Commission; Significant findings; Headline data;
     Executive Summary; less likely; ethnic minority; life expectancy;
     0845 604; domestic violence; hate crime; labour market; disabled people;
     mental health; Black Caribbean; different groups; religiously motivated;
     sexual assault; minority groups; formal childcare
     Recall that a frequency distribution is a set of ordered pairs < event, count >
     where count is the number of times event occurs. In our context event is a word-type
     and count is the number of tokens of that type in a text. A conditional frequency
     distribution is a collection of frequency distributions, each one for a different
     condition.
     For this example we add a second document to the corpus, extracted from a PDF
     ‘Guide to data protection’.
     Step 1 Create a single variable text word consisting of pairs of each word-token
         with the fileid of the document it occurs in.
40
                                                                                                        Summary
Step 2 Create a conditional frequency distribution, which will tell you the frequency
    of each word in both texts.
Step 3 Pick a sample of words which are likely to occur in both documents, and
    tabulate their comparative frequency.
Learning activity
Find some suitable electronic documents and follow the above techniques to construct a ‘mini-corpus’. The
documents in these examples were sourced from UK government websites: you may find similar
documents on the website of your own country’s government, or of transnational organisations like the
European Union or the United Nations. Think of some terms which are likely to occur in several of these
documents and compare them using a conditional frequency distribution. If you can find a lengthy report
which is issued along with a shorter summary, it is an interesting exercise to pick some key terms and see if
their comparative frequency is the same or similar in the original document and the summary.
3.7     Summary
 1. A corpus is a collection of linguistic data which was not originally produced for
    the purposes of linguistic analysis. Properly speaking it should be constructed so
    as to be balanced and representative. If a corpus includes any kind of
                                                                                                                41
CO3354 Introduction to natural language processing
     a) Explain what is meant by the following types of corpus, and describe an example
     in each category that you have encountered during this course:
           isolated
           categorised
           overlapping
           temporal.
     b) What applications would a tagged and parsed corpus be suitable for? What are
     some advantages and disadvantages of using an automated tagger to build such a
     corpus?
     c) Suppose the following lists show the number of sentences and the most commonly
     occurring part-of-speech tags in three different categories of text in a corpus, with
     their frequency of occurrence in brackets. What can you say about the styles of these
     documents from studying these results? Discuss any assumptions you make.
42
Appendix C
        Jack (Proper Noun) and (conjunction) Jill (Proper Noun) went (verb) up
        (preposition) the (determiner) hill (noun).
        The (determiner) owl (noun) and (conjunction) the (determiner) pussycat (noun)
        went (verb) to (preposition) sea (noun).
     1. John swam.
     2. (a) John and Mary walked on the hill.
        (b) The cat sat on the mat and slept.
        (c) John or a fish walked on a hill and barked.
        (d) . . .
     1. a(aa)*(bb)*
     2. (aaa)|(aab)|(abb)|(bbb)|(bba)|(baa)|(aba)|(bab)
                                                                                                97
CO3354 Introduction to natural language processing
S → either | if S1
S → the | a | one S2
S2 → happy S2
S3 → or | then S2
S3 →
     This is a slightly idealised rendering of Pinker’s state diagram which appears to have
     no halting state.
     The problem with this grammar can be clearly seen: the rule S3 → or | then S2 has
     no connection with the rule that introduces either or if and so there is no way to
     ensure that the appropriate conjunction is used.
98
                                                 APPENDIX C. ANSWERS TO SELECTED ACTIVITIES
1. ‘Stylistic’ analysis
Science Fiction (948 sentences) N (2232) V (1473) DET (1345) PRO (1268) P
    (1182) ‘.’ (1078) ADJ (793) ‘,’ (791) CNJ (685) ADV (644) VD (531) NP (467)
    ...
News (4623 sentences) N (22236) P (10845) DET (10648) NP (8336) V (7346)
   ADJ (5435) ‘,’ (5188) ‘.’ (4472) CNJ (4227) PRO (3408) ADV (2770) VD (2531)
   ...
Religion (1716 sentences) N (7304) P (4364) DET (4239) V (3863) ADJ (2620)
    CNJ (2324) PRO (2243) ‘,’ (1913) ‘.’ (1892) ADV (1485) NP (1224) VN (952)
    ...
Some counts:
 1. Reference and description: both SF and RE use pronouns more than proper
    names; News has more proper names. Hard to interpret without further
    analysis: if an SF work includes a lot of dialogue for example, it might be more
    natural for characters to refer to each other as I, we, you and so on rather than
    by name. And news stories tend to be about named individuals.
 2. Syntactic complexity: we cannot be very precise with this data but it looks as if
    the SF genre has the least syntactic complexity and the RE genre the highest,
    judging from the numbers of commas and conjunctions per sentence. Of course
    we cannot tell whether these tokens are connecting clauses or other types of
    phrases.
In an examination or coursework question, you would get credit for discussing these
and other characteristics in the light of your impressionistic understanding of the
typical styles for these genres.
2. Reuters
                                                                                        99
CO3354 Introduction to natural language processing
         Create a variable soysents containing all sentences from reports concerning soy
         products.
         reuters.categories()
         Pick out categories relating to soy:
         soysents = reuters.sents(categories=[’soy-meal’, ...])
         Display the first ten sentences in soysents.
         print soysents[:10]
         Create a variable metalwords containing all words from reports concerning
         metals.
         metalwords = reuters.words(categories = [’alum’,’copper’,’gold’, ...])
         (Note that inspection of texts in the alum category confirms that they are about
         aluminium.)
         What are the most frequently mentioned metals in this collection? Caution: why
         might this result be less than 100 per cent reliable?
         from nltk.book import *
         freqmetal = FreqDist(metalwords)
         freqmetalkeys = freqmetal.keys()
         freqmetal[:100]
         Remember that the contents of a frequency distribution are listed in the order of
         their frequency of occurrence. By scanning the output you should see that the
         first three metals listed are gold, copper and steel. However caution is in order
         as Reuters is an overlapping corpus, so we may be double-counting some
         occurrences. These metals may also be mentioned under the category
         strategic-metal, or some reports may mention more than one kind of metal
         and so come under multiple categories.
100