1
TEXT OPERATIONS
By Abdo Ababor
                 2021
Statistical Properties of Text
Text operations
5 main operations for selecting index terms,
• Lexical analysis/Tokenization of the text - digits,
  hyphens, punctuations marks, and the case of letters
• Elimination of stop words - filter out words which are
  not useful in the retrieval process
• Stemming words - remove affixes (prefixes and suffixes)
• Construction of term categorization structures such
  as thesaurus, to capture relationship for allowing the
  expansion of the original query with related terms
• Term weighting – taking into account term and document
  frequency
                                                        2
       Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used to
  select suitable term weights & other aspects of the system.
    A few words are very common.
     2 most frequent words (e.g. “the”, “of”) can account for
      about 10% of word occurrences.
    Most words are very rare.
     Half the words in a corpus appear only once, called “read
      only once”
     Called a “heavy tailed” distribution, since most of the
      probability mass is in the “tail”
                                                                      3
Sample Word Frequency Data
                             4
                Zipf’s distributions
           Rank-Frequency Distribution
 For all the words in a collection of documents, for each word w
 • f : is the frequency that w appears
 • r : is rank of w in order of frequency. (The most commonly
    occurring word has rank 1, etc.)
most frequent words
         f                     Distribution of sorted word frequencies,
                                          according to Zipf’s law
                       w has rank r and
                       frequency f
                                                      read only once
                                                             r     5
     Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
 George Kingsley Zipf (1902-1950),
   attempts to capture the distribution of the frequencies
     (i.e., number of occurances ) of the words within a text.
 Zipf's Law states that when the distinct words in a text
   are arranged in decreasing order of their frequency of
   occuerence (most frequent words first), the occurence
   characterstics of the vocabulary can be characterized by the
   constant rank-frequency law of Zipf:
               Frequency * Rank = constant
   roughly fit the relation: f * r = c
   Different collections have different constants c
                                                                  6
                  Example: Zipf's Law
   The table shows the most frequently occurring words from
    336,310 document collection containing 125, 720, 891 total
    words; out of which 508, 209 unique words
                                                                 7
             More Example: Zipf’s Law
   Illustration of Rank-Frequency Law. Let the total number of
    word occurrences in the sample N = 1, 000, 000
      Rank (R)      Term          Frequency (F)   R.(F/N)
      1             the           69 971          0.070
      2             of            36 411          0.073
      3             and           28 852          0.086
      4             to            26 149          0.104
      5             a             23237           0.116
      6             in            21341           0.128
      7             that          10595           0.074
      8             is            10099           0.081
      9             was           9816            0.088
      10            he            9543            0.095
                                                                  8
      Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-off).
      • Used by almost all systems.
• Significant words: Take words in between the most frequent
  (upper cut-off) and least frequent words (lower cut-off).
• Term weighting: Give differing weights to terms based on
  their frequency, with most frequent words weighed less.
      • Used by almost all ranking methods.
                                                                9
            Explanations for Zipf’s Law
 The law has been explained by “principle of least effort” which
  makes it easier for a speaker or writer of a language to repeat
  certain words instead of coining new and different words.
    “principle of least effort”- balance between speaker’s desire for
     a small vocabulary and hearer’s desire for a large one.
 Zipf’s Law Impact on IR
    Good News: Stopwords will account for a large fraction of text
     so eliminating them greatly reduces inverted-index storage
     costs.
    Bad News: For most words, gathering sufficient data for
     meaningful statistical analysis (e.g. for correlation analysis for
     query expansion) is difficult since they are extremely rare.
                                                                      10
          Word significance: Luhn’s Ideas
   Luhn Idea (1958): the frequency of word occurrence in a text gives a
    useful measurement of word significance.
   Luhn suggested that both extremely common and extremely
    uncommon words were not very useful for indexing.
      For this, Luhn specifies two cutoff points: an upper and a lower
       cutoffs based on which non-significant words are excluded
       The words exceeding the upper cutoff were considered to be
        common
       The words below the lower cutoff were considered to be rare
       They are not contributing significantly to the content of the text
       The ability of words to discriminate content, reached a peak at a
        rank order position half way between the two-cutoffs
                                                                        11
    Let f be the frequency of occurrence of words in a text,
   and r their rank in decreasing order of word frequency,
     then a plot relating f & r yields the following curve
Luhn (1958) suggested that both extremely common and
extremely uncommon words were not very useful for document
representation & indexing.
                                                               12
        Vocabulary size : Heaps’ Law
   Dictionaries
      600,000+ words
   But they do not include names of people, locations, products
    etc.
   Heap’s law: estimates the number of vocabularies in a given
    corpus
   How does the size of the overall vocabulary (number of
    unique words) grow with the size of the corpus?
       This determines how the size of the inverted index will
        scale with the size of the corpus.
                                                                   13
         Vocabulary Growth: Heaps’ Law
   Heap’s law: estimates the number of vocabularies in a given
    corpus
     The vocabulary size grows by O(n ), where β is a constant
                                           β
      between 0 – 1.
     If V is the size of the vocabulary and n is the length of the corpus
      in words, Heap’s provides the following equation:
    Where constants:                                     
                                         V  Kn
       K  10100
         0.40.6 (approx. square-root)
                                                                             14
              Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
  relationship between vocabulary size and number of tokens
 Example: from 1,000,000,000 documents, there may be
 1,000,000 distinct words. Can you agree?
                                                              15
                       Example
   We want to estimate the size of the vocabulary for a corpus of
    1,000,000 words. However, we only know statistics computed
    on smaller corpora sizes:
      For 100,000 words, there are 50,000 unique words
      For 500,000 words, there are 150,000 unique words
      Estimate the vocabulary size for the 1,000,000 words
       corpus?
      How about for a corpus of 1,000,000,000 words?
                                                                 16
17
                            Text Operations
   Not all words in a document are equally significant to represent the
    contents/meanings of a document
      Some word carry more meaning than others
      Noun words are the most representative of a document content
   Therefore, need to preprocess the text of a document in a collection to be used as
    index terms
   Using the set of all words in a collection to index documents creates too much
    noise for the retrieval task
      Reduce noise means reduce words which can be used to refer to the
        document
   Preprocessing is the process of controlling the size of the vocabulary or the
    number of distinct words used as index terms
      Preprocessing will lead to an improvement in the information retrieval
        performance
   However, some search engines on the Web omit preprocessing
      Every word in the document is an index term
                                                                                         18
                       Text Operations
     Text operations is the process of text transformations in to logical
    representations
   5 main operations for selecting index terms, i.e. to choose
    words/stems (or groups of words) to be used as indexing terms:
      Lexical analysis/Tokenization of the text - digits, hyphens,
       punctuations marks, and the case of letters
      Elimination of stop words - filter out words which are not useful
       in the retrieval process
      Stemming words - remove affixes (prefixes and suffixes)
      Construction of term categorization structures such as thesaurus,
       to capture relationship for allowing the expansion of the original
       query with related terms
      Term weighting – taking into account term and document
       frequency                                                            19
    Generating Document Representatives
   Text Processing System
       Input text – full text, abstract or title
       Output – a document representative adequate for use in an
        automatic retrieval system
    The document representative consists of a list of class names,
     each name representing a class of words occurring in the total
     input text. A document will be indexed by a name if one of its
    significant words occurs as a member of that class.
documents      Tokenization    stop words      stemming   Thesaurus
                                                            Index terms
Lexical Analysis/Tokenization of Text
   Change text of the documents into words to be adopted as
index terms
  Objective - identify words in the text
   Digits, hyphens, punctuation marks, case of letters
   Numbers are not good index terms (like 1910, 1999); but
    510 B.C. – unique
   Hyphen – break up the words (e.g. state-of-the-art =
    state of the art)- but some words, e.g. gilt-edged, B-49
    unique words which require hyphens
   Punctuation marks – remove totally unless significant,
    e.g. program code: x.exe and xexe
   Case of letters – not important and can convert all to
    upper or lower                                           21
                           Tokenization
   Analyze text into a sequence of discrete tokens (words).
   Input: “Friends, Romans and Countrymen”
   Output: Tokens (an instance of a sequence of characters that are grouped
    together as a useful semantic unit for processing)
      Friends
      Romans
      and
      Countrymen
   Each such token is now a candidate for an index entry, after
    further processing
   But what are valid tokens to use?
                                                                          22
              Issues in Tokenization
   One word or multiple: How do you decide it is one token or
    two or more?
       Hewlett-Packard  Hewlett and Packard as two tokens?
           state-of-the-art: break up hyphenated sequence.
           San Francisco, Los Angeles
        lowercase, lower-case, lower case ?
           data base, database, data-base
        Numbers:
          dates (3/12/91 vs. Mar. 12, 1991);
          phone numbers,
          IP addresses (100.2.86.144)
                                                                 23
                Issues in Tokenization
How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, e-mails, …
      Sometimes punctuation (e-mail), numbers (1999), and case
       (Republican vs. republican) can be a meaningful part of a token.
      However, frequently they are not.
 Simplest approach is to ignore all numbers and punctuation and use
   only case-insensitive unbroken strings of alphabetic characters as
   tokens.
     Generally, systems do not index numbers as text, but often very
      useful for search. Will often index “meta-data” , including creation
      date, format, etc. separately
     Issues of tokenization are language specific
      Requires the language to be known
      What works for one language doesn’t work for the other.             24
           Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
   They may occur in 80% of the documents in a collection.
   They would appear to be of little value in helping select
    documents matching a user need and needs to be filtered out
    as potential index terms
                                                                  25
              Elimination of STOPWORD
     E.g. of stopwords are articles, prepositions, conjunctions, etc.:
       articles (a, an, the);
       pronouns: (I, he, she, it, their, his)
       Some prepositions (on, of, in, about, besides, against),
       conjunctions/connectors (and, but, for, nor, or, so, yet),
       verbs (is, are, was, were),
       adverbs (here, there, out, because, soon, after),
       adjectives (all, any, each, every, few, many, some) can also be
        treated as stopwords.
       Stopwords are language dependent.
                                                                      26
                             Stopwords
   Intuition:
         Stopwords have little semantic content; It is typical to remove
          such high-frequency words
         Stopwords take up 50% of the text. Hence, document size
          reduces by 30-50%
       Smaller indices for information retrieval
        Good compression techniques for indices:
        The 30 most common words account for 30% of the tokens
         in written text
    Better approximation of importance for classification,
    summarization, etc.
                                                                            27
    How to determine a list of stopwords?
   One method: Sort terms (in decreasing order) by collection
    frequency and take the most frequent ones
        In a collection about insurance practices, “insurance” would be a
         stop word
    Another method: Build a stop word list that contains a set
    of articles, pronouns, etc.
        Why do we need stop lists: With a stop list, we can compare
          and exclude from index terms entirely the commonest words.
         With the removal of stopwords, we can measure better
          approximation of importance for classification,
          summarization, etc.
                                                                             28
                               Stop words
   Stop word elimination used to be standard in older IR systems.
   But the trend is getting away from doing this. Most web search
    engines index stop words:
      Good query optimization techniques mean you pay little at
       query time for including stop words.
      You need stopwords for:
            Phrase queries: “King of Denmark”
            Various song titles, etc.: “Let it be”, “To be or not to be”
            “Relational” queries: “flights to London”
       Elimination of stopwords might reduce recall (e.g. “To be or
        not to be” – all eliminated except “be” – will produce no or
        irrelevant retrieval)                                               29
                         Normalization
   It is Canonicalizing (change to simplest form) tokens so that
    matches occur despite superficial differences in the
    character sequences of the tokens
       Need to “normalize” terms in indexed text as well as query terms
        into the same form
       Example: We want to match U.S.A. and USA, by deleting
        periods in a term
   Case Folding: Often best to lower case everything, since users
    will use lowercase regardless of ‘correct’ capitalization…
       Republican vs. republican
       Anti-discriminatory vs. antidiscriminatory
       Car vs. automobile?
                                                                           30
                     Normalization issues
   Case folding is good for
      Allowing instances of Automobile at the beginning of a
       sentence to match with a query of automobile
      Helping a search engine when most users type ferrari when
       they are interested in a Ferrari car
   Bad for
      Proper names vs. common nouns
            E.g. General Motors, Associated Press, …
       Solution:
        lowercase only words at the beginning of the sentence
       In IR, lowercasing is most practical because of the way users
        issue their queries
                                                                        31
         Stemming/Morphological analysis
    Stemming reduces tokens to their “root” form of words to
     recognize morphological variation .
     The process involves removal of affixes (i.e. prefixes and
        suffixes) with the aim of reducing variants to the same stem
     Often removes inflectional and derivational morphology of a
        word
        Inflectional morphology: vary the form of words in order to
         express grammatical features, such as singular/plural or
         past/present tense. E.g. Boy → boys, cut → cutting.
        Derivational morphology: makes new words from old ones.
         E.g. creation is formed from create , but they are two separate
         words. And also, destruction → destroy
   Stemming is language dependent
     Correct stemming is language specific and can be complex.
                                                                           32
                           Stemming
   The final output from a conflation(reducing to the same token)
    algorithm is a set of classes, one for each stem detected.
      A Stem: the portion of a word which is left after the removal
       of its affixes (i.e., prefixes and/or suffixes).
      Example: ‘connect’ is the stem for {connected, connecting
       connection, connections}
      [automate, automatic, automation] all reduce to 
       automat
      A document representative then becomes a list of class
       names, which are often referred as the documents index
       terms/keywords.
   Queries : Queries are handled in the same way.
                                                                       33
           Ways to implement stemming
There are basically two ways to implement stemming.
1. 1st approach is to create a big dictionary that maps words to
 their stems.
    The advantage of this approach is that it works perfectly (insofar as the
     stem of a word can be defined perfectly); the disadvantages are the space
     required by the dictionary and the investment required to maintain the
     dictionary as new words appear.
2. 2nd approach is to use a set of rules that extract stems from
 words.
    The advantages of this approach are that the code is typically small, and it
     can gracefully handle new words; the disadvantage is that it occasionally
     makes mistakes. But, since stemming is imperfectly defined, anyway,
     occasional mistakes are tolerable, and the rule-based approach is the one
     that is generally chosen.
                                                                                    34
                    Porter Stemmer
Stemming is the operation of stripping the suffices from a word,
leaving its stem.
     Google, for instance, uses stemming to search for web pages
      containing the words connected, connecting, connection
      and connections when users ask for a web page that contains
      the word connect.
 In 1979, Martin Porter developed a stemming algorithm that
   uses a set of rules to extract stems from words, and though it
   makes some mistakes, most common words seem to work out
   right.
                                                                    35
                        Porter stemmer
   Most common algorithm for stemming English words to their
    common grammatical root
   It is simple procedure for removing known affixes in English
    without using a dictionary. To gets rid of plurals the following
    rules are used:
       SSES  SS              caresses  caress
       IES  i                ponies  poni
       SS  SS                caress → caress
       S                    cats  cat
       EMENT   (Delete final ement if what remains is longer
        than 1 character )
           replacement  replac
           cement  cement
                                                                       36
                   Porter stemmer
   While step 1a gets rid of plurals, step 1b removes -ed
    or -ing.
       e.g.
         ;; agreed -> agree      ;; disabled -> disable
         ;; matting -> mat       ;; mating -> mate
         ;; meeting -> meet      ;; milling -> mill
         ;; messing -> mess      ;; meetings -> meet
         ;; feed -> feed
                                                             37
                 Stemming: challenges
   May produce unusual stems that are not English words:
     Removing ‘UAL’ from FACTUAL and EQUAL
   May conflate (reduce to the same token) words that are actually
    distinct.
     “computer”, “computational”, “computation” all reduced to
       same token “comput”
   Not recognize all morphological derivations.
                                                                      38
                             Thesauri
   Mostly full-text searching cannot be accurate, since different
    authors may select different words to represent the same
    concept
       Problem: The same meaning can be expressed using
        different terms that are synonyms, homonyms, and related
        terms
       How can it be achieved such that for the same meaning the
        identical terms are used in the index and the query?
       Thesaurus: The vocabulary of a controlled indexing
        language, formally organized so that a priori relationships
        between concepts (for example as "broader" and “related")
        are made explicit.
                                                                      39
                         Thesauri …
•   A thesaurus contains terms and relationships between
    terms
    –  IR thesauri rely typically upon the use of symbols such as
       USE/UF (UF=used for), BT, and RT to demonstrate
       inter-term relationships.
    – e.g., car = automobile, truck, bus, taxi, motor vehicle
            -color = colour, paint
                                                                    40
                    Aim of Thesaurus
    Thesaurus tries to control the use of the vocabulary by
     showing a set of related words to handle synonyms and
     homonyms
    The aim of thesaurus is therefore:
      to provide a standard vocabulary for indexing and searching
        Thesaurus rewrite to form equivalence classes, and we index
          such equivalences
        When the document contains automobile, index it under
          car as well (usually, also vice-versa)
      to assist users with locating terms for proper query
       formulation: When the query contains automobile, look under
       car as well for expanding query
      to provide classified hierarchies that allow the broadening and
       narrowing of the current request according to user needs
                                                                         41
          Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
  vehicles :
Term: Motor vehicles
   UF :      Automobiles
             Cars
             Trucks
   BT:              Vehicles
   RT:              Road Engineering
                    Road Transport
                                                               42
                     More Example
Example: thesaurus built to assist IR in the fields of computer
  science:
TERM: natural languages
    UF natural language processing (UF=used for NLP)
    BT languages (BT=broader term is languages)
    TT languages (TT = (top term) is languages)
    RT artificial intelligence (RT=related term/s)
          computational linguistic
          formal languages
          query languages
          speech recognition
                                                                  43
         Language-specificity
   Many of the above features embody transformations that
    are
      Language-specific and
      Often, application-specific
   These are “plug-in” addenda to the indexing process
   Both open source and commercial plug-ins are available for
    handling these issues
                                                                 44
               INDEX TERM SELECTION
   Index language is the language used to describe documents
    and requests
   Elements of the index language are index terms which may be
    derived from the text of the document to be described, or may
    be arrived at independently.
        If a full text representation of the text is adopted, then all words in
         the text are used as index terms = full text indexing
        Otherwise, need to select the words to be used as index terms for
         reducing the size of the index file which is basic to design an efficient
         searching IR system
                                                                                     45