01.08.
2023
Introduction to Information Retrieval                                      Introduction to Information Retrieval
   Documents                                                                  Parsing a document
                                                                                 We need to deal with format and language of each document.
      Last lecture: Simple Boolean retrieval system
                                                                                 What format is it in? pdf, word, excel, html etc.
      Our assumptions were:
                                                                                 What language is it in?
            We know what a document is.
                                                                                 What character set is in use?
            We can “machine-read” each document.
                                                                                 Each of these is a classification problem, which we will study
      This can be complex in reality.
                                                                                  later in this course.
                                                                                 Alternative: use heuristics
                                                                       1                                                                           2
Introduction to Information Retrieval                                      Introduction to Information Retrieval
   Format/Language: Complications                                             Normalization
      A single index usually contains terms of several languages.               Need to “normalize” terms in indexed text as well as query
            Sometimes a document or its components contain multiple              terms into the same form.
             languages/formats.                                                  Example: We want to match U.S.A. and USA
            French email with Spanish pdf attachment                            We most commonly implicitly define equivalence classes of
      What is the document unit for indexing?                                    terms.
      A file?                                                                   Alternatively: do asymmetric expansion
                                                                                       window → window, windows
      An email?                                                                       windows → Windows, windows
      An email with 5 attachments?                                                    Windows (no expansion)
      A group of files (ppt or latex in HTML)?                                  More powerful, but less efficient
      Upshot: Answering the question “what is a document?” is not
       trivial and requires some design decisions.
      Also: XML
                                                                       3                                                                           4
Introduction to Information Retrieval                                      Introduction to Information Retrieval
   Normalization: Other languages                                             Recall: Inverted index construction
                                                                                 Input:
          Normalization and language detection interact.
          PETER WILL NICHT MIT. → MIT = mit
                                                                                 Output:
          He got his PhD from MIT. → MIT ≠ mit
                                                                                 Each token is a candidate for a postings entry.
                                                                                 What are valid tokens to emit?
                                                                       5                                                                           6
                                                                                                                                                       1
                                                                                                                                                   01.08.2023
Introduction to Information Retrieval                                          Introduction to Information Retrieval
                                                                                  Tokenization problems: One word or two? (or
   Exercises                                                                      several)
                                                                                       Hewlett-Packard
                                                                                       State-of-the-art
     In June, the dog likes to chase the cat in the barn. – How many                   co-education
     word tokens?                                                                      the hold-him-back-and-drag-him-away maneuver
                                                                                       data base
     –Tokenize: Mr. O’Neill thinks that the boys’                                      San Francisco
     stories about Chile’s capital aren’t amusing.                                     Los Angeles-based company
                                                                                       cheap San Francisco-Los Angeles fares
                                                                                       York University vs. New York University
                                                                           7                                                                              8
Introduction to Information Retrieval                                          Introduction to Information Retrieval
   Numbers                                                                        Chinese: No whitespace
        3/20/91
        20/3/91
        Mar 20, 1991
        B-52
        100.2.86.144
        (800) 234-2333
        800.234.2333
        Older IR systems may not index numbers . . .
        . . . but generally it’s a useful feature.
                                                                           9                                                                              10
Introduction to Information Retrieval                                          Introduction to Information Retrieval
   Ambiguous segmentation in Chinese                                              Other cases of “no whitespace”
                                                                                       Compounds in Dutch, German, Swedish
                                                                                       Computerlinguistik → Computer + Linguistik
                                                                                       Lebensversicherungsgesellschaftsangestellter
                                                                                       → leben + versicherung + gesellschaft + angestellter
                                                                                       Inuit: tusaatsiarunnanngittualuujunga (I can’t hear very well.)
         The two characters can be treated as one word meaning                         Many other languages with segmentation difficulties: Finnish,
         ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’.                Urdu, . . .
                                                                       11                                                                                 12
                                                                                                                                                               2
                                                                                                                                                              01.08.2023
Introduction to Information Retrieval                                                  Introduction to Information Retrieval
   Japanese                                                                               Arabic script
         4 different “alphabets”: Chinese characters, hiragana syllabary
         for inflectional endings and functional words, katakana
         syllabary for transcription of foreign words and other uses,
         and latin. No spaces (as in Chinese). End user can express
         query entirely in hiragana!
                                                                                 13                                                                               14
Introduction to Information Retrieval                                                  Introduction to Information Retrieval
   Arabic script: Bidirectionality                                                        Accents and diacritics
                                                                                             Accents: résumé vs. resume (simple omission of accent)
                                                                                             Umlauts: Universität vs. Universitaet (substitution with special
                              ← →       ← →                ← START                            letter sequence “ae”)
   ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’         Most important criterion: How are users likely to write their
                                                                                              queries for these words?
                                                                                             Even in languages that standardly have accents, users often do
                                                                                              not type them. (Polish?)
                                                                                 15                                                                               16
Introduction to Information Retrieval                                                  Introduction to Information Retrieval
   Case folding                                                                           Stop words
                                                                                                stop words = extremely common words which would appear
           Reduce all letters to lower case                                                     to be of little value in helping select documents matching a
           Possible exceptions: capitalized words in mid-sentence                               user need
           MIT vs. mit                                                                         Examples: a, an, and, are, as, at, be, by, for, from, has, he, in,
           Fed vs. fed                                                                          is, it, its, of, on, that, the, to, was, were, will, with
           It’s often best to lowercase everything since users will use                        Stop word elimination used to be standard in older IR
            lowercase regardless of correct capitalization.                                      systems.
                                                                                                But you need stop words for phrase queries, e.g. “King of
                                                                                                 Denmark”
                                                                                                Most web search engines index stop words.
                                                                                 17                                                                               18
                                                                                                                                                                       3
                                                                                                                                                    01.08.2023
Introduction to Information Retrieval                                             Introduction to Information Retrieval
   Lemmatization                                                                     Stemming
           Reduce inflectional/variant forms to base form
                                                                                             Definition of stemming: Crude heuristic process that chops
           Example: am, are, is → be                                                         off the ends of words in the hope of achieving what
           Example: car, cars, car’s, cars’ → car                                            “principled” lemmatization attempts to do with a lot of
           Example: the boy’s cars are different colors → the boy car be                     linguistic knowledge.
            different color                                                                  Language dependent
           Lemmatization implies doing “proper” reduction to                                Often inflectional and derivational
            dictionary headword form (the lemma).                                            Example for derivational: automate, automatic, automation
           Inflectional morphology (cutting → cut) vs. derivational                          all reduce to automat
            morphology (destruction → destroy)
                                                                             19                                                                         20
Introduction to Information Retrieval                                             Introduction to Information Retrieval
   Porter algorithm                                                                  Porter stemmer: A few rules
           Most common algorithm for stemming English
           Results suggest that it is at least as good as other stemming
            options                                                                           Rule                        Example
           Conventions + 5 phases of reductions                                              SSES → SS                   caresses → caress
           Phases are applied sequentially                                                   IES → I                     ponies → poni
           Each phase consists of a set of commands.                                                                     caress → caress
                                                                                              SS → SS
                Sample command: Delete final ement if what remains is longer                                             cats → cat
                 than 1 character                                                             S→
                replacement → replac
                cement → cement
           Sample convention: Of the rules in a compound command,
            select the one that applies to the longest suffix.
                                                                             21                                                                         22
Introduction to Information Retrieval                                             Introduction to Information Retrieval
   Three stemmers: A comparison                                                      Does stemming improve effectiveness?
    Sample text:    Such an analysis can reveal features that are not easil
                    visible from the variations in the individual genes and                  In general, stemming increases effectiveness for some
                    can lead to a picture of expression that is more                          queries, and decreases effectiveness for others.
                    biologically transparent and accessible to interpretation
    Porter stemmer: such an analysi can reveal featur that ar not easili visibl
                                                                                             Queries where stemming is likely to help: [tartan sweaters],
                    from the variat in the individu gene and can lead to                      [sightseeing tour san francisco] (equivalence classes:
                    pictur of express that is more biolog transpar and                        {sweater,sweaters}, {tour,tours})
                    access to interpret                                                      Porter Stemmer equivalence class oper contains all of
    Lovins stemmer: such an analys can reve featur that ar not eas vis from                   operate operating operates operation operative operatives
                    th vari in th individu gen and can lead to a pictur of                    operational.
                    expres that is mor biolog transpar and acces to
                    interpres                                                                Queries where stemming hurts: [operational AND research],
    Paice stemmer: such an analys can rev feat that are not easy vis from                     [operating AND system], [operative AND dentistry]
                    the vary in the individ gen and can lead to a pict of
                   express that is mor biolog transp and access to interpret
                                                                             23                                                                         24
                                                                                                                                                             4
                                             01.08.2023
Introduction to Information Retrieval
   Exercise: What does Google do?
             Stop words
             Normalization
             Tokenization
             Lowercasing
             Stemming
             Non-latin alphabets
             Umlauts
             Compounds
             Numbers
                                        25