CI-6226
Lecture 1. Introduction
    Information Retrieval and Analysis
                                         Vasily Sidorov
Today’s Outline
• Course information
  —Why learn IR?
• Walkthrough of the components in a modern IR
  system
• Search engine evolution
• Overview of Google search engine (1998)
  —Architecture
  —PageRank
Course Instructor (me)
• Dr. Vasily Sidorov
• SCSE, NTU
• Questions, feedback, anything
  —vsidorov@ntu.edu.sg
                                  4
Textbook
           • Introduction to Information
             Retrieval
             —Christopher D. Manning
             —Prabhakar Raghavan
             —Hinrich Schutze
           • Available online at:
             —http://nlp.stanford.edu/IR-book/
                                                 5
Reference Books
      Modern Information     Information
      Retrieval: The         Retrieval:
      Concepts and           Implementing and
      Technology behind      Evaluating Search
      Search (2nd Edition)   Engines
      Managing               Search Engines:
      Gigabytes:             Information
      Compressing and        Retrieval in Practice
      Indexing Documents
      and Images, Second
      Edition
                                                     6
Pre-requisites
• Mathematics
  —Linear algebra: vectors, matrices, matrix inverse
  —Probability theory basics
    ◦ P(AB)=P(A)P(B)
• Hands-on Attitude
  —Embrace new technology and ideas
• Programming
  —Any programming language will do
    ◦ Python, Java, C#, C/C++, JS …
  —Work with files
  —Text encodings (ASCII, UTF, …)
  —Algorithms
  —Data structures
                                                       7
After taking this course, you’ll know
• How to build your own search engine, or customize
  an existing text search engine
• How to enhance applications using IR, e.g.,
  —Cluster text-like information such as microarray data
  —Find similar actions / data / objects
  —Parse/analyze text/dialogues (e.g., Facebook posts,
   Twitter, comments)
• How to build your own NextGen IR killer-app
  —e.g., matching people based on their preferences
  —limited only by your imagination!
                                                           8
This course does NOT cover
• Non-text data Retrieval
  —Image
  —Video
• XML Retrieval and NoSQL databases
• Natural Language Processing
  —Ontologies, e.g., WORDNET, HOWNET
  —Part of Speech Tagging, Grammar, Parsing, …
  —GPT-3 and other text GANs
• Structured Data Retrieval
  —SQL
                                                 9
Let’s start
• What is IR?
  —IR = Integrated Resort? Internet Relay?
  —IR = Information Retrieval
• What to retrieve?
  —bookmarks like del.icio.us
  —people, like LinkedIn, Facebook
  —books (in library or on Amazon)
  —text (web pages, medical reports, assignment
   reports)
  —images (photos, Flickr)
  —videos (movies, YouTube)
• IR vs. Text Mining
                                                  10
What is Text Mining?
• “The objective of Text Mining is to exploit information
  contained in textual documents in various ways,
  including …discovery of patterns and trends in data,
  associations among entities, predictive rules, etc.”
                                   — Grobelnik et al., 2001
• “Another way to view text data mining is as a process of
  exploratory data analysis that leads to heretofore
  unknown information, or to answers for questions for
  which the answer is not currently known”
                                            — Hearst, 1999
                                                          11
Text Mining Challenges
• Data collection is “free text”
  —Not well-organized; Semi-structured
   or unstructured
• Natural language text contains
  ambiguities on many levels
  —Lexical, syntactic, semantic, and
   pragmatic, e.g.,
          Time flies like an arrow.
          Fruit flies like a banana.
• Learning techniques for processing text typically
  need annotated training examples
                                                      12
Text Mining Research Areas
• Information Retrieval (IR)
   —Search Engines
   —Classification
   —Recommendation
• Information Extraction (IE)
   —Product Information (e.g. price) scraping
   —Named entity recognition
• Information Understanding
   —Natural Language Processing (NLP)
   —Question Answering
   —Concept Extraction from Newsgroup
   —Visualization, Summarization
• Cross-Lingual Text Mining
• Trend Detection
   —Outlier Detection
   —Event Detection
                                                13
Why Learn IR?
• Understand limitations of state-of-the-art IR
  —Learn what is possible in IR, tell Fiction from Fact
  —Learn how to fool IR systems?
• Organize your personal information
  —Master/create IR software to manage personal
   information
• How to use Search Engines better
• Design next generation IR system!
  —Be the next Google (not necessary search engine)
  —Yahoo, Google, [Your Company]
                                                          14
How to Retrieve Information
• Example
  —Scan through every book in library/store bookshelf
  —View every image/video
• To speed up IR:
  —Must scan every piece of information before
   retrieving
    ◦ Google/Bing/etc. try to download the entire Web
  —Indexing = Scan everything = remember where each
   information is located
    ◦ “1984” located at Level 2 Shelf 34 of National Library
    ◦ List of documents containing “1984” stored on disk C:\
                                                          15
History
• 300 BC, Great Library of Alexandria, Egypt
  —Most books were stored in armaria (closed, labelled
   cupboards). Armaria were used for book storage till
   medieval times.
                                                     16
Libraries Before Computers
• Cataloging
  —A process of describing a document (both physical
   attributes & subject contents)
  —Catalog = key to a collection
• Bibliographic record
  —A surrogate record produced by catalogers according
   to defined standards (e.g., Machine Readable
   Cataloging record)
• Subject Classification
  —Allocating a classification number
                                                       17
Classical Indexing
• Indexing
  —Human librarians construct document surrogates by
   assigning identifiers to text items.
• Includes
  —Keyword Indexing
     ◦ Similar to today's Search Engine Index
  —Subject Indexing
     ◦ Similar to today's Classification Engine
                                                   18
Subject Indexing - Classification
• Hierarchical structure
  —Similar subjects at the same
                                              Furniture
   level
                                       Chairs           Tables
• Goals of Classification
  —Collocate subjects
     ◦ group all documents of same subject together on
       shelves & put them next to related subjects.
  —Define & Assign code (Call Number) to document
     ◦ to facilitate identification from the catalogue and to
       shelf location
                                                                19
Dewey Decimal Classification (DDC)
• Most widely used
  —Used by > 135 countries
• Translated into more than 30 languages
  —Arabic, Chinese, French, Greek, Hebrew, Icelandic,
   Russian, Spanish
• Universe of knowledge divided into 10 main classes
  —Each class divided into 10 main divisions, …
  —until all disciplines, subjects and concepts are
   defined.
• Currently: 23rd edition (2011)
                                                        20
      DDC Example
000   Computer science, information &   general works
100   Philosophy & psychology
200   Religion                    600   Technology (applied sciences)
300   Social sciences             610   Medical sciences
400   Language                    620   Engineering and allied operations
500   Science                     630   Agriculture and related technologies
600   Technology                  640   Home economics and family living
700   Arts & recreation           650   Management and auxiliary services
800   Literature                  660   Chemical engineering and related technologies
900   History & geography         670   Manufactures
                                  680   Manufacture of products for specific uses
                                  690   Buildings
620 Engineering & allied operations
621 Applied physics                                         Another example
622 Mining and related operations
                                              500 Natural sciences and mathematics
623 Military and nautical engineering
                                                510 Mathematics
624 Civil engineering
                                                  516 Geometry
625 Engineering of railroads and
                                                    516.3 Analytic geometries
roads
                                                      516.37 Metric differential geometries
626 [not used]
                                                        516.375 Finsler geometry
627 Hydraulic engineering
628 Sanitary and municipal
engineering
                                                                                        21
629 Other branches of engineering
DDC is not ideal…
• DDC Classification Guidelines
  —Determine the subject of a work
  —Determine the disciplinary focus of a work
  —Refer to the schedules
• Rules to handle a document in multiple classes
  —First-of-two Rule: When two subjects receive equal
   treatment, classify the work with the subject whose
   number comes first in the schedules
  —Rule of Application: Classify a work dealing with
   interrelated subjects with the subject that is acted
   upon
                                                          22
Classical Indexing
The Natural Language problem:
• Low consistency:
  —People use different words to refer to same things
  —People use same words to refer to different things
• Objective in IR:
  —Search & retrieval of documents (or records) require
   some level of intellectual control over the item and
   its contents, at the same time, recognizing the need
   for flexibility
                                                        23
 Classical Indexing
• Keyword indexing (Google)
  —Index entries generated
   from the title and/or
   keywords from the text.
  —No intellectual process of
   text analysis or abstraction
• Subject indexing (Yahoo)
  —Involves analysis of the subject by humans /
   computers
                                                  24
Classical Indexing Problems
• Effectiveness of indexing depends on:
  —Indexing Exhaustiveness
    ◦ extent to which the subject matter of a given
      document has been reflected through the index entries
  —Term Specificity
    ◦ how broad/specific are the terms/keywords
                                                          25
Vocabulary Control: Controlled vs Natural
language indexing
• Controlled language
  —Use of vocabulary control tool in indexing
  —Semantic Web
  —Dublin Core
  —XML Ontologies
• Natural language (free text)
  —Any term in the document may be an index term.
   No mechanism controls the indexing process
  —Modern Search Engine
                                                    26
Who wins?
            27
A Modern IR System (A Search Engine)
• Crawler
• Indexer
• Searcher
                                       28
Basic Concepts: Tokenization
• Assign unique id to each word & keep in a lexicon
• HTML tags can be
  —Ignored or
  —Used to assign more weight to important items such
   as <title>
• Remove Stop/Noise words before/after tokenization
                                                      29
Basic concepts: Stop/Noise Words
Removal
High Frequency Words that carry little information:
a about all also among an and are as         and same in
at be been between both but by           other languages
click did do during      for each
either had found          from
further get
                                                       30
Basic concepts: Stemming (Roman
Languages)
• Useful for
  —Reducing # Words (Dimensionality)
  —Machine Translation            working
     ◦ Morphology                       works           work
• Performance Improvement               worked
  —No: Harman (1991), Krovetz (1993)
  —Possibly: Krovetz (1995)
     ◦ Depends on type of text, and the assumption is that
       once one moves beyond English, the difference will
       prove significant
• Stemmer
  —Porter Stemmer, k-stemmer
                                                               31
Porter Stemmer
• Porter Stemmer (Porter 1980)
  —http://www.tartarus.org/~martin/PorterStemmer/
  —http://snowball.tartarus.org/
  —Simple algorithms to determine which affixes to
   strip in which order and when to apply repair
   strategies
           Input   Stripped “ed” affix                 Repair
        hoped      hop                   hope (add e if word is short)
        hopped     hopp                  hop (delete one if doubled)
• Samples of the algorithm accessible on the Internet
• Easy to understand and program
                                                                         32
Porter Stemmer
                           consigned      consign        knack       knack
Sample output:         consignment        consign      knackeries   knackeri
                           consolation     consol       knaves      knavish
                       consolatory       consolatori    knavish     knavish
                           consolidate    consolid        knif        knif
                       consolidating      consolid       knife       knife
Errors                      consoling      consol        knew        knew
• Conflation:
  —reply, rep. → rep
• Overstemming:
  —wander → wand
  —news → new
• Mis-stemming:
  —relativity → relative
• Understemming:
  —knavish → knavish (knave)
                                                                               33
Stemmer vs. Dictionary
• Stemming Rules more efficient than a dictionary
  —Algorithmic stemmers can be fast (and lean): 1
   Million words in 6 seconds on a 500 MHz PC
• No maintenance even if things change
• Better to ignore irregular forms (exceptions) than to
  complicate the algorithm
  —not much lost in practice
  —80/20 Rule
                                                      34
Basic Concepts: Phrase Detection
• Important for English
  —New York City Police Department
  —Bill Gates spoke on the benefits of Windows
• Essential for CJK (Chinese, Japanese, Korean)
  —新加坡是个美丽的城市
   [Singapore is a beautiful city]
• Approaches
  —Dictionary Based
    ◦ Most Accurate; Needs maintenance (by humans)
  —Learnt/Extracted from Corpus
    ◦ Hidden Markov Model; N-Grams; Statistical Analysis
    ◦ Suffix Tree Phrase Detection (via statistical counting)
                                                                35
     Phrases Extracted from News Dataset
south                        san diego                                united                     high
south africa                 san diego ca                             united kingdom             high density
south africa internet        san francisco                            united nations             high end
south africa po              san francisco bay                        united nations quite       high enough
south african                san francisco chronicle                  united states              high frequency
south african government     san francisco giants                     united states attempt      high hockey
south african intelligence   san francisco police                     united states code         high just
south african libertarian    san francisco police inspector           united states government   high level
south america                san francisco police inspector ron       united states holocaust    high performance
south atlantic               san francisco police intelligence        united states officially   high power
south dakota                 san francisco police intelligence unit   united states senate       high quality
south dakota writes          san francisco police officer                                        high ranking
south georgia                san jose                                                            high ranking crime
south georgia island         san jose ca                                                         high ranking initiate
south pacific                san jose mercury                                                    high resolution
south pacific island                                                                             high school
                                                                                                 high school students
                                                                                                 high speed
                                                                                                 high speed collision
                                                                                                 high sticking
                                                                                                 high tech
                                                                                                 high voltage
                                                                                                 highend
                                                                                                 higher        36
Vector Space Text Representation
• Bag of Words (BOW) Model
  —Order/Position of word/term unimportant
• Term Frequency (TF)
  —Terms appear more frequently in a document is
   considered more important in representing this
   document
                                                    37
Basic Concepts: Weighing the Terms
• Which of these tells you more about a doc?
  —10 occurrences of pizza?
  —10 occurrences of the?
• Look at
  —Collection frequency (CF)
  —Document frequency (DF), which is better:
                  Word    CF      DF
            TRY          10 422   8 760
            INSURANCE    10 440   3 997
• CF/DF weighting possible only in known collection
                                                      38
Basic Concepts: TFIDF
• TF x IDF measure combines:
  —Term frequency (TF): measure of term density in a doc
  —Inverse document frequency (IDF): measure of
   informativeness of term, rarity across whole corpus
  —Could be raw count of # documents containing the term
                                 1
                         IDF𝑖 =
                                DF𝑖
• By far the most commonly used version is:
                                 𝑁
                     IDF𝑖 = log
                                DF𝑖
        See Kishore Papineni, NAACL 2, 2002 for theoretical justification
                                                                       39
Basic Concepts: TFIDF
• Assign a TFIDF weight to each term 𝑖 in each
  document 𝑑
                                   𝑁
             𝑤𝑖,𝑑   = TF𝑖,𝑑 × log
                                  DF𝑖
• Word is more important if
  —The word appears more often in the document
  —The word does not appear that often in the whole
   corpus
                                                      40
Documents as Vectors
• Each document 𝑑 can now be viewed as a vector of
  TF × IDF values, one component for each term
• So we have a vector space
  —terms are axes
  —docs live in this space
  —even with stemming, may have 100,000+
   dimensions
• The corpus of documents gives us a matrix, which
  we could also view as a vector space in which words
  live
                                                     41
Why turn documents into vectors
• First application: Query-by-example
  —Given a doc/query 𝑑, find others “like” it (or most
   similar to it)
• Now that 𝑑 is a vector, find vectors (docs) “near” it.
                                                           42
What is not IR
• Why not a SQL query?
  —IR ≠ DB query
  —IR ≠ XML
• Data → Query
  —IR = unstructured
  —XML = semi-structured
  —DB = structured
                           43
 Search Engine Evolution
• 1st generation (use only “on page” data)        1995–1997
                                                  AltaVista, Excite,
  —Text data, word frequency, language            Lycos, …
• 2nd generation (use off-page, web-specific data)
  —Link (or connectivity) analysis                 1998–present
                                                   Made popular by
  —Click-through data (What people click)          Google, used by
  —Anchor-text (How people refer to this page)     everyone now.
• 3rd generation (answer “the need behind the query”)
  —Semantic analysis — what is this about?       present day
                                                 Siri, Alexa, Cortana,
  —Focus on user need, rather than on query      Google Assistant,
  —Context determination                         IBM Watson
                                                               44
1st Generation Search Engine
• Extended Boolean model
  —Matches: exact, prefix, phrase, …
  —Operators: AND, OR, NOT, NEAR, …
  —Fields: TITLE:, URL:, HOST:, …
  —AND is somewhat easier to implement, maybe
   preferable as default for short queries
• Ranking
  —TF-like factors: TF, explicit keywords, words in title,
   explicit emphasis (headers), etc.
  —IDF factors: IDF, total word count in corpus,
   frequency in query log, frequency in language
                                                             45
2nd Generation Search Engine
• Ranking — use off-page, web-specific data
  —Link (or connectivity) analysis
  —Click-through data (what results people click on)
  —Anchor-text (how people refer to this page)
• Link Analysis
  —Idea: mine hyperlink information on the Internet
  —Assumptions:
     ◦ Links often connect related pages
     ◦ A link between pages is a recommendation
       ▪ “people vote with their links”
                                                       46
3rd Generation Search Engine
• Query language determination
  —if query is in Japanese then do not return English
• Different ranking
• Hard & soft matches
  —Personalities (triggered on names)
  —Cities (travel info, maps)
  —Medical info (triggered on names and/or results)
  —Stock quotes, news (triggered on stock symbol)
  —Company info, …
• Integration of Search and Text Analysis
                                                        47
3rd Generation Search Engine (cont’d)
• Context determination
  —where: spatial (user location/target location)
  —when: query stream (previous queries)
  —who: personalization (user profile)
  —explicit (family friendly)
  —implicit (use google.com.sg or google.fr)
• Context use
  —Result restriction
  —Ranking modulation
                                                    48
History: Google Architecture (circa 1998)
Implemented in C/C++ on Linux and off-the-shelf PCs
                                                  49
For Comparison, Google Today
                               50
Google Architecture (c. 1998): Crawler
                                         51
Google Architecture (c. 1998): Indexer
                                         52
Google Architecture (c. 1998): Searcher
                                          53
Google’s Algorithm
• Imagine a browser randomly walking on the           1ൗ
                                                        3
  Internet:                                           1ൗ
                                                        3
                                                      1ൗ
  —Start at a random page                                3
  —At each step, go out of the current page along one
   of the links on that page, equiprobably
  —“In the steady state” each page has a long-term visit
   rate - use this as the page’s score
• BUT: The web is full of dead-ends
  —Random walk can get stuck in dead-ends
  —No sense to talk about long-term visit rates
                                                        54
                                      1ൗ                    1ൗ
                                        10                    10
                                      1ൗ                    1ൗ
Google Teleporting                    1ൗ
                                        10
                                         10
                                                            1ൗ
                                                              10
                                                               10
                                      1ൗ                    1ൗ
                                        10                    10
                                                            1ൗ
• At each visit,                                    1ൗ
                                                      10
                                                              10
  —with probability 10%, jump to a random web page
  —with remaining probability (90%), go out on a
   random link
  —If no out-link, stay put in this case
• Now cannot get stuck locally
  —There is a long-term rate at which any page is visited
• Motivation:
  —Pages cited from many places are IMPORTANT
  —Pages cited from an important place are
   IMPORTANT.
                                                           55
Anchor Text
• Associate anchor text of a link to the page it points
  to
• Advantages:
  —Links provide more accurate description
  —Can index documents that text-based search engine
   cannot (e.g. Google Image Search)
                                                          56
Key Google Optimization Techniques
• Each crawler maintains its local DNS lookup cache
• Parallelization of indexing phase
• In-memory lexicon
• Compression of repository
• Compact encoding of hitlists accounting for major space
  savings
• Indexer is optimized so it is just faster than the crawler
  (bottleneck)
• Document index updated in bulk
• Critical data structures placed on local disk
• Overall architecture designed to avoid disk seeks
  wherever possible
                                                           57
Google Storage Requirements (c. 1998)
• Total disk space approx, 106 GB
  —standard PC hard disk approx. 10 GB in 1998
                                             Price of 1 GB
                                             1981 $ 300 000
                                             1987 $ 50 000
                                             1990 $ 10 000
                                             1994 $ 1 000
                                             1997 $ 100
                                             2000 $ 10
                                             2004 $ 1
                                             2012 $ 0.1
                                             2017 $ 0.03
                                                      58