0% found this document useful (0 votes)

25 views57 pages

Lecture 1 - Introduction

The document outlines a lecture on Information Retrieval (IR) led by Dr. Vasily Sidorov, covering course information, components of modern IR systems, and the evolution of search engines. It discusses prerequisites, learning outcomes, and the distinction between IR and text mining, while also highlighting challenges and research areas in text mining. Additionally, it introduces classical indexing methods, vocabulary control, and basic concepts such as tokenization and TF-IDF in the context of search engines.

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views57 pages

Lecture 1 - Introduction

Uploaded by

alexiesourin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

CI-6226

Lecture 1. Introduction
Information Retrieval and Analysis

Vasily Sidorov
Today’s Outline
• Course information
—Why learn IR?
• Walkthrough of the components in a modern IR
system
• Search engine evolution
• Overview of Google search engine (1998)
—Architecture
—PageRank
Course Instructor (me)

• Dr. Vasily Sidorov

• SCSE, NTU
• Questions, feedback, anything
—vsidorov@ntu.edu.sg

4
Textbook
• Introduction to Information
Retrieval
—Christopher D. Manning
—Prabhakar Raghavan
—Hinrich Schutze
• Available online at:
—http://nlp.stanford.edu/IR-book/

5
Reference Books
Modern Information Information
Retrieval: The Retrieval:
Concepts and Implementing and
Technology behind Evaluating Search
Search (2nd Edition) Engines

Managing Search Engines:

Gigabytes: Information
Compressing and Retrieval in Practice
Indexing Documents
and Images, Second
Edition

6
Pre-requisites
• Mathematics
—Linear algebra: vectors, matrices, matrix inverse
—Probability theory basics
◦ P(AB)=P(A)P(B)
• Hands-on Attitude
—Embrace new technology and ideas
• Programming
—Any programming language will do
◦ Python, Java, C#, C/C++, JS …
—Work with files
—Text encodings (ASCII, UTF, …)
—Algorithms
—Data structures

7
After taking this course, you’ll know
• How to build your own search engine, or customize
an existing text search engine
• How to enhance applications using IR, e.g.,
—Cluster text-like information such as microarray data
—Find similar actions / data / objects
—Parse/analyze text/dialogues (e.g., Facebook posts,
Twitter, comments)
• How to build your own NextGen IR killer-app
—e.g., matching people based on their preferences
—limited only by your imagination!

8
This course does NOT cover
• Non-text data Retrieval
—Image
—Video
• XML Retrieval and NoSQL databases
• Natural Language Processing
—Ontologies, e.g., WORDNET, HOWNET
—Part of Speech Tagging, Grammar, Parsing, …
—GPT-3 and other text GANs
• Structured Data Retrieval
—SQL

9
Let’s start
• What is IR?
—IR = Integrated Resort? Internet Relay?
—IR = Information Retrieval
• What to retrieve?
—bookmarks like del.icio.us
—people, like LinkedIn, Facebook
—books (in library or on Amazon)
—text (web pages, medical reports, assignment
reports)
—images (photos, Flickr)
—videos (movies, YouTube)
• IR vs. Text Mining

10
What is Text Mining?
• “The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
— Grobelnik et al., 2001

• “Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known”
— Hearst, 1999

11
Text Mining Challenges
• Data collection is “free text”
—Not well-organized; Semi-structured
or unstructured
• Natural language text contains
ambiguities on many levels
—Lexical, syntactic, semantic, and
pragmatic, e.g.,
Time flies like an arrow.
Fruit flies like a banana.
• Learning techniques for processing text typically
need annotated training examples

12
Text Mining Research Areas
• Information Retrieval (IR)
—Search Engines
—Classification
—Recommendation

• Information Extraction (IE)

—Product Information (e.g. price) scraping
—Named entity recognition

• Information Understanding
—Natural Language Processing (NLP)
—Question Answering
—Concept Extraction from Newsgroup
—Visualization, Summarization

• Cross-Lingual Text Mining

• Trend Detection
—Outlier Detection
—Event Detection

13
Why Learn IR?
• Understand limitations of state-of-the-art IR
—Learn what is possible in IR, tell Fiction from Fact
—Learn how to fool IR systems?
• Organize your personal information
—Master/create IR software to manage personal
information
• How to use Search Engines better
• Design next generation IR system!
—Be the next Google (not necessary search engine)
—Yahoo, Google, [Your Company]

14
How to Retrieve Information
• Example
—Scan through every book in library/store bookshelf
—View every image/video
• To speed up IR:
—Must scan every piece of information before
retrieving
◦ Google/Bing/etc. try to download the entire Web
—Indexing = Scan everything = remember where each
information is located
◦ “1984” located at Level 2 Shelf 34 of National Library
◦ List of documents containing “1984” stored on disk C:\

15
History
• 300 BC, Great Library of Alexandria, Egypt
—Most books were stored in armaria (closed, labelled
cupboards). Armaria were used for book storage till
medieval times.

16
Libraries Before Computers
• Cataloging
—A process of describing a document (both physical
attributes & subject contents)
—Catalog = key to a collection
• Bibliographic record
—A surrogate record produced by catalogers according
to defined standards (e.g., Machine Readable
Cataloging record)
• Subject Classification
—Allocating a classification number

17
Classical Indexing
• Indexing
—Human librarians construct document surrogates by
assigning identifiers to text items.
• Includes
—Keyword Indexing
◦ Similar to today's Search Engine Index
—Subject Indexing
◦ Similar to today's Classification Engine

18
Subject Indexing - Classification
• Hierarchical structure
—Similar subjects at the same
Furniture
level
Chairs Tables
• Goals of Classification
—Collocate subjects
◦ group all documents of same subject together on
shelves & put them next to related subjects.
—Define & Assign code (Call Number) to document
◦ to facilitate identification from the catalogue and to
shelf location

19
Dewey Decimal Classification (DDC)

• Most widely used

—Used by > 135 countries
• Translated into more than 30 languages
—Arabic, Chinese, French, Greek, Hebrew, Icelandic,
Russian, Spanish
• Universe of knowledge divided into 10 main classes
—Each class divided into 10 main divisions, …
—until all disciplines, subjects and concepts are
defined.
• Currently: 23rd edition (2011)
20
DDC Example
000 Computer science, information & general works
100 Philosophy & psychology
200 Religion 600 Technology (applied sciences)
300 Social sciences 610 Medical sciences
400 Language 620 Engineering and allied operations
500 Science 630 Agriculture and related technologies
600 Technology 640 Home economics and family living
700 Arts & recreation 650 Management and auxiliary services
800 Literature 660 Chemical engineering and related technologies
900 History & geography 670 Manufactures
680 Manufacture of products for specific uses
690 Buildings

620 Engineering & allied operations

621 Applied physics Another example
622 Mining and related operations
500 Natural sciences and mathematics
623 Military and nautical engineering
510 Mathematics
624 Civil engineering
516 Geometry
625 Engineering of railroads and
516.3 Analytic geometries
roads
516.37 Metric differential geometries
626 [not used]
516.375 Finsler geometry
627 Hydraulic engineering
628 Sanitary and municipal
engineering
21
629 Other branches of engineering
DDC is not ideal…
• DDC Classification Guidelines
—Determine the subject of a work
—Determine the disciplinary focus of a work
—Refer to the schedules
• Rules to handle a document in multiple classes
—First-of-two Rule: When two subjects receive equal
treatment, classify the work with the subject whose
number comes first in the schedules
—Rule of Application: Classify a work dealing with
interrelated subjects with the subject that is acted
upon

22
Classical Indexing
The Natural Language problem:
• Low consistency:
—People use different words to refer to same things
—People use same words to refer to different things
• Objective in IR:
—Search & retrieval of documents (or records) require
some level of intellectual control over the item and
its contents, at the same time, recognizing the need
for flexibility

23
Classical Indexing
• Keyword indexing (Google)
—Index entries generated
from the title and/or
keywords from the text.
—No intellectual process of
text analysis or abstraction
• Subject indexing (Yahoo)
—Involves analysis of the subject by humans /
computers

24
Classical Indexing Problems
• Effectiveness of indexing depends on:
—Indexing Exhaustiveness
◦ extent to which the subject matter of a given
document has been reflected through the index entries
—Term Specificity
◦ how broad/specific are the terms/keywords

25
Vocabulary Control: Controlled vs Natural
language indexing
• Controlled language
—Use of vocabulary control tool in indexing
—Semantic Web
—Dublin Core
—XML Ontologies
• Natural language (free text)
—Any term in the document may be an index term.
No mechanism controls the indexing process
—Modern Search Engine

26
Who wins?

27
A Modern IR System (A Search Engine)
• Crawler
• Indexer
• Searcher

28
Basic Concepts: Tokenization
• Assign unique id to each word & keep in a lexicon
• HTML tags can be
—Ignored or
—Used to assign more weight to important items such
as <title>

• Remove Stop/Noise words before/after tokenization

29
Basic concepts: Stop/Noise Words
Removal
High Frequency Words that carry little information:

a about all also among an and are as and same in

at be been between both but by other languages
click did do during for each
either had found from
further get

30
Basic concepts: Stemming (Roman
Languages)
• Useful for
—Reducing # Words (Dimensionality)
—Machine Translation working
◦ Morphology works work
• Performance Improvement worked

—No: Harman (1991), Krovetz (1993)

—Possibly: Krovetz (1995)
◦ Depends on type of text, and the assumption is that
once one moves beyond English, the difference will
prove significant
• Stemmer
—Porter Stemmer, k-stemmer

31
Porter Stemmer
• Porter Stemmer (Porter 1980)
—http://www.tartarus.org/~martin/PorterStemmer/
—http://snowball.tartarus.org/
—Simple algorithms to determine which affixes to
strip in which order and when to apply repair
strategies
Input Stripped “ed” affix Repair
hoped hop hope (add e if word is short)

hopped hopp hop (delete one if doubled)

• Samples of the algorithm accessible on the Internet

• Easy to understand and program

32
Porter Stemmer
consigned consign knack knack
Sample output: consignment consign knackeries knackeri
consolation consol knaves knavish
consolatory consolatori knavish knavish
consolidate consolid knif knif
consolidating consolid knife knife
Errors consoling consol knew knew
• Conflation:
—reply, rep. → rep
• Overstemming:
—wander → wand
—news → new
• Mis-stemming:
—relativity → relative
• Understemming:
—knavish → knavish (knave)

33
Stemmer vs. Dictionary
• Stemming Rules more efficient than a dictionary
—Algorithmic stemmers can be fast (and lean): 1
Million words in 6 seconds on a 500 MHz PC
• No maintenance even if things change
• Better to ignore irregular forms (exceptions) than to
complicate the algorithm
—not much lost in practice
—80/20 Rule

34
Basic Concepts: Phrase Detection
• Important for English
—New York City Police Department
—Bill Gates spoke on the benefits of Windows
• Essential for CJK (Chinese, Japanese, Korean)
—新加坡是个美丽的城市
[Singapore is a beautiful city]
• Approaches
—Dictionary Based
◦ Most Accurate; Needs maintenance (by humans)
—Learnt/Extracted from Corpus
◦ Hidden Markov Model; N-Grams; Statistical Analysis
◦ Suffix Tree Phrase Detection (via statistical counting)

35
Phrases Extracted from News Dataset
south san diego united high
south africa san diego ca united kingdom high density
south africa internet san francisco united nations high end
south africa po san francisco bay united nations quite high enough
south african san francisco chronicle united states high frequency
south african government san francisco giants united states attempt high hockey
south african intelligence san francisco police united states code high just
south african libertarian san francisco police inspector united states government high level
south america san francisco police inspector ron united states holocaust high performance
south atlantic san francisco police intelligence united states officially high power
south dakota san francisco police intelligence unit united states senate high quality
south dakota writes san francisco police officer high ranking
south georgia san jose high ranking crime
south georgia island san jose ca high ranking initiate
south pacific san jose mercury high resolution
south pacific island high school
high school students
high speed
high speed collision
high sticking
high tech
high voltage
highend
higher 36
Vector Space Text Representation
• Bag of Words (BOW) Model
—Order/Position of word/term unimportant

• Term Frequency (TF)

—Terms appear more frequently in a document is
considered more important in representing this
document

37
Basic Concepts: Weighing the Terms
• Which of these tells you more about a doc?
—10 occurrences of pizza?
—10 occurrences of the?
• Look at
—Collection frequency (CF)
—Document frequency (DF), which is better:
Word CF DF
TRY 10 422 8 760
INSURANCE 10 440 3 997

• CF/DF weighting possible only in known collection

38
Basic Concepts: TFIDF
• TF x IDF measure combines:
—Term frequency (TF): measure of term density in a doc
—Inverse document frequency (IDF): measure of
informativeness of term, rarity across whole corpus
—Could be raw count of # documents containing the term
1
IDF𝑖 =
DF𝑖
• By far the most commonly used version is:
𝑁
IDF𝑖 = log
DF𝑖
See Kishore Papineni, NAACL 2, 2002 for theoretical justification

39
Basic Concepts: TFIDF
• Assign a TFIDF weight to each term 𝑖 in each
document 𝑑
𝑁
𝑤𝑖,𝑑 = TF𝑖,𝑑 × log
DF𝑖

• Word is more important if

—The word appears more often in the document
—The word does not appear that often in the whole
corpus

40
Documents as Vectors
• Each document 𝑑 can now be viewed as a vector of
TF × IDF values, one component for each term
• So we have a vector space
—terms are axes
—docs live in this space
—even with stemming, may have 100,000+
dimensions
• The corpus of documents gives us a matrix, which
we could also view as a vector space in which words
live

41
Why turn documents into vectors
• First application: Query-by-example
—Given a doc/query 𝑑, find others “like” it (or most
similar to it)
• Now that 𝑑 is a vector, find vectors (docs) “near” it.

42
What is not IR
• Why not a SQL query?
—IR ≠ DB query
—IR ≠ XML

• Data → Query
—IR = unstructured
—XML = semi-structured
—DB = structured

43
Search Engine Evolution
• 1st generation (use only “on page” data) 1995–1997
AltaVista, Excite,
—Text data, word frequency, language Lycos, …

• 2nd generation (use off-page, web-specific data)

—Link (or connectivity) analysis 1998–present
Made popular by
—Click-through data (What people click) Google, used by
—Anchor-text (How people refer to this page) everyone now.

• 3rd generation (answer “the need behind the query”)

—Semantic analysis — what is this about? present day
Siri, Alexa, Cortana,
—Focus on user need, rather than on query Google Assistant,
—Context determination IBM Watson
44
1st Generation Search Engine
• Extended Boolean model
—Matches: exact, prefix, phrase, …
—Operators: AND, OR, NOT, NEAR, …
—Fields: TITLE:, URL:, HOST:, …
—AND is somewhat easier to implement, maybe
preferable as default for short queries
• Ranking
—TF-like factors: TF, explicit keywords, words in title,
explicit emphasis (headers), etc.
—IDF factors: IDF, total word count in corpus,
frequency in query log, frequency in language

45
2nd Generation Search Engine
• Ranking — use off-page, web-specific data
—Link (or connectivity) analysis
—Click-through data (what results people click on)
—Anchor-text (how people refer to this page)

• Link Analysis
—Idea: mine hyperlink information on the Internet
—Assumptions:
◦ Links often connect related pages
◦ A link between pages is a recommendation
▪ “people vote with their links”

46
3rd Generation Search Engine
• Query language determination
—if query is in Japanese then do not return English
• Different ranking
• Hard & soft matches
—Personalities (triggered on names)
—Cities (travel info, maps)
—Medical info (triggered on names and/or results)
—Stock quotes, news (triggered on stock symbol)
—Company info, …
• Integration of Search and Text Analysis

47
3rd Generation Search Engine (cont’d)
• Context determination
—where: spatial (user location/target location)
—when: query stream (previous queries)
—who: personalization (user profile)
—explicit (family friendly)
—implicit (use google.com.sg or google.fr)

• Context use
—Result restriction
—Ranking modulation

48
History: Google Architecture (circa 1998)
Implemented in C/C++ on Linux and off-the-shelf PCs

49
For Comparison, Google Today

50
Google Architecture (c. 1998): Crawler

51
Google Architecture (c. 1998): Indexer

52
Google Architecture (c. 1998): Searcher

53
Google’s Algorithm
• Imagine a browser randomly walking on the 1ൗ
3
Internet: 1ൗ
3
1ൗ
—Start at a random page 3

—At each step, go out of the current page along one

of the links on that page, equiprobably
—“In the steady state” each page has a long-term visit
rate - use this as the page’s score
• BUT: The web is full of dead-ends
—Random walk can get stuck in dead-ends
—No sense to talk about long-term visit rates

54
1ൗ 1ൗ
10 10
1ൗ 1ൗ
Google Teleporting 1ൗ
10
10
1ൗ
10
10
1ൗ 1ൗ
10 10
1ൗ
• At each visit, 1ൗ
10
10

—with probability 10%, jump to a random web page

—with remaining probability (90%), go out on a
random link
—If no out-link, stay put in this case
• Now cannot get stuck locally
—There is a long-term rate at which any page is visited
• Motivation:
—Pages cited from many places are IMPORTANT
—Pages cited from an important place are
IMPORTANT.

55
Anchor Text
• Associate anchor text of a link to the page it points
to
• Advantages:
—Links provide more accurate description
—Can index documents that text-based search engine
cannot (e.g. Google Image Search)

56
Key Google Optimization Techniques
• Each crawler maintains its local DNS lookup cache
• Parallelization of indexing phase
• In-memory lexicon
• Compression of repository
• Compact encoding of hitlists accounting for major space
savings
• Indexer is optimized so it is just faster than the crawler
(bottleneck)
• Document index updated in bulk
• Critical data structures placed on local disk
• Overall architecture designed to avoid disk seeks
wherever possible
57
Google Storage Requirements (c. 1998)
• Total disk space approx, 106 GB
—standard PC hard disk approx. 10 GB in 1998

Price of 1 GB
1981 $ 300 000
1987 $ 50 000
1990 $ 10 000
1994 $ 1 000
1997 $ 100
2000 $ 10
2004 $ 1
2012 $ 0.1
2017 $ 0.03

PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Bulu
No ratings yet
Bulu
47 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Informationa Retrival
No ratings yet
Informationa Retrival
22 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
IR Lec1
No ratings yet
IR Lec1
26 pages
Ai & ML Unit-3 Ir & Ie
No ratings yet
Ai & ML Unit-3 Ir & Ie
15 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
Introduction To IIR
No ratings yet
Introduction To IIR
53 pages
Search Engine Indexing Guide
No ratings yet
Search Engine Indexing Guide
10 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Chapter 2
No ratings yet
Chapter 2
64 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Information Retrieval Data Structures & Algorithms - William B. Frakes
No ratings yet
Information Retrieval Data Structures & Algorithms - William B. Frakes
630 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Cataloging
No ratings yet
Cataloging
53 pages
IRS Unit 2
No ratings yet
IRS Unit 2
15 pages
Modern Information Retrieval Course
No ratings yet
Modern Information Retrieval Course
23 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
44 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Lecture 9 - Probabilistic Information Retrieval, Language Models
No ratings yet
Lecture 9 - Probabilistic Information Retrieval, Language Models
40 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
30 pages
SAP BW Course Content
No ratings yet
SAP BW Course Content
1 page
Learn SQL Queries
No ratings yet
Learn SQL Queries
181 pages
Blockchain and DLT Exam Paper
No ratings yet
Blockchain and DLT Exam Paper
1 page
Advanced Database Tech Questions
No ratings yet
Advanced Database Tech Questions
11 pages
Fullstack Java With Reactjs
No ratings yet
Fullstack Java With Reactjs
3 pages
HQL: Hyperinsane Query Language: (Or How To Access The Whole SQL API Within A HQL Injection ?)
No ratings yet
HQL: Hyperinsane Query Language: (Or How To Access The Whole SQL API Within A HQL Injection ?)
8 pages
Normalization of Database Tables
No ratings yet
Normalization of Database Tables
21 pages
Resume Avats
No ratings yet
Resume Avats
1 page
GENEXUS X EPISODIO 1 2da EDICION EN PDF
No ratings yet
GENEXUS X EPISODIO 1 2da EDICION EN PDF
136 pages
DBMS - Lesson 3 - Introduction To Database Management Systems
No ratings yet
DBMS - Lesson 3 - Introduction To Database Management Systems
20 pages
Domain 8: - Software Development Security
No ratings yet
Domain 8: - Software Development Security
3 pages
Information and Communication Technology: Pearson Edexcel International GCSE
100% (1)
Information and Communication Technology: Pearson Edexcel International GCSE
20 pages
MongoDB Basics for Beginners
No ratings yet
MongoDB Basics for Beginners
2 pages
03 - A Survey On OLAP
No ratings yet
03 - A Survey On OLAP
9 pages
01 - 15 8 Many To Many Relationships - en
No ratings yet
01 - 15 8 Many To Many Relationships - en
5 pages
Dbms - Question Bank
No ratings yet
Dbms - Question Bank
7 pages
Observability 101 Guide by Abhishek Veeramalla
No ratings yet
Observability 101 Guide by Abhishek Veeramalla
51 pages
DATABASE MANAGEMENT SYSTEMS - mqp1
No ratings yet
DATABASE MANAGEMENT SYSTEMS - mqp1
32 pages
Displaying Data From Mutiple Table: Joins
No ratings yet
Displaying Data From Mutiple Table: Joins
9 pages
Yash Pardeshi-Resume
No ratings yet
Yash Pardeshi-Resume
1 page
Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
No ratings yet
Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
393 pages
Disaster Recovery Manager Checklist
0% (1)
Disaster Recovery Manager Checklist
6 pages
1) Write A Short Note On Nosql
No ratings yet
1) Write A Short Note On Nosql
9 pages
Xii CS Syntax and Examples
No ratings yet
Xii CS Syntax and Examples
4 pages
Towards Merkle Trees For High-Performance Data Systems: Muhammad El-Hindi Tobias Ziegler Carsten Binnig
No ratings yet
Towards Merkle Trees For High-Performance Data Systems: Muhammad El-Hindi Tobias Ziegler Carsten Binnig
6 pages
Postal Database Security Guide
No ratings yet
Postal Database Security Guide
3 pages
SQL Server Standards-Naming Convention PDF
No ratings yet
SQL Server Standards-Naming Convention PDF
24 pages
Mis Executive Interview Guide
No ratings yet
Mis Executive Interview Guide
5 pages
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
No ratings yet
Lab # 7 View, Sequence, Synonym and Trigger Eng. Alaa O Shama
11 pages
2019 DBMS
No ratings yet
2019 DBMS
4 pages

Lecture 1 - Introduction

Uploaded by

Lecture 1 - Introduction

Uploaded by

CI-6226

• Dr. Vasily Sidorov

Managing Search Engines:

• “Another way to view text data mining is as a process of

• Information Extraction (IE)

• Cross-Lingual Text Mining

• Most widely used

620 Engineering & allied operations

• Remove Stop/Noise words before/after tokenization

a about all also among an and are as and same in

—No: Harman (1991), Krovetz (1993)

hopped hopp hop (delete one if doubled)

• Samples of the algorithm accessible on the Internet

• Term Frequency (TF)

• CF/DF weighting possible only in known collection

• Word is more important if

• 2nd generation (use off-page, web-specific data)

• 3rd generation (answer “the need behind the query”)

—At each step, go out of the current page along one

—with probability 10%, jump to a random web page

You might also like