0% found this document useful (0 votes)

17 views10 pages

Rohini 43276832601

The document discusses Information Retrieval (IR) and Information Extraction (IE) systems, detailing their components, processes, and purposes. IR focuses on retrieving unstructured information from large collections, while IE automates the extraction of specific information from text. Both systems play crucial roles in organizing and providing relevant data to users based on their queries.

Uploaded by

ram prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Rohini 43276832601

Uploaded by

ram prasath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CS8691 Artificial Intelligence

5.1 INFORMATION RETRIVAL

Information retrieval (IR) is finding material (usually documents) of an unstructured
nature (usually text) that satisfies an information need from within large collections (usually
stored on computers).Generically, ―collections‖, Less-frequently used, ―corpora‖ are searched
and ―documents‖ namely web pages, PDFs, PowerPoint slides, paragraphs, etc. are retrieved.
Information Retrieval system consists of a software program that facilitates a user in finding
the information the user needs.
The Information Retrieval System was coined by Calvin Mooers in 1952. These
information retrieval systems were, truly speaking, document retrieval system, since they
were designed to retrieve information. Information retrieval deals with storage, organization
and access to text, as well as multimedia information resources. Information Retrieval is a
process of searching some collection of documents, using the term document in its widest
sense, in order to identify those documents which deal with a particular subject. Any system
that is designed to facilitate this literature searching may legitimately be called an
information retrieval system.
Information retrieval systems originally meant text retrieval systems, since they were
dealing with textual documents, modern information retrieval systems deal with multimedia
information comprising text, audio, images and video. While many features of conventional
text retrieval system are equally applicable to multimedia information retrieval, the specific
nature of audio, image and video information have called for the development of many new
tools and techniques for information retrieval.
Modern information retrieval deals with storage, organization and access to text, as
well as multimedia information resources. The concept of information retrieval presupposes
that there are some documents or records containing information that have been organized in
an order suitable for easy retrieval. The documents or records we are concerned with contain
bibliographic information which is quite different from other kinds of information or data.
We may take a simple example. If we have a database of information pertaining to an office,
or a supermarket, all we have are the different kinds of records and related facts, like names

Rohini College of engineering and technology Page 1

CS8691 Artificial Intelligence

of employees, their positions, salary, and so on, or in the case of a supermarket, names of
different items, prices, quantity, and so on. The main objective of a bibliographic information
retrieval system, however, is to retrieve the information either the actual information or the
documents containing the information that fully or partially match the user‗s query. The
database may contain abstracts or full texts of document, like newspaper articles, handbooks,
dictionaries, encyclopedias, legal documents, statistics, etc., as well as audio, images, and
video information.
An information retrieval system thus has three major components- the document
subsystem, the users subsystem, and the searching/retrieval subsystem. These divisions are
quite
broad and each one is designed to serve one or more functions, such as:
 Analysis of documents and organization of information (creation of a document
database)
 Analysis of user‗s queries, preparation of a strategy to search the database
 Actual searching or matching of users queries with the database, and finally
 Retrieval of items that fully or partially match the search statement.
An IR is a 3 step Process:
 Asking a question (how to use the language to get what we want?)
 Building an answer from known data. (How to refer to a given text?)
 Assessing the answer. (Does it contain the information we are seeking.)

Fig: The Information Retrieval Cycle

Rohini College of engineering and technology Page 2

CS8691 Artificial Intelligence

5.1.1 IR System Components

 Text Operations forms index words (tokens).
 Stop word removal
 Stemming
 Indexing constructs an inverted index of word to document pointers.
 Searching retrieves documents that contain a given query token from the inverted
index.
 Ranking scores all retrieved documents according to a relevance metric.
 User Interface manages interaction with the user:
 Query input and document output.
 Relevance feedback.
 Visualization of results.
 Query Operations transform the query to improve retrieval:
 Query expansion using a thesaurus.
 Query transformation using relevance feedback.
5.1.2 Purpose of Information Retrieval System
An information retrieval system is designed to retrieve the documents or information
required by the user community. It should make the right information available to the right
user. Thus, an information retrieval system aims at collecting and organizing information in

Rohini College of engineering and technology Page 3

CS8691 Artificial Intelligence

one or more subject areas in order to provide it to the user as soon asked for. Belkin presents
the following situation which clearly reflects the purpose of information retrieval systems:
 A writer presents as set of ideas in a document using a set of concepts
 Somewhere there will be some users who require the ideas but may not be able to
identify those. In other words, there will be some persons who lack the ideas put
forward by the author in his/her work.
 Information retrieval system serve to match the writers ideas expressed in the
document with the user requirements or demand for those.
 Thus, an information retrieval system serves as a bridge between the world of
creators or generators of information and the users of that information.
Some terminology
 An IR system looks for data matching using some criteria defined by the users in
their queries.
 The language used to ask a question is called the query language.
 These queries use keywords (atomic items characterizing some data).
 The basic unit of data is a document (can be a file, an article, a paragraph, etc.).
 A document corresponds to free text (may be unstructured).
 All the documents are gathered into a collection (or corpus).
Example:
1 million documents, each counting about 1000 words
if each word is encoded using 6 bytes:
109 × 1000 × 6/1024 ≃ 6GB
5.1.3 Components of Information Retrieval
In an information retrieval system there are the documents or sources of information on
one side and on the other there are the user‗s queries. These two sides are linked through a
series of tasks. Lancaster mentions that an information retrieval system comprises six major
subsystems: The document subsystem

 The indexing subsystem

 The vocabulary subsystem
 The searching subsystem
 The service-system interface, and
 The matching subsystem
Rohini College of engineering and technology Page 4
CS8691 Artificial Intelligence

Three major components of IRS

1) Document subsystem
a) Acquisition
b) Representation
c) File organization
2) User sub system
a) Problem
b) Representation
c) Query
3) Searching /Retrieval subsystem
a) Matching
b) Retrieved objects
5.1.4 Kinds of Information Retrieval Systems
Two broad categories of information retrieval system can be identified: in- house and
online.
In- house information retrieval systems are set up by a particular library or information
center to serve mainly the users within the organization. One particular type of in-house
database is the library catalogue. Online public access catalogues (OPACs) provide facilities
for library users to carry out online catalogue searches, and then to check the availability of
the item required. Online IR is nothing but retrieving data from web sites, web pages and
servers that may include data bases, images, text, tables, and other types.

Rohini College of engineering and technology Page 5

CS8691 Artificial Intelligence

5.3.4 Functions of information retrieval system

An information retrieval system deals with various sources of information on the one
hand and user‗s requirements on the other. It must:
 Analyze the contents of the sources of information as well as the user‗s queries, and
then
 Match these to retrieve those items that are relevant
The major functions of an information retrieval system can be listed as follows:
 To identify the information (sources) relevant to the areas of interest of the target
users community
 To analyze the contents of the sources (documents)
 To represent the contents of the analyzed sources in a way that will be suitable for
matching user‗s queries
 To analyze user‗s queries and to represent them in a form that will be suitable for
matching with the database
 To match the search statement with the stored database
 To retrieve the information that is relevant, and
 To make necessary adjustments in the system based on feedback form the users.

5.3.5 Features of an information retrieval system

 An effective information retrieval system must have provisions for:
 Prompt dissemination of information
 Filtering of information
 The right amount of information at the right time
 Active switching of information
 Receiving information in an economical way
 Browsing
 Getting information in an economical way
 Current literature
 Access to other information systems
 Interpersonal communications, and
 Personalized help.

Rohini College of engineering and technology Page 6

CS8691 Artificial Intelligence

5.3.6 Indexing usually consists of the several phases

 After word segmentation, stop words are removed.
 These common words like articles or prepositions contain little meaning by
themselves and are ignored in the document representation.
 Second, word forms are transformed into their basic form, the stem.
 During the stemming phase, e.g. houses would be transformed into house.
 For the document representation, different word forms are usually not necessary.
 The importance of a word for a document can be different.
 Some words better describe the content of a document than others.
 This weight is determined by the frequency of a stem within the text of a document.
In multimedia retrieval, the context is essential for the selection of a form of query and
document representation. Different media representations may be matched against each other
or transformations may become necessary (e.g. to match terms against pictures or spoken
language utterances against documents in written text).
As information retrieval needs to deal with vague knowledge, exact processing
methods are not appropriate.
 Vague retrieval models like the probabilistic model are more suitable.
 Within these models, terms are provided with weights corresponding to their
importance for a document.
 These weights mirror different levels of relevance.
The result of current information retrieval systems are usually sorted lists of documents
where the top results are more likely to be relevant according to the system.
 In some approaches, the user can judge the documents returned to him and tell the
systems which ones are relevant for user.
 The system then resorts the result set.
 Documents which contain many of the words present in the relevant documents are
ranked higher.
 This relevance feedback process is known to greatly improve the performance.
 Relevance feedback is also an interesting application for machine learning.
 Based on a human decisions, the optimization step can be modeled with several
approaches, e.g. with rough sets.

Rohini College of engineering and technology Page 7

CS8691 Artificial Intelligence

5.2 INFORMATION EXTRACTION

Information extraction (IE) is the automated retrieval of specific information related
to a selected topic from a body or bodies of text. Information extraction is the process of
extracting specific (pre-specified) information from textual sources. One of the most trivial
examples is when your email extracts only the data from the message for you to add in your
Calendar.
Other free-flowing textual sources from which information extraction can distill
structured information are legal acts, medical records, social media interactions and streams,
online news, government documents, corporate reports and more.
Information extraction tools make it possible to pull information from text
documents, databases, websites or multiple sources. IE may extract info
from unstructured, semi-structured or structured, machine-readable text. Usually, however, IE
is used in natural language processing (NLP) to extract structured from unstructured text.
Information extraction depends on,
 Named entity recognition (NER), a sub-tool used to find targeted information to
extract.
 NER recognizes entities first as one of several categories such as location (LOC),
persons (PER) or organizations (ORG).
 Once the information category is recognized, an information extraction utility
extracts the named entity‗s related information and constructs a machine-readable
document from it, which algorithms can further process to extract meaning.
 IE finds meaning by way of other subtasks including co-reference resolution,
relationship extraction, language and vocabulary analysis and sometimes audio
extraction.
 Current efforts in multimedia document processing in IE include automatic
annotation and content recognition and extraction from images and video could be
seen as IE as well.
 Because of the complexity of language, high-quality IE is a challenging task for
artificial intelligence (AI) systems.

Rohini College of engineering and technology Page 8

CS8691 Artificial Intelligence

Typically, for structured information to be extracted from unstructured texts, the

following main subtasks are involved:
Pre-processing of text – where text is prepared for processing with the help of computational
linguistics tools such as tokenization, sentence splitting, morphological analysis, etc.
Finding and classifying concepts – this is where mentions of people, things, locations,
events and other pre-specified types of concepts are detected and classified.
Connecting the concepts – task of identifying relationships between extracted concepts.
Unifying – this subtask is about presenting the extracted data into a standard form.
Getting rid of the noise – this subtask involves eliminating duplicate data.
Enriching your knowledge base – this is where the extracted knowledge is ingested in your
database for further use.
5.2.1 Information Extraction Architecture
The below figure shows the architecture for a simple information extraction system.
At first, the raw text of the document is split into sentences using a sentence segmenter, and
each sentence is further subdivided into words using a tokenizer. Next, each sentence is
tagged with part-of-speech tags, which will prove very helpful in the next step, named entity
detection. In this step, we search for mentions of potentially interesting entities in each
sentence. Finally, we use relation detection to search for likely relations between different
entities in the text.

Rohini College of engineering and technology Page 9

CS8691 Artificial Intelligence

Figure : Simple Pipeline Architecture for an Information Extraction System.

This system takes the raw text of a document as its input, and generates a list of (entity,
relation, entity) tuples as its output.

5.2.2 Applications of IE

 Enterprise
 News tracking
 Customer care
 Data cleaning
 Personal information management
 Scientific applications
 Web oriented applications
 Citation databases
 Opinion databases
 Community websites
 Comparison shopping
 Ad placement on webpages
 Structured web searches

Rohini College of engineering and technology Page 10

Introduction to Information Retrieval
No ratings yet
Introduction to Information Retrieval
88 pages
CS & Engineering Lecture Notes
No ratings yet
CS & Engineering Lecture Notes
24 pages
Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
IR UNIT I - Notes
0% (1)
IR UNIT I - Notes
23 pages
Topic 2 Basic Concepts of Information Retrieval Systems
No ratings yet
Topic 2 Basic Concepts of Information Retrieval Systems
12 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
28 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
IR Module
No ratings yet
IR Module
80 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
IR Module For MIS Rift
No ratings yet
IR Module For MIS Rift
80 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Section a-UNIT 1
No ratings yet
Section a-UNIT 1
25 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
IR Notes
No ratings yet
IR Notes
14 pages
Indexing and Abstracting Reviewer LLE
100% (3)
Indexing and Abstracting Reviewer LLE
46 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
ITR Notes
No ratings yet
ITR Notes
166 pages
Unit I
No ratings yet
Unit I
33 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Lecturenote - 580003121chapter 1
No ratings yet
Lecturenote - 580003121chapter 1
10 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Unit-5. Search Engines
No ratings yet
Unit-5. Search Engines
105 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
CS8080 Irt
No ratings yet
CS8080 Irt
30 pages
Information Retrieval
No ratings yet
Information Retrieval
21 pages
IR Chapter 1
No ratings yet
IR Chapter 1
64 pages
IR First Chapter
No ratings yet
IR First Chapter
32 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
Module 1print
No ratings yet
Module 1print
5 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Lecture17 IR
No ratings yet
Lecture17 IR
28 pages
CSE Information Retrieval Guide
100% (1)
CSE Information Retrieval Guide
33 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Irs I
No ratings yet
Irs I
20 pages
Chapter - 1 Intro IR
No ratings yet
Chapter - 1 Intro IR
64 pages
Ads Qcubpnw
No ratings yet
Ads Qcubpnw
100 pages
Attendance
No ratings yet
Attendance
1 page
Attendance
No ratings yet
Attendance
2 pages
Robotics
No ratings yet
Robotics
6 pages
17cs17 - Vcs314 - Big Data Systems
No ratings yet
17cs17 - Vcs314 - Big Data Systems
5 pages
English 7A Week 7 Worksheet
No ratings yet
English 7A Week 7 Worksheet
3 pages
Noble Eightfold Path-Q&A - Sangharakishta
100% (1)
Noble Eightfold Path-Q&A - Sangharakishta
369 pages
F
No ratings yet
F
502 pages
ALU CODE and TEST BENCH
No ratings yet
ALU CODE and TEST BENCH
8 pages
SAQA - 14944 - Learner Guide
No ratings yet
SAQA - 14944 - Learner Guide
28 pages
O-Levels Mathematics Exemplar
100% (2)
O-Levels Mathematics Exemplar
60 pages
DLP English
No ratings yet
DLP English
1 page
Top Notch 2 Three Solucionario Top Nocht 2 Tercera Edicion
100% (1)
Top Notch 2 Three Solucionario Top Nocht 2 Tercera Edicion
18 pages
Dbisam 4 Rsdelphiwin 6410 R
No ratings yet
Dbisam 4 Rsdelphiwin 6410 R
1,044 pages
Fuhll Text
No ratings yet
Fuhll Text
290 pages
Day - 3 DataUtility Customization
No ratings yet
Day - 3 DataUtility Customization
16 pages
Dalit-Buddhist Naming Politics in Maharashtra
No ratings yet
Dalit-Buddhist Naming Politics in Maharashtra
26 pages
Security Audits & Playbooks Guide
No ratings yet
Security Audits & Playbooks Guide
26 pages
RAPORT Ihre Im Virustotal
No ratings yet
RAPORT Ihre Im Virustotal
16 pages
Act. 4 - Quiz 1 Unit 1
No ratings yet
Act. 4 - Quiz 1 Unit 1
17 pages
Chapter1.2 PythonPandas2
No ratings yet
Chapter1.2 PythonPandas2
38 pages
"Speaking Board Game: Verb 'To Be'"
No ratings yet
"Speaking Board Game: Verb 'To Be'"
1 page
2 Type of Words Derivative Words
No ratings yet
2 Type of Words Derivative Words
17 pages
Common English Pronunciation Problem Faced by Cantonese Speakers
No ratings yet
Common English Pronunciation Problem Faced by Cantonese Speakers
9 pages
English V Syllabus Question Bank
No ratings yet
English V Syllabus Question Bank
19 pages
Art Gallery Database Normalization
No ratings yet
Art Gallery Database Normalization
5 pages
PTP-SM City Tarlac-Elec
No ratings yet
PTP-SM City Tarlac-Elec
3 pages
Mackey - Assyrian Contemporaries of Ramses II The Great
No ratings yet
Mackey - Assyrian Contemporaries of Ramses II The Great
8 pages
Unit 1
No ratings yet
Unit 1
6 pages
Computer - Revision Sheet - Prep 1 - T2 - 2024
No ratings yet
Computer - Revision Sheet - Prep 1 - T2 - 2024
20 pages
"Dream Children A Reverie" by Charles Lamb
No ratings yet
"Dream Children A Reverie" by Charles Lamb
6 pages
Grammar Skills for Students
No ratings yet
Grammar Skills for Students
16 pages
Poetry Writing
No ratings yet
Poetry Writing
9 pages
Midterm 1
No ratings yet
Midterm 1
7 pages
Notes For A Course On Statistical Mechanics PDF
No ratings yet
Notes For A Course On Statistical Mechanics PDF
246 pages

Rohini 43276832601

Uploaded by

Rohini 43276832601

Uploaded by

CS8691 Artificial Intelligence

5.1 INFORMATION RETRIVAL

Rohini College of engineering and technology Page 1

Fig: The Information Retrieval Cycle

Rohini College of engineering and technology Page 2

5.1.1 IR System Components

Rohini College of engineering and technology Page 3

 The indexing subsystem

Three major components of IRS

Rohini College of engineering and technology Page 5

5.3.4 Functions of information retrieval system

5.3.5 Features of an information retrieval system

Rohini College of engineering and technology Page 6

5.3.6 Indexing usually consists of the several phases

Rohini College of engineering and technology Page 7

5.2 INFORMATION EXTRACTION

Rohini College of engineering and technology Page 8

Typically, for structured information to be extracted from unstructured texts, the

Rohini College of engineering and technology Page 9

Figure : Simple Pipeline Architecture for an Information Extraction System.

Rohini College of engineering and technology Page 10

You might also like