0 ratings 0% found this document useful (0 votes) 11 views 29 pages IRSunit 4
The document discusses various concepts related to information retrieval systems, including binding, similarity measures, ranking algorithms, and relevance feedback. It explains how search statements are generated and processed, the importance of statistical algorithms in indexing, and the use of hidden Markov models in information retrieval. Additionally, it covers the challenges of user feedback in web searches and the significance of algorithms like PageRank and distance rank in improving search efficiency.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
gi. Write a short note on the fy
binding.
Answer:
The final level of binding comes ag
appedtoa specific database this binding is based fe
the statistics of the processing tokens in the database
sxithesemantics used inthe database. This is especialy
tueinslatistical and concept indexing systems, Some
cithe statistics used in weighting are based upon the
carentindexing systems, Some examples are document
fequency and total frequency that are used as the basis
for indexing are determined by applying a statistical
agorithm against a representative sample of the
database. Natural language indexing techniques tend to
1sethe most corpora-independent algorithms
inal level of
the search is
@. Write about the similarity measures and
Ranking?
Answer
‘Searching in general is concemed with calculating
'nsmilarity between a user's search statement and the
‘emsin the database.
Restricting the similarity measure to passages gains
Sanficant precision with minimal impact on recall
Once items are identified so possibly relevant to
‘heuse’s query, itis best to present the most likely relevant
‘ems fist-Ranking. is a scalar number that represents
&. Define HMM?
Answer:
___ Wepresent a new method for infor
‘Sta hidden Markov models (HMM). We
Stheral framework for incorporating ™
Narn — ok ia a CRIMIN:
Sing: Xerox/Photocopying of this book
ymation retrieval
We develop @
tiple word
“AL Ack Anyone found gilly is LADLE
Algorithms?
Answer:
With the growing number of web pages and uses
onthe web, the numberof queries submitted othe search
engines are also growing rapidly day by day. Therefore,
the search engines needs to be more efficient in is
Processing way and output. Web: ‘mining techniques are
employed by the search engines to extract relevant
documents from the web database documents and
Provide the necessary and required information to the
Users. The search engines become very successfulbecause
of its page Rank algorithm. Page Rank algorithms are
used by the search engines to present the search results
by considering the relevance, importance and content
score and web mining techniquesto order them according
to the user interest. Some Ranking algorithms depend
only on the link structure of the documents i., their
popularity scores, where as other look for the actual
conteht of the documents.
Qs.
Write a short note on Distance Rank
algorithm?
Answer
The main goal of this ranking algorithm is
computedon the basis ofthe shortestogarithmic ditance
between two pages and ranked according to them so
that a page with smaller distance to assigned a hier
rank, The advantage oftisalgoith that, being es
Sensitive ican nd pages faster wth igh aay and
more qui with the use of dance based soiion as
compared to oer algorithms. the some algorthens
wovide quality output then that has some cer
: to face LEGAL proceedingsInformation Retrieval Systems lt
limitations. So the limitation for this algorithm is that
the crawler should perform large calculation to calculate
the distance vector, if new page is inserted between the
two pages. This distance rank algorithm adopls the page
Rank properties ie, the rank of each page is computed
as the weighted sum of ranks of all incoming pages to
that particular page. Then, a page hasa high rank value
ifithas more incoming links on a page
Q6. Write a note on probabilistic Relevance
feedback?
Answer:
‘The probabilistic relevance feedback is based on
the user feedback of documents relevance, So, after user
can give relevance feedback of certain documentsrelevant
ornon - relevant. Then we can indicate it with a, Boolean
indicator variable ‘R’ of a document relevance. The
probability of a term t to appear in a documentrelevance
is,
BX, = YR=1) =| VRI/ [VR]
B (x, = 0/R=0) = (df, |VRj)/(N- |VRI)
From the ‘N' number of documents, where the
number of df,contains the term “t’ within the known
relevant document set ‘VR’ and subset VR,, the
estimation of probabily is p(x, = 1). The assumption of
the set of relevant documents as a small subset of the
set of all documents makes the probability estimation
1) possibl
Q7._ Write about relevance feedback on the web?
Answer :
The relevance feedback on a web page is not
providing by the users that much, Because the péople
on a web page using advanced search interfaces to
complete their search in a single interaction, with respect
to the time. The lack off up take also probably reflects
the RF (Relevance Feedback) hard to explain to the
average users and RF recall enhancing, and web users
are, concerningrarely to go sufficient recall, From all of
the web search users, only 4% of user query session using
RF option, and about the 70% of the users only looked
at the top first resultant page and did not pursue things
any further.
Now- a -daysa thread of work is the use of click
stream data to provide indirect RF when a user click on
a resultant link it automatically redirect to the other page
‘or window. Its the most commonly using thread for RF
indirectly.
m4,
Q8. Write the Indirect Relevance Feeg, 3
(RF? wth
Answer:
Indirect RF is also called as implicit feed y,
the explicit feedback approach. The implicit feet.)
collection easy from the users when itis dificult iaczy,
in explicit relevante feedback on a web search,
This approach will work by collecting the ey
rates on pages globally, by the users clicks. Wher ius
want fo use an relevant information then the wer jp
click on that particular links to loss at more, frequen
then those links were assumed to indicate that the porn,
was likely relevant to the query. This approach vny
introduced by Direct flit search engine. An exarnple fy
this indirect feedback of relevance is clickstream ana this
isthe of form of the general area of clickstream mining
Q9. Explain in brief about relevance feedback?
EROEEE eH
Answer :
Relevance feedbackis the most preferred utlty
that enables the user to retrieve the desired information
in several passes. At every pass, the user enters a query
and obtains a result set containing the documents that
match the query. The user then selects the documents
that are relevant fo the generated query. Depending upon
this selection the initial query will be upgraded by adding
additional terms for better results. More over, the process
of upgrading the query can also include assigning new
Fights tothe existing terms depending upon the feedback
of the user. In this way, the end of every pass, the query
gets revised that ultimate leads to better search results
Q1O. Write anote on searching the Intemet and
Hypertext?
Answer:
‘The internet has multiple different mechanisms
that are the basis for search of items. The. primary
techniques are associated with serves on the internet that
create indexes of items on the internet and allow search
of them. Some of the most commonly used nodes are
YAHOO, Alta Vista and Lycos. In all of these systems
there are active processes that visit a large number of
intemet sites and retrieve textual data which retrieve data
and their general philosophy on user access, LYCOS (htip
1 vow. lycos. com) and Alta Vista automatically go
‘Warning : Xerox/Photocopying of this book is a CRIMINAL Act. Anyone found guilty is LIADLE to face LEGAL proceedings| Gut toother Intemet sites and return the text at the sites
for automatically Indexing (hitp : // www. olta vista
digital. com).
N Ql1, Write a short note on cognition and
* perception?
NY paneer: .
; ‘The user machine interface has primarily focused
‘ ona paradigm of a type writer,
a :
ay As computers displays became ubiquitous, man
\) machine interfaces focused on treating the display as.
ey’ an extension of paper with the focus on consistency of
| operations.
‘ Theadvent of WIMP interfaces and the evolution
ky! ofthe interface focused on how to represent to the user
2nsy, whats taking place in the computer environment,
Ti Extending the HCI to improve the information
ty, flow, thus reducing wasted user overhead in locating
Q needed information.
je, P0Kedat for future interfaces,
tai:| The audio sense has always been part of simple
| alerts in computers.
Although the major focus is an enhanced
| visualization of information, other senses are also being |*
ae
The sounds are now being replaced by speech in
both input and output interfaces.
The tactile (touch) sense is being addressed in the
e%periments using virtual Realty (VR)
Qi2. List the functions of fast data finder?
TTA a)
yo | Answer :
Proximity
Searching
Characterattention
§) Term counting
vw] 2 Buzy term matching
1 Information Retrieval Systems
t Anyone found gull to fa ee
Waring: Neo yPhotocopying of thls book a CRIMINAL Act. Anyone found gully Is LADLE Wo face LEGAL proceedingsInformation Retrieval Systems uy
Essay Questions with Answers
4.1 Searci STATEMENTS AND BINDING
QL. Explain the search statements and binding?
Answer:
Search statements are the statements ofan information need generated by users to specify the concepts thy
are trying to locate in items.
‘The search statement uses traditional boolean logic and / or natural language. In generation of the sea
statement, the user may have the ability to weight (assign an importance) to different concepts in the statemen,
The bindings to the vocabulary and past experiences of the user. The next level of binding comes when ty
search statement is passed for use by a specific search system.
‘The final level of binding comes as the search is applied to a specific database.
Input Binding
“Find me information on. Usersearch
the impact of the oil spills in Alaska statement using
on the price of oil” Vocabulary of user.
impact, oil (petroleum ), spills Statistical system binding
(accidents), Alaska, price (cost value) extracts processing tokens
impact (308), oil (606), Weights assigned to
petroleum (.65), spills (.12), Search terms based upon inverse
accidents (.23), Alaska (.45), document frequency algorithm
Price (.16), cost (.025), value (.10). and database.
Figure : Examples of Query Binding
‘The length of search statements directly affect the ability of information retrieval systems to find relevat|
items. The longer the search query, the easier itis for system to find items, Selective Dissemination of informatiot
system are usually very long typically 75 to 100 terms.
4.2 Simitartty MEASURES. AND RANKING.
Q2. Explain in detail about: ‘similarity measures?
A variety of different similarity measures can be used to calculate the similarity between the item and
search statement. A characteristic of a similarly formula is that the results of the formula increase as the ite
become more similar. The value is zero if the items are totaly dissimilar. To determine the similarity betwe
documents for clustering purposes is:
SIM (Item, Item) = Y>(Term,)(Termy,)
Answer:
SIM (Dos, Query) = Sle+IDF)*f,)
{,= K+ (k-1) TF,,/maxtreq,
Waring: Xeoy/Photocopying of his book a CRIMINAL At Anyone ound uly ABLE ae LEGAL prose
t
hed
The
Dumb,
A458 +m Information Retrieval Systems
‘similarity formula by salton in SMART system: >.
‘Todetermine the “weight” an item has wit a .
atc tothe distance. swith tespéct fo the search statement, the cosine formula is used to
Between the vector for the tem and the vector for query :
3 (00c,, *QTERM,,)
IM (DOG, QUERY) = —— St
2, (00C,,)? * 5 (QTERM,,
a f=}
The Jaccard Formula is
‘The denominator becomes depend upon the number of termsin common.
‘Common elements common increase, the similarity value quickly decreases, in the range—1 and +1.
(DOC, *QTERM,,)
|SIM(DOC,, QUERY, )= ist
dpoc,, +Zarerm,.-S 1000, *QTERM,,)
‘The Dice
‘The dice measure simplifies the denominator from the Jaccard measure and introduces a factor of 2 in the
numerator.
‘The normalization in the Dice formula is also invariant to the number of terms in common.
2* (D0C,, *QTERM,,)
-SIM (DOC, QUERY) = >?
DOC, + QTERM,,
Use of a similarity algorithm returns the complete database as search results.
Many of the items have a similarity close or equal to zero.
‘Thresholds are usually associated with the search process.
‘The threshold defines the items in the resultant Hit file from the query.
‘Thresholds are either a value that the similarity measure must equal or exceed or a number that limits the
‘number of items in the Hit file.
Normalizing denominator results vary with commonalty of terms.
QUERY = (2, 2,0, 0, 4)
DOC 1 = (0,2, 6, 4, 0)
DOC2 = (2, 6,0,0,4)
Coste] accord 7 DI
ree
| 3086 | 15
poc2 | 36.66 a2 20
VATIETINGS FaCvO! is
Ifthreshold = 4 only DOC 1 is selected
I threshold = 4 all selected
Wartng soPhoroopying of ile book isa CRIMINAL Act, Anyone found gully is TABLE to face LEGAL proceedings
\Information Retrieval Systems ™ B45
Vector: American, geography, lake, Mexico, painter, oi, reserve, subject. a
DOC1: — Geography of Mexico suggests oil reserves are available vector (0, 1, 0, 2,0,3, 1,0). we
DOC2: American geography has lakes available everywhere vector (1,3, 2, 0,0, 0, 0,0). a
DOC3: Painters suggest mexico lakes as subjects vector (0, 0, 1.3, 3,0,0, 2)
QUERY: — Ollreserves in Mexico a
Vectors (0. 0,0, 1.0, 1, 1,0) ie
SIM (Q, DOC,) = 6. SIM (Q, DOC 2) = 0. 7
SIM (Q. DOC = a
Figure: Query Threshold Process
The items ere Stored in dusters. (TOP - DOWN) A. B, C, I Centroids and ktoN, PQ, R - Items. Pe
oe
aoa
Figure: Item Cluster Hierarchy oa
uexy is compered to the centroids “A” and “B”. If the results of the similarity measure are above the | 52)
Exeshoid. the query is then applied to the nodes children.
. op
he Sled cite represents the quezy and the filed boxes represent the centroids for the three chsters represerézi | "=
byteoss. me
ee,
Figure: Centrold Comparisons “eed
Q3. Explain about the HMMs?
Answer:
Hidden Macuoy models harve been applied successfully over the last two decades in a wide variety of speo?
ed lengunge related resorption problems including speech recognition, named entity finding, optical charac
recognition, and topic identification. In the present work, we describe an application of their technology to &*
Problem of ad hac information revival
fn al HMM applications, the observed data is modeled as being the output produced by passing so™
usinoun ley tixough a noisy channel. In the ad hoc retrieval problem, we take the observed data to be the qu?
Warning : Neron/ Photocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proeel= y
N&Information Retrieval Systems
Q, and the unknown key to be a desired relevant document D. The noisy channel isthe mind of a user, who is
| imagined to have some nofion of which documents he wants, and who transforms that notion into the text of the
quot Thus, we compute for each documentthe probably that D was the relevant document inthe use mind,
giventhat Q wes the query produced, i, P (Dis P/Q), and rank the documents based on this measure.
Using probably models for information retrieval has a history almost four decades long, beginning with the
work of Maron and Kuhns, and first seeing real application in the “standard probability model” pioneered by
Robertson and Sparck-Jones. More recently, however, the introduction of ad hoc constants and non-linear smoothing
functions have improved performance steadily atthe cost of drifting further and from the probabilistic framework.
Q4. Explain about the concept of Probability model with the help of HMM?
QR | Answer:
h
‘ey Given a user-generated query and a set of documents, we wish to rank the documents according to the
ability that D is relevant, conditioned on the fact that the user produced Q, i.e. P (Dis R/Q), Applying Bayes’
rule, we decompose this into quantities that may be more easily estimated:
P(Q/Dis R)* P(D is R) ‘
P(DisR/Q) PIG) a
where P (Q/D is R) is the probability of the query being posed, under the hypothesis that the document is
relevant P (D is R) is the prior probability that document Dis relevant; and P (Q) isthe prior probability of query Q
‘being posed.
‘Since P (Q) will be identical-for all documents D, we can safely disregard it for the purposes of sorting
documents. We propose to model the generation of a query by a user as a discrete hidden Markov process
dependent on the document the user has in mind, A discrete hidden Markov model is defined by a set of output
SWeateshn! ambos, a set of states, and a probabilities for transitions between the states, and a probability distribution on
‘output symbols for each state. An observed sampling of the process is prodiiced by starting fram some initia state,
clistesrepeey transitioning from it to another’tate, sampling from the output distribution at that state, and then repeating these
later two steps. The transitioning and the sampling are non-deterministic, and are governed by the probabilities that
define the process. The term “hidden” referes tothe fact that an observer sees only the output symbols, but does not
now the underlying sequence of states that generated them.
Inthe present we take the union of all words appearing in the corpus as the set of output symbols, and posit
‘separate state for each of several mechanisms of query word generation. There is a separate process for each
7 individual document which generates the words of the query by traversing a word according to the output distribution,
ofthe state. Knowing the query that was produced, we can easily compute the probability of its being produced by
each of the documents in the corpus. This is the P (Q/D is R) term that appears in Equation 1.
While one can employ as many HMM states as one can imagine, we will restrict our discussion here to the
simple but powerful two-state. HMM shown is figure 1. The firs state, labelled “Document” represents choosing a
Word directly from the document.
id Pia | GE)
_ ae
query
query. —
cae fend
Becunend>
Figure 1: A simple two - stato HMMInformation Retrieval Systems uy
XS
2
ARR RR
&
ihe?
whan
pat ™
snk
eit!
am
canbe
thee
P
Figure.2: An expanded multi-state HMM.
The second, labelled “General English’, represents choosing a word that is unrelated to the document, bt
that occurs commonly in natural language queries. More generally, the model can be easily extended to accommodate
abroad variety of word generation mechanisms involving synonyms, topic lexicons, or proper names (See Figure?) umt
Q5. Explain about the ranking algorithms? web pe
page th
Answer:
by-product of use of similarity measures for selecting Hit items is a value that can be used in ranking te
‘output. Ranking the output implies ordering the output from most likely items that satisfy the query to least like)
relevantiitems firs. The original Boolean systems retumed items ordered by date of entry into the system versus by
likelihood of relevance to the user's search statement.
With the inclusion of statistical similarity techniques into commercial systems and the large number of his
that originate from searching diverse corpora, such as the Internet, ranking has become a common feature
modem systems. In most of the commercial systems, heuristic rules are used to assist in the ranking of items
Generally, systems do not want to use factors that require knowledge across the corpus as a basis for their similar
cr ranking functions because itis oo difficult fo maintain current values as the database changes and the added
complexity has not been shown to significantly improve the overall weighting process. A good example of how} *9eRe
‘commercial product integrates efficiency with theoretical concepts is the Retrieval Ware system's approach |
queries and ranking, RetrievalWare first uses indexes to identify potential relevant items, Itthen applies coarse gr
and fine grain ranking. The coarse grain ranking is based on the presence of query terms within items. In the fre
sg7ain ranking, the exact rank of the item is calculated, The coarse grain ranking is a weighted forinula that cant?
adjusted based on completeness, contextual evidence or variely, and semantic distance. Completenessis the proporta}
of the number of query terms found in the item versus the number in the query. It sets an upper limit on the ra
value for the item. If weights are assigned to query terms, the weights are factored into the value. Context
evidence occurs when related words from the semantic network are also in the item. Thus ifthe user has indicat
that the query term “charge” has the context of “paying for an object” then finding words such as “buy,” “purchas
“debt” suggests that the term “charge” in the item has the meaning the user desires and that more weight should
placed in ranking the item. Synonyms add additional weight; antonyms decrease weight. The coarse grain proce}
provides an initial rank to the item based upon existence of words within the item. Since physical proximity is 14]
considered in coarse grain ranking, the ranking value can be easily calculated.
‘Warning : Xerox/Photocopying of this book ls a CRIMINAL Act. Anyone found gullty ls LIABLE to face LEGAL proceeded
Via498 @ Information Retrieval Systems
Q6._ Briefly explain about the pageRank Algorithm?
Answer
PageRank: algorithm is used by the famous. engine that is Google. This algorithm is the most commonly used
thm for ranking the various pages. Working of the pageRank algorithm depends upon link structure of the web
3s. The PageRank algorithm is based on the concepts that if Page contains important links towards the other
g areal to be considered as important pages. The pageRank considers the back linkin deciding the rank score.
Ifthe addition of the all the ranks of the back links is large then the page then itis provided a large rank. Therefore,
than simply counting the number of pages that are linking to it. If backlink comes from an important page, then
that backlink is given a higher weighting than those backlink comes from non - important pages. In a simple way,
link from one page to another page may be considered as avote. However, not only the number of votes a page
receivesis considered important, but the importance or the relevance of the ones that cast these votes as well. We
assume page Ahaspages T,,,...T, which point to it .e., are links. The variable d is a damping factor, which value
canbe set between O and 1, We usually set the value of d to 0.85 PR(T,) is the incoming link to page A and C(T,)
isthe outgoing link from page T,. The pagerank of a page A is given by the following (1):
PR (A) = (Id) + d(PR(T,) /C(T,) +.
PRT, /' oe (1)
The damping factor is used to'stop the other pages having too much influence; this total vote is damped
‘down by mukiplying it by 0.85, one important thing is noted that the page ranks form a probability distribution over
web pages, so the sum of all web pages page ranks will be one and the d damping factor is the probability at each
page the random suffer will get bored and request another random link of M & Q or the back links of O.
ZN
[-n
N=
hoe tc ins oes
Let us take an example of hyperlink structure of four pages A, B, C and D as shown in below figure. The
PageRank for Pages A, B, Cand D can be calculated by using (1).
Another simplified version of pageRank given by
PR(N) = PR(M)/L(M)L(2) (2)
ik structure of four pages ”
ct. Anyone found guilty is LIABLE to face LEGAL proceedings
Figure: Hyperlln
‘Neting:XeroyPhotocopying ofthis book is @ CRIMINAL AInformation Retrieval Systems
Let us assume the initial pageRank as 1 and do the
PR (A) = (1-d) + d(PR(B)/C(B) + PR (C)/C(C) + PR(D)/C(D))
= (1-0.85) + 0.85 (1/3 + 1/3 +1/1)
= 15666667
PRIB) = (1d) + d (PR(A)/C(A) + (PR(C)/ C(C))
= (1-0.85) + 0.85 (1.5666667/2 + 1/3)
= 1,0991667
PR(C) = (1-d) +d (PRIA) / CIA) + (PR(B) / C(B))
= (1 -0.85) + 0.85 (1.5666667/2 + 1.0991667/3)
= 1.127264
PR(D) = (-a) + d{(PR(B) /C(B) + (PR(C)/CIC))
= (1-0.85) + 0.85 ( 1.0991666/3 + 1.127264/3)
= 0,7808221
For the second iteration by taking the above pageRank values from (3), (4), (5) and (6). The second iterata
pageRank values are as following:
PR(A) = 0.15 + 0.85 ((1.0991667/3) + (1.127264/3) + (0.7808221/1))
= 1.4445208
PR(B) = 0.15 + 0.85 ((1.4445208) + (1.127264/3)
= 1.0833128
PR(C) = 0.15 + 0.85 (( 1.4445208/2) + (1.0833128/3))
= 1.07086
PR(D) = 0.15 +0.85((1.0833128/3) + (1.07086/3)
= 0.760349
During the computation of 34° iteration, the average of the all web pages is 1. Some of the pageRank v
calculation. The value of damping factor disputiog =|
- (3)
ww (4)
+ (5)
~~ (6)
et)
-» (8)
(9)
(10)
are shown in table 1. The table with the graph is shown in the simulation results section.
‘Table 4: Iterative calculation for PageRank
Iteration A B Cc D
1 1 1 1 1
2 1.5666667 | 1.0991667 1.127264 0.7808221
3 1.4445208 | 10833128 1.07086 0.760349
v7 1.3141432 | 0,9886763 | 0.9886358 | 0.7102384
18 1.313941 | 0.985384 | 0,98851085 | 0.71016395
19 13138034 | 0.98844457 | 0.98842573 | 0.7101132
B44)
‘Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL
/#/“hy ‘
\ 411 8 m Information Retrieval Systems
One thing is noted here that the rarik of a Page is divided evenly among it's out - links to contribute to the
ranks of the pages. The original PageRank value i.e, Land ‘computes the alll iteration until all pages start to repeat
seme PageRank values individually and at last find their average PageRank can be calculated using a simple
iterative method and corresponds to the principal an vector ofthe normalized link matrix of the web. PageRank
algorithm needs a few hours to calculate the rank of millions of pages and provides efficient output of millions
pages.
Q7. Explain about weighted pageRank Algorithm?
‘Answer:
Weighted pageRank algorithm is proposed by Wenpu Xing and Ai Ghorbani weighted pageRank algorithm
(WPR) is the modification of the original pageRank algorithm. WPR decides the rank score based on the popularity
ofthe pages by taking into consideration the importance of both the inlinks and outlinks of the pages. This algorithm
provides high value of rank to the more popular pages and does not equally divide the rank of page among it’s out
link pages. Every out-link page is given a rank value based on its popularity. Popularity of a page is decided by
‘observing its number of in links and out links. As suggested, the performance of WPR is to be tested by using
different websites and future work include to calculate the rank score by utilizing more than one level of reference
user to classify the web pages. The importance is, assigned in terms of weight values to the incoming and outgoing
links and are denoted as W/,,, and. Win: respectively. wir, as shown in equation (4) is the weight of link(m, n)
computed depend on the number of incoming links of page n and the number of incoming links of all reference
pages of page m.
Wino = we = +» (11)
We =0,/ YO, ws (12)
chain)
‘when In and IP denote the number of incoming iinks with respect to page n and page P Ra, represents the
allreference pages list of page m. Similarly computation performed for We2',, as shown in equation (12) is the
Weight of link (m, n) which is depend on the number of outgoing links of page n and the number of outgoing inks
of llthe reference pages of m. Where as O, and O, are the number of outgoing links with respect to page n andp.
The formula as proposed for the WPR is as shown in Equation (13) which is a modification of the pageRank
formula,
we,
fv
(13)
WPR calculation calculated for the same hyperlink structure as shown in figure 5, The WPR equation for
Page A, B,C and O are as follows:
WPR (A) = (1d) +d 2WPRIB) We.a) Wisin + WPRIC) Wein) Wein) + WPR(D) Wel.m) Win) (14)
Warning Xero Photocopying af this Book isa CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceedingsInformation Retrieval Systems ™ manly
So for getting the value of WPR(A), before it we willcalculated the value of incoming links and out g,.
links weight as bellow.
Win =, +1)
= 3/(3 +2)
=3/5 w= (15)
fan
=2/(2+3+1)
=18
Win, = 0, (0, + Oct Og):
= (16)
Wainy = 1a +h)
=3/(3 +2)
=35 ws (17)
Wey = 0,0, +O, O5)
=2/(2+3+1)
= 216
6 .. (18)
Wow =I Ml +10)
=3/(2+2)
=3/4 (19)
Wola = O,/0,
=22
=1 s»» (20)
Now these inlinks and outlinks weights, equation numbers (15, 16, 17, 18, 19, 20) are put in the equation |.
(14) to calculate the weighted rank of the nodes A, B, C and D as following.
WPR{B) = (1d) + dEWPR(A) Wis.s) Wit'n) + WPRIC) We) Wi ~(21)
WPR(C) = (1-d) + d EWPRIA) Wirc Wa) + WPRIB) Wa.c) Wisi) (22)
WPR(D) = (1-d) + dEWPR(B) Wie.o) Wao + WPRIC) Wien Wien,» (23)
For WPR(A) calculation the value of dis set to 0.85 and the initial values of WPR(B), WPR(C) and WPR(D]
is considered 1, so calculation for 1* iteration as follows: Wer
WPR (A) = (1-0.85) + 0.85 (1* 3/5 *1/3 + 143/5* 1/3 + 1*3/4* 1)
= 1127 (24)
Winey = Ip (Ip + te + Ip)
= 22 +242)
= 216
=13 ++ (25)
‘Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceeding! waoat
413 8 Information Retrieval Systems
W's) = O,(O, + O,)
=3/(3 +3)
(26)
w= (27)
3243 +1)
2500 = Ue (28)
‘Again now for calculation of WPR (B)
These equations (25, 26, 27, 28) are put in to equation (21). In this the initial value of WPR(C) is set to 1.
WPR(B) = (10.85) + 0.85 (1.127*13* 1/2 + 1*25* 1/2).
= (0.15) + 0.85 (1.127 * 0,33 * 0.50 + 1* 0.40* 0.50)
= 0.4989 +. (29)
Id (ly +g + I)
2U2+242) |
=26
=18 ~- 80)
OY (0, +O)
= 3/(3 +3)
=316
218 (81)
Wao. = I/ (+h)
= 2/(3 + 2)
=25 (82)
Og (0, + + Pp)
=3/(2+3+1) ‘
=3/6
=12 (33)
By substituting the values of equations (24), (29), (30), (31), (32) and 83) to equation (22), you will get the
WPR of page C by taking das 0.85.
WPR(C) = (1 -0.85) # 0.85 (1.127 * 1/3 *1/2) + (0.499*25* 12).
= (0.15) + 0.85 ((1.127 * 0.33 * 0.50) + (0.499 * 0.40 * 0.50).
392 7 ed
in
Wiro
Wit
Wap = IpMly + I)
=2/(2+2)=
=12
Warning Desroytiomsapying of this book isa GRIMINAL Act. Anyone found gully is LADLE to face LEGAL proceedings
(35)Information Retrieval Systems ™ may
Wei) = 0,/(0,)
=2/2
-»» (36)
=1
Wen = l/l, + te)
=2/(2+3)
=25 + (37)
Wein) = O,/(0, + O, + O,)
=2/(2+3+1)
=18 = (88)
‘Again by substituting the values of equations (29), (34), (36), (37) and (38) to equation (23), youwill tthe
WPR(D) by taking d as 0.85.
WPR(D) = (1=0.85) + 0.85 ((0.499 * 1/2 * 1) + (0.392 * 2/5 * 1/3).
= (0.15) + 0.85 ((0.499 * 0.50 * 1) + (0.392 * 0.40 * 0.33)
= 0.406 + (39)
‘The values of WPR(A), WPR(B), WPR(C), and WPR(D) are shown in equations (24), (29), (34) and (33)
respectively. The relation between these are WPR(A) > WPR(B) > WPR(D) > WPR(C).
This results shows that the weighted PageRank order is different from pageRank. For the same above exampl,
the iterative computation of weighted pageRank algorithm is computed. The some weighted pageRank values ae
shown in table 2. The table valves with the chart are shown in the simulation results section.
So we can easily differentiate the WPR from the pageRank, categorized the resultant pages of a query into.
four categories based on their relevancy to the give query. They are :
‘Very Relevant Pages (VR) : The very relevant pages contain very important given query.
Relevant Pages (R) : The relevant pages do not have important information about given query.
¢ Weak Relevant Pages (WR) : The weak relevant pages do not have the relevant information but may have
the query keywords,
Irrelevant Pages (IR) : The irrelevant pages do not have both relevant information and query keywords.
Both the page rank and WPR algorithms provide pages in the sorting order according to their ranks to uses
fora given query. So the order of relevant pages and their numbering are very important for users in the resultant ist
Table
ferative calculation values for weighted pageRank
Iteration A B c D
1 1 1 1 1 :
2 1.1275 0.47972 03912 0.19935,
3 ‘0.425162 0.27674 0.25727 0.18026
4 0.355701 0.244128 0.24189 0.17541
5 0.34580 0.247110 0.239808, 0.17719
6 0.34454 0.247110 0.23953 0.17714
7 0.34438 0.23450 0.23450 0.17714
8 0.34436 0.23950, 0.23449 0.17714
Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceedin
@)
(i)
™~
tee40 —
8. Give the explanation about HITS algonthma?
Answer:
The HITS algorithm is proposed by Kleinberg in
1988. HITS algorithm identifies two different forms of
web pages called hubs and authorities. Authorities are
having important contents, Hubs are pages that
actas resource lists, guiding users to authoriies, Thus, a
hub page for a subject points to many authoritative
‘on that content, and good authori ed
by many good hub pages on the same subject, Hubs
and authorities are shown in figure (a). In this a page
may be a.good hub and a good authority a the same
time. This circular relationship leads to the definition of
aniterative algorithm called HITS (Hyperlink induced
topic selection). HITS algorithm is ranking the web page
byusing inlinks and out links of the web pages. In this a
web page is named as authority ofthe web page is pointed
by many hyper links and a web page is named as hub if
the page point to various hyperlinks. An illustration of
hub and authority are shows in figure (a). HITS is,
technically a link based algorithm. In HITS algorithm,
ranking of the web page is decided by analyzing their
textual contents against a given query. After collection
ofthe web pages, the HITS algorithm concentrates on
the structure of the web only, neglecting their textual
contents. Original HITS algorithm has some problems
which are given below.
@ High rank value is given to some popular website
that is not highly relevant to the given query.
(@) Topic drift occurs when the hub has multiple topics
as equivalent weights are given to all the out lines
ofa hub page.
In efficiency: Graph construction should be
Performed on line,
Inelevant links: Advertisements and automatically
generated links. ‘
‘Mutually effective relationship between hosts: on
one site, multiple documents are pointing to
document D at another site and retrieve their hub
scores and the authority score of D.
Because of these above problems, the HITS
algorithm is not preferred to be used in Google
- Seairch engine. So pageRank algorithm is used in
Google search engine because of its preference
and efficiency.
1 Information Retrieval Systems
Hub
Authorities
Figure (a): llustration of Hubs and Authorities
In this HITS algorithm, the hub and authorities
are calculated using the following algorithm.
HITS Algorithm
Initialize all eights to 1.
Repeat until the weights converge:
Forevery hub PeH
4 =D
5. For every authority P€ A
_ =H,
6 ap= 2
7.» Normalize
The HITS algorithm treats www as a directed graph
G(V, E), where V is a set of vertices representing page a
and Eis a set of edges that core spond to links. There
are two main steps in the HITS algorithm. The first step
is the sampling step and the second step is the interactive
step. In the sampling step, a set of relevant pages for the
given query are collected i.e., a sub-graph S of G is
retrieved which is high in authority pages. This algorithm
starts with a root set R a set of S is obtained, keeping in
mind that S is relatively small, rich in relevant pages
about the query and contains most of the good authorities.
The next second step, iteraive step, finds hubs and
authorities, using the output of the sampling step
equations (10)-and (11), (See all equations from 7Q)
~-- (40)
++ (41)
Where H, represents the hub weight, A, represents
the authority weight and the set of reference and referer
pages of page P denote with respect to I(P) and B(P)..
‘The weight of authority pages is proportional to the
summation of the weights of hub pages that links to the
authority page another one is, hub weight of the page is
; 1 ct. Anyone found guilty is LIABLE to face LEGAL proceeding
Naming :XeroyPhotocopying of this book is a CRIMINAL Act. Anyon guilty face proceedings
!Information Retrieval Systems M4,
proportional othe summation ofthe weghtsof authority page that hub inks to gure (b) shows an example.
calculation of authority and hub scores. :
Qi RI
ae}-—J P
@ R2
Figure (b}: Calculation of hubs and authorities
From the above equation (40 and (41) the hub and authority are calculated such as:
A, = Hay + Hep + Hey (42)
Hy=Ay + Ay (43) :
Q9. Explain about the EigenRumor algorithm?
Answer:
This algorithm enables a higher score to be assigned to blog entry entered by a good blogger but not inked
by any other belongs based on acceptance of the blooger’s prior work. In the recent scenario day by day number
blogging sites is increasing, there isa challenge for internet service provider to provide good blogs to the users page
rank and HITS are, very promising in providing the rank value to the blogs but some ‘issues arise, if these two
algorithm are applied directly to the blogs. These issue are:
1, The number of links to a blog entry is generally very small. As the result the scores of blog entries are
calculated by pageRank, for example, are generally too small to permit blog entries to be ranked by importance.
2. _ Generally, some time is needed to develop a number of in-links and thus have a higher pageRank score.
Since blogs are considered to be a communication tool for discussing new topics. It is desirable to assign
higher score to.an entry submitted by a bloger who has been received lot of attention in the past, even ifthe
The rank scores of blog entries as decided by the page rank algorithm is often very low so it cannot allow blag
entries to be provided by rank score according to their importance. Soto resolve these issues, an EigenRumat
algorithm is proposed for Ranking the blogs. The EigenRumor algorithm has similarities to pageRankand
HITS in that all are based an eigne vector calculation of the adjacency mattix of the links. However, in he
EigenRumor model, the adjacency matrix is contructed from agent-to-object links, not page (object) to pag?
(object) links. One important thing is noted that angent is used to represent an aspect of human ‘being such
as a blogger, and an object such as a blog entity using the EigenRumor algorithm the hub and authori
scores are calculated as attributes of agents (bloggers) and the inducement of a blog entity that does not yt
hhave any in-link entered by the bloger can be computed.
Q10. Write the comparison of various page Ranking algorithms.
Answer :
Based on the analysis, a comparison of some of various web page ranking algorithms is shown in below
table. Comparison is done on the basis 6f some parameters such as main technique use, methodology, inpt!
Parameter, relevancy, quality of results, importance and limitations. Among all the algorithms, PageRank and HITS
are most important algorithms, PageRank is only algorithm which is implemented in the google search engine and
HITS is used in the IBM prototype search engine called C lever. A similar algorithm is used in the other Teom seat
engine and later itis used by Ask.com.
‘Warning : Xerox/Photocopying ofthis book is a CRIMINAL Act, Anyone found guilty is LIABLE to face LEGAL proceed
ae wel417 8 ™ Information Retrieval Systems
Table: Comparison Between ranking algorithms
‘Agorithms | PopeRank | Weighted fara)
Distance EigenRumor
Criteria PageRank Rank
Mining WSM. ‘WSM WSMand '|WSM ‘WCM
Techniques wCM
Working ee Computers ‘highly | Calculating Use the adjacency
process eae values at relevant the minimum — | matrix which
eae = time pagesare | average distance | is constructed
cre corton | 2ndzesuts | computed | betweentwo | fromagentto object
arestoredon | are sorted andfinal | pagesand | linknot page to
Priority of on the basis values on more pages. ‘Page
Pages ‘of page the fly.
importance.
UpParameters | Inboundslinks | Inboundlinks | Inbound | Inboundslinks | Agent / Object
and out bound | linksand
links ‘out bound
links and
content
Complexity OllogN)
SeNY
4398 —
giz. When does relevance feedback work ?
‘Answer:
The RF (Relevance feedback) efficiency of working
isbased onthe,
User Knowledge
The User of RF system need to have efficient
Knowledge to be able to make an initial query
which is minimum related to the document within
the IR system and what if they desire the retrieve,
for successful, IR. Even the user makes a query
related to documents, some of the problems the
RF alone unable to solve. Those problems are,
Misspellings
Hauser query contains different kind of spellings
‘of terms from the documents collections
contained, then the RF not working or not retrieve
the desired documents effectively which is
Expected by the user.
Cross - Language IR
‘Thedocuments in the same language cluster more
closely together means documents in another
language are not nearby in a vector space that is
based on the term distribution.
Mismatch of Vocabulary
Mismatches in the Vocabulary of a query terms
leads to failure of a user search and RF is
ineffective. So, the Vocabulary of query terms
should match with the document collection of IR
system, it improved the search results effectively.
2. — Similarity of Relevant Document
To cluster the documents for easy retrieval the
relevant documents to be similar to each other. It
differs the relevant documents from the non-
relevant documents with the similarity of term
distribution in all relevant documents marked by
the users. But this cluster centered approach will
not work if the relevant documents are multimodal
class, it consists of several clusters of documents
in the vector space.
Q13. Explain the Evaluation of Relevance
feedback Strategies ?
Answer:
The role of iterative RF in the evaluation of
‘elevance feedback given very substantial gain with the
‘erations and the judgment of the relevant documents
co improve the user information need. The evaluation
OIRF made effective by computing the precision-recall
1
Bb
Information Retrieval Systems
graph to the query q, after we compute the query am
which is modified by the user feedback of q,, This strategy
completely based on the user RF of document collection.
But itis not a proper evaluation of RF.
Another strategy of realistic evaluation is to use
the documents in the residual collection. In this strategy
the user judge the document in which some of them are
relevant. We compare the relative performance of variant
relevances feedback methods, but it is difficult to validly
compare performance with and without RF, because the
collection size and the no.of relevant documents changes
from before the feedback to after it.
Another strategy of evaluation of RF, has two
collections, one used to q, initial query are relevance
judgment and q,, comparative evaluation. The second
collection is used to compare the valid performance of
both q, and q,,. The user based RF with user studies
makes IR system collections fair and close to the user
need.
Q14. Write about the Pseudo Relevance feedback?
Answer:
‘The Pseudo RF is also known as blind RF provides
a method for automatic local analysis.
PRF (Pseduo Relevance feedback) methods
assume top K documents to be relevant and the query is
refined using the feedback document. The basis behind
PRF is that documents which are similar to the user's
tial query will ead to more relevant terms which when
augmented with the query.will lead to an improvement
inperformance.
Query Drift
It is caused as a result of adding terms which
have no association with the topic of relevance of the
query. This happen only when there are only a few or no
relevant documents in the sensitivity to the quality of
top documents in the top k documents, PRF only
improves the performance of queries which have good
or reasonable initial retrieval performance. It can't be
used to improve the performance of bad or failed queries
which do not retrieve anything relevant of Hence, itis
not robust to the quality of the initial retrieval.
Several approaches have been proposed to
improve the robustness of PRE The main strategies used
could be categorized as follows,
1. To refine the feedback document set so that
instead of using all the top k documents, we could
choose a subset of it which is likely to be “highly
relevant”.
2, Instead of using all the terms obtained through
feedback for query refinement, only use a subset
of important terms to avoid introducing query
drift,
Weriag Xecsjfiaocapylg ofthis Book isa CRIMINAL Ad. Anyone found pully LADLE to face LEGAL proceedingsInformation Retrieval Systems @
3. Dynamically decide when to apply PRF instead
of using it forall queries.
4. Varying the importance of each feedback
document.
5. Usinga large external collection like wikipedia or
the web as.a source of expansion terms besides
these obtained through PRE The intuition behind
this approach is that ifthe query does not have
‘many relevant documents in the collection then
any improvements in the modeling of PRF is
‘bound to perform poorly due to query drift.
4.4 Sevective DisseMINATION OF
INFORMATION SEARCH
QUS. Expl: detail about selective
Dissemination of information search ?
Answer :
Selective Dissemination of information, frequently
called dissemination systems, are becoming more
prevalent with the growth of the intemet.
A dissemination system is sometimes labeled a
“push” system while a search system is called a “pull”
‘system.
Ina dissemination system, the user defines a profile
and as now information is added to the system it is
‘automatically compared to the user's profil.
If itis considered a match, it is asynchronously
sent to the user's “mail” file,
‘The difference between the two functions lie in
the dynamic nature of the profiling process, the size and
diversity of the search statements and number of
simultaneous searches per item,
Inthe search system, an existing database exists,
Assuch, corpora statistics exist on term frequently
within and between terms.
These can be used for weighting factors in the
indexing process and the similarity comparison (e.g.,
inverse document frequency algorithms).
Profiles are relatively static search statements that
cover a diversity of topics.
Rather than specifying a particular information
need, they usually generalize all of the potential
information needs of a user.
4.29
(One of the fist commercial search technigues fg
dissemination was the logic message Disseminat
System (LMDS). The system originated from a systen,
created by chase, Rosen and and Wallace (CRW Ing),
‘was designed for need to suppor the search of thousand,
of profiles with items arriving every 20 seconds.
It demonstrated one approach to the problem
where the profiles were treated as the static databagg
and the new item acted like the query.
Ituses the terms in the item to search the Profile
structure to identify those profiles whose logic could be
satisfied by the item.
BEEEER OS
The systern uses a least frequently occurring
trigraph (three character) algorithm that quickly identitg
which profiles are not satisfied by the iter.
The potential profiles aré analyzed in: detail to
confirm ifthe item is a hit,
aa
Another example of a dissemination approach:
the personal library software (PLS) system. It received
items into the database and periodically running user’s
profiles against the database. coal
This makes maximum use of the restrospective
search software but loses near real time delivery of items.
This makes maximum use of the retrospective
search software but loses near real time delivery of items.
Another approach to dissemination uses
statistical classification technique and explict error
minimization to determine the decision criteria for
selecting items fora particular profile,
Tea
Schutzeet al used two approaches to reduce the
dimensionality: selectinga set of existing features tous.
or creating a new much smaller set of features that he
original features are mapped into,
Ax? measure was used to determine the mot
important features. The test was applied to a table tha!
Contained the number of relevant (N)) and non -relevatl
(N,.) items in which a term occurs plus the number o
relevant and non - relevant items in which the tem
does not occur ‘respectively.
The formula used was :
NINN, -N,_N,J?
Warning : XeroyPhotocopying ofthis book is a CRIMINAL Act. Anyone found gully is LIABLE to face LEGAL proceci@
- 7Pe
A m Information Retrieval Systems
fF BOOLEAN SysTEMS
iN) ais. Explain about the weighted search of Boolean systems?
‘ ETO
The two major approaches to generating queries are boolean and natural language. Natural language queries
\) re easly represented within statistical models and are usable by the similarity measures discussed, Risses arise
‘pen Boolean queries are associated with weighted index systems, some of the issues are associated with how the
fy) Jogi (AND, OR, NOT) operators function with weighted values and how weights are associated with the query
\ tems. If the operators are interpreted in their normal interpretation, they act too restrictive or too general. To
integrate the Boolean and weighted systems model, fox and sharat proposed a fuzzy set approach. Fuzzy sets
introduce the concept of degree of membership for AND andOR operations are defined as:
y DEG, = min (DEG,, DEG,)
DEG yp = max (DEG,, DEG,)
Where A and Bare terms in an item. DEG is the degree of membership. The mixed min and max (MMM)
°| model considers the similarity between query and document to be a linear combination of the minimum and
snaximum item weights. Fox proposed the following similarity formula :
d SIM (QUERY ,, DOC) = Coq, * MAX(DOC,, DOC,,..:; DOC,) + Cogs * min (DOC, , DOC,, ....DOCn)
® SIM(QUERY yp» DOC) = Cy, * min (DOC, DOC,,..., DOC,) + Cue *(POC,, DOC,,..-, DOC,)
Where Cog, and C,,. are weighthing coefficients for the OR operation and Cy.» and Cys, are the weighting
coefficient for AND operation. When C,,., is between 0.5 to 0.8 and Con, is greater than 0.2.
The MMM technique was expanded by price considering all item weights versus the maximum / minimum
approach. The similarity measure is calculated as:
SIM (QUERY DOC) = 2 1""di/ = tee
Where the ds are inspected in ascending order for AND queries. The r terms are weighting coefficients.
j Another alternative approach is using the P - norm model which allows terms within the query to have
| ‘weights in adc Wn to the terms in the items. Similar to the cosine similarity technique, it considers the membership
vals (d,,....,) 10 be coordinates in an “n” demensional space. For an OR query, the origin is the worst possibilty.
Foran AND query the ideal point is the unit vector where all the D, values equal 1.
‘The generalized queries are:
Qoq = (Ay @,) OR (Ay, @,) OR... OR(A, a,)
+ Qyo= (Ay ay) AND (A,, a) AND ..... AND (A, a,)
Ifwe assign the strictness value to a parameter labeled “S” then the similarity formulas' between queries and
items are
SIM (Qoq, DOC) = [afd +. Fas dha) (ay +28 +...)
SIM (Quy DOC) = 1 = Gai d-dyl? +--+ af (1-d,,)° /(as +a8 +...+5)
SIM (Q,,, DOC) = 1-SIM(Q, Doc)
the Boolean queries and weighted query terms under the assumption that
dexes. The normal Boolean operations produce the following results:
Another approach for merging
‘are no weights available in the in
“#” OR B"retrieves those items that contain the term A or teh term B or both.
Warning : XeroyPh Fag of this book is a CRIMINAL Act, Anyone found guilty is LIABLE to face LEGAL proceedings
\ may‘stems BL
Information Retrieval § :
»\AND B" witiewes those items that contain both,
torms and B.
“A NOT B" retrieves: those items that contain:
term A and not contain term B.
weight are then assigned to the terms between
the values 0.0 to 1.0, they may be Interpreted as the
significance that users are placing on each term, The
value 1.0 is assumed to be the strict interpretation of a
Boolean query. The value 0.0is intepreted to mean that
the user places little vatue on them term, Under these
assumptions, a tem assigned a vahte of 0.0 should have
no effect on the retrieved set, Thus:
“A, OR B," should return the set of items that
contain A asa term
“A, AND B,” willalso retum the set of items that
contain term A
“ajNOTB,”
This suggests that as the weight for term B goes
from 0,0 to 1.0 the resultant set changes from the set of
all items that contains term A to the set normally
«generated from the Boolean operation. The process can
be visualized by use of the VENN diagrams'shown in
below figure would include all items that are in all the
areas in the VENN diagram.
Iso return set A,
Figure: VENN Diagram
4.6 SEARCHING THE INTERNET AND HYPERTEXT
QI7. Write the six key characteristics of
intelligent agents.
Ee)
me
Answer:
There are six key characteristics of intelligent
agents:
1. Autonomy: The search agent must be able to
‘operate without interaction with a human agent.
It must have control over its own intemal states
and make independent decisions. This implies a
search capability to traverse information sites
based upon pre-established criteria collecting
potentially relevant information.
‘Warning : Xerox/Thotocopying of thls book is a CRIMINAL Act. Anyone found gullly Is LIADLE to face LEGAL proceeding?
{ \ 1
: May
Communteattons Abillty: The agent magi.
able to communicate with the Information gi
fs i traverses them. This implies a univer"
ccepted language defining the extemal ine
Capnetty for Cooperation: This concen |
suggests that intelligent agents need to cooper
to perform mutually beneficial tasks, ,
4, Capacity for Reasoning: There are three ipa,
‘of reasoning scenarios: A
Rulle-based: Where user has defined a sot y
conditions and actions to be taken,
Knowledge-based: Where the intelligent ager
have stored previous conditions and actions tay,
which are used lo deduce future actions.
Artificial evolution based: Where intelligen,
agents spawn new agents with higher lone |
capability to perform its objectives.
5. Adaptive Behavior: Closely tied to 1 and4,} ter
adaptive behavior permits the intelligent agentts |
assess its current state and make decisions on te
actions it should take. -
6. Trustworthiness: The user must trust that the
intelligent agent will act on the user's behalf to
locate information that the user has access
and is relevant to the user.
’
4.7 INTRODUCTION TO INFORMATION
VISUAUZATION
.
Q18. Explain about the information visualization’
Answer:
The beginnings of the theory of visualization besa |g
‘over 2400 years ago. The philosopher Plato discemed
that we perceive objects through the senses, using tht
mind. Our perception of the real world is a translation
from physical energy from our environment into encode!
neural signals. The mind is continually interpreting and F
categorizing our perception of our surroundings. Use o!
‘@ computer is another source of input to the minds}
processing functions. Text-only interfaces reduce th’
complexity of the interface but also restrict use of he
‘more powerful information processing functions the mind |
has developed since birth.
Information visualization is a relatively new] +
discipline growing out ofthe debates in the 1970s on-hé
way the brain processes and uses mental images.
required significant advancements in technology an
Information retrieval techniques to become a possbllis |
earliest researchers in information visualization was Dos
who in 1962 discussed the concept of “semantic 102OY ee Oe le 62). The roa
‘NY sec (Dovie-62). The road maps show the lems
oats toa specific semantic theme. The user
Yate use this view to focus his query on if
raat) Sete potion ofthe databace pectic
y
% COGNITION AND PenctPrio,
7 ROE)
2 R18, August-2022, Q7(b)
t R18, February-2022, Q7(b)
REESE
(One of the first-level cognitive processes is preat-
‘ecion hetis, taking the significant visual information
om the photoreceptors and forming primitives.
An example of using the conscious processing
czpabilties of the brain is the detection of the
(izent shaped objects and the border between
them shown between the left side and middle of
eeFige.
¢ Thereader can likely detect the differencesiin the
takes to visualize the two different
‘This suggests that if information semantics are
placed in orientations, the mind's clustering
zazeqaie function enables detection of groupings
ezsiex than using different objects.
Another visual factor is the optical illusion that
makes a light object on a dark background to
zppear larger than if the item is dark and the
background is light.
Making use of this factor suggests that a visual
display of small objects should use bright colors.
Colors have many attributes that can be modified
such as hue, saturation and lightness.
Depth, like color, is frequently used for representing
Visual information.
Cssified as monocular cues, changesin shading,
blurring (proportional to distance), perspective,
™otion, stereascopic vision, occlusion and texture
Cepict depth Sze.
Another higher level processing technique is the
28
es eee : a
= that could provide a uset a view of the whole |
Use of configural aspects of a display.
Mewaing : Xeran/Ph ing of this book is a CRIMINAL
Information Retrieval Systems
A confiqural effect occurs when arrangements of
objects are presented to the user allowing for easy
recognition of a high-level abstract condition.
* Another visual cue that can be used is spatial
frequency.
© Spatial frequency is an acuity measure relative to
regular light-dark changes that are in the visual
field or similar channels.
—In-using congnitive engineering in designing
information visualization techniques, a hidden risk
that " Understanding isin eye of the beholder”
Q20. Describe the background?
Ore)
Gtr)
[EAE
Answer :
A significant portion of the brain is devoted to
vision and supports the maximum information
transfer function from the environment toa human,
being.
Visualization is the transformation of information
into a visual form which enables the user to
observe and understand the information.
The Gestalt psychologists postulate that the mind
follows a set of rules to combine the input stimuli
toa mental representation that differs from the
~ sum of the individual inputs =
@ Proximity: Nearby figures are grouped
together.
@ Similarity: Similar figures are grouped
together.
@ Continuity: Figures are interpreted as
smooth continuous patterns rather than
discontinuous concatenations of shapes e.g.
acircle with its diameter drawn is perceived
as two continuous shapes, a circle and a
line, versus two half circles concatenated
together).
— Closure: Gaps within a figure are filed in to
create a whole (e.g,, using dashed lines to
represent a square does not prevent
understanding itasa square).
@ — Connectedness: uniform and linked spots,
lines or areas are perceived as a single unit
@ Shifting the information processing load from
slower cognitive processes to faster perceptual
systems significantly improves the information-
carrying interfaces between humans and
computers.
‘Act. Anyone found guilty is LIABLE to face LEGAL proceedingsInformation Retrieval Systems
INFORMATION VISUALIZATION TECHNO-LOGIES
Q21. Explain about the information visualization technologies with neat diagrams?
Answer :
® The ones focused on Information Retrieval Systems are investigating how best to display the res.
searches, structured data from DBMSs and the resiilts of link analysis correlating data. 4
# Thegoals for displaying the result from searches fall nto two major classes: docurnent chustering are
statement analysis.
¢ Visualization tools inthis area attempt to display the clusters, with an indication of their size and toric
basis for users to navigate to items of interest.
* Displaying the total set of terms, including additional terms from relevance feedback or thesaurus exzzce,
along with documents retrieved and indicate the importance ofthe term to the retrieval and raring rom
One way of organizing information is hierarchical
{A tree stricture is useful in representing information that ranges over time (e.g., genealogical inexp
constituents ofa larger unite. organization structures, mechanical device definitions) and ace! ,
the higher to lower level (e.g, hierarchical clustering of documents).
‘A two-dimensional representation becomes dificult for a user to understand.as the hierarchy becomarizs|¢
One of the earliest experiments in information visualization was the Information Visualizer develops
XEROX PARC.
Itincorporates various visualization formats such as DataMap, InfoGrid,
|, ConeTTree, and the Perspecive
The Cone-Tree is a 3-Dimer
al representation of data, where one node of the tree is represented =
Pex and all the information subordinate to itis arranged in a circular structure a its base.
Any child node may also be the parent of another cone.
Selecting a particular node, rotates it to the front of the display,m Information Retrieval Systems
425 a
Report =
Ej Wiavet Tole [iiwnace se] oases
[Letter | fid Manufacture
|__
rl [Article nea
|| ff What Frome
[Announcement
q
Hid Water Coote
Figure: Perspective Wall
@ Treemaps.
Thistechnique makes maximum use of the display screen space by using rectangular boxes that are recursively
o
subdivided based upon parent-child relationships between the data.
TheCPU, OS, Memory, and Network management, articles are all related to a general category of computer
operating systems versus computer applications which are shown in the rest of the figure.
CPU articles os, Network Management
articles
Memory articles
Word Processing software articles
Spreadsheet software articles
‘PowerPoint software articles
Drawing software articles
Figure Tree Map
¢ TheEnvision system not only displays the relevance rankand estimated relevance of each item found by a
query but also simultaneously presents other query information.
¢ © The design is intentionally graphical ‘and simple using two dimensional visualization.
* — Thisallowsa larger variety of user computer platforms to have access to their system.
+ Envision’s three interactive windows to display search results: Query window, Graphic View window, and
Item Summary window.
cai isa CRIMINAL Act Anyone found gully is HABLE nfs LEGAL procedings
| Warming xo lex of this book is 2Information Retrieval Systems @
Figure: Example of DCARS
Document Content Analysis and Retrieval System (DCARS) being developed by Calspan Advanced Technolo
Center.
‘@ Their system is designed to augment the RetrievalWare search product.
| .
‘They display the query results as a histogram with the items as rows and each term's contribution to te|
selection indicated by the width of a tile bar on the row.
@ —- DCARS provides a friendly user interface that indicates why a particular item was found, but it is muct
harder to use the information in determining how to modify search statements to improve them.
Figure; Example of DCARS Query Histogram
% Another representation that is widely used for both hierarchical and network related information &
“cityscape” which uses the metaphor of movement within a city,
¢ —_Inliew of using hills, asin the terrain approach, skyscrapers represent the theme (concept) area as.
‘Warning :XeroyPhotocopying ofthis book is a CRIMINAL Act. Anyone found guilty s LIABLE to face LEGAL procesSG...
i RQ
—_
| =
| EEA
WK
. =Information Retrieval Systems ™@
Internal Assessment
Murtiete CHoIce QuESTIONS
1, The Search statement uses traditional Boolean
logic and /or. language.
a) Natural ¥) Logical
©) Statistical @) Index
2. Avariety of different similarity measures can be
used to calculate the similarity between the item
and the ___ statement.
a) Logical b) And/or
©) Search ) Natural
3. Todetermine the similarity between documents
for clustering purpose is.
a) SIM (lteri , Item)
DiTerm,,,)(Term,,) b)
SIM (Item ,, Item) = })(Term,,)(Term,,)
©) SIM (Item) = (Term)
¢) SIM (Item , item) = S(Term,,)
4. SIM (DOC, , QUERY, =
2) SIC+IDF)
fi
b) D(C+IDF)*F,)
) DIC+IDF)
@ S(C+DF)R,)
5. The length of search statements directly affect
the ability of ______to find relevant items.
a) Information
) Document
©) System
d) Information retrieval systems
6. The binding is to the and past
experiences of the user.
2) Vocabulary b) Specific
©) Search d) User
7. The final level of binding comes as the search is
applied toa specific
a) Search b) Document
) Database 4) Information
8. Useofasimilarity, returns the complete
database as search results.
a) Algorithm b) Results
©) Thresholds d) Measure
10.
10.
Thresholds ate usually associated with 4,
—_ process.
a) System b) Index
©) Search 4) Database
The ___defines the items in the resulta
hitfile from the query.
a) HMM
©) Search
‘b) Measures
@) Threshold
Fit IN THE BLANKS
SIM (Item i, Item j
Use of HMM for searching textual corpora has
introduced a new paradigm for
P(DisR(Q)=____.
‘The Development for a HMM approach begins
with applying__ Rule.
HMM Stands for —___
PARC Stands for ___..
‘The Unnormalized similarity formula __
The one dimensional space defined by.
should cause the group means to be well
separated.
g(x) =
The two major approaches to generating queries
are _______ and natural language.
Warning: XeroyPhotocopying ofthis book Is a CRIMINAL Act. Anyone found gully s LIABLE to face LEGAL proceeding?
aa
4298 ™ Information Retrieval Systems’
j, Murripte Cuoice Questions
zee) | 2) | 3.0) | 4(a) | 5. (a)
6.ta) | 7c) | 8a) | %(c) | 10.(¢)
i Fe In THE BLANKS
1... 2(ferm,,) Term,,)
2, Search
3, PiQDisR)*P(DisR/P(Q)
4. Bayes
5, _ Hidden Markov models
6. _ Pale Auto Research Center
5
7. SM(Q,DOC,) = 2 TERM,, * TERM,
8 Y=NA*x
9. log(niC1-n)
10. Boolean