0% found this document useful (0 votes)
11 views29 pages

IRSunit 4

The document discusses various concepts related to information retrieval systems, including binding, similarity measures, ranking algorithms, and relevance feedback. It explains how search statements are generated and processed, the importance of statistical algorithms in indexing, and the use of hidden Markov models in information retrieval. Additionally, it covers the challenges of user feedback in web searches and the significance of algorithms like PageRank and distance rank in improving search efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

IRSunit 4

The document discusses various concepts related to information retrieval systems, including binding, similarity measures, ranking algorithms, and relevance feedback. It explains how search statements are generated and processed, the importance of statistical algorithms in indexing, and the use of hidden Markov models in information retrieval. Additionally, it covers the challenges of user feedback in web searches and the significance of algorithms like PageRank and distance rank in improving search efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
gi. Write a short note on the fy binding. Answer: The final level of binding comes ag appedtoa specific database this binding is based fe the statistics of the processing tokens in the database sxithesemantics used inthe database. This is especialy tueinslatistical and concept indexing systems, Some cithe statistics used in weighting are based upon the carentindexing systems, Some examples are document fequency and total frequency that are used as the basis for indexing are determined by applying a statistical agorithm against a representative sample of the database. Natural language indexing techniques tend to 1sethe most corpora-independent algorithms inal level of the search is @. Write about the similarity measures and Ranking? Answer ‘Searching in general is concemed with calculating 'nsmilarity between a user's search statement and the ‘emsin the database. Restricting the similarity measure to passages gains Sanficant precision with minimal impact on recall Once items are identified so possibly relevant to ‘heuse’s query, itis best to present the most likely relevant ‘ems fist-Ranking. is a scalar number that represents &. Define HMM? Answer: ___ Wepresent a new method for infor ‘Sta hidden Markov models (HMM). We Stheral framework for incorporating ™ Narn — ok ia a CRIMIN: Sing: Xerox/Photocopying of this book ymation retrieval We develop @ tiple word “AL Ack Anyone found gilly is LADLE Algorithms? Answer: With the growing number of web pages and uses onthe web, the numberof queries submitted othe search engines are also growing rapidly day by day. Therefore, the search engines needs to be more efficient in is Processing way and output. Web: ‘mining techniques are employed by the search engines to extract relevant documents from the web database documents and Provide the necessary and required information to the Users. The search engines become very successfulbecause of its page Rank algorithm. Page Rank algorithms are used by the search engines to present the search results by considering the relevance, importance and content score and web mining techniquesto order them according to the user interest. Some Ranking algorithms depend only on the link structure of the documents i., their popularity scores, where as other look for the actual conteht of the documents. Qs. Write a short note on Distance Rank algorithm? Answer The main goal of this ranking algorithm is computedon the basis ofthe shortestogarithmic ditance between two pages and ranked according to them so that a page with smaller distance to assigned a hier rank, The advantage oftisalgoith that, being es Sensitive ican nd pages faster wth igh aay and more qui with the use of dance based soiion as compared to oer algorithms. the some algorthens wovide quality output then that has some cer : to face LEGAL proceedings Information Retrieval Systems lt limitations. So the limitation for this algorithm is that the crawler should perform large calculation to calculate the distance vector, if new page is inserted between the two pages. This distance rank algorithm adopls the page Rank properties ie, the rank of each page is computed as the weighted sum of ranks of all incoming pages to that particular page. Then, a page hasa high rank value ifithas more incoming links on a page Q6. Write a note on probabilistic Relevance feedback? Answer: ‘The probabilistic relevance feedback is based on the user feedback of documents relevance, So, after user can give relevance feedback of certain documentsrelevant ornon - relevant. Then we can indicate it with a, Boolean indicator variable ‘R’ of a document relevance. The probability of a term t to appear in a documentrelevance is, BX, = YR=1) =| VRI/ [VR] B (x, = 0/R=0) = (df, |VRj)/(N- |VRI) From the ‘N' number of documents, where the number of df,contains the term “t’ within the known relevant document set ‘VR’ and subset VR,, the estimation of probabily is p(x, = 1). The assumption of the set of relevant documents as a small subset of the set of all documents makes the probability estimation 1) possibl Q7._ Write about relevance feedback on the web? Answer : The relevance feedback on a web page is not providing by the users that much, Because the péople on a web page using advanced search interfaces to complete their search in a single interaction, with respect to the time. The lack off up take also probably reflects the RF (Relevance Feedback) hard to explain to the average users and RF recall enhancing, and web users are, concerningrarely to go sufficient recall, From all of the web search users, only 4% of user query session using RF option, and about the 70% of the users only looked at the top first resultant page and did not pursue things any further. Now- a -daysa thread of work is the use of click stream data to provide indirect RF when a user click on a resultant link it automatically redirect to the other page ‘or window. Its the most commonly using thread for RF indirectly. m4, Q8. Write the Indirect Relevance Feeg, 3 (RF? wth Answer: Indirect RF is also called as implicit feed y, the explicit feedback approach. The implicit feet.) collection easy from the users when itis dificult iaczy, in explicit relevante feedback on a web search, This approach will work by collecting the ey rates on pages globally, by the users clicks. Wher ius want fo use an relevant information then the wer jp click on that particular links to loss at more, frequen then those links were assumed to indicate that the porn, was likely relevant to the query. This approach vny introduced by Direct flit search engine. An exarnple fy this indirect feedback of relevance is clickstream ana this isthe of form of the general area of clickstream mining Q9. Explain in brief about relevance feedback? EROEEE eH Answer : Relevance feedbackis the most preferred utlty that enables the user to retrieve the desired information in several passes. At every pass, the user enters a query and obtains a result set containing the documents that match the query. The user then selects the documents that are relevant fo the generated query. Depending upon this selection the initial query will be upgraded by adding additional terms for better results. More over, the process of upgrading the query can also include assigning new Fights tothe existing terms depending upon the feedback of the user. In this way, the end of every pass, the query gets revised that ultimate leads to better search results Q1O. Write anote on searching the Intemet and Hypertext? Answer: ‘The internet has multiple different mechanisms that are the basis for search of items. The. primary techniques are associated with serves on the internet that create indexes of items on the internet and allow search of them. Some of the most commonly used nodes are YAHOO, Alta Vista and Lycos. In all of these systems there are active processes that visit a large number of intemet sites and retrieve textual data which retrieve data and their general philosophy on user access, LYCOS (htip 1 vow. lycos. com) and Alta Vista automatically go ‘Warning : Xerox/Photocopying of this book is a CRIMINAL Act. Anyone found guilty is LIADLE to face LEGAL proceedings | Gut toother Intemet sites and return the text at the sites for automatically Indexing (hitp : // www. olta vista digital. com). N Ql1, Write a short note on cognition and * perception? NY paneer: . ; ‘The user machine interface has primarily focused ‘ ona paradigm of a type writer, a : ay As computers displays became ubiquitous, man \) machine interfaces focused on treating the display as. ey’ an extension of paper with the focus on consistency of | operations. ‘ Theadvent of WIMP interfaces and the evolution ky! ofthe interface focused on how to represent to the user 2nsy, whats taking place in the computer environment, Ti Extending the HCI to improve the information ty, flow, thus reducing wasted user overhead in locating Q needed information. je, P0Kedat for future interfaces, tai:| The audio sense has always been part of simple | alerts in computers. Although the major focus is an enhanced | visualization of information, other senses are also being |* ae The sounds are now being replaced by speech in both input and output interfaces. The tactile (touch) sense is being addressed in the e%periments using virtual Realty (VR) Qi2. List the functions of fast data finder? TTA a) yo | Answer : Proximity Searching Characterattention §) Term counting vw] 2 Buzy term matching 1 Information Retrieval Systems t Anyone found gull to fa ee Waring: Neo yPhotocopying of thls book a CRIMINAL Act. Anyone found gully Is LADLE Wo face LEGAL proceedings Information Retrieval Systems uy Essay Questions with Answers 4.1 Searci STATEMENTS AND BINDING QL. Explain the search statements and binding? Answer: Search statements are the statements ofan information need generated by users to specify the concepts thy are trying to locate in items. ‘The search statement uses traditional boolean logic and / or natural language. In generation of the sea statement, the user may have the ability to weight (assign an importance) to different concepts in the statemen, The bindings to the vocabulary and past experiences of the user. The next level of binding comes when ty search statement is passed for use by a specific search system. ‘The final level of binding comes as the search is applied to a specific database. Input Binding “Find me information on. Usersearch the impact of the oil spills in Alaska statement using on the price of oil” Vocabulary of user. impact, oil (petroleum ), spills Statistical system binding (accidents), Alaska, price (cost value) extracts processing tokens impact (308), oil (606), Weights assigned to petroleum (.65), spills (.12), Search terms based upon inverse accidents (.23), Alaska (.45), document frequency algorithm Price (.16), cost (.025), value (.10). and database. Figure : Examples of Query Binding ‘The length of search statements directly affect the ability of information retrieval systems to find relevat| items. The longer the search query, the easier itis for system to find items, Selective Dissemination of informatiot system are usually very long typically 75 to 100 terms. 4.2 Simitartty MEASURES. AND RANKING. Q2. Explain in detail about: ‘similarity measures? A variety of different similarity measures can be used to calculate the similarity between the item and search statement. A characteristic of a similarly formula is that the results of the formula increase as the ite become more similar. The value is zero if the items are totaly dissimilar. To determine the similarity betwe documents for clustering purposes is: SIM (Item, Item) = Y>(Term,)(Termy,) Answer: SIM (Dos, Query) = Sle+IDF)*f,) {,= K+ (k-1) TF,,/maxtreq, Waring: Xeoy/Photocopying of his book a CRIMINAL At Anyone ound uly ABLE ae LEGAL prose t hed The Dumb, A 458 +m Information Retrieval Systems ‘similarity formula by salton in SMART system: >. ‘Todetermine the “weight” an item has wit a . atc tothe distance. swith tespéct fo the search statement, the cosine formula is used to Between the vector for the tem and the vector for query : 3 (00c,, *QTERM,,) IM (DOG, QUERY) = —— St 2, (00C,,)? * 5 (QTERM,, a f=} The Jaccard Formula is ‘The denominator becomes depend upon the number of termsin common. ‘Common elements common increase, the similarity value quickly decreases, in the range—1 and +1. (DOC, *QTERM,,) |SIM(DOC,, QUERY, )= ist dpoc,, +Zarerm,.-S 1000, *QTERM,,) ‘The Dice ‘The dice measure simplifies the denominator from the Jaccard measure and introduces a factor of 2 in the numerator. ‘The normalization in the Dice formula is also invariant to the number of terms in common. 2* (D0C,, *QTERM,,) -SIM (DOC, QUERY) = >? DOC, + QTERM,, Use of a similarity algorithm returns the complete database as search results. Many of the items have a similarity close or equal to zero. ‘Thresholds are usually associated with the search process. ‘The threshold defines the items in the resultant Hit file from the query. ‘Thresholds are either a value that the similarity measure must equal or exceed or a number that limits the ‘number of items in the Hit file. Normalizing denominator results vary with commonalty of terms. QUERY = (2, 2,0, 0, 4) DOC 1 = (0,2, 6, 4, 0) DOC2 = (2, 6,0,0,4) Coste] accord 7 DI ree | 3086 | 15 poc2 | 36.66 a2 20 VATIETINGS FaCvO! is Ifthreshold = 4 only DOC 1 is selected I threshold = 4 all selected Wartng soPhoroopying of ile book isa CRIMINAL Act, Anyone found gully is TABLE to face LEGAL proceedings \ Information Retrieval Systems ™ B45 Vector: American, geography, lake, Mexico, painter, oi, reserve, subject. a DOC1: — Geography of Mexico suggests oil reserves are available vector (0, 1, 0, 2,0,3, 1,0). we DOC2: American geography has lakes available everywhere vector (1,3, 2, 0,0, 0, 0,0). a DOC3: Painters suggest mexico lakes as subjects vector (0, 0, 1.3, 3,0,0, 2) QUERY: — Ollreserves in Mexico a Vectors (0. 0,0, 1.0, 1, 1,0) ie SIM (Q, DOC,) = 6. SIM (Q, DOC 2) = 0. 7 SIM (Q. DOC = a Figure: Query Threshold Process The items ere Stored in dusters. (TOP - DOWN) A. B, C, I Centroids and ktoN, PQ, R - Items. Pe oe aoa Figure: Item Cluster Hierarchy oa uexy is compered to the centroids “A” and “B”. If the results of the similarity measure are above the | 52) Exeshoid. the query is then applied to the nodes children. . op he Sled cite represents the quezy and the filed boxes represent the centroids for the three chsters represerézi | "= byteoss. me ee, Figure: Centrold Comparisons “eed Q3. Explain about the HMMs? Answer: Hidden Macuoy models harve been applied successfully over the last two decades in a wide variety of speo? ed lengunge related resorption problems including speech recognition, named entity finding, optical charac recognition, and topic identification. In the present work, we describe an application of their technology to &* Problem of ad hac information revival fn al HMM applications, the observed data is modeled as being the output produced by passing so™ usinoun ley tixough a noisy channel. In the ad hoc retrieval problem, we take the observed data to be the qu? Warning : Neron/ Photocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proeel= y N& Information Retrieval Systems Q, and the unknown key to be a desired relevant document D. The noisy channel isthe mind of a user, who is | imagined to have some nofion of which documents he wants, and who transforms that notion into the text of the quot Thus, we compute for each documentthe probably that D was the relevant document inthe use mind, giventhat Q wes the query produced, i, P (Dis P/Q), and rank the documents based on this measure. Using probably models for information retrieval has a history almost four decades long, beginning with the work of Maron and Kuhns, and first seeing real application in the “standard probability model” pioneered by Robertson and Sparck-Jones. More recently, however, the introduction of ad hoc constants and non-linear smoothing functions have improved performance steadily atthe cost of drifting further and from the probabilistic framework. Q4. Explain about the concept of Probability model with the help of HMM? QR | Answer: h ‘ey Given a user-generated query and a set of documents, we wish to rank the documents according to the ability that D is relevant, conditioned on the fact that the user produced Q, i.e. P (Dis R/Q), Applying Bayes’ rule, we decompose this into quantities that may be more easily estimated: P(Q/Dis R)* P(D is R) ‘ P(DisR/Q) PIG) a where P (Q/D is R) is the probability of the query being posed, under the hypothesis that the document is relevant P (D is R) is the prior probability that document Dis relevant; and P (Q) isthe prior probability of query Q ‘being posed. ‘Since P (Q) will be identical-for all documents D, we can safely disregard it for the purposes of sorting documents. We propose to model the generation of a query by a user as a discrete hidden Markov process dependent on the document the user has in mind, A discrete hidden Markov model is defined by a set of output SWeateshn! ambos, a set of states, and a probabilities for transitions between the states, and a probability distribution on ‘output symbols for each state. An observed sampling of the process is prodiiced by starting fram some initia state, clistesrepeey transitioning from it to another’tate, sampling from the output distribution at that state, and then repeating these later two steps. The transitioning and the sampling are non-deterministic, and are governed by the probabilities that define the process. The term “hidden” referes tothe fact that an observer sees only the output symbols, but does not now the underlying sequence of states that generated them. Inthe present we take the union of all words appearing in the corpus as the set of output symbols, and posit ‘separate state for each of several mechanisms of query word generation. There is a separate process for each 7 individual document which generates the words of the query by traversing a word according to the output distribution, ofthe state. Knowing the query that was produced, we can easily compute the probability of its being produced by each of the documents in the corpus. This is the P (Q/D is R) term that appears in Equation 1. While one can employ as many HMM states as one can imagine, we will restrict our discussion here to the simple but powerful two-state. HMM shown is figure 1. The firs state, labelled “Document” represents choosing a Word directly from the document. id Pia | GE) _ ae query query. — cae fend Becunend> Figure 1: A simple two - stato HMM Information Retrieval Systems uy XS 2 ARR RR & ihe? whan pat ™ snk eit! am canbe thee P Figure.2: An expanded multi-state HMM. The second, labelled “General English’, represents choosing a word that is unrelated to the document, bt that occurs commonly in natural language queries. More generally, the model can be easily extended to accommodate abroad variety of word generation mechanisms involving synonyms, topic lexicons, or proper names (See Figure?) umt Q5. Explain about the ranking algorithms? web pe page th Answer: by-product of use of similarity measures for selecting Hit items is a value that can be used in ranking te ‘output. Ranking the output implies ordering the output from most likely items that satisfy the query to least like) relevantiitems firs. The original Boolean systems retumed items ordered by date of entry into the system versus by likelihood of relevance to the user's search statement. With the inclusion of statistical similarity techniques into commercial systems and the large number of his that originate from searching diverse corpora, such as the Internet, ranking has become a common feature modem systems. In most of the commercial systems, heuristic rules are used to assist in the ranking of items Generally, systems do not want to use factors that require knowledge across the corpus as a basis for their similar cr ranking functions because itis oo difficult fo maintain current values as the database changes and the added complexity has not been shown to significantly improve the overall weighting process. A good example of how} *9eRe ‘commercial product integrates efficiency with theoretical concepts is the Retrieval Ware system's approach | queries and ranking, RetrievalWare first uses indexes to identify potential relevant items, Itthen applies coarse gr and fine grain ranking. The coarse grain ranking is based on the presence of query terms within items. In the fre sg7ain ranking, the exact rank of the item is calculated, The coarse grain ranking is a weighted forinula that cant? adjusted based on completeness, contextual evidence or variely, and semantic distance. Completenessis the proporta} of the number of query terms found in the item versus the number in the query. It sets an upper limit on the ra value for the item. If weights are assigned to query terms, the weights are factored into the value. Context evidence occurs when related words from the semantic network are also in the item. Thus ifthe user has indicat that the query term “charge” has the context of “paying for an object” then finding words such as “buy,” “purchas “debt” suggests that the term “charge” in the item has the meaning the user desires and that more weight should placed in ranking the item. Synonyms add additional weight; antonyms decrease weight. The coarse grain proce} provides an initial rank to the item based upon existence of words within the item. Since physical proximity is 14] considered in coarse grain ranking, the ranking value can be easily calculated. ‘Warning : Xerox/Photocopying of this book ls a CRIMINAL Act. Anyone found gullty ls LIABLE to face LEGAL proceeded Via 498 @ Information Retrieval Systems Q6._ Briefly explain about the pageRank Algorithm? Answer PageRank: algorithm is used by the famous. engine that is Google. This algorithm is the most commonly used thm for ranking the various pages. Working of the pageRank algorithm depends upon link structure of the web 3s. The PageRank algorithm is based on the concepts that if Page contains important links towards the other g areal to be considered as important pages. The pageRank considers the back linkin deciding the rank score. Ifthe addition of the all the ranks of the back links is large then the page then itis provided a large rank. Therefore, than simply counting the number of pages that are linking to it. If backlink comes from an important page, then that backlink is given a higher weighting than those backlink comes from non - important pages. In a simple way, link from one page to another page may be considered as avote. However, not only the number of votes a page receivesis considered important, but the importance or the relevance of the ones that cast these votes as well. We assume page Ahaspages T,,,...T, which point to it .e., are links. The variable d is a damping factor, which value canbe set between O and 1, We usually set the value of d to 0.85 PR(T,) is the incoming link to page A and C(T,) isthe outgoing link from page T,. The pagerank of a page A is given by the following (1): PR (A) = (Id) + d(PR(T,) /C(T,) +. PRT, /' oe (1) The damping factor is used to'stop the other pages having too much influence; this total vote is damped ‘down by mukiplying it by 0.85, one important thing is noted that the page ranks form a probability distribution over web pages, so the sum of all web pages page ranks will be one and the d damping factor is the probability at each page the random suffer will get bored and request another random link of M & Q or the back links of O. ZN [-n N= hoe tc ins oes Let us take an example of hyperlink structure of four pages A, B, C and D as shown in below figure. The PageRank for Pages A, B, Cand D can be calculated by using (1). Another simplified version of pageRank given by PR(N) = PR(M)/L(M)L(2) (2) ik structure of four pages ” ct. Anyone found guilty is LIABLE to face LEGAL proceedings Figure: Hyperlln ‘Neting:XeroyPhotocopying ofthis book is @ CRIMINAL A Information Retrieval Systems Let us assume the initial pageRank as 1 and do the PR (A) = (1-d) + d(PR(B)/C(B) + PR (C)/C(C) + PR(D)/C(D)) = (1-0.85) + 0.85 (1/3 + 1/3 +1/1) = 15666667 PRIB) = (1d) + d (PR(A)/C(A) + (PR(C)/ C(C)) = (1-0.85) + 0.85 (1.5666667/2 + 1/3) = 1,0991667 PR(C) = (1-d) +d (PRIA) / CIA) + (PR(B) / C(B)) = (1 -0.85) + 0.85 (1.5666667/2 + 1.0991667/3) = 1.127264 PR(D) = (-a) + d{(PR(B) /C(B) + (PR(C)/CIC)) = (1-0.85) + 0.85 ( 1.0991666/3 + 1.127264/3) = 0,7808221 For the second iteration by taking the above pageRank values from (3), (4), (5) and (6). The second iterata pageRank values are as following: PR(A) = 0.15 + 0.85 ((1.0991667/3) + (1.127264/3) + (0.7808221/1)) = 1.4445208 PR(B) = 0.15 + 0.85 ((1.4445208) + (1.127264/3) = 1.0833128 PR(C) = 0.15 + 0.85 (( 1.4445208/2) + (1.0833128/3)) = 1.07086 PR(D) = 0.15 +0.85((1.0833128/3) + (1.07086/3) = 0.760349 During the computation of 34° iteration, the average of the all web pages is 1. Some of the pageRank v calculation. The value of damping factor disputiog =| - (3) ww (4) + (5) ~~ (6) et) -» (8) (9) (10) are shown in table 1. The table with the graph is shown in the simulation results section. ‘Table 4: Iterative calculation for PageRank Iteration A B Cc D 1 1 1 1 1 2 1.5666667 | 1.0991667 1.127264 0.7808221 3 1.4445208 | 10833128 1.07086 0.760349 v7 1.3141432 | 0,9886763 | 0.9886358 | 0.7102384 18 1.313941 | 0.985384 | 0,98851085 | 0.71016395 19 13138034 | 0.98844457 | 0.98842573 | 0.7101132 B44) ‘Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL /#/ “hy ‘ \ 411 8 m Information Retrieval Systems One thing is noted here that the rarik of a Page is divided evenly among it's out - links to contribute to the ranks of the pages. The original PageRank value i.e, Land ‘computes the alll iteration until all pages start to repeat seme PageRank values individually and at last find their average PageRank can be calculated using a simple iterative method and corresponds to the principal an vector ofthe normalized link matrix of the web. PageRank algorithm needs a few hours to calculate the rank of millions of pages and provides efficient output of millions pages. Q7. Explain about weighted pageRank Algorithm? ‘Answer: Weighted pageRank algorithm is proposed by Wenpu Xing and Ai Ghorbani weighted pageRank algorithm (WPR) is the modification of the original pageRank algorithm. WPR decides the rank score based on the popularity ofthe pages by taking into consideration the importance of both the inlinks and outlinks of the pages. This algorithm provides high value of rank to the more popular pages and does not equally divide the rank of page among it’s out link pages. Every out-link page is given a rank value based on its popularity. Popularity of a page is decided by ‘observing its number of in links and out links. As suggested, the performance of WPR is to be tested by using different websites and future work include to calculate the rank score by utilizing more than one level of reference user to classify the web pages. The importance is, assigned in terms of weight values to the incoming and outgoing links and are denoted as W/,,, and. Win: respectively. wir, as shown in equation (4) is the weight of link(m, n) computed depend on the number of incoming links of page n and the number of incoming links of all reference pages of page m. Wino = we = +» (11) We =0,/ YO, ws (12) chain) ‘when In and IP denote the number of incoming iinks with respect to page n and page P Ra, represents the allreference pages list of page m. Similarly computation performed for We2',, as shown in equation (12) is the Weight of link (m, n) which is depend on the number of outgoing links of page n and the number of outgoing inks of llthe reference pages of m. Where as O, and O, are the number of outgoing links with respect to page n andp. The formula as proposed for the WPR is as shown in Equation (13) which is a modification of the pageRank formula, we, fv (13) WPR calculation calculated for the same hyperlink structure as shown in figure 5, The WPR equation for Page A, B,C and O are as follows: WPR (A) = (1d) +d 2WPRIB) We.a) Wisin + WPRIC) Wein) Wein) + WPR(D) Wel.m) Win) (14) Warning Xero Photocopying af this Book isa CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceedings Information Retrieval Systems ™ manly So for getting the value of WPR(A), before it we willcalculated the value of incoming links and out g,. links weight as bellow. Win =, +1) = 3/(3 +2) =3/5 w= (15) fan =2/(2+3+1) =18 Win, = 0, (0, + Oct Og): = (16) Wainy = 1a +h) =3/(3 +2) =35 ws (17) Wey = 0,0, +O, O5) =2/(2+3+1) = 216 6 .. (18) Wow =I Ml +10) =3/(2+2) =3/4 (19) Wola = O,/0, =22 =1 s»» (20) Now these inlinks and outlinks weights, equation numbers (15, 16, 17, 18, 19, 20) are put in the equation |. (14) to calculate the weighted rank of the nodes A, B, C and D as following. WPR{B) = (1d) + dEWPR(A) Wis.s) Wit'n) + WPRIC) We) Wi ~(21) WPR(C) = (1-d) + d EWPRIA) Wirc Wa) + WPRIB) Wa.c) Wisi) (22) WPR(D) = (1-d) + dEWPR(B) Wie.o) Wao + WPRIC) Wien Wien,» (23) For WPR(A) calculation the value of dis set to 0.85 and the initial values of WPR(B), WPR(C) and WPR(D] is considered 1, so calculation for 1* iteration as follows: Wer WPR (A) = (1-0.85) + 0.85 (1* 3/5 *1/3 + 143/5* 1/3 + 1*3/4* 1) = 1127 (24) Winey = Ip (Ip + te + Ip) = 22 +242) = 216 =13 ++ (25) ‘Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceeding! wa oat 413 8 Information Retrieval Systems W's) = O,(O, + O,) =3/(3 +3) (26) w= (27) 3243 +1) 2500 = Ue (28) ‘Again now for calculation of WPR (B) These equations (25, 26, 27, 28) are put in to equation (21). In this the initial value of WPR(C) is set to 1. WPR(B) = (10.85) + 0.85 (1.127*13* 1/2 + 1*25* 1/2). = (0.15) + 0.85 (1.127 * 0,33 * 0.50 + 1* 0.40* 0.50) = 0.4989 +. (29) Id (ly +g + I) 2U2+242) | =26 =18 ~- 80) OY (0, +O) = 3/(3 +3) =316 218 (81) Wao. = I/ (+h) = 2/(3 + 2) =25 (82) Og (0, + + Pp) =3/(2+3+1) ‘ =3/6 =12 (33) By substituting the values of equations (24), (29), (30), (31), (32) and 83) to equation (22), you will get the WPR of page C by taking das 0.85. WPR(C) = (1 -0.85) # 0.85 (1.127 * 1/3 *1/2) + (0.499*25* 12). = (0.15) + 0.85 ((1.127 * 0.33 * 0.50) + (0.499 * 0.40 * 0.50). 392 7 ed in Wiro Wit Wap = IpMly + I) =2/(2+2)= =12 Warning Desroytiomsapying of this book isa GRIMINAL Act. Anyone found gully is LADLE to face LEGAL proceedings (35) Information Retrieval Systems ™ may Wei) = 0,/(0,) =2/2 -»» (36) =1 Wen = l/l, + te) =2/(2+3) =25 + (37) Wein) = O,/(0, + O, + O,) =2/(2+3+1) =18 = (88) ‘Again by substituting the values of equations (29), (34), (36), (37) and (38) to equation (23), youwill tthe WPR(D) by taking d as 0.85. WPR(D) = (1=0.85) + 0.85 ((0.499 * 1/2 * 1) + (0.392 * 2/5 * 1/3). = (0.15) + 0.85 ((0.499 * 0.50 * 1) + (0.392 * 0.40 * 0.33) = 0.406 + (39) ‘The values of WPR(A), WPR(B), WPR(C), and WPR(D) are shown in equations (24), (29), (34) and (33) respectively. The relation between these are WPR(A) > WPR(B) > WPR(D) > WPR(C). This results shows that the weighted PageRank order is different from pageRank. For the same above exampl, the iterative computation of weighted pageRank algorithm is computed. The some weighted pageRank values ae shown in table 2. The table valves with the chart are shown in the simulation results section. So we can easily differentiate the WPR from the pageRank, categorized the resultant pages of a query into. four categories based on their relevancy to the give query. They are : ‘Very Relevant Pages (VR) : The very relevant pages contain very important given query. Relevant Pages (R) : The relevant pages do not have important information about given query. ¢ Weak Relevant Pages (WR) : The weak relevant pages do not have the relevant information but may have the query keywords, Irrelevant Pages (IR) : The irrelevant pages do not have both relevant information and query keywords. Both the page rank and WPR algorithms provide pages in the sorting order according to their ranks to uses fora given query. So the order of relevant pages and their numbering are very important for users in the resultant ist Table ferative calculation values for weighted pageRank Iteration A B c D 1 1 1 1 1 : 2 1.1275 0.47972 03912 0.19935, 3 ‘0.425162 0.27674 0.25727 0.18026 4 0.355701 0.244128 0.24189 0.17541 5 0.34580 0.247110 0.239808, 0.17719 6 0.34454 0.247110 0.23953 0.17714 7 0.34438 0.23450 0.23450 0.17714 8 0.34436 0.23950, 0.23449 0.17714 Warning : XeroyPhotocopying of this book is a CRIMINAL Act. Anyone found guilty is LIABLE to face LEGAL proceedin @) (i) ™~ tee 40 — 8. Give the explanation about HITS algonthma? Answer: The HITS algorithm is proposed by Kleinberg in 1988. HITS algorithm identifies two different forms of web pages called hubs and authorities. Authorities are having important contents, Hubs are pages that actas resource lists, guiding users to authoriies, Thus, a hub page for a subject points to many authoritative ‘on that content, and good authori ed by many good hub pages on the same subject, Hubs and authorities are shown in figure (a). In this a page may be a.good hub and a good authority a the same time. This circular relationship leads to the definition of aniterative algorithm called HITS (Hyperlink induced topic selection). HITS algorithm is ranking the web page byusing inlinks and out links of the web pages. In this a web page is named as authority ofthe web page is pointed by many hyper links and a web page is named as hub if the page point to various hyperlinks. An illustration of hub and authority are shows in figure (a). HITS is, technically a link based algorithm. In HITS algorithm, ranking of the web page is decided by analyzing their textual contents against a given query. After collection ofthe web pages, the HITS algorithm concentrates on the structure of the web only, neglecting their textual contents. Original HITS algorithm has some problems which are given below. @ High rank value is given to some popular website that is not highly relevant to the given query. (@) Topic drift occurs when the hub has multiple topics as equivalent weights are given to all the out lines ofa hub page. In efficiency: Graph construction should be Performed on line, Inelevant links: Advertisements and automatically generated links. ‘ ‘Mutually effective relationship between hosts: on one site, multiple documents are pointing to document D at another site and retrieve their hub scores and the authority score of D. Because of these above problems, the HITS algorithm is not preferred to be used in Google - Seairch engine. So pageRank algorithm is used in Google search engine because of its preference and efficiency. 1 Information Retrieval Systems Hub Authorities Figure (a): llustration of Hubs and Authorities In this HITS algorithm, the hub and authorities are calculated using the following algorithm. HITS Algorithm Initialize all eights to 1. Repeat until the weights converge: Forevery hub PeH 4 =D 5. For every authority P€ A _ =H, 6 ap= 2 7.» Normalize The HITS algorithm treats www as a directed graph G(V, E), where V is a set of vertices representing page a and Eis a set of edges that core spond to links. There are two main steps in the HITS algorithm. The first step is the sampling step and the second step is the interactive step. In the sampling step, a set of relevant pages for the given query are collected i.e., a sub-graph S of G is retrieved which is high in authority pages. This algorithm starts with a root set R a set of S is obtained, keeping in mind that S is relatively small, rich in relevant pages about the query and contains most of the good authorities. The next second step, iteraive step, finds hubs and authorities, using the output of the sampling step equations (10)-and (11), (See all equations from 7Q) ~-- (40) ++ (41) Where H, represents the hub weight, A, represents the authority weight and the set of reference and referer pages of page P denote with respect to I(P) and B(P).. ‘The weight of authority pages is proportional to the summation of the weights of hub pages that links to the authority page another one is, hub weight of the page is ; 1 ct. Anyone found guilty is LIABLE to face LEGAL proceeding Naming :XeroyPhotocopying of this book is a CRIMINAL Act. Anyon guilty face proceedings ! Information Retrieval Systems M4, proportional othe summation ofthe weghtsof authority page that hub inks to gure (b) shows an example. calculation of authority and hub scores. : Qi RI ae}-—J P @ R2 Figure (b}: Calculation of hubs and authorities From the above equation (40 and (41) the hub and authority are calculated such as: A, = Hay + Hep + Hey (42) Hy=Ay + Ay (43) : Q9. Explain about the EigenRumor algorithm? Answer: This algorithm enables a higher score to be assigned to blog entry entered by a good blogger but not inked by any other belongs based on acceptance of the blooger’s prior work. In the recent scenario day by day number blogging sites is increasing, there isa challenge for internet service provider to provide good blogs to the users page rank and HITS are, very promising in providing the rank value to the blogs but some ‘issues arise, if these two algorithm are applied directly to the blogs. These issue are: 1, The number of links to a blog entry is generally very small. As the result the scores of blog entries are calculated by pageRank, for example, are generally too small to permit blog entries to be ranked by importance. 2. _ Generally, some time is needed to develop a number of in-links and thus have a higher pageRank score. Since blogs are considered to be a communication tool for discussing new topics. It is desirable to assign higher score to.an entry submitted by a bloger who has been received lot of attention in the past, even ifthe The rank scores of blog entries as decided by the page rank algorithm is often very low so it cannot allow blag entries to be provided by rank score according to their importance. Soto resolve these issues, an EigenRumat algorithm is proposed for Ranking the blogs. The EigenRumor algorithm has similarities to pageRankand HITS in that all are based an eigne vector calculation of the adjacency mattix of the links. However, in he EigenRumor model, the adjacency matrix is contructed from agent-to-object links, not page (object) to pag? (object) links. One important thing is noted that angent is used to represent an aspect of human ‘being such as a blogger, and an object such as a blog entity using the EigenRumor algorithm the hub and authori scores are calculated as attributes of agents (bloggers) and the inducement of a blog entity that does not yt hhave any in-link entered by the bloger can be computed. Q10. Write the comparison of various page Ranking algorithms. Answer : Based on the analysis, a comparison of some of various web page ranking algorithms is shown in below table. Comparison is done on the basis 6f some parameters such as main technique use, methodology, inpt! Parameter, relevancy, quality of results, importance and limitations. Among all the algorithms, PageRank and HITS are most important algorithms, PageRank is only algorithm which is implemented in the google search engine and HITS is used in the IBM prototype search engine called C lever. A similar algorithm is used in the other Teom seat engine and later itis used by Ask.com. ‘Warning : Xerox/Photocopying ofthis book is a CRIMINAL Act, Anyone found guilty is LIABLE to face LEGAL proceed ae wel 417 8 ™ Information Retrieval Systems Table: Comparison Between ranking algorithms ‘Agorithms | PopeRank | Weighted fara) Distance EigenRumor Criteria PageRank Rank Mining WSM. ‘WSM WSMand '|WSM ‘WCM Techniques wCM Working ee Computers ‘highly | Calculating Use the adjacency process eae values at relevant the minimum — | matrix which eae = time pagesare | average distance | is constructed cre corton | 2ndzesuts | computed | betweentwo | fromagentto object arestoredon | are sorted andfinal | pagesand | linknot page to Priority of on the basis values on more pages. ‘Page Pages ‘of page the fly. importance. UpParameters | Inboundslinks | Inboundlinks | Inbound | Inboundslinks | Agent / Object and out bound | linksand links ‘out bound links and content Complexity OllogN) Se NY 4398 — giz. When does relevance feedback work ? ‘Answer: The RF (Relevance feedback) efficiency of working isbased onthe, User Knowledge The User of RF system need to have efficient Knowledge to be able to make an initial query which is minimum related to the document within the IR system and what if they desire the retrieve, for successful, IR. Even the user makes a query related to documents, some of the problems the RF alone unable to solve. Those problems are, Misspellings Hauser query contains different kind of spellings ‘of terms from the documents collections contained, then the RF not working or not retrieve the desired documents effectively which is Expected by the user. Cross - Language IR ‘Thedocuments in the same language cluster more closely together means documents in another language are not nearby in a vector space that is based on the term distribution. Mismatch of Vocabulary Mismatches in the Vocabulary of a query terms leads to failure of a user search and RF is ineffective. So, the Vocabulary of query terms should match with the document collection of IR system, it improved the search results effectively. 2. — Similarity of Relevant Document To cluster the documents for easy retrieval the relevant documents to be similar to each other. It differs the relevant documents from the non- relevant documents with the similarity of term distribution in all relevant documents marked by the users. But this cluster centered approach will not work if the relevant documents are multimodal class, it consists of several clusters of documents in the vector space. Q13. Explain the Evaluation of Relevance feedback Strategies ? Answer: The role of iterative RF in the evaluation of ‘elevance feedback given very substantial gain with the ‘erations and the judgment of the relevant documents co improve the user information need. The evaluation OIRF made effective by computing the precision-recall 1 Bb Information Retrieval Systems graph to the query q, after we compute the query am which is modified by the user feedback of q,, This strategy completely based on the user RF of document collection. But itis not a proper evaluation of RF. Another strategy of realistic evaluation is to use the documents in the residual collection. In this strategy the user judge the document in which some of them are relevant. We compare the relative performance of variant relevances feedback methods, but it is difficult to validly compare performance with and without RF, because the collection size and the no.of relevant documents changes from before the feedback to after it. Another strategy of evaluation of RF, has two collections, one used to q, initial query are relevance judgment and q,, comparative evaluation. The second collection is used to compare the valid performance of both q, and q,,. The user based RF with user studies makes IR system collections fair and close to the user need. Q14. Write about the Pseudo Relevance feedback? Answer: ‘The Pseudo RF is also known as blind RF provides a method for automatic local analysis. PRF (Pseduo Relevance feedback) methods assume top K documents to be relevant and the query is refined using the feedback document. The basis behind PRF is that documents which are similar to the user's tial query will ead to more relevant terms which when augmented with the query.will lead to an improvement inperformance. Query Drift It is caused as a result of adding terms which have no association with the topic of relevance of the query. This happen only when there are only a few or no relevant documents in the sensitivity to the quality of top documents in the top k documents, PRF only improves the performance of queries which have good or reasonable initial retrieval performance. It can't be used to improve the performance of bad or failed queries which do not retrieve anything relevant of Hence, itis not robust to the quality of the initial retrieval. Several approaches have been proposed to improve the robustness of PRE The main strategies used could be categorized as follows, 1. To refine the feedback document set so that instead of using all the top k documents, we could choose a subset of it which is likely to be “highly relevant”. 2, Instead of using all the terms obtained through feedback for query refinement, only use a subset of important terms to avoid introducing query drift, Weriag Xecsjfiaocapylg ofthis Book isa CRIMINAL Ad. Anyone found pully LADLE to face LEGAL proceedings Information Retrieval Systems @ 3. Dynamically decide when to apply PRF instead of using it forall queries. 4. Varying the importance of each feedback document. 5. Usinga large external collection like wikipedia or the web as.a source of expansion terms besides these obtained through PRE The intuition behind this approach is that ifthe query does not have ‘many relevant documents in the collection then any improvements in the modeling of PRF is ‘bound to perform poorly due to query drift. 4.4 Sevective DisseMINATION OF INFORMATION SEARCH QUS. Expl: detail about selective Dissemination of information search ? Answer : Selective Dissemination of information, frequently called dissemination systems, are becoming more prevalent with the growth of the intemet. A dissemination system is sometimes labeled a “push” system while a search system is called a “pull” ‘system. Ina dissemination system, the user defines a profile and as now information is added to the system it is ‘automatically compared to the user's profil. If itis considered a match, it is asynchronously sent to the user's “mail” file, ‘The difference between the two functions lie in the dynamic nature of the profiling process, the size and diversity of the search statements and number of simultaneous searches per item, Inthe search system, an existing database exists, Assuch, corpora statistics exist on term frequently within and between terms. These can be used for weighting factors in the indexing process and the similarity comparison (e.g., inverse document frequency algorithms). Profiles are relatively static search statements that cover a diversity of topics. Rather than specifying a particular information need, they usually generalize all of the potential information needs of a user. 4.29 (One of the fist commercial search technigues fg dissemination was the logic message Disseminat System (LMDS). The system originated from a systen, created by chase, Rosen and and Wallace (CRW Ing), ‘was designed for need to suppor the search of thousand, of profiles with items arriving every 20 seconds. It demonstrated one approach to the problem where the profiles were treated as the static databagg and the new item acted like the query. Ituses the terms in the item to search the Profile structure to identify those profiles whose logic could be satisfied by the item. BEEEER OS The systern uses a least frequently occurring trigraph (three character) algorithm that quickly identitg which profiles are not satisfied by the iter. The potential profiles aré analyzed in: detail to confirm ifthe item is a hit, aa Another example of a dissemination approach: the personal library software (PLS) system. It received items into the database and periodically running user’s profiles against the database. coal This makes maximum use of the restrospective search software but loses near real time delivery of items. This makes maximum use of the retrospective search software but loses near real time delivery of items. Another approach to dissemination uses statistical classification technique and explict error minimization to determine the decision criteria for selecting items fora particular profile, Tea Schutzeet al used two approaches to reduce the dimensionality: selectinga set of existing features tous. or creating a new much smaller set of features that he original features are mapped into, Ax? measure was used to determine the mot important features. The test was applied to a table tha! Contained the number of relevant (N)) and non -relevatl (N,.) items in which a term occurs plus the number o relevant and non - relevant items in which the tem does not occur ‘respectively. The formula used was : NINN, -N,_N,J? Warning : XeroyPhotocopying ofthis book is a CRIMINAL Act. Anyone found gully is LIABLE to face LEGAL proceci@ - 7 Pe A m Information Retrieval Systems fF BOOLEAN SysTEMS iN) ais. Explain about the weighted search of Boolean systems? ‘ ETO The two major approaches to generating queries are boolean and natural language. Natural language queries \) re easly represented within statistical models and are usable by the similarity measures discussed, Risses arise ‘pen Boolean queries are associated with weighted index systems, some of the issues are associated with how the fy) Jogi (AND, OR, NOT) operators function with weighted values and how weights are associated with the query \ tems. If the operators are interpreted in their normal interpretation, they act too restrictive or too general. To integrate the Boolean and weighted systems model, fox and sharat proposed a fuzzy set approach. Fuzzy sets introduce the concept of degree of membership for AND andOR operations are defined as: y DEG, = min (DEG,, DEG,) DEG yp = max (DEG,, DEG,) Where A and Bare terms in an item. DEG is the degree of membership. The mixed min and max (MMM) °| model considers the similarity between query and document to be a linear combination of the minimum and snaximum item weights. Fox proposed the following similarity formula : d SIM (QUERY ,, DOC) = Coq, * MAX(DOC,, DOC,,..:; DOC,) + Cogs * min (DOC, , DOC,, ....DOCn) ® SIM(QUERY yp» DOC) = Cy, * min (DOC, DOC,,..., DOC,) + Cue *(POC,, DOC,,..-, DOC,) Where Cog, and C,,. are weighthing coefficients for the OR operation and Cy.» and Cys, are the weighting coefficient for AND operation. When C,,., is between 0.5 to 0.8 and Con, is greater than 0.2. The MMM technique was expanded by price considering all item weights versus the maximum / minimum approach. The similarity measure is calculated as: SIM (QUERY DOC) = 2 1""di/ = tee Where the ds are inspected in ascending order for AND queries. The r terms are weighting coefficients. j Another alternative approach is using the P - norm model which allows terms within the query to have | ‘weights in adc Wn to the terms in the items. Similar to the cosine similarity technique, it considers the membership vals (d,,....,) 10 be coordinates in an “n” demensional space. For an OR query, the origin is the worst possibilty. Foran AND query the ideal point is the unit vector where all the D, values equal 1. ‘The generalized queries are: Qoq = (Ay @,) OR (Ay, @,) OR... OR(A, a,) + Qyo= (Ay ay) AND (A,, a) AND ..... AND (A, a,) Ifwe assign the strictness value to a parameter labeled “S” then the similarity formulas' between queries and items are SIM (Qoq, DOC) = [afd +. Fas dha) (ay +28 +...) SIM (Quy DOC) = 1 = Gai d-dyl? +--+ af (1-d,,)° /(as +a8 +...+5) SIM (Q,,, DOC) = 1-SIM(Q, Doc) the Boolean queries and weighted query terms under the assumption that dexes. The normal Boolean operations produce the following results: Another approach for merging ‘are no weights available in the in “#” OR B"retrieves those items that contain the term A or teh term B or both. Warning : XeroyPh Fag of this book is a CRIMINAL Act, Anyone found guilty is LIABLE to face LEGAL proceedings \ may ‘stems BL Information Retrieval § : »\AND B" witiewes those items that contain both, torms and B. “A NOT B" retrieves: those items that contain: term A and not contain term B. weight are then assigned to the terms between the values 0.0 to 1.0, they may be Interpreted as the significance that users are placing on each term, The value 1.0 is assumed to be the strict interpretation of a Boolean query. The value 0.0is intepreted to mean that the user places little vatue on them term, Under these assumptions, a tem assigned a vahte of 0.0 should have no effect on the retrieved set, Thus: “A, OR B," should return the set of items that contain A asa term “A, AND B,” willalso retum the set of items that contain term A “ajNOTB,” This suggests that as the weight for term B goes from 0,0 to 1.0 the resultant set changes from the set of all items that contains term A to the set normally «generated from the Boolean operation. The process can be visualized by use of the VENN diagrams'shown in below figure would include all items that are in all the areas in the VENN diagram. Iso return set A, Figure: VENN Diagram 4.6 SEARCHING THE INTERNET AND HYPERTEXT QI7. Write the six key characteristics of intelligent agents. Ee) me Answer: There are six key characteristics of intelligent agents: 1. Autonomy: The search agent must be able to ‘operate without interaction with a human agent. It must have control over its own intemal states and make independent decisions. This implies a search capability to traverse information sites based upon pre-established criteria collecting potentially relevant information. ‘Warning : Xerox/Thotocopying of thls book is a CRIMINAL Act. Anyone found gullly Is LIADLE to face LEGAL proceeding? { \ 1 : May Communteattons Abillty: The agent magi. able to communicate with the Information gi fs i traverses them. This implies a univer" ccepted language defining the extemal ine Capnetty for Cooperation: This concen | suggests that intelligent agents need to cooper to perform mutually beneficial tasks, , 4, Capacity for Reasoning: There are three ipa, ‘of reasoning scenarios: A Rulle-based: Where user has defined a sot y conditions and actions to be taken, Knowledge-based: Where the intelligent ager have stored previous conditions and actions tay, which are used lo deduce future actions. Artificial evolution based: Where intelligen, agents spawn new agents with higher lone | capability to perform its objectives. 5. Adaptive Behavior: Closely tied to 1 and4,} ter adaptive behavior permits the intelligent agentts | assess its current state and make decisions on te actions it should take. - 6. Trustworthiness: The user must trust that the intelligent agent will act on the user's behalf to locate information that the user has access and is relevant to the user. ’ 4.7 INTRODUCTION TO INFORMATION VISUAUZATION . Q18. Explain about the information visualization’ Answer: The beginnings of the theory of visualization besa |g ‘over 2400 years ago. The philosopher Plato discemed that we perceive objects through the senses, using tht mind. Our perception of the real world is a translation from physical energy from our environment into encode! neural signals. The mind is continually interpreting and F categorizing our perception of our surroundings. Use o! ‘@ computer is another source of input to the minds} processing functions. Text-only interfaces reduce th’ complexity of the interface but also restrict use of he ‘more powerful information processing functions the mind | has developed since birth. Information visualization is a relatively new] + discipline growing out ofthe debates in the 1970s on-hé way the brain processes and uses mental images. required significant advancements in technology an Information retrieval techniques to become a possbllis | earliest researchers in information visualization was Dos who in 1962 discussed the concept of “semantic 102 OY ee Oe le 62). The roa ‘NY sec (Dovie-62). The road maps show the lems oats toa specific semantic theme. The user Yate use this view to focus his query on if raat) Sete potion ofthe databace pectic y % COGNITION AND PenctPrio, 7 ROE) 2 R18, August-2022, Q7(b) t R18, February-2022, Q7(b) REESE (One of the first-level cognitive processes is preat- ‘ecion hetis, taking the significant visual information om the photoreceptors and forming primitives. An example of using the conscious processing czpabilties of the brain is the detection of the (izent shaped objects and the border between them shown between the left side and middle of eeFige. ¢ Thereader can likely detect the differencesiin the takes to visualize the two different ‘This suggests that if information semantics are placed in orientations, the mind's clustering zazeqaie function enables detection of groupings ezsiex than using different objects. Another visual factor is the optical illusion that makes a light object on a dark background to zppear larger than if the item is dark and the background is light. Making use of this factor suggests that a visual display of small objects should use bright colors. Colors have many attributes that can be modified such as hue, saturation and lightness. Depth, like color, is frequently used for representing Visual information. Cssified as monocular cues, changesin shading, blurring (proportional to distance), perspective, ™otion, stereascopic vision, occlusion and texture Cepict depth Sze. Another higher level processing technique is the 28 es eee : a = that could provide a uset a view of the whole | Use of configural aspects of a display. Mewaing : Xeran/Ph ing of this book is a CRIMINAL Information Retrieval Systems A confiqural effect occurs when arrangements of objects are presented to the user allowing for easy recognition of a high-level abstract condition. * Another visual cue that can be used is spatial frequency. © Spatial frequency is an acuity measure relative to regular light-dark changes that are in the visual field or similar channels. —In-using congnitive engineering in designing information visualization techniques, a hidden risk that " Understanding isin eye of the beholder” Q20. Describe the background? Ore) Gtr) [EAE Answer : A significant portion of the brain is devoted to vision and supports the maximum information transfer function from the environment toa human, being. Visualization is the transformation of information into a visual form which enables the user to observe and understand the information. The Gestalt psychologists postulate that the mind follows a set of rules to combine the input stimuli toa mental representation that differs from the ~ sum of the individual inputs = @ Proximity: Nearby figures are grouped together. @ Similarity: Similar figures are grouped together. @ Continuity: Figures are interpreted as smooth continuous patterns rather than discontinuous concatenations of shapes e.g. acircle with its diameter drawn is perceived as two continuous shapes, a circle and a line, versus two half circles concatenated together). — Closure: Gaps within a figure are filed in to create a whole (e.g,, using dashed lines to represent a square does not prevent understanding itasa square). @ — Connectedness: uniform and linked spots, lines or areas are perceived as a single unit @ Shifting the information processing load from slower cognitive processes to faster perceptual systems significantly improves the information- carrying interfaces between humans and computers. ‘Act. Anyone found guilty is LIABLE to face LEGAL proceedings Information Retrieval Systems INFORMATION VISUALIZATION TECHNO-LOGIES Q21. Explain about the information visualization technologies with neat diagrams? Answer : ® The ones focused on Information Retrieval Systems are investigating how best to display the res. searches, structured data from DBMSs and the resiilts of link analysis correlating data. 4 # Thegoals for displaying the result from searches fall nto two major classes: docurnent chustering are statement analysis. ¢ Visualization tools inthis area attempt to display the clusters, with an indication of their size and toric basis for users to navigate to items of interest. * Displaying the total set of terms, including additional terms from relevance feedback or thesaurus exzzce, along with documents retrieved and indicate the importance ofthe term to the retrieval and raring rom One way of organizing information is hierarchical {A tree stricture is useful in representing information that ranges over time (e.g., genealogical inexp constituents ofa larger unite. organization structures, mechanical device definitions) and ace! , the higher to lower level (e.g, hierarchical clustering of documents). ‘A two-dimensional representation becomes dificult for a user to understand.as the hierarchy becomarizs|¢ One of the earliest experiments in information visualization was the Information Visualizer develops XEROX PARC. Itincorporates various visualization formats such as DataMap, InfoGrid, |, ConeTTree, and the Perspecive The Cone-Tree is a 3-Dimer al representation of data, where one node of the tree is represented = Pex and all the information subordinate to itis arranged in a circular structure a its base. Any child node may also be the parent of another cone. Selecting a particular node, rotates it to the front of the display, m Information Retrieval Systems 425 a Report = Ej Wiavet Tole [iiwnace se] oases [Letter | fid Manufacture |__ rl [Article nea || ff What Frome [Announcement q Hid Water Coote Figure: Perspective Wall @ Treemaps. Thistechnique makes maximum use of the display screen space by using rectangular boxes that are recursively o subdivided based upon parent-child relationships between the data. TheCPU, OS, Memory, and Network management, articles are all related to a general category of computer operating systems versus computer applications which are shown in the rest of the figure. CPU articles os, Network Management articles Memory articles Word Processing software articles Spreadsheet software articles ‘PowerPoint software articles Drawing software articles Figure Tree Map ¢ TheEnvision system not only displays the relevance rankand estimated relevance of each item found by a query but also simultaneously presents other query information. ¢ © The design is intentionally graphical ‘and simple using two dimensional visualization. * — Thisallowsa larger variety of user computer platforms to have access to their system. + Envision’s three interactive windows to display search results: Query window, Graphic View window, and Item Summary window. cai isa CRIMINAL Act Anyone found gully is HABLE nfs LEGAL procedings | Warming xo lex of this book is 2 Information Retrieval Systems @ Figure: Example of DCARS Document Content Analysis and Retrieval System (DCARS) being developed by Calspan Advanced Technolo Center. ‘@ Their system is designed to augment the RetrievalWare search product. | . ‘They display the query results as a histogram with the items as rows and each term's contribution to te| selection indicated by the width of a tile bar on the row. @ —- DCARS provides a friendly user interface that indicates why a particular item was found, but it is muct harder to use the information in determining how to modify search statements to improve them. Figure; Example of DCARS Query Histogram % Another representation that is widely used for both hierarchical and network related information & “cityscape” which uses the metaphor of movement within a city, ¢ —_Inliew of using hills, asin the terrain approach, skyscrapers represent the theme (concept) area as. ‘Warning :XeroyPhotocopying ofthis book is a CRIMINAL Act. Anyone found guilty s LIABLE to face LEGAL proces SG... i RQ —_ | = | EEA WK . = Information Retrieval Systems ™@ Internal Assessment Murtiete CHoIce QuESTIONS 1, The Search statement uses traditional Boolean logic and /or. language. a) Natural ¥) Logical ©) Statistical @) Index 2. Avariety of different similarity measures can be used to calculate the similarity between the item and the ___ statement. a) Logical b) And/or ©) Search ) Natural 3. Todetermine the similarity between documents for clustering purpose is. a) SIM (lteri , Item) DiTerm,,,)(Term,,) b) SIM (Item ,, Item) = })(Term,,)(Term,,) ©) SIM (Item) = (Term) ¢) SIM (Item , item) = S(Term,,) 4. SIM (DOC, , QUERY, = 2) SIC+IDF) fi b) D(C+IDF)*F,) ) DIC+IDF) @ S(C+DF)R,) 5. The length of search statements directly affect the ability of ______to find relevant items. a) Information ) Document ©) System d) Information retrieval systems 6. The binding is to the and past experiences of the user. 2) Vocabulary b) Specific ©) Search d) User 7. The final level of binding comes as the search is applied toa specific a) Search b) Document ) Database 4) Information 8. Useofasimilarity, returns the complete database as search results. a) Algorithm b) Results ©) Thresholds d) Measure 10. 10. Thresholds ate usually associated with 4, —_ process. a) System b) Index ©) Search 4) Database The ___defines the items in the resulta hitfile from the query. a) HMM ©) Search ‘b) Measures @) Threshold Fit IN THE BLANKS SIM (Item i, Item j Use of HMM for searching textual corpora has introduced a new paradigm for P(DisR(Q)=____. ‘The Development for a HMM approach begins with applying__ Rule. HMM Stands for —___ PARC Stands for ___.. ‘The Unnormalized similarity formula __ The one dimensional space defined by. should cause the group means to be well separated. g(x) = The two major approaches to generating queries are _______ and natural language. Warning: XeroyPhotocopying ofthis book Is a CRIMINAL Act. Anyone found gully s LIABLE to face LEGAL proceeding? a a 4298 ™ Information Retrieval Systems’ j, Murripte Cuoice Questions zee) | 2) | 3.0) | 4(a) | 5. (a) 6.ta) | 7c) | 8a) | %(c) | 10.(¢) i Fe In THE BLANKS 1... 2(ferm,,) Term,,) 2, Search 3, PiQDisR)*P(DisR/P(Q) 4. Bayes 5, _ Hidden Markov models 6. _ Pale Auto Research Center 5 7. SM(Q,DOC,) = 2 TERM,, * TERM, 8 Y=NA*x 9. log(niC1-n) 10. Boolean

You might also like