0% found this document useful (0 votes)
41 views27 pages

IT Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views27 pages

IT Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT-2

Hypertext Transfer Protocol (HTTP)


2.1 Hypertext Transfer Protocol

Web browsers interact with web servers with a simple application-level protocol called
HTTP (Hypertext Transfer Protocol), which runs on top of TCP/IP network connections.
HTTP is a client-server protocol that defines how messages are formatted and
transmitted, and what action web servers and browsers should take in response to
various commands For example, when the user enters a URL in the browser, the browser
sends a HTTP command to the web server directing it to fetch and transmit the requested
web page. Some of the fundamental characteristics of the HTTP protocol are:

A typical Web paradigm using the request/response HTTP.

 The HTTP protocol uses the request/response paradigm, which is an HTTP client
program sends an HTTP request message to an HTTP server that returns an HTTP
response message.
• HTTP is a pull protocol; the client pulls information from the server
(instead of server pushing information down to the client).
 HTTP is a stateless protocol, that is, each request-response exchange is treated
independently. Clients and servers are not required to retain a state. An HTTP
transaction consists of a single request from a client to a server, followed by a single
response from the server back to the client. The server does not maintain any
information about the transaction. Some transactions require the state to be maintained.
 HTTP is media independent: Any type of data can be sent by HTTP if both the client
and the server know how to handle the data content. It is required for the client as
well as the server to specify the content type using appropriate MIME-type.
2.2 Hypertext Transfer Protocol Version

HTTP uses a <major>.<minor> numbering scheme to indicate versions of the protocol. The
version of an HTTP message is indicated by an HTTP-Version field in the first line. Here is
the general syntax of specifying HTTP version number:

HTTP-Version = "HTTP" "/" 1*DIGIT "." 1*DIGIT


The initial version of HTTP was referred to as HTTP/0.9, which was a simple protocol for raw
data transfer across the Internet. HTTP/1.0, as defined by RFC (Request for Comments) 1945,
improved the protocol. In 1997, HTTP/1.1 was formally defined, and is currently an Internet
Draft Standard [RFC-2616]. Essentially all operational browsers and servers support HTTP/1.1.

2.3 Hypertext Transfer Protocol Connections

How a client will communicate with the server depends on the type of connection
established between the two machines. Thus, an HTTP connection can either be persistent
or non-persistent. Non-persistent HTTP was used by HTTP/1.0. HTTP/1.1 uses the
persistent type of connection, which is also known as a kept-alive type connection with
multiple messages or objects being sent over a single TCP connection between client and
server.
2.3.1 Non-Persistent Hypertext Transfer Protocol
HTTP/1.0 used a non-persistent connection in which only one object can be sent over
a TCP connection. transmitting a file from one machine to other required two Round
Trip Time (RTT)—the time taken to send a small packet to travel from client to
server and back.
• One RTT to initiate TCP connection
• Second for HTTP request and first few bytes of HTTP response to return
• Rest of the time is taken in transmitting the file

Fig: RTT in a non-persistent HTTP.

While using non-persistent HTTP, the operating system has an extra overhead for
maintaining each TCP connection, as a result many browsers often open parallel TCP
connections to fetch referenced objects. The steps involved in setting up of a connection
with non-persistent HTTP are:

1. Client (Browser) initiates a TCP connection to www.anyCollege.edu (Server):


Handshake.
2. Server at host www.anyCollege.edu accepts connection and
acknowledges.
3. Client sends HTTP request for file /someDir/file.html.
4. Server receives message, finds and sends file in HTTP response.
5. Client receives response. It terminates connection, parseObject.
6. Steps 1–5 are repeated for each embedded object.

2.3.2 Persistent Hypertext Transfer Protocol

To overcome the issues of HTTP/1.0, HTTP/1.1 came with persistent connections


through which multiple objects can be sent over a single TCP connection between the client
and server. The server leaves the connection open after sending the response, so subsequent
HTTP messages between same client/server are sent over the open connection. Persistent
connection also overcomes the problem of slow start as in non-persistent each object
transfer suffers from slow start, and overall number of RTTs required for persistent is much
less than in non-persistent (fig)
The steps involved in setting the connection of non-persistent HTTP are:

1. Client (Browser) initiates a TCP connection to www.sfgc.ac.in(Server): Handshake.


2. Server at host www.sfgc.ac.in accepts connection and
acknowledges.
3. Client sends HTTP request for file /someDir/file.html.
4. Server receives request, finds and sends object in HTTP response.
5. Client receives response. It terminates connection, parseobject.
6. Steps 3–5 are repeated for each embedded object.

FIG:RTT in a persistent HTTP.


Thus, the overhead of HTTP/1.0 is 1 RTT for each start (each request/response), that is if
there are 10 objects, then the Total Transmission Time is as follows:

TTT = [10 * 1 TCP/RTT] + [10 * 1 REQ/RESP RTT] = 20RTT

Whereas for HTTP/1.1, persistent connections1 are very helpful with multi- object requests
as the server keeps TCP connection open by default.

TTT = [1 * 1 TCP/RTT] + [10 * 1 REQ/RESP RTT] = 11 RTT

2.4 Hypertext Transfer Protocol Communication

In its simplest form, the communication model of HTTP involves an HTTP client, usually a
web browser on a client machine, and an HTTP server, more commonly known as a web
server. The basic HTTP communication model has four steps:

• Handshaking: Opening a TCP connection to the web server.


• Client request: After a TCP connection is created, the HTTP client sends a request
message formatted according to the rules of the HTTP standard—an HTTP Request. This
message specifies the resource that the client wishes to retrieve or includes information to
be provided to the server.
• Server response: The server reads and interprets the request. It takes action relevant to
the request and creates an HTTP response message, which it sends back to the client.
The response message indicates whether the request was successful, and it may also
contain the content of the resource that the client requested, if appropriate.
• Closing: Closing the connection (optional).
Handshaking:
For opening a TCP connection, the user on client-side inputs the URL containing the name
of the web server in the web browser. Then, the web browser asks the DNS (Domain Name
Server) for the IP address of the given URL. If the DNS fails to find the IP address of the
URL, it shows an error (for example, “Netscape (Browser) is unable to locate error.”) on the
client’s screen. If the DNS finds the IP address of the URL, then the client browser opens a
TCP connection to port 80 (default port of HTTP, although one can specify another port
number explicitly in the URL) of the machine whose IP address has been found.
Request Message
After handshaking in the first step, the client (browser) requests an object (file) from

the server. This is done with a human-readable message. Every HTTP request message has
the same basic structure

Start Line:
Request method: It indicates the type of the request a client wants to send. They are also called
methods
Method = GET | HEAD | POST | PUT| DELETE| TRACE | OPTIONS| CONNECT
| COPY| MOVE
GET: Request server to return the resource specified by the Request-URI as the body of a
response.
HEAD: Requests server to return the same HTTP header fields that would be returned if a GET
method was used, but not return the message body that would be returned to a GET method.
Post:The most common form of the POST method is to submit an HTML form to the server. Since
the information is included in the body, large chunks of data such as an entire file can be sent to
the server.
Put: . It is used to upload a new resource or replace an existing document. The actual document
is specified in the body part.

DELETE: Request server to respond to future HTTP request messages that contain the
specified Request-URI with a response indicating that there is no resource associated with this
Request-URI.

TRACE: Request server to return a copy of the complete HTTP request message, including
start line, header fields, and body, received by the server.
MOVE: It is similar to the COPY method except that it deletes the source file.
CONNECT: It is used to convert a request connection into the transparent TCP/IP tunnel.
COPY: The HTTP protocol may be used to copy a file from one location to another.
Headers:
The HTTP protocol specification makes a clear distinction between general headers,
request headers, response headers, and entity headers. Both request and response messages
have general headers but have no relation to the data eventually transmitted in the body. The
headers are separated by an empty line from the request and response body. The format of a
request header is shown in the following table:

General
Header
Request
Header
Entity Header

A header consists of a single line or multiple lines. Each line is a single header of the following
form:

Header-name: Header-value
General Headers
General headers do not describe the body of the message. They provide information
about the messages instead of what content they carry.
• Connection: Close
This header indicates whether the client or server, which generated the message,
intends to keep the connection open.
• Warning: Danger. This site may be hacked!
This header stores text for human consumption, something that would be useful
when tracing a problem.
• Cache-Control: no-cache
This header shows whether the caching should be used.

Request Header:

It allows the client to pass additional information about themselves and about the request, such
as the data format that the client expects.

• User-Agent: Mozilla/4.75
Identifies the software (e.g., a web browser) responsible for making the request.
• Host: www.netsurf.com
This header was introduced to support virtual hosting, a feature that allows a web
server to service more than one domain.
• Referer: http://wwwdtu.ac.in/∼akshi/index.html
This header provides the server with context information about the request. If the
request came about because a user clicked on a link found on a web page, this header
contains the URL of that referring page.
• Accept: text/plain
This header specifies the format of the media that the client can accept.
Entity Header:

• Content-Type: mime-type/mime-subtype
This header specifies the MIME-type of the content of the message body.
• Content-Length: 546
This optional header provides the length of the message body. Although it is optional,
it is useful for clients such as web browsers that wish to impart information about the
progress of a request. Last-Modified: Sun, 1 Sept 2016 13:28:31 GMT
This header provides the last modification date of the content that is transmitted in the
body of the message. It is critical for the proper functioning of caching mechanisms.
• Allow: GET, HEAD, POST
This header specifies the list of the valid methods that can be applied on a URL.
Message Body:
The message body part is optional for an HTTP message, but, if it is available, then it is used to
carry the entity-body associated with the request. If the entity- body is associated, then usually
Content-Type and Content-Length header lines specify the nature of the associated body.
Response Message: Similar to an HTTP request message, an HTTP response message
consists of a status line, header fields, and the body of the response, in the following
format
Fig:A sample HTTP Request Message

Fig: HTTP Response Message


Status Line:
Status line consists of three parts: HTTP version, Status code, and Status phrase. Two
consecutive parts are separated by a space.

HTTP version Status Code Status phrase

• HTTP version: This field specifies the version of the HTTP protocol being used by the
server. The current version is HTTP/1.1.
• Status code: It is a three-digit code that indicates the status of the response. The status
codes are classified with respect to their functionality into five groups as follows:
• 1xx series (Informational)—This class of status codes represents provisional
responses.
• 2xx series (Success)—This class of status codes indicates that the client’s request
are received, understood, and accepted successfully.
• 3xx series (Re-directional)—These status codes indicate that additional actions must
be taken by the client to complete the request.
• 4xx series (Client error)—These status codes are used to indicate that the client
request had an error and therefore it cannot be fulfilled
• 5xx series (Server error)—This set of status codes indicates that the server
encountered some problem and hence the request cannot be satisfied at this time.
The reason of the failure is
• embedded in the message body. It is also indicated whether the failure is temporary
or permanent. The user agent should accordingly display a message on the screen to
make the user aware of the server failure.
Status phrase: It is also known as Reason-phrase and is intended to give a short textual
description of status code.
Example:
403
Not Found The requested resource could not be found but may be available in the
future. Subsequent requests by the client are permissible.

404

Request Timeout The server timed-out waiting for the request.


Headers:
Headers in an HTTP response message are similar to the one in a request message
except for one aspect, in place of request header in the headers it contains a response
header.

General
Header
Response
Header
Entity Header
• Response Header
Response headers help the server to pass additional information about the response that
cannot be inferred from the status code alone, like the information about the server and
the data being sent
• Location: http://www.mywebsite.com/relocatedPage.html
This header specifies a URL towards which the client should redirect its original
request.
It always accompanies the “301” and “302” status codes that direct clients to try a
new location.
• WWW-Authenticate: Basic
This header accompanies the “401” status code that indicates an authorization
challenge. It specifies the authentication scheme which should be used to access the
requested entity. Server: Apache/1.2.5
This header is not tied to a particular status code. It is an optional header that
identifies the server software.
• Age:22
This header specifies the age of the resource in the proxy cache in seconds.
Message Body
Similar to HTTP request messages, the message body in an HTTP response message is
also optional. The message body carries the actual HTTP response data from the server
(including files, images, and so on) to the client.:
Figure:
A sample HTTP response message.

2.5 Hypertext Transfer Protocol Secure


HTTPS is a protocol for secure communication over the Internet. It was developed by
Netscape. It is not a protocol, but it is just the result of the combination of the HTTP
and SSL/TLS (Secure Socket Layer/Transport Layer Security) protocol. It is also called
secure HTTP, as it sends and receives everything in the encrypted form, adding the element
of safety. HTTPS is often used to protect highly confidential online transactions like online
banking and online shopping order forms. The use of HTTPS protects against
eavesdropping and man-in-the-middle attacks. While using HTTP, servers and clients still
speak exactly the same HTTP to each other, but over a secure SSL connection that encrypts
and decrypts their requests and responses. The SSL layer has two main purposes:

• Verifying that you are talking directly to the server that you think you are talking to.
• Ensuring that only the server can read what you send it, and only you can read what it
sends back.
2.6 Hypertext Transfer Protocol State Retention: Cookies
HTTP is a stateless protocol. Cookies are an application-based solution to provide state
retention over a stateless protocol. They are small pieces of information that are sent in
response from the web server to the client. Cookies are the simplest technique used for storing
client state. A cookie is also known as HTTP cookie, web cookie, or browser cookie. Cookies
are not software; they cannot be programmed, cannot carry viruses, and cannot install
malware on the host computer. However, they can be used by spyware to track a user’s
browsing activities. Cookies are stored on a client’s computer. They have a lifespan and are
destroyed by the client browser at the end of that lifespan.

Fig:HTTP Cookie
Creating Cookies
When receiving an HTTP request, a server can send a Set-Cookie header with the response.
The cookie is usually stored by the browser and, afterwards, the cookie value is sent along with
every request made to the same server as the content of a Cookie HTTP header.
A simple cookie can be set like this:

Set-Cookie: <cookie-name>=<cookie-value>
There are various kinds of cookies that are used for different scenarios depending on the need.
These different types of cookies are given with their brief description in
Types of Cookies

Cookies Description
Session cookie A session cookie only lasts for the duration of users using the website. The web
browser normally deletes session cookies when it quits.
Persistent cookie/tracing cookie
A persistent
cookie will
outlast user
sessions. If a
persistent
cookie has its
max-age set to
1 year,
then, within
the year, the
initial value
set in that
cookie
would be
sent back to
the server
every time
the user
visited the
server.
Secure cookieA secure cookie is used when a browser is visiting a server via HTTPS, ensuring
that the cookie is always encrypted when transmitting from client to server.
Zombie cookie A zombie cookie is any cookie that is automatically recreated after the user has
deleted it.

• Persistence: One of the most powerful aspects of cookies is their persistence. When a
cookie is set on the client’s browser, it can persist for days, months, or even years. This
makes it easy to save user preferences and visit information and to keep this
information available every time the user returns to a website. Moreover, as cookies
are stored on the client’s hard disk they are still available even if the server crashes.
• Transparent: Cookies work transparently without the user being aware that
information needs to be stored.
They lighten the load on the server’s memory.
2.7 Hypertext Transfer Protocol Cache

Caching is the term for storing reusable responses in order to make subsequent
requests faster. The caching of web pages is an important technique to improve the Quality
of Service (QoS) of the web servers. Caching can reduce network latency experienced by
clients. For example, web pages can be loaded more quickly in the browser. Caching can
also conserve bandwidth on the network, thus increasing the scalability of the network
with the help of an HTTP proxy cache (also known as web cache). Caching also increases the
availability of web pages.
• Browser cache: Web browsers themselves maintain a small cache. Typically, the browser
sets a policy that dictates the most important items to cache. This may be user-specific
content or content deemed expensive to download and likely to be requested again.
• Intermediary caching proxies (Web proxy): Any server in between the client and your
infrastructure can cache certain content as desired. These caches may be maintained by
ISPs or other independent parties.
• Reverse cache: Your server infrastructure can implement its own cache for backend
services. This way, content can be served from the point-of-contact instead of hitting
backend servers on each request.
Cache Consistency

Cache consistency mechanisms ensure that cached copies of web pages are eventually
updated to reflect changes in the original web pages. There are basically, two cache
consistency mechanisms currently in use for HTTP proxies:

• Pull method: In this mechanism, each web page cached is assigned a time- to-serve
field, which indicates the time of storing the web page in the cache. An expiration time
of one or two days is also maintained. If the time is expired, a fresh copy is obtained
when user requests for the page.

• Push method: In this mechanism, the web server is assigned the responsibility
of making all cached copies consistent with the server copy.
2.8 Evolution of Web
Web 1.0:
The First generation of the web ,web 1.0 was introduced by tim Berners Lee in late 1990, as a
technology based solution for business to broad cast their information to people.the core elements of
web were HTTP,HTML,URL. The Web 1.0, as an unlimited source of.

The Web 1.0, as an unlimited source of information with users from cross-
section of society seeking to find information to satisfy their information needs,
required an effective and efficient mechanism to access it. This read-only Web
was accessible using an information retrieval system, popularly known as a
web
search engine, or simply search engine,
Web 2.0:
Web 2.0 is the term used to describe the second generation of the world wide web that emerged in
early 2000s.unlike web 1.0 which was primarily focused on the one_way dissemination of
information,web 2.0 is characterized by a more collaborative and interactive approach to web
content and user engagement.

Web 2.0 Technologies:


Web 2.0 encourages a wider range of expression, facilitates more collaborative
ways of working, enables community creation, fosters dialogue and knowledge
sharing, and creates a setting for learners with various tools and technologies. A
 Blogging
Social Networking Sites
Weblog or Blog: A Weblog, or “blog,” is a personal journal or newsletter on
the Web. Some blogs are highly influential and have enormous readership,
while others are mainly intended for a close circle of family and friends.
The power of Weblogs is that they allow millions of people to easily publish

 Social Networking Sites:


Social networking sites, with Facebook being the best-known, allow users to set up a personal
profile page where they can post regular status updates, maintain links to contacts known as
“friends” through a variety of interactive channels, and assemble and display their interests in
the form of texts, photos, videos, group memberships, and so on.
 Podcasts:
A Podcast is basically just an audio (or video) file. A podcast is different from
other types of audio on the Internet because a podcast can be subscribed to by
the listeners, so that when new podcasts are released, they are automatically
delivered, or fed, to a subscriber’s computer or mobile device.
 Wikis
A single page in a wiki website is referred to as a wiki page.
The entire collection of wiki pages, which are usually interconnected with hyperlinks, is “the
wiki.” A wiki is essentially a database for creating, browsing, and searching through
information
 Micro-blogging
Micro-blogging is the practice of posting small pieces of digital content—which could be
text, pictures, links, short videos, or other media—on the Internet. Micro-blogging
enables users to write brief messages, usually limited to less than 200 characters. and
publish them via web browser-based services, email, or mobile phones.
Web 3.0:
Web 3.0 also know as the semantic web, is the next generation of the world wide web that aims to
create a more intelligent, interconnected, and contextualized web experience. While web 2.0 focused
on user generated content and social interaction, web 3.0 aims to bring more automated and
personalized experience to the web. Web 3.0 is based on a specific set of principles, technical
parameters, and values that distinguish it from earlier iterations of the World Wide Web: Web 2.0
and Web 1.0. Web 3.0 envisions a world without centralized companies, where people are in control
of their own data and transactions are transparently recorded on blockchains, or databases searchable
by anyone.
Features of Web 3.0

Semantic Web
Semantic means “relating to meaning in language or logic.” The Semantic Web improves the
abilities of web technologies to generate, share, and connect content through search and analysis
by understanding the meaning of language beyond simple keywords.
Artificial intelligence
Web 3.0 leans on artificial intelligence (AI) to develop computers that can understand the meaning
or context of user requests and answer complex requests more quickly. The artificial intelligence of
the Web 3.0 era goes beyond the interactivity of Web 2.0 and creates experiences for people that feel
curated, seamless, and intuitive — a central aim behind the development of the metaverse.
Decentralization
Web 3.0 envisions a truly decentralized internet, where connectivity is based completely on peer-
to-peer network connections. This decentralized web will rely on blockchain to store data and
maintain digital assets without being tracked.
Ubiquity
Ubiquity means appearing everywhere or being very common. The definition of ubiquity in terms of
Web 3.0 refers to the idea that the internet should be accessible from anywhere, through any
platform, on any device. Along with digital ubiquity comes the idea of equality. If Web 3.0 is
ubiquitous, it means it is not limited. Web 3.0 is not meant for the few, it is meant for the many.

Web 1.0 Web 2.0 Web 3.0


Despite only providing
Because of developments in web
limited information and Web 3.0 is the next break in the
technologies such
little to no user evolution of the Internet,
as Javascript, HTML5, CSS3, etc.,
interaction, it was the first allowing it to understand data
and Web 2.0 made the internet a lot
and most reliable internet in a human-like manner.
more interactive.
in the 1990s.

Social networks and user-generated It will use AI


Before, there was no such
content production have flourished technology, Machine Learning,
thing as user pages or just
because data can now be distributed and Blockchain to provide
commenting on articles.
and shared. users with smart applications.

Consumers struggled to
locate valuable Many web inventors, including the This will enable the intelligent
information in Online 1.0 above-mentioned Jeffrey Zeldman, creation and distribution of
since there were no pioneered the set of technologies highly tailored content to every
algorithms to scan used in this internet era. internet user.
through websites.
2.9 Big Data:

Big Data is a trending set of techniques that demand new ways of


consolidation of the various methods to uncover hidden information from the massive and
complex raw supply of data. User-generated content on the Web has been established as a
type of Big Data, and, thus, a discussion about Big data is inevitable in any description of
the evolution and growth of the Web. Following are the types of Big Data that have been
identified across literature:
Social Networks (human-sourced information): Human-sourced information is now almost
entirely digitized and stored everywhere from personal computers to social networks. Data are
loosely structured and often ungoverned.
Big Data Characteristics
o Volume
o Veracity
o Variety
o Value
o Velocity

Volume

The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.

Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.

b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,


XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.

c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.

Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.

Value

Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.

2.10 Web IR: Information Retrieval on the Web


Web Information Retrieval Tools
These are automated methods for retrieving information on the Web and can be broadly
classified as search tools or search services.
Search Tools
A search tool provides a user interface (UI) where the user can specify queries and browse the
results.
Class 1 search tools: General purpose search tools completely hide the organization and content
of the index from the user.
Class 2 search tools: Subject directories feature a hierarchically organized subject catalog or
directory of the Web, which is visible to users as they browse and search.

Search Services
Search services provide a layer of abstraction over several search tools and databases,
aiming to simplify web search. Search services broadcast user queries to several search engines
and various other information sources simultaneously.
2.11 Web Information Retrieval Architecture (Search Engine Architecture)
A search engine is an online answering machine, which is used to search, understand, and
organize content's result in its database based on the search query (keywords) inserted by the end-
users (internet user). To display search results, all search engines first find the valuable result from
their database, sort them to make an ordered list based on the search algorithm, and display in front
of end-users. The process of organizing content in the form of a list is commonly known as a Search
Engine Results Page (SERP).

There are the following four basic components of Search Engine -

1. Web Crawler

Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an essential
role in search engine optimization (SEO) strategy. It is mainly a software component that traverses
on the web, then downloads and collects all the information over the Internet.

There are the following web crawler features that can affect the search results –

o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling

Database

The search engine database is a type of Non-relational database. It is the place where all the web
information is stored. It has a large number of web resources. Some most popular search engine
databases are Amazon Elastic Search Service and Splunk.

There are the following two database variable features that can affect the search results:

o Size of the database


o The freshness of the database

3. Search Interfaces

Search Interface is one of the most important components of Search Engine. It is an interface
between the user and the database. It basically helps users to search for queries using the database.

There are the following features Search Interfaces that affect the search results -

o Operators
o Phrase Searching
o Truncation

4. Ranking Algorithms

The ranking algorithm is used by Google to rank web pages according to the Google search
algorithm.

There are the following ranking features that affect the search results -

o Location and frequency


o Link Analysis
o Clickthrough measurement

How do search engines work

There are the following tasks done by every search engines -

1. Crawling

Crawling is the first stage in which a search engine uses web crawlers to find, visit, and download
the web pages on the WWW (World Wide Web). Crawling is performed by software robots, known
as "spiders" or "crawlers." These robots are used to review the website content.

2. Indexing

Indexing is an online library of websites, which is used to sort, store, and organize the content that
we found during the crawling. Once a page is indexed, it appears as a result of the most valuable and
most relevant query.

3. Ranking and Retrieval

The ranking is the last stage of the search engine. It is used to provide a piece of content that will be
the best answer based on the user's query. It displays the best content at the top rank of the website.

To know more about how the search engine works click on the following link -
Search Engine Processing

There are following two major Search Engine processing functions -

1. Indexing process

Indexing is the process of building a structure that enables searching.

Indexing process contains the following three blocks -

i. Text acquisition

It is used to identify and store documents for indexing.

ii. Text transformation

It is the process of transform documents into index or features.

iii. Index creation

Index creation takes the output from text transformation and creates the indexes or data searches that
enable fast searching.

2. Query process

The query is the process of producing the list of documents based on a user's search query.

There are the following three tasks of the Query process -

i. User interaction
User interaction provides an interface between the users who search the content and the search
engine.

ii. Ranking

The ranking is the core component of the search engine. It takes query data from the user interaction
and generates a ranked list of data based on the retrieval model.

iii. Evaluation

Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result
helps us to improve the ranking of the search engine.

2.12 Web Information Retrieval Performance Metrics

Like in the information retrieval community, system evaluation in Web IR (search engines)
also revolves around the notion of relevant and not relevant documents. In a binary decision
problem, a classifier labels examples as either positive or negative. The decision made by
the classifier can be represented in a structure
known as a confusion matrix or contingency table. The confusion matrix has four
categories: True positives (TP) are examples correctly labeled as positives. False
positives (FP) refer to negative examples incorrectly labeled as positive— they form
Type-I errors. True negatives (TN) correspond to negatives correctly labeled as
negative. And false negatives (FN) refer to positive examples incorrectly labeled as
negative—they form Type-II errors.
Fig: confusion matrix

• Precision: This is defined as the number of relevant documents retrieved by a


search divided by the total number of documents retrieved by that search:

Recall: This is also known as true positive rate or sensitivity or hit rate.
It is defined as the number of relevant documents retrieved by a search divided by the
total number of existing relevant documents.

• F-measure (in information retrieval): This can be used as a single measure of


performance. The F-measure is the harmonic mean of precision and recall. It is
a weighted average of the true positive rate (recall) and precision:
n=165 Predicted: NO Predicted: YES
Actual: NO TN=50 FP=10 60
Actual: YES FN=5 TP=100 105
55 110

The performance measures are thus computed from the confusion matrix for a
binary classifier as follows:

• Accuracy: Overall, how often is the classifier correct?


(TP + TN)/total = (100 + 50)/165 = 0.91 implies 91% accuracy
• Misclassification rate or the error rate: Overall, how often is it wrong?
(FP + FN)/total = (10 + 5)/165 = 0.09 implies 9% error rate (equivalent to 1
minus Accuracy)
• Recall or true positive rate or sensitivity: When it’s actually yes, how often does
it predict yes?
TP/actual yes = 100/105 = 0.95 implies 95% recall
• False positive rate or fall-out: When it’s actually no, how often does it
predict yes?
FP/actual no = 10/60 = 0.17
• Specificity: When it’s actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83 (equivalent to 1 minus false positive rate)
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91 implies 91% precision
• F-measure: 2 * precision * recall/precision + recall
2 * 0.91 * 0.95/0.91 + 0.95 = 1.729/1.86 = 0.92956989 ~ 0.93

2.13 Web Information Retrieval Models


Standard Boolean Model:
The standard Boolean model is based on Boolean logic and classical set theory where
both the documents to be searched and the user’s query are conceived as sets of terms.
Retrieval is based on whether the documents contain the query terms. A query is
represented as a Boolean expression of terms in which terms are combined with the
logical operators AND, OR, and NOT.
Algebraic Model: Documents are represented as vectors, matrices or tuples. Using
algebraic operations, these are transformed to a one-dimensional similarity measure.
Implementations include the vector space model and the generalized vector space model,
(enhanced) topic-based vector space model, and latent semantic indexing
• Vector Space Model (VSM): The VSM is an algebraic model used for information
retrieval where the documents are represented through the words that they contain.
It represents natural language documents in a formal manner by the use of vectors
in a multi-dimensional space
Model:

– Each document is broken down into a word frequency table. The tables are called
vectors and can be stored as arrays.
– A vocabulary is built from all the words in all the documents in the system.
– Each document and user query is represented as a vector based against the
vocabulary.
– Calculating similarity measure.
– Ranking the documents for relevance.
Extended Boolean Model:
The idea of the extended model is to make use of partial matching and term
weights as in the vector space model. It combines the characteristics of the vector space
model with the properties of Boolean algebra and ranks the similarity between queries
and documents. Documents are returned by ranking them on the basis of frequency of
query terms (ranked Boolean). The concept of term weights was introduced to reflect the
(estimated) importance of each term.
Probabilistic models: . Probability theory seems to be the most natural way to quantify
uncertainty. A document’s relevance is interpreted as a probability. Document and query
similarities are computed as probabilities for a given query. The probabilistic model
takes these term dependencies and relationships into account and, in fact, specifies major
parameters, such as the weights of the query terms and the form of the query-document
similarity. Common models are the basic probabilistic model, Bayesian inference
networks, and language models.

Hyper Link Induced topic Research(HITS)


This is an algorithm developed by Kleinberg in 1998. It defines authorities as
pages that are recognized as providing significant, trustworthy, and useful information
on a topic. In- degree (number of pointers to a page) is one simple measure of
authority. However, in-degree treats all links as equal. Hubs are index pages that
provide lots of useful links to relevant content pages (topic authorities). It attempts to
computationally determine hubs and authorities on a particular topic through analysis
of a relevant sub-graph of the Web. This is based on mutually recursive facts that hubs
point to lots of authorities and authorities are pointed to by lots of hubs. Together, they
tend to form a bipartite graph.
Algorithm:
Computes hubs and authorities for a particular topic specified by a

normal query.

First determines a set of relevant pages for the query called the base
set S.

Analyze the link structure of the web sub-graph defined by S to find
authority and hub pages in this set.
Page Rank (Google): An alternative link-analysis method is used by Google,
known as the PageRank given by Brin and Page in 1998. It does not attempt to capture
the distinction between hubs and authorities. It ranks pages just by authority and is
applied to the entire web rather than a local neighborhood of pages surrounding the
results of a query.
2.14 Google PageRank

Google used PageRank to determine the ranking of pages in its search results. As
Google became the dominant search engine, it sparked the massive demand for backlinks. n
the original paper on PageRank, the concept was defined as "a method for computing a
ranking for every web page based on the graph of the web. PageRank is an attempt to see
how good an approximation to importance can be obtained just from the link structure."
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.
There are more details about d in the next section. Also C(A) is defined as the number of
links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

• PR(A): PageRank of page A.


• PR(Tn): PageRank of pages Tn, which link to page A. Each page has a notion of
its own self-importance. That’s PR(T1) for the first page in the Web all the way
up to PR(Tn) for the last page.
• C(Tn): Number of outbound links on page Ti. Each page spreads its vote out
evenly among all of its outgoing links. The count, or number, of outgoing links
for page 1 is C(T1), C(Tn) for page n, and so on for all pages.
• C(Tn) is the PR(Tn)/C(Tn): If page A has a back link from page n, the share of
the vote page A will get is PR(Tn)/C(Tn).
• All these fractions of votes are added together, but to stop the other pages having too
much influence, this total vote is “damped down” by multiplying it by 0.85 (the factor
“d”).
• (1 – d): The PageRanks form a probability distribution over web pages so the sum of
PageRanks of all web pages will be one.

The (1 – d) bit at the beginning is a bit of probability math magic so the sum of all
web pages’ PageRanks will be one, it adds in the bit lost by the d. It also means that if a
page has no links to it (no back links) even then it will still get a small PR of 0.15 (i.e., 1
–0.85).

Note that the PageRank’s form a probability distribution over web pages, so the sum of all
web pages' PageRank’s will be one. This formula calculates the PageRank for a page by
summing a percentage of the PageRank value of all pages that link to it. Therefore, backlinks
from pages with greater PageRank have more value. In addition, pages with more outbound
links pass a smaller fraction of their PageRank to each linked web page.

According to this formula, three primary factors that impact a page's PageRank are:

 The number of pages that backlink to it


 The PageRank of the pages that backlink to it
 The number of outbound links on each of the pages that backlink to it

In the example above, web page A has a backlink that points to web page B and web page C.
Web page B has a backlink that points to web page C, and web page C has no outbound links.
Based upon this, we already know that A will have the lowest PageRank and C will have the
greatest PageRank. Here's the PageRank formulas and results for the first iteration assuming
d=0.85:

 Page A: (1 - 0.85) = 0.15


 Page B: (1 - 0.85) + (0.85) * (0.15 / 2) = 0.213745
 Page C: (1 - 0.85) + (0.85) * (0.15 / 2) + (0.85) * (0.21375 / 1) = 0.3954375
This is just the first iteration of the calculation. To get the final PageRank of each page, the
calculation must be repeated until the average PageRank for all pages is 1.0.

You might also like