IT Unit 2
IT Unit 2
Web browsers interact with web servers with a simple application-level protocol called
HTTP (Hypertext Transfer Protocol), which runs on top of TCP/IP network connections.
HTTP is a client-server protocol that defines how messages are formatted and
transmitted, and what action web servers and browsers should take in response to
various commands For example, when the user enters a URL in the browser, the browser
sends a HTTP command to the web server directing it to fetch and transmit the requested
web page. Some of the fundamental characteristics of the HTTP protocol are:
The HTTP protocol uses the request/response paradigm, which is an HTTP client
program sends an HTTP request message to an HTTP server that returns an HTTP
response message.
• HTTP is a pull protocol; the client pulls information from the server
(instead of server pushing information down to the client).
HTTP is a stateless protocol, that is, each request-response exchange is treated
independently. Clients and servers are not required to retain a state. An HTTP
transaction consists of a single request from a client to a server, followed by a single
response from the server back to the client. The server does not maintain any
information about the transaction. Some transactions require the state to be maintained.
HTTP is media independent: Any type of data can be sent by HTTP if both the client
and the server know how to handle the data content. It is required for the client as
well as the server to specify the content type using appropriate MIME-type.
2.2 Hypertext Transfer Protocol Version
HTTP uses a <major>.<minor> numbering scheme to indicate versions of the protocol. The
version of an HTTP message is indicated by an HTTP-Version field in the first line. Here is
the general syntax of specifying HTTP version number:
How a client will communicate with the server depends on the type of connection
established between the two machines. Thus, an HTTP connection can either be persistent
or non-persistent. Non-persistent HTTP was used by HTTP/1.0. HTTP/1.1 uses the
persistent type of connection, which is also known as a kept-alive type connection with
multiple messages or objects being sent over a single TCP connection between client and
server.
2.3.1 Non-Persistent Hypertext Transfer Protocol
HTTP/1.0 used a non-persistent connection in which only one object can be sent over
a TCP connection. transmitting a file from one machine to other required two Round
Trip Time (RTT)—the time taken to send a small packet to travel from client to
server and back.
• One RTT to initiate TCP connection
• Second for HTTP request and first few bytes of HTTP response to return
• Rest of the time is taken in transmitting the file
While using non-persistent HTTP, the operating system has an extra overhead for
maintaining each TCP connection, as a result many browsers often open parallel TCP
connections to fetch referenced objects. The steps involved in setting up of a connection
with non-persistent HTTP are:
Whereas for HTTP/1.1, persistent connections1 are very helpful with multi- object requests
as the server keeps TCP connection open by default.
In its simplest form, the communication model of HTTP involves an HTTP client, usually a
web browser on a client machine, and an HTTP server, more commonly known as a web
server. The basic HTTP communication model has four steps:
the server. This is done with a human-readable message. Every HTTP request message has
the same basic structure
Start Line:
Request method: It indicates the type of the request a client wants to send. They are also called
methods
Method = GET | HEAD | POST | PUT| DELETE| TRACE | OPTIONS| CONNECT
| COPY| MOVE
GET: Request server to return the resource specified by the Request-URI as the body of a
response.
HEAD: Requests server to return the same HTTP header fields that would be returned if a GET
method was used, but not return the message body that would be returned to a GET method.
Post:The most common form of the POST method is to submit an HTML form to the server. Since
the information is included in the body, large chunks of data such as an entire file can be sent to
the server.
Put: . It is used to upload a new resource or replace an existing document. The actual document
is specified in the body part.
DELETE: Request server to respond to future HTTP request messages that contain the
specified Request-URI with a response indicating that there is no resource associated with this
Request-URI.
TRACE: Request server to return a copy of the complete HTTP request message, including
start line, header fields, and body, received by the server.
MOVE: It is similar to the COPY method except that it deletes the source file.
CONNECT: It is used to convert a request connection into the transparent TCP/IP tunnel.
COPY: The HTTP protocol may be used to copy a file from one location to another.
Headers:
The HTTP protocol specification makes a clear distinction between general headers,
request headers, response headers, and entity headers. Both request and response messages
have general headers but have no relation to the data eventually transmitted in the body. The
headers are separated by an empty line from the request and response body. The format of a
request header is shown in the following table:
General
Header
Request
Header
Entity Header
A header consists of a single line or multiple lines. Each line is a single header of the following
form:
Header-name: Header-value
General Headers
General headers do not describe the body of the message. They provide information
about the messages instead of what content they carry.
• Connection: Close
This header indicates whether the client or server, which generated the message,
intends to keep the connection open.
• Warning: Danger. This site may be hacked!
This header stores text for human consumption, something that would be useful
when tracing a problem.
• Cache-Control: no-cache
This header shows whether the caching should be used.
Request Header:
It allows the client to pass additional information about themselves and about the request, such
as the data format that the client expects.
• User-Agent: Mozilla/4.75
Identifies the software (e.g., a web browser) responsible for making the request.
• Host: www.netsurf.com
This header was introduced to support virtual hosting, a feature that allows a web
server to service more than one domain.
• Referer: http://wwwdtu.ac.in/∼akshi/index.html
This header provides the server with context information about the request. If the
request came about because a user clicked on a link found on a web page, this header
contains the URL of that referring page.
• Accept: text/plain
This header specifies the format of the media that the client can accept.
Entity Header:
• Content-Type: mime-type/mime-subtype
This header specifies the MIME-type of the content of the message body.
• Content-Length: 546
This optional header provides the length of the message body. Although it is optional,
it is useful for clients such as web browsers that wish to impart information about the
progress of a request. Last-Modified: Sun, 1 Sept 2016 13:28:31 GMT
This header provides the last modification date of the content that is transmitted in the
body of the message. It is critical for the proper functioning of caching mechanisms.
• Allow: GET, HEAD, POST
This header specifies the list of the valid methods that can be applied on a URL.
Message Body:
The message body part is optional for an HTTP message, but, if it is available, then it is used to
carry the entity-body associated with the request. If the entity- body is associated, then usually
Content-Type and Content-Length header lines specify the nature of the associated body.
Response Message: Similar to an HTTP request message, an HTTP response message
consists of a status line, header fields, and the body of the response, in the following
format
Fig:A sample HTTP Request Message
• HTTP version: This field specifies the version of the HTTP protocol being used by the
server. The current version is HTTP/1.1.
• Status code: It is a three-digit code that indicates the status of the response. The status
codes are classified with respect to their functionality into five groups as follows:
• 1xx series (Informational)—This class of status codes represents provisional
responses.
• 2xx series (Success)—This class of status codes indicates that the client’s request
are received, understood, and accepted successfully.
• 3xx series (Re-directional)—These status codes indicate that additional actions must
be taken by the client to complete the request.
• 4xx series (Client error)—These status codes are used to indicate that the client
request had an error and therefore it cannot be fulfilled
• 5xx series (Server error)—This set of status codes indicates that the server
encountered some problem and hence the request cannot be satisfied at this time.
The reason of the failure is
• embedded in the message body. It is also indicated whether the failure is temporary
or permanent. The user agent should accordingly display a message on the screen to
make the user aware of the server failure.
Status phrase: It is also known as Reason-phrase and is intended to give a short textual
description of status code.
Example:
403
Not Found The requested resource could not be found but may be available in the
future. Subsequent requests by the client are permissible.
404
General
Header
Response
Header
Entity Header
• Response Header
Response headers help the server to pass additional information about the response that
cannot be inferred from the status code alone, like the information about the server and
the data being sent
• Location: http://www.mywebsite.com/relocatedPage.html
This header specifies a URL towards which the client should redirect its original
request.
It always accompanies the “301” and “302” status codes that direct clients to try a
new location.
• WWW-Authenticate: Basic
This header accompanies the “401” status code that indicates an authorization
challenge. It specifies the authentication scheme which should be used to access the
requested entity. Server: Apache/1.2.5
This header is not tied to a particular status code. It is an optional header that
identifies the server software.
• Age:22
This header specifies the age of the resource in the proxy cache in seconds.
Message Body
Similar to HTTP request messages, the message body in an HTTP response message is
also optional. The message body carries the actual HTTP response data from the server
(including files, images, and so on) to the client.:
Figure:
A sample HTTP response message.
• Verifying that you are talking directly to the server that you think you are talking to.
• Ensuring that only the server can read what you send it, and only you can read what it
sends back.
2.6 Hypertext Transfer Protocol State Retention: Cookies
HTTP is a stateless protocol. Cookies are an application-based solution to provide state
retention over a stateless protocol. They are small pieces of information that are sent in
response from the web server to the client. Cookies are the simplest technique used for storing
client state. A cookie is also known as HTTP cookie, web cookie, or browser cookie. Cookies
are not software; they cannot be programmed, cannot carry viruses, and cannot install
malware on the host computer. However, they can be used by spyware to track a user’s
browsing activities. Cookies are stored on a client’s computer. They have a lifespan and are
destroyed by the client browser at the end of that lifespan.
Fig:HTTP Cookie
Creating Cookies
When receiving an HTTP request, a server can send a Set-Cookie header with the response.
The cookie is usually stored by the browser and, afterwards, the cookie value is sent along with
every request made to the same server as the content of a Cookie HTTP header.
A simple cookie can be set like this:
Set-Cookie: <cookie-name>=<cookie-value>
There are various kinds of cookies that are used for different scenarios depending on the need.
These different types of cookies are given with their brief description in
Types of Cookies
Cookies Description
Session cookie A session cookie only lasts for the duration of users using the website. The web
browser normally deletes session cookies when it quits.
Persistent cookie/tracing cookie
A persistent
cookie will
outlast user
sessions. If a
persistent
cookie has its
max-age set to
1 year,
then, within
the year, the
initial value
set in that
cookie
would be
sent back to
the server
every time
the user
visited the
server.
Secure cookieA secure cookie is used when a browser is visiting a server via HTTPS, ensuring
that the cookie is always encrypted when transmitting from client to server.
Zombie cookie A zombie cookie is any cookie that is automatically recreated after the user has
deleted it.
• Persistence: One of the most powerful aspects of cookies is their persistence. When a
cookie is set on the client’s browser, it can persist for days, months, or even years. This
makes it easy to save user preferences and visit information and to keep this
information available every time the user returns to a website. Moreover, as cookies
are stored on the client’s hard disk they are still available even if the server crashes.
• Transparent: Cookies work transparently without the user being aware that
information needs to be stored.
They lighten the load on the server’s memory.
2.7 Hypertext Transfer Protocol Cache
Caching is the term for storing reusable responses in order to make subsequent
requests faster. The caching of web pages is an important technique to improve the Quality
of Service (QoS) of the web servers. Caching can reduce network latency experienced by
clients. For example, web pages can be loaded more quickly in the browser. Caching can
also conserve bandwidth on the network, thus increasing the scalability of the network
with the help of an HTTP proxy cache (also known as web cache). Caching also increases the
availability of web pages.
• Browser cache: Web browsers themselves maintain a small cache. Typically, the browser
sets a policy that dictates the most important items to cache. This may be user-specific
content or content deemed expensive to download and likely to be requested again.
• Intermediary caching proxies (Web proxy): Any server in between the client and your
infrastructure can cache certain content as desired. These caches may be maintained by
ISPs or other independent parties.
• Reverse cache: Your server infrastructure can implement its own cache for backend
services. This way, content can be served from the point-of-contact instead of hitting
backend servers on each request.
Cache Consistency
Cache consistency mechanisms ensure that cached copies of web pages are eventually
updated to reflect changes in the original web pages. There are basically, two cache
consistency mechanisms currently in use for HTTP proxies:
• Pull method: In this mechanism, each web page cached is assigned a time- to-serve
field, which indicates the time of storing the web page in the cache. An expiration time
of one or two days is also maintained. If the time is expired, a fresh copy is obtained
when user requests for the page.
• Push method: In this mechanism, the web server is assigned the responsibility
of making all cached copies consistent with the server copy.
2.8 Evolution of Web
Web 1.0:
The First generation of the web ,web 1.0 was introduced by tim Berners Lee in late 1990, as a
technology based solution for business to broad cast their information to people.the core elements of
web were HTTP,HTML,URL. The Web 1.0, as an unlimited source of.
The Web 1.0, as an unlimited source of information with users from cross-
section of society seeking to find information to satisfy their information needs,
required an effective and efficient mechanism to access it. This read-only Web
was accessible using an information retrieval system, popularly known as a
web
search engine, or simply search engine,
Web 2.0:
Web 2.0 is the term used to describe the second generation of the world wide web that emerged in
early 2000s.unlike web 1.0 which was primarily focused on the one_way dissemination of
information,web 2.0 is characterized by a more collaborative and interactive approach to web
content and user engagement.
Semantic Web
Semantic means “relating to meaning in language or logic.” The Semantic Web improves the
abilities of web technologies to generate, share, and connect content through search and analysis
by understanding the meaning of language beyond simple keywords.
Artificial intelligence
Web 3.0 leans on artificial intelligence (AI) to develop computers that can understand the meaning
or context of user requests and answer complex requests more quickly. The artificial intelligence of
the Web 3.0 era goes beyond the interactivity of Web 2.0 and creates experiences for people that feel
curated, seamless, and intuitive — a central aim behind the development of the metaverse.
Decentralization
Web 3.0 envisions a truly decentralized internet, where connectivity is based completely on peer-
to-peer network connections. This decentralized web will rely on blockchain to store data and
maintain digital assets without being tracked.
Ubiquity
Ubiquity means appearing everywhere or being very common. The definition of ubiquity in terms of
Web 3.0 refers to the idea that the internet should be accessible from anywhere, through any
platform, on any device. Along with digital ubiquity comes the idea of equality. If Web 3.0 is
ubiquitous, it means it is not limited. Web 3.0 is not meant for the few, it is meant for the many.
Consumers struggled to
locate valuable Many web inventors, including the This will enable the intelligent
information in Online 1.0 above-mentioned Jeffrey Zeldman, creation and distribution of
since there were no pioneered the set of technologies highly tailored content to every
algorithms to scan used in this internet era. internet user.
through websites.
2.9 Big Data:
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
a. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
Search Services
Search services provide a layer of abstraction over several search tools and databases,
aiming to simplify web search. Search services broadcast user queries to several search engines
and various other information sources simultaneously.
2.11 Web Information Retrieval Architecture (Search Engine Architecture)
A search engine is an online answering machine, which is used to search, understand, and
organize content's result in its database based on the search query (keywords) inserted by the end-
users (internet user). To display search results, all search engines first find the valuable result from
their database, sort them to make an ordered list based on the search algorithm, and display in front
of end-users. The process of organizing content in the form of a list is commonly known as a Search
Engine Results Page (SERP).
1. Web Crawler
Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an essential
role in search engine optimization (SEO) strategy. It is mainly a software component that traverses
on the web, then downloads and collects all the information over the Internet.
There are the following web crawler features that can affect the search results –
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
Database
The search engine database is a type of Non-relational database. It is the place where all the web
information is stored. It has a large number of web resources. Some most popular search engine
databases are Amazon Elastic Search Service and Splunk.
There are the following two database variable features that can affect the search results:
3. Search Interfaces
Search Interface is one of the most important components of Search Engine. It is an interface
between the user and the database. It basically helps users to search for queries using the database.
There are the following features Search Interfaces that affect the search results -
o Operators
o Phrase Searching
o Truncation
4. Ranking Algorithms
The ranking algorithm is used by Google to rank web pages according to the Google search
algorithm.
There are the following ranking features that affect the search results -
1. Crawling
Crawling is the first stage in which a search engine uses web crawlers to find, visit, and download
the web pages on the WWW (World Wide Web). Crawling is performed by software robots, known
as "spiders" or "crawlers." These robots are used to review the website content.
2. Indexing
Indexing is an online library of websites, which is used to sort, store, and organize the content that
we found during the crawling. Once a page is indexed, it appears as a result of the most valuable and
most relevant query.
The ranking is the last stage of the search engine. It is used to provide a piece of content that will be
the best answer based on the user's query. It displays the best content at the top rank of the website.
To know more about how the search engine works click on the following link -
Search Engine Processing
1. Indexing process
i. Text acquisition
Index creation takes the output from text transformation and creates the indexes or data searches that
enable fast searching.
2. Query process
The query is the process of producing the list of documents based on a user's search query.
i. User interaction
User interaction provides an interface between the users who search the content and the search
engine.
ii. Ranking
The ranking is the core component of the search engine. It takes query data from the user interaction
and generates a ranked list of data based on the retrieval model.
iii. Evaluation
Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result
helps us to improve the ranking of the search engine.
Like in the information retrieval community, system evaluation in Web IR (search engines)
also revolves around the notion of relevant and not relevant documents. In a binary decision
problem, a classifier labels examples as either positive or negative. The decision made by
the classifier can be represented in a structure
known as a confusion matrix or contingency table. The confusion matrix has four
categories: True positives (TP) are examples correctly labeled as positives. False
positives (FP) refer to negative examples incorrectly labeled as positive— they form
Type-I errors. True negatives (TN) correspond to negatives correctly labeled as
negative. And false negatives (FN) refer to positive examples incorrectly labeled as
negative—they form Type-II errors.
Fig: confusion matrix
Recall: This is also known as true positive rate or sensitivity or hit rate.
It is defined as the number of relevant documents retrieved by a search divided by the
total number of existing relevant documents.
The performance measures are thus computed from the confusion matrix for a
binary classifier as follows:
– Each document is broken down into a word frequency table. The tables are called
vectors and can be stored as arrays.
– A vocabulary is built from all the words in all the documents in the system.
– Each document and user query is represented as a vector based against the
vocabulary.
– Calculating similarity measure.
– Ranking the documents for relevance.
Extended Boolean Model:
The idea of the extended model is to make use of partial matching and term
weights as in the vector space model. It combines the characteristics of the vector space
model with the properties of Boolean algebra and ranks the similarity between queries
and documents. Documents are returned by ranking them on the basis of frequency of
query terms (ranked Boolean). The concept of term weights was introduced to reflect the
(estimated) importance of each term.
Probabilistic models: . Probability theory seems to be the most natural way to quantify
uncertainty. A document’s relevance is interpreted as a probability. Document and query
similarities are computed as probabilities for a given query. The probabilistic model
takes these term dependencies and relationships into account and, in fact, specifies major
parameters, such as the weights of the query terms and the form of the query-document
similarity. Common models are the basic probabilistic model, Bayesian inference
networks, and language models.
Google used PageRank to determine the ranking of pages in its search results. As
Google became the dominant search engine, it sparked the massive demand for backlinks. n
the original paper on PageRank, the concept was defined as "a method for computing a
ranking for every web page based on the graph of the web. PageRank is an attempt to see
how good an approximation to importance can be obtained just from the link structure."
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.
There are more details about d in the next section. Also C(A) is defined as the number of
links going out of page A. The PageRank of a page A is given as follows:
The (1 – d) bit at the beginning is a bit of probability math magic so the sum of all
web pages’ PageRanks will be one, it adds in the bit lost by the d. It also means that if a
page has no links to it (no back links) even then it will still get a small PR of 0.15 (i.e., 1
–0.85).
Note that the PageRank’s form a probability distribution over web pages, so the sum of all
web pages' PageRank’s will be one. This formula calculates the PageRank for a page by
summing a percentage of the PageRank value of all pages that link to it. Therefore, backlinks
from pages with greater PageRank have more value. In addition, pages with more outbound
links pass a smaller fraction of their PageRank to each linked web page.
According to this formula, three primary factors that impact a page's PageRank are:
In the example above, web page A has a backlink that points to web page B and web page C.
Web page B has a backlink that points to web page C, and web page C has no outbound links.
Based upon this, we already know that A will have the lowest PageRank and C will have the
greatest PageRank. Here's the PageRank formulas and results for the first iteration assuming
d=0.85: