DM Clustering UNIT4
DM Clustering UNIT4
UNIT-IV
CLUSTERING
Clustering
Clustering is the process of partitioning a set of data into meaningful similar subclasses
is called cluster
[Or]
Clustering is the grouping the set of objects such a way that the object of same
group’s are grouped together. i.e., while doing clustering analysis, we first partitioning data
into group based on similarity.
Examples of clustering application:
Marketing
Land use
Insurance
City-planning
1. Partitioning approach:
Construct various partitions and then evaluate them by some criterion.e.g, Minimizing the
sum of square errors.
Typical methods: k-means, k-medoids, CLARANS
2 .Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion.
Typical methods: Agglomerative, Divisive clustering
BIRCH
ROCK
CHAMELEON
3. Density-based approach:
Based on connectivity and density functions.
Typical methods:
Denclue [Density based clustering]
DBSCAN [Density based clustering method based on connected region with sufficiently
high density]
OPTICS [Ordering point to identify the clustering structure]
5. Model-based methods:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each
other.
Typical method: COWWEB, EM [Expectation & Maximization],
6. Constraint-based methods:
Clustering by considering user-specified or application specific clustering.
Typical methods: COD (Obstacles), constrained clustering.
Partitioning methods:
1. k-mean algorithm:
Step-1 Take mean value (randomly)
Step-2 Find nearest number of mean and put it in cluster.
Step-3 Repeat step-1&step-2 until we get same mean.
Step-4 k-mean method typically uses the Square error criterion function.
EX: O= {2, 3, 4, 10, 11, 12, 20, 25, 30}; K=2
M1=4 M2=12
C1={2,3,4} C2={10,11,12,20,25,30}
M1=9/3 =3 M2=108/6 =18
C1={2,3,4,10} C2={11,12,20,25,3
M1=19/4 =4.75 M2=19.6
M1= ~5 M2= ~20
C1={2,3,4,10,11,12} C2={20,25,30}
M1=7 M2=25
C1={2,3,4,10,11,12} C2={20,25,30}
M1=7 M2=25
2. K-Medoids algorithm:
The k-mean algorithm is sensitivity to outliers because an object with an extremely large values
may substrainly destroy the destroy the distribution of data.
Insetead of taking the mean value of the object in a cluster as a reference point, we can
pick actual objects to represent clusters using one representative object per cluster.
Each remaining object is clustered with the representative object to which it is the most
similar.
Case-1: ‘p’ currenty belongs to representative object Oj.If Oj is replaced by O randam as a
representative object Oi,i≠j , then p is reassigned to Oi.
→data object
+ →cluster center
- →before swapping
--- →After swapping.
Case-2:’p’ currently belongs to representative object Oj,if Oj is replaced by Orandom as a
representative object and p is close to Orandom then p is assigned to Orandom.
Case-3:’p’ currently belongs to representative object Oi,i≠j, if Oj is replaced by Orandom
as a representative object and p is still close to Oi, then the assigned doesn’t change.
Case-4: ‘p’ currently belongs to representative object Oi, i≠j, if Oj is replaced by
Orandom, then p is reassigned to Orandom.
Disadvantages:
1.Once merge (or) splits step is done it cannot be redo (or) undo.
2. To overcome this problem and to improve the Quality of hierarchial methods is to
integrate with other clustering techniques.
1. BIRCH- Balanced interactive reducing &clustering using hierarchy.
2. ROCKS- Robust clustering using links.
3. CHAMELEON-
CF-tree: It is a height-balanced tree that stores the clustering features for a hierarchical
clustering. the size of a clustering feature tree is dependent on two factors:
1. Branching factor: It decides the maximum number of child nodes for a non-
leafnode.
2. Threshold : It decides the maximum diameter that a subclusters i.e., a
collection of non-leaf and its child node.
CHAMELEON:
Measures the similarity based on a dynamic model: Two clusters are merged only if the
interconnectivity and closeness between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items with in the clusters.
→Cure ignores information about interconnectivity of objects.
→ Rock ignores information about closeness of two clusters.
There are two – phases of Algorithm:
Density-based Clustering
The Density-based Clustering tool works by detecting areas where points are concentrated and
where they are separated by areas that are empty or sparse. Points that are not part of a cluster are
labeled as noise.
This tool uses unsupervised machine learning clustering algorithms which automatically detect
patterns based purely on spatial location and the distance to a specified number of neighbors.
These algorithms are considered unsupervised because they do not require any training on what
it means to be a cluster.
Clustering Methods
The Density-based Clustering tool provides three different Clustering Methods with which to
find clusters in your point data:
Defined distance (DBSCAN)—uses a specified distance to separate dense clusters from
sparser noise. The DBSCAN algorithm is the fastest of the clustering methods, but is only
appropriate if there is a very clear Search Distance to use, and that works well for all
potential clusters. This requires that all meaningful clusters have similar densities.
Multi-scale (OPTICS)—uses the distance between neighboring features to create a
reachability plot which is then used to separate clusters of varying densities from noise.
The OPTICS algorithm offers the most flexibility in fine-tuning the clusters that are
detected, though it is computationally intensive, particularly with a large Search
Distance.
For Defined distance (DBSCAN), if the Minimum Features per Cluster cannot be
found within the Search Distance from a particular point, then that point will be marked
as noise. In other words, if the core-distance (the distance required to reach the minimum
number of features) for a feature is greater than the Search Distance, the point is marked
as noise. The Search Distance, when using Defined distance (DBSCAN), is treated as a
search cut-off.
Multi-scale (OPTICS) will search all neighbor distances within the specified Search Distance,
comparing each of them to the core-distance. If any distance is smaller than the core-distance,
then that feature is assigned that core-distance as its reachability distance. If all of the distances
are larger than the core-distance, then the smallest of those distances is assigned as the
reachability distance.
While only Multi-scale (OPTICS) uses the reachability plot to detect clusters, the plot can be
used to help explain, conceptually, how these methods differ from each other. For the purposes
of illustration, the reachability plot below will be used to explain the differences in the 3
methods. The plot reveals clusters of varying densities and separation distances.
→Statistical info of each cell is calculated and stored before hand and is used to
answer Queries.
→parameters of higher level cells can be easily calculated from parameters of lower
level cell
→Remove the irrelevant cells from further consideration
→when finish examining the current layer, proceed to the next lower level.
→Repeat this process until the bottom layer is reached.
Advantages:
1. Query-independent, easy to parallelize, incremental update
2. O(k), where k is the number of grid cells at the lowest level
Disadvantages: All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected.
Wave cluster: clustering by wavelet Analysis
A multi-resolution clustering approach which applies wavelet transform to the feature
space. How to apply wavelet transform to find clusters.
Summarizes the data by imposing a multi dimensional grid grid structure on to data space
These multidimensional spatial data objects are represented in a n-dimensional feature
space.
Apply wavelet transform on feature space to find the dense regions in the feature space.
Model-based clustering methods attempt to optimize the fit between the given data and some
mathematical model. Such methods are often based on the assumption that the data are generated
by a mixture of underlying probability distributions.
Expectation-Maximization (EM):
In practice, each cluster can be represented mathematically by a parametric probability
distribution. The entire data is a mixture of these distributions, where each individual distribution
is typically referred to as a component distribution.
The EM (Expectation-Maximization) algorithm is a popular iterative refinement algorithm that
can be used for finding the parameter estimates. It can be viewed as an extension of the k-means
paradigm, which assigns an object to the cluster with which it is most similar, based on the
cluster mean.
Conceptual Clustering
Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled
objects, produces a classification scheme over the objects. Unlike conventional clustering, which
primarily identifies groups of like objects, conceptual clustering goes one step further by also
finding characteristic descriptions for each group, where each group represents a concept or
class. Hence, conceptual clustering is a two-step process: clustering is performed first, followed
by characterization.
COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects
are described by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in
the form of a classification tree.
Outlier Analysis
“What is an outlier?” Very often, there exist data objects that do not comply with the general
behavior or model of the data. Such data objects, which are grossly different from or inconsistent
with the remaining set of data, are called outliers. Outliers can be caused by measurement or
execution error.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because one
person’s noise could be another person’s signal. In other words, the outliers may be of particular
interest, such as in the case of fraud detection, where outliers may indicate fraudulent activity.
Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier
mining.
Outlier mining has wide applications. As mentioned previously, it can be used in fraud detection,
for example, by detecting unusual usage of credit cards or telecommunication services. In
addition, it is useful in customized marketing for identifying the spending behavior of customers
with extremely low or extremely high incomes, or in medical analysis for finding unusual
responses to various medical treatments.
Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.
Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive
property to provide a set of various data types. The web has multiple aspects that yield different
approaches for the mining process, such as web pages consist of text, web pages are linked via
hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web
usage mining.
In data mining, web terminology pertains to concepts relevant to analyzing web-related data.
This involves web crawling, scraping, URL tokenization, content mining, link analysis, usage
mining, structure mining, and more. Data mining techniques enable understanding user behavior,
preferences, and patterns on websites. Techniques include personalization, anomaly detection,
text mining, and utilizing semantic web concepts. These web-focused approaches allow for the
extraction of valuable insights from the vast amount of data available on the internet.
Web mining has numerous applications in various fields, including business, marketing, e-
commerce, education, healthcare, and more. Some common applications of web mining include -
Fraud Detection -
Web mining is used to detect fraudulent activities, such as credit card fraud, identity
theft, and online scams. This includes analyzing user behavior patterns, detecting
anomalies, and identifying potential security threats.
Social Network Analysis -
Web mining is used to analyze social media data and identify social networks,
communities, and influencers. This information can be used to understand social
dynamics, sentiment analysis, and targeted advertising.
Data collection -
Web data is collected from various sources, including web pages, databases, and APIs.
Data pre-processing -
The collected data is pre-processed to remove irrelevant information, such as
advertisements and duplicate content.
Data integration -
The pre-processed data is integrated and transformed into a structured format for
analysis.
Pattern discovery -
Web mining techniques are applied to identify patterns, trends, and relationships.
Evaluation -
The discovered patterns are evaluated to determine their significance and usefulness.
Visualization -
The analysis results are visualized through graphs, charts, and other visualizations.
Web Content Mining is one of the three different types of techniques in Web Mining. In this
article, we will purely discuss Web Content Mining. Mining, extraction, and integration of
useful data, information, and knowledge from Web page content are known as Web Mining.
It describes the discovery of useful information from web content. In simple words, it is the
application of web mining that extracts relevant or useful information content from the Web.
Web Content mining is somehow related but different from other mining techniques like data
mining and text mining. Due to heterogeneity and the absence of web data, automated
discovery of new knowledge patterns can be challenging to some extent.
Web data are generally semi-structured and/or unstructured, while data mining is primarily
concerned with structured data . It performs scanning and mining of text, image and images,
and groups of web pages according to the content of input by displaying the list in search
engines.
For Example: if the user is searching for a particular song then the search engine will display
or provide suggestions relevant to it.
Web content mining deals with different kinds of data such as text, audio, video, image, etc.
Unstructured Web Data Mining
Unstructured data includes data such as audio, video, etc, We convert these unstructured data
into structured data,i.e., into useful information or structured information (which is known as
Web Content Mining). the process of Conversion is mentioned as follows:
2. Database Approaches:
Used for transforming unstructured data into a more structured and high-level collection of
resources, such as in relational databases, and using standard database querying mechanisms
and data mining techniques to access and analyze this information.
Multilevel Databases:
Lowest Level – semi-structured information is kept.
High Level- generalization from lower levels organized into relations
and objects.
Web Query Systems:
Web-query systems are developed such as SQL, and Natural Language
Processing for extracting data.
Web content mining has the following problems or challenges also with their solutions, such as:
o Data Extraction: Extraction of structured data from Web pages, such as products and
search results. Extracting such data allows one to provide services. Two main types of
techniques, machine learning and automatic extraction, are used to solve this problem.
o Web Information Integration and Schema Matching: Although the Web contains a
huge amount of data, each website (or even page) represents similar information
differently. Identifying or matching semantically similar data is an important problem
with many practical applications.
o Opinion extraction from online sources: There are many online opinion sources, e.g.,
customer reviews of products, forums, blogs, and chat rooms. Mining opinions are of
great importance for marketing intelligence and product benchmarking.
o Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time-consuming. The main application is to
synthesize and organize the pieces of information on the web to give the user a coherent
picture of the topic domain. A few existing methods that explore the web's information
redundancy will be presented.
o Segmenting Web pages and detecting noise: In many Web applications, one only wants
the main content of the Web page without advertisements, navigation links, copyright
notices. Automatically segmenting Web pages to extract the pages' main content is an
interesting problem.
The challenge for Web structure mining is to deal with the structure of the hyperlinks within the
web itself. Link analysis is an old area of research. However, with the growing interest in Web
mining, the research of structure analysis has increased. These efforts resulted in a newly
emerging research area called Link Mining, which is located at the intersection of the work in
link analysis, hypertext, web mining, relational learning, inductive logic programming, and graph
mining.
Web structure mining uses graph theory to analyze a website's node and connection structure.
According to the type of web structural data, web structure mining can be divided into two kinds:
The web contains a variety of objects with almost no unifying structure, with differences in the
authoring style and content much greater than in traditional collections of text documents. The
objects in the WWW are web pages, and links are in, out, and co-citation (two pages linked to by
the same page). Attributes include HTML tags, word appearances, and anchor texts. Web
structure mining includes the following terminology, such as:
An example of a technique of web structure mining is the PageRank algorithm used by Google
to rank search results. A page's rank is decided by the number and quality of links pointing to the
target node.
Link mining had produced some agitation on some traditional data mining tasks. Below we
summarize some of these possible tasks of link mining which are applicable in Web structure
mining, such as:
1. Link-based Classification: The most recent upgrade of a classic data mining task to
linked Domains. The task is to predict the category of a web page based on words that
occur on the page, links between pages, anchor text, html tags, and other possible
attributes found on the web page.
2. Link-based Cluster Analysis: The data is segmented into groups, where similar objects
are grouped together, and dissimilar objects are grouped into different groups. Unlike the
previous task, link-based cluster analysis is unsupervised and can be used to discover
hidden patterns from data.
3. Link Type: There is a wide range of tasks concerning predicting the existence of links,
such as predicting the type of link between two entities or predicting the purpose of a
link.
4. Link Strength: Links could be associated with weights.
5. Link Cardinality: The main task is to predict the number of links between objects. page
categorization used to
o Finding related pages.
o Finding duplicated websites and finding out the similarity between them.
Web Usage Mining focuses on techniques that could predict the behavior of users while they are
interacting with the WWW. Web usage mining, discovering user navigation patterns from web
data, trying to discover useful information from the secondary data derived from users'
interactions while surfing the web. Web usage mining collects the data from Weblog records to
discover user access patterns of web pages. Several available research projects and commercial
tools analyze those patterns for different purposes. The insight knowledge could be utilized in
personalization, system improvement, site modification, business intelligence, and usage
characterization.
The only information left behind by many users visiting a Web site is the path through the pages
they have accessed. Most of the Web information retrieval tools only use textual information,
while they ignore the link information that could be very valuable. In general, there are mainly
four kinds of data mining techniques applied to the web mining domain to discover the user
navigation pattern, such as:
Association rule is the most basic rule of data mining methods which is used more than other
methods in web usage mining. This method enables the website for more efficient content
organization or provides recommendations for an effective cross-selling product.
These rules are statements in the form X => Y where (X) and (Y) are the set of available items in
a series of transactions. The rule of X => Y states that transactions that contain items in X may
also include items in Y. Association rules in the web usage mining are used to find relationships
between pages that frequently appear next to one another in user sessions.
2. Sequential Patterns
Sequential patterns are used to discover the subsequence in a large volume of sequential data. In
web usage mining, sequential patterns are used to find user navigation patterns that frequently
appear at meetings. The sequential patterns may seem to be association rules. But the sequential
patterns are included the time, which means that the sequence of events that occurred is defined
in sequential patterns. Algorithms that are used to extract association rules can also be used to
generate sequential patterns. Two types of algorithms are used for sequential mining patterns.
o The first type of algorithm is based on association rules mining. Many common
algorithms of sequential mining patterns have been changed for mining association rules.
For example, GSP and AprioriAll are two developed species of Apriori algorithms that
are used to extract association rules. But some researchers believe that association rules
mining algorithms do not have enough performance in the long sequential patterns
mining.
o The second type of sequential patterns mining algorithms has been introduced in which
the tree structure and Markov chain are used to represent survey patterns. For example, in
one of these algorithms called WAP-mine, the tree structure called WAP-tree is used to
explore access patterns to the web. Evaluation results show that its performance is higher
than an algorithm such as GSP.
3. Clustering
Clustering techniques diagnose groups of similar items among high volumes of data. This is
done based on distance functions which measure the degree of similarity between different items.
Clustering in web usage mining is used for grouping similar meetings. What is important in this
type of search is the contrast between the user and individual groups. Two types of interesting
clustering can be found in this area: user clustering and page clustering.
Clustering of user records is usually used to analyze web mining and web analytics tasks. More
knowledge derived from clustering is used to partition the market in e-commerce. Different
methods and techniques are used for clustering, which includes:
o Using the similarity graph and the amount of time spent viewing a page to estimate the
similarity of meetings.
The repetitive patterns are first extracted from the user's sessions using association rules in other
clustering methods. Then, these patterns are used to construct a graph where the nodes are the
visited pages. The edges of the graph connect two or more pages. If these pages exist in a pattern
extracted, the weight will be assigned to the edges that show the relationship between the nodes.
Then, for clustering, this graph is recursively divided to user behavior groups are detected.
4. Classification Mining
Discovering classification rules allows one to develop a profile of items belonging to a particular
group according to their common attributes. This profile can classify new data items added to the
database. In Web Mining, classified techniques allow one to develop a profile for clients who
access particular server files based on demographic information available on those clients or
their navigation patterns.
Advantages
Web usage mining has many advantages, making this technology attractive to corporations,
including government agencies.
o There are also elements unique to web usage mining that show the technology's benefits.
These include the way semantic knowledge is applied when interpreting, analyzing and
reasoning about usage patterns during the mining phase.
Disadvantages
Web usage mining by itself does not create issues, but when used on data of personal nature, this
technology might cause concerns.
o The most criticized ethical issue involving web usage mining is the invasion of privacy.
Privacy is considered lost when information concerning an individual is obtained, used,
or disseminated, especially if this occurs without the individual's knowledge or consent.
The obtained data will be analyzed, made anonymous, and then clustered to form
anonymous profiles.
o These applications de-individualize users by judging them by their mouse clicks rather
than by identifying information. De-individualization, in general, can be defined as a
tendency to judge and treat people based on group characteristics instead of on their
characteristics and merits.
o The companies collecting the data for a specific purpose might use the data for totally
different purposes, violating the user's interests.
The main objective of web usage mining is to collect data about the user's navigation patterns.
This information can improve the Web sites in the user view. There are three main applications
of this mining, such as:
Web usage mining techniques can be used for the personalization of web users. For example,
user behavior can be immediately predicted by comparing her current survey patterns with those
extracted from the log files. Recommendation systems with a real application in this area suggest
links that direct the user to his favorite pages. Some sites also organize their product catalogs
based on the predicted interests of a specific user and represent them.
2. Pre - recovery
The results of web usage mining can be used to improve the performance of Web servers and
Web-based applications. Web usage mining can be used for retrieving and caching strategies and
thus reduce the response time of Web servers.
Usability is one of the most important issues in designing and implementing websites. The
results of web usage mining can help to appropriate the design of websites. Adaptive websites
are an application of this type of mining. Website content and structure are dynamically
reorganized based on data derived from user behavior in these sites.
Google is the most commonly used internet search engine. Google search takes place in the
following three stages:
1. Crawling. Crawlers discover what pages exist on the web. A search engine
constantly looks for new and updated pages to add to its list of known pages. This is
referred to as URL discovery. Once a page is discovered, the crawler examines its
content. The search engine uses an algorithm to choose which pages to crawl and
how often.
2. Indexing. After a page is crawled, the textual content is processed, analyzed and
tagged with attributes and metadata that help the search engine understand what the
content is about. This also enables the search engine to weed out duplicate pages and
collect signals about the content, such as the country or region the page is local to and
the usability of the page.
3. Searching and ranking. When a user enters a query, the search engine searches the
index for matching pages and returns the results that appear the most relevant on the
search engine results page (SERP). The engine ranks content on a number of factors,
such as the authoritativeness of a page, back links to the page and keywords a page
contains.
Specialized content search engines are more selective about the parts of the web they crawl and
index. For example, Creative Commons Search is a search engine for content shared explicitly
for reuse under Creative Commons license. This search engine only looks for that specific type
of content.
Country-specific search engines may prioritize websites presented in the native language of the
country over English websites. Individual websites, such as large corporate sites, may use a
search engine to index and retrieve only content from that company's site. Some of the major
search engine companies license or sell their search engines for use on individual sites.
Search
engines crawl, index and rank content across the internet, using algorithms to decide placement
on results pages.
Not every search engine ranks content the same way, but some have similar ranking algorithms.
Google search and other search engines like it rank relevant results based on the following
criteria:
Query meaning. The search engine looks at user queries to establish searcher intent,
which is the specific type of information the user is looking for. Search engines use
language models to do this. Language models are algorithms that read user input,
understand what it means and determine the type of information that a user is looking
for.
Usability. Search engines evaluate the accessibility and general user experience of
content and reward content with better page experience. One example of page
usability is mobile-friendliness, which is a measure of how easy a webpage is to use
on a mobile device.
User data. A user's past search history, search settings and location data are a few of
the data types search engines use to determine the content rankings they choose.
Search engines might use other website performance metrics, such as bounce rate and time spent
on page, to determine where websites rank on a results page. Search engines might return
different results for the same term searched as text-based content versus an image or video
search.
Search
engines often provide links to videos on their search engine results pages.
Content creators use search engine optimization (SEO) to take advantage of the above processes.
Optimizing the content on a page for search engines increases its visibility to searchers and its
ranking on the SERP. For example, a content creator could insert keywords relevant to a given
search query to improve results for that query. If the content creator wants people searching for
dogs to land on their page, they might add the keywords bone, leash and hound. They might also
include links to pages that Google deems authoritative.
The primary goal of a search engine is to help people search for and find information. Search
engines are designed to provide people with the right information based on a set of criteria, such
as quality and relevance.
Webpage and website providers use search engines to make money and to collect data, such
as clickstream data, about searchers. These are secondary goals that require users to trust that the
content they are getting on a SERP is enough to engage with it. Users must see the information
they're getting is the right information.
Organic results. Unpaid organic results are seen as more trustworthy than paid, ad-
based results.
Search
engines return both organic and paid results; the two differ in several ways.
How do search engines make money?
User data. Search engines also make money from the user data that they collect.
Examples include search history and location data. This data is used to create a digital
profile for a given searcher, which search engine providers can use to serve targeted
ads to that user.
Contextual ads. Search engines also capitalize on serving up contextual ads that are
directly related to the user's current search. If a search engine includes a shopping
feature on the platform, it might display contextual ads for products related to the
user's search in the sidebar of a website where advertisements are displayed. For
example, if the online store sells books, an ad may appear in the corner of the page
for reading glasses.
Donations. Some search engines are designed help nonprofits solicit donations.
Affiliate links. Some engines include affiliate links, where the search engine has a
partnership in which the partner pays the search engine when a user clicks the
partner's link.
Search engines personalize results based on digital searcher profiles created from user data. User
data is collected from the application or device a user accesses the search engine with. User data
collected includes the following:
search history
location information
audio data
user ID
device identification
IP address
contact lists
purchase history
Cookies are used to track browsing history and other data. They are small text files sent from the
websites a user visits to their web browser. Search engines use cookies to track user preferences
and personalize results and ads. They are able to remember settings, such as passwords, language
preferences, content filters, how many results per page and session information.
Using private browsing settings or incognito browsing protects users from tracking but only at
the device level. Search history and other information accumulated during search is not saved
and is deleted after the search session. However, internet service providers, employers and the
domain owners of the websites visited are able to track digital information left behind during a
search.
Characteristics
The main purpose of a Web Search Engine is to provide website listings that are being sought by
the user. To do this, the website usually collects words fromthe usr that it then matches with
websites to bring results. However, this process of collecting words and matching is not a simple
excercise because it has to know the ‘stress’ factor on each word. So different search engine
technologies would use different word resolution methods. In Zapaat Search Engine, for
example, some characteristics are:
1. Context : Words that define the type of websites that the user is interested in.
2. Keywords : Words to look for in particular websites that match Context.
3. Layering: Define a context within a context to narrow the result set.
4. Connected Words: Define adjacent words as keywords.
Crawling
The crawler, or web spider, is a vital software component of the search engine. It essentially
sorts through the Internet to find website addresses and the contents of a website for storage in
the search engine database. Crawling can scan brand new information on the Internet or it can
locate older data. Crawlers have the ability to search a wide range of websites at the same time
and collect large amounts of information simultaneously. This allows the search engine to find
current content on an hourly basis. The web spider crawls until it cannot find any more
information within a site, such as further hyperlinks to internal or external pages.
Indexing
Once the search engine has crawled the contents of the Internet, it indexes that content based on
the occurrence of keyword phrases in each individual website. This allows a particular search
query and subject to be found easily. Keyword phrases are the particular group of words used by
an individual to search a particular topic.
The indexing function of a search engine first excludes any unnecessary and common articles
such as "the," "a" and "an." After eliminating common text, it stores the content in an organized
way for quick and easy access. Search engine designers develop algorithms for searching the
web according to specific keywords and keyword phrases. Those algorithms match user-
generated keywords and keyword phrases to content found within a particular website, using the
index.
Storage
Storing web content within the database of the search engine is essential for fast and easy
searching. The amount of content available to the user is dependent on the amount of storage
space available. Larger search engines like Google and Yahoo are able to store amounts of data
ranging in the terabytes, offering a larger source of information available for the user.
Results
Results are the hyperlinks to websites that show up in the search engine page when a certain
keyword or phrase is queried. When you type in a search term, the crawler runs through the
index and matches what you typed with other keywords. Algorithms created by the search engine
designers are used to provide the most relevant data first. Each search engine has its own set of
algorithms and therefore returns different results.
Ranking Algorithms
The ranking algorithm is used by Google to rank web pages according to the Google search
algorithm.
There are the following ranking features that affect the search results -
Enterprise Search
Popular search engines like Google and Bing are so enmeshed in our everyday lives that have
become synonymous with search in most of our minds. However, though web search and
enterprise are broadly comparable, they work in quite different ways and serve distinct purposes.
Enterprise search tools are for use by employees. They retrieve information from all types of data
that an organization stores, including both structured data, which is found in databases, and
unstructured data that takes the form of documents like PDFs and media.
The term “enterprise search” describes the software used to search for information inside a
corporate organization. The technology identifies and enables the indexing, searching and
display of specific content to authorized users across the enterprise.
IT industry analysts have shared that enterprise search is growing into something new. In 2017,
for instance, Gartner created a new enterprise search category called “Insight Engines.” These
solutions help business synthesize information interactively, or even proactively, by ingesting,
organizing and analyzing data. Forrester, another prominent analyst firm, defines this new
category as “Cognitive Search.”
How does enterprise search work?
Content is the raw material for enterprise search
More and more data to analyze, structure and classify
Data becomes more pervasive within a business as the organization grows. There can be a huge
proliferation of product information, process information, marketing content and so forth.
Individual teams create content, which then inevitably spreads across the enterprise.
Diversity of data
The information found inside large organizations tends to be highly diverse and fragmented. It’s
invariably hosted on a broad range of repositories and enterprise applications. These include
Content Management Systems (CMS’s), Enterprise Resource Planning solutions (ERP),
Customer Relationship Management (CRM), Relational Database Management Systems
(RDBMS’s), file systems, archives, data lakes, email systems, websites, intranets and social
networks as well as both private and public cloud platforms.
The data comes from a variety of sources. Structured and unstructured data are kept in different
“containers.”
Exploration – Here, the enterprise search engine software crawls all data sources,
gathering information from across the organization and its internal and external data
sources.
Indexing – After the data has been recovered, the enterprise search platform performs
analysis and enrichment of the data by tracking relationships inside the data—and then
storing the results so as to facilitate accurate, quick information retrieval.
Search – On the front end, employees request for information in their native languages.
The enterprise search platform then offers answers—taking the form of content and
pieces of content—that appear to be the most relevant to the query. The query response
also factors in the employee’s work context. Different people may get different answers
that relate to their work and search histories.
Techniques like Natural Language Processing (NLP) and Machine Learning are often involved
in determining relevant answers to queries.
Machine Learning
Machine learning applies AI to give systems the ability to learn and improve from experience,
automatically, without the need to be programmed explicitly. It focuses on creating computer
software that can access data and then make use of it for learning purposes.
“The knowledge worker spends about 2.5 hours per day, or roughly 30% of the workday,
searching for information.” – IDC
“The research found that on average, workers in both the U.K. and U.S. spent up to 25
minutes looking for a single document in over a third of searches conducted.”
– SearchYourCloud
“The average digital worker spends an estimated 28 percent of the workweek searching
e-mail and nearly 20 percent looking for internal information or tracking down colleagues
who can help with specific tasks.” – McKinsey & Company
Enterprise search software reduces the time employees require to find the necessary information.
As a result, it opens up work schedules for more high-value tasks. This improvement is
particularly important given the current emphasis on getting optimal performance out of teams in
lean, digital, agile organizations.
Customer service – Giving customer service representatives the ability to quickly and
easily find the information they need to deliver excellent customer service.
Contact experts – Letting employees search for experts and filter results according to
expertise and knowledge.
Talent search – Matching candidates with job descriptions from a database of potential
candidates.
Intranet search – helping intranet users locate information they need from shared drives
and databases.
Insight engines – Leveraging AI to detect relationships between people, content and data
as well as connections between user interests and current and past search queries.
What are the main criteria to select an enterprise search software?
Connectors
How many data connectors will an enterprise search engine need for the data sources it has to
index? The best practice is to include the sources that are likely to be indexed in the future in
addition to what is planned for current indexing. If a company plans to decommission a data
source in a year or so, however, it may want to exclude it from the connection and indexing
processes. This is particularly true if the data is going to migrated to a new source.
The following enterprise search platform characteristics and features help make sure that
information and documents are only accessible to users with the right permissions:
Protecting content from malicious actors using built-in encryption in the indexing
pipeline
Controlling access on a per-user basis and using security filters for indexed content
Using multilayer security across the cloud, on-premises data centers, intranets and
operations
Intelligent search or Predictive AI
Predictive AI is seen as the future of enterprise search engines. With self-learning algorithms
embedded in enterprise search tools, it is possible to innovate by learning from users and
improving results based on their usage patterns. Furthermore, by using custom APIs that are
designed to make search tools work optimally for a given audience, it is possible to deliver fine-
tuned results that improve over time.
A digital workplace solution to improve productivity
The exploding data and hours of time that employees waste looking for what they need has other
ramifications, as well. Without a reliable way to search through and contextualize all the
structured and unstructured data that exists, insights are routinely missed, and the value of the
data is lost.
Employees often need to ask colleagues for help finding the information they need, wasting
additional time and resources and slowing progress. And in the end, the digital workplace that
was meant to facilitate more creative, nonroutine work actually ends up producing the inverse.
An enterprise search solution can solve this information crisis. It can search and retrieve data
regardless of format, type, language, and location. But more than that, it can use AI to
understand the context of each piece and match it to the search intent. And the more data it is
fed, the more it learns, returning better results with each query.
To the end user, it is a simple and familiar experience that delivers powerful results. For
businesses overall, it’s a key building block in their digital transformation.