0% found this document useful (0 votes)

6 views15 pages

Unit 2

The document provides an overview of outliers in data analysis, defining them as data points that significantly deviate from the norm and discussing their types, detection methods, and handling techniques in machine learning. It also covers semantic analysis in natural language processing, detailing its components, tasks, and techniques for understanding meaning in text. Additionally, the document addresses visual analytics, its importance in business analytics, and the benefits of data visualization, along with an explanation of network diagrams and their variations.

Uploaded by

Sakthidevi Balakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Unit 2

Uploaded by

Sakthidevi Balakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

Outlier:

An outlier is a data point that significantly deviates from the rest of the data. It can be either
much higher or much lower than the other data points, and its presence can have a
significant impact on the results of machine learning algorithms. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining.
Types of Outliers
There are two main types of outliers:
 Global outliers: Global outliers are isolated data points that are far away from the main
body of the data. They are often easy to identify and remove.
 Contextual outliers: Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to determine
their significance.

Algorithm
1. Calculate the mean of each cluster
2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. If (Distance > Threshold) then, Outlier

Outlier Detection Methods in Machine Learning

Outlier detection plays a crucial role in ensuring the quality and accuracy of machine
learning models. By identifying and removing or handling outliers effectively, we can
prevent them from biasing the model, reducing its performance, and hindering its
interpretability. Here’s an overview of various outlier detection methods:
1. Statistical Methods:
 Z-Score: This method calculates the standard deviation of the data points and identifies
outliers as those with Z-scores exceeding a certain threshold (typically 3 or -3).
 Interquartile Range (IQR): IQR identifies outliers as data points falling outside the
range defined by Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1 and Q3 are the first and
third quartiles, and k is a factor (typically 1.5).
2. Distance-Based Methods:
 K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest
neighbors are far away from them.
 Local Outlier Factor (LOF): This method calculates the local density of data points
and identifies outliers as those with significantly lower density compared to their
neighbors.

3. Clustering-Based Methods:
 Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): In DBSCAN, clusters data points based on their density and identifies
outliers as points not belonging to any cluster.
 Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
clusters by iteratively merging or splitting clusters based on their similarity. Outliers
can be identified as clusters containing only a single data point or clusters significantly
smaller than others.
4. Other Methods:
 Isolation Forest: Isolation forest randomly isolates data points by splitting features and
identifies outliers as those isolated quickly and easily.
 One-class Support Vector Machines (OCSVM): One-Class SVM learns a boundary
around the normal data and identifies outliers as points falling outside the boundary.
Techniques for Handling Outliers in Machine Learning
Outliers, data points that significantly deviate from the majority, can have detrimental
effects on machine learning models. To address this, several techniques can be employed to
handle outliers effectively:
1. Removal:
 This involves identifying and removing outliers from the dataset before training the
model. Common methods include:
 Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
 Distance-based methods: Outliers are identified based on their distance
from their nearest neighbors.
 Clustering: Outliers are identified as points not belonging to any cluster or
belonging to very small clusters.
2. Transformation:
 This involves transforming the data to reduce the influence of outliers. Common
methods include:
 Scaling: Standardizing or normalizing the data to have a mean of zero and a
standard deviation of one.
 Winsorization: Replacing outlier values with the nearest non-outlier value.
 Log transformation: Applying a logarithmic transformation to compress the
data and reduce the impact of extreme values.
3. Robust Estimation:
 This involves using algorithms that are less sensitive to outliers. Some examples
include:
 Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
 M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
 Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.
4. Modeling Outliers:
 This involves explicitly modeling the outliers as a separate group. This can be done by:
 Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
 Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.

Importance of outlier detection in machine learning

Outlier detection is important in machine learning for several reasons:
1. Biased models: Outliers can bias a machine learning model towards the outlier
values, leading to poor performance on the rest of the data. This can be particularly
problematic for algorithms that are sensitive to outliers, such as linear regression.
2. Reduced accuracy: Outliers can introduce noise into the data, making it difficult for a
machine learning model to learn the true underlying patterns. This can lead to reduced
accuracy and performance.
3. Increased variance: Outliers can increase the variance of a machine learning
model, making it more sensitive to small changes in the data. This can make it difficult
to train a stable and reliable model.
4. Reduced interpretability: Outliers can make it difficult to understand what a machine
learning model has learned from the data. This can make it difficult to trust the model’s
predictions and can hamper efforts to improve its performance.

Semantic Analysis

Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to

understand the meaning of Natural Language. Understanding Natural Language might seem
a straightforward process to us as humans. However, due to the vast complexity and
subjectivity involved in human language, interpreting it is quite a complicated task for
machines. Semantic Analysis of Natural Language captures the meaning of the given text
while taking into account context, logical structuring of sentences and grammar roles.
Parts of Semantic Analysis
Semantic Analysis of Natural Language can be classified into two broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding
the meaning of each word of the text individually. It basically refers to fetching the
dictionary meaning that a word in the text is deputed to carry.
2. Compositional Semantics Analysis: Although knowing the meaning of each
word of the text is essential, it is not sufficient to completely understand the meaning of the
text.
For example, consider the following two sentences:
 Sentence 1: Students love GeeksforGeeks.
 Sentence 2: GeeksforGeeks loves Students.

Tasks involved in Semantic Analysis

In order to understand the meaning of a sentence, the following are the major processes
involved in Semantic Analysis:
1. Word Sense Disambiguation
2. Relationship Extraction
Word Sense Disambiguation:
In Natural Language, the meaning of a word may vary as per its usage in sentences
and the context of the text. Word Sense Disambiguation involves interpreting the meaning
of a word based upon the context of its occurrence in a text.
For example, the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer
of a tree.’
Likewise, the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the
accurate meaning of the word is highly dependent upon its context and usage in the text.
Thus, the ability of a machine to overcome the ambiguity involved in identifying the
meaning of a word based on its usage and context is called Word Sense Disambiguation.
Relationship Extraction:
Another important task involved in Semantic Analysis is Relationship Extracting. It
involves firstly identifying various entities present in the sentence and then extracting the
relationships between those entities.

Entities
Relationships

Elements of Semantic Analysis

Some of the critical elements of Semantic Analysis that must be scrutinized and taken into
account while processing Natural Language are:
 Hyponymy: Hyponymys refers to a term that is an instance of a generic term. They can
be understood by taking class-object as an analogy. For example: ‘Color‘ is a
hypernymy while ‘grey‘, ‘blue‘, ‘red‘, etc, are its hyponyms.
 Homonymy: Homonymy refers to two or more lexical terms with the same spellings
but completely distinct in meaning. For example: ‘Rose‘ might mean ‘the past form of
rise‘ or ‘a flower‘, – same spelling but different meanings; hence, ‘rose‘ is a
homonymy.
 Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job, Occupation),
(Large, Big), (Stop, Halt).
 Antonymy: Antonymy refers to a pair of lexical terms that have contrasting meanings –
they are symmetric to a semantic axis. For example: (Day, Night), (Hot, Cold), (Large,
Small).
 Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the terms
need not be closely related in the case of homonymy. For example: ‘ man‘ may mean
‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since all these
different meanings bear a close association, the lexical term ‘man‘ is a polysemy.
 Meronomy: Meronomy refers to a relationship wherein one lexical term is a
constituent of some larger entity. For example: ‘Wheel‘ is a meronym of ‘Automobile‘
Meaning Representation
While, as humans, it is pretty simple for us to understand the meaning of textual
information, it is not so in the case of machines. Thus, machines tend to represent the text
in specific formats in order to interpret its meaning. This formal structure that is used to
understand the meaning of a text is called meaning representation.
Basic Units of Semantic System:
In order to accomplish Meaning Representation in Semantic Analysis, it is vital to
understand the building units of such representations. The basic units of semantic systems
are explained below:
1. Entity: An entity refers to a particular unit or individual in specific such as a person or
a location.
2. Concept: A Concept may be understood as a generalization of entities. It refers to a
broad class of individual units. For example Learning Portals, City, Students.
3. Relations: Relations help establish relationships between various entities and concepts.
For example: ‘GeeksforGeeks is a Learning Portal’, ‘Delhi is a City.’, etc.
4. Predicate: Predicates represent the verb structures of the sentences.
In Meaning Representation, we employ these basic units to represent textual information.
Approaches to Meaning Representations:
Now that we are familiar with the basic understanding of Meaning Representations, here
are some of the most popular approaches to meaning representation:
1. First-order predicate logic (FOPL)
2. Semantic Nets
3. Frames
4. Conceptual dependency (CD)
5. Rule-based architecture
6. Case Grammar
7. Conceptual Graphs
Semantic Analysis Techniques
Based upon the end goal one is trying to accomplish, Semantic Analysis can be used in
various ways. Two of the most common Semantic Analysis techniques are:
Text Classification
In-Text Classification, our aim is to label the text according to the insights we intend to
gain from the textual data.
For example:
 In Sentiment Analysis, we try to label the text with the prominent emotion they
convey. It is highly beneficial when analyzing customer reviews for improvement.
 In Topic Classification, we try to categories our text into some predefined categories.
For example: Identifying whether a research paper is of Physics, Chemistry or Maths
 In Intent Classification, we try to determine the intent behind a text message. For
example: Identifying whether an e-mail received at customer care service is a query,
complaint or request.
Text Extraction
In-Text Extraction, we aim at obtaining specific information from our text.
For Example,
 In Keyword Extraction, we try to obtain the essential words that define the entire
document.
 In Entity Extraction, we try to obtain all the entities involved in a document.

Visual analytics

Visual analytics is essentially the marriage of data analytics and visualizations. This
approach to solving problems is concerned with integrating interactive visual
representations with underlying analytical processes to effectively facilitate high-level,
complex activities, such as reasoning and data-driven decision making.

Data analytics visualization specifically focuses on analytical reasoning techniques that

enable users to gain greater insights that will directly support decision making and
planning; visual representations and interaction techniques that exploit the human eye’s
perceptual processes; data representations and transformations that format data to support
visualization and analytics;

Visual analytics is particularly useful in business analytics applications that involve large
amounts of complex data sets and analytical processes that require a great deal of
interaction and monitoring. Increasing demand for the integration of visual analytics
software is driven by the generation of more and more data of high volume, complexity,
and velocity.

Big data analytics visualization tools help transform cryptic, tedious big data into a visually
colorful, interactive data visualization from which users can track trends, patterns, and
anomalies, and make better, data-driven decisions.
In order to successfully analyze and understand a single big data problem, visual analytics
systems are often used in tandem with multiple analysis approaches, such as machine
learning algorithms and intelligence value estimation algorithms.
Some effective big data analytics visualization strategies and approaches include good
semantic mapping, abstraction, aggregation, incremental approximate database queries, and
the transformation of data into a functional or procedural model.

The visual analytics process typically follows the same steps:

 data transformation
 data mapping
 contribution selecting
 ranking, interaction
 model visualization
 knowledge processing.
‍
What is Data Visualization in Business Analytics?
Business Analytics is the process by which businesses use statistical methods and
technologies for analyzing historical data in order to gain new insight and improve strategic
decision-making. Data visualization is a core component in a typical business analytics
dashboard, providing visual representations such as charts and graphs for easy and quick
data analysis.
Visualizing data aids in finding correlations between business operations and long-term
outcomes. Visual analytics applications must, like any business intelligence or business
analytics initiative, adopt an effective data management strategy in order to integrate and
standardize data from disparate source systems.
Read more about BI Tools in our complete guide.
‍
Benefits of Visual Analytics
Businesses are implementing data analytics and visualization tools with increasing
frequency in order to speed up their business performance and improve their business
decisions making process. Some key benefits of visualization in data analytics include:
Improved data exploration and data analysis, and minimized overall cost
Faster and better understanding of data for faster, better decision-making
Consumption of greater volumes of data in less time, which improves operational
efficiency
Early detection of otherwise overlooked trends, outliers, and correlations between data sets,
which may result in a competitive edge
Instant feedback and real time updates, which keep data current and accurate
‍

Network diagrams
Network diagrams (also called Graphs) show interconnections between a set of entities. Each
entity is represented by a Node (or vertice). Connections between nodes are represented
through links (or edges).
Here is an example showing the co-authors network of Vincent Ranwez, a researcher who’s
my previous supervisor. Basically, people having published at least one research paper with
him are represented by a node. If two people have been listed on the same publication at least
once, they are connected by a link.

Four types of input

Four main types of network diagram exist, according to the features of data inputs.

Undirected and Unweighted

Tom, Cherelle and Melanie live in the same house. They are connected but no direction and
no weight.

Undirected and Weighted

In the previous co-authors network, people are connected if they published a scientific paper
together. The weight is the number of time it happend.

Directed and Unweighted

Tom follows Shirley on twitter, but the opposite is not necessarily true. The connection is
unweighted: just connected or not.

Directed and Weighted

People migrate from a country to another: the weight is the number of people, the direction is
the destination.

Variation
Many customizations are available for network diagrams. Here are a few features you can
work on to improve your graphic:
 Adding information to the node: you can add more insight to the graphic by
customizing the color, the shape or the size of each node according to other variables.
 Different layout algorythm: finding the most optimal position for each node is a tricky
exercise that highly impacts the output. Several algorithms have been developped, and
choosing the right one for your data is a crucial step.

Spatial data

Spatial data can be referred to as geographic data or geospatial data. Spatial data provides the
information that identifies the location of features and boundaries on Earth. Spatial data can
be processed and analysed using Geographical Information Systems (GIS) or Image
Processing packages.
Types of data

There are different types of spatial data which can be split into two categories:
 Feature Data (vector data model) = entity of the real world e.g. a road, a tree or a building
these can be represented as a point, line or polygon in space
 Coverage Data (raster data model) = mapping of continuous data in space expressed as a
range of values e.g. a satellite image, an aerial photograph, a Digital Surface Model
(DSM) or Digital Terrain Model (DTM), text file with daily precipitation values.
Coverage data can be represented as a grid or triangulated irregular network

Geographic data such as road maps, land-usage maps, topographic elevation maps,
political maps showing boundaries, land-ownership maps, and so on. Geographical
information system are special purpose databases for storing geographical data.
Geographical data are differ from design data in certain ways. Maps and satellite images are
typical examples of geographic data. Maps may provide not only location information
associated with locations such as elevations. Soil type, land type and annual rainfall.
Types of geographical data :
 Raster data
 Vector data
1.Raster data: Raster data consist of pixels also known as grid cells in two or more
dimensions. For example, image of Satellites , digital pictures, and scanned maps.
2.Vector data: Vector data consist of triangles, lines, and various geometrical objects in
two dimensions and cylinders, cuboids, and other polyhedrons in three dimensions. For
example, building boundaries and roads.

Applications of Spatial databases in DBMS :

 Microsoft SQL server: Since the 2008 version of Microsoft SQL server supported
spatial databases.
 CouchDB : This is document-based database in which spatial data is enabled by plugin
calledGeoCouch.
Neo4j database.

Time series plots:

Time series charts present a series of data points collected over a specified reporting period.
The x-axis plots time and the y-axis plots data points.

The following time intervals affect the content of a time series chart:

Reporting period
The period during which data points are collected for presentation in a chart. For
example, a time series chart might present aggregated data points collected over a
24-hour period.
Granularity
The period during which data is considered for generating a single, aggregated data
point.
For example, suppose you want to find the average bps for a network device within
a reporting period of one day. You could calculate a single data point consisting of
the average of all data points collected within that 24-hour period. Or you could
apply a finer granularity by calculating the average of data points collected within a
smaller time period, such as 1 hour, over the course of the 24-hour period. If you
apply a granularity of one hour for a reporting period of one day, the chart would
contain 24 aggregated data points. For information about the granularity periods that
are supported for a given reporting period and report type, see the table in the Data
aggregation topic.
Types of time series charts
Time series charts include:

 Resource Time Series (RTS). A collection of data points for a single resource over a
given reporting period. Data aggregations in a Resource Time Series chart are time
aggregations.
 Group Time Series (GTS). A collection of data points for a group of resources over
a given reporting period. Data aggregations in a Group Time Series chart are spatial
aggregations.
 Advanced Group Time Series (AGTS). A data set consisting of both time and spatial
aggregations.
 Trending and Forecasting Time Series. A collection of data points collected over a
trending and forecasting period rather than over a reporting period.
 Multi-Resource Time Series. A collection of data for multiple resources (maximum
of 100) and for multiple periods.

 The effect of granularity and polling period on aggregation

The effective granularity for an aggregation is the granularity period that is used in the
aggregation. In some cases, the granularity that is specified for an aggregation is not
the same as the effective granularity.
 Changing granularity and granularity precedence
You can change the granularity of the data displayed in GTS and RTS line charts.
Some methods of changing granularity take precedence over others.
 Resource Time Series
Resource Time Series (RTS) reports contain raw or aggregated data for a single
resource over a particular reporting period.
 Group Time Series
Group Time Series (GTS) reports contain raw or aggregated data for a group of
resources over a particular reporting period.
 Advanced Group Time Series
The Advanced Group Time Series (AGTS) report contains both time and spatial
aggregations. AGTS reports allow you to aggregate a set of aggregated data points.
 Trending and Forecasting Time Series Charts
The Trending and Forecasting Time Series (TFTS) chart provides information about
the trend of a resource towards a predicted upgrade target.

Reinforcement learning

Reinforcement learning is an autonomous, self-teaching system that essentially learns by

trial and error. It performs actions with the aim of maximizing rewards, or in other words, it
is learning by doing in order to achieve the best outcomes.
Reinforcement learning is an autonomous, self-teaching system that essentially learns by
trial and error. It performs actions with the aim of maximizing rewards, or in other words, it
is learning by doing in order to achieve the best outcomes.

Main points in Reinforcement learning –

 Input: The input should be an initial state from which the model will start
 Output: There are many possible outputs as there are a variety of solutions to a
particular problem
 Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

Types of Reinforcement:

There are two types of Reinforcement:

1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it has
a positive effect on behavior.
Advantages of reinforcement learning are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
Advantages of reinforcement learning:
 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning

Reinforcement learning elements are as follows:

1. Policy
2. Reward function
3. Value function
4. Model of the environment
Policy: Policy defines the learning agent behavior for given time period. It is a mapping
from perceived states of the environment to actions to be taken when in those states.
Reward function: Reward function is used to define a goal in a reinforcement learning
problem.A reward function is a function that provides a numerical score based on the state
of the environment
Value function: Value functions specify what is good in the long run. The value of a state
is the total amount of reward an agent can expect to accumulate over the future, starting
from that state.
Model of the environment: Models are used for planning.

Advantages of Reinforcement learning

1. Reinforcement learning can be used to solve very complex problems that cannot be
solved by conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the
environment
4. Reinforcement learning can handle environments that are non-deterministic, meaning
that the outcomes of actions are not always predictable. This is useful in real-world
applications where the environment may change over time or is uncertain.
Disadvantages of Reinforcement learning
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation
3. Reinforcement learning is highly dependent on the quality of the reward function. If the
reward function is poorly designed, the agent may not learn the desired behavior.

A/B testing

A/B testing—also called split testing or bucket testing—compares the performance of two
versions of content to see which one appeals more to visitors/viewers. It tests a control (A)
version against a variant (B) version to measure which one is most successful based on your
key metrics
Website A/B testing (copy, images, colors designs, calls to action), which splits traffic
between two versions—A and B.
1) conversions or 2) visitors who performed the desired action.
Email marketing A/B testing (subject line, images, calls to action), which splits recipients
into two segments to determine which version generates a higher open rate.
Content selected by editors or content selected by an algorithm based on user behavior to see
which one results in more engagement.
In addition to A/B tests, there are also A/B/N tests, where the "N" stands for "unknown". An
A/B/N test is a type with more than two variations.
When and why you should A/B test

A/B testing provides the most benefits when it operates continuously. A regular flow of tests
can deliver a stream of recommendations on how to fine-tune performance. And continuous
testing is possible because the available options for testing are nearly unlimited.

A/B testing can be used to evaluate just about any digital marketing asset including:
 Emails

 newsletters
 advertisements
 text messages
 website pages
 components on web pages
 mobile apps

A/B testing plays an important role in campaign management since it helps determine what is
and isn’t working. It shows what your audience is interested in and responds to. A/B testing
can help you see which element of your marketing strategy has the biggest impact, which one
needs improvement, and which one needs to be dropped altogether.

 A/B testing can be used to isolate the performance problem and drive performance higher.
 Proactive use of A/B testing will allow you to compare and contrast the performance of
two different approaches to identify the better one.

Benefits of running A/B tests :

A/B testing provides a great way to quantitatively determine the tactics that work best with
visitors to your website. However, there is still an upside because you won’t stick with
something that isn’t working.

By testing widely used website components/sections. It make determinations that improve not
only the test page but other similar pages as well.

A/B test Perform:

A/B testing isn’t difficult, but it requires marketers to follow a well-defined process. Here are
these nine basic steps:

The fundamental steps to planning and executing an A/B test

1. Measure and review the performance baseline
2. Determine the testing goal using the performance baseline
3. Develop a hypothesis on how your test will boost performance
4. Identify test targets or locations
5. Create the A and B versions to test
6. Utilize a QA tool to validate the setup
7. Execute the test
8. Track and evaluate results using web and testing analytics
9. Apply learnings to improve the customer experience
Tests will provide data and empirical evidence to help you refine and enhance performance.
Using what you’ve learned from A/B testing will help make a bigger impact, design a more
engaging customer experience (CX), write more compelling copy, and create more
captivating visuals. As you continuously optimize, your marketing strategies will become
more effective, increasing ROI and driving more revenue.

Correlation

The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as “r,” is as follows:
The correlation coefficient, denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:

1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example,
there is a positive correlation between height and weight. As people get taller, they also
tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an inverse
relationship. As one variable increases, the other variable decreases. For example, there
is a negative correlation between price and demand. As the price of a product increases,
the demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example,
there is zero correlation between shoe size and intelligence.
A positive correlation indicates that the two variables move in the same direction, while a
negative correlation indicates that the two variables move in opposite directions.
The strength of the correlation is measured by a correlation coefficient, which can range
from -1 to 1. A correlation coefficient of 0 indicates no correlation, while a correlation
coefficient of 1 or -1 indicates a perfect correlation.

How to Conduct Correlation Analysis

To conduct a correlation analysis, you will need to follow these steps:
1. Identify Variable: Identify the two variables that we want to correlate. The variables
should be quantitative, meaning that they can be represented by numbers.
2. Collect data : Collect data on the two variables. We can collect data from a variety of
sources, such as surveys, experiments, or existing records.
3. Choose the appropriate correlation coefficient. The Pearson correlation coefficient is
the most commonly used correlation coefficient, but there are other correlation
coefficients that may be more appropriate for certain types of data.
4. Calculate the correlation coefficient. We can use a statistical software package to
calculate the correlation coefficient, or you can use a formula.
5. Interpret the correlation coefficient. The correlation coefficient can be interpreted as
a measure of the strength and direction of the linear relationship between the two
variables.
The various fields in which it can be used are:
 Economics and Finance : Help in analyzing the economic trends by understanding the
relations between supply and demand.
 Business Analytics : Helps in making better decisions for the company and provides
valuable insights.
 Market Research and Promotions : Helps in creating better marketing strategies by
analyzing the relation between recent market trends and customer behavior.
 Medical Research : Correlation can be employed in Healthcare so as to better
understand the relation between different symptoms of diseases and understand
genetical diseases better.
Advantages of Correlation Analysis
 Correlation analysis helps us understand how two variables affect each other or are
related to each other.
 They are simple and very easy to interpret.
 Aids in decision-making process in business, healthcare, marketing, etc
 Helps in feature selection in machine learning.
 Gives a measure of the relation between two variables.
Disadvantages of Correlation Analysis
 Correlation does not imply causation, which means a variable may not be the cause for
the other variable even though they are correlated.
 If outliers are not dealt with well they may cause errors.
 It works well only on bivariate relations and may not produce accurate results for
multivariate relations.
 Complex relations can not be analyzed accurately.

Regression

Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
The most common models are simple linear and multiple linear. Nonlinear regression
analysis is commonly used for more complicated data sets in which the dependent and
independent variables show a nonlinear relationship.

Regression analysis offers numerous applications in various disciplines, including finance.

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:

Y = a + bX + ϵ

Where:

 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:

Y = a + bX1 + cX2 + dX3 + ϵ

Where:

 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:

 Non-collinearity: Independent variables should show a minimum correlation with

each other. If the independent variables are highly correlated with each other, it will
be difficult to assess the true relationships between the dependent and independent
variables.

Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we

have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below
comparison diagram of the linear dataset and non-linear dataset.

Unit 4
No ratings yet
Unit 4
17 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
Unit 5
No ratings yet
Unit 5
47 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Outlier Detection Techniques
100% (1)
Outlier Detection Techniques
13 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
2 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Unit 5
No ratings yet
Unit 5
70 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Anomaly Detection Overview
No ratings yet
Anomaly Detection Overview
36 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
12 pages
Datamining Seminar
No ratings yet
Datamining Seminar
19 pages
Eda U2
No ratings yet
Eda U2
141 pages
Expt 2
No ratings yet
Expt 2
3 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Outliers
No ratings yet
Outliers
3 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Unit 1
No ratings yet
Unit 1
21 pages
11 Different Ways For Outlier Detection in Python
No ratings yet
11 Different Ways For Outlier Detection in Python
11 pages
ML Peact 1
No ratings yet
ML Peact 1
9 pages
Outlier Detection in Python MEAP V01 Brett Kennedy Full
No ratings yet
Outlier Detection in Python MEAP V01 Brett Kennedy Full
168 pages
Outlier Analysis for Data Scientists
No ratings yet
Outlier Analysis for Data Scientists
18 pages
Clustering and Outlier Analysis
No ratings yet
Clustering and Outlier Analysis
36 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
Outlier Analysis & Detection Methods
No ratings yet
Outlier Analysis & Detection Methods
4 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Unit5 OutliersDetection
No ratings yet
Unit5 OutliersDetection
37 pages
Pages: 1191 - 1196 ISSN: 2278-2397 International Journal of Computing Algorithm (IJCOA) 1191 Outlier Detection Scheme To Handle Wireless Sensor Data
No ratings yet
Pages: 1191 - 1196 ISSN: 2278-2397 International Journal of Computing Algorithm (IJCOA) 1191 Outlier Detection Scheme To Handle Wireless Sensor Data
6 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
6 pages
818CIT01 BDA Unit 1
No ratings yet
818CIT01 BDA Unit 1
20 pages
Py Unit - 1
No ratings yet
Py Unit - 1
53 pages
818cit01-Bda-Unit 4 - Notes
No ratings yet
818cit01-Bda-Unit 4 - Notes
20 pages
CN Lab Manual-1
No ratings yet
CN Lab Manual-1
71 pages
Unit I Unified Process and Use Case Diagrams 9 Introduction To Ooad With Oo Basics
No ratings yet
Unit I Unified Process and Use Case Diagrams 9 Introduction To Ooad With Oo Basics
1 page
21cs62 Full Stack Development
No ratings yet
21cs62 Full Stack Development
39 pages
818cit01-Bda-Unit 2-Notes
No ratings yet
818cit01-Bda-Unit 2-Notes
59 pages
BigData Hadoop Notes
No ratings yet
BigData Hadoop Notes
101 pages
8th Tamil Questions and Answers
No ratings yet
8th Tamil Questions and Answers
7 pages
MWH TF Ò - Jäœ - Ïaš Ïu©L: Brœíÿ: Ehyoah®, Ghuj NJR Ciueil: Gwitfÿ Gyéj
No ratings yet
MWH TF Ò - Jäœ - Ïaš Ïu©L: Brœíÿ: Ehyoah®, Ghuj NJR Ciueil: Gwitfÿ Gyéj
6 pages
Actuarial Society of India: Examinations
No ratings yet
Actuarial Society of India: Examinations
5 pages
Consumer Buying Behaviour Regarding Shoes
No ratings yet
Consumer Buying Behaviour Regarding Shoes
27 pages
(Ebook PDF) Essentials of Modern Business Statistics With Microsoft Office Excel 7th Edition Download
100% (2)
(Ebook PDF) Essentials of Modern Business Statistics With Microsoft Office Excel 7th Edition Download
43 pages
Final Exam in Statistics
100% (3)
Final Exam in Statistics
12 pages
Statistics Cheat Sheet
No ratings yet
Statistics Cheat Sheet
5 pages
Grade 12 College Choice Factors
No ratings yet
Grade 12 College Choice Factors
69 pages
Learning Styles & Vocabulary Mastery
No ratings yet
Learning Styles & Vocabulary Mastery
25 pages
(Journal of Human Kinetics) Prediction of Rowing Ergometer Performance From Functional Anaerobic Power Strength and Anthropometric Components
No ratings yet
(Journal of Human Kinetics) Prediction of Rowing Ergometer Performance From Functional Anaerobic Power Strength and Anthropometric Components
10 pages
Data Management for Math Students
No ratings yet
Data Management for Math Students
32 pages
Project Management Office Transformations: Direct and Moderating Effects That Enhance Performance and Maturity
No ratings yet
Project Management Office Transformations: Direct and Moderating Effects That Enhance Performance and Maturity
27 pages
Psychology Syllabus 2020-22
No ratings yet
Psychology Syllabus 2020-22
26 pages
Board Characteristics, State Ownership and Firm
No ratings yet
Board Characteristics, State Ownership and Firm
20 pages
Data Processing: Editing and Coding
100% (2)
Data Processing: Editing and Coding
16 pages
Correlation Regression Curve Fitting-1
No ratings yet
Correlation Regression Curve Fitting-1
3 pages
1104-Article Text-3379-1-10-20250110
No ratings yet
1104-Article Text-3379-1-10-20250110
8 pages
Surveying Homework and Exam Guide
No ratings yet
Surveying Homework and Exam Guide
30 pages
Lesson 5 Chapter 4: Jointly Distributed Random Variables: Michael Akritas
No ratings yet
Lesson 5 Chapter 4: Jointly Distributed Random Variables: Michael Akritas
89 pages
Mba Dcrust Syllabus
0% (1)
Mba Dcrust Syllabus
72 pages
MK
No ratings yet
MK
32 pages
Zouhal Et Al-2021-The Effects of Exercise Training On Plasma Volume Variations A Systematic Review
No ratings yet
Zouhal Et Al-2021-The Effects of Exercise Training On Plasma Volume Variations A Systematic Review
34 pages
Answering Questions With Statistics 1st Edition Szafran Test Bank Full Chapter PDF
100% (25)
Answering Questions With Statistics 1st Edition Szafran Test Bank Full Chapter PDF
44 pages
Psychometric Properties 1
No ratings yet
Psychometric Properties 1
6 pages
Law & Zentner, 2012 (PROMS)
No ratings yet
Law & Zentner, 2012 (PROMS)
15 pages
Correlation & Regression Guide
100% (1)
Correlation & Regression Guide
19 pages
4 - of Tests and Testing
100% (1)
4 - of Tests and Testing
16 pages
Study Text
No ratings yet
Study Text
369 pages
Test 5 Final Version
No ratings yet
Test 5 Final Version
4 pages
Spss Note
No ratings yet
Spss Note
35 pages
Data Analysis Using SPSS: Research Workshop Series
No ratings yet
Data Analysis Using SPSS: Research Workshop Series
86 pages
Risk Management Prtactices of Sport Coaches
No ratings yet
Risk Management Prtactices of Sport Coaches
34 pages

Unit 2

Uploaded by

Unit 2

Uploaded by

Outlier:

Outlier Detection Methods in Machine Learning

Importance of outlier detection in machine learning

Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to

Tasks involved in Semantic Analysis

Elements of Semantic Analysis

Data analytics visualization specifically focuses on analytical reasoning techniques that

The visual analytics process typically follows the same steps:

Four types of input

Undirected and Unweighted

Undirected and Weighted

Directed and Unweighted

Directed and Weighted

Applications of Spatial databases in DBMS :

Time series plots:

 The effect of granularity and polling period on aggregation

Reinforcement learning is an autonomous, self-teaching system that essentially learns by

Main points in Reinforcement learning –

There are two types of Reinforcement:

Reinforcement learning elements are as follows:

Advantages of Reinforcement learning

Benefits of running A/B tests :

A/B test Perform:

The fundamental steps to planning and executing an A/B test

How to Conduct Correlation Analysis

Regression analysis offers numerous applications in various disciplines, including finance.

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

Simple Linear Regression

Multiple Linear Regression

Y = a + bX1 + cX2 + dX3 + ϵ

 Non-collinearity: Independent variables should show a minimum correlation with

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we

You might also like