0% found this document useful (0 votes)
6 views15 pages

Unit 2

The document provides an overview of outliers in data analysis, defining them as data points that significantly deviate from the norm and discussing their types, detection methods, and handling techniques in machine learning. It also covers semantic analysis in natural language processing, detailing its components, tasks, and techniques for understanding meaning in text. Additionally, the document addresses visual analytics, its importance in business analytics, and the benefits of data visualization, along with an explanation of network diagrams and their variations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Unit 2

The document provides an overview of outliers in data analysis, defining them as data points that significantly deviate from the norm and discussing their types, detection methods, and handling techniques in machine learning. It also covers semantic analysis in natural language processing, detailing its components, tasks, and techniques for understanding meaning in text. Additionally, the document addresses visual analytics, its importance in business analytics, and the benefits of data visualization, along with an explanation of network diagrams and their variations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Outlier:

An outlier is a data point that significantly deviates from the rest of the data. It can be either
much higher or much lower than the other data points, and its presence can have a
significant impact on the results of machine learning algorithms. They can be caused by
measurement or execution errors. The analysis of outlier data is referred to as outlier
analysis or outlier mining.
Types of Outliers
There are two main types of outliers:
 Global outliers: Global outliers are isolated data points that are far away from the main
body of the data. They are often easy to identify and remove.
 Contextual outliers: Contextual outliers are data points that are unusual in a specific
context but may not be outliers in a different context. They are often more difficult to
identify and may require additional information or domain knowledge to determine
their significance.

Algorithm
1. Calculate the mean of each cluster
2. Initialize the Threshold value
3. Calculate the distance of the test data from each cluster mean
4. Find the nearest cluster to the test data
5. If (Distance > Threshold) then, Outlier

Outlier Detection Methods in Machine Learning


Outlier detection plays a crucial role in ensuring the quality and accuracy of machine
learning models. By identifying and removing or handling outliers effectively, we can
prevent them from biasing the model, reducing its performance, and hindering its
interpretability. Here’s an overview of various outlier detection methods:
1. Statistical Methods:
 Z-Score: This method calculates the standard deviation of the data points and identifies
outliers as those with Z-scores exceeding a certain threshold (typically 3 or -3).
 Interquartile Range (IQR): IQR identifies outliers as data points falling outside the
range defined by Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1), where Q1 and Q3 are the first and
third quartiles, and k is a factor (typically 1.5).
2. Distance-Based Methods:
 K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest
neighbors are far away from them.
 Local Outlier Factor (LOF): This method calculates the local density of data points
and identifies outliers as those with significantly lower density compared to their
neighbors.

3. Clustering-Based Methods:
 Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): In DBSCAN, clusters data points based on their density and identifies
outliers as points not belonging to any cluster.
 Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
clusters by iteratively merging or splitting clusters based on their similarity. Outliers
can be identified as clusters containing only a single data point or clusters significantly
smaller than others.
4. Other Methods:
 Isolation Forest: Isolation forest randomly isolates data points by splitting features and
identifies outliers as those isolated quickly and easily.
 One-class Support Vector Machines (OCSVM): One-Class SVM learns a boundary
around the normal data and identifies outliers as points falling outside the boundary.
Techniques for Handling Outliers in Machine Learning
Outliers, data points that significantly deviate from the majority, can have detrimental
effects on machine learning models. To address this, several techniques can be employed to
handle outliers effectively:
1. Removal:
 This involves identifying and removing outliers from the dataset before training the
model. Common methods include:
 Thresholding: Outliers are identified as data points exceeding a certain
threshold (e.g., Z-score > 3).
 Distance-based methods: Outliers are identified based on their distance
from their nearest neighbors.
 Clustering: Outliers are identified as points not belonging to any cluster or
belonging to very small clusters.
2. Transformation:
 This involves transforming the data to reduce the influence of outliers. Common
methods include:
 Scaling: Standardizing or normalizing the data to have a mean of zero and a
standard deviation of one.
 Winsorization: Replacing outlier values with the nearest non-outlier value.
 Log transformation: Applying a logarithmic transformation to compress the
data and reduce the impact of extreme values.
3. Robust Estimation:
 This involves using algorithms that are less sensitive to outliers. Some examples
include:
 Robust regression: Algorithms like L1-regularized regression or Huber
regression are less influenced by outliers than least squares regression.
 M-estimators: These algorithms estimate the model parameters based on a
robust objective function that down weights the influence of outliers.
 Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are
less susceptible to the presence of outliers than K-means clustering.
4. Modeling Outliers:
 This involves explicitly modeling the outliers as a separate group. This can be done by:
 Adding a separate feature: Create a new feature indicating whether a data
point is an outlier or not.
 Using a mixture model: Train a model that assumes the data comes from a
mixture of multiple distributions, where one distribution represents the
outliers.

Importance of outlier detection in machine learning


Outlier detection is important in machine learning for several reasons:
1. Biased models: Outliers can bias a machine learning model towards the outlier
values, leading to poor performance on the rest of the data. This can be particularly
problematic for algorithms that are sensitive to outliers, such as linear regression.
2. Reduced accuracy: Outliers can introduce noise into the data, making it difficult for a
machine learning model to learn the true underlying patterns. This can lead to reduced
accuracy and performance.
3. Increased variance: Outliers can increase the variance of a machine learning
model, making it more sensitive to small changes in the data. This can make it difficult
to train a stable and reliable model.
4. Reduced interpretability: Outliers can make it difficult to understand what a machine
learning model has learned from the data. This can make it difficult to trust the model’s
predictions and can hamper efforts to improve its performance.

Semantic Analysis

Semantic Analysis is a subfield of Natural Language Processing (NLP) that attempts to


understand the meaning of Natural Language. Understanding Natural Language might seem
a straightforward process to us as humans. However, due to the vast complexity and
subjectivity involved in human language, interpreting it is quite a complicated task for
machines. Semantic Analysis of Natural Language captures the meaning of the given text
while taking into account context, logical structuring of sentences and grammar roles.
Parts of Semantic Analysis
Semantic Analysis of Natural Language can be classified into two broad parts:
1. Lexical Semantic Analysis: Lexical Semantic Analysis involves understanding
the meaning of each word of the text individually. It basically refers to fetching the
dictionary meaning that a word in the text is deputed to carry.
2. Compositional Semantics Analysis: Although knowing the meaning of each
word of the text is essential, it is not sufficient to completely understand the meaning of the
text.
For example, consider the following two sentences:
 Sentence 1: Students love GeeksforGeeks.
 Sentence 2: GeeksforGeeks loves Students.

Tasks involved in Semantic Analysis


In order to understand the meaning of a sentence, the following are the major processes
involved in Semantic Analysis:
1. Word Sense Disambiguation
2. Relationship Extraction
Word Sense Disambiguation:
In Natural Language, the meaning of a word may vary as per its usage in sentences
and the context of the text. Word Sense Disambiguation involves interpreting the meaning
of a word based upon the context of its occurrence in a text.
For example, the word ‘Bark’ may mean ‘the sound made by a dog’ or ‘the outermost layer
of a tree.’
Likewise, the word ‘rock’ may mean ‘a stone‘ or ‘a genre of music‘ – hence, the
accurate meaning of the word is highly dependent upon its context and usage in the text.
Thus, the ability of a machine to overcome the ambiguity involved in identifying the
meaning of a word based on its usage and context is called Word Sense Disambiguation.
Relationship Extraction:
Another important task involved in Semantic Analysis is Relationship Extracting. It
involves firstly identifying various entities present in the sentence and then extracting the
relationships between those entities.

Entities
Relationships

Elements of Semantic Analysis


Some of the critical elements of Semantic Analysis that must be scrutinized and taken into
account while processing Natural Language are:
 Hyponymy: Hyponymys refers to a term that is an instance of a generic term. They can
be understood by taking class-object as an analogy. For example: ‘Color‘ is a
hypernymy while ‘grey‘, ‘blue‘, ‘red‘, etc, are its hyponyms.
 Homonymy: Homonymy refers to two or more lexical terms with the same spellings
but completely distinct in meaning. For example: ‘Rose‘ might mean ‘the past form of
rise‘ or ‘a flower‘, – same spelling but different meanings; hence, ‘rose‘ is a
homonymy.
 Synonymy: When two or more lexical terms that might be spelt distinctly have the
same or similar meaning, they are called Synonymy. For example: (Job, Occupation),
(Large, Big), (Stop, Halt).
 Antonymy: Antonymy refers to a pair of lexical terms that have contrasting meanings –
they are symmetric to a semantic axis. For example: (Day, Night), (Hot, Cold), (Large,
Small).
 Polysemy: Polysemy refers to lexical terms that have the same spelling but multiple
closely related meanings. It differs from homonymy because the meanings of the terms
need not be closely related in the case of homonymy. For example: ‘ man‘ may mean
‘the human species‘ or ‘a male human‘ or ‘an adult male human‘ – since all these
different meanings bear a close association, the lexical term ‘man‘ is a polysemy.
 Meronomy: Meronomy refers to a relationship wherein one lexical term is a
constituent of some larger entity. For example: ‘Wheel‘ is a meronym of ‘Automobile‘
Meaning Representation
While, as humans, it is pretty simple for us to understand the meaning of textual
information, it is not so in the case of machines. Thus, machines tend to represent the text
in specific formats in order to interpret its meaning. This formal structure that is used to
understand the meaning of a text is called meaning representation.
Basic Units of Semantic System:
In order to accomplish Meaning Representation in Semantic Analysis, it is vital to
understand the building units of such representations. The basic units of semantic systems
are explained below:
1. Entity: An entity refers to a particular unit or individual in specific such as a person or
a location.
2. Concept: A Concept may be understood as a generalization of entities. It refers to a
broad class of individual units. For example Learning Portals, City, Students.
3. Relations: Relations help establish relationships between various entities and concepts.
For example: ‘GeeksforGeeks is a Learning Portal’, ‘Delhi is a City.’, etc.
4. Predicate: Predicates represent the verb structures of the sentences.
In Meaning Representation, we employ these basic units to represent textual information.
Approaches to Meaning Representations:
Now that we are familiar with the basic understanding of Meaning Representations, here
are some of the most popular approaches to meaning representation:
1. First-order predicate logic (FOPL)
2. Semantic Nets
3. Frames
4. Conceptual dependency (CD)
5. Rule-based architecture
6. Case Grammar
7. Conceptual Graphs
Semantic Analysis Techniques
Based upon the end goal one is trying to accomplish, Semantic Analysis can be used in
various ways. Two of the most common Semantic Analysis techniques are:
Text Classification
In-Text Classification, our aim is to label the text according to the insights we intend to
gain from the textual data.
For example:
 In Sentiment Analysis, we try to label the text with the prominent emotion they
convey. It is highly beneficial when analyzing customer reviews for improvement.
 In Topic Classification, we try to categories our text into some predefined categories.
For example: Identifying whether a research paper is of Physics, Chemistry or Maths
 In Intent Classification, we try to determine the intent behind a text message. For
example: Identifying whether an e-mail received at customer care service is a query,
complaint or request.
Text Extraction
In-Text Extraction, we aim at obtaining specific information from our text.
For Example,
 In Keyword Extraction, we try to obtain the essential words that define the entire
document.
 In Entity Extraction, we try to obtain all the entities involved in a document.

Visual analytics

Visual analytics is essentially the marriage of data analytics and visualizations. This
approach to solving problems is concerned with integrating interactive visual
representations with underlying analytical processes to effectively facilitate high-level,
complex activities, such as reasoning and data-driven decision making.

Data analytics visualization specifically focuses on analytical reasoning techniques that


enable users to gain greater insights that will directly support decision making and
planning; visual representations and interaction techniques that exploit the human eye’s
perceptual processes; data representations and transformations that format data to support
visualization and analytics;

Visual analytics is particularly useful in business analytics applications that involve large
amounts of complex data sets and analytical processes that require a great deal of
interaction and monitoring. Increasing demand for the integration of visual analytics
software is driven by the generation of more and more data of high volume, complexity,
and velocity.

Big data analytics visualization tools help transform cryptic, tedious big data into a visually
colorful, interactive data visualization from which users can track trends, patterns, and
anomalies, and make better, data-driven decisions.
In order to successfully analyze and understand a single big data problem, visual analytics
systems are often used in tandem with multiple analysis approaches, such as machine
learning algorithms and intelligence value estimation algorithms.
Some effective big data analytics visualization strategies and approaches include good
semantic mapping, abstraction, aggregation, incremental approximate database queries, and
the transformation of data into a functional or procedural model.

The visual analytics process typically follows the same steps:


 data transformation
 data mapping
 contribution selecting
 ranking, interaction
 model visualization
 knowledge processing.

What is Data Visualization in Business Analytics?
Business Analytics is the process by which businesses use statistical methods and
technologies for analyzing historical data in order to gain new insight and improve strategic
decision-making. Data visualization is a core component in a typical business analytics
dashboard, providing visual representations such as charts and graphs for easy and quick
data analysis.
Visualizing data aids in finding correlations between business operations and long-term
outcomes. Visual analytics applications must, like any business intelligence or business
analytics initiative, adopt an effective data management strategy in order to integrate and
standardize data from disparate source systems.
Read more about BI Tools in our complete guide.

Benefits of Visual Analytics
Businesses are implementing data analytics and visualization tools with increasing
frequency in order to speed up their business performance and improve their business
decisions making process. Some key benefits of visualization in data analytics include:
Improved data exploration and data analysis, and minimized overall cost
Faster and better understanding of data for faster, better decision-making
Consumption of greater volumes of data in less time, which improves operational
efficiency
Early detection of otherwise overlooked trends, outliers, and correlations between data sets,
which may result in a competitive edge
Instant feedback and real time updates, which keep data current and accurate

Network diagrams
Network diagrams (also called Graphs) show interconnections between a set of entities. Each
entity is represented by a Node (or vertice). Connections between nodes are represented
through links (or edges).
Here is an example showing the co-authors network of Vincent Ranwez, a researcher who’s
my previous supervisor. Basically, people having published at least one research paper with
him are represented by a node. If two people have been listed on the same publication at least
once, they are connected by a link.

Four types of input


Four main types of network diagram exist, according to the features of data inputs.

Undirected and Unweighted


Tom, Cherelle and Melanie live in the same house. They are connected but no direction and
no weight.

Undirected and Weighted


In the previous co-authors network, people are connected if they published a scientific paper
together. The weight is the number of time it happend.

Directed and Unweighted


Tom follows Shirley on twitter, but the opposite is not necessarily true. The connection is
unweighted: just connected or not.

Directed and Weighted


People migrate from a country to another: the weight is the number of people, the direction is
the destination.

Variation
Many customizations are available for network diagrams. Here are a few features you can
work on to improve your graphic:
 Adding information to the node: you can add more insight to the graphic by
customizing the color, the shape or the size of each node according to other variables.
 Different layout algorythm: finding the most optimal position for each node is a tricky
exercise that highly impacts the output. Several algorithms have been developped, and
choosing the right one for your data is a crucial step.

Spatial data

Spatial data can be referred to as geographic data or geospatial data. Spatial data provides the
information that identifies the location of features and boundaries on Earth. Spatial data can
be processed and analysed using Geographical Information Systems (GIS) or Image
Processing packages.
Types of data

There are different types of spatial data which can be split into two categories:
 Feature Data (vector data model) = entity of the real world e.g. a road, a tree or a building
these can be represented as a point, line or polygon in space
 Coverage Data (raster data model) = mapping of continuous data in space expressed as a
range of values e.g. a satellite image, an aerial photograph, a Digital Surface Model
(DSM) or Digital Terrain Model (DTM), text file with daily precipitation values.
Coverage data can be represented as a grid or triangulated irregular network

Geographic data such as road maps, land-usage maps, topographic elevation maps,
political maps showing boundaries, land-ownership maps, and so on. Geographical
information system are special purpose databases for storing geographical data.
Geographical data are differ from design data in certain ways. Maps and satellite images are
typical examples of geographic data. Maps may provide not only location information
associated with locations such as elevations. Soil type, land type and annual rainfall.
Types of geographical data :
 Raster data
 Vector data
1.Raster data: Raster data consist of pixels also known as grid cells in two or more
dimensions. For example, image of Satellites , digital pictures, and scanned maps.
2.Vector data: Vector data consist of triangles, lines, and various geometrical objects in
two dimensions and cylinders, cuboids, and other polyhedrons in three dimensions. For
example, building boundaries and roads.

Applications of Spatial databases in DBMS :


 Microsoft SQL server: Since the 2008 version of Microsoft SQL server supported
spatial databases.
 CouchDB : This is document-based database in which spatial data is enabled by plugin
calledGeoCouch.
Neo4j database.

Time series plots:

Time series charts present a series of data points collected over a specified reporting period.
The x-axis plots time and the y-axis plots data points.

The following time intervals affect the content of a time series chart:

Reporting period
The period during which data points are collected for presentation in a chart. For
example, a time series chart might present aggregated data points collected over a
24-hour period.
Granularity
The period during which data is considered for generating a single, aggregated data
point.
For example, suppose you want to find the average bps for a network device within
a reporting period of one day. You could calculate a single data point consisting of
the average of all data points collected within that 24-hour period. Or you could
apply a finer granularity by calculating the average of data points collected within a
smaller time period, such as 1 hour, over the course of the 24-hour period. If you
apply a granularity of one hour for a reporting period of one day, the chart would
contain 24 aggregated data points. For information about the granularity periods that
are supported for a given reporting period and report type, see the table in the Data
aggregation topic.
Types of time series charts
Time series charts include:

 Resource Time Series (RTS). A collection of data points for a single resource over a
given reporting period. Data aggregations in a Resource Time Series chart are time
aggregations.
 Group Time Series (GTS). A collection of data points for a group of resources over
a given reporting period. Data aggregations in a Group Time Series chart are spatial
aggregations.
 Advanced Group Time Series (AGTS). A data set consisting of both time and spatial
aggregations.
 Trending and Forecasting Time Series. A collection of data points collected over a
trending and forecasting period rather than over a reporting period.
 Multi-Resource Time Series. A collection of data for multiple resources (maximum
of 100) and for multiple periods.

 The effect of granularity and polling period on aggregation


The effective granularity for an aggregation is the granularity period that is used in the
aggregation. In some cases, the granularity that is specified for an aggregation is not
the same as the effective granularity.
 Changing granularity and granularity precedence
You can change the granularity of the data displayed in GTS and RTS line charts.
Some methods of changing granularity take precedence over others.
 Resource Time Series
Resource Time Series (RTS) reports contain raw or aggregated data for a single
resource over a particular reporting period.
 Group Time Series
Group Time Series (GTS) reports contain raw or aggregated data for a group of
resources over a particular reporting period.
 Advanced Group Time Series
The Advanced Group Time Series (AGTS) report contains both time and spatial
aggregations. AGTS reports allow you to aggregate a set of aggregated data points.
 Trending and Forecasting Time Series Charts
The Trending and Forecasting Time Series (TFTS) chart provides information about
the trend of a resource towards a predicted upgrade target.

Reinforcement learning

Reinforcement learning is an autonomous, self-teaching system that essentially learns by


trial and error. It performs actions with the aim of maximizing rewards, or in other words, it
is learning by doing in order to achieve the best outcomes.
Reinforcement learning is an autonomous, self-teaching system that essentially learns by
trial and error. It performs actions with the aim of maximizing rewards, or in other words, it
is learning by doing in order to achieve the best outcomes.

Main points in Reinforcement learning –

 Input: The input should be an initial state from which the model will start
 Output: There are many possible outputs as there are a variety of solutions to a
particular problem
 Training: The training is based upon the input, The model will return a state and the
user will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

Types of Reinforcement:

There are two types of Reinforcement:


1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it has
a positive effect on behavior.
Advantages of reinforcement learning are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
Advantages of reinforcement learning:
 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning

Reinforcement learning elements are as follows:


1. Policy
2. Reward function
3. Value function
4. Model of the environment
Policy: Policy defines the learning agent behavior for given time period. It is a mapping
from perceived states of the environment to actions to be taken when in those states.
Reward function: Reward function is used to define a goal in a reinforcement learning
problem.A reward function is a function that provides a numerical score based on the state
of the environment
Value function: Value functions specify what is good in the long run. The value of a state
is the total amount of reward an agent can expect to accumulate over the future, starting
from that state.
Model of the environment: Models are used for planning.

Advantages of Reinforcement learning


1. Reinforcement learning can be used to solve very complex problems that cannot be
solved by conventional techniques.
2. The model can correct the errors that occurred during the training process.
3. In RL, training data is obtained via the direct interaction of the agent with the
environment
4. Reinforcement learning can handle environments that are non-deterministic, meaning
that the outcomes of actions are not always predictable. This is useful in real-world
applications where the environment may change over time or is uncertain.
Disadvantages of Reinforcement learning
1. Reinforcement learning is not preferable to use for solving simple problems.
2. Reinforcement learning needs a lot of data and a lot of computation
3. Reinforcement learning is highly dependent on the quality of the reward function. If the
reward function is poorly designed, the agent may not learn the desired behavior.

A/B testing

A/B testing—also called split testing or bucket testing—compares the performance of two
versions of content to see which one appeals more to visitors/viewers. It tests a control (A)
version against a variant (B) version to measure which one is most successful based on your
key metrics
Website A/B testing (copy, images, colors designs, calls to action), which splits traffic
between two versions—A and B.
1) conversions or 2) visitors who performed the desired action.
Email marketing A/B testing (subject line, images, calls to action), which splits recipients
into two segments to determine which version generates a higher open rate.
Content selected by editors or content selected by an algorithm based on user behavior to see
which one results in more engagement.
In addition to A/B tests, there are also A/B/N tests, where the "N" stands for "unknown". An
A/B/N test is a type with more than two variations.
When and why you should A/B test

A/B testing provides the most benefits when it operates continuously. A regular flow of tests
can deliver a stream of recommendations on how to fine-tune performance. And continuous
testing is possible because the available options for testing are nearly unlimited.

A/B testing can be used to evaluate just about any digital marketing asset including:
 Emails

 newsletters
 advertisements
 text messages
 website pages
 components on web pages
 mobile apps

A/B testing plays an important role in campaign management since it helps determine what is
and isn’t working. It shows what your audience is interested in and responds to. A/B testing
can help you see which element of your marketing strategy has the biggest impact, which one
needs improvement, and which one needs to be dropped altogether.

 A/B testing can be used to isolate the performance problem and drive performance higher.
 Proactive use of A/B testing will allow you to compare and contrast the performance of
two different approaches to identify the better one.

Benefits of running A/B tests :

A/B testing provides a great way to quantitatively determine the tactics that work best with
visitors to your website. However, there is still an upside because you won’t stick with
something that isn’t working.

By testing widely used website components/sections. It make determinations that improve not
only the test page but other similar pages as well.

A/B test Perform:

A/B testing isn’t difficult, but it requires marketers to follow a well-defined process. Here are
these nine basic steps:

The fundamental steps to planning and executing an A/B test


1. Measure and review the performance baseline
2. Determine the testing goal using the performance baseline
3. Develop a hypothesis on how your test will boost performance
4. Identify test targets or locations
5. Create the A and B versions to test
6. Utilize a QA tool to validate the setup
7. Execute the test
8. Track and evaluate results using web and testing analytics
9. Apply learnings to improve the customer experience
Tests will provide data and empirical evidence to help you refine and enhance performance.
Using what you’ve learned from A/B testing will help make a bigger impact, design a more
engaging customer experience (CX), write more compelling copy, and create more
captivating visuals. As you continuously optimize, your marketing strategies will become
more effective, increasing ROI and driving more revenue.

Correlation

The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as “r,” is as follows:
The correlation coefficient, denoted by “r”, ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:

1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example,
there is a positive correlation between height and weight. As people get taller, they also
tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an inverse
relationship. As one variable increases, the other variable decreases. For example, there
is a negative correlation between price and demand. As the price of a product increases,
the demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example,
there is zero correlation between shoe size and intelligence.
A positive correlation indicates that the two variables move in the same direction, while a
negative correlation indicates that the two variables move in opposite directions.
The strength of the correlation is measured by a correlation coefficient, which can range
from -1 to 1. A correlation coefficient of 0 indicates no correlation, while a correlation
coefficient of 1 or -1 indicates a perfect correlation.

How to Conduct Correlation Analysis


To conduct a correlation analysis, you will need to follow these steps:
1. Identify Variable: Identify the two variables that we want to correlate. The variables
should be quantitative, meaning that they can be represented by numbers.
2. Collect data : Collect data on the two variables. We can collect data from a variety of
sources, such as surveys, experiments, or existing records.
3. Choose the appropriate correlation coefficient. The Pearson correlation coefficient is
the most commonly used correlation coefficient, but there are other correlation
coefficients that may be more appropriate for certain types of data.
4. Calculate the correlation coefficient. We can use a statistical software package to
calculate the correlation coefficient, or you can use a formula.
5. Interpret the correlation coefficient. The correlation coefficient can be interpreted as
a measure of the strength and direction of the linear relationship between the two
variables.
The various fields in which it can be used are:
 Economics and Finance : Help in analyzing the economic trends by understanding the
relations between supply and demand.
 Business Analytics : Helps in making better decisions for the company and provides
valuable insights.
 Market Research and Promotions : Helps in creating better marketing strategies by
analyzing the relation between recent market trends and customer behavior.
 Medical Research : Correlation can be employed in Healthcare so as to better
understand the relation between different symptoms of diseases and understand
genetical diseases better.
Advantages of Correlation Analysis
 Correlation analysis helps us understand how two variables affect each other or are
related to each other.
 They are simple and very easy to interpret.
 Aids in decision-making process in business, healthcare, marketing, etc
 Helps in feature selection in machine learning.
 Gives a measure of the relation between two variables.
Disadvantages of Correlation Analysis
 Correlation does not imply causation, which means a variable may not be the cause for
the other variable even though they are correlated.
 If outliers are not dealt with well they may cause errors.
 It works well only on bivariate relations and may not produce accurate results for
multivariate relations.
 Complex relations can not be analyzed accurately.

Regression

Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear.
The most common models are simple linear and multiple linear. Nonlinear regression
analysis is commonly used for more complicated data sets in which the dependent and
independent variables show a nonlinear relationship.

Regression analysis offers numerous applications in various disciplines, including finance.

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship between the slope
and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow the normal distribution.

Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the
following equation:

Y = a + bX + ϵ

Where:

 Y – Dependent variable
 X – Independent (explanatory) variable
 a – Intercept
 b – Slope
 ϵ – Residual (error)

Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:

Y = a + bX1 + cX2 + dX3 + ϵ

Where:

 Y – Dependent variable
 X1, X2, X3 – Independent (explanatory) variables
 a – Intercept
 b, c, d – Slopes
 ϵ – Residual (error)

Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:

 Non-collinearity: Independent variables should show a minimum correlation with


each other. If the independent variables are highly correlated with each other, it will
be difficult to assess the true relationships between the dependent and independent
variables.

Polynomial Regression
o Polynomial Regression is a regression algorithm that models the relationship between
a dependent(y) and independent variable(x) as nth degree polynomial. The
Polynomial Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n


o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
functions and datasets.
o Hence, "In Polynomial regression, the original features are converted into
Polynomial features of required degree (2,3,..,n) and then modeled using a linear
model."

Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

o If we apply a linear model on a linear dataset, then it provides us a good result as we


have seen in Simple Linear Regression, but if we apply the same model without any
modification on a non-linear dataset, then it will produce a drastic output. Due to
which loss function will increase, the error rate will be high, and accuracy will be
decreased.
o So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model. We can understand it in a better way using the below
comparison diagram of the linear dataset and non-linear dataset.

You might also like