0% found this document useful (0 votes)
14 views41 pages

FIOT UNIT 4 Pagenumber

This document provides an overview of Software Defined Networking (SDN), highlighting its centralized control plane compared to traditional networking's distributed control plane. It discusses the architecture of SDN, including the roles of southbound and northbound interfaces, and outlines the advantages and disadvantages of SDN, such as programmability and single points of failure. Additionally, it covers SDN's integration with IoT, data handling, and analytics, emphasizing the importance of managing large datasets generated by IoT devices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

FIOT UNIT 4 Pagenumber

This document provides an overview of Software Defined Networking (SDN), highlighting its centralized control plane compared to traditional networking's distributed control plane. It discusses the architecture of SDN, including the roles of southbound and northbound interfaces, and outlines the advantages and disadvantages of SDN, such as programmability and single points of failure. Additionally, it covers SDN's integration with IoT, data handling, and analytics, emphasizing the importance of managing large datasets generated by IoT devices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

UNIT-IV

Introduction to Software Defined Network (SDN)

• All traditional networking devices like router and switches uses distributed
control plane. But newer model of networking i.e., Software-defined
Networking (SDN) uses centralized control plane.
• Distributed control plane means that control plane of all networking devices
lies within the device itself.
• Each device have their own control plane to control data plane.
• In Centralized control plane system, there is a device which contains control
plane of all devices.
• This device control the activities of data plane of all networking devices
simultaneously.
• This device is called Controller or SDN controller.

The following figure shows a model of controller based networking.

Figure- Controller based network model

1
1. Southbound Interface:
In SDN, all networking devices must be connected to controller so that it can
regulate data planes of all devices. When drawing architecture of network,
usually the network architect places networking devices below controller. Now
according to map conventions, interface between controller and networking
devices lies to south of controller. Hence, these interfaces are
called Southbound Interface.

Southbound interface is an interface between a program on controller and a


program on networking device. Note that these interfaces we are discussing are
software interface not physical one.

2. Northbound Interface :
Controller need to know many information regarding network so that it can
control data plane of networking devices All these information are provided by
Network Programmer. Network Programmer provide essential information to
controller through various software or script about what functions it has to do.
Again these softwares/scripts are placed above controller in network
architecture. This placement of software/script makes interfaces between
controller and software in north direction, according to map conventions.
Hence, Interfaces between controller and softwares are called Northbound
Interface. These interfaces enable programmability of network.

2
3. All interfaces we discussed above are program based interfaces. These
interfaces in a broader sense are called Application Program Interface (API).
An API is an interface through which two program can exchange data between
them.
In order to understand software defined networks, we need to understand the
various planes involved in networking.

Dataplane:
All the activities involving as well as resulting from data packets sent by the end
user belong to this plane. This includes:
• Forwarding of packets
• Segmentation and reassembly of data
• Replication of packets for multicasting
Control plane:
All activities necessary to perform data plane activities but do not involve end
user data packets belong to this plane. In other words, this is the brain of the
network. The activities of the control plane include:
• Making routing tables
• Setting packet handling policies
In a traditional network, each switch has its own data plane as well as control
plane. The control plane of various switches exchange topology information and
hence construct a forwarding table which decides where an incoming data packet
has to be forwarded via the data plane.

3
Advantages of SDN:
• Network is programmable hence can easily be modified via the controller
rather than individual switches.
• Switch hardware becomes cheaper since each switch only needs a data plane.
• Hardware is abstracted, hence applications can be written on top of controller
independent of switch vendor.
• Provides better security since the controller can monitor traffic and deploy
security policies. For example, if the controller detects suspicious activity in
network traffic, it can reroute or drop the packets.
Disadvantages of SDN:
The central dependency of the network means single point of failure, i.e. if the
controller gets corrupted, the entire network will be affected.

4
SDN Architecture

A typical SDN architecture consists of three layers.

• Application layer:
It contains the typical network applications like intrusion detection, firewall,
and load balancing
• Control layer:
It consists of the SDN controller which acts as the brain of the network. It
also allows hardware abstraction to the applications written on top of it.
• Infrastructure layer:
This consists of physical switches which forms the data plane and carries out
actual movement of data packets.

5
The layers communicate via a set of interfaces called the northbound
APIs(between application and control layer) and southbound APIs(between
control and infrastructure layer).

Challenges

✓ Rule placement

✓ Controller placement

Rule placement

✓ Switches forward traffic based on a rule – ‘Flow-Rule’ – defined by the


centralized controller.

▪ Traditionally, Routing Table in every switch (L3 switch/router).


SDN maintains Flow Table at every switch.

▪ Flow -Rule in the Flow Table.

✓ Each rule has a specific format, which is also defined by a protocol (e.g.,
OpenFlow).

✓ Size of ternary content-addressable memory (TCAM) is limited at the


switches.

▪ Limited number of rules can be inserted.

✓ Fast processing is done using TCAM at the switches.

✓ TCAM is very cost-expensive.

✓ On receiving a request, for which no flow-rule is present in the switch, the


switch sends a PACKET-IN message to the comtroller.

6
✓ The controller decides a suitable flow-rule for the request.

✓ The Flow-Rule inserted at the switch,

✓ Typically, 3-5ms delay is involved in a new rule placement

✓ How to define/place the rules at switches, while considering available


TCAM.

✓ How to define rules, so that less number of PACKET-IN

messages are sent to controller.

OpenFlow Protocol

✓ Only one protocol is available for rule placement – OpenFlow.

✓ It has different versions – 1.0, 1.1, 1.2, 1.3, etc. – to have different
number of match-fields.

7
✓ Different match-fields

▪ Source IP

▪ Destination IP

▪ Source Port

▪ Priority

▪ etc.

How much time a flow-rule is to be kept at the switch?

✓ Hard timeout

▪ All rules are deleted from the switch at hard timeout.

▪ This can used to reset the switch.

✓ Soft timeout

▪ If NO flow is received associated with a rule for a particular time,


the rule is deleted.

▪ This is used to empty the rule-space by deleting an unused rule.

✓ SDN is NOT OpenFlow.

8
▪ SDN is a technology/concept

▪ OpenFlow is a protocol used to communicate between data-plane


and control-plane.

▪ We may have other protocols for this purpose. However, OpenFlow


is the only protocol present today.

OpenFlow Switch Software

• Indigo: Open source, it runs on Mac OS X.

• LINC: Open source, it runs on Linux, Solaris, Windows, MacOS, and


FreeBSD.

• Pantou: Turns a commercial wireless router/access point to an OpenFlow


enabled switch. OpenFlow runs on OpenWRT.

• Of13softswitch: User-space software switch based on Ericsson


TrafficLab1.1 softswitch.

• Open vSwitch: Open Source, it is the MOST popular one present to


day.26Introduction to Internet of Things

Controller Placement

✓ Controllers define flow-rule according to the application- specific


requirements.

✓ The controllers must be able to handle all incoming requests from


switches.

✓ Rule should be placed without incurring much delay.

9
✓ Typically, a controller can handle 200 requests in a second (through a
single thread).

✓ The controllers are logically connected to the switches in one- hop


distance.

✓ Physically, they are connected to the switches in multi-hop distance.

✓ If we have a very small number of controllers for a large network, the


network might be congested with control packets (i.e., PACKET-IN
messages.
✓ if we look at this controller placement there are different architectures

Flat Architecture

So, one architecture the basic architecture is called the flat architecture, and here
basically the switch and the controller they are just logically one hop away the
switch sends a packet in message to the controller if the switch already does not
have this flow rule for the particular flow that it has received.

10
So, it will set a send a packet in message to the controller and the controller is
going to send back the flow rule corresponding to that to that particular. That
means, how it is at how the switch is going to treat it you know that particular
instruction is going to be sent by the controller the controller knows it the
controller knows how the different flows how the different packets are going to
be handled this is the assumption in this particular technology SDN technology.
Hierarchical (tree) Architecture

This is the hierarchical or the tree architecture and these I do not need to
elaborate further, but it is quite obvious we have these different switches and
hierarchically they are placed within the controllers are placed and connected to
these different switches in a tree like fashion.
And we have this packet in message and the corresponding flow rule coming

11
back for each of these connectivity’s.

Ring Architecture

In the ring architecture we have a similar kind of thing, but we have to keep in
mind that in the ring architecture. So, basically these controllers are placed in a
ring like fashion we have multiple controllers like this placed in ring like fashion,
but a particular switch is connected to only one controller in this version.

12
when the packet in request has to be sent this PACKET-IN request will be sent to
a single controller only not that it can be sent to any of the other controllers in the
ring it will be sent to a single controller and the low rule is going to be sent to
this particular switch that has

requested the rule, and then we have the mesh architecture mesh as we know
increases the reliability. And as you can see over here for instance we have 2
different switches who can be connected to a single controller. So, if this one
goes down there is the other one which can take over and so on. requested the
rule, and then we have the mesh architecture mesh as we know increases the
reliability. And as you can see over here for instance we have 2 different
switches who can be connected to a single controller. So, if this one goes down
there is the other one which can take over and so on.

Control Mechanisms
✓ Distributed

13
▪ The control decisions can be taken in a distributed manner
▪ Ex: each subnetwork is controlled by different controller
✓ Centralized
▪ The control decisions are taken in a centralized manner.
▪ Ex: A network is controlled by a single controller.

Backup Controller

✓ If a controller is down, what will happen?

▪ Backup controller is introduced

▪ Replica of the main controller is created

▪ If the main controller is down, backup controller controls the


network to have uninterrupted network management.

SDN one can have enhanced level of security in the network and in this
particular case we will be talking help of the firewall proxy,http, and the IDS
and these can have improved security with respect to this technology.
So, just as a very brief you know here we are not going to discuss about you
know improving security with SDN and in much detail, but just as a brief you
know this is the this is this is a this is a paper which was published in
SIGCOMM in 2013 very recently; that means, which is talking about the
simplifying protocol for policy enforcement.

So, what it does? So, you know let us look at this particular figure. So, it is an
example of a potential data plane ambiguity to implement the policy chain this
chain firewall IDS proxy in this particular topology and the sequence of flow of
instructions is like this. So, it will this is from the http when a http request comes
then it is sent from one switch to another switch. This particular switch then it

14
goes to the IDS comes back goes to the proxy and the forwarding and the
firewall.

And then finally, to this particular switch and then to the then finally, out of the
network. So, this is how you know security is implemented and enhanced using
SDN. So, we are not talking about as I mentioned already I just wanted to show
you that security can indeed be improved with the help of SDN. And we do not
want to discuss anything further on this particular issue.

Experimenting with SDN

✓ Simulator/Emulator

▪ Infrastructure deployment – MUST be supported with OpenFlow

▪ Controller placement – MUST support OpenFlow

▪ Remote controller-can be situated in a remote place, and


communicated using IP address and port number

▪ Local

Switch Deployment

✓ Mininet

▪ Used to create a virtual network with OpenFlow-enabled switches

▪ Based on Python language

▪ Supports remote and local controllers


✓ There are controller configuration software for example, Pox, Nox,
Floodlight open daylight and ONOS particularly open daylight and ONOS
are the most popular once that are used for controller configuration.

15
SDN for IOT

Benefits of Integrating SDN in IoT

✓ Intelligent routing decisions can be deployed using SDN

✓ Simplification of information collection, analysis and decision


making

✓ Visibility of network resources – network management is


simplified based on user, device and application-specific
requirements

✓ Intelligent traffic pattern analysis and coordinated decisions

SDN for IoT-I

16
Now, if we look at this particular figure in front of us we have these different
devices the IoT devices in different sub networks maybe and these devices
through mobile axis or fixed axis channels this data from these devices they can
be acquired and be transmitted to the data aggregator. Here all these data
aggregation are going to be done of the data that is received from these different
IoT devices. And then it passes through a transport network and from the
transport network it goes to the different gateways and the packet segregation is
going to bedone using this.

So, this is basically the simplified view of an IoT network now what happens is
when we want to integrate SDN what we are trying to do is we are going to use

17
the SDN controller. So, what the SDN controller is going to do is it is going to
control each of these different things different aspects and also it is you know it
is going to improve the orchestration between the different devices between the
different protocols that are running, etcetera, etcetera in this network and overall
it is going to improve the service logic that is behind it. So, this is going to be
improved.

Now with the SDN with the implementation of the SDN the control of these n
devices IoT devices which includes sensors actuators RF id tags and any other
IoT device. So, you know the centralized control is made possible then here as
we can see this part can take care of the rule placement, because we have these
access devices over here the rule placement while considering issues like
mobility etcetera and the heterogeneity of the n devices this can be implemented
here.

And the rule placement and traffic engineering and backbone networks can be
taken care of at the transport network and flow classification and enhanced
security are taken care of at the data center networks.

Data Handling and Analytics

18
Data handling is the process of ensuring that research data is stored, archived or
disposed off in a safe and secure manner during and after the conclusion of a
research project. This includes the development of policies and procedures to
manage data handled electronically as well as through non-electronic means .

Considerations/issues in data handling

Issues that should be considered in ensuring integrity of data handled include the
following:

• Type of data handled and its impact on the environment (especially if it is on


a toxic media).
• Type of media containing data and its storage capacity, handling and storage
requirements, reliability, longevity (in the case of degradable medium),
retrieval effectiveness, and ease of upgrade to newer media.
• Data handling responsibilities/privileges, that is, who can handle which
portion of data, at what point during the project, for what purpose, etc.
• Data handling procedures that describe how long the data should be kept,
and when, how, and who should handle data for storage, sharing, archival,
retrieval and disposal purposes.

In recent days, most data concern Big Data due to

✓ heavy traffic generated by IoT devices


✓ Huge amount of data generated by the deployed sensors.

What is Big Data

19
• Collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data
processing applications .
• “Big Data” is the data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it.
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• An aim to solve new problems or old problems in a better way.
• Big Data generates value from the storage and processing of very large
quantities of digital information that cannot be analyzed with traditional
computing techniques.

Types of Data:

There are two types of Data

✓ Structured data
✓ Data that can be easily organized.
✓ Usually stored in relational databases.
✓ Structured Query Language (SQL) manages structured data in
databases.
✓ It accounts for only 20% of the total available data today in the world.
✓ Unstructured data
✓ Information that do not possess any pre‐defined model.
✓ Traditional RDBMSs are unable to process unstructured data.
✓ Enhances the ability to provide better insight to huge datasets.
✓ It accounts for 80% of the total data available today in the world.

20
Characteristics of Big Data:

✓ Big Data is characterized by 7 Vs –

✓ Volume

✓ Velocity

✓ Variety

✓ Variability

✓ Veracity (Accuracy)

✓ Visualization

✓ Value

❖ Volume

o Quantity of data that is generated

o Sources of data are added continuously

o Example of volume ‐

▪ 30TB of images will be generated every night from the Large


Synoptic Survey Telescope (LSST)

▪ 72 hours of video are uploaded to YouTube every minute

❖ Velocity

o Refers to the speed of generation of data

21
o Data processing time decreasing day‐by‐day in order to provide
real‐time services

o Older batch processing technology is unable to handle high velocity


of data

o Example of velocity –

▪ 140 million tweets per day on average (according to a survey


conducted in 2011)

▪ New York Stock Exchange captures 1TB of trade information


during each trading session

❖ Variety

o Refers to the category to which the data belongs

o No restriction over the input data formats

o Data mostly unstructured or semi‐structured

o Example of variety –

▪ Pure text, images, audio, video, web, GPS data, sensor data,
SMS, documents, PDFs, flash etc.

❖ Variability

o Refers to data whose meaning is constantly changing.

o Meaning of the data depends on the context.

o Data appear as an indecipherable mass without structure

o Example:

22
▪ Language processing, Hashtags, Geo‐spatial data,
Multimedia, Sensor events

❖ Veracity

o Veracity refers to the biases, noise and abnormality in data.

o It is important in programs that involve automated decision‐making,


or feeding the data into an unsupervised machine learning
algorithm.

o Veracity isn’t just about data quality, it’s about data


understandability.

❖ Value

o It means extracting useful business information from scattered data.

o Includes a large volume and variety of data

o Easy to access and delivers quality analytics that enables informed


decisions

Data Handling Technologies:

❖ Cloud computing

o Essential characteristics according to NIST ( National Institute of


Standards and Technology )

▪ On‐demand self service

▪ Broad network access

▪ Resource pooling

▪ Rapid elasticity

23
▪ Measured service

o Basic service models provided by cloud computing

▪ Infrastructure‐as‐a‐Service (IaaS)

▪ Platform‐as‐a‐Service (PaaS)

▪ Software‐as‐a‐Service (SaaS)

❖ Internet of Things (IoT)

o According to Techopedia, IoT “describes a future where every day


physical objects will be connected to the internet and will be able to
identify themselves to other devices.”

o Sensors embedded into various devices and machines and deployed


into fields.

o Sensors transmit sensed data to remote servers via Internet.

o Continuous data acquisition from mobile equipment, transportation


facilities, public facilities, and home appliances

❖ Data handling at data centers

o Storing, managing, and organizing data.

o Estimates and provides necessary processing capacity.

o Provides sufficient network infrastructure.

o Effectively manages energy consumption.

o Replicates data to keep backup.

24
o Develop business oriented strategic solutions from big data.

o Helps business personnel to analyze existing data.

o Discovers problems in business operations.

Flow of Data

Data Sources:

✓ Enterprise data

✓ Online trading and analysis data.

✓ Production and inventory data.

✓ Sales and other financial data.

✓ IoT data

✓ Data from industry, agriculture, traffic, transportation

✓ Medical‐care data,

25
✓ Data from public departments, and families.

✓ Bio‐medical data

✓ Masses of data generated by gene sequencing.

✓ Data from medical clinics and medical R&Ds.

✓ Other fields

✓ Fields such as – computational biology, astronomy, nuclear research


etc

Data Acquisition:

✓ Data collection

✓ Log files or record files that are automatically generated by data


sources to record activities for further analysis.

✓ Sensory data such as sound wave, voice, vibration, automobile,


chemical, current, weather, pressure, temperature etc.

✓ Complex and variety of data collection through mobile devices. E.g. –


geographical location, 2D barcodes, pictures, videos etc.

✓ Data transmission

✓ After collecting data, it will be transferred to storage system for


further processing and analysis of the data.

✓ Data transmission can be categorized as – Inter‐DCN transmission


and Intra‐DCN transmission.

✓ Data pre‐processing

✓ Collected datasets suffer from noise, redundancy, inconsistency etc.,


thus, pre‐ processing of data is necessary.

✓ Pre‐processing of relational data mainly follows – integration,


cleaning, and redundancy mitigation

26
✓ Integration is combining data from various sources and provides users
with a uniform view of data.

✓ Cleaning is identifying inaccurate, incomplete, or unreasonable data,


and then modifying or deleting such data.

✓ Redundancy mitigation is eliminating data repetition through


detection, filtering and compression of data to avoid unnecessary
transmission.

Data Storage

✓ File system

✓ Distributed file systems that store massive data and ensure –


consistency, availability, and fault tolerance of data.

✓ GFS is a notable example of distributed file system that supports


large‐scale file system, though it’s performance is limited in case of
small files

✓ Hadoop Distributed File System (HDFS) and Kosmosfs are other


notable file systems, derived from the open source codes of GFS.

✓ Databases

✓ Emergence of non‐traditional relational databases (NoSQL) in order


to deal with the characteristics that big data possess.

✓ Three main No SQL databases – Key‐value databases,


column‐oriented databases, and document‐oriented databases.

Data Handling Using Hadoop:

✓ Hadoop is a software framework for distributed processing of large datasets


across large clusters of computers.

✓ Hadoop is open-source implementation for Google ‘s GFS and MapReduce.

27
Apache Hadoop's Map Reduce and Hadoop Distributed File System (HDFS)
components originally derived respectively from Google's MapReduce and
Google File System (GFS).

Building Blocks of Hadoop:

✓ Hadoop Common

✓ A module containing the utilities that support the other Hadoop


components

✓ Hadoop Distributed File System (HDFS)

✓ Provides reliable data storage and access across the nodes

✓ MapReduce

✓ Framework for applications that process large amount of datasets in


parallel.

✓ Yet Another Resource Negotiator (YARN)

✓ Next‐generation Map Reduce, which assigns CPU, memory and


storage to applications running on a Hadoop cluster.

Hadoop Distributed File System (HDFS):

✓ Centralized node

28
✓ Namenode

✓ Maintains metadata info about files

✓ Distributed node

✓ Datanode

✓ Store the actual data

✓ Files are divided into blocks

✓ Each block is replicated

Name and Data Nodes

✓ Namenode

✓ Stores filesystem metadata.

✓ Maintains two in‐memory tables, to map the datanodes to the blocks,


and vice versa

✓ Datanode

✓ Stores actual data

✓ Data nodes can talk to each other to rebalance and replicate data

✓ Data nodes update the namenode with the block information


periodically

✓ Before updating datanodes verify the checksums.

Job and Task Trackers

✓ Job Tracker –

✓ Runs with the Namenode

✓ Receives the user’s job

✓ Decides on how many tasks will run (number of mappers)

29
✓ Decides on where to run each mapper (concept of locality)

✓ Task Tracker –

✓ Runs on each datanode

Receives the task from Job Tracker.

Hadoop Master / Slave Architecture:

✓ Master‐slave shared‐nothing architecture

✓ Master

✓ Executes operations like opening, closing, and renaming files and


directories.

✓ Determines the mapping of blocks to Datanodes.

✓ Slave

✓ Serves read and write requests from the file system’s clients.

30
✓ Performs block creation, deletion, and replication as instructed by the
Namenode.

What is Data Analytics

✓ “Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software. Data analytics technologies and
techniques are widely used in commercial industries to enable
organizations to make more‐ informed business decisions and
researchers to verify or disprove scientific models, theories and
hypotheses.”

[An admin's guide to AWS data management]

Types of Data Analysis

✓ Two types of analysis

31
✓ Qualitative Analysis

✓ Deals with the analysis of data that is categorical in nature

✓ Quantitative Analysis

✓ Quantitative analysis refers to the process by which numerical


data is analyzed

Qualitative Analysis

✓ Data is not described through numerical values

✓ Described by some sort of descriptive context such as text

✓ Data can be gathered by many methods such as interviews, videos and audio
recordings, field notes

✓ Data needs to be interpreted

✓ The grouping of data into identifiable themes

✓ Qualitative analysis can be summarized be

✓ Notice things

✓ Collect things

✓ Think about things

Quantitative Analysis

✓ Quantitative analysis refers to the process by which numerical data is


analyzed

✓ Involves descriptive statistics such as mean, media, standard deviation

32
✓ The following are often involved with quantitative analysis:

✓ Statistical models

✓ Analysis of variables

✓ Data dispersion

✓ Analysis of relationships between variables

✓ Contingence and correlation

✓ Regression analysis

✓ Statistical significance

✓ Precision

✓ Error limits

Comparison:

Qualitative Data Quantitative Data

Data is observed Data is measured

Involves descriptions Involves numbers

Emphasis is on quality Emphasis is on quantity

Examples are color, smell, taste,etc. Examples are volume, weight,etc.

33
Advantages

✓ Allows for the identification of important (and often mission‐critical) trends.

✓ Helps businesses identify performance problems that require some sort of


action.

✓ Can be viewed in a visual manner, which leads to faster and better decisions.

✓ Better awareness regarding the habits of potential customers.

✓ It can provide a company with aen edg over their competitors.

Statistical models

✓ The statistical model is defined as the mathematical equation that are


formulated in the form of relationships between variables.

✓ A statistical model illustrates how a set of random variables is related to


another set of random variables.

✓ A statistical model is represented as the ordered pair (X , P)

✓ X denotes the set of all possible observations

✓ P refers to the set of probability distributions on X

Statistical models are broadly categorized as

✓ Complete models

✓ Incomplete models

✓ Complete model does have the number of variables equal to the


number of equations

34
✓ An incomplete model does not have the same number of variables as
the number of equations

✓ In order to build a statistical model

✓ Data Gathering

✓ Descriptive Methods

✓ Thinking about Predictors

✓ Building of model

✓ Interpreting the Results

Analysis of variance

✓ Analysis of Variance (ANOVA) is a parametric statistical technique used to


compare datasets.

✓ ANOVA is best applied where more than 2 populations or samples are


meant to be compared.

✓ To perform an ANOVA, we must have a continuous response variable and at


least one categorical factor (e.g. age, gender) with two or more levels
(e.g. Locations 1, 2)

✓ ANOVAs require data from approximately normally distributed populations

✓ Properties to perform ANOVA –

✓ Independence of case

✓ The sample should be selected randomly

35
✓ There should not be any pattern in the selection of the sample

✓ Normality

✓ Distribution of each group should be normal

✓ Homogeneity

✓ Variance between the groups should be the same (e.g. should


not compare data from cities with those from slums)

✓ Analysis of variance (ANOVA) has three types:

✓ One way analysis

✓ One fixed factor (levels set by investigator). Factors: age, gender, etc.

✓ Two way analysis

✓ Factor variables are more than two

✓ K‐way analysis

✓ Factor variables are k

✓ Total Sum of square

✓ In statistical data analysis, the total sum of squares (TSS or SST) is a


quantity that appears as part of a standard way of presenting results of
such analyses. It is defined as being the sum, over all observations, of
the squared differences of each observation from the overall mean.

✓ Total SS = Σ(Yi – mean of Y)2.

✓ F –ratio

36
✓ Helps to understand the ratio of variance between two data sets

✓ The F ratio is approximately 1.0 when the null hypothesis is true and
is greater than

1.0 when the null hypothesis is false.

F=Msbetween/MSwithin

✓ Degree of freedom

✓ Factors which have no effect on the variance

✓ The number of degrees of freedom is the number of values in the final


calculation of a statistic that are free to vary.

Data dispersion

✓ A measure of statistical dispersion is a nonnegative real number that is zero


if all the data are the same and increases as the data becomes more diverse.

✓ Examples of dispersion measures:

✓ Range

✓ Average absolute deviation

✓ Variance and Standard deviation

✓ Range

✓ The range is calculated by simply taking the difference between the


maximum and minimum values in the data set.

✓ Average absolute deviation

37
✓ The average absolute deviation (or mean absolute deviation) of a data
set is the average of the absolute deviations from the mean.

✓ Variance

✓ Variance is the expectation of the squared deviation of a random


variable from its mean

✓ Standard deviation

✓ Standard deviation (SD) is a measure that is used to quantify the


amount of variation or dispersion of a set of data values

Contingence and correlation

✓ In statistics, a contingency table (also known as a cross tabulation or


crosstab) is a type of table in a matrix format that displays the (multivariate)
frequency distribution of the variables.

✓ Provides a basic picture of the interrelation between two variables

✓ A crucial problem of multivariate statistics is finding (direct‐)dependence


structure underlying the variables contained in high‐dimensional
contingency tables

✓ Correlation is a technique for investigating the relationship between two


quantitative, continuous variables

✓ Pearson's correlation coefficient (r) is a measure of the strength of the


association between the two variables.

✓ Correlations are useful because they can indicate a predictive relationship


that can be exploited in practice

38
Regression analysis:

✓ In statistical modeling, regression analysis is a statistical process for


estimating the relationships among variables.

✓ Focuses on the relationship between a dependent variable and one or more


independent variables.

✓ Regression analysis estimates the conditional expectation of the dependent


variable given the independent variables

✓ The estimation target is a function of the independent variables called the


regression function

✓ Characterize the variation of the dependent variable around the regression


function which can be described by a probability distribution

✓ Regression analysis is widely used for prediction and forecasting, where its
use has substantial overlap with the field of machine learning

✓ Regression analysis is also used to understand which among the independent


variables are related to the dependent variable

Statistic al significance

✓ Statistical significance is the likelihood that the difference in conversion


rates between a given variation and the baseline is not due to random
chance

✓ Statistical significance level reflects the risk tolerance and confidence level

✓ There are two key variables that go into determining statistical significance:

✓ Sample size

39
✓ Effect size

✓ Sample size refers to the sample size of the experiment

✓ The larger your sample size, the more confident you can be in the result of
the experiment (assuming that it is a randomized sample)

✓ The effect size is just the standardized mean difference between the two
groups

✓ If a particular experiment replicated, the different effect size estimates from


each study can easily be combined to give an overall best estimate of the
effect size

Precision and Error limits:

✓ Precision refers to how close estimates from different samples are to each
other.

✓ The standard error is a measure of precision.

✓ When the standard error is small, estimates from different samples will be
close in value and vice versa.

✓ Precision is inversely related to standard error.

✓ The limits of error are the maximum overestimate and the maximum
underestimate from the combination of the sampling and the non‐sampling
errors

✓ The margin of error is defined as –

✓ Limit of error = Critical value x Standard deviation of the statistic

40
✓ Critical value: Determines the tolerance level of error.

✓ The limits of error are the maximum overestimate and the maximum
underestimate from the combination of the sampling and the non‐sampling
errors

✓ The margin of error is defined as –

✓ Limit of error = Critical value x Standard deviation of the statistic

✓ Critical value: Determines the tolerance level of error.

41

You might also like