20CS610   Machine Learning
BE Sixth Semester 20CS610
    Course Outcomes
2
    ◻   After completing this course, students should be able to:
    ◻   CO1: Understand the basic concepts of Machine
        Learning.
    ◻   CO2: Formulate machine learning problems
        corresponding to different applications.
    ◻   CO3: Understand the unsupervised learning techniques..
    ◻   CO4: Understanding the pre-processing activity done
        for Machine learning algorithm.
    ◻   CO5: Understand neural networks implemented for
        various applications.
    Text Books
3
    ◻   Pattern Recognition and Image Analysis, Earl Gose,
        Richard Johnsonbaugh Steve Jos, Pearson 2015
    ◻   Pattern Classification, Richard O. Duda, Peter E.
        Hart, David G. Stork, Second edition, Wiley
        publication.
    ◻   Ethem Alpaydin (2014). Introduction to
        Machine Learning, Third Edition, MIT Press.
           The textbook website is https://www.
        cmpe.boun.edu.tr/~ethem/i2ml3e/
    Reference Books
4
    ◻   1. Machine Learning, Tom M. Mitchell, McGraw-Hill
        Publishers, 1997.
    ◻   2. Pattern Recognition and Machine Learning,
        Christopher M. Bishop, Springer Publishers, 2011.
    ◻   3. Kevin Murphy, Machine Learning: A Probabilistic
        Perspective, MIT Press, 2012.
    ◻   4. Understanding Machine Learning , Shai Shalev-
        Shwartz and Shai Bendavid, Cambridge University
        press. 2017.
    Assessment         Weightage in Marks
5
    ◻   Class Test –I              10
    ◻   Quiz/Mini Projects/ Assignment/ seminars 10
    ◻   Class Test – II               10
    ◻   Quiz/Mini Projects/ Assignment/ seminars 10
    ◻   Class Test – III              10
    ◻                               Total 50
    Question Paper Pattern
6
    ◻   Semester End Examination (SEE)
    ◻   Semester End Examination (SEE) is a written examination of three
        hours duration of 100 marks with 50% weightage.
    ◻   Note:
    ◻   •  The question paper consists of TWO parts PART- A and PART- B.
    ◻   •   PART- A consists of Question Number 1-5 are compulsory
               (ONE question from each unit)
    ◻   •   PART-B consists of Question Number 6-15 will have internal
        choice.
          (TWO question from each unit)
    ◻   •  Each Question carries 10 marks and may consist of sub-
        questions.
    ◻   •  Answer 10 full questions of 10 marks each
    UNIT – 1 Introduction & Bayesian Decision
    Theory:
1
    ◻   Introduction: What Is Machine Learning?
        Applications of Machine Learning, Types of
        Machine Learning, Statistical Decision Theory and
        Analysis
    ◻   Probability: Introduction, Basics of Probability,
        Combination, Permutation, union, Intersection.
        Complement, Conditional Probability, Random
        Variables, Binomial Distribution, Normal distribution,
        Joint Distributions and Densities, Moments of
        Random Variables
MACHINE LEARNING – UNIT 1
Introduction
◻   Artificial Intelligence (AI)
◻   Machine Learning (ML)
◻   Deep Learning (DL)
◻   Data Science
Artificial Intelligence
◻   Artificial intelligence is intelligence demonstrated
    by machines, as opposed to natural intelligence
    displayed by animals including humans.
Machine Learning
◻   Machine Learning – Statistical Tool to explore the data.
Machine learning (ML) is a type of artificial intelligence (AI) that allows
software applications to “self-learn” from training data and improve
over time, to become more accurate at predicting outcomes without being
explicitly programmed to do so.
Machine learning algorithms use historical data as input to predict new
output values.
Machine learning algorithms are able to detect patterns in data and
learn from them, in order to make their own predictions
If you are searching some item in amazon… next time… without your
request… your choice will be listed.
Deep Learning
◻   It is the subset of ML, which mimic human brain.
◻   Three popular Deep Learning Techniques are:
      ANN – Artificial Neural Network
      CNN- Convolution Neural Network
      RNN- Recurrent Neural Network
Summary:
What is Machine Learning?
◻   Learning : is any process by which a system improves
    performance from experience.
◻   Example: We call experienced doctor, Experienced
    Teacher, Experienced driver
◻   Machine Learning is the study of algorithms that
    • improve their performance P
     • at some task T
     • with experience E.
    A well-defined learning task is given by <P,T,E>
    Defining the Learning Task
Improve on task T, with respect to performance metric P, based
on experience E
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
  E: A sequence of images and steering commands recorded
 while observing a human driver.
◻   A Machine Learning system learns from historical data,
    builds the prediction models, and whenever it
    receives new data, predicts the output for it.
◻   The accuracy of predicted output depends upon the
    amount of data, as the huge amount of data helps to
    build a better model which predicts the output more
    accurately.
Types of Machine Learning:
  Supervised Learning
  Unsupervised Learning
  Semi supervised Learning
  Reinforcement Learning
•   Classification is the task of assigning a class label to an input
    pattern. The class label indicates one of a given set of classes. The
    classification is carried out with the help of a model obtained
    using a learning procedure. According to the type of learning
    used, there are two categories of classification. supervised
    learning and unsupervised learning.
•   Supervised learning makes use of a set of examples which
    already have the class labels assigned to them.
•   Unsupervised learning attempts to find inherent structures in
    the data.
•   Semi-supervised learning makes use of a small number of
    labeled data and a large number of unlabeled data to learn the
    classifier.
          1. Supervised Learning
• Supervised learning: classification is seen as supervised
  learning from examples.
 – Supervision: The data (observations, measurements, etc.) are
    labeled with pre-defined classes. It is like that a “teacher” gives the
    classes (supervision).
 – It is called supervised learning because the process of an
    algorithm learning from the training dataset can be thought
    of as a teacher supervising the learning process.
 – Test data are classified into these classes too.
 – Example: Classify the mails as span or non span based on
    redecided parameters.
    Supervised Learning
◻   The aim of a supervised learning algorithm
    is to find a mapping function to map the
    input     variable(x)   with    the    output
    variable(y).
◻   In the real-world, supervised learning can be
    used       for Risk   Assessment,      Image
    classification, Fraud Detection, spam
    filtering, etc.
◻   Types of supervised Machine learning
    Algorithms:
Supervised learning can be further divided into two types of problems:
    Regression
◻   Regression algorithms are used if there is a relationship between the
    input variable and the output variable. It is used for the prediction of
    continuous variables, such as Weather forecasting, Market Trends, etc.
Below are some popular Regression algorithms which come under
supervised learning:
•  Linear Regression
•  Regression Trees
•  Non-Linear Regression
•  Bayesian Linear Regression
•  Polynomial Regression
    Classification
◻   Classification algorithms are used when the output variable is
    categorical, which means there are two classes such as Yes-
    No, Male-Female, True-false, etc.
Some of the Classification algorithms which come under
supervised learning: are:
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Regression vs classification
◻   Regression models are used to predict a continuous
    value.
◻   Predicting prices of a house given the features of
    house like size, price etc is one of the common
    examples of Regression. It is a supervised technique.
◻   Classification applied on discrete values.
◻   Example: age vs. height., temperature of a city, house
    price
Regression vs classification
          2. Unsupervised learning
27
• Unsupervised learning (clustering)
     – Class labels of the data are unknown.
     – Given a set of data, the task is to establish the existence
          of classes or clusters in the data.
      ◻   Unsupervised learning is a type of machine
          learning in which models are trained using
          unlabeled dataset and are allowed to act on that
          data without any supervision.
Types of unsupervised Algorithm
•   Clustering: Clustering is a method of grouping the objects
    into clusters such that objects with most similarities
    remains into a group and has less or no similarities with
    the objects of another group.
•   Association: An association rule is an unsupervised
    learning method which is used for finding the relationships
    between variables in the large database. It determines the
    set of items that occurs together in the dataset.
    Association rule makes marketing strategy more effective.
    Such as people who buy X item (suppose a bread) are also
    tend to purchase Y Butter/Jam) item. A typical example of
    Association rule is Market Basket Analysis.
    Unsupervised Learning algorithms:
◻   Below is the list of some popular unsupervised learning
    algorithms:
•   K-means clustering
•   KNN (k-nearest neighbors)
•   Hierarchal clustering
•   Anomaly detection
•   Neural Networks
•   Principle Component Analysis
•   Independent Component Analysis
•   Apriori algorithm
•   Singular value decomposition
Supervised Learning                                   Unsupervised Learning
Supervised learning algorithms are trained using Unsupervised learning algorithms are trained using
labeled data.                                    unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting correct output or not.   feedback.
Supervised learning model predicts the output.       Unsupervised learning     model   finds   the   hidden
                                                     patterns in data.
In supervised learning, input data is provided to the In unsupervised learning, only input data is provided
model along with the output.                          to the model.
The goal of supervised learning is to train the model The goal of unsupervised learning is to find the
so that it can predict the output when it is given new hidden patterns and useful insights from the
data.                                                  unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any supervision
model.                                             to train the model.
Supervised      learning   can     be    categorized Unsupervised      Learning   can     be     classified
in Classification and Regression problems.           in Clustering and Associations problems.
Supervised vs. unsupervised learning
3. Semi supervised learning
It makes use of a small number of labeled data and a large number
of unlabeled data to learn.
The model uses labeled data as an input to make inferences
about the unlabeled data.
     4. Reinforcement Learning
34
     ◻   Reinforcement learning is a machine learning
         training method based on rewarding desired
         behaviors and/or punishing undesired ones .
     ◻   In general, a reinforcement learning agent is able
         to perceive and interpret its environment, take
         actions and learn through trial and error.
     ◻   Since there is no training data, machines learn
         from their own mistakes and choose the
         actions that lead to the best solution or
         maximum reward.
     Reinforcement – Learning from the
35
     environment
Life cycle of Machine learning
             Learning
37
     •   The classifier to be designed is built using input samples
         which is a mixture of all the classes.
     •   The classifier learns how to discriminate between samples of
         different classes.
     •   If the Learning is offline i.e. Supervised method then, the
         classifier is first given a set of training samples and the
         optimal decision boundary found, and then the classification
         is done.
     •   If the learning is online then there is no teacher and no
         training samples (Unsupervised). The input samples are the
         test samples itself. The classifier learns and classifies at the
         same time.
    Training and Testing data
◻   Two types of data set in supervised classifier.
      Training set : 70 to 80% of the available data will be used for
      training the system.
       In Supervised classification Training data is the data you use to
      train an algorithm or machine learning model to predict the
      outcome you design your model to predict.
      Testing set : around 20-30% will be used for testing the system. Test
      data is used to measure the performance, such as accuracy or
      efficiency, of the algorithm you are using to train the machine.
      Testing is the measure of quality of your algorithm.
      Many a times even after 80% testing, failures can be see during
      testing, reason being not good representation of the test data in the
      training set.
◻   Unsupervised classifier does not use training data
      Model Selection?
◻   It is the process of selecting the optimal model from the set of candidate
    models, for a given data.
◻   Data used in learning:
      Training Set (Usually 60%)
      Validation set – Cross Validation (20%) : Validated on different models on the
      same training set
      Test data (20%) : Unseen data
◻   Model Selection is finding the optimal model which minimizes both bias and
    variance.
◻   Bias is the error during training and Variance is the error during testing
    Features and classes
◻   Properties or attributes used to classify the objects are
    called features.
◻   A collection of “similar” (not necessarily same) objects are
    grouped together as one “class”.
◻   For example:
◻    All the above are classified as character T
◻   Classes are identified by a label.
◻   Most of the pattern recognition tasks are first done by
    humans and automated later.
Samples or patterns
◻   The individual items or objects or situations to be
    classified will be referred as samples or patterns
    or data.
◻   The set of data is called “Data Set”.
• Pattern is anything which has a regular sequence of occurrence. A pattern
   can be either through visualization or it can be derived mathematically by
   applying algorithms.
• Patterns include repeated trends in various forms of data
• Examples : Speech pattern, print on clothes, design of outfits, jewelry pattern,
   Sound wave, tree species, fingerprint, face, barcode, QR-code, handwriting,
   or character image etc.
• Pattern recognition is the process of recognizing patterns by using a machine
   learning algorithm. Pattern recognition can be defined as the classification of
   data based on knowledge already gained or on statistical information
   extracted from patterns and/or their representation. One of the important
   aspects of pattern recognition is its application potential.
• Examples: Speech recognition, speaker identification, multimedia document
   recognition (MDR), automatic medical diagnosis.
The data inputs for pattern recognition can be words or texts, images, or audio
files. Hence, pattern recognition is broader compared to computer vision that
focuses on image recognition.
In a typical pattern recognition application, the raw data is processed and
converted into a form that is convenient for a machine to use. Pattern
recognition involves the classification and cluster of patterns.
    Definition of Pattern Recognition
•     Pattern recognition is defined as the study of how machines can observe the
      environment, learn to distinguish various patterns of interest from their
      background, and make logical decisions about the categories of the patterns.
      During recognition, the given objects are assigned to a specific category.
•     In general, pattern recognition can be described as an information reduction,
      information mapping, or information labeling process.
•     In computer science, pattern recognition refers to the process of matching
      information already stored in a database with incoming data based on
      their attributes or features.
Input Data and Output Response for Various Applications
Task of Classification         Input Data                   Output Response
Character Recognition          Optical Signals or Strokes   Name of the character
Speech Recognition             Acoustic Waveforms           Name of the word
Speaker Recognition            Voice                        Name of the speaker
Weather Prediction             Weather Maps                 Weather Forecasts
Medical Diagnosis              Symptoms                     Disease
Stock Market Prediction        Financial News and Charts    Predicted Market ups and Downs
     Image Processing Example
•     Sorting Fish: incoming fish are sorted
      according to species using optical sensing
      (sea bass or salmon?)
•     Problem Analysis:
    ▪ set up a camera and take some sample
       images to extract features
    ▪ Consider features such as length, lightness,
       width, number and shape of fins, position of
       mouth, etc.
                Preprocessing
A critical step for reliable feature extraction!
                                      Examples:
                                  • Noise removal
                                  • Image enhancement
                                  • Separate touching
                                   or occluding fish
                                                               47
                                  • Extract boundary of each
                                      fish
                        Feature Extraction
• How to choose a good set of features?
 – Discriminative features
 – Invariant features (e.g., invariant to geometric
    transformations such as translation, rotation and scale)
• Are there ways to automatically learn which features are
  best ?                                                       48
                     Feature Extraction (cont’d)
                         Histogram of “length”
                                     threshold l*
•   Even though sea bass is longer than salmon on the average, there are
    many examples of fish where this observation does not hold.
Add Another Feature
  Lightness is a better feature than length because it reduces the
  misclassification error.
  Can we combine features in such a way that we improve performance?
  (Hint: correlation)
                           Multiple Features
•     To improve recognition accuracy, we might need to use more than one
     features.
    – Single features might not yield the best performance.
    – Using combinations of features might yield better performance.
                                                                            51
• If the feature space cannot be perfectly separated by a
  straight line, a more complex boundary might be used.
  (non-linear)
• Alternatively a simple decision boundary such as straight
  line might be used even if it did not perfectly separate
  the classes, provided that the error rates were
  acceptably low.
Decision region and Decision Boundary
•   Our goal of Machine learning is to reach an optimal decision rule to
    categorize the incoming data into their respective categories
•   The decision boundary separates points belonging to one class from
    points of other
•    The decision boundary partitions the feature space into decision
    regions.
•   The nature of the decision boundary is decided by the discriminant
    function which is used for decision. It is a function of the feature vector.
    Hyper planes and Hyper surfaces
    For two category case, a positive value of discriminant function decides class 1
    and a negative value decides the other.
•      If the number of dimensions is three. Then the decision boundary will be a
      plane or a 3-D surface. The decision regions become semi-infinite volumes
•     If the number of dimensions increases to more than three, then the decision
      boundary becomes a hyper-plane or a hyper-surface. The decision regions
      become semi-infinite hyperspaces.
    Decision Theory
•   Can we do better than a linear classifier?
EXAMPLES FOR MACHINE
LEARNING APPLICATIONS
Handwriting Recognition
                     57
License Plate Recognition
                   58
Biometric Recognition
                59
Face Detection/Recognition
 Detection
             Matching
                             Recognition
                        60
Fingerprint Classification
  Important step for speeding up identification
                                  61
Autonomous Systems
  Obstacle detection and avoidance
       Object recognition
                             62
Medical Applications
Skin Cancer Detection Breast Cancer Detection
                                63
Land Cover Classification
(using aerial or satellite images)
     Many applications including “precision” agriculture.
                                        64
Statistical Decision Theory
◻   Decision theory, in statistics, a set of
    quantitative methods for reaching optimal
    decisions.
              Example for Statistical Decision Theory
•   Consider Hypothetical Basket ball Association:
•   The prediction could be based on the difference between the home
    team’s average number of points per game (apg) and the visiting
    team’s ‘apg’ for previous games.
•   The training set consists of scores of previously played games, with
    each home team is classified as winner or loser
•   Now the prediction problem is : given a game to be played, predict
    the home team to be a winner or loser using the feature ‘dapg’,
•   Where dapg = Home team apg – Visiting team apg
Data set of games showing outcomes, differences between average numbers of points scored and differences
between winning percentages for the participating teams in previous games
•  The figure shown in the previous slide, lists 30 games and gives the value of
  dapg for each game and tells whether the home team won or lost.
• Notice that in this data set the team with the higher apg usually wins.
• For example in the 9th game the home team on average, scored 10.8 fewer
  points in previous games than the visiting team, on average and also the home
  team lost.
• When the teams have about the same apgs, the outcome is less certain. For
  example, in the 10th game , the home team on average scored 0.4 fewer points
  than the visiting team, on average, but the home team won the match.
• Similarly 12th game, the home team had an apg 1.1. less than the visiting team
  on average and the team lost.
                 Histogram of dapg
• Histogram is a convenient way to describe the data.
• To form a histogram, the data from a single class are
  grouped into intervals.
• Over each interval rectangle is drawn, with height
  proportional to number of data points falling in that
  interval. In the example interval is chosen to have width
  of two units.
• General observation is that, the prediction is not accurate
  with single feature ‘dapg’
Lost
 Won
                                       Prediction
•   To predict normally a threshold value T is used.
•   ‘dapg’ > T consider to be win
•   ‘dapg’ < T consider to be lost
•    T is called decision boundary or threshold.
•    If T=-1, four samples in the original data are misclassified.
    – Here 3 winners are called losers and one loser is called winner.
•    If T=0.8, results in no samples from the loser class being misclassified as winner, but 5
     samples from the winner class would be misclassified as loser.
•    IF T=-6.5, results no samples from the winner class being misclassified as losers, but 7
     samples from the loser would be misclassified as winners.
•    By inspection, we see that when a decision boundary is used to classify the samples the
     minimum number of samples that are misclassified is four.
•    In the above observations, the minimum number of samples misclassified is 4 when T=0.8
                 Using Additional Feature dwp
•   To make it more accurate let us consider two features.
•   Additional features often increases the accuracy of
    classification.
•   Along with ‘dapg’ another feature ‘dwp’ is considered.
•   wp= winning percentage of a team in previous games
•   dwp = difference in winning percentage between teams
•   dwp = Home team wp – visiting team wp
Data set of games showing outcomes, differences between average number of points scored and differences between
                        winning percentages for the participating teams in previous games
•   Now observe the results on a scatterplot
•   Each sample has a corresponding feature vector (dapg,dwp), which determines its position in the plot.
•   Note that the feature space can be classified into two decision regions by a straight line, called a linear
    decision boundary. (refer line equation). Prediction of this line is logistic regression.
•   If the sample lies above the decision boundary, the home team would be classified as the winner and it is
    below the decision boundary it is classified as loser.
         Prediction with two parameters.
• Consider the following : springfield (Home team)
• dapg= home team apg – visiting team apg = 98.3-102.9 = -4.6
• dwp = Home team wp – visiting team wp = -21.4-58.1 = -36.7
• Since the point (dapg, dwp) = (-4.6,-36.7) lies below the decision
  boundary, we predict that the home team will lose the game.
◻   If the feature space cannot be perfectly separated by a
    straight line, a more complex boundary might be used. (non-
    linear)
◻   Alternatively a simple decision boundary such as straight line
    might be used even if it did not perfectly separate the classes,
    provided that the error rates were acceptably low.
◻   Having the model shown in previous slide, we can
    use it for any type of recognition and classification.
◻    It can be
      speaker recognition
      Speech recognition
      Image classification
      Video recognition and so on…
◻   It is now very important to learn:
      Different techniques to extract the features
      Then in the second stage, different methods to
      recognize the pattern and classify
     ■   Some of them use statistical approach
     ■   Few uses probabilistic model using mean and variance etc.
     ■   Other methods are - neural network, deep neural networks
     ■   Hyper box classifier
     ■   Fuzzy measure
     ■   And mixture of some of the above
    When do we use Machine Learning?
◻    ML is used when:
    • Human expertise does not exist (navigating on Mars)
    • Humans can’t explain their expertise (speech
      recognition)
    • Models must be customized (personalized medicine)
    • Models are based on huge amounts of data
      (genomics)
PROBABILITY:
INTRODUCTION TO PROBABILITY
PROBABILITIES OF EVENTS
What is covered?
◻   Basics of Probability
◻   Combination
◻   Permutation
◻   Examples for the above
◻   Union
◻   Intersection
◻   Complement
     What is a probability
◻   Probability is               the        branch
    of mathematics concerning numerical descriptions
    of how likely an event is to occur
◻   The probability of an event is a number between 0
    and 1, where, roughly speaking, 0 indicates that
    the event is not going to happen and 1 indicates
    event happens all the time.
     Experiment
◻   The term experiment is used in probability theory
    to describe a process for which the outcome is not
    known with certainty.
    Example of experiments are:
       Rolling a fair six sided die.
       Randomly choosing 5 apples from a lot of 100 apples.
Event
◻   An event is an outcome of an experiment. It is
    denoted by capital letter. Say E1,E2… or A,B….
    and so on
◻   For example toss a coin, H and T are two events.
◻   The event consisting of all possible outcomes of a
    statistical experiment is called the “Sample Space”.
    Ex: { E1,E2…}
Examples
  Sample Space of Tossing a coin = {H,T}
  Tossing 2 Coins = {HH,HT,TH,TT}
Random phenomena
 –  Unable to predict the outcomes, but in the long-run, the
    outcomes exhibit statistical regularity.
 Examples
1.   Tossing a coin – outcomes S ={Head, Tail}
 Unable to predict on each toss whether is Head or Tail.
 In the long run can predict that 50% of the time heads will occur
 and 50% of the time tails will occur
2.   Rolling a die – outcomes
         S ={   ,   ,   ,   ,   ,   }
 Unable to predict outcome but in the long run can one can
 determine that each outcome will occur 1/6 of the time.
 Use symmetry. Each side is the same. One side should not
 occur more frequently than another side in the long run. If the
 die is not balanced this may not be true.
Example
  ◻   The die toss:
  ◻   Simple events:     Sample space:
           1   E1
                       S ={E1, E2, E3, E4, E5, E6}
           2   E2
                                             S
           3   E3      • E1 • E3
           4   E4                  • E5
               E5      • E2
           5                   • E4 • E6
               E6
           6
     Frequency of an event
90
     ◻   Frequency of occurrence is measured by:
                  ◻   (after the event has occurred)
           •   If we let n get infinitely large,
    The Probability of an Event
◻   The probability of an event A measures “how
    often” A will occur. We write P(A).
◻   P(A) must be between 0 and 1.
      If event A can never occur, P(A) = 0. If event A always
      occurs when the experiment is performed, P(A) =1.
      Then P(A) + P(not A) = 1.
      So P(not A) = 1-P(A)
◻   The sum of the probabilities for all simple events in S
    equals 1.
Example 1
    Toss a fair coin twice. What is the probability
     of observing at least one head?
    1st Coin   2nd Coin   Ei
    P(Ei)
                 H        H    1/4
    H                                P(at least 1 head)
                 T        H
                          H    1/4
                               1/4
                                     = P(E1) + P(E2) + P(E3)
                 H        T
     T                         1/4   = 1/4 + 1/4 + 1/4 = 3/4
                 T        H
                          T
Example 2
 A bowl contains three colour Ms®, one red, one
 blue and one green. A child selects two M&Ms at
 random. What is the probability that at least one
 is red?
1st M&M   2nd M&M        Ei    P(Ei)
           m
  m
                    RB        1/6
           m        RG        1/6   P(at least 1 red)
           m        BR
  m                           1/6
           m                        = P(RB) + P(BR)+
                    BG        1/6
           m                        P(RG) + P(GR)
  m                 GB        1/6
           m        GR
                              1/6
                                    = 4/6 = 2/3
Example 3
The sample space of throwing a pair of
dice is
Example 3
Event            Simple events        Probability
Dice add to 3    (1,2),(2,1)          2/36
Dice add to 6    (1,5),(2,4),(3,3),   5/36
                 (4,2),(5,1)
Red die show 1   (1,1),(1,2),(1,3),   6/36
                 (1,4),(1,5),(1,6)
Green die        (1,1),(2,1),(3,1),   6/36
show 1           (4,1),(5,1),(6,1)
Permutations
  ◻     The number of ways you can arrange
      n distinct objects, taking them r at a
        time is
       Example: How many 3-digit lock
       combinations can we make from the
       numbers 1, 2, 3, and 4?
 The order of the choice
 is important!
   Examples
Example: A lock consists of five parts and can be
assembled in any order. A quality control engineer wants
to test each order for efficiency of assembly. How many
orders are there?
       The order of the choice
       is important!
 It is required to seat 5 men and 4 women in a row so that the women occupy the
 even places. How many such arrangements are possible?
98Solution:
 Given: 5 men and 4 women
 Total number of people = 9
 The women occupy even places, which means they will be sitting in the 2nd, 4th, 6th
 and 8th places, whereas the men will be sitting in the 1st, 3rd, 5th, 7th and 9th
 places.
 The number of arrangements in which 4 women can sit in 4 places = 4P4 = 4!/(4 – 4)!
 = 4!/0! = 24/1 = 24
 5 men can occupy 5 seats in 5 ways.
 That means the number of ways they can be seated = 5P5 = 5!/(5 – 5)! = 5!/0! =
 120/1 = 120
 Therefore, the total numbers of possible sitting arrangements = 24 × 120 = 2880
Combinations
◻    The number of distinct combinations of n
     distinct objects that can be formed, taking
     them r at a time is
     Example: Three members of a 5-person commi ee
     must be chosen to form a subcommi ee. How many
     different subcommi ees could be formed?
    The order
    of
    the choice
    is
    not
                            Example
                                                                      m
                                                                        m m
•
                                                                      m m m
    A box contains six M&Ms®, four red
    and two green. A child selects two M&Ms at random. What is the
    probability that exactly one is red?
    The order of
    the choice is
    not important!
                            4 × 2 =8 ways to choose 1
                            red and 1 green M&M.           P(exactly one red)
                                                                 = 8/15
  A team of four has to be selected from 6 boys and 4 girls. How many different ways a
  team can be selected if at least one boy must be there in the team?
101
  Solution:
  Combination of a four-member team with at least one boy are:
  {(BGGG), (BBGG), (BBBG), (BBBB)}
  Number of ways one boy and three girls can be selected = 6 C 1 × 4 C 3 = 6 × 4 = 24
  Number of ways two boys and two girls can be selected = 6 C 2 × 4 C 2 = 15 × 6 = 90
  Number of ways three boys and one girl can be selected = 6 C 3 × 4 C 1 = 20 × 4 = 80
  Number of ways four boys can be selected = 6 C 4 = 15
  Total number of ways to form such a team = 24 + 90 + 80 + 15 = 209.
Is it combination or permutation?
◻   Having 6 dots in a braille cell, how many different character can
    be made?
◻   It is a problem of combination
◻   C6,0+C 6,1 + C6,2 + C6,3+ C6,4+C6,5+
    C6,6=1+6+15+20+15+6+1 = 64
◻   (Why combination is used not permutation? : reason each dots is
    of same nature )
◻   64 different characters can be made.
◻   Where N is from 0 to 6. (It is the summation of combinations..)
Having 4 characters, how may 2 character words can be formed:
Permutation : P4,2= 12
Combination: C4,2 = 6
Remember Permutation is larger than combination
Summary:
◻   So formula for Permutation is : (order is relevant)
◻   Formula for Combination is: (Order is not relevant)
EVENT RELATIONS
                       An Event , E
The event, E, is any subset of the sample space, S. i.e. any set
of outcomes (not necessarily all outcomes) of the random
phenomena
                                                           Venn
      S                                                   diagram
                                   E
The event, E, is said to have occurred if after the
outcome has been observed the outcome lies in E.
      S
                               E
Examples
1.   Rolling a die – outcomes
         S ={ , , , , ,            }
             ={1, 2, 3, 4, 5, 6}
       E = the event that an even number is rolled
           = {2, 4, 6}
            ={   ,   ,   }
 Special Events
The Null Event, is also called as empty event
  represented by - φ
φ = { } = the event that contains no
 outcomes
  The Entire Event, The Sample Space - S
   S = the event that contains all outcomes
3 Basic Event relations
1. Union if you see the word or,
2. Intersection if you see the word and,
3. Complement if you see the word not.
Union
Let A and B be two events, then the
union of A and B is the event (denoted
by A∪B) defined by:
 A ∪ B = {e| e belongs to A or e belongs
 to B}
                  A∪B
          A                 B
The event A ∪ B occurs if the event A
occurs or the event and B occurs or
both occurs.
                A∪B
         A                  B
Intersection
Let A and B be two events, then the
intersection of A and B is the event
(denoted by A∩B) defined by:
  A ∩ B = {e| e belongs to A and e
  belongs to B}
                  A∩B
           A                 B
The event A ∩ B occurs if the event A
occurs and the event and B occurs .
                A∩B
         A                 B
Complement
Let A be any event, then the
complement of A (denoted by    )
defined by:
    = {e| e does not belongs to A}
                   A
The event   occurs if the event A does
not occur
                   A
Mutually Exclusive
Two events A and B are called
mutually exclusive if:
      A                   B
If two events A and B are mutually
exclusive then:
1. They have no outcomes in common.
   They can’t occur at the same time. The outcome of
   the random experiment can not belong to both A
   and B.
          A                         B
RULES OF PROBABILITY
ADDITIVE RULE
RULE FOR COMPLEMENTS
Additive rule (General case)
 P[A ∪ B] = P[A] + P[B] – P[A ∩ B]
       or
 P[A or B] = P[A] + P[B] – P[A and
 B]
The additive rule (Mutually exclusive events) if A ∩ B =
φ
            P[A ∪ B] = P[A] + P[B]
            i.e.
            P[A or B] = P[A] +
            P[B]
         if A ∩ B = φ
         (A and B mutually
         exclusive)
When P[A] is added to P[B] the outcome in A ∩ B are counted
twice
    hence
  P[A ∪ B] = P[A] + P[B] – P[A ∩ B]
Example:
Bangalore and Mohali are two of the cities competing for the National
university games. (There are also many others).
The organizers are narrowing the competition to the final 5 cities.
There is a 20% chance that Bangalore will be amongst the final 5.
There is a 35% chance that Mohali will be amongst the final 5 and
 an 8% chance that both Bangalore and Mohali will be amongst the
final 5.
What is the probability that Bangalore or Mohali will be amongst the
final 5.
Solution:
Let A = the event that Bangalore is amongst the
final 5.
Let B = the event that Mohali is amongst the final 5.
Given P[A] = 0.20, P[B] = 0.35, and P[A ∩ B] = 0.08
What is P[A ∪ B]?
Note: “and” ≡ ∩, “or” ≡ ∪ .
 Find the probability of drawing an ace or a spade
              from a deck of cards.
There are 52 cards in a deck; 13 are spades, 4 are aces.
Probability of a single card being spade is: 13/52       = 1/4.
Probability of drawing an Ace is : 4/52           = 1/13.
Probability of a single card being both Spade and Ace       =
1/52.
Let A = Event of drawing a spade .
Let B = Event drawing Ace.
Given P[A] =1/4, P[B] =1/13, and P[A ∩ B] = 1/52
     Rule for complements
or
Complement
Let A be any event, then the complement of A (denoted by   )
defined by:
      = {e| e does not belongs to A}
                                A
The event occurs   if the event A does not occur
                              A
Logic:
         A
What Is Conditional Probability?
◻   Conditional probability is defined as the likelihood
    of an event or outcome occurring, based on the
    occurrence of a previous event or outcome.
◻   Conditional probability is calculated by multiplying
    the probability of the preceding event by the
    updated probability of the succeeding, or
    conditional, event.
◻   Bayes' theorem is a mathematical formula used
    in calculating conditional probability.
Definition
Suppose that we are interested in computing the probability of
event A and we have been told event B has occurred.
Then the conditional probability of A given B is defined to be:
    Similarly,   P[B|A] = P[A ∩ B] /P[A]
•     From the previous two expressions
    P[A ∩ B] = P[B].P[A|B]
    And      P[A ∩ B] = P[A].P[B|A]
    Can also be used to calculate P[A ∩ B]
    The Multiplication Rule
•    In many of the cases, P(A) may not depend on whether B has occurred. We say
     that the event A is independent of B if P(A) = P(A|B). An important
     consequence of the definition of independence is multiplication rule, which is
     obtained by substituting P(A) for P(A|B) in the above expressions
• P[A ∩ B] = P[A].P[B] whenever A is independent of B
Rationale:
If we’re told that event B has occurred then the sample space is restricted to B.
The probability within B has to be normalized, This is achieved by dividing by P[B]
The event A can now only occur if the outcome is in A ∩ B. Hence the new probability
of A is:
                                       A
                                                                             B
                                                         A∩B
An Example
The academy awards is soon to be shown.
For a specific married couple the probability that the
husband watches the show is 80%, the probability that
his wife watches the show is 65%, while the
probability that they both watch the show is 60%.
If the husband is watching the show, what is the
probability that his wife is also watching the show
Solution:
The academy awards is soon to be shown.
Let B = the event that the husband watches the show
P[B]= 0.80
Let A = the event that his wife watches the show
P[A]= 0.65 and P[A ∩ B]= 0.60
Another example
◻   There are 100 Students in a class.
◻   40 Students likes Apple
        Consider this event as A, So probability of occurrence of A is 40/100 = 0.4
◻   30 Students likes Orange.
        Consider this event as B, So probability of occurrence of B is 30/100=0.3
◻   Remaining Students does like either Apple nor Orange
◻    20 Students likes Both Apple and Orange, So probability of Both A and B occurring is = A
    intersect B = 20/100 = 0.2
◻   What is the probability of A in B, means what is the probability that A is occurring given B
    :
               P(A|B) = 0.2/0.3 = 0.67
               P(A|B) indicates that A
               occurring in the sample space
40   20   30   of B.
               Here we are not considering
               the entire sample space of 100
               students, but only 30 students.
Example : Calculating the conditional probability of rain given that the
biometric pressure is high.
Weather record shows that high barometric pressure (defined as being over
760 mm of mercury) occurred on 160 of the 200 days in a data set, and it
rained on 20 of the 160 days with high barometric pressure. If we let R denote
the event “rain occurred” and H the event “ High barometric pressure occurred”
and use the frequentist approach to define probabilities.
     P(H) = 160/200 = 0.8
and P(R and H) = 20/160 = 0.10
We can obtain the probability of rain given high pressure, directly from the
data.
     P(R|H) = 20/160 = 0.125
Using conditional probability
      P(R|H) = P(R and H)/P(H) = 0.10/0.8 = 0.125.
 Example : In my town, it's rainy one third of the days. Given that it is rainy,
 there will be heavy traffic with probability 1/2, and given that it is not rainy,
 there will be heavy traffic with probability 1/4. If it's rainy and there is heavy
 traffic, I arrive late for work with probability 1/2. On the other hand, the
 probability of being late is reduced to 1/8 if it is not rainy and there is no
 heavy traffic. In other situations (rainy and no traffic, not rainy and traffic) the
 probability of being late is 0.25. You pick a random day.
• What is the probability that it's not raining and there is heavy traffic and I am
   not late?
• What is the probability that I am late?
• Given that I arrived late at work, what is the probability that it rained that
   day?
 Let R be the event that it's rainy, T be the event that there is heavy traffic,
 and L be the event that I am late for work. As it is seen from the problem
 statement, we are given conditional probabilities in a chain format. Thus, it is
 useful to draw a tree diagram for this problem. In this figure, each leaf in the
 tree corresponds to a single outcome in the sample space. We can calculate the
 probabilities of each outcome in the sample space by multiplying the
 probabilities on the edges of the tree that lead to the corresponding outcome
a. The probability that it's not raining and there is heavy traffic and I am not
    late can be found using the tree diagram which is in fact applying the chain
    rule:
      P(Rc∩T∩Lc) =P(Rc)P(T|Rc)P(Lc|Rc∩T)
               =2/3⋅1/4⋅3/4
               =1/8.
b. The probability that I am late can be found from the tree. All we need to do
    is sum the probabilities of the outcomes that correspond to me being late. In
    fact, we are using the law of total probability here.
      P(L) =P(R and T and L)+P(R and Tc and L) + P(Rc and T and L) + P(Rc and
  Tc and L)
            =1/12+1/24+1/24+1/16
            =11/48.
  c. We can find P(R|L) using
      P(R|L)=P(R∩L)/P(L)
      We have already found P(L)=11/48 and we can find P(R∩L) similarly by
  adding the probabilities of the outcomes that belong to R∩L.
In particular,
    P(R∩L) =P(R,T,L)+P(R,Tc,L)
         =1/12+1/24
         =1/8
Thus we obtain
     P(R|L) =P(R∩L)/P(L)
     =(1/8)/(11/48)
     =6/11.
Random Variables
Random variable takes a random value, which is real and can be finite
or infinite and it is generated out of random experiment.
The random value is generated out of a function.
Example: Let us consider an experiment of tossing two coins.
Then sample space is S= { HH, HT, TH, TT}
Given X as random variable with condition: number of heads.
X(HH) =2
X(HT) =1
X(TH) =1
X(TT) = 0
• The distribution function for the number of heads from two flips of a coin.
• The random variable k is defined to be the total number of heads that occur
   when a fair coin is flipped two times.
• This random variable can have only 3 values 0,1,2, so it is discrete.
• Distribution function
                     k
                        is (T,P(k)T), (T, H), (H, T), (H, H)
                    0      1/4
                    1      2/4
                    2      1/4
•   In general, a probability distribution function takes the following form.
•   The table shows the pmf of a dataset. Areas under pmf graphs correspond to probability
•   For example:
    Pr(X = 2)
    = shaded rectangle
    = height × base
• Two types of random variables
    – Discrete random variables (countable set of possible outcomes)
    – Continuous random variable (unbroken chain of possible
       outcomes)
     A discrete random variable is described by its distribution
     function which lists for each outcome x the probability P(x) of x.
•    Discrete random variables are understood in terms of their probability mass function (pmf)
•    pmf ≡ a mathematical function that assigns probabilities to all possible outcomes for a discrete
     random variable.
◻   Two types of random variables
      Discrete random variables
      Continuous random variable
Discrete random variables
◻   If the variable value is finite or infinite but
    countable, then it is called discrete random variable.
◻   Example of tossing two coins and to get the count
    of number of heads is an example for discrete
    random variable.
◻   Sample space of real values is fixed.
 Discrete random variables
• If the variable value is finite or infinite but countable,
  then it is called discrete random variable.
• Example : Tossing two coins or ice cream purchase in the
  following slides are examples for discrete random
  variable.
Example: Random variable
        Probability of customers purchasing more than 3 ice creams:
•   Consider P(X), where X>3
•   Then it will be sum of P(X=4) + P(X=5) + P(X=6)=
    0.04+0.04+0.02=0.10
•   10 percent of the customers will purchase more than 3 ice creams.
•   However Summation of i=1 to n P(Xi)=1
    Continuous Random Variable
◻   If the random variable values lies between two certain fixed numbers then it
    is called continuous random variable. The result can be finite or infinite.
◻   Sample space of real values is not fixed, but it is in a range.
◻   If X is the random value and it’s values lies between a and b then,
         It is represented by : a <= X <= b
Example: Temperature, age, weight, height…etc. ranges between specific
range.
              Here the values for the sample space will be infinite
Probability distribution
◻   Frequency distribution is a listing of the observed
    frequencies of all the output of an experiment that
    actually occurred when experiment was done.
◻   Where as a probability distribution is a listing of
    the probabilities of all possible outcomes that could
    result if the experiment were done. (distribution
    with expectations).
Broad classification of Probability
distribution
◻   Discrete probability distribution
      Binomial distribution
      Poisson distribution
◻   Continuous Probability distribution
      Normal distribution
Discrete Probability Distribution:
Binomial Distribution
◻   A binomial distribution can be thought of as
    simply the probability of a SUCCESS or FAILURE
    outcome in an experiment or survey that is
    repeated multiple times. (When we have only two
    possible outcomes)
◻   Example, a coin toss has only two possible outcomes:
    heads or tails and taking a test could have two
    possible outcomes: pass or fail.
    Assumptions of Binomial distribution
    (It is also called as Bernoulli’s Distribution)
◻   Assumptions:
      Random experiment is performed repeatedly with a fixed and
      finite number of trials. The number is denoted by ‘n’
      There are two mutually exclusive possible outcome on each trial,
      which are know as “Success” and “Failure”. Success is denoted by ‘p’
      and failure is denoted by ‘q’. and p+q=1 or q=1-p.
      The outcome of any give trail does not affect the outcomes of the
      subsequent trail. That means all trials are independent.
      The probability of success and failure (p&q) remains constant for all
      trials. If it does not remain constant then it is not binomial distribution.
      For example tossing a coin the probability of getting head or
      getting a red ball from a pool of colored balls, here every time
      after the ball is taken out it is again replaced to the pool.
      With this assumption let see the formula
Formula for Binomial Distribution
     OR
   P(X=r) =
Where P is success and
       q is failure
Binomial distribution, generally
Note the general pattern emerging if you have only two possible outcomes (call them 1/0 or yes/no or
success/failure) in n independent trials.
Total number of successes X obtained in trials is called a binomial random variable
 then the probability of exactly X “successes” P(x) is as follows
                            n = number of trials
                P(x)
                =                                                 1-p = probability
                                                                  of failure
                X=#                            p=
                successes                      probability of
                out of n                       success
                trials
            Binomial Probability Distribution
■   A fixed number of observations (trials), n
    ■   e.g., 15 tosses of a coin; 20 patients; 1000 people surveyed
■   A binary outcome
    ■   e.g., head or tail in each toss of a coin; disease or no disease
    ■   Generally called “success” and “failure”
    ■   Probability of success is p, probability of failure is 1 – p
■   Constant probability for each observation
    ■   e.g., Probability of getting a tail is the same each time we toss the coin
◻   Consider a pen manufacturing company
◻   10% of the pens are defective
◻   (i)Find the probability that at least 2 pens are defective
    in a box of 12
◻   So n=12,
◻   p=10% = 10/100 = 1/10
◻   q= (1-q) =90/100 = 9/10
◻   X>=2
◻   P(X>=2) = 1- [P(X<2)]
◻              = 1-[P(X=0) +P(X=1)]
Binomial Distribution: Illustration with example
◻   Consider a pen manufacturing company
◻   10% of the pens are defective
◻   (i)Find the probability that exactly 2 pens are
    defective in a box of 12
◻   So n=12,
◻   p=10% = 10/100 = 1/10
◻   q= (1-q) =90/100 = 9/10
◻   X=2
Binomial distribution: Another example
◻   If I toss a coin 20 times, what’s the
    probability of getting exactly 10
    heads?
       Binomial distribution: example
• If I toss a coin 20 times, what’s the probability of
  getting of getting 2 or fewer heads?
The Binomial Distribution: another example
 ◻   Say 40% of the class is
     female.
 ◻   What is the probability
     that 6 of the first 10
     students walking in will
     be female?
 Continuous Random Variable
• If the random variable values lies between two certain fixed
  numbers then it is called continuous random variable. The
  result can be finite or infinite.
• If X is the random value and it’s values lies between a and b
  then,
      It is represented by : a <= X <= b
      Example: Temperature, age, weight, height…etc. ranges
   between specific range.
        Continuous Probability Distributions
◻   When the random variable of interest can take any value in an interval,
    it is called continuous random variable.
        Every continuous random variable has an infinite, uncountable
        number of possible values (i.e., any value in an interval).
•  Examples Temperature on a given day, Length, height, intensity of light
   falling on a given region.
◻   The length of time it takes a truck driver to go from New York City to Miami.
◻   The depth of drilling to find oil.
◻   The weight of a truck in a truck-weighing station.
◻   The amount of water in a 12-ounce bottle.
For each of these, if the variable is X, then x>0 and less than some maximum value
possible, but it can take on any value within this range
•
◻   Continuous random variable differs from discrete random
    variable. Discrete random variables can take on only a finite
    number of values or at most a countable infinity of values.
◻   A continuous random variable is described by Probability
    density function. This function is used to obtain the probability
    that the value of a continuous random variable is in the given
    interval.
Continuous Uniform Distribution
◻   For Uniform distribution, f(x) is constant over the
    possible value of x.
◻   Area looks like a rectangle.
◻   For the area in continuous distribution we need to
    do integration of the function.
◻   However in this case it is the area of rectangle.
◻   Example to time taken to wash the cloths in a
    washing machine. (for a standard condition)
174
NORMAL DISTRIBUTION
◻   The most often used continuous probability
    distribution is the normal distribution; it is also known
    as Gaussian distribution.
◻   Its graph called the normal curve is the bell-shaped
    curve.
◻   Such a curve approximately describes many
    phenomenon occur in nature, industry and research.
      Physical   measurement     in    areas    such    as
      meteorological experiments, rainfall studies and
      measurement of manufacturing parts are often more
      than adequately explained with normal distribution.
 NORMAL DISTRIBUTION
 The normal (or Gaussian) distribution, is a very commonly used
 (occurring) function in the fields of probability theory, and has wide
 applications in the fields of:
- Pattern Recognition;
- Machine Learning;
- Artificial Neural Networks and Soft computing;
- Digital Signal (image, sound , video etc.) processing
- Vibrations, Graphics etc.
The probability distribution of the normal variable depends upon the two parameters 𝜇 and 𝜎
–   The parameter μ is called the mean or expectation of the
    distribution.
–   The parameter σ is the standard deviation; and variance is
    thus σ^2.
–   Few terms:
    •   Mode: Repeated terms
    •   Median : middle data (if there are 9 data, the 5th one is the median)
    •   Mean : is the average of all the data points
    •   SD- standard Deviation, indicates how much the data is deviated
        from the mean.
        –   Low SD indicates that all data points are placed close by
        –   High SD indicates that the data points are distributed and are not close by.
    •   SD given by the formula (S)
    •   Where S is sample SD
    •   If you want population SD, represented by
 The probability distribution of the normal variable depends upon the two parameters 𝜇 and 𝜎
– The parameter μ is called the mean or expectation of the distribution.
– The parameter σ is the standard deviation; and variance is thus σ^2.
–   standard deviation is a measure of the amount of variation or dispersion of a set of values.
–   A low standard deviation indicates that the values tend to be close to the mean ( expected
    value) of the set,
–   a high standard deviation indicates that the values are spread out over a wider range.
• The density of the normal variable 𝑥 with mean 𝜇 and variance 𝜎2
  is                 1    (𝑥−𝜇) ◌ൗ 2
                               2
           𝑓           𝑒−        2𝜎      −∞ < 𝑥 < ∞
                   𝜎
           𝑥 = 2𝜋
     where 𝜋 = 3.14159  … and 𝑒 = 2.71828 … . ., the Naperian
     constant                                                             f(x)
                                                                                       𝜎
The Normal distribution                             1    −(        )
                                                              x− μ 2
                                                        e
                                        f ( x)
                                                                   2
(mean μ, standard deviation σ)                                2σ
                                                   2 πσ
                                        =
                                 A plot of normal distribution (or
                                 bell-shaped curve) where each
                                 band has a width of 1 standard
                                 deviation – See also: 68–95–
                                 99.7 rule.
Standard Normal Distribution : In the above equation
probability is computed for particular value of x. If you
want a range then it has to be integrated.
     For Standard Normal distribution:
• For standard normal distribution, the area under the
  given range is given by:
          Problem: Normal distribution
• Consider an electrical circuit in which the voltage is
    normally distributed with mean 120 and standard
    deviation of 3. What is the probability that the next
    reading will be between 119 and 121 volts?
•
Another problem
Difference between PDF and PMF
        Joint Distributions and Densities
• The joint random variables (x,y) signifies that ,
  simultaneously, the first feature has the value x and
  the second feature has the value y.
• If the random variables x and y are discrete, the joint
  distribution function of the joint random variable (x,y)
  is the probability of P(x,y) that both x and y occur.
                  Joint distribution in continuous random variable
•   If x and y are continuous, then the probability density function is used over
    the region R, where x and y is applied is used.
•   It is given by:
•    Where the integral is taken over the region R. This integral represents a
    volume in the xyp-space.
Probability distributions can be used to describe the
 population, just as we described samples .
–    Shape: Symmetric, skewed, mound-shaped…
–    Outliers: unusual or unlikely measurements
–     Center and spread: mean and standard deviation. A population mean is
      called μ and a population standard deviation is called σ.
    Let x be a discrete random variable with probability distribution p(x). Then
    the mean, variance and standard deviation of x are given as
            Variance, continuous
Discrete case:
Continuous case?:
    Moments of Random Variables
 Moments are very useful in statistics because they tell us much about our data.
• In mathematics, the moments of a function are quantitative measures related to the
  shape of the function's graph.
• The “moments” of a random variable (or of its distribution) are expected values of
  powers or related functions of the random variable.
•   If the function represents mass, then the first moment is the center of the mass, and
    the second moment is the rotational inertia. The mathematical concept is closely
    related to the concept of moment in physics.
•   If the function is a probability distribution, then there are four commonly used
    moments in statistics
         The first moment is the expected value - measure of center of the data
         The second central moment is the variance - spread of our data about the mean
         The third standardized moment is the skewness - the shape of the distribution
         The fourth standardized moment is the kurtosis - measures the peakedness or
              flatness of the distribution.
Computing Moments for population
Moment 3: To know the Skewness
                       In positive
                       Skewness,
                       Mean is > median
                       and
                       Median>mode
                       And it is reverse
                       in case of –ve
                       skewness
Moment 4 : To know the Kurtosis
D
Normal Distribution
◻   Consider an example of x values:
◻   4,5,5,6,6,6,7,7,8
◻   Mode, Median and mean all will be equal
◻   = Mode is 6
◻   = Median is 6
◻   = Mean is also 6
Positive Skew
◻   Consider an example of x values:
◻   5,5,5,6,6,7,8,9,10
◻   (It is an example for Normal Distribution)
◻   = Mode is 5
◻   = Median is 6
◻   = Mean is also 6.8
+ve skew
-ve skew
The “moments” of a random variable (or of its distribution)
are expected values of powers or related functions of the
random variable.
In particular, the first moment is the mean, µX = E(X).
                        μk = E ( X               )
                                         k
                             ⎧                        if X is
                                 x   ∑   x p k
                                                      discrete
                            =⎪∞
                             ⎨
                             ⎪( x) k                 if X is
                             ⎪∫ x f
                              ⎩-
                                   ( x)              continuous
The mean is a measure of the∞dx“center” or “location” of a distribution.
The kth moment of X.
Let X be a discrete random variable having support Rx = <1, 2> and the pmf is
                         using this compute         Solution:
A central moment is a moment of a probability distribution of
random variables about the random variables Mean (Expected Value).
The kth central moment of X
Formula for Computing Kth Central moment of Random variable
          The k th central moment of X
      Expected Values of Discrete Random
                  Variables
•   The variance of a discrete random variable x is
•   The standard deviation of a discrete random variable x is
                                                                204
Let X be a discrete random variable having support x = <1, 2>
and the pmf is
                          Using this compute mean (first order moment)
First order moment is the mean.
Solution:
•   Example : Let X be a discrete random variable having support   Rx = <1, 2, 3> and pmf is as listed
    below
•   The third moment of can be computed as shown:
                                                                rd
For example computation of 3 Order
moment
◻   The third central moment of can be computed as follows:
◻   Here X value is 1 and 2 and Probability is ¾ and ¼ respectively. Consider Mean
    is 5/4
Estimation of Parameters from Samples
◻   There are 3 kinds of estimates for these
    parameters:
      Method of moments estimates
      Maximum likelihood estimates
      Unbiased estimates.
Estimation of Parameters from samples: Method of moments
To estimate parameters using methods of moments, ‘n’ independent samples or
patterns x1,x2,x3…xn are collected from random variable x, which may be
continuous or discrete.
Randomly choose one of these samples in the data set,
let its value to be a new discrete random variable x’ called the empirical random
variable, which takes on of these values x1,x2,x3…xn.
Each with a probability 1/n.
Compute the sample mean and sample variance using the formula
The method of moments can also be used to compute covariance :
covariance is a measure of the relationship between two random variables.
The metric evaluates how much – to what extent – the variables change together.
In other words, it is essentially a measure of the variance between two variables.
 Maximum Likelihood Estimates:
• Maximum likelihood estimation is a method that determines
  values for the parameters of a model. The parameter values are
  found such that they maximise the likelihood that the process
  described by the model produced the data that were actually
  observed.
• To compute maximum likelihood estimate, choose parameter (or
  set of parameters) that maximises the joint distribution function
  or multivariate density function for the entire data set when it is
  evaluated at the sample points x1,x2,x3…xn.
END OF UNIT 1