STATISTICS (Week 1-3)                                 Qualitative - deals with characteristics and
descriptors that can’t be easily measured, but
Basic Statistical Concepts
                                                      can be observed subjectively
What is Statistics?
   -   It is the Science of collecting, organizing,
                                                      Four Levels of Data Management
       presenting, analyzing, and interpreting
       data to assist in making more effective        1. Nominal – lowest level of data management
       decisions.                                                          - for identification and
                                                      classification
Two Branches of Statistics
1. Descriptive Statistics - using the data            2. Ordinal – use to reflect some rank or order of
gathered on a group to describe or reach              individual or objects
conclusions about the same group
                                                      3.     Interval   –     zero      is    arbitrary
(E.g. class average, range of scores in an exam)      (eg. Temperature)
2. Inferential Statistics - a researcher gathers
data from a sample and uses the statistics            4. Ratio – highest level of data measurement
generated to reach conclusions about the                         - zero is absolute (e.g. height)
population from the sample drawn.
                                                      Statistics vs Parameter
Two Types of Variable                                    - Parameter measures the characteristic
                                                             of a population.
Variable – characteristic of interest about an
                                                         - Statistic measures the characteristic of a
object under investigation
                                                             sample.
Independent
                                                      Sampling - the process of selecting certain
   -   Manipulated                                    members or a subset of the population to make
   -   Causes the change                              statistical inferences from them and to estimate
Dependent                                             characteristics of the whole population.
   -   the variable that the investigator             Reasons for sampling
       measures to determine the effect of the          1. The sample can save money
       independent variable                             2. The sample can save time
                                                        3. For a given resource, the sample can
                                                           broaden the scope of the study.
                                                        4. Since the research process is sometimes
DATA
                                                           discouraging, sampling can save the
       – the set of values collected from the              product.
variables                                               5. If getting the population is impossible,
                                                           sampling is the only option.
Quantitative - deals with numbers and things you
can measure.
   1. Discrete
    Countable
    Data are obtained by counting
      (Example: the number of children in a
      family)
   2. Continuous
          Can assume an infinite number of           1. Probability Sampling Techniques - every
              values in an interval between any       member of the population has equal chance to
              two specific values                     be selected. (S3C)
              (E.g. temperature)
                                                      a. Simple Random Sampling
- most basic random technique                              of sampling methodology. It arises
- the basis for other random sampling                      from the failure to collect data on
techniques                                                 all items in the sample and results
- every item of a population has the chance to be          in nonresponse bias. Follow-up is
selected                                                   required for nonresponses after a
- every sample of a fixed size has the same                specific period because not
selection as every other sample of that size.              everyone will respond to your
                                                           surveys as others will do.
b. Systematic Random Sampling
                                                        3. Sampling error - happens when
                                                           there are variations or chance
c. Cluster Sampling
                                                           differences from sample to
d. Stratified Sampling                                     sample.
- more efficient from simple random                     4. Measurement error - three
- you are confident that there is a representation         sources of measurement error are
of items across the entire population                      ambiguous wording of questions,
                                                           the Hawthorne effect, and
                                                           respondent error.
2. Non-probability Sampling Techniques - the
researcher selects samples based on the
subjective judgement. (ConJuSQou)
                                                     CENTRAL TENDENCIES
a. Convenience Sampling
- convenient for the researcher since the            Median
samples are easy.                                      - the measure of the location or centrality
- inexpensive                                             of the observations
                                                       - rank your data from smallest to largest
b. Judgmental or Purposive Sampling                       and look for the middle value
- the opinions of the preselected experts are          - not affected by outliers
essential
- you cannot generalize the results of their         Mean
opinion                                                - their average
                                                       - e most common measure in a central
c. Snowball Sampling                                      tendency
                                                       - X-bar represents it
d. Quota Sampling                                      - summing all the values of data or
                                                          observation and divided by the number of
Survey Error                                              observations
   1. Coverage error - occurs if there                 a) Sample Mean- The sample mean
      are groups of items excluded from                   is the sum of the values in a
                                                          sample divided by the number of
      the sampling frame, and they have
                                                          data points in the sample.
      no chance to be selected.
      Coverage error results in a
      selection bias.
   2. Nonresponse          error       -
                                                        b) Population mean= population divided by
                                                           the population size, N.
       Nonresponse to sample surveys is
       one of the most serious problems              Mode
       that occur in practical applications
   -   the value that occurs most frequently
   -   Like the median and unlike the mean,
       extreme values do not affect the mode
Range
  - get the difference between the largest
      observation and the lowest observation
  - sets the boundaries                            1. Sample variance came from your sample size,
                                                   while population variance came from the
Quartile                                           population.
  - divides the number of data points into         2. They also differ in the computation of the
       four equal parts, or quarters               denominator. If you compute for the sample
  - 1st quartile is the middle number              mean, sample variance, and even sample
       between the smallest number and the         standard deviation, we use n-1 in the
       median of the data set.                     denominator instead of n. The reason is that
  - Quartile 2 is sometimes called the             using n in the denominator sample formulas
       median                                      results in a statistic that tends to underestimate
  - Quartile 3 is the largest number               the population.
Interquartile range                                Standard Deviation
   - more resistant to outliers                       - measures the spread of your data set
   - it contains information only about the              from the mean
      difference between the upper and lower          - If the data points are far from the mean,
      quartiles                                          the more spread out the data, and its
   - can be computed by using the formula                standard deviation is high.
      Q3 - Q1.                                        - If the SD is lower, it means the data points
                                                         are close to their average value.
Variance
   - the average squared deviation or                 a) Sample Standard Deviation - measures
      difference of the data points from their           the spread or dispersion of the sample
      mean                                               data set. It is represented by (S)
          a) Sample Variance - the sample
              variance is the sum of the squared
              differences around the mean
              divided by the sample size minus
              1.
                                                      b) Population    Standard       Deviation  -
                                                         measures the spread or dispersion of the
                                                         population data set. It is represented by
                                                         (σ)
          b) Population    Variance    -   the     The Coefficient of Variation (CV)
             population variance is the sum of        - measures the variation in percentage
             the squared differences around              rather in terms of units of data.
             the population mean divided by           - measures the spread of the data relative
             the population size, N                      to the mean by computing the relative
                                                         variability
Difference:
   -   the ratio of the standard deviation to the             numerical variables (X and Y).
       mean (average)
                                                       Coefficient Of Correlation
Shapes
                                                         - indicates the relative strength of
   - measures of shape are tools that can
     describe the shape of a distribution of                  the linear relationship between
     data.                                                    two numerical variables. This
                                                              correlation means that as one
                                                              variable changes in value, the
                                                              other variable also changes either
                                                              increases or decreases.
                                                           1. sample coefficient correlation
                                                                of    the   sample    data    is
                                                                represented by r. When you use
                                                                sample data, the coefficient of
          a) Skewness can be seen on the tail                   correlation is unlikely to be
             of your curve. It could be                         exactly
             rightskewed or left-skewed                         +1, 0, or -1 as compared to the
          b) Kurtosis represents the peak of a                  population data.
             distribution
                i.   Leptokurtic distributions - if
                     the peak of distributions is
                     high and thin.
                                                       Strength: is positively related to the correlation
                                                       coefficient.
                ii.   Platykurtic distributions - if      - A perfectly linear relationship,
                      the peak of distributions is             either extreme +1 or extreme -1,
                      flat and spread out.                     had a strong strength. Though in
                                                               actual practice, you cannot see
                                                               this type of perfect relationship in
                                                               the data set.
                                                          - The zero coefficient presents no
                                                               strength of the relationship.
                                                          - Not perfectly linear are those
               iii.   Mesokurtic distributions -               correlation coefficients whose
                      these are the shape of
                      normal distribution.
Covariance
  - measures the strength of the
      linear relationship between two
       value is in between 0 and +1/-1.
       There is a relationship, and the
       strength of the relationship
       depends on how closely the data
       points to the line.
    -
Direction: You can determine the
direction of your graph based on the sign
of the correlation coefficient.
    - In positive coefficients, as the
        value of one variable increases,
        the other variable increases, and
        there is an upward slope
    - In negative coefficients, as the
        value of one variable increases,
        the value of the other variable
        decrease, and there is a
        downward slope of the graph.