STS Notes
STS Notes
BSMA3103 (Lecture/Lab)
STATISTIC
The average age of 3,000 selected jeepney drivers
NATURE OF STATISTICS
across the country is 48 years.
WHAT IS STATISTICS? The average weight of 150 randomly selected
Statistics is the science of collecting, organizing, students is 57 kg
summarizing, and analyzing information to draw The proportion of Filipino teenagers who smoke is
conclusions or answer questions. In addition, statistics 33% based on the responses of 500 teenagers.
is about providing a measure of confidence in any
conclusions. TYPES OF STATISTICS
1. Descriptive statistics consist of organizing and
FIELDS OF STATISTICS summarizing data. Descriptive statistics describe data
A. Mathematical statistics- the study and
through numerical summaries, tables, and graphs.
development of statistical theory and methods in the
2. Inferential statistics uses methods that take a result
abstract.
from a sample, extend it to the population, and
B. Applied statistics- the application of statistical
measure the reliability of the result.
methods to solve real problems involving randomly
generated data and the development of new statistical DATA
methodology motivated by real problems. -In statistics, data refers to a collection of facts,
figures, or information gathered for analysis and
LIMITATION OF STATISTICS interpretation
1. Not suitable to the study of qualitative phenomenon. -data are measurements or observation that are
2. Statistics does not study individuals. gathered for an event under study
3. Statistical laws are not exact.
4. Statistics table may be misused. SOURCES OF DATA
5. Statistics is only, one of the methods of studying a Primary sources - provide a first hand account of
problem. an event or time period and are considered to be
authoritative. They represent original thinking,
BASIC TERMINOLOGY IN STATISTICS reports on discoveries or events, or they can share
Universe is the set of all entities under study. new information.
A population is the total or entire group of Secondary sources - offer an analysis,
individuals or observations from which interpretation or a restatement of primary sources
information is desired by a researcher. apart from and are considered to be persuasive they often
persons, a population may consist of mosquitoes, involve generalization, synthesis interpretation,
villages, institution, etc. commentary or evaluation an attempt to convince
Sample is the subset of the population. It is a the reader of the creator's argument. They often
smaller manageable subset of a larger population attempt to describe or explain primary sources.
that is selected to represent the whole group Primary data - are data documented by the
An individual is a person or object that is a primary source. The data collectors documented
member of the population being studied. the data themselves.
A statistic is a numerical summary of a sample. Secondary data - are data documented by a
A parameter is a numerical summary of a secondary source. The data collectors had the data
population documented by other sources.
PARAMETER
The average age of all jeepney drivers in the DATA COLLECTION
Philippines is 50 years. Data collection is the process of gathering and
The average weight of all PHS students is 60 kg measuring information on variables of interest, in an
The proportion of Filipino teenagers who smoke is established systematic fashion that enables one to
30%
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
answer stated research questions, test hypotheses, and
VARIABLES
evaluate outcomes.
Variables are the characteristics of the individuals
STEPS IN DATA GATHERING within the population.
1. Set the objectives for collecting data.
2. Determine the data needed based on the set VARIABLES CAN BE CLASSIFIED INTO
TWO GROUP
objectives.
1. Qualitative variables (categorical) is variable that
3. Determine the method to be used in data gathering
yields categorical response. It is a word or a code that
and define the comprehensive data collection points.
represents a class or category.
4. Design the data gathering forms to be used.
2. Quantitative variables (numeric) takes numerical
5. Collect the data.
values representing an amount or quantity.
FIVE METHODS TO COLLECT PRIMARY
QUANTITATIVE VARIABLES MAY BE
DATA
FURTHER CLASSIFIED INTO:
1. Direct personal interview
1. Discrete Variable is a quantitative variable that either a
2. Indirect/Questionnaire Method
3. Focus Group finite number of possible values or a countable number of
4. Experiment possible values. If you count to get the value of a
5. Observation quantitative variable, it is discrete.
2. Continuous Variable is a quantitative variable that has
OPEN-ENDED & CLOSED-ENDED an infinite number of possible values that are not countable.
Open- Ended You measure to get the value of a quantitative variable, it is
-more detailed answers continuous.
-could reveal addtnl insights
-difficult to encode, tabulate and analyze LEVEL OF MEASUREMENT
-low response rate
-respondent has to be articulate RATIO
-respondent could feel threatened
QUANTITATIVE
-responses could have different levels of detail
Closed-Ended INTERVAL
-Easy to encode, tabulate and analyze
-Easy to understand
ORDINAL
-Enables Inter-study comparisons
QUALITATIVE
-Saves time and money
-High response rate NOMINAL
-Could frustrate respondents
-Potenially biased response sets DATA COLLECTION
-Difficult or impossible to detect if respondent truly
CATEGORICAL
understood the questions DATA
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
NUMERICAL the quantity arithmetic operations such as addition and
DATA Subtraction can be performed on values of the variable.
Examples:
- temperature on Fahrenheit/Celsius thermometer
-trait anxiety (e.g., high anxious vs. Low anxious)
-iq (e.g., high iq vs. Average iq vs. Low iq)
INTERVAL RATIO
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
Population Standard Deviation ()- the square
root of the variance. The sample size is typically denoted by “n” and it is
Population Proportion (P)- proportion of always a positive integer. No exact sample size can be
elements in the population with a certain mentioned here and it can vary in different research
characteristic. settings. However, all else being equal, large sized
Population Size (N)- total number of elements in sample leads to increased precision in estimates of
the population. various properties of the population.
Sample refers to a subset of the population. Values Choosing of sample size depends on nonstatistical
calculated from a sample are called statistics. considerations and statistical considerations.
Sample Mean (x̄ )- the average of all values in the • Non-statistical considerations- It may include
sample. availability of resources, man power, budget, ethics
Sample Variance (s²)- variance computed from and sampling frame.
the sample. • Statistical considerations - It will include the
Sample Standard Deviation (s)- square root of desired
the sample variance. precision of the estimate.
Sample Proportion (p̂ )- proportion of sample
elements with a certain characteristic. THREE CRITERIA NEED TO BE SPECIFIED TO
Sample Size (n)- total number of elements in the DETERMINE THE APPROPRIATE SAMPLE
SIZE:
sample.
A. Level of Precision
Concept Population Sample Also called sampling error, the level of precision, is
(Parameter) (Statistic) the range in which the true value of the population is
Mean μ (mu) x̄ (x-bar) estimated to be.
(average) B. Confidence Interval
It is statistical measure of the number of times out of
Variance σ² (sigma s²
100 that results can be expected to be within a
squared)
Standard σ (sigma) s specified range. For example, a confidence interval of
Deviation 90% means that results of an action will probably meet
expectations 90% of the time.
Proportion P p̂ (p-hat)
Desired Confidence Level Z-Score
Size N n 80% 1.28
85% 1.44
Key Difference:
90% 1.65
Greek letters (μ, o, o?, P, N) are commonly used for 95% 1.96
population parameters. 99% 2.58
Roman letters (x, s, s?, p, n) are commonly used for
sample statistics. C. Degree of Variability
Depending upon the target population and attributes
SAMPLE SIZE under consideration, the degree of variability varies
In statistics, sample size refers to the number of considerably. The more heterogeneous a population is,
individual participants, observations, or data points the larger the sample size is required to get an
included in a study or research project. It represents a optimum level of precision.
subset of a larger population that is selected to be
representative of that population for analysis and
drawing conclusions. The sample size is crucial METHODS IN DETERMINING THE
because it impacts the reliability and generalizability SAMPLE SIZE
of research findings. A. Estimating the Mean or Average
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
The sample size required to estimate the population
mean to with a level of confidence with specified
margin of error e, given by:
which we know only after we have taken the sample.
where:
Z is the z-score corresponding to level of confidence
e is the level of precision
Example:
A soft drink machine is regulated so that the amount of
drink dispensed is approximately normally distribute
with a standard deviation equal to 0.5 ounce.
Determine
the sample size needed if we wish to be 95% confider
that the sample mean will be within 0.03 ounce from
the
true mean. C. Slovin's Formula
where:
Z is the z-score corresponding to level of confidence
e is the level of precision
P is population proportion
There is a dilemma in this formula: It depends on
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
for the sample. The selection process is random,
SAMPLING TECHNIQUES
meaning it's not influenced by any specific
Sampling is the process of selecting a subset characteristics or biases of the researcher or the
(sample) from a larger group (population) in order population itself. By ensuring random selection, a
to study and draw conclusions about the entire simple random sample aims to create a sample that
population. Since studying the whole population is accurately reflects the characteristics of the larger
often time-consuming, costly, or impractical, population.
researchers use sampling to gather information
more efficiently. The goal is to make sure the B. Systematic Sampling (SS)
sample is as representative as possible of the Systematic sampling is a probability sampling method
population. where the sample is selected from a larger population
Sampling techniques are the methods or strategies by choosing every kth element from a list or sequence.
used to select samples from a population. They It's a straightforward technique often used when a
determine how participants or elements are population is ordered in a predictable way.
chosen.
Good sampling techniques reduce bias and
increase the accuracy of results.
where,
TYPES OF SAMPLING TECHNIQUES k = sampling interval
1. Probability Sampling is a sampling technique in
N = population size
which every member of the population has a known,
n = sample size
non-zero chance of being selected. It relies on
randomization, making it more representative and less
C. Stratified Sampling (STS)
prone to bias.
Stratified sampling is a probability sampling
2. Non-Probability Sampling is a sampling technique
method used to divide a population into subgroups
where not all members of the population have a chance
(strata) based on shared characteristics, and then
of being included. Selection is based on convenience,
samples are randomly selected from each stratum.
judgment, or voluntary participation, making it less
This ensures representation from all subgroups in
representative but often easier and cheaper to conduct.
the final sample
Probability Sampling Non-Probability
Aspect Sampling Through stratification, the population is divided
Chance of Equal and known Unequal, not all
into smaller, non-overlapping groups called
Selection chance for all members have a strata. These strata are formed based on specific
members chance characteristics relevant to the research, such as
Bias Less prone to bias More prone to
bias age, gender, income, or education level.
Representa More representative Less Once the strata are defined, a random sampling
tiveness of the population representative
technique (like simple random sampling) is used
Cost & More costly and time- Cheaper and
Time consuming faster within each stratum to select individuals. This
method ensures that each stratum is adequately
1. PROBABILITY SAMPLING represented in the overall sample.
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
it's more practical to sample entire groups rather than However, some individuals or subgroups may have no
individuals. chance of being sampled. In order to be able to
2. NON-PROBABILITY SAMPLING generalize the conclusion to the whole population,
some assumptions, which are usually not met, are
A. Convenience Or Haphazard Sampling required.
Units are selected in an arbitrary manner with little or
no planning involved. Haphazard sampling assumes E. Volunteer Sampling
that the population units are all alike, then any unit The respondents are only volunteers in this method.
may be chosen for the sample. An example of Generally, volunteers must be screened so as to get a
haphazard sampling is the vox pop survey where the set of characteristics suitable for the purposes of the
interviewer selects any person who happens to walk survey (e.g. individuals with a particular disease). This
by. Unfortunately, unless the population units are truly method can be subject to large selection biases, but is
similar, selection is subject to the biases of the sometimes necessary. For example, for ethical reasons,
interviewer and whoever happened to walk by at the volunteers with particular medical conditions may
time of sampling. have to be solicited for some medical experiments.
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
-Error that results from taking one sample instead of Visualization also enables us to collect and organize
examining the whole population data based on categories and topics, which can make it
Error that results from using sampling to estimate easier to break it down into manageable chunks. This
information regarding a population
can be a significant benefit.
DATA VISUALIZATION 2. Locating patterns and anomalies within a given
The presentation of information or data in a visual data collection.
format is known as data visualization. The purpose If you were to manually sort through raw data, it could
of data visualization is to transmit information or take you a very long time to identify patterns, trends,
data to readers in a way that is understandable and or anything that is out of the ordinary. However, you
useful to them. Charts, infographics, diagrams, and may sort through a large amount of data in a short
maps are the most common ways that data can be amount of time by using data visualization tools such
represented graphically. as charts. Even better, charts make it more simpler and
Data visualization is a form of communication that faster to identify patterns than it would be to do so by
portrays dense and complex information in combing through numerical data.
graphical form. The resulting visuals are designed 3. Tell a story that can be found inside the data.
to make it easy to compare data and use it to tell a The mere presentation of numbers does not typically
story – both of which can help users in decision elicit an emotional response. However, data
making. Data visualization can express data of visualization allows for the telling of a story that
varying types and sizes: from a few data points to provides context for the data. Designers utilize
large multivariate datasets. methods such as color theory, images, design style,
The fields of art and data science come together in and visual cues to appeal to the emotions of readers,
the discipline of data visualization. Although a put faces to numbers, and introduce a narrative to the
data visualization has the potential to be artistic data. They also use these methods to put faces to
and aesthetically beautiful, it must not lose sight of numbers.
the fact that its primary purpose is to effectively 4. Putting more weight to a claim or viewpoint.
communicate the facts it depicts. When it comes to persuading people that your
The process of drawing conclusions from viewpoint is correct, showing them the evidence is
collected, processed, and modeled data requires often necessary for them to believe you. Your
that the data be visualized as one of the processes argument can be strengthened while also highlighting
in the data science workflow. This means that data your creative potential if you use a good infographic or
visualization is one of the steps in the data science chart. You can use a comparison infographic, for
process. The discipline known as data presentation instance, to compare the various points of view in an
architecture (dpa) seeks to identify, locate, argument, various ideas, product or service options,
process, format, and convey data in the most advantages and disadvantages, and even more.
effective manner possible. Data visualization is 5. Bringing attention to the most relevant aspects of a
one component of the larger data presentation data set.
architecture (dpa) field We make use of data visualizations on occasion so that
it is simpler for readers to investigate the data and
Uses of data visualization draw their own conclusions. On the other hand, we
1. Present information in a way that is both frequently employ data visualizations in order to
interesting and simple to understand. convey a story, present a specific argument, or urge
Large amounts of numbers can frequently cause us to readers to arrive at a particular conclusion. Visual cues
experience double vision. Finding the meaning of the are employed by designers to guide the viewer's
data that is presented in rows might be challenging. attention to specific locations on a page. Visual cues
The use of pictures, charts, descriptive language, and are elements such as forms, symbols, and colors that
an engaging design all contribute to data visualization, either direct the viewer's attention to a particular
which enables us to reframe the data in a new light.
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
portion of the data visualization or highlight a certain 6. Geographic infographic (map infographic) - uses
portion of the data. maps and location data to show trends or patterns.
Example: population density by region, global internet
usage, tourism hotspots.
Types of data visualization 7. Hierarchical infographic presents information in
TABLE levels of importance or ranking (like a pyramid or flow
When performing an analysis of comparative data on structure).
categorical objects, a data table or a spreadsheet can be Example: maslow's hierarchy of needs, organizational
an effective format to use. The things being compared structure, food pyramid.
are typically arranged in a column, and the classified 8. List infographic - uses a list format to summarize
objects are placed in the rows of the table. The points clearly and attractively.
numerical value is then placed in what is known as the Example: top 10 tips for productivity, safety rules, best
cell, which is located at the junction of the row and the practices.
column. 9. Flowchart infographic - helps readers make
decisions by guiding them through different options or
INFOGRAPHIC paths.
An infographic is a compilation of images, charts, and Example: should you buy or rent?, troubleshooting
relatively little text that provides a concise summary of guides.
a subject in an easy-to-understand format 10. Interactive infographic - digital version that
allows users to interact, click, or explore additional
Infographics come in many types, depending on the data.
purpose and the way information is presented. Here are Example: online data dashboards, clickable maps,
the different types of infographics commonly used: animated infographics.
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
(bar, grouped bar, bubble, parallel coordinate, multi- A diagram is a graphical depiction of information,
line, bullet) comparable to a chart in its function. Both two-
dimensional and three-dimensional representations of
Ranking diagrams are possible. Diagrams can be used to plan
Show an item’s position in an ordered list. Use cases out projects, assist in decision making, map out
include: processes, determine root causes, connect concepts,
-electronic results and identify connections.
-performance statistics MAPS
(ordered bar, ordered column, parallel coordinates) A land mass is depicted pictorially on a map in order
to facilitate easier comprehension. The geographic
Part-to-whole characteristics of the land, such as its regions,
Show how partial elements add up to a total. Use cases landscapes, cities, and roadways, as well as its bodies
include: of water, are depicted on maps.
-consolidated revenue of product categories
-budgets STYLE
(stacked bar, pie, donut, stacked area, treemap, Data visualizations use custom styles and shapes to
sunburst) make data easier to understand at a glance, in ways
that suit the user's needs and context.
Correlation Charts can benefit from customizing the following:
Show correlation between two or more variables. Use Graphical elements
cases include: Typography
-income and life expectancy Iconography
(scatterplot, bubble, column/line, heatmap) Axes and labels
Legends and annotations
Distribution
Show how often each values occur in a dataset. Use Styling different types of data
cases include: Visual encoding is the process of translating data into
-population distribution visual form. Unique graphical attributes can be applied
-income distribution to both quantitative data (such as temperature, price, or
(histogram, box plot, violin, density) speed) and qualitative data (such as categories, flavors,
or expressions). These attributes include:
Flow Shape
Show movement of data between multiple states. Use Charts can use shapes to display data in a range of
cases include: ways. A shape can be styled as playful and curvilinear,
-fund transfers or precise and high-fidelity, among other ways in
-vote counts and election results between.
(sankey, gantt, chord, network)
Level of shape detail
Relationship Charts can represent data at varying levels of
Show how multiple items relate to one another. Use precision. Data intended for close exploration should
cases include: be represented by shapes that are suitable for
-social networks interaction (in terms of touch target size and related
-word charts affordances). Whereas data that's intended to express a
(network, venn diagram, chord, sunburst) general idea or trend can use shapes with less detail.
DIAGRAM Color
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
Can be used to differentiate chart data in four primary content in the hierarchy. However, these treatments
ways: should be used sparingly, with a limited number of
-distinguishing categories from one another typographic styles.
-representing quantity
-highlighting specific data/ highlight area of focus Iconography
-expressing meaning Iconography can represent different types of data in a
chart and improve a chart's overall usability.
Size Iconography can be used for.
Area Categorical data to differentiate groups or
Volume categories
Length Ul controls and actions, such as filter, zoom, save,
Angle and download
Position States, such as errors, no data, completed states,
Direction and danger
Density
When placing icons in a chart, it's recommended to use
Accessibility universally recognizable symbols, particularly when
To accommodate users who don't see color representing actions or states, such as: save, download,
differences, you can use other methods to accentuate completed, error, and danger.
data, such as high-contrast shading, shape, or texture.
Labelled axis
Line - A labelled axis or multiple axes, indicates the scale
Chart lines can express qualities about data, such as and scope of the data displayed. For example, line
hierarchy, highlights, and comparisons. Line styles can charts display a range of values along both horizontal
be styled in different ways, such as using dashes or and vertical labelled axes
varied opacities. -bar charts should always start at the x-axis baseline
Lines can be applied to specific elements, including: value of zero
Annotations
Forecasting elements Bar chart baseline
Comparative tools Bar charts should start at a baseline (the starting value
Confidence intervals on the y-axis) of zero. Starting at a baseline that isn't
Anomalies zero can cause the data to be perceived incorrectly
kath
STATISTICAL ANALYSIS
BSMA3103 (Lecture/Lab)
Be rotated
Stacked vertically
Small displays
Charts displayed on wearables (or other small screens)
should be a simplified version of the mobile or desktop
chart.
kath