0% found this document useful (0 votes)
36 views3 pages

Cheat Sheet

Stab22

Uploaded by

Sahala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views3 pages

Cheat Sheet

Stab22

Uploaded by

Sahala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Scatterplots & Association. Scatterplot: A graphical display of two quantitative variables to show their relationship.

Types of
Associations: Positive Association: As one variable increases, the other also increases. Negative Association: As one variable
increases, the other decreases. No Association: No discernible pattern between the two variables. Correlation Coefficient. r is
between -1 and 1. Closer to 1 means strong positive correlation; closer to -1 means strong negative correlation; 0 means no
correlation. Properties: Symmetric: r is the same regardless of which variable is on the x-axis or y-axis. Not Affected by Scaling:
Shifting or rescaling variables does not affect r. Sensitive to Outliers: Outliers can drastically change the value of r. Interpretation of
Correlation Coefficient. r>0r > 0r>0: Positive relationship. r<0r < 0r<0: Negative relationship. Magnitude of r: The closer r is to±1,
the stronger the linear relationship. Linear Regression. Purpose: Linear regression models the relationship between two
quantitative variables by fitting a straight line to the data. Assumptions in Linear Regression. Linearity: The relationship between
x and y should be linear. How to check: Look for patterns in the residual plot. A curved pattern suggests a violation of the linearity
assumption. Constant Variance (Homoscedasticity): The variance of residuals should be constant across all values of x. How to
check: Check for funnel-shaped patterns in the residual plot. A funnel shape suggests heteroscedasticity. Normality of Residuals:
The residuals should be normally distributed. How to check: Use Q-Q plots or histograms of the residuals. Independence of
Residuals: Residuals should be independent of each other (especially in time series data). How to check: Use the Durbin-Watson
test. A value close to 2 indicates no autocorrelation. Key Regression Diagnostics. Residual Plot. Purpose: Used to check for
linearity, constant variance, and independence. What to look for: Randomly scattered points around 0 suggest the model is
appropriate. Problems: Curved pattern → Non-linearity. Funnel shape → Heteroscedasticity. Patterns over time → Autocorrelation.
Q-Q Plot. Purpose: Used to check the normality of residuals. What to look for: Residuals should lie along a straight diagonal line.
Deviations indicate non-normality. Regression Diagnostics Summary Linearity: Check residual plot for random scatter. Curves
suggest non-linearity. Constant Variance: Look for a funnel shape in the residual plot to detect heteroscedasticity. Normality: Use
Q-Q plots or histograms to check if residuals follow a normal distribution. Independence: Use the Durbin-Watson test to check for
autocorrelation in time series data. Common Problems and Solutions. Multicollinearity: Issue: Independent variables are highly
correlated. Solution: Remove one of the correlated variables or use Ridge regression. Heteroscedasticity: Issue: Non-constant
variance in residuals. Solution: Use transformations (e.g., log), weighted least squares, or robust standard errors. Non-linearity:
Issue: Curved relationships between variables. Solution: Use polynomial regression or transform variables. Outliers: Issue: Large
residuals that may skew the model. Solution: Identify and remove outliers or use robust regression techniques. R2 is it only reflects
the proportion of variance in the dependent variable explained by the model. Re-expressing Data (Transformations): When the
assumptions of linear regression (such as linearity or constant variance of residuals) are violated, transforming the data can often
help. Common transformations: Square (y²): Useful when the data is left-skewed. Square root (√y): Often applied to count data
(such as the number of items or occurrences). Logarithmic (log(y)): Works well for right-skewed data or data that grows
exponentially (such as interest rates or population growth). Inverse (1/y): Useful for ratios and rates (e.g., converting speed from
km/h to h/km). Outliers: These are points that deviate significantly from the rest of the data and can distort the regression line.High
Leverage Points: These occur when an x-value is far from the mean of the other x-values. While high leverage points don’t
necessarily distort the regression line, they have the potential to do so if they are influential. Influential Points: These are points
that have a significant effect on the slope of the regression line. Removing these points would greatly alter the fit of the line.Linear
Transformation. A linear transformation modifies a dataset by either shifting or rescaling all values. Formula; Y= aX+b. Where:
a: The scaling factor (multiplies each value). b: The shifting factor (adds/subtracts a constant from each value). Effect of Shifting
(Adding/Subtracting a Constant): Adds b to the mean and median. No effect on spread (standard deviation, IQR). Effect of
Rescaling (Multiplying/Dividing by a Constant): multiplies a to the mean, median, and spread (standard deviation, IQR).
Effect of Linear Transformations on Z-Scores: Shifting (adding/subtracting b): No effect on z-scores, because both the data
point and the mean shift by the same amount.Rescaling (multiplying/dividing by a): No effect on z-scores, because both the data
point and standard deviation are scaled by the same factor, which cancels out. Normal Distribution (Bell Curve) Characteristics:
Symmetric and unimodal (one peak at the mean). Defined by two parameters: Mean (μ): The center of the curve. Standard
deviation (σ): Controls the spread of the curve. 68-95-99.7 Rule: 68% of data falls within 1 standard deviation of the mean. 95%
falls within 2 standard deviations. 99.7% falls within 3 standard deviations. Why Use the Normal Distribution? Many real-world
datasets (heights, weights, IQ scores) follow a Normal distribution. Z-scores are especially useful for interpreting data within a
Normal distribution. General Density Curve: A smooth curve that describes the shape of the data distribution. Can be skewed,
bimodal, or uniform. The area under the curve equals 1 (100% of the data). Normal Density Curve (Bell Curve): A specific type
of density curve for normally distributed data. Symmetric and bell-shaped. Defined by mean (μ) and standard deviation
(σ).When to Use Each: Normal Density Curve: When the data is symmetric, unimodal, and follows the 68-95-99.7 rule. General
Density Curve: When the data is skewed, bimodal, or doesn’t follow the Normal distribution. Proportions and Z-Scores You can
use z-scores to find what proportion of data falls above or below a specific value in a Normal distribution.Simple Random Sampling
(SRS) Definition: Every individual in the population has an equal chance of being selected. Example: Selecting 20 students
randomly from a class of 575 using R (sample(1:575, 20)). Advantage: Minimizes bias, truly random selection. Disadvantage:
Not always practical for large populations. Sampling Frame.Definition: A list of individuals from which a sample is drawn. It should
closely match the population to avoid bias.Example: If you use the Yellow Pages to sample businesses, but not all businesses are
listed, this could introduce bias.Stratified Sampling Definition: The population is divided into strata (groups of similar individuals).
An SRS is taken within each stratum. Example: Dividing a student population by gender (40% male, 60% female) and ensuring the
sample reflects this ratio.Advantage: Ensures all groups are represented, reducing bias.Use Case: When there are distinct
subgroups in the population that might have different characteristics.Cluster Sampling.Definition: Instead of sampling individuals
directly, entire clusters (e.g., neighborhoods) are randomly selected, and then individuals within those clusters are surveyed.
Advantage: Practical for large, dispersed populations. Disadvantage: May introduce bias if clusters are not representative of the
whole population. Multistage Sampling Definition: A complex sampling method where sampling occurs in stages (e.g., first
sampling cities, then neighborhoods, then households). Example: National surveys where different regions are sampled at various
levels. Use Case: For large populations with hierarchical structures. Systematic Sampling Definition: Selects every ith individual
from a population after a random starting point. Example: Surveying every 5th customer in a store after randomly selecting a
starting point. Advantage: Easy to implement when you have a list of the population. Bias in Sampling. Undercoverage Bias
Definition: Occurs when some groups in the population are left out or underrepresented.Example: A phone survey that only
contacts landline users underrepresents younger people who primarily use mobile phones.Impact: Results may not reflect the true
population characteristics.Response Bias Definition: Occurs when the way questions are asked influences respondents to answer
in a particular way. Types of Response Bias: Leading Questions: A question like "How much do you agree that smoking is
harmful?" nudges respondents toward agreement. Social Desirability Bias: People may answer in a socially acceptable way rather
than truthfully (e.g., overstating recycling habits). Sensitive Topics: Respondents may underreport behaviors such as drug use due
to fear of judgment. Recall Bias: Asking participants to remember past events (e.g., "How many times did you exercise last year?")
can lead to inaccurate answers.Population Definition: The entire group you're studying (e.g., all Canadians, all students at a
college). Example: All college students in a drug-use survey. Sample Definition: A smaller group selected from the population to
represent the whole. Example: 100 students in your dorm surveyed about drug use. Parameter. Definition: A numerical value
describing a characteristic of the population. It is often unknown. Example: The true proportion of all students at a college who use
drugs (e.g., 30%). Statistic Definition: A numerical value describing a characteristic of a sample. It is used to estimate the
population parameter. Example: The proportion of students in your dorm who use drugs (e.g., 15%).Sampling Variability
Definition: Results can vary depending on the sample selected from the population. Key Point: Larger samples tend to have less
variability and provide more reliable estimates of the population parameter. Sample Size Effect on Accuracy: A larger sample size
reduces sampling variability and improves the accuracy of estimates. Note: The sample size, not the population size, is the key
factor in reducing variability—except when the sample is more than 10% of the population. Voluntary Response Bias.Definition:
Only individuals with strong opinions (often negative) are likely to respond, skewing the results.Example: A survey asking people to
rate their dissatisfaction with a product might only attract angry customers, leading to a biased conclusion.Convenience
Sampling.Definition: Including individuals who are easy to reach rather than selecting a representative sample.Example:
Surveying students who are nearby in a common area rather than randomly selecting from the entire student body. Undercoverage
Definition: Failing to sample certain groups within the population adequately. Example: Missing rural voters in a political survey
conducted primarily in urban areas. . Experimental Designs.Completely Randomized Design (CRD): All experimental units are
randomly assigned to treatments. Randomized Block Design (RBD): Random assignment is done within blocks of similar
subjects.Blocking Blocking is used to ensure that groups with specific characteristics (e.g., age) are not unevenly distributed in
treatment groups, which could bias the results. Experiments Placebo: A fake treatment is used to prevent knowledge of treatment
from affecting the response. The control group may receive a placebo or standard treatment. Blinding:Single-blind: Either the
subjects or evaluators don't know the treatment assignment. Double-blind: Both subjects and evaluators are unaware of treatment
assignments to avoid bias. Confounding Variable: A variable that affects both the factor and response, making it difficult to tell the
true cause of the response. Observational Studies vs. Experiments. Observational Studies: No control over the subjects'
behavior, simply observing and measuring variables. Experiments: Different treatments are imposed to compare effects. Data
Collection Methods Sample Surveys: Directly ask a sample for information. Observational Studies: Observe and record data
without manipulating variables. Retrospective: Look at past data. Prospective: Collect data as events happen. Experiments:
Assign different treatments to measure causal effects. Principles of Experimental Design.Control: Make conditions similar for all
groups except for the treatment. Randomize: Distribute unknown effects evenly across groups. Replicate: Take multiple
measurements to ensure results aren't by chance. Blocking: Group subjects by known factors to control variability. Blinding in
Experiments. Placebo-Controlled: Ensures the placebo effect doesn’t interfere with results. Double-Blind: Both researchers and
subjects are blinded to treatment assignments to prevent bias.

You might also like