DADM NOTES, KAHOOT AND IMP FORMULAS
Notes:
The bar graph is the graphical representation of categorical data.
A histogram is the graphical representation of quantitative data.
Categorical Data (qualitative):
1. bar chart (cannot be skewed)
2. pie graph
Numerical Data (Quantative):
1. Histogram - 1 quantitative variable
2. Boxplot - numerical and categorical
3. Time Series - 2 quantitative variables
4. Scatterplot - 2 quantitative variables
Cross-Sectional: The data is collected at one point in time.
Time Series (longitudal): Data collected over a period of time
Cross-sectional time series: Data is collected over time; for every point of time there are multiple
observations
Symmetric Skewed: Mean ~ Median
Right Skewed: Mean > Median
Left Skewed: Median > Mean
Mean is sensitive to outliers
Median is not sensitive to outliers
Pivot tables are also knows as cross-tabulations or contingency tables.
Covariance:
If x and y move in the same direction: positive covariance
If x and y move in different directions: negative covariance
Covariance does not tell us about the strength of the linear association
Correlation:
r can be positive or negative
-1<=r<=1
the closer it is to extremes (-1 or 1) the stronger their LINEAR relationship is
0 or close to 0, there is no linear relationship
Drawbacks:
Co relation is very sensitive to outliers
one outlier can change the data
Causation:
Causal relationship - a change in variable causes a change in the other variable
Correlations and Covaraince do not imply causation
In spurious relations, two variables are wrongly assumed to be related to each other
In spurious relations, there is typically a 3rd lurking variable that drives both variables.
1. Simple linear regression = linear regression with only one predictor
2. Multiple linear regression = linear regression with more than one (quantitative or
categorical) predictors
3. Logistic regression = nonlinear regression with a categorical dependent variable
Simple Linear Regression:
Univariate: Regression (2 variables) (simple regression: 1 predictor)
Multivariate: Regression (multiple regression: 2 or more predictors)
R Sqaure:
The coefficient of determination
0 ≤ r2 ≤ +1
r2 shows how much of variability in Y is explained by variability in X (i.e., how much is
explained by the linear model) r2 shows how good the linear model is
The closer to +1, the better the model
Multiple Regression:
In multiple regression, R-square is always > Adjusted R square.
R square will always go up even if you add a bad predictor. However, R- squared Adjusted may
go down if you add a bad predictor
Therefore, R-squared will always exceed R-squared adjusted
Regression with categorical :
Coefficient of interaction variables captures the difference between slopes
Linear regression with categorical predictors:
nonlinear effects
a) Regressions with quadratic term For categorical ordinal variables only Also works for
numerical discrete or continuous
b) Regressions with multiple dummy variables For categorical ordinal & categorical nominal
variables
Logistic regression:
Logistic regression is used to model situations when the dependent variable (Y) is categorical
and may take only two values.
Kahoot:
1. Pivot tables can be used for categorical and quatitative variables: (2 answers)
Ans: TRUE
IT DEPENDS
2. Pivot chart can only be bar graph:
Ans: False
3. Which excel command do we use to merge 2 dataset?
Ans: =VLOOKUP(), XLOOKUP()
4. Data Validation tool in Excel can be used to..
Ans: make sure that the data is in correct format
detect data entry errors
5. Correct way to show Excel's Date command? Date(__/__/__)
Ans: year, month, day
6. Index command finds the value in a specified location
Ans: True
7. Match command finds the value in a specified location
Ans: False
8. Independence mean correlation = 0
Ans: True
9. Correlation = 0 means independence
Ans: False
10. When correlation = 0, it always means that X and Y are not related to each other
Ans: False
11. When correlation = 0, it means there is no linear relationship between X and Y
Ans: True
12. Which of the following statements is NOT TRUE about correlation?
Ans: Correlation implies causation
13. Cheese consumption is positively correlated with # deaths by being tangled in bedsheets
because?
Ans: it's a spurious relationship
14. Spurious correlation occurs when
Ans: 2 variables are wrongly assumed to be related
15. Multiple linear regression means there are several linear regression equations
Ans: False
16. In multiple regression, always R-sqaured
Ans: > R-Squared adjusted
17. In a multiple regression, when we add a new X variable, R-squared ALWAYS goes up.
Ans: True
18. In a multiple regression, when we add new X variable, R-squared adjusted ALWAYS
goes up.
Ans: False
19. Which of the following cannot be modeled using logistic regression?
Ans: LN( Starting Salry)
Starting Salary
[basically, any numerical value]
20. In logistic regression, Y (1=Like,0=Dislike) is linearly dependent on explanatory
variables
Ans: False
21. Logit is linearly dependent on the explanatory variables
Ans: True
IMP Formulas: