--------------DATA ANALYSIS-------------- MEAN:
- Arithmetic: the normal
To block the cell: $G$6 - Geometric: exponential data
!
Minimum Maximum Count Range = √𝑥1 ∗ 𝑥2 … .∗ 𝑥𝑛
=min(range) =max(range) =count(range) =max - min - Harmonic: inversely proportional (speed, time)
( ( (
= n/ )( + )* … )+
Sturges’ rule: (on the formula sheet)
I= int
Mean for data grouped into classes
- Number of classes: k = I [log2(n)] +1
=int(log(n;2)) +1 = sum c_i*f_i
∆ c_i =center of the classes
- Width of the classes: h= I ( ) + 1
" c_i = (lower + upper)/2
=int((max-min)/k) +1
=sumproduct( select column c_i; select column f_i)
Count if = sum( x_i *n_i) / total
=countif(range; criteria) = sumproduct (select column x_i; select column n_i) / total
=countif(range; “ ><=” & cell)
(raw) Median (𝒙
+)
Count ifs
- numerical data.
=countifs( range 1; criteria 1; range 2; criteria 2)
=median(range)
Sum
=sum(num1; num2; ….) Median for data grouped into classes
Discover the median class (see raw median and related with the
Absolute Frequencies (n_i) interval it’s in)
Number of elements in each category Median class: xa =[k;v]
-how many are under x? à use count if / count ifs ( Fa – Fa-1 ) / ( k – v ) = ( 0,5 – Fa-1 ) / (x – k )
=COUNTIFS(range;">="&lowerlimit;range;"<"&upperlimit)
(raw) Mode (mo)
Cumulative Absolute Frequencies (N_i) the most frequent observation in a data set.
number of observations smaller than or equal to a given value of the =mode(range) à only looks to 1 mode (the smallest)
variable =mode.mult(range) à you can find the 2 modes
Previous (N_i) + current (n_i)
Mode for data grouped into classes
Relative Frequencies (f_i) n_i / h_i = density or f_i/h_i
percentage of elements in each category = n_i / (upper – lower)
= (n_i) / sum n_i (total) à spot the highest density à find the modal class
Cumulative Relative Frequencies (F_i) Range
percentage of observations smaller than or equal to a given value of the Max- min // upper limit – lower limit
variable
= previous (F_i) + current (f_i) Quantiles
The quantiles are non-central tendency location measures that indicate
Histogram with area 1 à the position of an observation within the sorted data set.
𝒇𝒊 // 𝒏_𝒊
relative densities: di = q0.25/ q0.75
𝒉𝒊
d = (f_i) / number of interval (width) - F_i class where 0,25 is.
- HISTOGRAM- bar chart: Class (X) ---------------------------> F_i (X) – F_i (X-1)
Select the di ; insert ;Column chart; Menu on the right: reduce the gap between Class (lower X , x ) --------------> 0.25/75 – F_i (X-1)
the columns for 0%; Changing the label of the horizontal axis: right-click; select =quartile.inc(range;1)
data; select classes interval
IQR= q3-q1 (interquartile range)
Frequency polygon à imaginary classes: di = 0
Adding the imaginary classes: (raw) Variance (𝒔𝟐 )
- insert a new row before the first one (the sme range but the is the classical measure of spread of the observations around the center.
interval before) Is always > 0
- insert a new row after the last one (the same range but the interval If it’s = 0 à all the observations are zero
after) VAR.P à “large” samples ( total/n >30) = var.p(range)
NOTE: zeros on the di VAR.S à “small” samples (total/n <=30) =var.s(range)
HISTOGRAM & FREQUENCY POLYGON TOGETHER
- after building the histogram and the imaginary classes: Variance for data grouped into classes
arranging the range of the horizontal axis: click on the histogram and expand .𝑠 * = sum 𝑥𝑖 * fi - 𝑥̅ *
the selection of the di and the range Add column (c_i )^2
1. right click over the chart
= c_i * c_i
2. select data
3. add series 2 = sumproduct( f_i ; c_i^2) – mean grouped^2
4. add values: select the columns with the di with the zeros 𝑠 * = sum 𝑥𝑖 * ni - 𝑥̅ *
5. right click on one of the columns; change chart type à line Add column (x_i)^2
x_i * x_i
Ogive (with the F_I) =sumproduct( n_i ; n_i^2) – mean grouped^2
select the (F_i) // insert line chart //change the labels on the horizontal
axis
Standard deviation (s) à raiz da variance
###
(raw) Mean (𝒙)
=sqrt(variance)
=average(range) =stdev.p(range)
Outlier Lorenz curve
An outlier is an observation that is unusually large or unusually small when • insert a new row at the beginning with pi=qi=0
compared to the rest of the data. • select both (at the same time pi and qi) à put them side by side or
Moderate lower = q1 - 1,5 * IQR hit the “command” key at the same time you are selecting them
Moderate upper = q3 + 1,5 * IQR • insert – charts – scatter (the 4th to appear)
Severe lower = q1 - 3 * IQR • add the line of equal distribution:
Severe upper = q3 +3 * IQR add a table with pi and qi in the poits 0,0 and 1,1
add new series in the chart
Nº severe=
=countif(range; “<”& severe lower) + countif(range; “>” & severe upper)
Nº moderate= location measures: mean & mode & median & quantiles
=countif(range; “<”& moderate lower) + countif(range; “>” & moderate spread measures: range and iqr & varience & standard deviacion &
upper) – nº severe coeficient of variation & outliers and boxplot:
shape measures: skewness & kurtosis
SYMMETRY & DISTRIBUTION concentration measures: lorenz curve & gini index
association measures: covariance & correlation
g=0 àmode = median = mean à equal distribution (symmetrical
graphic) à symmetric - type
g > 0 à mode < median < mean à distribution/ center of gravity moved Categorical: groups or categories. (blood type; nationality; hobby)
to the right à positively skewed (biggest more to the left) Numerical: expressed by numbers. (age; height; nº of sons)
g < 0 à mode > median > mean à distribution/center of gravity moved Discrete: finite values (shoe size; nº of siblings)
to the left à negatively skewed (biggest more to the right) Continuous: any value inside a real interval
SKEWNESS - nature + scale of measurement
(g) measures are based on the relationship between central tendency Qualitative: no meaning for the difference between values
measures. NOMINAL: does not imply ranking of responses. (gender; eye color)
KURTOSIS ORDINAL: indicates the rank of items. ( social class- low, medium,
is a measure of whether the data is heavy- tailed or light-tailed relatively avarege; 1-5 stars )
to a normal distribution. Quantitative: includes interval and ratio measurement levels.
§ k = 0 àNormal kurtosis: mesokurtic distribution INTERVAL: indicates rank and distance from an arbitrary 0. (temperature,
§ k > 0 à Low kurtosis: leptokurtic distribution IQ)
RATIO: indicates rank and distance from a natural zero. (age; salary;
§ k < 0 à High kurtosis: platykurtic distribution
weight)
(raw) Skewness --------------------------------------------------------------------------------------------------------------------------
SKEW.P à “large” samples ( total/n >30) Regression line: y = b0 + b1x (formulae test)
SWEW à “small” samples (total/n <=30) b1 = Sxy/S^2x à slope
=skew.p(range) b0 = y – b1x à intercept
=skew(range) - the sign of the slope (b1) is determined by the
covariance. (covariance > 0 à increasing; covariance < 0
(raw) Kurtosis à decreasing)
=kurt(range) - b0 predicted average value of y when x is zero
K= [( 1/n * sum(c_i – mean)^4 ) / s^4 ] - 3 - b1 predicted average variation of y when x increases 1
unit
Kurtosis for data grouped into classes
Add column (c_i- mean)^4
K=sumproduct( (c-mean)^4); f_i ) / (deviation^4) – 3 r = correl( xi; yi) à coeficiente of linear correlation between X and Y.
-
b1 (slope) = slope(yi; xi)
Coefficient of variation (CV) à SPREAD
= standard deviation / mean b0 (intercept) =intercept (yi; xi)
= (s. dev / mean)
equation of the regression line = b0 – b1x
Boxplot
1. mean & variance & standard deviation: square route of
à Correlation ¹ causation
variance =. sqrt(variance)
2. coefficient of variation (CV) = standard deviation / mean
Correlation does not imply causation: it can be coincidence. It’s an
= (s. dev / mean) assumption in forecast.
3. interquartiles:
• add f_i to the table residual: difference between an observation and the
• add F_i to the table corresponding value estimated by the regression line.
• q0.25
4. q0.75 interquartiles range = q0.75 – q0.25 coefficient of determination: (0- 1)
5. Raw data all in the same column (it’s how excel can read the R^2 = SSR/ SST = 1 – SSE/SST
information)
- the larger the value of R^2 the better the fit of the line to the
6. Select the data à insert à chart (box and whisker)
data
=RSQ(y,x)
Gini index à CONCENTRATION (cumulative)
à how to interpret the coefficient of determination?
G = 1 – (sum qi / sum pi)
R^2= 0.92 (92% of the variation in Y is explained by the model &
• F_i= previous (F_i) + current (f_i) à = pi the remaining are explained by the residuals)
• c_i * n_i (add sum/ total in the last cell)
• % attribute = c_i*n_i/ total c_i*n_i In a linear regression, the coefficient of determination equals the
• qi= previous (qi) + current (%attribute) square of the correlation coefficient: R^2 = rxy^2
=1 – sum(qi (except last one: nº1)) / sum(pi (except last one: nº1))
R- correlation coefficient: analizar como as variáveis estão ligadas uma 2. the multiplicative model
com a outra ---- mais próximo de 1/-1 melhor Yt = Tt * St * Et
- is appropriate if the magnitude of the seasonal
R^2- determination coefficient: analizar quão bem a linha da li near
fluctuations increase or decrease proportionally to the
regression aproxima os valores de x e y
level of the series
NONLINEAR REGRESSIONS: Þ ESTIMATING THE SEASONALITY:
It’s necessary to de-trend the series: is made after estimation of
X and Y have an exponential relationship
Y= b0 * b1^x the trend, by the least squares method or by the MAp.
ln y = (ln b0) + (x.ln b1) 1. Additive model
y* = b0* + xb1* where (y* = ln y) Yt = Tt + St + Et à de-trend: Yt - Tt
y ln(X) à exp. Y is the independent variable 2. Multiplicative model
x ln(Y) à exp. X is the independent variable Yt = Tt * St * Et à de-trend: Yt/Tt
X and Y have a power relationship Seasonal indexes (SI): the average of the de-trended observations
y = b0 * x ^b1 corresponding to the same seasonal period. (average of S of each
ln y = (ln b0) + (b1. lnx) period)
y* = b0* + xb1* where (y* = ln y) and (x* = ln x) Additive model:
ln(X)*ln(Y) 𝑆−𝑎: 𝑆𝐼−𝑥̅(SI) , so that 𝑥̅ (SI – a) = 0
Multiplicative model:
power regression: SI / 𝑥̅(SI) , so that 𝑥̅ (SI – a) = 1
-click on one of the points ; add chart element; trendline; more trending
lines; power; add both displays
Þ ESTIMATING THE RANDOM COMPONENT
After estimating the trend and the seasonal component;
Exponential regression:
Is useful for validating the model
-click on one of the points; add chart element; trendline; exponential;
1. Additive model : Et = Yt – Tt – St
add both displays
2. Multiplicative model: Et = Yt / (Tt * St)
Which one fits better? Compare de R^2 of both graphs, the one with
Þ ESTIMATING THE TREND: (2 options)
the higher R^2 is the best option for the data set.
1. least squares method:
Note: R^2 à coefficient of determination
Tt= b0 + b1t
b1= sty/st^2 = y – b1t
2. moving averages method:
flexible method to determine the trend: allows to soften or even
scatter plot with the regression line: eliminate fluctuations in the series by a process of successive
-select the data (xi and yi)
averaging.
-Insert – scatter
-add regression line: click on the points; add chart element; trendline; linear Þ MAp averaging p observations at a time where
Note: double click on the line; menu; display equation on chart (eq. p = 2m + 1 and you lose 2m observations.
regression line) & display R-squared value on chart (R^2: coefficient of
determination) The higher the order of moving averages, the greater the
“smoothing” effect obtained, but also the greater the number of
“lost” observations.
TIME SERIES: set of measurements, ordered over time, on a particular The choice of order averages must be made according to seasonal
quantity of interest movements:
- are usually graphically represented by a timeline;
- the main goal of studying time series is to forecast;
Addictive model
Trend component (Tt) ORDINARY LEAST SQUARES METHOD- OLS: (2 options)
Represents the general long-term movement of the series 1- slope + intercept
Can be: increasing, decreasing, constant find b1 & b0
2- timeline + trendline
Seasonality component (St) à si adjusted timeline: select both data, insert, graphic, 4 option, click on one point, add chart
element, trendline, linear, add equation display
Represents the influence of seasonal factors on the behavior of
the series; it reflects the rhythm imposed by the seasons.
Trent: Tt = slope(b1) * t + intercept(b0)
De-trent: Yt – Tt (add this column)
Cyclical component (Ct)
Includes mid-term variations in the series, but with imprecise (not
Seasonal index (SI)
fixed) timing; is contained in the trend.
3 observations per year:
= average( Yt-Tt ). 1st: 1,4,7. 2nd: 2,5,8. 3rd: 3,6,9
Random/Residual component (Et)
=average(1st, 2nd,3rd) àaverage SI
Includes variations of low intensity and short duration of random
If its negative it’s bellowed the annual average and if it’s positive it’s
nature, which are not possible to explain, and are not predictable;
above the annual average (this case annual)
this is the component that makes the model not exact.
SI adjusted: =(SI - averageSI)
DECOMPOSE MODELS: PROBABILITYS
Average Si-adjusted = 0àADDITIVE MODEL
1. the additive model Random experiment: can have 2 or more outcomes, in which is not possible to
Yt = Tt + St + Et know which outcome is going to occur
Forecast
- is appropriate if the magnitude of seasonal fluctuations Goal = 700
does not vary with the level of the series. 25% above = 875 = 700 * 1,25
forecast= (annual Tt + 3rd SI) à if its bellow the 25% it will not be achieved.
Multiplicative model
MOVING AVERAGE METHOD- MAP
MA3:
Average all the groups of 3 consecutive observations
We are going to lose 2 observations
- 1,2,3 // 2,3,4 // 3,4,5// 4,5,6// 5,6,7 // 6,7,8 // 7,8,9
-
MA4:
-average of the 4 first: start in the 2 line and end 1 line before the last
one. Total probability theorem:
- average 2 and 2: start in the 3 line and end 2 lines before the last one. P(B) = P(A and B) + P(B\A)
P(A\B) =0,2 ------> P(Ã\B) =0,8
P(B) = P(B and A) + P(B and Ã)
De- trent: Yt / Tt (add this column)
----------------------------------------------------------------------------------------------------------
Seasonal index (SI):
3 observations per year: random variable (r.v.): a function that associates a real number 𝑥
= average( Yt/Tt). 1st: 1,4,7. 2nd: 2,5,8 3rd: 3,6,9 to each result of the event space Ω.
=average(1 , 2 ,3 ) à average SI
st nd rd
CLASSIFICATION OF RANDOM VARIABLES:
SI adjusted: =(SI/averageSI) -discrete: if it can assume values inside a finite or countable
Average Si-adjusted = 1àMULTIPLICATIVE MODEL infinite set.
-continuous: if it takes values in a non countable infinite set
Forecast
The annual number is going to be > than 3000? expected value or mean of a discrete random variable 𝑋 is an
- Add month 10, 11, 12 (t)
indicator of the center of the random variable. We can define it as
- Slope(b1)
- Intercept(b0)
the center of mass of the variable.
- Trent for 10, 11, 12 E(x) = ∑ [x * f(x)]
- Forecast: Tt 10 * 1stSI
- Final= sum (forecast10; forecast11; forecast12)
MENOR = -----------------cumulative(true)
---------------------------------------------------------------------------------------------------------- MAIOR --------------------- 1 – (MENOR =)---------- 1 - comulative
MAIOR = ------------------1 – ( MENOR = ¯)
= --------------------------false
BERNOULI --------- X Ç (p)
- one single random experiment
p= probability of success
q= probability of failure
q=1-p
E(x) = p
Var (x) = p(1-p)
F(x)= p^x * (1-p)^1-x
BINOMIAL -------- X Ç bin(n, p)
number of success in n experiments (generalization of Bernoulli)
• E(x) = n *p
• Var (x)= n* p (1-p)
• F(x) = nCx p^x * (1-p)^n-x
=binom.dist(x;n;p;true/false)
X Ç bin(50; 0,4) (P<= 2) ------------------- binom.dist(2; 50; 0,4; true)
NEGATIVE BINOMIAL -------- X Ç BN(p, r)
count of the number of independent Bernoulli experiments
required to find r success (generalization of Bernoulli)
• E(x) =r/p
• Var (x) = r/ p^2
• F(x) = (x-1) C (n-1)* p^r* (1-p)^x-r
=negbinom.dist(x-r; r; p ; true/false)
GEOMETRIC DISTRIBUTION-------- X Ç Geo(p)
particular case when r= 1 in negative binomial distribution
• E(x) =1/p
• Var (x) = 1-p/ p^2
• F(x) = p (1-p)^x-1
=negbinom.dist(x-1; 1; p ; true/false)
POISSON DISTRIBUTION-------- X Ç Po(l)
Number of occurances of a given event per unit of Às vezes o b0 não é quando o x é zero mas sim o anterior (exemplo data
time/area/volume de jan 2015- dezembro 2020: b0 corresponde a dezembro 2014)
• E(x) =Var (x) = l Se SI for 1,54 quer dizer que nesse período/mês este 54% superior ao
• F(x) = (e^-l * l^x) / x! normal/anual
=poisson.dist(x; mean ; true/false)
Values da RANDOM COMPONENT (Et) of a time series with the
LAW OF RARE EVENTS: X Ç bin(n, p) à X segue Po(l=np) multiplicative model are randomly distributed around the number 1
(podem ser negativos)
A TREND não é efetivamente a função: atenção escolhas múltiplas, tem de
Probability density function: (fx ³ 0) dizer average, predicted.
-
P(a£ X £b) = ∫. 𝑓𝑥
CORRELATION COEFFICIENT (CORREL): negative or positive à same sign as
the covariance; if its close to 1 or -1 it’s strong.
Cumulative distribution function: P(X£ x)
-
=∫/ 𝑓𝑥 k= min value of x in fx SEASONAL FLUCTUATIONS don’t change over time à ADDICTIVE MODEL
• E(x) = µ Para FORECAST: trend value * or + SI adjusted
• Var(x) =s^2 = standard deviation ^2 = E(x^2) - µ^2
R^2= 0.92 (92% of the variation in Y is explained by the model & the
remaining are explained by the residuals)
EXPONENTIAL DISTRIBUTION-------- X Ç Exp(l) ST à SI adjusted
• E(x) = µ = 1/l
• Var (x) =s^2 = 1/l^2 DISCRETE RANDOM VARIABLE: somatório de f(X) = 1
• f(x) = le^-xl
CONCENTRATION MEASURES: only if they are cumulative in nature.
• F(x) = 1 – e^-xl
=exp.dist(x; l ; true/false) GRAFICOS: continuos é comulativo
à não pode descer (declive nunca é negativo)
à nunca tem valores > 1
NORMAL DISTRIBUTION-------- X Ç N(µ, s^2)
P(a£ X £b) = P (X £b) – P(£ a) MOVING AVERAGE TIME SERIES:
- YEAR: 12 point MA
P(X < -a) = 1- P(X < a)
This method is used to estimate the trend by successively averaging
=exp.dist(x; mean; standard deviation ; true/false) groups of the same size of consecutive observations.
With this method, the fluctuations of the series are smoothed.
Averages correspond to a period of 1 full year thus eliminating seasonal
-------------------------------------------------------- fluctuations. The graph will be flatter than the original because the
seasonal component has been eliminated.
--------------------------------------------------------
In a time series the SEASONAL COMPONENT may not exist, depending on
the variable and on the timeframe. When it does not exist trend should be
estimated using the LEAST SQUARES METHOD or the MOVING AVERAGES
NOTES method. The CYCLICAL COMPONENT refers to fluctuations without a fixed
time interval, occurring in long periods. The RANDOM COMPONENT is
STURGES’ RULE: if the grouping is not the same as the proposed in the used to validate the decomposition model. In the case of the additive
exercise doesn’t mean it’s incorrect. Sturges’ rule is merely indicative. model its value should be randomly distributed around 0.
RAW vs GROUPED DATA: if the data is discrete (just 1 number) the mean is
the same. If it’s not discrete its normal to be different since that when we Chart represents the RANDOM COMPONENT
group the data we lose information. à values around 0: additive model
à values around 1: multiplicative model
If all the data increase/decrease % the median will change but remain in
the same central position (median * 1,increase). The mean and standard
deviation are going to change in the same scale the central deviation (CV) RANDOM COMPONENT of a time series should show a random behavior,
is going to be the same. different from one decomposition model to another.
CHANGES OF ORIGIN:
- mean YES NORMAL DISTRIBUTION:
- variance NO - the sum of 2 normal random variables is necessary a normal
- correlation coefficient NO random variable
- coefficient variation YES - the kurtosis coefficient is negative
CHANGES OF SCALE:
- mean YES
- variance YES
- correlation coefficient NO
- coefficient variation NO
VARIANCE: the difference between the center value and the ones next to
it.
More GINI INDEX à more concentration
The value of the TREND can be very different than the real one because
the prediction is made based on the assumptions that the conditions that
affect the series in the past are the same that will affect in the future.