BIG DATA ANALYSIS Flashcards
Descriptive analysis
What has happened?
- Access and manipulate past datas
- Inform decision-making
Predictive analytics
Predictive analytics
What could happen in the future ?
- Use historical data to make future decision
- Estimation of variables
prescriptive analytics
What should we do ?
- Optimization and simulation to provide advice
- Explore several possible actions and suggest course of action.
- Build models
5 Vs of data
Volume, Velocity, Variety, Veracity, Value
3 Statistical data types :
3 Statistical data types :
Cross-sectional data, Times-series data, Panel data
Cross-sectional data
A given sort of entity for a single period of time
Times-series data
For a single entity for multiple periods of time
Panel data
Panel data
Multiples entities for multiple periods of time
3 types of variables
numerical, categorical dummy
numerical variables
Data that represent quantities or measurements.
Ex : age
categorical variables
Data that represent distinct categories or groups. It
attributes a Number for each category.
dummy variables
Data that represent categorical data in a set of binaries
(0 and 1) - La fonction indicatrice en Mathématiques -
The Linear Regression Model: Definition
modeling method that postulates a relationship between dependent variable (𝒚) and one or more independent variables (𝑥1, 𝑥2, … , 𝑥𝑘)
y dep de 𝑥1, 𝑥2, … , 𝑥𝑘
simple linear regression model
x1 only independent variable, y dep
𝑦 = β0 + β1 𝑥1 + ϵ
Beta 0
C’est l’intercept (ordonnée à l’origine), c’est la valeur de y lorsque x=0.
Beta 1
unknown slope : coefficient directeur
β0 + β1 𝑥1
deterministic component of the linear model
Positive
negative
no linear relationship
positive, pente vers le haut Beta1 positif negative, pente vers le bas Beta1 négatif
no relation : pente droite, B1 = 0
For a multiple linear regression model
what is ∀𝑖 ∈ [1; 𝑘] β𝑖
unknown pop parameter associated with variable xi
The multiple linear regression model:
Given 𝑦 the dependent variable, { x𝑖 | 𝑖 ∈ { 1, 2, 3, … , k } the dependent variables : 𝑦 = β0 + ∑ β𝑖𝑥𝑖
𝑘
𝑖=1
+ ϵ = β0 + β1x1 + β2x2 + ⋯ + βkxk + ϵ
residual error
e = y - y^
Ordinary least squares (OLS) formula:
SSE = ∑ 𝑒^2
def OLS
used to find the minimum when summarize the squared errors between the observed data points and the values predicted by the linear model.
Given 𝑌 1, 𝑌 2, … , 𝑌 𝑁 N dependent variables, using Matrix :
{ 𝑌 1 = β0 + β1𝑋1 + ϵ1
𝑌 2 = β0 + β1𝑋2 + ϵ2
⋮
𝑌𝑁 = β0 + β1𝑋𝑁 + ϵ𝑁
What is the value of Y?
Y (Y1
Y2
YN)
<=> Y = XBeta + epsilon
SSE = ?
//Y-XBeta//^2
The Linear Regression Model - Derivation
Beta^= (tXX)^-1* tXY
Definition: The variance of the estimator 𝛃:
Measures the uncertainty of the estimated parameter.
The smaller the variance is, the more confident we are when predicting the parameter.
Definition: The variance of the estimator 𝛃^:
Measures the correlation of the two variables.
A positive covariance indicates that the two variables are positively correlated, they increase together, and vis-versa.
Defintion: The 𝛔𝟐̂:
Represents the residual variance - La variance des erreurs -,
Measures the dispersion of observations around the regression model. (we don’t know the exact 𝛔𝟐̂ because 𝛜 is uncertain)
𝑉(β̂)
sigma carré (tXX)^-1
𝑉(β̂) Matrix:
(var (Bo^). cov(Bo^, B1^)
cov(B1^, B0^) var (B1^))
𝑉𝑎𝑟(β0̂)
somme des xi carré / n* somme des (xi-Xbarre)^2
𝑉𝑎𝑟(𝛽1̂)
sigma carré / n* somme des (xi-Xbarre)^2
sigma carré
1 / n - (k-1) * somme des ei^2
se
standard error of the estimate
R ^2
coefficient of determination
diff entre R^2 et R^2 ajusté
-R2 (coefficient de détermination) mesure proportion de la variance de la variable dépendante expliquée par le modèle, mais augmente toujours quand on ajoute des variables explicatives, même inutiles.
-R2 ajusté corrige cette limitation en pénalisant l’ajout de variables non pertinentes, en tenant compte du nombre de prédicteurs et de la taille de l’échantillon ( il peut diminuer si l’ajout d’une variable n’a rien à voir avec les autres).
Definitions: The sample variance 𝒔𝒆^2
measures the average squared deviation b/w th observed / predicted values
standard error of the estimate se
standard deviation of the error of the estimation.
formule se
racine de (SSE / n-k-1)
def R^2
quantifies the sample variation in the dependent variable that is explained by the sample regression equation.
ratio of the explained variation of the dependent variable to its total variation.
compris entre 0 et 1
SST def formule
somme des (yi - ybarre)^2
SST
SSR + SSE
SSR def formule
somme des (yi^- ybarre)^2
R^2
SSR/SST=1-(SSE/SST)
Definition: Adjusted R^2:
Definition: Adjusted R^2:
- We cannot use 𝑅2 for model comparison when the competing models do not include the same number of independent variables (although dependent variable is the same).
- 𝑅2 never decreases as we add more variables. It accounts for the sample size n and the number of independent variables k.
- Imposes a penalty for any additional independent variables.
- The higher the adjusted 𝑅2 the better model. When comparing models with the same dependent variable, the model with the higher adjusted 𝑹𝟐 is preferred.
- Note that as n increases, adjusted 𝑹𝟐 gets closer to 𝑹𝟐
I.E. 𝑙𝑖𝑚 𝑛→+∞ 𝑎𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 → 𝑅2
R^2 adjusted=
1 - (1- R^2) * (n-1 / n-k-1)
Test of joint significance
Évalue si plusieurs coefficients (ou tous) sont significatifs ensemble.
Utilise généralement un test F pour tester l’hypothèse H0:β1=β2=⋯=βk=0
Répond à la question : L’ensemble des variables explicatives améliore-t-il significativement le modèle ?
F test formule
SSR/k / SSE/ n-k-1
MSR / MSE
R^2 n-k-1 / 1-R^2 k
p value for a F test
if p value = 0 reject the null hypo
individual test of significance
->Évalue si un seul coefficient d’une régression (βi) est significatif.
->Répond à la question : Cette variable a-t-elle un effet significatif sur la variable dépendante ?
Ho et H1 for an individual test of significance for a two tailed test
Ho : Bi = 0
H1 : Bi =/= 0
test statistic for a test of individual significance
bj- BjO / se(bj)
p value for a test of individual significance
p value less than alpha reject h nutt
CAPM equation
𝑅 − 𝑅𝑓 = α + β(𝑅𝑀− 𝑅𝑓) + ϵ
- R : The rate of return on a stock or portfolio
- RM : market return
- Rf : risk-free interest rate
- B values : measures how sensitive the stock’s return is to changes in the market.
𝛽 =1 : change in the market = same change in the stock.
𝛽 >1 : stock is more aggressive or riskier than the market.
𝛽 <1 : stock is conservative or less risky than the market - alpha values : - 𝛼 : predicted to be zero, thus nonzero values indicate abnormal returns.
Called the stock’s alpha.
𝛼 > 0 : Positive abnormal returns
𝛼 < 0 : Negative abnormal returns
t value
coefficient erreur type du coefficient
MSR mean square regression
SSR / Df
MSE
SSE / Df
MST
SST / Df
3 hypo de la regression OLS
Linéarité entre les variables indépendantes et dépendante (relation X / Y doit ê linéaire)
Homoscédasticité: variance des résidus doit être constante à tous les niveaux de X
amplitude des erreurs (la taille des résidus) à peu près la même pour tout X.
Les variables explicatives ne doivent pas être parfaitement corrélées
Etapes pour savoir si une variable est significant ou non
1) on calcule la t stat ( B1/se(B1))
2)on regarde dans la t table à l’intersection entre la ligne de df( n-k-1) et la colonne du niveau de risque ( en général 5%-> donc 0,025 car il y a deux queues)
3) Si t stat > valeur critique du tableau: on rejette H0 et donc la variable sera significant
In an ANOVA table
F stat
p-value
t-stat
*F stat : determine whether independent variables collectively explain the variation in the dependent variable?
*p-value : explique la significance individual des prédicateurs
*t-stat : explique la significance d’un coefficient en particulier