BIG DATA ANALYSIS Flashcards

1
Q

Descriptive analysis

A

What has happened?
- Access and manipulate past datas
- Inform decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Predictive analytics

A

Predictive analytics

What could happen in the future ?
- Use historical data to make future decision
- Estimation of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

prescriptive analytics

A

What should we do ?
- Optimization and simulation to provide advice
- Explore several possible actions and suggest course of action.
- Build models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

5 Vs of data

A

Volume, Velocity, Variety, Veracity, Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 Statistical data types :

A

3 Statistical data types :

Cross-sectional data, Times-series data, Panel data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cross-sectional data

A

A given sort of entity for a single period of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Times-series data

A

For a single entity for multiple periods of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Panel data

A

Panel data
Multiples entities for multiple periods of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3 types of variables

A

numerical, categorical dummy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

numerical variables

A

Data that represent quantities or measurements.
Ex : age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

categorical variables

A

Data that represent distinct categories or groups. It
attributes a Number for each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

dummy variables

A

Data that represent categorical data in a set of binaries
(0 and 1) - La fonction indicatrice en Mathématiques -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Linear Regression Model: Definition

A

modeling method that postulates a relationship between dependent variable (𝒚) and one or more independent variables (𝑥1, 𝑥2, … , 𝑥𝑘)

y dep de 𝑥1, 𝑥2, … , 𝑥𝑘

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

simple linear regression model

A

x1 only independent variable, y dep
𝑦 = β0 + β1 𝑥1 + ϵ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Beta 0

A

C’est l’intercept (ordonnée à l’origine), c’est la valeur de y lorsque x=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Beta 1

A

unknown slope : coefficient directeur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

β0 + β1 𝑥1

A

deterministic component of the linear model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Positive
negative
no linear relationship

A

positive, pente vers le haut Beta1 positif negative, pente vers le bas Beta1 négatif
no relation : pente droite, B1 = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

For a multiple linear regression model
what is ∀𝑖 ∈ [1; 𝑘] β𝑖

A

unknown pop parameter associated with variable xi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The multiple linear regression model:

A

Given 𝑦 the dependent variable, { x𝑖 | 𝑖 ∈ { 1, 2, 3, … , k } the dependent variables : 𝑦 = β0 + ∑ β𝑖𝑥𝑖
𝑘
𝑖=1
+ ϵ = β0 + β1x1 + β2x2 + ⋯ + βkxk + ϵ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

residual error

A

e = y - y^

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Ordinary least squares (OLS) formula:

A

SSE = ∑ 𝑒^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

def OLS

A

used to find the minimum when summarize the squared errors between the observed data points and the values predicted by the linear model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Given 𝑌 1, 𝑌 2, … , 𝑌 𝑁 N dependent variables, using Matrix :
{ 𝑌 1 = β0 + β1𝑋1 + ϵ1
𝑌 2 = β0 + β1𝑋2 + ϵ2

𝑌𝑁 = β0 + β1𝑋𝑁 + ϵ𝑁

What is the value of Y?

A

Y (Y1
Y2
YN)
<=> Y = XBeta + epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SSE = ?

A

//Y-XBeta//^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

The Linear Regression Model - Derivation

A

Beta^= (tXX)^-1* tXY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Definition: The variance of the estimator 𝛃:

A

Measures the uncertainty of the estimated parameter.
The smaller the variance is, the more confident we are when predicting the parameter.

27
Q

Definition: The variance of the estimator 𝛃^:

A

Measures the correlation of the two variables.
A positive covariance indicates that the two variables are positively correlated, they increase together, and vis-versa.

28
Q

Defintion: The 𝛔𝟐̂:

A

Represents the residual variance - La variance des erreurs -,
Measures the dispersion of observations around the regression model. (we don’t know the exact 𝛔𝟐̂ because 𝛜 is uncertain)

29
Q

𝑉(β̂)

A

sigma carré (tXX)^-1

30
Q

𝑉(β̂) Matrix:

A

(var (Bo^). cov(Bo^, B1^)
cov(B1^, B0^) var (B1^))

31
Q

𝑉𝑎𝑟(β0̂)

A

somme des xi carré / n* somme des (xi-Xbarre)^2

32
Q

𝑉𝑎𝑟(𝛽1̂)

A

sigma carré / n* somme des (xi-Xbarre)^2

33
Q

sigma carré

A

1 / n - (k-1) * somme des ei^2

34
Q

se

A

standard error of the estimate

35
Q

R ^2

A

coefficient of determination

36
Q

diff entre R^2 et R^2 ajusté

A

-R2 (coefficient de détermination) mesure proportion de la variance de la variable dépendante expliquée par le modèle, mais augmente toujours quand on ajoute des variables explicatives, même inutiles.
-R2 ajusté corrige cette limitation en pénalisant l’ajout de variables non pertinentes, en tenant compte du nombre de prédicteurs et de la taille de l’échantillon ( il peut diminuer si l’ajout d’une variable n’a rien à voir avec les autres).

37
Q

Definitions: The sample variance 𝒔𝒆^2

A

measures the average squared deviation b/w th observed / predicted values

38
Q

standard error of the estimate se

A

standard deviation of the error of the estimation.

39
Q

formule se

A

racine de (SSE / n-k-1)

40
Q

def R^2

A

quantifies the sample variation in the dependent variable that is explained by the sample regression equation.

ratio of the explained variation of the dependent variable to its total variation.

compris entre 0 et 1

41
Q

SST def formule

A

somme des (yi - ybarre)^2

42
Q

SST

43
Q

SSR def formule

A

somme des (yi^- ybarre)^2

44
Q

R^2

A

SSR/SST=1-(SSE/SST)

45
Q

Definition: Adjusted R^2:

A

Definition: Adjusted R^2:

  • We cannot use 𝑅2 for model comparison when the competing models do not include the same number of independent variables (although dependent variable is the same).
  • 𝑅2 never decreases as we add more variables. It accounts for the sample size n and the number of independent variables k.
  • Imposes a penalty for any additional independent variables.
  • The higher the adjusted 𝑅2 the better model. When comparing models with the same dependent variable, the model with the higher adjusted 𝑹𝟐 is preferred.
  • Note that as n increases, adjusted 𝑹𝟐 gets closer to 𝑹𝟐
    I.E. 𝑙𝑖𝑚 𝑛→+∞ 𝑎𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 → 𝑅2
46
Q

R^2 adjusted=

A

1 - (1- R^2) * (n-1 / n-k-1)

47
Q

Test of joint significance

A

Évalue si plusieurs coefficients (ou tous) sont significatifs ensemble.
Utilise généralement un test F pour tester l’hypothèse H0:β1=β2=⋯=βk=0
Répond à la question : L’ensemble des variables explicatives améliore-t-il significativement le modèle ?

48
Q

F test formule

A

SSR/k / SSE/ n-k-1
MSR / MSE
R^2 n-k-1 / 1-R^2 k

49
Q

p value for a F test

A

if p value = 0 reject the null hypo

50
Q

individual test of significance

A

->Évalue si un seul coefficient d’une régression (βi​) est significatif.
->Répond à la question : Cette variable a-t-elle un effet significatif sur la variable dépendante ?

51
Q

Ho et H1 for an individual test of significance for a two tailed test

A

Ho : Bi = 0
H1 : Bi =/= 0

52
Q

test statistic for a test of individual significance

A

bj- BjO / se(bj)

53
Q

p value for a test of individual significance

A

p value less than alpha reject h nutt

54
Q

CAPM equation

A

𝑅 − 𝑅𝑓 = α + β(𝑅𝑀− 𝑅𝑓) + ϵ

  • R : The rate of return on a stock or portfolio
  • RM : market return
  • Rf : risk-free interest rate
  • B values : measures how sensitive the stock’s return is to changes in the market.
    𝛽 =1 : change in the market = same change in the stock.
    𝛽 >1 : stock is more aggressive or riskier than the market.
    𝛽 <1 : stock is conservative or less risky than the market
  • alpha values : - 𝛼 : predicted to be zero, thus nonzero values indicate abnormal returns.
    Called the stock’s alpha.
    𝛼 > 0 : Positive abnormal returns
    𝛼 < 0 : Negative abnormal returns
55
Q

t value

A

coefficient erreur type du coefficient

56
Q

MSR mean square regression

57
Q

MSE

58
Q

MST

59
Q

3 hypo de la regression OLS

A

Linéarité entre les variables indépendantes et dépendante (relation X / Y doit ê linéaire)

Homoscédasticité: variance des résidus doit être constante à tous les niveaux de X
amplitude des erreurs (la taille des résidus) à peu près la même pour tout X.

Les variables explicatives ne doivent pas être parfaitement corrélées

59
Q

Etapes pour savoir si une variable est significant ou non

A

1) on calcule la t stat ( B1/se(B1))

2)on regarde dans la t table à l’intersection entre la ligne de df( n-k-1) et la colonne du niveau de risque ( en général 5%-> donc 0,025 car il y a deux queues)

3) Si t stat > valeur critique du tableau: on rejette H0 et donc la variable sera significant

60
Q

In an ANOVA table
F stat
p-value
t-stat

A

*F stat : determine whether independent variables collectively explain the variation in the dependent variable?
*p-value : explique la significance individual des prédicateurs
*t-stat : explique la significance d’un coefficient en particulier