BIG DATA ANALYSIS Flashcards

Question 1

Q

Descriptive analysis

Answer

A

What has happened?
- Access and manipulate past datas
- Inform decision-making

Question 2

Q

Predictive analytics

Answer

A

Predictive analytics

What could happen in the future ?
- Use historical data to make future decision
- Estimation of variables

Question 3

Q

prescriptive analytics

Answer

A

What should we do ?
- Optimization and simulation to provide advice
- Explore several possible actions and suggest course of action.
- Build models

Question 4

Q

5 Vs of data

Answer

A

Volume, Velocity, Variety, Veracity, Value

Question 5

Q

3 Statistical data types :

Answer

A

3 Statistical data types :

Cross-sectional data, Times-series data, Panel data

Question 6

Q

Cross-sectional data

Answer

A

A given sort of entity for a single period of time

Question 7

Q

Times-series data

Answer

A

For a single entity for multiple periods of time

Question 8

Q

Panel data

Answer

A

Panel data
Multiples entities for multiple periods of time

Question 9

Q

3 types of variables

Answer

A

numerical, categorical dummy

Question 10

Q

numerical variables

Answer

A

Data that represent quantities or measurements.
Ex : age

Question 11

Q

categorical variables

Answer

A

Data that represent distinct categories or groups. It
attributes a Number for each category.

Question 12

Q

dummy variables

Answer

A

Data that represent categorical data in a set of binaries
(0 and 1) - La fonction indicatrice en Mathématiques -

Question 13

Q

The Linear Regression Model: Definition

Answer

A

modeling method that postulates a relationship between dependent variable (𝒚) and one or more independent variables (𝑥1, 𝑥2, … , 𝑥𝑘)

y dep de 𝑥1, 𝑥2, … , 𝑥𝑘

Question 14

Q

simple linear regression model

Answer

A

x1 only independent variable, y dep
𝑦 = β0 + β1 𝑥1 + ϵ

Question 15

Q

Beta 0

Answer

A

C’est l’intercept (ordonnée à l’origine), c’est la valeur de y lorsque x=0.

Question 16

Q

Beta 1

Answer

A

unknown slope : coefficient directeur

Question 17

Q

β0 + β1 𝑥1

Answer

A

deterministic component of the linear model

Question 18

Q

Positive
negative
no linear relationship

Answer

A

positive, pente vers le haut Beta1 positif negative, pente vers le bas Beta1 négatif
no relation : pente droite, B1 = 0

Question 19

Q

For a multiple linear regression model
what is ∀𝑖 ∈ [1; 𝑘] β𝑖

Answer

A

unknown pop parameter associated with variable xi

Question 20

Q

The multiple linear regression model:

Answer

A

Given 𝑦 the dependent variable, { x𝑖 | 𝑖 ∈ { 1, 2, 3, … , k } the dependent variables : 𝑦 = β0 + ∑ β𝑖𝑥𝑖
𝑘
𝑖=1
+ ϵ = β0 + β1x1 + β2x2 + ⋯ + βkxk + ϵ

Question 21

Q

residual error

Answer

A

e = y - y^

Question 22

Q

Ordinary least squares (OLS) formula:

Answer

A

SSE = ∑ 𝑒^2

Question 23

Q

def OLS

Answer

A

used to find the minimum when summarize the squared errors between the observed data points and the values predicted by the linear model.

Question 24

Q

Given 𝑌 1, 𝑌 2, … , 𝑌 𝑁 N dependent variables, using Matrix :
{ 𝑌 1 = β0 + β1𝑋1 + ϵ1
𝑌 2 = β0 + β1𝑋2 + ϵ2
⋮
𝑌𝑁 = β0 + β1𝑋𝑁 + ϵ𝑁

What is the value of Y?

Answer

A

Y (Y1
Y2
YN)
<=> Y = XBeta + epsilon

Question 25

Q

SSE = ?

Answer

A

//Y-XBeta//^2

Question 26

Q

The Linear Regression Model - Derivation

Answer

A

Beta^= (tXX)^-1* tXY

Question 27

Q

Definition: The variance of the estimator 𝛃:

Answer

A

Measures the uncertainty of the estimated parameter.
The smaller the variance is, the more confident we are when predicting the parameter.

Question 28

Q

Definition: The variance of the estimator 𝛃^:

Answer

A

Measures the correlation of the two variables.
A positive covariance indicates that the two variables are positively correlated, they increase together, and vis-versa.

Question 29

Q

Defintion: The 𝛔𝟐̂:

Answer

A

Represents the residual variance - La variance des erreurs -,
Measures the dispersion of observations around the regression model. (we don’t know the exact 𝛔𝟐̂ because 𝛜 is uncertain)

Question 30

Q

𝑉(β̂)

Answer

A

sigma carré (tXX)^-1

Question 31

Q

𝑉(β̂) Matrix:

Answer

A

(var (Bo^). cov(Bo^, B1^)
cov(B1^, B0^) var (B1^))

Question 32

Q

𝑉𝑎𝑟(β0̂)

Answer

A

somme des xi carré / n* somme des (xi-Xbarre)^2

Question 33

Q

𝑉𝑎𝑟(𝛽1̂)

Answer

A

sigma carré / n* somme des (xi-Xbarre)^2

Question 34

Q

sigma carré

Answer

A

1 / n - (k-1) * somme des ei^2

Question 35

Q

se

Answer

A

standard error of the estimate

Question 36

Q

R ^2

Answer

A

coefficient of determination

Question 37

Q

diff entre R^2 et R^2 ajusté

Answer

A

-R2 (coefficient de détermination) mesure proportion de la variance de la variable dépendante expliquée par le modèle, mais augmente toujours quand on ajoute des variables explicatives, même inutiles.
-R2 ajusté corrige cette limitation en pénalisant l’ajout de variables non pertinentes, en tenant compte du nombre de prédicteurs et de la taille de l’échantillon ( il peut diminuer si l’ajout d’une variable n’a rien à voir avec les autres).

Question 38

Q

Definitions: The sample variance 𝒔𝒆^2

Answer

A

measures the average squared deviation b/w th observed / predicted values

Question 39

Q

standard error of the estimate se

Answer

A

standard deviation of the error of the estimation.

Question 40

Q

formule se

Answer

A

racine de (SSE / n-k-1)

Question 41

Q

def R^2

Answer

A

quantifies the sample variation in the dependent variable that is explained by the sample regression equation.

ratio of the explained variation of the dependent variable to its total variation.

compris entre 0 et 1

Question 42

Q

SST def formule

Answer

A

somme des (yi - ybarre)^2

Question 43

Q

SST

Answer

A

SSR + SSE

Question 44

Q

SSR def formule

Answer

A

somme des (yi^- ybarre)^2

Question 45

Q

R^2

Answer

A

SSR/SST=1-(SSE/SST)

Question 46

Q

Definition: Adjusted R^2:

Answer

A

Definition: Adjusted R^2:

We cannot use 𝑅2 for model comparison when the competing models do not include the same number of independent variables (although dependent variable is the same).
𝑅2 never decreases as we add more variables. It accounts for the sample size n and the number of independent variables k.
Imposes a penalty for any additional independent variables.
The higher the adjusted 𝑅2 the better model. When comparing models with the same dependent variable, the model with the higher adjusted 𝑹𝟐 is preferred.
Note that as n increases, adjusted 𝑹𝟐 gets closer to 𝑹𝟐
I.E. 𝑙𝑖𝑚 𝑛→+∞ 𝑎𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 → 𝑅2

Question 47

Q

R^2 adjusted=

Answer

A

1 - (1- R^2) * (n-1 / n-k-1)

Question 48

Q

Test of joint significance

Answer

A

Évalue si plusieurs coefficients (ou tous) sont significatifs ensemble.
Utilise généralement un test F pour tester l’hypothèse H0:β1=β2=⋯=βk=0
Répond à la question : L’ensemble des variables explicatives améliore-t-il significativement le modèle ?

Question 49

Q

F test formule

Answer

A

SSR/k / SSE/ n-k-1
MSR / MSE
R^2 n-k-1 / 1-R^2 k

Question 50

Q

p value for a F test

Answer

A

if p value = 0 reject the null hypo

Question 51

Q

individual test of significance

Answer

A

->Évalue si un seul coefficient d’une régression (βi) est significatif.
->Répond à la question : Cette variable a-t-elle un effet significatif sur la variable dépendante ?

Question 52

Q

Ho et H1 for an individual test of significance for a two tailed test

Answer

A

Ho : Bi = 0
H1 : Bi =/= 0

Question 53

Q

test statistic for a test of individual significance

Answer

A

bj- BjO / se(bj)

Question 54

Q

p value for a test of individual significance

Answer

A

p value less than alpha reject h nutt

Question 55

Q

CAPM equation

Answer

A

𝑅 − 𝑅𝑓 = α + β(𝑅𝑀− 𝑅𝑓) + ϵ

R : The rate of return on a stock or portfolio
RM : market return
Rf : risk-free interest rate
B values : measures how sensitive the stock’s return is to changes in the market.
𝛽 =1 : change in the market = same change in the stock.
𝛽 >1 : stock is more aggressive or riskier than the market.
𝛽 <1 : stock is conservative or less risky than the market
alpha values : - 𝛼 : predicted to be zero, thus nonzero values indicate abnormal returns.
Called the stock’s alpha.
𝛼 > 0 : Positive abnormal returns
𝛼 < 0 : Negative abnormal returns

Question 56

Q

t value

Answer

A

coefficient erreur type du coefficient

Question 57

Q

MSR mean square regression

Question 58

Q

MSE

Question 59

Q

MST

Question 60

Q

3 hypo de la regression OLS

Answer

A

Linéarité entre les variables indépendantes et dépendante (relation X / Y doit ê linéaire)

Homoscédasticité: variance des résidus doit être constante à tous les niveaux de X
amplitude des erreurs (la taille des résidus) à peu près la même pour tout X.

Les variables explicatives ne doivent pas être parfaitement corrélées

Question 61

Q

Etapes pour savoir si une variable est significant ou non

Answer

A

1) on calcule la t stat ( B1/se(B1))

2)on regarde dans la t table à l’intersection entre la ligne de df( n-k-1) et la colonne du niveau de risque ( en général 5%-> donc 0,025 car il y a deux queues)

3) Si t stat > valeur critique du tableau: on rejette H0 et donc la variable sera significant

Question 62

Q

In an ANOVA table
F stat
p-value
t-stat

Answer

A

*F stat : determine whether independent variables collectively explain the variation in the dependent variable?
*p-value : explique la significance individual des prédicateurs
*t-stat : explique la significance d’un coefficient en particulier

Question 63

Q

Question 64

Q