notions Flashcards

Question 1

Q

actionable insight

Answer

A

an operational insight that can be implemented

Question 2

Q

bad data

Answer

A

garbage in - garbage out

Question 3

Q

re-create analysis

Answer

A

is quite difficult but important

Question 4

Q

bias

Answer

A

things that can influence the decision in the wrong wayy

Question 5

Q

analytics workflow

Answer

A

modularity - which tools approaches are used

Question 6

Q

Directed acyclic graph

Answer

A

s a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop.

Question 7

Q

airflow

Answer

A

workflow manager to not user crontab

Question 8

Q

Sum of squares total / regression / error

Answer

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS - sum of differences between predict value and real on

Question 9

Q

Sum of squares total / regression / error

Answer

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real on

Question 10

Q

Sum of squares total / regression / error

Answer

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real one

SST = SSR + ESS

Question 11

Q

Sum of squares total / regression / error

Answer

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
SSE/RSS(Residual sum of square) - sum of differences between predict value and real on

SST = SSR + ESS

Question 12

Q

Depndent variable

Answer

A

The one we are trying to predict

Question 13

Q

OLS

Answer

A

Ordinary Least Squares

Question 14

Q

OLS

Answer

A

Ordinary Least Squares (min SSE)

Question 15

Q

R-squared

Answer

A

R^2 = SSR/SST, 1 is best, 0 is worst

Question 16

Q

R-squared

Answer

A

R^2 = SSR/SST, 1 is best, 0 is worst

R-squared measures how much of the total variability is explained by this model

Question 17

Q

adjusted R-squared

Answer

A

measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.

Question 18

Q

F-statistic

Question 19

Q

Linearity

Answer

A

Is the function linear? Does it fit well?

Question 20

Q

No endegeneity

Question 21

Q

Feature/target in ML

Answer

A

independent variable(feature), used to predict dependent variable(target)

Question 22

Q

Regression intercept

Answer

A

A point where regression line crosses y-axis

Question 23

Q

Regression coefficient

Answer

A

the coefficient on which should we multiple the feature

Question 24

Q

p-value of the feature

Answer

A

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (means that feature can be used)

Question 25

Q

F-regression

Answer

A

Creates regressions for a feature(in case we have many of them) and a target

Question 26

Q

mean vs average

Answer

A

mean is the value in 50%, average is sum divided by number of elements

Question 27

Q

standardization

Answer

A

Find the mean and std deviation. value-mean/deviation

Question 28

Q

underfitting/overfitting

Answer

A

underfitting - low accuracy(doesn’t capture any logic)/ overfitting too much accuracy (capture all the noise). Can be resolved by train(75%) and test(25%) datasets.

Question 29

Q

Multicollinearity

Question 30

Q

Dummy variables

Answer

A

In case of categories(BMW, AUDI, Opel) we want to create n-1 columns with dummy variables (1 if BMW, 0 if not) In this case no new column for Opel as it is obvious that if it’s not BMW or Audi it is opel

Question 31

Q

Data cleaning

Answer

A

Remove outliers, qunatile, remove missing values

Question 32

Q

Models: linear, quadratic, exponential, logistic

Answer

A

Logistic: categorical regression

Question 33

Q

MLE

Answer

A

maximum likelihood estimation

Question 34

Q

Clusters

Answer

A

Maximize the similarity in cluster and dissimilarity between clusters

Question 35

Q

Cluster analysis

Answer

A

unsupervised learning as we don’t know the result. Classification though deals with known outcomes and can be trained on train data.

Question 36

Q

Centroid

Answer

A

Center of mass of all data points in cluster analysis

Question 37

Q

K-means

Answer

A

1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points

Question 38

Q

Cluster analysis

Answer

A

unsupervised learning as we don’t know the result. Classification although deals with known outcomes and can be trained on train data.

Question 39

Q

Centroid

Answer

A

Center of the mass of all data points in cluster analysis

Question 40

Q

K-means

Answer

A

1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points
Repeat 3

Question 41

Q

WCSS

Answer

A

determine the number of clusters
Within Cluster Sum of Squares
Elbow method
To get WCSS
kmeans.inertia_

Question 42

Q

Cluster seeds

Answer

A

We need to choose points from which to build our clusters. There’s k-means++ method for this
It is already integrated to KMeans

Question 43

Q

Cluster analysis pros and cons

Answer

A

Pros:
Simple to understand
Fast to cluster
Widely available
Easy to implement

Cons:
We need to pick K (Elbow method)
Sensitive to initialization (k-means++)
Sensitive to outliers (remove outliers)
Produces spherical solutions (as we use euclidian distance from centroid)
Standardization

Question 44

Q

Class of clusters

Answer

A

Flat (Kmeans) Hierarchical (Species)

Question 45

Q

IQR

Answer

A

Intra Quartal Range