notions Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

actionable insight

A

an operational insight that can be implemented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

bad data

A

garbage in - garbage out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

re-create analysis

A

is quite difficult but important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

bias

A

things that can influence the decision in the wrong wayy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

analytics workflow

A

modularity - which tools approaches are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Directed acyclic graph

A

s a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

airflow

A

workflow manager to not user crontab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS - sum of differences between predict value and real on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real one

SST = SSR + ESS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sum of squares total / regression / error

A

SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
SSE/RSS(Residual sum of square) - sum of differences between predict value and real on

SST = SSR + ESS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Depndent variable

A

The one we are trying to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

OLS

A

Ordinary Least Squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OLS

A

Ordinary Least Squares (min SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

R-squared

A

R^2 = SSR/SST, 1 is best, 0 is worst

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R-squared

A

R^2 = SSR/SST, 1 is best, 0 is worst

R-squared measures how much of the total variability is explained by this model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

adjusted R-squared

A

measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

F-statistic

A

??

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Linearity

A

Is the function linear? Does it fit well?

20
Q

No endegeneity

A

???

21
Q

Feature/target in ML

A

independent variable(feature), used to predict dependent variable(target)

22
Q

Regression intercept

A

A point where regression line crosses y-axis

23
Q

Regression coefficient

A

the coefficient on which should we multiple the feature

24
Q

p-value of the feature

A

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (means that feature can be used)

25
Q

F-regression

A

Creates regressions for a feature(in case we have many of them) and a target

26
Q

mean vs average

A

mean is the value in 50%, average is sum divided by number of elements

27
Q

standardization

A

Find the mean and std deviation. value-mean/deviation

28
Q

underfitting/overfitting

A

underfitting - low accuracy(doesn’t capture any logic)/ overfitting too much accuracy (capture all the noise). Can be resolved by train(75%) and test(25%) datasets.

29
Q

Multicollinearity

A

???

30
Q

Dummy variables

A

In case of categories(BMW, AUDI, Opel) we want to create n-1 columns with dummy variables (1 if BMW, 0 if not) In this case no new column for Opel as it is obvious that if it’s not BMW or Audi it is opel

31
Q

Data cleaning

A

Remove outliers, qunatile, remove missing values

32
Q

Models: linear, quadratic, exponential, logistic

A

Logistic: categorical regression

33
Q

MLE

A

maximum likelihood estimation

34
Q

Clusters

A

Maximize the similarity in cluster and dissimilarity between clusters

35
Q

Cluster analysis

A

unsupervised learning as we don’t know the result. Classification though deals with known outcomes and can be trained on train data.

36
Q

Centroid

A

Center of mass of all data points in cluster analysis

37
Q

K-means

A

1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points

38
Q

Cluster analysis

A

unsupervised learning as we don’t know the result. Classification although deals with known outcomes and can be trained on train data.

39
Q

Centroid

A

Center of the mass of all data points in cluster analysis

40
Q

K-means

A

1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points
Repeat 3

41
Q

WCSS

A
determine the number of clusters
Within Cluster Sum of Squares
Elbow method
To get WCSS
kmeans.inertia_
42
Q

Cluster seeds

A

We need to choose points from which to build our clusters. There’s k-means++ method for this
It is already integrated to KMeans

43
Q

Cluster analysis pros and cons

A
Pros:
Simple to understand
Fast to cluster
Widely available
Easy to implement
Cons:
We need to pick K (Elbow method)
Sensitive to initialization (k-means++)
Sensitive to outliers (remove outliers)
Produces spherical solutions (as we use euclidian distance from centroid)
Standardization
44
Q

Class of clusters

A

Flat (Kmeans) Hierarchical (Species)

45
Q

IQR

A

Intra Quartal Range