notions Flashcards
actionable insight
an operational insight that can be implemented
bad data
garbage in - garbage out
re-create analysis
is quite difficult but important
bias
things that can influence the decision in the wrong wayy
analytics workflow
modularity - which tools approaches are used
Directed acyclic graph
s a directed graph with no directed cycles. That is, it consists of vertices and edges, with each edge directed from one vertex to another, such that following those directions will never form a closed loop.
airflow
workflow manager to not user crontab
Sum of squares total / regression / error
SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS - sum of differences between predict value and real on
Sum of squares total / regression / error
SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real on
Sum of squares total / regression / error
SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
ESS/RSS(Residual sum of square) - sum of differences between predict value and real one
SST = SSR + ESS
Sum of squares total / regression / error
SST/TSS - sum of squares of for E 1->n (y1 - mean)^2
SSR/ESS - SS Regression/ explained sum of squares - of difference between the mean value and predicted value
SSE/RSS(Residual sum of square) - sum of differences between predict value and real on
SST = SSR + ESS
Depndent variable
The one we are trying to predict
OLS
Ordinary Least Squares
OLS
Ordinary Least Squares (min SSE)
R-squared
R^2 = SSR/SST, 1 is best, 0 is worst
R-squared
R^2 = SSR/SST, 1 is best, 0 is worst
R-squared measures how much of the total variability is explained by this model
adjusted R-squared
measures how well your model fits the data. However, it penalizes the use of variables that are meaningless for the regression.
F-statistic
??
Linearity
Is the function linear? Does it fit well?
No endegeneity
???
Feature/target in ML
independent variable(feature), used to predict dependent variable(target)
Regression intercept
A point where regression line crosses y-axis
Regression coefficient
the coefficient on which should we multiple the feature
p-value of the feature
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis (means that feature can be used)
F-regression
Creates regressions for a feature(in case we have many of them) and a target
mean vs average
mean is the value in 50%, average is sum divided by number of elements
standardization
Find the mean and std deviation. value-mean/deviation
underfitting/overfitting
underfitting - low accuracy(doesn’t capture any logic)/ overfitting too much accuracy (capture all the noise). Can be resolved by train(75%) and test(25%) datasets.
Multicollinearity
???
Dummy variables
In case of categories(BMW, AUDI, Opel) we want to create n-1 columns with dummy variables (1 if BMW, 0 if not) In this case no new column for Opel as it is obvious that if it’s not BMW or Audi it is opel
Data cleaning
Remove outliers, qunatile, remove missing values
Models: linear, quadratic, exponential, logistic
Logistic: categorical regression
MLE
maximum likelihood estimation
Clusters
Maximize the similarity in cluster and dissimilarity between clusters
Cluster analysis
unsupervised learning as we don’t know the result. Classification though deals with known outcomes and can be trained on train data.
Centroid
Center of mass of all data points in cluster analysis
K-means
1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points
Cluster analysis
unsupervised learning as we don’t know the result. Classification although deals with known outcomes and can be trained on train data.
Centroid
Center of the mass of all data points in cluster analysis
K-means
1) Choose K.
2) Specify seed(centroid)
3) Assign each point to the closest centroid
4) Adjust the centroid based on selected points
Repeat 3
WCSS
determine the number of clusters Within Cluster Sum of Squares Elbow method To get WCSS kmeans.inertia_
Cluster seeds
We need to choose points from which to build our clusters. There’s k-means++ method for this
It is already integrated to KMeans
Cluster analysis pros and cons
Pros: Simple to understand Fast to cluster Widely available Easy to implement
Cons: We need to pick K (Elbow method) Sensitive to initialization (k-means++) Sensitive to outliers (remove outliers) Produces spherical solutions (as we use euclidian distance from centroid) Standardization
Class of clusters
Flat (Kmeans) Hierarchical (Species)
IQR
Intra Quartal Range