Lecture 1 Flashcards
Key aspects of Data Mining
What is data mining, see last card
Trade-off processing time and memory
Computers as tool and with growing data
From unstructured data to structured knowledge
What is large amount of big data?
volume
variety
velocity
3 characteristics of volume (big data)
too big for manual analysis
too big to store in RAM
too big to store on disk
3 characteristics of variety (big data)
variance
outliers, confounders, noise
different data types
2 characteristics of velocity (big data)
results before data changes
streaming data
What makes predictions possible?
associations between features/target
numerical: correlation coefficient
categorical: mutual infomation
Supervised learning (2 types)
regression (predictior)
classification (classifiers)
Unsupervised learning (2 types)
clustering
dimensionality reduction
Learning
A program is said to learn from experience E on task T and a performance measure P if its performance at task T as measured by P improves with E.
Suppose your email program watches which emails you do or do not mark as spam and based on that learns how to better filter spam. What is E, T and P?
E = Watching you label email T = Classifying emails spam/ham P = The number (or fraction) of emails correctly classified as spam/ham
characteristics supervised learning
trying to predict a specific quantity (like Dow Jones of tomorrow, is a e-mail spam or ham)
have training examples with labels
can measure accuracy directly
characteristics unsupervised learning
not looking for something specific, you want to ‘understand the data’
looking for structure (or unstructured) patterns
does not require labeled data
evaluation usually indirect or qualitative
description supervised learning
wa are giving labels to the data manually and it are the labels we want to predict as good as possible. The algorithm is giving supervision, examples, of what you want to see come out of it
workflow supervised learning
collect data
label the data manually (target variable)
choose representatation
train the model to learn
evaluate
what is meant by ‘representatation’ (workflow SL)
feature selection
possibly) convert to feature vector
Split the set in …. and …. for ‘train model’ (workflow SL)
train set for learning
validation set for hyperparamater tuning
what is meant by ‘evaluation’ (workflow SL) (2)
check performance of tuned model (/validated model) on test set
estimate how well model will do in the real world
parameter or model tuning;
what is it?
for each value of hyperparameters you… (3)?
- learning algorithms typically have settings (aka hyperparameters)
2a. apply algorithm to training set to learn
2b. check performance on validation set
2c. find/choose bes-performing setting (aka hyperparameter)
label examples (3)
annotation guidelines
measure inter-annotator agreement
crowdsourcing
persons’s r = correlation coefficient
See below for difference covariance and correlation
measures strength of a LINEAR and LINEAR relationship only (dependency)
correlation does never imply causation, discovery of corrleation can only suggest a causal relationship
what does the values of pearson’s r mean?
Note
In statistics, when we talk about dependency, we are referring to any statistical relationship between two random variables or two sets of data.
1 = A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases.
1 / 0 = positive (higher values of x means tend to have higher values of y) linear correlation
0 = A value of 0 implies that there is no linear correlation between the variables.
NOTE that if X,Y are independent the correlation coefficient between X and Y is zero (X,Y uncorrelated). BUT, if the correlation coefficient between X and Y is zero (X, Y uncorrelated), that does not mean that X and Y are independent.
E.g. suppose Y = X^2. Then Y is completely determined by X, so that X and Y are perfectly dependent, so there is some statistical relationship, but just no linear one.
- 1 / 0 = negative (lower values of x tend to have lower values of y) linear correlation
- 1 = A value of −1 implies that all data points lie on a line for which Y decreases as X increases
pearson’s r visually
-1 or 1 = a line pointing up or down, does not matter how steep (r= <= 1), as long as it is not horizontal.
values between -1 / 0 and 1 / 0 = some cloud of dots where you can draw a line in
0 = a round cloud of dots, point really far apart or some figure that does not make sense at all
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
formula pearson’s r
Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations;
covariance / product of standard diviations
s. d. is the same as variance. We take the squared root of variance, because using the standard diviation makes correleation independent units aka not sensitive to scaling)
http: //www.datasciencemadesimple.com/pearson-function-in-excel/
covariance;
meaning
formula (if sample)
measure of joint variability of two variables.
Sum(variance of X * variance of Y ) / N - 1
variance
distance of a datapoint from it’s mean.
To calculate variance you need to sum and square al the variance (spread from its mean) / N - 1