Lecture 1 Flashcards

1
Q

Key aspects of Data Mining

What is data mining, see last card

A

Trade-off processing time and memory

Computers as tool and with growing data

From unstructured data to structured knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is large amount of big data?

A

volume
variety
velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 characteristics of volume (big data)

A

too big for manual analysis
too big to store in RAM
too big to store on disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 characteristics of variety (big data)

A

variance
outliers, confounders, noise
different data types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 characteristics of velocity (big data)

A

results before data changes

streaming data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What makes predictions possible?

A

associations between features/target

numerical: correlation coefficient
categorical: mutual infomation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Supervised learning (2 types)

A

regression (predictior)

classification (classifiers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unsupervised learning (2 types)

A

clustering

dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Learning

A

A program is said to learn from experience E on task T and a performance measure P if its performance at task T as measured by P improves with E.

Suppose your email program watches which emails you do or do not mark as spam and based on that learns how to better filter spam. What is E, T and P?

E = Watching you label email 
T =  Classifying emails spam/ham
P = The number (or fraction) of emails correctly classified as spam/ham
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

characteristics supervised learning

A

trying to predict a specific quantity (like Dow Jones of tomorrow, is a e-mail spam or ham)

have training examples with labels

can measure accuracy directly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

characteristics unsupervised learning

A

not looking for something specific, you want to ‘understand the data’

looking for structure (or unstructured) patterns

does not require labeled data

evaluation usually indirect or qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

description supervised learning

A

wa are giving labels to the data manually and it are the labels we want to predict as good as possible. The algorithm is giving supervision, examples, of what you want to see come out of it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

workflow supervised learning

A

collect data

label the data manually (target variable)

choose representatation

train the model to learn

evaluate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is meant by ‘representatation’ (workflow SL)

A

feature selection

possibly) convert to feature vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Split the set in …. and …. for ‘train model’ (workflow SL)

A

train set for learning

validation set for hyperparamater tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is meant by ‘evaluation’ (workflow SL) (2)

A

check performance of tuned model (/validated model) on test set

estimate how well model will do in the real world

17
Q

parameter or model tuning;

what is it?

for each value of hyperparameters you… (3)?

A
  1. learning algorithms typically have settings (aka hyperparameters)

2a. apply algorithm to training set to learn
2b. check performance on validation set
2c. find/choose bes-performing setting (aka hyperparameter)

18
Q

label examples (3)

A

annotation guidelines

measure inter-annotator agreement

crowdsourcing

19
Q

persons’s r = correlation coefficient

See below for difference covariance and correlation

A

measures strength of a LINEAR and LINEAR relationship only (dependency)

correlation does never imply causation, discovery of corrleation can only suggest a causal relationship

20
Q

what does the values of pearson’s r mean?

Note
In statistics, when we talk about dependency, we are referring to any statistical relationship between two random variables or two sets of data.

A

1 = A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases.

1 / 0 = positive (higher values of x means tend to have higher values of y) linear correlation

0 = A value of 0 implies that there is no linear correlation between the variables.
NOTE that if X,Y are independent the correlation coefficient between X and Y is zero (X,Y uncorrelated). BUT, if the correlation coefficient between X and Y is zero (X, Y uncorrelated), that does not mean that X and Y are independent.
E.g. suppose Y = X^2. Then Y is completely determined by X, so that X and Y are perfectly dependent, so there is some statistical relationship, but just no linear one.

  • 1 / 0 = negative (lower values of x tend to have lower values of y) linear correlation
  • 1 = A value of −1 implies that all data points lie on a line for which Y decreases as X increases
21
Q

pearson’s r visually

A

-1 or 1 = a line pointing up or down, does not matter how steep (r= <= 1), as long as it is not horizontal.

values between -1 / 0 and 1 / 0 = some cloud of dots where you can draw a line in

0 = a round cloud of dots, point really far apart or some figure that does not make sense at all

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

22
Q

formula pearson’s r

A

Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations;

covariance / product of standard diviations

s. d. is the same as variance. We take the squared root of variance, because using the standard diviation makes correleation independent units aka not sensitive to scaling)
http: //www.datasciencemadesimple.com/pearson-function-in-excel/

23
Q

covariance;

meaning
formula (if sample)

A

measure of joint variability of two variables.

Sum(variance of X * variance of Y ) / N - 1

24
Q

variance

A

distance of a datapoint from it’s mean.

To calculate variance you need to sum and square al the variance (spread from its mean) / N - 1

25
Q

anscombe’s quartet

A

even if 4 datasets have the same summary statistics (mean, variance, correlation coefficient and best fit-line) they look the same visually.
But this is not the case. Effect of curvature and outliers can be huge.

Statistical descriptors are incomplete descriptors of the underlying data!

26
Q

regression models can be used to (2)

A

describe the relationship between random variables

predict the value of one variable based on another variable

27
Q

dependent variable

A

outcome, response, target (on y-axis)

28
Q

independent variable

A

input, predictors, features (on x-axis)

29
Q

a regression model models the relationship as a…..

A

parametrized function y = F(x)

x –> f() —> F(x)
you just put x in some function and you get the function of x.

this is the learning task, could be predictor or classifier

30
Q

classification

A

The model learns from the data input given to it and then uses this learning to classify new observation.

Each instance (point) has a class label, can be represented by feature vectors, e.g. [X1X2] (in case classes are defined by 2 features)

The classes are devided by a descision bounary

31
Q

Decision boundaries

A

A descision bounary is nothing more than the function of x ( so f(x), the learning task, the classifier, the model).

This function, this classifier, is trained on the dataset (labelled data points) to ‘draw’ a division between classes. ITERATIVE PROCESS

The decision boundary is considered to be a model of the separation between classes.

NOTE
There is a difference between the terms algorithm and a model.
Algorithm = a mathematical technique or equation (that is, a framework) with parameters.
Model = equation that is formed by using data to find the parameters in the equation of an algorithm.

32
Q

Decision boundaries can be 2 types;

A

linear (stright line)

non linear (wiggly line), depends on number of parameters and polynomials

33
Q

Dimensionality reduction

A

feature selection (select relevant features)

feature extraction (define relevant features) e.g. PCA, FA for
• Image Processing: edge detection
• From pixels to reduced set of features

https://www.youtube.com/watch?v=LDhqqxOVqV0

34
Q

feature selection (4) -dimensionality reduction

A

reduce complexity and easier interpretation

reduce demand on resources (computation/RAM)

reduce ‘curse of dimensionality’

reduce change of overfitting

35
Q

Pattern mining

A

Identifying rules that describe specific patterns within the data. For example, supermarkets used market-basket analysis to identify items that were often purchased together—for instance, a store featuring a fish sale would also stock up on tartar sauce.

36
Q

Difference Correlation and covariance

A

Covariance

  • Tries to look into and measure how X and Y change together.
  • tells us direction in which two variables vary with each other.
  • Covariance can be classified as positive covariance (that large values of one variable are associated with big values from the other) and negative covariance (large values of one variable are associated with small values of the other one) + no trend

========================
Correlation.

  • Both, the direction and magnitude of how X and Y vary with each other.
  • Serves as a scaled version of a covariance, make covariance unitless, sort of normalise (same idea as variance and standard deviation)
  • three categories: positive, negative, or zero
37
Q

What is Data Mining

A

Process of using computer science (algorithms and database methods) and statistics, to find applicable, usable knowledge (like patterns, trends and relationships) in raw (big) data.

We apply an algorithm that “learns” something about the data. These algorithms are machine learning algorithms.

NOTE
There is a difference between the terms algorithm and a model.
Algorithm = a mathematical technique or equation (that is, a framework) with parameters.
Model = equation that is formed by using data to find the parameters in the equation of an algorithm.