Lecture 1 Flashcards
Key aspects of Data Mining
What is data mining, see last card
Trade-off processing time and memory
Computers as tool and with growing data
From unstructured data to structured knowledge
What is large amount of big data?
volume
variety
velocity
3 characteristics of volume (big data)
too big for manual analysis
too big to store in RAM
too big to store on disk
3 characteristics of variety (big data)
variance
outliers, confounders, noise
different data types
2 characteristics of velocity (big data)
results before data changes
streaming data
What makes predictions possible?
associations between features/target
numerical: correlation coefficient
categorical: mutual infomation
Supervised learning (2 types)
regression (predictior)
classification (classifiers)
Unsupervised learning (2 types)
clustering
dimensionality reduction
Learning
A program is said to learn from experience E on task T and a performance measure P if its performance at task T as measured by P improves with E.
Suppose your email program watches which emails you do or do not mark as spam and based on that learns how to better filter spam. What is E, T and P?
E = Watching you label email T = Classifying emails spam/ham P = The number (or fraction) of emails correctly classified as spam/ham
characteristics supervised learning
trying to predict a specific quantity (like Dow Jones of tomorrow, is a e-mail spam or ham)
have training examples with labels
can measure accuracy directly
characteristics unsupervised learning
not looking for something specific, you want to ‘understand the data’
looking for structure (or unstructured) patterns
does not require labeled data
evaluation usually indirect or qualitative
description supervised learning
wa are giving labels to the data manually and it are the labels we want to predict as good as possible. The algorithm is giving supervision, examples, of what you want to see come out of it
workflow supervised learning
collect data
label the data manually (target variable)
choose representatation
train the model to learn
evaluate
what is meant by ‘representatation’ (workflow SL)
feature selection
possibly) convert to feature vector
Split the set in …. and …. for ‘train model’ (workflow SL)
train set for learning
validation set for hyperparamater tuning