Chp 4 Components of Learning Flashcards
How is data usually represented
As a matrix
Data science spend the most time
cleaning data
Target function
A function that maps X to Y, we do not know this function and the goal is to recreate this function
Learning job
Create a hypothesis function that also maps X to Y very similarly to the target function
Learning steps 5
- We have an unknown target function F which maps X to Y
- We have certain training examples and we use those training examples as part of a learning algorithm
- The learning algorithm has a number of hypotheses.
- These training examples and hypotheses together will produce a final hypothesis.
- We hope this final hypothesis is very close to the target function
Learner input output
seen data as input, classifier as output
Classifier input output
unseen data, response to that data
Model is an
artifact, learner builds a model and classifier uses that model to predict
Curse of dimensionality
The various challenges and complications that come from data that is very high in dimensions, too much data to handle every single case
Generalizing
Being able to adapt to data that the model has not seen before
Selection
Selecting the data you need
Preprocessing
Clean data and understand what you need to remove
Transformation
Transform it into the shape you want, add/remove attributes
Data Mining
Get patterns from the data
Supervised learning
Model is trained on labeled data, input output pairs. Algorithm learns to map input to output
Unsupervised Learning
Model is trained on unlabeled data. Algorithm looks for pattern or structure in data
Reinforcement learning
Algorithm receives feedback in form of rewards or penalties. Goal is to maximize reward over time
Binarization
Converting continuous or categorical data into binary form
Discretization
Converting continuous data into discrete categories or intervals
Classification
The task of learning a target function that maps each attribute set x to one of the predefined class labels
How do decision trees work?
Takes some data, uses tree induction algorithm to understand this data, learn a model from it, apply deduction to get new responses right
Decision trees are always
Binary
Hunts Algorithm
Dt set of training records that reach a node t
If Dt contains records that belong to the same class Yt, then t is a leaf node labeled as Yt
If Dt is an empty set, then t is a lead node labeled by the default class Yd
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.
Recursively apply the procedure to each subset
what is the default class
The class that is most frequent in the data set