Lecture 3 - Machine Learning Flashcards
What is Supervised learning
- Machine learns by using LABELLED data
- Uses regressions and classification
- Maps labelled input to known output.
- Linear regression,. logistic regresison, KNN
What is unsupervised learning.
-Machine is trained using unlabeled data without guidance.
- uses association and clustering
- Understand patterns and discover output.
- K-means, C-means, etc.
What is reinforcement learning
- An agent interacts with its environment by producting actions and discovers errors and rewards.
- Uses reward based.
- No pre-defined data
- Follow trail and error meothod - Q-learning, etc.
What is the sum rule in probability
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
What is posterior/conditional probability P(A|B)
Probability of an event given that you know that B is true (B = some evidence)
P(A|B) = 0.8 P(rain today| cloudy) = 0.8
i.e. your belief about A given that you know B
P(A|B)=P(A ∩ B)/P(B)
What is Bayes Theorem?
P(A|B) = (P(B|A)*P(A))/P(B)
Bayes Reasoning Formula
= Argmax P(Hi|E) = argmax P(Hi) *P(E|Hi) / P(E)
What are smooth Probabilities?
what if we have a P(wi|cj) = 0…?
ex. the word “dumbo” never appeared in the class SPAM?
then P(“dumbo”| SPAM) = 0
so if a text contains the word “dumbo”, the class SPAM is
completely ruled out !
to solve this: we assume that every word always appears at
least once (or a smaller value, like 0.5)
ex: add-1 smoothing:
P(wi | cj ) = (frequency of wi in cj ) + 1 / totalnumber of words in cj size of vocabulary
instead of multiplying the probabilities, we can use log to…?
if we really do the product of probabilities…
argmaxcj P(cj) ∏ P(wi|cj)
we soon have numerical underflow…
ex: 0.01 x 0.02 x 0.05 x …
so instead, we add the log of the probs
argmaxcj log(P(cj)) + Σ log(P(wi|c))
ex: log(0.01) + log(0.02) + log(0.05) + …
Benefits and problems with Naive Bayes
Makes a strong assumption of conditional independence
that is often incorrect
ex: the word ambulance is not conditionally independent of the
word accident given the class SPORTS
BUT:
surprisingly very effective on real-world tasks
basis of many spam filters
fast, simple
gives confidence in its class predictions (i.e., the scores)
Fast, easy to apply
often used as a baseline algorithm before trying other methods
How to split machine learning data ?
Split data set into 3 sub-sets
1. Actual training set (~80%)
2. Validation set (~20%)
3. Test set ~20%
1 and 2 are 80%
What are the steps to training a model?
- Collect a large set of examples (all with correct classifications)
- Divide collection into training, validation and test set
Loop: - Apply learning algorithm to training set to learn the parameters
- Measure performance with the validation set, and adjust hyperparameters* to improve performance
- Measure performance with the test set
Parameters vs hyper-parameters
Parameters:
basic values learned by the ML
model. eg.
* for NB: prior & conditional
probabilities
* for DTs: features to split
* for ANNs: weights
Hyper-parameters: parameters
used to set up the ML model. eg.
* for NB: value of delta for
smoothing,
* for DTs: pruning level
* for ANNs: nb of hidden layers, nb
of nodes per layer…
What are the metrics in machine learning?
Accuracy
% of instances of the test set the algorithm correctly
classifies
when all classes are equally important and represented
Recall, Precision & F-measure
when one class is more important than the others
Why can Accuracy be misleading?
problem:
when one class C is more important than the others
eg. when data set is unbalanced