Lecture 3-Intro to ML Flashcards
Why Machine Learning(4)?
-Increase in Data Generation
-Improve Decision Making
-Uncover patterns and trends in data
-Solve complex problems
When was ML created and who did?
1959 Arthur Samuel
What is the ML process if its training data(8)?
1.Dataset
2.Data cleaning
3.Feature Engineering
4.Training data
5.Learning algorithm
6.Train model
7.Score model
8.Evaluate model
What is the ML process if its new data(6)?
1.Dataset
2.Data cleaning
3.Feature Engineering
4.New data
5.Score model
6.Evaluate model
What is the task that takes the most time in ML process?
Data cleaning takes 80-90% of time
What are the 3 types of Machine Learning?
-Supervised learning
-Unsupervised learning
-Reinforcement learning
What is supervised learning?
The machine learns by using labelled data
What is unsupervised learning?
The machine is trained on unlabeled data without any guidance
What is reinforcement learning?
An agent interacts with its environment by producing actions and discovers errors and rewards
What does EDA stand for in supervised and unsupervised learning?
Exploratory Data Analysis
What is ML widely used in?
-In data mining aka Knowledge Discovery Detection(KDD)
Examples: clustering, anomaly detection, association rule mining
What is prior(or unconditional) probability ?
Probability of an event before any evidence is obtained
What is posterior(or conditional) probability?
Probability of an event given that you know that some evidence is true
What is Naive Bayes Classifier?
A simple probabilistic classifier based on Bayes’ theorem where:
-there’s strong independence assumption (often does not hold)
-the features/attributes are conditionally independent
What are 4 pros of Naive Bayes Classification?
-Very effective on real-world tasks
-Used as baseline algo before trying other methods
-Fast, simple
-Gives confidence in its class predictions
What is the main con in Naive Bayes Classification?
-Makes a strong assumption of conditional independence that is often INCORRECT
How do we evaluate a learning model/what you learned is correct?
You run your classifier on a data set of unseen examples(that you did not use for training) for which you know the correct classification
What are the 3 sub-sets we can divide the data set into?
1.Actual training set(~80%)
2.Validating set(~20%)
3.Test set(~80%)
What are the metrics used when evaluating a learning model?
-Accuracy
-Recall
-Precision
-F-measure
What is the def of accuracy?
-% of instances of the test set the algo correctly classifies
-How many % were correct overall?
What is the definition of recall?
How many % of instances of C were found correctly?
What is the def of precision?
Of the detected instance of C, how many % were correct?
When to use accuracy?
When all classes are equally important and represented
When to use recall, precision & f-measure?
When one class is more important than the others