Midsemester Exam Flashcards
To revise content for the midsemester exam
What’s the difference between Active and Passive learning?
Passive learning is where an expert tells you all the features that help you classify the data and you memorise them. Active learning is where an expert classifies a dataset for you and you discover the features for yourself.
What is the process of Classification?
- Get a training set of data
- Apply a learning algorithm on the training set to create a model.
- Apply the model to more of the training set and classify it
What is one way you can classify data?
Using a decision tree
Why don’t you explicitly tell the model the characteristics of data?
This can be difficult to do. For example, how do you determine that an email is spam? All the features of a spam email? Much easier to just tell the model which ones are spam and which are not.
What is the difference between Classification and Clustering?
Classification is supervised learning. You are the expert who supervises the model. Clustering is unsupervised learning.
What are the two steps in the classification process?
- Model construction. 2. Model Usuage
Explain the model construction step in the classification process.
To construct a model we need to
- get a training set
- Determine class labels for the training set. E.g Salmon, not salmon
- Determine how to represent the model: E.g using Decision trees, math formulas or rules.
What do we do in the model usage step?
Check the accuracy of the model. How? By checking that the test sample’s class labels are correctly classified.
What’s the difference between training data and testing data?
Training data and testing data are both partitions. The testing data class label is not known by the model. The mode’s job is accurately predict it.
What is one way of determining the accuracy of a model?
Use a confusion matrix to determine the accuracy of the model
How do you use a confusion matrix to determine the accuracy of a model?
General idea: The number of correctly classified tuples divided by all the tuples.
T-Positive + T-Negative ____________________________________
T-Positive + F-Positive + T-Negative + F-Negative
What is a disadvantage of the Confusion Matrix method of testing model accuracy?
In an example of 10,000 fish that are to be classified. If there are only 10 Salmon that are classified. The accuracy calculation will be very high 99.9. But given the few number of Salmon in the dataset, this could be because the model isn’t detecting things correctly.
What is precision and recall?
Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive?
Recall: completeness – what % of positive tuples did the classifier correctly label as positive?
How do we calculate precision?
Precision =
True Positive
________________________________
True Positive + False Positive
How do we calculate recall?
Recall = True Positive / True Positive + False Negative
What is the holdout method of estimation?
Take 70% as the training data and 30% as the testing data. Do this a couple of times.
What is the cross validation method of estimation?
Divide the dataset into subsets. Check each subset as a test set and use the other subsets as training set. Do this for all subsets Example: Subset 1 - > test set Subset 2 - 10 -> training set Next step: Subset 2 -> test set Subset 1, Subset 3-10 -> training set
What are the three main algorithms for classifying data?
Nearest Neighbour, Bayes, and Decision Tree
What’s an instance-based classifier?
Rote learner model of classification. It has to match exactly for the classification to work.
What are the three things you need to get K-neartest neighbour to work?
You need three things. 1. The stored records 2. Compute the distance to neighbours. 3. Choose how many neighbours to compare to.
What are some things to consider for classification algorithms?
Accuracy - Does it correctly classify class labels Speed - time to build the model, time to use the model Robustness - handling signal and noise Scalability - handling large amounts of data Interpretability - understand the insight from the model
What are similarity/dissimilarity in the context of classification algorithms?
Similarity is how alike two data objects are. 1 means the same.
What measure do we use to compute similarity between data objects?
Use Euclidean distance. Distance equals the square root of the sum of the squared distance. Example: John(45yo , $10 000), Kelly(34yo, $15 000) Distance = sqrt[(45-34)^2 + (10 000 - 15 000)^2]
What are the steps of the K-nearest algorithm/classification?
- Compute the euclidean distance from the data point to other data points
- Get the class label from the neighest neighbour. How? Majority vote based off the weight factor
- What’s the weight factor? Weight = 1/distance^2
Why is k-nearest neighbour considered lazy?
It doesn’t build a model. It remembers all the training data and has to compute distance to neighbours every time it is run.
What are the issues with choosing a k-value for nearest neighbour?
If k is too small it is sensitive to noise. If k is too large it may pick up data from other class labels.
How do we stop one attribute from dominating the k-means algorithm?
Normalize the data so one attribute doesn’t dramatically affect the k-means plots.
What’s an obvious disadvantage of k-means algorithm?
Because it doesn’t build a model it has to compute the distance every time it classifies a new record. This is a relatively expensive operation.
What’s a Naive Bayes Classifier?
A Naive Bayes Classifier uses Bayes theorem to compute the probability of a class label.
What is an advantage of using a Naive Bayes Classifier?
It’s easy to implement and generally gets good results