CHAP 5 : ML Intro and KNN Flashcards
What is the difference between traditional programming and machine learning?
In traditional programming, computers are merely following instructions from the program. However, in machine learning, computers learn from experiences just as a human does and makes decisions.
Traditional programming:
- Input : Data, rules (logic)
- Output : Answers
Machine learning:
- Input : Data, answers
- Output : Logic
What are the 3 categories that machine learning can be broadly classfied into?
- Supervised learning
- Unsupervised learning
- Reinforcement learning
What kinds of tasks are solved in supervised learning? Give examples.
- Classsification – Object classification in surveillance videos / photos etc
- Regression – Prediction of house prices
What is the goal of supervised learning and what kind of data is passed into supervised learning?
Goal of supervised learning algorithm is to learn patterns in the data and build a general set of rules to map input to the class / event.
Human-labelled data is passed as input
What are the 3 stages of supervised learning model?
- Training
- Testing / Validation
- Classification / prediction
What kinds of tasks are solved in unsupervised learning? Give examples.
- Clustering – grouping of similar customer profiles (aka customer segmentation for marketing purposes)
- Dimensionality reduction – finding key features in data
- Anormaly detection – detecting fraud credit card transaction
What is dimensionality reduction? [not so impt]
transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data
What is the idea behind reinforcement learning?
- These algorithm maps situations to actions that yield maximum rewards
- For every action, there is a reward defined by the user
- The algorithm learns to find the action that maximises the reward
Give examples of reinforcement learning.
AI for games (e.g. ATARI Game AIs / game bots)
Self-driving car
What are the 3 major tasks in ML?
- Classification
- Regression
- Clustering
What kinds of data are used in ML algorithms ? [2]
- Numerical
- Categorical
What are the 2 kinds of numerical data? Give examples.
- Discrete data – Shows the count that involves only integers, cannot subdivide the values into parts (e.g. no of children)
- Continuous data – can be meaningully divided into finer levels, measured on a scale / continuum and have almost any numeric value (e.g. weight and height)
What are the 2 kinds of categorial data? Give examples.
- Nominal data – used for labelling variables, has no order (sometimes referred to as “labels”) – e.g. Gender
- Ordinal data – values of ordinal data have some natural ordering – e.g. rating product on a scale of 1-5, clothes sizing (small, medium, large)
Higher dimensional data provides more details but incurs more computation in Machine Learning. True or False?
True
What are the characterisitics of kNN algorithm?
kNN is a
1. Supervised – need labeled training data
2. Non-parametric – no assumption made on data distribution
- Lazy learning – no need for training the data, no model generated
algorithm
What is the key idea behind kNN?
It is used to find the k-most SIMILAR datapoints from the dataset.
How does kNN algo work? (How is the class of new data point determined?)
- Algo works based on the majority vote of the k nearest neighbours class
- New datapoint is assigned to the most common class of the k-nearest neighbours
What are similarity measures in kNN?
A similarity measure is a distance measure with dimensions representing features of the objects.
What are the kinds of similarity measures (distance metrics) used in kNN?
- Euclidean distance **
- Manhattan distance **
- Minkowski distance
- Hamming distance (same as manhattan distance but binary values)
How is similarity measured?
Similarity is measured in the range 0 to 1.
For 2 features X and Y,
– if similarity score of X and Y = 1, X == Y
– if similarity score of X and Y = 0 , X != Y
Why is feature scaling necessary?
Since ML algorithms takes only the magnitude of features neglecting the units, there is a chance where higher weightage is given to features with higher magnitude.
With feature scaling, features will have equal magnitude.
What are the 2 ways we can scale features? When are they used?
- Standardisation : x’ = (x-mean) / (stdev)
– Used when data has varying range and when data follows a distribution
- Nomalisation : x’ = [x-min(x)] / [max(x)-min(x)]
– Used when there is no assumption made on the distribution of data
kNN works well with large and higher dimension data. Is the statement true or false?
False
Which type of data do the Cumulative Average Points (CAP) fall under?
Continuous
Which type of data do school rankings fall under (NUS, NTU, SMU rankings)?
Ordinal
Given the same dataset and input, the evaluation of multiple runs of kNN algorithm will yield different results. Is the statement true or false?
False
The manhattan distance is the straight line distance between two points. Is the statement true or false?
False. That is euclidean distance
Euclidean distance is a variation of Minkowski distance where the order of the norm is 2. Is the statement true or false?
True
Hamming Distance can be used on (x,y) co-ordinate values. Is the statement true or false?
False, used on binary values
Feature scaling methods should be used in KNN as the distance metrics are used in the algorithm.
True