CHAP 5 : ML Intro and KNN Flashcards

1
Q

What is the difference between traditional programming and machine learning?

A

In traditional programming, computers are merely following instructions from the program. However, in machine learning, computers learn from experiences just as a human does and makes decisions.

Traditional programming:
- Input : Data, rules (logic)
- Output : Answers

Machine learning:
- Input : Data, answers
- Output : Logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 categories that machine learning can be broadly classfied into?

A
  1. Supervised learning
  2. Unsupervised learning
  3. Reinforcement learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What kinds of tasks are solved in supervised learning? Give examples.

A
  1. Classsification – Object classification in surveillance videos / photos etc
  2. Regression – Prediction of house prices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of supervised learning and what kind of data is passed into supervised learning?

A

Goal of supervised learning algorithm is to learn patterns in the data and build a general set of rules to map input to the class / event.

Human-labelled data is passed as input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 stages of supervised learning model?

A
  1. Training
  2. Testing / Validation
  3. Classification / prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kinds of tasks are solved in unsupervised learning? Give examples.

A
  1. Clustering – grouping of similar customer profiles (aka customer segmentation for marketing purposes)
  2. Dimensionality reduction – finding key features in data
  3. Anormaly detection – detecting fraud credit card transaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is dimensionality reduction? [not so impt]

A

transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the idea behind reinforcement learning?

A
  • These algorithm maps situations to actions that yield maximum rewards
  • For every action, there is a reward defined by the user
  • The algorithm learns to find the action that maximises the reward
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give examples of reinforcement learning.

A

AI for games (e.g. ATARI Game AIs / game bots)
Self-driving car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 major tasks in ML?

A
  1. Classification
  2. Regression
  3. Clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What kinds of data are used in ML algorithms ? [2]

A
  1. Numerical
  2. Categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 kinds of numerical data? Give examples.

A
  1. Discrete data – Shows the count that involves only integers, cannot subdivide the values into parts (e.g. no of children)
  2. Continuous data – can be meaningully divided into finer levels, measured on a scale / continuum and have almost any numeric value (e.g. weight and height)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 kinds of categorial data? Give examples.

A
  1. Nominal data – used for labelling variables, has no order (sometimes referred to as “labels”) – e.g. Gender
  2. Ordinal data – values of ordinal data have some natural ordering – e.g. rating product on a scale of 1-5, clothes sizing (small, medium, large)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Higher dimensional data provides more details but incurs more computation in Machine Learning. True or False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the characterisitics of kNN algorithm?

A

kNN is a
1. Supervised – need labeled training data
2. Non-parametric – no assumption made on data distribution

  1. Lazy learning – no need for training the data, no model generated

algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the key idea behind kNN?

A

It is used to find the k-most SIMILAR datapoints from the dataset.

17
Q

How does kNN algo work? (How is the class of new data point determined?)

A
  • Algo works based on the majority vote of the k nearest neighbours class
  • New datapoint is assigned to the most common class of the k-nearest neighbours
18
Q

What are similarity measures in kNN?

A

A similarity measure is a distance measure with dimensions representing features of the objects.

19
Q

What are the kinds of similarity measures (distance metrics) used in kNN?

A
  1. Euclidean distance **
  2. Manhattan distance **
  3. Minkowski distance
  4. Hamming distance (same as manhattan distance but binary values)
20
Q

How is similarity measured?

A

Similarity is measured in the range 0 to 1.

For 2 features X and Y,
– if similarity score of X and Y = 1, X == Y
– if similarity score of X and Y = 0 , X != Y

21
Q

Why is feature scaling necessary?

A

Since ML algorithms takes only the magnitude of features neglecting the units, there is a chance where higher weightage is given to features with higher magnitude.

With feature scaling, features will have equal magnitude.

22
Q

What are the 2 ways we can scale features? When are they used?

A
  1. Standardisation : x’ = (x-mean) / (stdev)

– Used when data has varying range and when data follows a distribution

  1. Nomalisation : x’ = [x-min(x)] / [max(x)-min(x)]

– Used when there is no assumption made on the distribution of data

23
Q

kNN works well with large and higher dimension data. Is the statement true or false?

A

False

24
Q

Which type of data do the Cumulative Average Points (CAP) fall under?

A

Continuous

25
Q

Which type of data do school rankings fall under (NUS, NTU, SMU rankings)?

A

Ordinal

26
Q

Given the same dataset and input, the evaluation of multiple runs of kNN algorithm will yield different results. Is the statement true or false?

A

False

27
Q

The manhattan distance is the straight line distance between two points. Is the statement true or false?

A

False. That is euclidean distance

28
Q

Euclidean distance is a variation of Minkowski distance where the order of the norm is 2. Is the statement true or false?

A

True

29
Q

Hamming Distance can be used on (x,y) co-ordinate values. Is the statement true or false?

A

False, used on binary values

30
Q

Feature scaling methods should be used in KNN as the distance metrics are used in the algorithm.

A

True