6: Machine Learning 1 Flashcards

Question 1

Q

What is machine learning?

Answer

A

Understanding the world by learning from the data
Not so much interested in cause and mechanisms
Interested in classification and predictions.

Question 2

Q

What are some example use of machine learning?

Answer

A

Spam detection
Community detection
Sorting news
Text translation
Face recognition
Optical character recognition (OCR)
Suggested content and ads

Question 3

Q

What is unsupervised learning?

Answer

A

Techniques where the machine is NOT given labels, or corresponding outputs.
- The machine will detect patterns from the data with no example to rely on.

The dataset containing the data to learn from is called an unlabelled dataset.

Question 4

Q

What is supervised learning?

Answer

A

Techniques where the machine is given inputs and corresponding outputs to learn from.
- The machine will try to adjust parameters to make the best prediction of the output when given an input.

Dataset containing inputs and corresponding outputs is called a labeled dataset.

Question 5

Q

What is reinforcement learning?

Answer

A

The machine learns through trial and errors.
- The method includes a feedback loop with rewards. While attempting trials, the machine tries to maximise the rewards.

Question 6

Q

What is a good algorithm?

Answer

A

An algorithm capable of making the correct prediction.

Question 7

Q

What is the goal of machine learning and how is this tested?

Answer

A

Try to identify patterns and make predictions from data.
Does not matter if input causes output, as long as the input is enough to predict the output.
Algorithm is trained on a TRAINING DATASET.
Then, the accuracy of the model can be tested with a TEST DATASET.

Question 8

Q

What fit of your model don’t you want?

Answer

A

You do not want a model that is fitted to the random variations of your data

Question 9

Q

What is underfitting and overfitting of a model?

Answer

A

Under: Not enough parameters to correctly predict Y (may be linear when it should not be).

Over: To many parameters to correctly predict Y (touches every data point) = small residual

Question 10

Q

What are the three components the machine learns with?

Answer

A

A decision process: recipe of calculations/steps that takes in the data and returns a “guess” at the kind of pattern in the data the algorithm is looking to find.
An error function: method of measuring how good the guess was by comparing to known examples (when available). How to quantify how bad possible miss was?
An updating or optimisation process: algorithm looks at miss and updates how decision process comes to final decision so next time the miss won’t be as great.

Question 11

Q

What are some challenges of machine learning?

Answer

A

Biases and discrimination. If data are biased, predictions are biased.
Privacy issues: data can be memorised and exploited by machines.
Legitimacy and accountability: responsibility in case of failure or unattended outcome?

Question 12

Q

What is a model within unsupervised learning?

Answer

A

K-means clustering

Question 13

Q

What is the intuition behind K-means clustering?

Answer

A

Observations belonging to the same groups must share the same characteristics.

If we know we have K groups (clusters) in our dataset, we can try to group the observations so that the distance among observations:
- within a group is the smallest possible
between groups is the largest possible.

Question 14

Q

How do we estimate in K-means clustering?

Answer

A

Choose number of clusters (K).
Randomly pick k observations from the dataset: these are centers of each cluster (centroids).
Each observation in the dataset is assigned to the closest centroid (calculated by Euclidian distance).
Update centroids: new is the mean of the data points in each cluster.
Redo step 3 & 4 until observations stop changing clusters, or when max nr of iterations is reached.

Question 15

Q

How do we assess accuracy, sensitivity and specificity?

Answer

A

By comparing a training and a test dataset. Typically done by splitting the original dataset in two random groups: 70% to train the model, 30% to test it

Create confusion matrix to calculate

Question 16

Q

What is accuracy?

Answer

Study These Flashcards

A

Proportion of observations correctly labelled by the algorithm.

= (TP+TN) / (TP+TN+FP+FN)

Question 17

Q

What is sensitivity?

Answer

Study These Flashcards

A

Proportion of observations correctly predicted to belong to a category.

= TP / (TP+FN)

Question 18

Q

What is specificity?

Answer

Study These Flashcards

A

Proportion of observations correctly predicted to NOT belong to a category

= TN / (TN+FP)

Question 19

Q

What two methods can we use if we don’t know the number of clusters in our data (for K-means clustering)?

Answer

Study These Flashcards

A

The Elbow technique. Minimize sum of squares within clusters (average distance), wss.
The Silhouette technique.
Ranging from -1 to +1, a high score means data points matches well within cluster and poorly with neighbouring cluster (choose highest point).

6: Machine Learning 1 Flashcards

(19 cards)