6: Machine Learning 1 Flashcards

1
Q

What is machine learning?

A
  • Understanding the world by learning from the data
  • Not so much interested in cause and mechanisms
  • Interested in classification and predictions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some example use of machine learning?

A
  • Spam detection
  • Community detection
  • Sorting news
  • Text translation
  • Face recognition
  • Optical character recognition (OCR)
  • Suggested content and ads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is unsupervised learning?

A

Techniques where the machine is NOT given labels, or corresponding outputs.
- The machine will detect patterns from the data with no example to rely on.

The dataset containing the data to learn from is called an unlabelled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is supervised learning?

A

Techniques where the machine is given inputs and corresponding outputs to learn from.
- The machine will try to adjust parameters to make the best prediction of the output when given an input.

Dataset containing inputs and corresponding outputs is called a labeled dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is reinforcement learning?

A

The machine learns through trial and errors.
- The method includes a feedback loop with rewards. While attempting trials, the machine tries to maximise the rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a good algorithm?

A

An algorithm capable of making the correct prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of machine learning and how is this tested?

A
  • Try to identify patterns and make predictions from data.
  • Does not matter if input causes output, as long as the input is enough to predict the output.
  • Algorithm is trained on a TRAINING DATASET.
  • Then, the accuracy of the model can be tested with a TEST DATASET.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What fit of your model don’t you want?

A

You do not want a model that is fitted to the random variations of your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is underfitting and overfitting of a model?

A

Under: Not enough parameters to correctly predict Y (may be linear when it should not be).

Over: To many parameters to correctly predict Y (touches every data point) = small residual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three components the machine learns with?

A
  1. A decision process: recipe of calculations/steps that takes in the data and returns a “guess” at the kind of pattern in the data the algorithm is looking to find.
  2. An error function: method of measuring how good the guess was by comparing to known examples (when available). How to quantify how bad possible miss was?
  3. An updating or optimisation process: algorithm looks at miss and updates how decision process comes to final decision so next time the miss won’t be as great.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some challenges of machine learning?

A
  • Biases and discrimination. If data are biased, predictions are biased.
  • Privacy issues: data can be memorised and exploited by machines.
  • Legitimacy and accountability: responsibility in case of failure or unattended outcome?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a model within unsupervised learning?

A

K-means clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the intuition behind K-means clustering?

A
  • Observations belonging to the same groups must share the same characteristics.

If we know we have K groups (clusters) in our dataset, we can try to group the observations so that the distance among observations:
- within a group is the smallest possible
between groups is the largest possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we estimate in K-means clustering?

A
  1. Choose number of clusters (K).
  2. Randomly pick k observations from the dataset: these are centers of each cluster (centroids).
  3. Each observation in the dataset is assigned to the closest centroid (calculated by Euclidian distance).
  4. Update centroids: new is the mean of the data points in each cluster.
  5. Redo step 3 & 4 until observations stop changing clusters, or when max nr of iterations is reached.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we assess accuracy, sensitivity and specificity?

A

By comparing a training and a test dataset. Typically done by splitting the original dataset in two random groups: 70% to train the model, 30% to test it

Create confusion matrix to calculate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is accuracy?

A

Proportion of observations correctly labelled by the algorithm.

= (TP+TN) / (TP+TN+FP+FN)

17
Q

What is sensitivity?

A

Proportion of observations correctly predicted to belong to a category.

= TP / (TP+FN)

18
Q

What is specificity?

A

Proportion of observations correctly predicted to NOT belong to a category

= TN / (TN+FP)

19
Q

What two methods can we use if we don’t know the number of clusters in our data (for K-means clustering)?

A
  1. The Elbow technique. Minimize sum of squares within clusters (average distance), wss.
  2. The Silhouette technique.
    Ranging from -1 to +1, a high score means data points matches well within cluster and poorly with neighbouring cluster (choose highest point).