deck_15595778 Flashcards

1
Q

What is a perceptron?
What does it do?

A
  • an artificial neuron that can be used for binary classification
  • it receives an input signal through weighted connections, sums up the total of those inputs to compute its activation level, and “fires” by outputting a 1 if the total input exceeds a given threshold. Otherwise, it outputs a 0.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What happens during training of a perceptron?

A

During training, a series of small adjustments is made to the connection weights and threshold using the perceptron’s learning rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where do the inputs to the perceptron in a classification task come from?

A

the inputs are the feature values, and the output is the classification label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What happens during the training phase if the output of the perceptron is wrong?

A

the threshold and the weights are adjusted according to the learning rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Output was 0, target was 1
what happens to the threshold and weights?

A

raise the threshold
lower the weights
ignoring the inputs that are 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Output was 1, target was 0
what happens to the threshold and weights?

A

lower the threshold
raise the weights
ignoring the inputs that are 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of classifier is a perceptron?

A

Linear classifier
They create a straight-line decision boundary in the feature space.
They will only succeed (converge) if the data is linearly separable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean to train a perceptron

A

finding the coefficients for a linear equation
coefficients are the connection weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the differences between the Perceptron learning algorithm and Stochastic Gradient Descent

A
  • both can be used for classification
  • SGD finds an optional solution based on a loss function that aims to ‘center’ the decision boundary between the classes
  • the Perceptron learning algorithm will find a solution if it exists, but not necessarily the best solution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the Perceptron Learning Algorithm

A
  • initialize the weights (ws), threshold (t), and learning rate (lr)
    repeat until done
    for each training example (xs, target)
    compute perceptron output using sum(ws * xs) > t?
    if output < target
    for each (w, x) in (ws, xs)
    w = w + x * lr
    t = t – lr
    else if output > target
    for each (w, x) in (ws, xs)
    w = w - x * lr
    t = t + lr
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Initialize the Weights and Threshold of a perceptron
why is starting with random weights useful in a multi layer perceptron

A

is useful because depending on where you start, you might converge on a better or worse solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When do we stop repeating the perceptron Learning Algorithm

A
  • after a set number of epochs, or
  • when the perceptron reaches a high level of accuracy, or
  • when the perceptron hasn’t improved its accuracy in a while, or
  • some other method or combination of methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

when adjusting the weights of a perceptron Learning Algorithm, why do we normalize the data?

A

normalize the data before training so that the weights are all adjusted at about the
same rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is Randomized Presentation in a perceptron Learning Algorithm

A

you can randomize the presentation order of each example during an epoch instead of presenting them in the same order.
This can stop the network getting stuck in a suboptimal solution (especially with more complex Multi-Layer perceptrons)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is Batch Learning in a perceptron Learning Algorithm

A

instead of updating after every example, you can compute the output of the entire batch of examples in the training set, and then update the weights only once per epoch based on the outputs that were
wrong
- this lets you write very short code with numpy and might prevent the network getting stuck in a suboptimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the bias of a perceptron

A

a parameter called bias is a weighted connection to an input that is always set to 1 for every example, then it is updated along with the other weights.
Mathematically, using a bias is the same as using an adjustable threshold
the bias is the same as the negation of an adjustable threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can you overcome the limitation that a perceptron can only converge on a solution if the data is linearly separable?

A

use a multi layer perceptron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a multi-layer perceptron (MLP)?

A

A MLP is a network of artificial neurons in layers. An individual neuron is similar to a perceptron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does each neuron do in a MLP?

A

accumulates input from the weighted connections and produces an output using an activation function

20
Q

Why are MLPs sometimes referred to as feedforward networks

A

because activation always goes in one direction from input layer to output layer

21
Q

What are MLPs with larger number of hidden layers called?

A

Deep networks
Training a deep network is referred to as deep learning

22
Q

What makes a network full connected (or dense)

A

each unit is connected to every neuron is the previous and next layers

23
Q

Can multi layer perceptron classifiers learn a decision boundary of any shape?

A

Only if you have the right configuration of hidden layers, the right activation function, luck in choosing the random starting weights, and enough time / computational power to complete the training
if the configuration isn’t right, it might never converge

24
Q

What is the limitation of the output neurons of an MLP?

A

they’re limited to linear combinations of the output from the previous layer.

25
Q

What does it mean if an MLP has to classify data that is not linearly separable?

A

the hidden layers must be performing computations on the inputs that yield a new, linearly separable, representation of the problem to present at the output layer.
Hidden layers operate a bit like the kernel functions of SVMs

26
Q

Why can’t we use the perceptron rule to train MLPs?

A

the perceptron rule is based on the difference between the actual and target output but we don’t know in advance what the output should be when it comes to the hidden units, so we can’t use that method

27
Q

What is backpropagation?

A

a method used in neural networks to update the weights of neurons by propagating the error signal from the output layer back to the input layer through hidden layers

28
Q

What is the most popular way to train weights for backpopagation?

A

stochastic gradient descent with backpropagation.

29
Q

what is the problem of local minima

A

because networks are initialized with small random values for the connection weights, each run might result in a better or worse state
\
\ /\ /
\__/ \ /
local min \ /
_/ this is the global minimum

your network can get stuck in a suboptimal state (local minimum) and fail to converge to an optimal state
- always train several networks with the same architecture and choose the network with the best performance

30
Q

Why does the threshold function not have a derivative?

A

the threshold function does not have a derivative because it’s not sloped
(derivative is 0, except on the break where there is no derivative)

31
Q

What are the activation functions of an MLP

A
  • Logistic sigmoid
    outputs values between 0 and 1
  • Hyperbolic tangent (tanh) function
    outputs values between -1 and 1
  • Rectified Linear (RELU)
    𝑓(𝑥) = max⁡(0, 𝑥)
    if activation value is negative, make it 0.
    if activation value is positive, use it as it is.
32
Q

Explain Non-Binary Classification using an MLP

A

To use an MLP for Non-Binary Classification:
- Add an output node for every class
- Add softmax layer that scales the float numbers to add up to 1, and regards the values as probability, so whichever has max probability, that will be the class
e.g., 0.2 for class A, 0.5 for class B, 0.3 for class C
Label is class B

33
Q

Explain Regression using an MLP

A

Perceptrons can’t be used for regression because they always output a 0 or 1.
But with MLP you can have a single output unit with the identity activation function: 𝑓(𝑎) = 𝑎.
the output layer sums up all the outputs from the previous layer
the output is compared to the target and training happens much the same way

34
Q

What are the hyperparameters you can use when training a multi-layer perceptron

A

Feature Scaling:
- MLPs are sensitive to feature scale
- recommendation is to scale all the data between -1 and 1

Hidden Layer Configuration:
Consider
1. look at the output of each hidden
neuron as an interim classification computed along the way and learn from these interim classifications. (Usually won’t do this step)
2. If a layer has more neurons than the one before, it can transform the input by adding new meta features (you’re increasing the dimensionality by adding more neurons to a layer than the previous layer, in higher dimension it’s possible to get a linear classification.)
3. If a layer has fewer neurons than the layer before, then it will send less information forward to the next layer. If some features are redundant, or if it’s useful to combine features, this can be a good thing. Otherwise, it might hurt performance. (trying to extract the most important info from the previous outputs / combine some of the features and reducing the dimension of the previous layers. sometimes not a good idea to reduce dimensions, you might lose info, hurt performance. if you reduce too quickly, you’ll get a bottleneck in the neural network (increase gradually and decrease gradually))
4. more layers means more epochs needed to train the network
5. An approach that often works well is to start with a large hidden layer (bigger than the input layer) and then slowly reduce the number of units in each layer until you get to the output.
6. When considering the size of the next layer, think in multiples of the previous layer. For
example, if you have 1000 inputs in the previous layer, maybe consider increasing by 50% (1500)
or decreasing by 25% (750). But don’t decrease suddenly from 1000 to 10 units (99% reduction)
– this might make it hard to learn.

Activation Function:
- ReLu, tanh, sigmoid, linear (need to experiment with them)

Learning Rate:
- Larger values might lead to faster convergence. Smaller values might yield higher accuracy but are more likely to get stuck in a local minimum
- usually start with a value like 0.001 (don’t use too small or too big of a step) - need to experiment with this
- if your model does not converge, it’s jumping around a lot, so reduce the learning rate
- best choice for learning rate is to use adaptive learning rate which is dependent on the loss function

Batch size and shuffling:
- With a larger batch size, you accumulate error signal over a larger number of examples before adjusting the weights
- With a smaller batch size, the weights jump around a lot more, and this can help you avoid getting stuck in a local minimum.
- larger batch size can lead to faster learning.
- shuffle data between each epoch - helps avoid getting stuck in a local minimum

Stopping Condition:
- max number of epochs (max_iter in SKlearn)
- when error rate is only changing by a small amount (tol in SKlearn)

Regularization:
- Regularization refers to a set of mathematical techniques applied to the backpropagation algorithm to avoid overfitting.
- tries to prevent weights from getting too specific to the training data (alpha in SKLearn)

Type of Backpropagation:
- stochastic gradient descent with backpropagation
- solver parameter in SKlearn

35
Q

What is Deep Learning

A

Deep learning is machine learning that involves very large datasets and deep neural networks with many hidden layers.
Architectures for deep learning: multi-layer perceptrons (MLPs), recurrent neural networks (RNNs), generative adversarial networks (GANs), convolutional neural networks (CNNs), and transformers.

36
Q

What is transfer learning and why do we use it

A

Training a deep neural network is expensive so developers will opt to fine-tune a network that has already been partially or fully trained for a standard learning task, hoping that the learning from that original task can be re-purposed, or “transferred”, to the task they’re interested in

37
Q

What technical developments have enabled the boom in deep learning

A

ACCESS TO BIG DATASETS
- Deep learning networks require massive amounts of data to train effectively - we now have access to freely available text data on the web and social media
ACCESS TO FAST HARDWARE
Graphics Processing Units (GPUs), with their large onboard memory caches and massively
parallel architectures, can be repurposed for training neural networks.
TECHNICAL DEVELOPMENTS
solution to the vanishing gradients
problem (new approach to activation function (like ReLU), new approaches to connection weight initialization, Batch Normalization ( input to each layer is rescaled prior to processing)etc)

38
Q

Supervised Learning vs Unsupervised Learning

A

Supervised learning - set of training data consisting of sets of features associated
with correct outputs (either class labels or numerical values). The task of the learning algorithm was to find a model that could do a good job of predicting the correct outputs for previously unseen examples
Unsupervised learning - set of data containing lists of feature values, but there is no “correct” output given and no solved examples to give to the learner. You don’t know exactly what you
are looking for, but you do know what sort of thing you are looking for

39
Q

What do all approaches to automatic clustering have in common

A

there is always a distance calculation
involved at the heart of all automatic clustering approaches

40
Q

Advantage / Disadvantage of k-Means Clustering

A

Advantage: easy to implement
Disadvantages: slow on large data sets, does not always find the optimal solution on the first run (you must do multiple runs and take the best)

41
Q

How does the k-Means algorithm work?

A
  • based on finding the centroids of a set of clusters
  • chooses k centroids randomly, then assigns every point in the sample data to it’s closest centroid
  • adjusts the centroids by computing the mean of the feature values in each cluster - those mean values become the new centroids
  • then reassigns every point to the new closest centroid
  • this process repeats until there are no further improvements to be had
42
Q

How do you measure cluster quality

A
  • inertia or the Sum Squared Error
    SSE is measuring the distance between each point and its assigned centroid, square the results, and add them up.
    Squaring the results is to emphasize points that are further from the cluster
43
Q

How can you choose centroids

A
  • random point within the feature space
  • random from one of the data points
  • analyze from dataset and choose
44
Q

How does k-Means clustering make predictions

A

compute the distance of the new data point to each centroid - the closest centroid represents the cluster to which the new data point is assigned

45
Q

How do you find the right k value for k-Means clustering?

A

Graph the SSE for different values of k and look for the elbow in the graph - that is the right value for k
if there is no clear elbow you’ll have to think carefully about the particular problem you are
trying to solve and what you want to get out of the clustering algorithm.

46
Q

What are the parameters in SKLearn for k-Means clustering?

A

Initializing the Centroids
- randomly or default is ‘smart’ where initial centroids are nicely spaced out
Number of Runs
- default will perform 10 runs and remember the best solution from those runs
Stopping Condition
- tolerance factor (tol). If the improvement in inertia is less than tol it stops
- max_iter - stops after a certain number of iterations
Verbosity
- controls the verbosity level of the log output that the algorithm generates during operation