Lecture 8 - (Stochastic) Gradient Descent, Regularization, Artificial Neural Networks, Perceptron Flashcards

Question

True or false. Regularization drives weights further away from the origin by adding a regularization term to the objective function.

Answer 1

False. It drives weights closer to the origin

Answer 2

True. Lasso Regression automatically performs feature selection and outputs a sparse model. (because it can drive the weights of the regression parameter to 0, which means that it completely drops them)

Answer 3

Early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Process 1. monitor the validation error as we run stochastic gradient descent 2. stops algorithms if validation error starts to increase (if it starts overfitting data) 3. stop training as soon as the validation error reaches the minimum

Answer 4

Overfitting means that the model performs well on training data, but it does not generalize well Underfitting means that the model is too simple to learn the underlying structure of the data

Answer 5

Dropout (a regularization technique for neural networks)

Answer 6

It is a real-valued function that provides a measure of how wrong the model is in terms of its ability to estimate a relationship between inputs and outputs e.g., mean square error, cross-entropy, maximum (classification) likelihood

Answer 7

It minimizes or maximizes a function f(x) by altering x e.g., Gradient Descent reduces f(x) by moving x in the opposite sign of the derivative

Answer 8

the x represents the input → then you pass this input to the latent features (that have weights w) → then you set the parameters (v) → then you get an output

Answer 9

the learning rate and the activation function

Answer 10

Hidden Layers

Answer 11

A Neural Network starts with identical Activation Functions but, using different Weights and Biases on the connections, it flips and stretches the Activation Functions into new shapes, which are then added together to fit a squiggle that is shifted to fit the data

Answer 12

A learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving towards a minimum of a loss function.

Answer 13

A too high learning rate will make the learning jump over the minimum point A too low learning rate will either take too long to converge or get stuck into an undesired local minimum

Answer 14

A good learning rate is achieved by training the model a few hundred iterations starting with a high learning rate and gradually increase at each iteration (multiplying by a constant factor)

Answer 15

Sigmoid ReLU Softmax

Answer 16

A TLU is an object that inputs an array of weighted quantities, sums them, and if this sum meets or surpasses some threshold, outputs a quantity. A TLU can classify. Imagine a TLU that has two inputs, whose weights equal 1, and whose theta equals 1.5. When this TLU inputs <0,0>, <0,1>, <1,0>, and <1,1>, it outputs 0, 0, 0, and 1 respectively. In a neural network, the Threshold Logic Unit (TLU) Algorithm develops a weight matrix and a threshold matrix that describes lines that separate the various class inputs.

Answer 17

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class. In short, a perceptron is a single-layer neural network.

Answer 18

A single TLU can be used for a simple linear binary classification. A perceptron is composed of a single layer of TLUs. Each TLU is connected to a number of inputs. TLU computes a linear combination of inputs. If the result exceeds a threshold, it outputs a positive class

Answer 19

1. A perceptron is fed one training instance at a time, and for each instance, it makes its predictions 2. For every output neuron that produces a wrong prediction, it reinforces the connection weights from its inputs that would have contributed to the correct prediction

Answer 20

An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

Answer 21

A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural networks.

Answer 22

Backpropagation is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network's weights.

Answer 23

1. For each training instance: backpropagation algorithm first makes a prediction (forward pass) and then measures error 2. Then goes through each layer in reverse to measure error contribution from each connection (backward pass) 3. Finally, it tweaks connection weights to reduce the error (Gradient Descent Step)

Answer 24

If you have m layers and all have k elements, then the cost of forward pass is O(dk + mk2) - whatever that means (check on that) The cost of forward an backward pass is the same

Answer 25

1. Perceptron is incapable of learning complex patterns - Because decision boundary of each output neuron is linear (just like Logistic Regression Classifiers) 2. Weaknesses: - Incapable of solving Exclusive OR (XOR) classification problem - Solution: Can be solved by stacking multiple Perceptrons (Resulting ANN is called a Multilayer Perceptron(MLP)

Answer 26

FALSE: Perceptrons do not output a class probability. Perceptrons make predictions based on a hard threshold

Lecture 8 - (Stochastic) Gradient Descent, Regularization, Artificial Neural Networks, Perceptron Flashcards

(53 cards)