Neural Network Basic Ingredienrs Flashcards by john simerlink

Activation functions are typically what type of functions?

Non linear functions

How well did you know this?

Not at all

Perfectly

List three example activation functions

Sigmoid, ReLU, tanh

How well did you know this?

Not at all

Perfectly

The softmax function is a generalization of what?

The logistic function, but to multiple dimensions

How well did you know this?

Not at all

Perfectly

the softmax function is also known as what?

Softargmax or normalized exponential function

How well did you know this?

Not at all

Perfectly

Softmax does what?

It converts a vector z of K real numbers (e.g either positive or negative(and they don’t all have to be the same sign (eg - or +))) into a vector of K probabilities s (aka a probability distribution), where the mapping between z and s is order preserving and where all the elements in s sum to one .

How well did you know this?

Not at all

Perfectly

What is forward propagation?

The process of a neural network generating an output for a given input

How well did you know this?

Not at all

Perfectly

Explain the specificity of the term back propagation

Backpropagation is actually only the process of calculating the gradient of the loss function with respect to the weights, iteratively using the chain rule. Formally speaking, It doesn’t refer to how the gradient is used. However, loosely speaking it is often used loosely to refer to the entire learning algorithm, including how the gradient is used (such as by stochastic gradient descent)

How well did you know this?

Not at all

Perfectly

Stochastic gradient descent

How well did you know this?

Not at all

Perfectly

Rprop

Resilient backpropagation

How well did you know this?

Not at all

Perfectly

Batch

In gradient descent, a batch is the total number of examples [by examples they mean training samples, right?] you use to calculate the gradient in a single iteration.

How well did you know this?

Not at all

Perfectly

Batch size for stochastic gradient descent

How well did you know this?

Not at all

Perfectly

Convex problems

have only one minimum; that is, only one place where the slope (of the loss function (where the independent variable is the weight(s)) is exactly 0

How well did you know this?

Not at all

Perfectly

Ln(1)

How well did you know this?

Not at all

Perfectly

Cross entropy

A measure for calculating how good a classifier is. More specifically when trying to classify the class of some item, we declare a p(x) to be the true probability distribution for if some item is class x in a discrete set of x ∈ X. This p(x) is usually expressed as a one hot vector <0 0 0 0 1.0 0 0> if we say the class set is of size 7. Our model is going to predict some distribution q(x)

How well did you know this?

Not at all

Perfectly

What does the graph of a logistic function look like?

How well did you know this?

Not at all

Perfectly

What are the domain and range of the sigmoid function?

Domain: -infinity to infinity
Range : 0 to 1

How well did you know this?

Not at all

Perfectly

Why is argsoftmax considered the generalization of the logistic function to multiple dimensions?

Logistic function takes a scalar of any real value and converts it into another scalar between the range of 0 and 1. Argsoftmax takes a vector (of d dimensions) of any real values and converts it into another vector (of d dimensions) whose values are all in the range of [0,1]

How well did you know this?

Not at all

Perfectly

When is the softmax function used?

is used as the activation function in the output layer of neural networks that are performing multi-class classification problems

How well did you know this?

Not at all

Perfectly

What is the formula for the Jacobian of the softmax function?

How well did you know this?

Not at all

Perfectly

Categorical cross entropy loss is also called what?

Softmax loss

How well did you know this?

Not at all

Perfectly

Classification problems can be subdivided into what two categories?

Multi-class classification and multi-label classification

How well did you know this?

Not at all

Perfectly

Explain the difference between multi-class and multi-label classification

multi-class classification - each sample belongs to only one class (mutually exclusive)

multi-label classification - each sample may belong to multiple classes (or to no class)

How well did you know this?

Not at all

Perfectly

Kronecker delta function

Study These Flashcards

List four benefits of ReLU neurons

Study These Flashcards

1) helps prevent vanishing gradients 2) its idempotent
3) it creates sparse activations
4) it’s faster to compute in both forward and backward passes

What is a downside of using ReLUs?

Dying ReLUs

What would three consecutive sigmoid activations looks like?

what would ReLU(ReLU(ReLU(2))) yield?

Principal difference between sigmoid and tanh?

Sigmoid range is (0, 1). Tanh range is (-1,1)

Sigmoid formula

1/(1+e^(-x))

LU in ReLU stands for what?

Linear Unit

FC layer stands for what?

Fully connected layer

A unit with an identity activation function is equivalent to what?

Multiple linear regression

A unit with a sigmoid activation function is equivalent to what?

Logistic regression

ReLU stands for what?

Rectified linear unit

Activation functions serve what two primary purposes?

1) help a model account for interaction effects 2) help a model account for non-linear effects

What is an interactive (interaction) effect?

It is when one variable A affects a prediction differently depending on the value of B. For example, if my model wanted to know whether a certain body weight indicated an increased risk of diabetes, it would have to know an individual height.

Give an example of how the output of of a node’s value is calculated

You might have input A and input B and bias b, and wA of 2 and wA of 3, and activation function F with value of ReLU. So the calculation would be ReLU(2A + 3B + b)

Tanh range

-1 to 1

A Sigmoid function is especially useful when we have to predict what?

The probability of something

Why is the sigmoid function useful for predicting probability?

Because probability of something must be between 0 and 1 which is the the same range as sigmoid

What layer is sigmoid mostly used in?

Activation layer

Why is sigmoid mainly used on activation layer?

Because When used on other layers it can cause the network to get stuck during training time

Explain the dying ReLu phenomenon

When the weights in the network lead to a pre-activation calculation for a neuron to be negative, that neuron will then have 0 value and can be called “dead” as it doesn’t affect downstream neurons

Give two examples of saturating non linearities

Tanh (x) and sigmoid(x)

ReLU formula

ReLU(x) = max(x, 0)

Rectified networks

Networks that use the rectifier function (ReLU) for hidden layers

Why is it useful that part of ReLU is linear?

Because ReLU préserves many of the useful features of linear models

List two of the useful properties of linear activation functions

1) They’re easy to optimize with gradient based methods 2) they generalize well

Is ReLu considered linear or nonlinear?

Nonlinear

List two common learning rate schedules

1) constant learning rate 2) monotonically decaying rate after an optional warmup

Mlp stands for?

Multi layer perceptron

What is an MLP

A sequence of layers of neurons

What is an intuitive understanding of what the bias is?

The bias of a neuron controls the trigger happiness of the neuron. The hire it is, the more likely it is to fire (if act function is ReLu) otherwise it just adds a scalar value to the output

In rectified networks does every hidden layer use ReLUs?

No, not necessarily

Are activation functions part of a neuron? Or are they applied after a neuron, I.e is there an entire activation function layer applied to the outputs of a layer?

Logically activation functions are part of a neuron - they convert the pre output of a neuron into the output of a neuron Some frameworks or notations however might apply the activation function as a separate layer applied element-wise to a whole set of neuron outputs (i.e a layer’s output）

Neural Network Basic Ingredienrs Flashcards

(55 cards)