Neural Network Basic Ingredienrs Flashcards

1
Q

Activation functions are typically what type of functions?

A

Non linear functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List three example activation functions

A

Sigmoid, ReLU, tanh

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The softmax function is a generalization of what?

A

The logistic function, but to multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

the softmax function is also known as what?

A

Softargmax or normalized exponential function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Softmax does what?

A

It converts a vector z of K real numbers (e.g either positive or negative(and they don’t all have to be the same sign (eg - or +))) into a vector of K probabilities s (aka a probability distribution), where the mapping between z and s is order preserving and where all the elements in s sum to one .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is forward propagation?

A

The process of a neural network generating an output for a given input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the specificity of the term back propagation

A

Backpropagation is actually only the process of calculating the gradient of the loss function with respect to the weights, iteratively using the chain rule. Formally speaking, It doesn’t refer to how the gradient is used. However, loosely speaking it is often used loosely to refer to the entire learning algorithm, including how the gradient is used (such as by stochastic gradient descent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stochastic gradient descent

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Rprop

A

Resilient backpropagation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Batch

A

In gradient descent, a batch is the total number of examples [by examples they mean training samples, right?] you use to calculate the gradient in a single iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batch size for stochastic gradient descent

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Convex problems

A

have only one minimum; that is, only one place where the slope (of the loss function (where the independent variable is the weight(s)) is exactly 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ln(1)

A

0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross entropy

A

A measure for calculating how good a classifier is. More specifically when trying to classify the class of some item, we declare a p(x) to be the true probability distribution for if some item is class x in a discrete set of x ∈ X. This p(x) is usually expressed as a one hot vector <0 0 0 0 1.0 0 0> if we say the class set is of size 7. Our model is going to predict some distribution q(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the graph of a logistic function look like?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the domain and range of the sigmoid function?

A

Domain: -infinity to infinity
Range : 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is argsoftmax considered the generalization of the logistic function to multiple dimensions?

A

Logistic function takes a scalar of any real value and converts it into another scalar between the range of 0 and 1. Argsoftmax takes a vector (of d dimensions) of any real values and converts it into another vector (of d dimensions) whose values are all in the range of [0,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When is the softmax function used?

A

is used as the activation function in the output layer of neural networks that are performing multi-class classification problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for the Jacobian of the softmax function?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Categorical cross entropy loss is also called what?

A

Softmax loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Classification problems can be subdivided into what two categories?

A

Multi-class classification and multi-label classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Explain the difference between multi-class and multi-label classification

A

multi-class classification - each sample belongs to only one class (mutually exclusive)

multi-label classification - each sample may belong to multiple classes (or to no class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Kronecker delta function

A
24
Q

List four benefits of ReLU neurons

A

1) helps prevent vanishing gradients 2) its idempotent
3) it creates sparse activations
4) it’s faster to compute in both forward and backward passes

25
Q

What is a downside of using ReLUs?

A

Dying ReLUs

26
Q

What would three consecutive sigmoid activations looks like?

A
27
Q

what would ReLU(ReLU(ReLU(2))) yield?

A

2

28
Q

Principal difference between sigmoid and tanh?

A

Sigmoid range is (0, 1). Tanh range is (-1,1)

29
Q

Sigmoid formula

A

1/(1+e^(-x))

30
Q

LU in ReLU stands for what?

A

Linear Unit

31
Q

FC layer stands for what?

A

Fully connected layer

32
Q

A unit with an identity activation function is equivalent to what?

A

Multiple linear regression

33
Q

A unit with a sigmoid activation function is equivalent to what?

A

Logistic regression

34
Q

ReLU stands for what?

A

Rectified linear unit

35
Q

Activation functions serve what two primary purposes?

A

1) help a model account for interaction effects
2) help a model account for non-linear effects

36
Q

What is an interactive (interaction) effect?

A

It is when one variable A affects a prediction differently depending on the value of B. For example, if my model wanted to know whether a certain body weight indicated an increased risk of diabetes, it would have to know an individual height.

37
Q

Give an example of how the output of of a node’s value is calculated

A

You might have input A and input B and bias b, and wA of 2 and wA of 3, and activation function F with value of ReLU. So the calculation would be ReLU(2A + 3B + b)

38
Q

Tanh range

A

-1 to 1

39
Q

A Sigmoid function is especially useful when we have to predict what?

A

The probability of something

40
Q

Why is the sigmoid function useful for predicting probability?

A

Because probability of something must be between 0 and 1 which is the the same range as sigmoid

41
Q

What layer is sigmoid mostly used in?

A

Activation layer

42
Q

Why is sigmoid mainly used on activation layer?

A

Because When used on other layers it can cause the network to get stuck during training time

43
Q

Explain the dying ReLu phenomenon

A

When the weights in the network lead to a pre-activation calculation for a neuron to be negative, that neuron will then have 0 value and can be called “dead” as it doesn’t affect downstream neurons

44
Q

Give two examples of saturating non linearities

A

Tanh (x) and sigmoid(x)

45
Q

ReLU formula

A

ReLU(x) = max(x, 0)

46
Q

Rectified networks

A

Networks that use the rectifier function (ReLU) for hidden layers

47
Q

Why is it useful that part of ReLU is linear?

A

Because ReLU préserves many of the useful features of linear models

48
Q

List two of the useful properties of linear activation functions

A

1) They’re easy to optimize with gradient based methods 2) they generalize well

49
Q

Is ReLu considered linear or nonlinear?

A

Nonlinear

50
Q

List two common learning rate schedules

A

1) constant learning rate
2) monotonically decaying rate after an optional warmup

51
Q

Mlp stands for?

A

Multi layer perceptron

52
Q

What is an MLP

A

A sequence of layers of neurons

53
Q

What is an intuitive understanding of what the bias is?

A

The bias of a neuron controls the trigger happiness of the neuron. The hire it is, the more likely it is to fire (if act function is ReLu) otherwise it just adds a scalar value to the output

54
Q

In rectified networks does every hidden layer use ReLUs?

A

No, not necessarily

55
Q

Are activation functions part of a neuron? Or are they applied after a neuron, I.e is there an entire activation function layer applied to the outputs of a layer?

A

Logically activation functions are part of a neuron - they convert the pre output of a neuron into the output of a neuron

Some frameworks or notations however might apply the activation function as a separate layer applied element-wise to a whole set of neuron outputs (i.e a layer’s output)