Advanced Learning Algorithms (From Practice Quizzes) Flashcards by Leam Murphy

Which of these are terms used to refer to components of an artificial neural network? (hint: three of these are correct)

A.) layers
B.) neurons
C.) activation function
D.) axon

A.) B.) C.)

How well did you know this?

Not at all

Perfectly

True/False? Neural networks take inspiration from, but do not very accurately mimic, how neurons in a biological brain learn.

True; Artificial neural networks use a very simplified mathematical model of what a biological neuron does.

How well did you know this?

Not at all

Perfectly

Question 1
For the following code:

model = Sequential([

Dense(units=25, activation=”sigmoid”),

Dense(units=15, activation=”sigmoid”),

Dense(units=10, activation=”sigmoid”),

Dense(units=1, activation=”sigmoid”)])

This code will define a neural network with how many layers?

A.) 4
B.) 5
C.) 3
D.) 25

A.) 4

How well did you know this?

Not at all

Perfectly

Using TensorFlow, how do you define the second neural network layer with 4 neurons and a sigmoid activation?

A.) Dense(layer=2, units=4, activation = ‘sigmoid’)
B.) Dense(units=4, activation=‘sigmoid’)
C.) Dense(units=4)
D.) Dense(units=[4], activation=[‘sigmoid’])

B.)

How well did you know this?

Not at all

Perfectly

Which of the following activation functions is the most common choice for the hidden layers of a neural network?

A.) Sigmoid
B.) Linear
C.) ReLU
D.) Most hidden layers do not use any activation function

C.) A ReLU is most often used because it is faster to train compared to the sigmoid. This is because the ReLU is only flat on one side (the left side) whereas the sigmoid goes flat (horizontal, slope approaching zero) on both sides of the curve.

How well did you know this?

Not at all

Perfectly

For the task of predicting housing prices, which activation functions could you choose for the output layer? Choose the 2 options that apply.

A.) Linear
B.) Sigmoid
C.) ReLU

A.) and C.). A linear activation function can be used for a regression task where the output can be both negative and positive, but it’s also possible to use it for a task where the output is 0 or greater (like with house prices). ReLU outputs values 0 or greater, and housing prices are positive values.

How well did you know this?

Not at all

Perfectly

True/False? A neural network with many layers but no activation function (in the hidden layers) is not effective; that’s why we should instead use the linear activation function in every hidden layer.

False; A neural network with many layers but no activation function is not effective. A linear activation is the same as “no activation function”

How well did you know this?

Not at all

Perfectly

For a multiclass classification task that has 4 possible outputs, the sum of all the activations adds up to 1. For a multiclass classification task that has 3 possible outputs, the sum of all the activations should add up to ….

A.) Less than 1
B.) 1
C.) It will vary, depending on the input x
D.) More than 1

B.) 1

How well did you know this?

Not at all

Perfectly

For multiclass classification, the cross entropy loss is used for training the model. If there are 4 possible classes for the output, and for a particular training example, the true class of the example is class 3 (y=3), then what does the cross entropy loss simplify to? [Hint: This loss should get smaller when a3 gets larger.]

A.) z_3/(z_1+z_2+z_3+z_4)
B.) z_3
C.) −log(a3)

C.)

How well did you know this?

Not at all

Perfectly

For multiclass classification, the recommended way to implement softmax regression is to set from_logits=True in the loss function, and also to define the model’s output layer with…

A.) a ‘softmax’ activation
B.) a ‘linear’ activation

B.) Set the output as linear, because the loss function handles the calculation of the softmax with a more numerically stable method.

How well did you know this?

Not at all

Perfectly

The Adam optimizer is the recommended optimizer for finding the optimal parameters of the model. How do you use the Adam optimizer in TensorFlow?

A.) The call to model.compile() will automatically pick the best optimizer, whether it is gradient descent, Adam or something else. So there’s no need to pick an optimizer manually.
B.) The call to model.compile() uses the Adam optimizer by default
C.) The Adam optimizer works only with Softmax outputs. So if a neural network has a Softmax output layer, TensorFlow will automatically pick the Adam optimizer.
D.) When calling model.compile, set optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3).

D.)

How well did you know this?

Not at all

Perfectly

What is this name of the layer type where each single neuron of the layer does not look at all the values of the input vector that is fed into that layer.

A.) convolutional layer
B.) A fully connected layer
C.) Image layer
D.) 1D layer or 2D layer (depending on the input dimension)

A.) For a convolutional layer, each neuron takes as input a subset of the vector that is fed into that layer.

How well did you know this?

Not at all

Perfectly

In the context of machine learning, what is a diagnostic?

A.) This refers to the process of measuring how well a learning algorithm does on a test set (data that the algorithm was not trained on).
B.) A test that you run to gain insight into what is/isn’t working with a learning algorithm.
C.) An application of machine learning to medical applications, with the goal of diagnosing patients’ conditions.
D.) A process by which we quickly try as many different ways to improve an algorithm as possible, so as to see what works.

B.) A diagnostic is a test that you run to gain insight into what is/isn’t working with a learning algorithm, to gain guidance into improving its performance

How well did you know this?

Not at all

Perfectly

True/False? It is always true that the better an algorithm does on the training set, the better it will do on generalizing to new data.

False; if a model overfits the training set, it may not generalize well to new data.

How well did you know this?

Not at all

Perfectly

For a classification task; suppose you train three different models using three different neural network architectures. Which data do you use to evaluate the three models in order to choose the best one?

A.) The test set
B.) The cross validation set
C.) The training set
D.) All the data – training, cross validation and test sets put together.

B.) Use the cross-validation set to calculate the cross-validation error on all three models in order to compare which of the three models is best.

How well did you know this?

Not at all

Perfectly

If the model’s cross-validation error Jcv is much higher than the training error Jtrain, this is an indication that the model has…

A.) high bias
B.) low bias
C.) high variance
D.) low variance

Study These Flashcards

C.) When Jcv&raquo_space; J train (whether Jtrain is also high or not, this is a sign that the model is overfitting to the training data and performing much worse on new examples.

Which of these is the best way to determine whether your model has high bias (has underfit the training data)?

A.) See if the cross-validation error is high compared to the baseline level of performance
B.) Compare the training error to the baseline level of performance
C.) See if the training error is high (above 15% or so)
D.) Compare the training error to the cross-validation error.

Study These Flashcards

B.) If comparing your model’s training error to a baseline level of performance (such as human level performance, or performance of other well-established models), if your model’s training error is much higher, then this is a sign that the model has high bias (has underfit).

You find that your algorithm has high bias. Which of these seem like good options for improving the algorithm’s performance? Hint: two of these are correct.

A.) Remove examples from the training set
B.) Collect more training examples
C.) Collect additional features or add polynomial features
D.) Decrease the regularization parameter λ (lambda)

Study These Flashcards

C.) and D.) More features could potentially help the model better fit the training examples. Decreasing regularization can help the model better fit the training data.

You find that your algorithm has a training error of 2%, and a cross validation error of 20% (much higher than the training error). Based on the conclusion you would draw about whether the algorithm has a high bias or high variance problem, which of these seem like good options for improving the algorithm’s performance? Hint: two of these are correct.

A.) Decrease the regularization parameter λ
B.) Collect more training data
C.) Collect more training data
D.) Increase the regularization parameter λ

Study These Flashcards

B.) and D.) The model appears to have high variance (overfit), and collecting more training examples would help reduce high variance. The model appears to have high variance (overfit), and increasing regularization would help reduce high variance.

Which of these is a way to do error analysis?

A.) Calculating the test error Jtest
B.) Collecting additional training data in order to help the algorithm do better.
C.) Manually examine a sample of the training examples that the model misclassified in order to identify common traits and trends.
D.) Calculating the training error Jtrain

Study These Flashcards

C.) By identifying similar types of errors, you can collect more data that are similar to these misclassified examples in order to train the model to improve on these types of examples.

We sometimes take an existing training example and modify it (for example, by rotating an image slightly) to create a new example with the same label. What is this process called?

A.) Data augmentation
B.) Machine learning diagnostic
C.) Error analysis
D.) Bias/variance analysis

Study These Flashcards

A.) Modifying existing data (such as images, or audio) is called data augmentation.

What are two possible ways to perform transfer learning? Hint: two of the four choices are correct.

A.) You can choose to train all parameters of the model, including the output layers, as well as the earlier layers.
B.) You can choose to train just the output layers’ parameters and leave the other parameters of the model fixed.
C.) Given a dataset, pre-train and then further fine-tune a neural network on the same dataset.
D.) Download a pre-trained model and use it for prediction without modifying or re-training it.

Study These Flashcards

A and B. It may help to train all the layers of the model on your own training set. This may take more time compared to if you just trained the parameters of the output layers. The earlier layers of the model may be reusable as is, because they are identifying low level features that are relevant to your task.

Take a decision tree learning to classify between spam and non-spam email. There are 20 training examples at the root note, comprising 10 spam and 10 non-spam emails. If the algorithm can choose from among four features, resulting in four corresponding splits, which would it choose (i.e., which has highest purity)?

A.) Left split: 5 of 10 emails are spam. Right split: 5 of 10 emails are spam.
B.) Left split: 2 of 2 emails are spam. Right split: 8 of 18 emails are spam.
C.) Left split: 10 of 10 emails are spam. Right split: 0 of 10 emails are spam.
D.) Left split: 7 of 8 emails are spam. Right split: 3 of 12 emails are spam.

Study These Flashcards

C.)

When working with Decision Trees, which of these are commonly used criteria to decide to stop splitting?

A.) When the tree has reached a maximum depth
B.) When the number of examples in a node is below a threshold
C.) When the information gain from additional splits is too large
D.) When a node is 50% one class and 50% another class (highest possible value of entropy)

Study These Flashcards

A.) And B.)

To represent 3 possible values for the ear shape, you can define 3 features for ear shape: pointy ears, floppy ears, oval ears. For an animal whose ears are not pointy, not floppy, but are oval, how can you represent this information as a feature vector? A.) [1,0,0] B.) [1, 1, 0] C.) [0, 0, 1] D.) [0, 1, 0]

C.) 0 is used to represent the absence of that feature (not pointy, not floppy), and 1 is used to represent the presence of that feature (oval).

For the random forest, how do you build each individual tree so that they are not all identical to each other? A.) Sample the training data with replacement B.) Sample the training data without replacement C.) If you are training B trees, train each one on 1/B of the training set, so each tree is trained on a distinct set of examples. D.) Train the algorithm multiple times on the same training set. This will naturally result in different trees.

A.) You can generate a training set that is unique for each individual tree by sampling the training data with replacement.

You are choosing between a decision tree and a neural network for a classification task where the input x is a 100x100 resolution image. Which would you choose? A.) A neural network, because the input is unstructured data and neural networks typically work better with unstructured data. B.) A neural network, because the input is structured data and neural networks typically work better with structured data. C.) A decision tree, because the input is structured data and decision trees typically work better with structured data. D.) A decision tree, because the input is unstructured and decision trees typically work better with unstructured data.

A.)

What does sampling with replacement refer to? A.) Drawing a sequence of examples where, when picking the next example, first replacing all previously drawn examples into the set we are picking from. B.) It refers to a process of making an identical copy of the training set. C.) Drawing a sequence of examples where, when picking the next example, first remove all previously drawn examples from the set we are picking from. D.) It refers to using a new sample of data that we use to permanently overwrite (that is, to replace) the original data.

A.)

Advanced Learning Algorithms (From Practice Quizzes) Flashcards

(28 cards)