Practice Exam Flashcards

1
Q

T/F
Q-learning can learn the optimal Q-function Q without ever
executing the optimal policy.

A

True

Yes, this is a property called off-policy learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following would the best reward function for a robot
that is trying to learn to escape a maze quickly (assume a discount of $$\gamma = 1$$):

(A) Reward of +1 for escaping the maze and a reward of zero at all other times.
(B) Reward of +1 for escaping the maze and a reward -1 at all other times.
(C) Reward of +1000 for escaping the maze and a reward 1 at all other times.

A

(B) Reward of +1 for escaping the maze and a reward -1 at all other times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does regret let us quantify?

(A) Whether our policy is optimal or not.
(B) The relative goodness of exploration procedures.
(C) The negative utility of a state like a fire pit.
(D) How accurately we estimated the probabilities of the transition function

A

(B) The relative goodness of exploration procedures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following is NOT True for both MDPs and Reinforcement Learning?

(A) A discounted future reward is used.
(B) An instantaneous reward is used.
(C) After selecting an action at a state, the resulting state is probabilistically determined.
(D) The values for the transition function are known in advance.

A

(D) The values for the transition function are known in advance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

T/F
The utility function estimate must be completely accurate in order to get an optimal policy.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a contraction?
(A) The time savings from estimating the optimal policy via policy iteration instead of value iteration.
(B) Part of the proof of convergence for the value iteration algorithm.
(C) A shorter path to a node in the A* algorithm when that node is already present on the priority queue.
(D) The part of the state space that is not observable in partially observable MDPs.

A

(B) Part of the proof of convergence for the value iteration algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the MDP framework we model the interaction between an agent and an environment. Which of the following statements are true of that framework

2.3.1
The agent selects actions, which deterministically move it to a new state in the environment.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In the MDP framework we model the interaction between an agent and an environment. Which of the following statements are true of that framework

2.3.2
The agent receives a reward only once it arrives in its goal state.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You roll two regular six-sided dice. What is the probability of getting a total sum of 10 or more given that the first dice shows a 6? Write as a decimal.

A

0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many ways are there to apply the chain rule to a joint distribution with $$N$$ random variables?

(A) $$N$$
(B) $$N^2$$
(C) $$2^N$$
(D) $$N!$$

A

(D) $$N!$$

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F
The Markov property says that given the past state, the present and the future are independent.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If a process is stationary, it means that:

(A) the state itself does not change
(B) the conditional probability table does not change over time
(C) the transition table is deterministic
(D) the agent has reached a terminal state

A

(B) the conditional probability table does not change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which of the following is unnecessary to construct a dynamic Bayesian model (DBN)?

(A) The sensor model.
(B) The transition model.
(C) The prior distribution over the state variables.
(D) Multiple state and evidence variables.

A

(D) Multiple state and evidence variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the effect of the Markov assumption in n-gram language models?

(A) It makes it possible to estimate the probabilities from data.
(B) Long distance relationships, like subject verb agreement, are taken into account.
(C) The probability of a word is determined by all previous words in the sentence.
(D) The probability of a word is determined only by a single preceding word.

A

(A) It makes it possible to estimate the probabilities from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are n-gram language models typically evaluated?

(A) Correlation with human judgments
(B) Cross-entropy measured against gold standard labels
(C) Perplexity on a test set
(D) Precision and recall

A

(C) Perplexity on a test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T/F
The conditional probability distribution of a variable in a Bayesian network should be specified based on the probability distributions of all of the other variables (nodes).

A

False

Just need the parents

17
Q

Write the joint probability for the Bayes’ Net shown below, encoding its independence assumptions into your equation. $$P(A, B, C, D, E) =$$

A

[$$P(A) \cdot P(B) \cdot P(C|A) \cdot P(D|C,A) \cdot
P(E|C,B)$$

18
Q

Which of the following is true of locally structured (sparse) systems?

(A) Each subcomponent must interact directly with all the other components.
(B) The structure grows linearly in complexity (rather than exponentially).
(C) Every variable cannot be influenced by all of the others.
(D) All such systems are compact.

A

(B) The structure grows linearly in complexity (rather than exponentially).

19
Q

T/F

It is possible that more than one Bayesian network can be used to represent the same joint distribution.

A

True

20
Q

If two variables (nodes) X and Y in a Bayesian network do not share a path, which of the following must be true?

(A) X and Y can never be true at the same time.
(B) X and Y are conditionally independent.
(C) X has a direct influence on Y.
(D) There exists a causal relationship between X and Y

A

(B) X and Y are conditionally independent.

21
Q

T/F

The Naive Bayes model is “naive” because it assumes that the features are conditionally independent of each other, given the class.

A

True

22
Q

Write down the form of the joint probability model $$P(X_1, X_2, X_3, Y )$$ for this data using the Naive Bayes assumption.

A

$$P(Y)P(X_1|Y)P(X_2|Y)P(X_3|Y)$$

23
Q

Which of the following can handle unknown contexts (assume words not in the vocabulary are all assigned to the same keyword)?
(Check all that apply.)

[A] Maximum Likelihood Estimate
[B] Stupid backoff
[C] Laplace Smoothing
[D] Smart backoff

A

[B] Stupid backoff
[C] Laplace Smoothing

24
Q

Which of the following best characterizes the difference between parametric and nonparametric models?

(A) Parametric models cannot summarize data with a large number of training examples.
(B) Parametric models can be used if each hypothesis considers all of the other training examples to make the next prediction.
(C) Instance-based learning and memory-based learning use parametric models.
(D A parametric model has a fixed size on the number of parameters.

A

(D A parametric model has a fixed size on the number of parameters.

25
Q

Which of the following is NOT true about the bag-of-words model?

(A) Each position is identically distributed.
(B) Predict label conditioned on feature variables.
(C) It is sensitive to word order or reordering.
(D) All positions share the same conditional probabilities.

A

(C) It is sensitive to word order or reordering.

26
Q

Which of the following are true about linear functions?

(A) It could be only used for classification.
(B) All sets of data points are linearly separable.
(C) When using the perceptron learning rule, the weights are updated when the actual output does not match the hypothesis output.
(D) The learning rule must be applied to one example at a time.

A

(C) When using the perceptron learning rule, the weights are updated when the actual output does not match the hypothesis output.

27
Q

T/F
For linearly separable data, there exists only one decision boundary that separates the classes.

A

False
raw out put > 0 -> 1, else 0

28
Q

T/F
If a binary perceptron has a high bias value, it is easier for the perceptron to output a 1.

A

True

29
Q

T/F
If all of its weights are greater than or equal to 0, a binary perceptron will always fire.

A

False

30
Q

Which of the following is correct about perceptrons?

(A) A perceptron is guaranteed to learn a decision boundary that perfectly separates the data within a finite number of training steps.
(B) A perceptron can only converge when the data points are linearly separable.
(C) The training error always decreases after each run on the entire training set.
(D) A perceptron can find multiple decision boundaries.

A

(D) A perceptron can find multiple decision boundaries.

[[A - only correct for linear separable data.]]
[[B - can converge with a decaying learning rate.]]
[[C - Typically, the variation across runs is very large.]]

31
Q

A perceptron converges to a minimum-error solution when:

(A) A fixed learning rate is used
(B) The learning rate increases as training time increases
(C) The learning rate decreases as training time increases
(D) The learning rate decreases only when validation error decreases

A

(C) The learning rate decreases as training time increases

32
Q

Relative to perceptrons, neural networks gain their power from

(A) smooth activation function
(B) stacking of layers
(C) convolution
(D) backpropagation

A

(B) stacking of layers

33
Q

In neural networks, nonlinear activation functions: Please check all that apply.

[ ] Make it possible to do the gradient calculation in
backpropagation, as opposed to using step function which isn’t differentiable.
[ ] Help to learn nonlinear decision boundaries.
[ ] Are applied only to the output units.
[ ] Always output values between 0 and 1.

A

[X] Make it possible to do the gradient calculation in
backpropagation, as opposed to using step function which isn’t differentiable.
[X] Help to learn nonlinear decision boundaries

34
Q

Which of the following is correct about RNN?

(A) RNN network uses an earlier layer’s outputs as inputs.
(B) The hidden layer of RNN includes a recurrent connection as part of its outputs.
(C) Forward inference in RNN means mapping a sequence of outputs
to a sequence of inputs.
(D) RNN is an effective approach for visual feature extraction.

A

(A) RNN network uses an earlier layer’s outputs as inputs.

35
Q

Which of the following are NOT taken into consideration by the backpropagation algorithm?

(A) The rate of change of the cost with respect to any weight.
(B) The rate of change of the cost with respect to any bias.
(C) The error of one layer in terms of the error in the next layer.
(D) The rate of change of one weight with respect to the other weights of neurons in the same layer.

A

(D) The rate of change of one weight with respect to the other weights of neurons in the same layer.

36
Q

Which of the following is NOT true about activation functions?

(A) Sigmoid function maps the output into the range [0, 1] and it is differentiable.
(B) The tanh function is a variant of the sigmoid function and maps outlier values toward the mean.
(C) The result of ReLU for negative numbers is the same as a linear function.

A

(C) The result of ReLU for negative numbers is the same as a linear function.