Practice Exam Flashcards
T/F
Q-learning can learn the optimal Q-function Q without ever
executing the optimal policy.
True
Yes, this is a property called off-policy learning.
Which of the following would the best reward function for a robot
that is trying to learn to escape a maze quickly (assume a discount of $$\gamma = 1$$):
(A) Reward of +1 for escaping the maze and a reward of zero at all other times.
(B) Reward of +1 for escaping the maze and a reward -1 at all other times.
(C) Reward of +1000 for escaping the maze and a reward 1 at all other times.
(B) Reward of +1 for escaping the maze and a reward -1 at all other times.
What does regret let us quantify?
(A) Whether our policy is optimal or not.
(B) The relative goodness of exploration procedures.
(C) The negative utility of a state like a fire pit.
(D) How accurately we estimated the probabilities of the transition function
(B) The relative goodness of exploration procedures.
Which of the following is NOT True for both MDPs and Reinforcement Learning?
(A) A discounted future reward is used.
(B) An instantaneous reward is used.
(C) After selecting an action at a state, the resulting state is probabilistically determined.
(D) The values for the transition function are known in advance.
(D) The values for the transition function are known in advance.
T/F
The utility function estimate must be completely accurate in order to get an optimal policy.
False
What is a contraction?
(A) The time savings from estimating the optimal policy via policy iteration instead of value iteration.
(B) Part of the proof of convergence for the value iteration algorithm.
(C) A shorter path to a node in the A* algorithm when that node is already present on the priority queue.
(D) The part of the state space that is not observable in partially observable MDPs.
(B) Part of the proof of convergence for the value iteration algorithm.
In the MDP framework we model the interaction between an agent and an environment. Which of the following statements are true of that framework
2.3.1
The agent selects actions, which deterministically move it to a new state in the environment.
False
In the MDP framework we model the interaction between an agent and an environment. Which of the following statements are true of that framework
2.3.2
The agent receives a reward only once it arrives in its goal state.
False
You roll two regular six-sided dice. What is the probability of getting a total sum of 10 or more given that the first dice shows a 6? Write as a decimal.
0.5
How many ways are there to apply the chain rule to a joint distribution with $$N$$ random variables?
(A) $$N$$
(B) $$N^2$$
(C) $$2^N$$
(D) $$N!$$
(D) $$N!$$
T/F
The Markov property says that given the past state, the present and the future are independent.
False
If a process is stationary, it means that:
(A) the state itself does not change
(B) the conditional probability table does not change over time
(C) the transition table is deterministic
(D) the agent has reached a terminal state
(B) the conditional probability table does not change over time
Which of the following is unnecessary to construct a dynamic Bayesian model (DBN)?
(A) The sensor model.
(B) The transition model.
(C) The prior distribution over the state variables.
(D) Multiple state and evidence variables.
(D) Multiple state and evidence variables.
What is the effect of the Markov assumption in n-gram language models?
(A) It makes it possible to estimate the probabilities from data.
(B) Long distance relationships, like subject verb agreement, are taken into account.
(C) The probability of a word is determined by all previous words in the sentence.
(D) The probability of a word is determined only by a single preceding word.
(A) It makes it possible to estimate the probabilities from data.
How are n-gram language models typically evaluated?
(A) Correlation with human judgments
(B) Cross-entropy measured against gold standard labels
(C) Perplexity on a test set
(D) Precision and recall
(C) Perplexity on a test set