Reinforcement Learning Flashcards by Phil Pieper

Q

Was sind Agents?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Wann ist ein Agent autonomous?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was sind Rational Agents?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was sind Reflexive Agents?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was sind Agents with internal state?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was sind Goal-based agents?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was sind Agents with some use function?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Beschreib den Markov Decision Process

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was besagt die Markov Property?

A

Not dependent on history

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was ist epsilon-greedy?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Wie wirken sich die Wahl von epsilon und beta auf epsilon-greedy aus?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Was ist eine q-table?

A

One-hot state encodings x One-hot action encodings

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Wie funktioniert tabular rl?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Wie kann man Tabular RL mit q-tables als Deep RL realisieren?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Deep RL

Wofür ist das implicit model of action selection?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Wie funktioniert Temporal Difference (TD) Learning?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Nenn die Bellman equation

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

TD-learning advises to adapt the Q-value for the current (s,a). How?

A

How well did you know this?

1

Not at all

2

3

4

5

Perfectly

Q

Beschreib den SARSA Algorithmus

A

Q

Vergleiche SARSA und Q-Learning

A

Q

Erkläre Actor-critic Learning

A

Q

Erkläre Deep RL für SARSA or Q-Learning.
Wie backpropagatet man?

A

Q

Was ist value-based RL?

A

Q

Was ist policy-gradient based RL?

A

maximize the expected future return R

Wie schafft man es, dass bei Policy Gradient based RL alle Gewichte trainiert werden können?

Was ist goal-conditioned RL?

Was ist Experience Replay?

Performing many trials in the environment can be costly. Solution: learn multiple times from the experience:

Was ist Hindsight Experience Replay (HER)?

Agent knows how to get to any experienced state

Was ist Hierarchical RL?

higher level tells lower level on which goal to perform

Wie funktioniert Model-based RL? Wie bestimmt man eine Aktion?

like a tree search

Was sind Limitations von Model-based RL?

exponential number of states/actions; cycles

Erkläre: Advantage of Model-based vs Model-free RL

Beschreib die Architektur von MuZero

Beschreibe MuZero: Planning by Monte-Carlo Tree Search

Wie wird MuZero trainiert?

via Experience Replay

Wie kann man die MuZero Architektur erweitern? Was bringt das?

Wie kann man MuZero im Bezug auf continous Actions erweitern?

Which of the following statements on Hierarchical Reinforcement Learning are correct? 1. If the subtasks don't yield rewards individually, Hierarchical RL can't be used. 2. Hierarchical Agents always use the same set of actions on each level of the hierarchy. 3. In Hierarchical RL high-level agents set the goals for lower-level agents.

3

Assume we have the following information on 2-itemsets generated with the Apriori algorithm: Frequent: {𝐴,𝐵}, {𝐴,𝐶} Not frequent: {𝐴,𝐷} Which 3-itemsets could potentially be tested against in the following iteration of the Apriori algorithm, independent of any additional information on other itemsets? 1. {𝐴,𝐵,𝐷} 2. {𝐴,𝐶,𝐷} 3. {𝐵,𝐶,𝐷} 4. {𝐴,𝐵,𝐶}

4

0,2/0,4=0,5

Which of the following statements on Policy Gradient methods and the REINFORCE algorithm are correct? 1. The REINFORCE algorithm learns how to estimate the best Q-Values. 2. Policy Gradient-based RL learns to estimate the probabilities of the actions for a given state directly. 3. The REINFORCE algorithm uses the TD-error to update the network weights.

2

Which of the statements about the k-Nearest Neighbors (k-NN) algorithm are correct? 1. k-NN can also be used for regression. 2. k-NN can be used for imputing missing values of both categorical and continuous variables. 3. k-NN performs much better if all of the data have the same scale. 4. k-NN is only defined for Euclidean distance metric.

1, 2 und 3

The goal of model-based RL strategies is to reduce the complexity of searching for the best solution, in an environment in which the dynamics are known. Stimmt das?

Ja

Which of the following statements on the MuZero algorithm are correct? 1. In MuZero, a model of the world has to be provided. 2. MuZero uses planning in the latent space to find the best action. 3. For each episode, only one search tree has to be created. 4. The dynamics function maps a hidden state and an action to another hidden state.

2 und 4

One of the major disadvantages of the Apriori algorithm is high computational complexity. Which of the following factors are predominantly responsible for this? 1. A high number of items in each itemset 2. A low minimum confidence 3. A high minimum support 4. A high number of itemsets in the database

1 und 4

2/5=0,4

In Hindsight Experience Replay the reward scheme is already known while the experiences are gathered. Stimmt das?

Nein