Reinforcement Learning Flashcards

Question 1

Q

What is a one line summary of Reinforcement Learning?

Answer

A

Reinforcement Learning trains an agent to take actions in an environment by sending the reward and the state.

Question 2

Q

What are some of the main aspects of Reinforcement Learning?

Answer

A

It employs Trial and Error to find a solution
It makes a sequence of decisions
Does not require labelled input/output pairs, but rules of reward and penalty
Aim to take actions by maximising reward.

Question 3

Q

What are some of the main applications for Reinforcement Learning?

Answer

A

Game Playing
Robotics
Logistics
Autonomous Driving

Question 4

Q

What are the key elements of Reinforcement Learning that describe the overall process?

Answer

A

The Environment - Physical world in which the agent operates

Agent - Learns to act in a way that maximises the cumulative reward

State - Current situation of the Agent

Reward - Gets feedback from the environment

Policy - The method to map the Agent’s state to actions

Value - Future reward that an Agent would receive by taking an action in a particular state

Question 5

Q

What are the elements of the Markov Decision Process?

Answer

A

Construct a set of environment states (S)
Define a set of possible actions (A)
Define a real valued reward function (R)
Build a transition model (P(s’, s|a)
The hyperparameter (r) is used as a Discount Factor

Question 6

Q

What is the primary goal of the Markov Process?

Answer

A

Find a good policy for the Agent to act on at their current state, which maximises the cumulative reward

Question 7

Q

What is the step-by-step process of the Markov Decision Process?

Answer

A

At the beginning, environment samples initial state of the agent
Until the program is terminated/finished:
- Agent selects an action
- Environment samples the reward
- Environment samples the next state
- Agent receives the reward and next state

Question 8

Q

What is a Policy in the context of the Markov Decision Process?

Answer

A

A policy is a function from S to A that specifies what action to take in each state.

Question 9

Q

What is Q-Learning designed for?

Answer

A

Q-Learning is a method designed to find the next best action given a current state, which aims to maximise the cumulative reward

Question 10

Q

What is the Bellman Equation?

Answer

A

New state and action = The old state and action + learning rate(reward + discount rate(maximum expected future reward) - old state and action value)

Question 11

Q

What does a Q-value define?

Answer

A

A Q-value is a representation of the quality of a State/Action pair

Question 12

Q

What is the learning process for Q-Learning?

Answer

A

Initialise all Q-values in a Q-table to 0
Choose an action for the current state, with the best Q-value
Perform action, which results in a new state
Measure the reward for undergoing that action from that state
Update that respective Q value using the Bellman Equation

Question 13

Q

What is the drawback of Q-Learning?

Answer

A

It is computationally expensive in both the training and inference stages

Question 14

Q

Why do we use the Epsilon-Greedy Exploration Strategy in Q-Learning?

Answer

A

The aim is to introduce some randomness on selecting actions, so that it encourages the Agent to explore other courses of action, rather than constantly picking the best option in the short term.

Question 15

Q

What is different between Q-Learning and Deep Q-Learning?

Answer

A

Deep Q-Learning represents the Q-table as a Neural Network, which maps to specific actions. Q-Learning represents all actions in a table, called a Q-table.

Question 16

Q

What is a Policy Gradient?

Answer

A

Policy Gradients use a policy network to directly estimate actions rather than quality values

Question 17

Q

How do you train a Policy Network?

Answer

A

Initialise the network with random weights
Forward passing the network to generate a possible action
With this action, keep executing ‘the game’ until the end. If it is positive, then backpropagate a positive gradient, otherwise backpropagate a negative gradient.
Play N episodes, and update the weights based on the positive and negative gradient.

Question 18

Q

What are some challenges in Reinforcement Learning?

Answer

A

Requires large amounts of data to train, as no direct controlling of the Agent but only through rewarding.

Reaching a local optimum - Performing not as expected but the Agent think’s its doing well

The agent finds a shortcut in getting the rewards without completing the designed task.

Model free methods are the way forward, and they are more flexible in dealing with unknown environments, but they are slow to train.