Playing Games and Driving Cars: Reinforcement Learning

In a Nutshell

In reinforcement learning, an agent takes multiple actions, and the positive or negative outcome of those actions serves as a loss function for subsequent training.

Training an Agent

What is an agent?

How is training an agent different from training the models we've used so far?

Most things stay the same, and we can use all of the knowledge we've built:

We can use any or all of the network models, including feed-forward, convolutional, recurrent, and combinations of those.
We will still train in batches using some variant of gradient descent
As before, the model will ideally learn a complex non-obvious function of many parameters

A few things change ... well, not really change, but "specialize":

The inputs may start out as entire frames (or frame deltas) of a video feed
- We may feature engineer more explicitly (or not)
The ouputs may be a low-cardinality set of categories that represent actions (e.g., direction of a digital joystick, or input to a small number of control systems)
We may model state explicitly (outside the network) as well as implicitly (inside the network)
The function we're learning is one which will "tell our agent what to do" or -- assuming there is no disconnect between knowing what to do and doing it, the function will essentially be the agent
The loss function depends on the outcome of the game, and the game requires many actions to reach an outcome, and so requires some slightly different approaches from the ones we've used before.

Principal Approaches: Deep Q-Learning and Policy Gradient Learning

Policy Gradient is straightforward and shows a lot of research promise, but can be quite difficult to use. The challenge is less in the math, code, or concepts, and more in terms of effective training. We'll look very briefly at PG.

Deep Q-Learning is more constrained and a little more complex mathematically. These factors would seem to cut against the use of DQL, but they allow for relatively fast and effective training, so they are very widely used. We'll go deeper into DQL and work with an example.

There are, of course, many variants on these as well as some other strategies.

%md ## Policy Gradient Learning With Policy Gradient Learning, we directly try to learn a "policy" function that selects a (possibly continuous-valued) move for an agent to make given the current state of the "world." We want to maximize total discounted future reward, but we do not need discrete actions to take or a model that tells us a specific "next reward." Instead, we can make fine-grained moves and we can collect all the moves that lead to a reward, and then apply that reward to all of them. --- > __ASIDE:__ The term *gradient* here comes from a formula which indicates the gradient (or steepest direction) to improve the policy parameters with respect to the loss function. That is, which direction to adjust the parameters to maximize improvement in expected total reward. --- In some sense, this is a more straightforward, direct approach than the other approach we'll work with, Deep Q-Learning.  ### Challenges with Policy Gradients Policy gradients, despite achieving remarkable results, are a form of brute-force solution. Thus they require a large amount of input data and extraordinary amounts of training time. Some of these challenges come down to the *credit assignment problem* -- properly attributing reward to actions in the past which may (or may not) be responsible for the reward -- and thus some mitigations include more complex reward functions, adding more frequent reward training into the system, and adding domain knowledge to the policy, or adding an entire separate network, called a "critic network" to learn to provide feedback to the actor network. Another challenge is the size of the search space, and tractable approaches to exploring it. PG is challenging to use in practice, though there are a number of "tricks" in various publications that you can try. ### Next Steps * Great post by Andrej Karpathy on policy gradient learning: http://karpathy.github.io/2016/05/31/rl/ * A nice first step on policy gradients with real code: *Using Keras and Deep Deterministic Policy Gradient to play TORCS*: https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html If you'd like to explore a variety of reinforcement learning techniques, Mattias Lappert at the Karlsruhe Institute of Technology, has created an add-on framework for Keras that implements a variety of state-of-the-art RL techniques, including discussed today. His framework, KerasRL is at https://github.com/matthiasplappert/keras-rl and includes examples that integrate with OpenAI gym.

Policy Gradient Learning

With Policy Gradient Learning, we directly try to learn a "policy" function that selects a (possibly continuous-valued) move for an agent to make given the current state of the "world."

We want to maximize total discounted future reward, but we do not need discrete actions to take or a model that tells us a specific "next reward."

Instead, we can make fine-grained moves and we can collect all the moves that lead to a reward, and then apply that reward to all of them.

ASIDE: The term gradient here comes from a formula which indicates the gradient (or steepest direction) to improve the policy parameters with respect to the loss function. That is, which direction to adjust the parameters to maximize improvement in expected total reward.

In some sense, this is a more straightforward, direct approach than the other approach we'll work with, Deep Q-Learning.

Challenges with Policy Gradients

Policy gradients, despite achieving remarkable results, are a form of brute-force solution.

Thus they require a large amount of input data and extraordinary amounts of training time.

Some of these challenges come down to the credit assignment problem -- properly attributing reward to actions in the past which may (or may not) be responsible for the reward -- and thus some mitigations include more complex reward functions, adding more frequent reward training into the system, and adding domain knowledge to the policy, or adding an entire separate network, called a "critic network" to learn to provide feedback to the actor network.

Another challenge is the size of the search space, and tractable approaches to exploring it.

PG is challenging to use in practice, though there are a number of "tricks" in various publications that you can try.

Next Steps

Great post by Andrej Karpathy on policy gradient learning: http://karpathy.github.io/2016/05/31/rl/
A nice first step on policy gradients with real code: Using Keras and Deep Deterministic Policy Gradient to play TORCS: https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html

If you'd like to explore a variety of reinforcement learning techniques, Mattias Lappert at the Karlsruhe Institute of Technology, has created an add-on framework for Keras that implements a variety of state-of-the-art RL techniques, including discussed today.

His framework, KerasRL is at https://github.com/matthiasplappert/keras-rl and includes examples that integrate with OpenAI gym.

%md ## Deep Q-Learning Deep Q-Learning is deep learning applied to "Q-Learning."  So what is Q-Learning? Q-Learning is a model that posits a map for optimal actions for each possible state in a game. Specifically, given a state and an action, there is a "Q-function" that provides the value or quality (the Q stands for quality) of that action taken in that state. So, if an agent is in state s, choosing an action could be as simple as looking at Q(s, a) for all a, and choosing the highest "quality value" -- aka <img src="https://i.imgur.com/RerRQzk.png" width=110> There are some other considerations, such as explore-exploit tradeoff, but let's focus on this Q function. In small state spaces, this function can be represented as a table, a bit like basic strategy blackjack tables. <img src="http://i.imgur.com/rfxLSls.png" width=400> Even a simple Atari-style video game may have hundreds of thousands of states, though. This is where the neural network comes in. __What we need is a way to learn a Q function, when we don't know what the error of a particular move is, since the error (loss) may be dependent on many future actions and can also be non-deterministic (e.g., if there are randomly generated enemies or conditions in the game).__ The tricks -- or insights -- here are: [1] Model the total future reward -- what we're really after -- as a recursive calculation from the immediate reward (*r*) and future outcomes: <img src="https://i.imgur.com/ePXoQfR.png" width=250>    * \\({\gamma}\\) is a "discount factor" on future reward* Assume the game terminates or "effectively terminates" to make the recursion tractable* This equation is a simplified case of the Bellman Equation    [2] Assume that if you iteratively run this process starting with an arbitrary Q model, and you train the Q model with actual outcomes, your Q model will eventually converge toward the "true" Q function* This seems intuitively to resemble various Monte Carlo sampling methods (if that helps at all) #### As improbable as this might seem at first for teaching an agent a complex game or task, it actually works, and in a straightforward way. How do we apply this to our neural network code? Unlike before, when we called "fit" to train a network automatically, here we'll need some interplay between the agent's behavior in the game and the training. That is, we need the agent to play some moves in order to get actual numbers to train with. And as soon as we have some actual numbers, we want to do a little training with them right away so that our Q function improves. So we'll alternate one or more in-game actions with a manual call to train on a single batch of data. The algorithm looks like this (credit for the nice summary to Tambet Matiisen; read his longer explanation at https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ for review): 1. Do a feedforward pass for the current state s to get predicted Q-values for all actions.2. Do a feedforward pass for the next state s′ and calculate maximum over all network outputs <img src="https://i.imgur.com/3QiH4Z1.png" width=110>3. Set Q-value target for action a to <img src="https://i.imgur.com/2RNmkl6.png" width=140> (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.4. Update the weights using backpropagation. If there is "reward" throughout the game, we can model the loss as  <img src="https://i.imgur.com/OojwPbJ.png" width=350> If the game is win/lose only ... most of the r's go away and the entire loss is based on a 0/1 or -1/1 score at the end of a game. ### Practical Consideration 1: Experience Replay To improve training, we cache all (or as much as possible) of the agent's state/move/reward/next-state data. Then, when we go to perform a training run, we can build a batch out of a subset of all previous moves. This provides diversity in the training data, whose value we discussed earlier. ### Practical Consideration 2: Explore-Exploit Balance In order to add more diversity to the agent's actions, we set a threshold ("epsilon") which represents the probability that the agent ignores its experience-based model and just chooses a random action. This also add diversity, by preventing the agent from taking an overly-narrow, 100% greedy (best-perfomance-so-far) path. ### Let's Look at the Code! Reinforcement learning code examples are a bit more complex than the other examples we've seen so far, because in the other examples, the data sets (training and test) exist outside the program as assets (e.g., the MNIST digit data). In reinforcement learning, the training and reward data come from some environment that the agent is supposed to learn. Typically, the environment is simulated by local code, or *represented by local code* even if the real environment is remote or comes from the physical world via sensors.  #### So the code contains not just the neural net and training logic, but part (or all) of a game world itself. One of the most elegant small, but complete, examples comes courtesy of former Ph.D. researcher (Univ. of Florida) and Apple rockstar Eder Santana. It's a simplified catch-the-falling-brick game (a bit like Atari Kaboom! but even simpler) that nevertheless is complex enough to illustrate DQL and to be impressive in action. When we're done, we'll basically have a game, and an agent that plays, which run like this: <img src="http://i.imgur.com/PB6OgNF.gif">

Deep Q-Learning

Deep Q-Learning is deep learning applied to "Q-Learning."

So what is Q-Learning?

Q-Learning is a model that posits a map for optimal actions for each possible state in a game.

Specifically, given a state and an action, there is a "Q-function" that provides the value or quality (the Q stands for quality) of that action taken in that state.

So, if an agent is in state s, choosing an action could be as simple as looking at Q(s, a) for all a, and choosing the highest "quality value" -- aka

There are some other considerations, such as explore-exploit tradeoff, but let's focus on this Q function.

In small state spaces, this function can be represented as a table, a bit like basic strategy blackjack tables.

Even a simple Atari-style video game may have hundreds of thousands of states, though. This is where the neural network comes in.

What we need is a way to learn a Q function, when we don't know what the error of a particular move is, since the error (loss) may be dependent on many future actions and can also be non-deterministic (e.g., if there are randomly generated enemies or conditions in the game).

The tricks -- or insights -- here are:

[1] Model the total future reward -- what we're really after -- as a recursive calculation from the immediate reward (r) and future outcomes:

${\gamma}$ is a "discount factor" on future reward
Assume the game terminates or "effectively terminates" to make the recursion tractable
This equation is a simplified case of the Bellman Equation

[2] Assume that if you iteratively run this process starting with an arbitrary Q model, and you train the Q model with actual outcomes, your Q model will eventually converge toward the "true" Q function

This seems intuitively to resemble various Monte Carlo sampling methods (if that helps at all)

As improbable as this might seem at first for teaching an agent a complex game or task, it actually works, and in a straightforward way.

How do we apply this to our neural network code?

Unlike before, when we called "fit" to train a network automatically, here we'll need some interplay between the agent's behavior in the game and the training. That is, we need the agent to play some moves in order to get actual numbers to train with. And as soon as we have some actual numbers, we want to do a little training with them right away so that our Q function improves. So we'll alternate one or more in-game actions with a manual call to train on a single batch of data.

The algorithm looks like this (credit for the nice summary to Tambet Matiisen; read his longer explanation at https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ for review):

Do a feedforward pass for the current state s to get predicted Q-values for all actions.
Do a feedforward pass for the next state s′ and calculate maximum over all network outputs
Set Q-value target for action a to (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
Update the weights using backpropagation.

If there is "reward" throughout the game, we can model the loss as

If the game is win/lose only ... most of the r's go away and the entire loss is based on a 0/1 or -1/1 score at the end of a game.

Practical Consideration 1: Experience Replay

To improve training, we cache all (or as much as possible) of the agent's state/move/reward/next-state data. Then, when we go to perform a training run, we can build a batch out of a subset of all previous moves. This provides diversity in the training data, whose value we discussed earlier.

Practical Consideration 2: Explore-Exploit Balance

In order to add more diversity to the agent's actions, we set a threshold ("epsilon") which represents the probability that the agent ignores its experience-based model and just chooses a random action. This also add diversity, by preventing the agent from taking an overly-narrow, 100% greedy (best-perfomance-so-far) path.

Let's Look at the Code!

Reinforcement learning code examples are a bit more complex than the other examples we've seen so far, because in the other examples, the data sets (training and test) exist outside the program as assets (e.g., the MNIST digit data).

In reinforcement learning, the training and reward data come from some environment that the agent is supposed to learn. Typically, the environment is simulated by local code, or represented by local code even if the real environment is remote or comes from the physical world via sensors.

So the code contains not just the neural net and training logic, but part (or all) of a game world itself.

One of the most elegant small, but complete, examples comes courtesy of former Ph.D. researcher (Univ. of Florida) and Apple rockstar Eder Santana. It's a simplified catch-the-falling-brick game (a bit like Atari Kaboom! but even simpler) that nevertheless is complex enough to illustrate DQL and to be impressive in action.

When we're done, we'll basically have a game, and an agent that plays, which run like this:

class Catch(object):    def __init__(self, grid_size=10):        self.grid_size = grid_size        self.reset()     def _update_state(self, action):        """        Input: action and states        Ouput: new states and reward        """        state = self.state        if action == 0:  # left            action = -1        elif action == 1:  # stay            action = 0        else:            action = 1  # right        f0, f1, basket = state[0]        new_basket = min(max(1, basket + action), self.grid_size-1)        f0 += 1        out = np.asarray([f0, f1, new_basket])        out = out[np.newaxis]         assert len(out.shape) == 2        self.state = out     def _draw_state(self):        im_size = (self.grid_size,)*2        state = self.state[0]        canvas = np.zeros(im_size)        canvas[state[0], state[1]] = 1  # draw fruit        canvas[-1, state[2]-1:state[2] + 2] = 1  # draw basket        return canvas     def _get_reward(self):        fruit_row, fruit_col, basket = self.state[0]        if fruit_row == self.grid_size-1:            if abs(fruit_col - basket) <= 1:                return 1            else:                return -1        else:            return 0     def _is_over(self):        if self.state[0, 0] == self.grid_size-1:            return True        else:            return False     def observe(self):        canvas = self._draw_state()        return canvas.reshape((1, -1))     def act(self, action):        self._update_state(action)        reward = self._get_reward()        game_over = self._is_over()        return self.observe(), reward, game_over     def reset(self):        n = np.random.randint(0, self.grid_size-1, size=1)        m = np.random.randint(1, self.grid_size-2, size=1)        self.state = np.array([0, n, m])[np.newaxis].astype('int64')

Next, let's look at the network itself -- it's super simple, so we can get that out of the way too:

model = Sequential()
model.add(Dense(hidden_size, input_shape=(grid_size**2,), activation='relu'))
model.add(Dense(hidden_size, activation='relu'))
model.add(Dense(num_actions))
model.compile(sgd(lr=.2), "mse")

Note that the output layer has num_actions neurons.

We are going to implement the training target as

the estimated reward for the one action taken when the game doesn't conclude, or
error/reward for the specific action that loses/wins a game

In any case, we only train with an error/reward for actions the agent actually chose. We neutralize the hypothetical rewards for other actions, as they are not causally chained to any ground truth.

Next, let's zoom in on at the main game training loop:

win_cnt = 0
for e in range(epoch):
    loss = 0.
    env.reset()
    game_over = False
    # get initial input
    input_t = env.observe()

    while not game_over:
        input_tm1 = input_t
        # get next action
        if np.random.rand() <= epsilon:
            action = np.random.randint(0, num_actions, size=1)
        else:
            q = model.predict(input_tm1)
            action = np.argmax(q[0])

        # apply action, get rewards and new state
        input_t, reward, game_over = env.act(action)
        if reward == 1:
            win_cnt += 1

        # store experience
        exp_replay.remember([input_tm1, action, reward, input_t], game_over)

        # adapt model
        inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)

        loss += model.train_on_batch(inputs, targets)
    print("Epoch {:03d}/{:d} | Loss {:.4f} | Win count {}".format(e, epoch - 1, loss, win_cnt))

063_DLbyABr_07-ReinforcementLearning(Python)