063_DLbyABr_07-ReinforcementLearning(Python)

Loading...

ScaDaMaLe Course site and book

This is a 2019-2021 augmentation and update of Adam Breindel's initial notebooks.

Thanks to Christian von Koch and William Anzén for their contributions towards making these materials Spark 3.0.1 and Python 3+ compliant.

Playing Games and Driving Cars: Reinforcement Learning

In a Nutshell

In reinforcement learning, an agent takes multiple actions, and the positive or negative outcome of those actions serves as a loss function for subsequent training.

Training an Agent

What is an agent?

How is training an agent different from training the models we've used so far?

Most things stay the same, and we can use all of the knowledge we've built:

  • We can use any or all of the network models, including feed-forward, convolutional, recurrent, and combinations of those.
  • We will still train in batches using some variant of gradient descent
  • As before, the model will ideally learn a complex non-obvious function of many parameters

A few things change ... well, not really change, but "specialize":

  • The inputs may start out as entire frames (or frame deltas) of a video feed
    • We may feature engineer more explicitly (or not)
  • The ouputs may be a low-cardinality set of categories that represent actions (e.g., direction of a digital joystick, or input to a small number of control systems)
  • We may model state explicitly (outside the network) as well as implicitly (inside the network)
  • The function we're learning is one which will "tell our agent what to do" or -- assuming there is no disconnect between knowing what to do and doing it, the function will essentially be the agent
  • The loss function depends on the outcome of the game, and the game requires many actions to reach an outcome, and so requires some slightly different approaches from the ones we've used before.

Principal Approaches: Deep Q-Learning and Policy Gradient Learning

  • Policy Gradient is straightforward and shows a lot of research promise, but can be quite difficult to use. The challenge is less in the math, code, or concepts, and more in terms of effective training. We'll look very briefly at PG.
  • Deep Q-Learning is more constrained and a little more complex mathematically. These factors would seem to cut against the use of DQL, but they allow for relatively fast and effective training, so they are very widely used. We'll go deeper into DQL and work with an example.

There are, of course, many variants on these as well as some other strategies.

Policy Gradient Learning

With Policy Gradient Learning, we directly try to learn a "policy" function that selects a (possibly continuous-valued) move for an agent to make given the current state of the "world."

We want to maximize total discounted future reward, but we do not need discrete actions to take or a model that tells us a specific "next reward."

Instead, we can make fine-grained moves and we can collect all the moves that lead to a reward, and then apply that reward to all of them.


ASIDE: The term gradient here comes from a formula which indicates the gradient (or steepest direction) to improve the policy parameters with respect to the loss function. That is, which direction to adjust the parameters to maximize improvement in expected total reward.


In some sense, this is a more straightforward, direct approach than the other approach we'll work with, Deep Q-Learning.

Challenges with Policy Gradients

Policy gradients, despite achieving remarkable results, are a form of brute-force solution.

Thus they require a large amount of input data and extraordinary amounts of training time.

Some of these challenges come down to the credit assignment problem -- properly attributing reward to actions in the past which may (or may not) be responsible for the reward -- and thus some mitigations include more complex reward functions, adding more frequent reward training into the system, and adding domain knowledge to the policy, or adding an entire separate network, called a "critic network" to learn to provide feedback to the actor network.

Another challenge is the size of the search space, and tractable approaches to exploring it.

PG is challenging to use in practice, though there are a number of "tricks" in various publications that you can try.

Next Steps

If you'd like to explore a variety of reinforcement learning techniques, Mattias Lappert at the Karlsruhe Institute of Technology, has created an add-on framework for Keras that implements a variety of state-of-the-art RL techniques, including discussed today.

His framework, KerasRL is at https://github.com/matthiasplappert/keras-rl and includes examples that integrate with OpenAI gym.

Deep Q-Learning

Deep Q-Learning is deep learning applied to "Q-Learning."

So what is Q-Learning?

Q-Learning is a model that posits a map for optimal actions for each possible state in a game.

Specifically, given a state and an action, there is a "Q-function" that provides the value or quality (the Q stands for quality) of that action taken in that state.

So, if an agent is in state s, choosing an action could be as simple as looking at Q(s, a) for all a, and choosing the highest "quality value" -- aka

There are some other considerations, such as explore-exploit tradeoff, but let's focus on this Q function.

In small state spaces, this function can be represented as a table, a bit like basic strategy blackjack tables.

Even a simple Atari-style video game may have hundreds of thousands of states, though. This is where the neural network comes in.

What we need is a way to learn a Q function, when we don't know what the error of a particular move is, since the error (loss) may be dependent on many future actions and can also be non-deterministic (e.g., if there are randomly generated enemies or conditions in the game).

The tricks -- or insights -- here are:

[1] Model the total future reward -- what we're really after -- as a recursive calculation from the immediate reward (r) and future outcomes:

  • γ{\gamma} is a "discount factor" on future reward
  • Assume the game terminates or "effectively terminates" to make the recursion tractable
  • This equation is a simplified case of the Bellman Equation

[2] Assume that if you iteratively run this process starting with an arbitrary Q model, and you train the Q model with actual outcomes, your Q model will eventually converge toward the "true" Q function

  • This seems intuitively to resemble various Monte Carlo sampling methods (if that helps at all)

As improbable as this might seem at first for teaching an agent a complex game or task, it actually works, and in a straightforward way.

How do we apply this to our neural network code?

Unlike before, when we called "fit" to train a network automatically, here we'll need some interplay between the agent's behavior in the game and the training. That is, we need the agent to play some moves in order to get actual numbers to train with. And as soon as we have some actual numbers, we want to do a little training with them right away so that our Q function improves. So we'll alternate one or more in-game actions with a manual call to train on a single batch of data.

The algorithm looks like this (credit for the nice summary to Tambet Matiisen; read his longer explanation at https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ for review):

  1. Do a feedforward pass for the current state s to get predicted Q-values for all actions.
  2. Do a feedforward pass for the next state s′ and calculate maximum over all network outputs
  3. Set Q-value target for action a to (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
  4. Update the weights using backpropagation.

If there is "reward" throughout the game, we can model the loss as

If the game is win/lose only ... most of the r's go away and the entire loss is based on a 0/1 or -1/1 score at the end of a game.

Practical Consideration 1: Experience Replay

To improve training, we cache all (or as much as possible) of the agent's state/move/reward/next-state data. Then, when we go to perform a training run, we can build a batch out of a subset of all previous moves. This provides diversity in the training data, whose value we discussed earlier.

Practical Consideration 2: Explore-Exploit Balance

In order to add more diversity to the agent's actions, we set a threshold ("epsilon") which represents the probability that the agent ignores its experience-based model and just chooses a random action. This also add diversity, by preventing the agent from taking an overly-narrow, 100% greedy (best-perfomance-so-far) path.

Let's Look at the Code!

Reinforcement learning code examples are a bit more complex than the other examples we've seen so far, because in the other examples, the data sets (training and test) exist outside the program as assets (e.g., the MNIST digit data).

In reinforcement learning, the training and reward data come from some environment that the agent is supposed to learn. Typically, the environment is simulated by local code, or represented by local code even if the real environment is remote or comes from the physical world via sensors.

So the code contains not just the neural net and training logic, but part (or all) of a game world itself.

One of the most elegant small, but complete, examples comes courtesy of former Ph.D. researcher (Univ. of Florida) and Apple rockstar Eder Santana. It's a simplified catch-the-falling-brick game (a bit like Atari Kaboom! but even simpler) that nevertheless is complex enough to illustrate DQL and to be impressive in action.

When we're done, we'll basically have a game, and an agent that plays, which run like this:

First, let's get familiar with the game environment itself, since we'll need to see how it works, before we can focus on the reinforcement learning part of the program.

class Catch(object):
    def __init__(self, grid_size=10):
        self.grid_size = grid_size
        self.reset()
 
    def _update_state(self, action):
        """
        Input: action and states
        Ouput: new states and reward
        """
        state = self.state
        if action == 0:  # left
            action = -1
        elif action == 1:  # stay
            action = 0
        else:
            action = 1  # right
        f0, f1, basket = state[0]
        new_basket = min(max(1, basket + action), self.grid_size-1)
        f0 += 1
        out = np.asarray([f0, f1, new_basket])
        out = out[np.newaxis]
 
        assert len(out.shape) == 2
        self.state = out
 
    def _draw_state(self):
        im_size = (self.grid_size,)*2
        state = self.state[0]
        canvas = np.zeros(im_size)
        canvas[state[0], state[1]] = 1  # draw fruit
        canvas[-1, state[2]-1:state[2] + 2] = 1  # draw basket
        return canvas
 
    def _get_reward(self):
        fruit_row, fruit_col, basket = self.state[0]
        if fruit_row == self.grid_size-1:
            if abs(fruit_col - basket) <= 1:
                return 1
            else:
                return -1
        else:
            return 0
 
    def _is_over(self):
        if self.state[0, 0] == self.grid_size-1:
            return True
        else:
            return False
 
    def observe(self):
        canvas = self._draw_state()
        return canvas.reshape((1, -1))
 
    def act(self, action):
        self._update_state(action)
        reward = self._get_reward()
        game_over = self._is_over()
        return self.observe(), reward, game_over
 
    def reset(self):
        n = np.random.randint(0, self.grid_size-1, size=1)
        m = np.random.randint(1, self.grid_size-2, size=1)
        self.state = np.array([0, n, m])[np.newaxis].astype('int64')

Next, let's look at the network itself -- it's super simple, so we can get that out of the way too:

model = Sequential()
model.add(Dense(hidden_size, input_shape=(grid_size**2,), activation='relu'))
model.add(Dense(hidden_size, activation='relu'))
model.add(Dense(num_actions))
model.compile(sgd(lr=.2), "mse")

Note that the output layer has num_actions neurons.

We are going to implement the training target as

  • the estimated reward for the one action taken when the game doesn't conclude, or
  • error/reward for the specific action that loses/wins a game

In any case, we only train with an error/reward for actions the agent actually chose. We neutralize the hypothetical rewards for other actions, as they are not causally chained to any ground truth.

Next, let's zoom in on at the main game training loop:

win_cnt = 0
for e in range(epoch):
    loss = 0.
    env.reset()
    game_over = False
    # get initial input
    input_t = env.observe()

    while not game_over:
        input_tm1 = input_t
        # get next action
        if np.random.rand() <= epsilon:
            action = np.random.randint(0, num_actions, size=1)
        else:
            q = model.predict(input_tm1)
            action = np.argmax(q[0])

        # apply action, get rewards and new state
        input_t, reward, game_over = env.act(action)
        if reward == 1:
            win_cnt += 1

        # store experience
        exp_replay.remember([input_tm1, action, reward, input_t], game_over)

        # adapt model
        inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)

        loss += model.train_on_batch(inputs, targets)
    print("Epoch {:03d}/{:d} | Loss {:.4f} | Win count {}".format(e, epoch - 1, loss, win_cnt))

The key bits are:

  • Choose an action
  • Act and collect the reward and new state
  • Cache previous state, action, reward, and new state in "Experience Replay" buffer
  • Ask buffer for a batch of action data to train on
  • Call model.train_on_batch to perform one training batch

Last, let's dive into where the actual Q-Learning calculations occur, which happen, in this code to be in the get_batch call to the experience replay buffer object: