Introduction

Configuration

win64+Python 3.7+Pycharm+OpenAI gym

Prior knowledge

In paper PlayingAtariwithDeepReinforcementLearning, the author illuminates why he introduces deep learning into reinforcement learning (RL). The author wants to find a general solution to quite different games, and in order to achieve this goal, we should stop using hand-crafted features which may differ between two distinct types of games. Just like in computer vision and speech recognition field, convolutional network extracts features more quickly and efficiently with no regard to different situation in image performance. This advance in deep learning (DL) makes it possible to extract features directly from raw image or sound data, so naturally we want to ask whether DL also can be beneficial for RL with raw sensory data.

The paper above also answer the question that what distinguishes the DL and RL. DL assumes a fixed distribution which means images never interfere with each other and when one image exists, probability of other images' existence does not change.

However, it is different in RL. When you see the Tian'anmen Square in your in-vehicle camera, you are more likely to see a tank in your following trip :(. This means that in RL distribution does not obey independently identically distribution, and changes when new policy is learned.

Simple network building

In this section, I will build a basic DQN for CartPole from the ground following this tutorial.

CartPole

The cartPole is just a pole attach to a cart, if you pull the cart ahead the pole will fall behind like the GIF above due to inertia. It is like swirling a huge penis in front of a drunk man. Click here and learn more about it.

For this simple game, what you need to know is that the observation returns an array containing four numbers, and they are:

[position of cart, velocity of cart, angle of pole, rotation rate of pole]

You can choose two actions, and they are:

applying a force of +1 or -1 to the cart

Guess what? When I first saw these four numbers, I was going to compute a function to calculate which action I would choose and that is exactly the thought of DQN! In the next section, I will build a network to solve the cartPole problem hand by hand.

CartPole solution

By the following code, I import some package. The variables are:

LR: learning rate, how fast the weight changes itself
goal_steps: the maximum steps one game lasts
score_requirement: the minimum score you can tolerate
initial_games: to get sufficient data, the time of game you replay

Then, I create and reset a new game.

import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter

LR = 1e-3
env = gym.make("CartPole-v0")
env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000

Now, I guess you are thinking about build a network but wait. Although you may not know how to solve this problem or how to build a network by TFlearn, you will probably know how to play a game. You do trail and error to accumulate experience, train yourself and finally examine what you have learned.

Now, lets run the game and collect some data.

def initial_population():
	
    # training_data is [observations, actions]
    training_data = []
    
    # In following iteration, we generate sufficient data to train
    for _ in range(initial_games):
    	# we use score to filter good perfomance
        score = 0
        # game_memory restore all the decision sequence
        game_memory = []
        # for the observation you get is after the action,
        # you need to restore it
        prev_observation = []
        
        # In this iteration, we set the maximum round of game to goal_steps
        # random action and select the good ones
        for _ in range(goal_steps):
            action = random.randrange(0, 2)
            observation, reward, done, info = env.step(action)
            if len(prev_observation) > 0:
                game_memory.append([prev_observation, action])
            prev_observation = observation
            score += reward
            if done: break
        if score >= score_requirement:
            for data in game_memory:
            	# this is the action label
                if data[1] == 1:
                    output = [0, 1]
                elif data[1] == 0:
                    output = [1, 0]
                training_data.append([data[0], output])
        env.reset()

    return training_data

Then, build a network for the training_data and if you do not know the meaning of TFlearn code, click here to my another post.

def neural_network_model(input_size):
	
    # the input_data are games satisfying that 
    # score is larger than the score_requirement
    network = input_data(shape=[None, input_size, 1], name='input')
    
    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model

Next, we reshape training_data to train our model.

def train_model(training_data, model=False):
	
    # reshape data
    X = np.array([i[0] for i in training_data]).reshape(-1, len(training_data[0][0]), 1)
    y = [i[1] for i in training_data]

    if not model:
        model = neural_network_model(input_size=len(X[0]))
	
    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_learning')
    
    return model
    
training_data = initial_population()
model = train_model(training_data)

Finally, rerun the game and use the model!

Analysis

Rerun the game for a few times and calculate average, median and Standard deviation of scores. The result is below:

Epoch	Average	Median	SD
100	189.3	200.0	23.7
100	179.2	200.0	30.3
100	166.4	180.0	35.3
150	176.8	200.0	30.9
150	156.0	152.5	35.0
150	200.0	200.0	0.0

From the table we can see that the simple network can raise the reward of game but the result is not stable because for each training data the model treats it the same, so for some instances which are similar to 50-scores training data will fail quickly!

Advanced network with Q value

In previous section, I calculate the probability of actions to determine how to move from current state to next state. For training data I choose sequences that have more than 50 scores as 'good' input, sequences whose scores are less than 50 are more likely to be noisy data which take a wrong action in early choice.

It is tricky to determine which data is 'good' and which is 'bad', and we need a more advanced algorithm which can calculate how good a sequence is and iterate itself to get the game completed with Q value.

What is Q value?

Q value is a scalar which indicates how good a state is.

The Q value is calculated like above, and if you do not know what it means, you should read some book. Here, I will introduce something that other tutorial author do not tell you.

Why there is a Π right above Q: Because you may have a policy about action you want to choose, and that means you choose an action with a distribution.
Why the left equation is in an expectation form: when your policy and state transition are both uncertain it is no doubt you can only calculate an expectation of what you can get.

How to compute Q value with network?

To simplify the concept, state (s) and action (a) contribute differently to the Q value.

Like what we do in previous section, we compute the Q(S, A) with input of state by network and R related to action while different actions will lead to different state and get different reward.

According to this paper, I import deque to construct the replay memory pool.

import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, fully_connected
from tflearn.layers.estimator import regression
from collections import deque

The parameters are:

LR: learning rate
GAMMA: discounted factor
MEMORY_SIZE: the size of replay memory
BATCH_SIZE: parameter of mini-batch
EXPLORATION_MAX: the initial exploration rate
EXPLORATION_MIN: the minimum of exploration rate
EXPLORATION_DECAY: decay of exploration rate

LR = 1e-3

GAMMA = 0.95

MEMORY_SIZE = 1000000
BATCH_SIZE = 32

EXPLORATION_MAX = 1
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995

Because the cartPole game is simple we reconstruct a smaller network, which is really easy to comprehend. Notice that I change the activation function of the last layer to linear which makes the output no longer limit within a 0-1 interval. The loss is also changed to mean square error.

def neural_network_model(input_size):
	
    network = input_data(shape=[None, input_size, 1], name='input')
    
    network = fully_connected(network, 24, activation='relu')
    network = fully_connected(network, 24, activation='relu')
   
    network = fully_connected(network, 2, activation='linear')
    network = regression(network, optimizer='adam', loss='mean_square', learning_rate=LR, name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model

So our main function cartpole() is:

def cartpole():
    env = gym.make('CartPole-v0')
    observation_space = env.observation_space.shape[0]
    action_space = env.action_space.n
    # the dqn_solver is used to generate network, select action, 
    # store the state and train the model
    dqn_solver = DQNSolver(observation_space, action_space)
    run = 0
    # we run 1000 rounds games to train model
    for i in range(1000):
        run += 1
        state = env.reset()
        state = np.reshape(state, (-1, len(state), 1))
        step = 0
        while True:
            step += 1
            action = dqn_solver.act(state)
            state_next, reward, terminal, _ = env.step(action)
            reward = -reward if terminal else reward
            state_next = state_next.reshape(-1, len(state_next), 1)
            dqn_solver.remember(state, action, reward, state_next, terminal)
            state = state_next
            if terminal:
                print("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))
                break
            dqn_solver.experience_replay()

dqn_solver is:

class DQNSolver:
    def __init__(self, observation_space, action_space):
        self.exploration_rate = EXPLORATION_MAX

        self.action_space = action_space
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = neural_network_model(observation_space)

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space)
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return
        batch = random.sample(self.memory, BATCH_SIZE)
        state_batch = []
        q_values_batch = []

        for state, action, reward, state_next, terminal in batch:
            q_update = reward
            if not terminal:
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
            q_values = self.model.predict(state)
            q_values[0][action] = q_update
            state_batch.append(state[0])
            q_values_batch.append(q_values[0])

        self.model.fit({'input': state_batch}, {'targets': q_values_batch}, n_epoch=1, snapshot_epoch=False, run_id='openai_learning')
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

If you run my code, you will find that it may converge in 1000 steps while it can also diverge and get a really low score. Mini-batch is an important tool to solve the convergence problem and this technique will be introduced in another post. In next post, I will continue to improve the Q value network. Wait for me.