Chapter 6 TD Learning

Temporal-Difference Learning is defined here as central to reinforcement learning.

Tabular TD(0) Prediction

Use temporal difference to estimate the state value function of a given policy.

_images/tabular_td0_eval.jpg

In IntroRL, the main loop of the TD(0) Prediction is shown below.

while (num_episodes<=max_num_episodes-1) and keep_looping:

    keep_looping = False
    abserr = 0.0 # calculated below as part of termination criteria

    # policy evaluation
    for start_hash in start_stateL:
        num_episodes += 1
        if num_episodes > max_num_episodes:
            break

        s_hash = start_hash
        a_desc = policy.get_single_action( s_hash )
        for _ in range( max_episode_steps ):

            sn_hash, reward = Env.get_action_snext_reward( s_hash, a_desc )

            state_value_coll.td0_update( s_hash=s_hash, alpha=alpha_obj(),
                                                     gamma=gamma, sn_hash=sn_hash,
                                                     reward=reward)

            if (sn_hash in Env.terminal_set) or (sn_hash is None):
                break

            # get ready for next step
            s_hash = sn_hash

            a_desc = policy.get_single_action( s_hash )
            if a_desc is None:
                print('a_desc is None for policy.get_single_action( "%s" ) ='%\
                      str(s_hash), a_desc)

    abserr = state_value_coll.get_biggest_action_state_err()
    if abserr > max_abserr:
        keep_looping = True

    if num_episodes < min_num_episodes:
        keep_looping = True # must loop for min_num_episodes at least

TD(0) Prediction Full Souce Code

Example 6.2 Random Walk

Example 6.2 in Sutton & Barto on page 125 uses the random walk Markov Reward Process (MRP) below.

_images/example_6_2_random_walk_mdp.jpg

The solution from Sutton & Barto and from IntroRL are shown below.

They both show that TD(0) converges more quickly than MC for this problem.

_images/example_6_2_random_walk_rms_sutton.jpg _images/example_6_2_mc_td_random_walk.png

IntroRL Example 6.2 Souce Code

Figure 6.2 Batch Training

Figure 6.2 on page 127 of Sutton & Barto shows that batch TD is consistently better than batch MC for the random walk shown above in example 6.2.

The figure on the left below is directly from Sutton & Barto

The figure on the right below is generated by IntroRL and overlays the results of IntroRL, Sutton & Barto and Shangtong Zhang

_images/figure_6_2_batch_td_mc_sutton.jpg _images/fig_6_2_mc_td_random_walk.png

IntroRL Figure 6.2 Souce Code

Sarsa On-Policy TD Control

Baseline for epsilon greedy version of Sarsa.

The pseudo code of SARSA prediction of Q(s,a) is given on page 130 of Sutton & Barto and is shown below.

_images/sarsa_td_control_sutton.jpg

The main loop of the IntroRL Epsilon Greedy Sarsa control function is shown below

while (episode_loop_counter<=max_num_episodes-1) and keep_looping :

    keep_looping = False
    abserr = 0.0 # calculated below as part of termination criteria
    Nterminal_episodes = set() # tracks if start_hash got to terminal_set or max_num_episodes

    for start_hash in loop_stateL:
        episode_loop_counter += 1
        if episode_loop_counter > max_num_episodes:
            break

        if learn_tracker is not None:
            learn_tracker.add_new_episode()

        s_hash = start_hash
        a_desc = action_value_coll.get_best_eps_greedy_action( s_hash, epsgreedy_obj=eg )

        for n_episode_steps in range( max_episode_steps ):

            # Begin an episode
            if a_desc is None:
                Nterminal_episodes.add( start_hash )
                print('break for a_desc==None')
                break
            else:
                sn_hash, reward = environment.get_action_snext_reward( s_hash, a_desc )
                if learn_tracker is not None:
                    learn_tracker.add_sarsn_to_current_episode( s_hash, a_desc,
                                                                reward, sn_hash)

                if sn_hash is None:
                    Nterminal_episodes.add( start_hash )
                    print('break for sn_hash==None')
                    break
                else:
                    an_desc = action_value_coll.get_best_eps_greedy_action( sn_hash,
                                                                            epsgreedy_obj=eg )

                    action_value_coll.sarsa_update( s_hash=s_hash, a_desc=a_desc,
                                                    alpha=alpha_obj(), gamma=gamma,
                                                    sn_hash=sn_hash, an_desc=an_desc,
                                                    reward=reward)

                    if sn_hash in environment.terminal_set:
                        Nterminal_episodes.add( start_hash )
                        if (n_episode_steps==0) and (num_s_hash>2):
                            print('1st step break for sn_hash in terminal_set', sn_hash,
                                  ' s_hash=%s'%str(s_hash), ' a_desc=%s'%str(a_desc))
                        break
                    s_hash = sn_hash
                    a_desc = an_desc

    # increment episode counter on EpsilonGreedy and Alpha objects
    eg.inc_N_episodes()
    alpha_obj.inc_N_episodes()

    abserr = action_value_coll.get_biggest_action_state_err()
    if abserr > max_abserr:
        keep_looping = True

    if episode_loop_counter < min_num_episodes:
        keep_looping = True # must loop for min_num_episodes at least

Sarsa On-Policy, Full Souce Code

Example 6.5 Windy Gridworld

Example 6.5 applies epsilon greedy Sarsa to the Windy Gridworld

The case is run with gamma=1.0, epsilon=0.1 and alpha=0.5 Example 6.5 Windy Gridworld, Full Souce Code

Shown below on the left is the answer published in Sutton & Barto

On the right is a plot comparing the results if IntroRL, Sutton & Barto and Denny Britz

_images/example_6_5_windy_gridworld_sutton.jpg _images/example_6_5_windy_gridworld.png

The policy from the IntroRL run results in the episode shown below, however, the resulting policy was not consistent, often getting caught in infinite loops:

 _________________ Windy Gridworld Sutton Ex6.5 Episode Summary ________________
      *       *       *       *       *       *  [7->R]  [8->R]  [9->R] [10->D]  ||
      *       *       *       *       *  [6->R]       *       *       * [11->D]  ||
      *       *       *       *  [5->R]       *       *       *       * [12->D]  ||
 [1->R]  [2->R]  [3->R]  [4->R]       *       *       *       *       * [13->D]  ||
      *       *       *       *       *       *       *       * [15->L] [14->L]  ||
      *       *       *       *       *       *       *       *       *       *  ||
      *       *       *       *       *       *       *       *       *       *  ||
___0_______0_______0_______1_______1_______1_______2_______2_______1_______0___
 ______________________________ Upward Wind Speed ______________________________

Example 6.6 Cliff Walking

Example 6.6 compares epsilon-greedy Q-learning to epsilon-greedy Sarsa for a Cliff Walking Gridworld. The Q-learning pseudo code is given on page 131 of Sutton & Barto

_images/qlearning_pseudocode_sutton.jpg

The Example 6.6 Cliff Walking, Full Souce Code of the IntroRL Q-learning function is much like the Sarsa On-Policy, Full Souce Code except for the update of the Q(s,a) value.

# Q-learning Q(s,a) update
action_value_coll.qlearning_update( s_hash=s_hash, a_desc=a_desc, sn_hash=sn_hash,
                                    alpha=alpha_obj(), gamma=gamma,
                                    reward=reward)

# Sarsa Q(s,a) update
action_value_coll.sarsa_update( s_hash=s_hash, a_desc=a_desc,
                                alpha=alpha_obj(), gamma=gamma,
                                sn_hash=sn_hash, an_desc=an_desc,
                                reward=reward)

The point of Example 6.6 is to show that Sarsa takes the epsilon-greedy action selection into account and travels the safer route along the top of the grid.

Q-learning takes the more efficient route along the edge of the cliff, ignoring the exploration dangers of epsilon-greedy.

_images/cliffwalk_gridworld_sutton.jpg _images/example_6_6_soln_sutton.jpg _images/example_6_6_cliff_walking_alpha_0_5.png

The above-right image overlays the results of Sutton & Barto and IntroRL.

Figure 6.3 Expected Sarsa

Figure 6.3 on page 133 of Sutton & Barto illustrates the benefit of Expected Sarsa over both Sarsa and Q-learning.

The description of figure 6.3 applies to the chart on the left, below.

The chart on the right is run by IntroRL with fewer runs than the Sutton & Barto chart.

_images/figure_6_3_text_sutton.jpg _images/figure_6_3_chart_sutton.jpg _images/figure_6_3_cliff_walking_plot.png

The plot above is created by first running a script to make a data file, Make Figure 6.3 Expected Sarsa Data File Code

And then running a script to create the plot, Figure 6.3 Expected Sarsa, Plotting Souce Code

Figure 6.5 Double Qlearning

Double Q-learning is outlined in the pseudo code below.

_images/double_qlearning_pseudocode.jpg

Figure 6.5 demonstrates how Double Qlearning is virtually unaffected by maximization bias.

_images/figure_6_5_text_sutton.jpg _images/figure_6_5_chart_sutton.jpg _images/figure_6_5_maximization_bias_nb10.png

The IntroRL chart, above right, was run with 10 choices from (B) in the MDP Figure 6.5 Double Qlearning, Full Souce Code

The IntroRL values are overlaid on the Sutton & Barto values.