Chapter 8 Planning & Learning

Tabular DynaQ

_images/tabular_dynaq_pseudocode.jpg

IntroRL implements the above pseudo code with the class DynaQAgent

Figure 8.2 DynaQ Grid

Figure 8.2 from Sutton & Barto is shown below on the left.

It illustrates the difference that planning steps make in solving the grid from “S” to “G”.

_images/fig_8_2_sutton.jpg _images/fig_8_2_dyna_maze.png

The figure shown above on the right is from the IntroRL implementation of tabular DynaQ.

The code used to generate the above figure is: Figure 8.2 Code

Figure 8.4 Blocking Maze

Figure 8.4 goes with Example 8.2 Blocking Maze on page 166 of Sutton & Barto . It illustrates how the DynaQ and DynaQ+ algorithms adapt to a maze that changes configuration as the analysis progresses.

In particular, this maze closes off the shortest route to the goal, and opens a longer route to the goal partway through the analysis. The figure on the lower left illustrates the two maze configurations used during the analysis.

The figure on the lower right compares the results of IntroRL with Sutton & Barto and Shangtong Zhang

_images/figure_8_4_sutton.jpg _images/fig_8_4_blocking_maze.png

The code used to generate the above figure is: Figure 8.4 Code

The figure also relies on the code for DynaQAgent and DynaQPlusAgent

Figure 8.5 Shortcut Maze

Figure 8.5 goes with Example 8.3 Shortcut Maze on page 167 of Sutton & Barto . It illustrates how the DynaQ and DynaQ+ algorithms adapt to a maze that changes configuration as the analysis progresses.

Unlike the Blocking Maze, this maze opens a shortcut to the goal state partway through the analysis. The figure on the lower left illustrates the two maze configurations used during the analysis.

The figure on the lower right compares the results of IntroRL with Sutton & Barto

_images/figure_8_5_sutton.jpg _images/fig_8_5_shortcut_maze.png

The code used to generate the above figure is: Figure 8.5 Code

Example 8.4 Prioritized Sweeping

Prioritized Sweeping creates a priority queue of states-action pairs that should be examined in the planning portion of an agent’s search. The pseudo code from page 170 of Sutton & Barto is shown below.

_images/prioritized_sweeping_psuedocode.jpg

PrioritySweepAgent implements the above pseudo code.

Note that updates to Q(s,a) are only performed in the planning section on (s,a) pairs taken from the priority queue.

This approach was shown by Peng and Williams (1993) to perform well on maze solutions by reducing the number of Q(s,a) updates required to reach a solution.

_images/scaled_maze.jpg

The above image shows the maze used in Example 8.4 to compare Dyna-Q to Prioritized Sweeping. The maze was scaled from the maze on the left, to ever-larger mazes by integer factors. The above two mazes are for scale factors of 1 and 4.

The two plots below show the results on page 170 of Sutton & Barto as well as the results generated by IntroRL.

The plot on the right includes all the results from Sutton & Barto, Shangtong Zhang and IntroRL.

_images/example_8_4_psweep_sutton.jpg _images/example_8_4_psweep.png

The code used to generate the above figure is: Example 8.4 Code