Chapter 8 Planning & Learning¶
Tabular DynaQ¶
IntroRL implements the above pseudo code with the class DynaQAgent
Figure 8.2 DynaQ Grid¶
Figure 8.2 from Sutton & Barto is shown below on the left.
It illustrates the difference that planning steps make in solving the grid from “S” to “G”.
The figure shown above on the right is from the IntroRL implementation of tabular DynaQ.
The code used to generate the above figure is: Figure 8.2 Code
Figure 8.4 Blocking Maze¶
Figure 8.4 goes with Example 8.2 Blocking Maze
on page 166 of
Sutton & Barto .
It illustrates how the DynaQ and DynaQ+ algorithms adapt to a maze that changes configuration
as the analysis progresses.
In particular, this maze closes off the shortest route to the goal, and opens a longer route to the goal partway through the analysis. The figure on the lower left illustrates the two maze configurations used during the analysis.
The figure on the lower right compares the results of IntroRL with Sutton & Barto and Shangtong Zhang
The code used to generate the above figure is: Figure 8.4 Code
The figure also relies on the code for DynaQAgent and DynaQPlusAgent
Figure 8.5 Shortcut Maze¶
Figure 8.5 goes with Example 8.3 Shortcut Maze
on page 167 of
Sutton & Barto .
It illustrates how the DynaQ and DynaQ+ algorithms adapt to a maze that changes configuration
as the analysis progresses.
Unlike the Blocking Maze, this maze opens a shortcut to the goal state partway through the analysis. The figure on the lower left illustrates the two maze configurations used during the analysis.
The figure on the lower right compares the results of IntroRL with Sutton & Barto
The code used to generate the above figure is: Figure 8.5 Code
Example 8.4 Prioritized Sweeping¶
Prioritized Sweeping creates a priority queue of states-action pairs that should be examined in the planning portion of an agent’s search. The pseudo code from page 170 of Sutton & Barto is shown below.
PrioritySweepAgent implements the above pseudo code.
Note that updates to Q(s,a) are only performed in the planning section on (s,a) pairs taken from the priority queue.
This approach was shown by Peng and Williams (1993) to perform well on maze solutions by reducing the number of Q(s,a) updates required to reach a solution.
The above image shows the maze used in Example 8.4 to compare Dyna-Q to Prioritized Sweeping. The maze was scaled from the maze on the left, to ever-larger mazes by integer factors. The above two mazes are for scale factors of 1 and 4.
The two plots below show the results on page 170 of Sutton & Barto as well as the results generated by IntroRL.
The plot on the right includes all the results from Sutton & Barto, Shangtong Zhang and IntroRL.
The code used to generate the above figure is: Example 8.4 Code