Skip to Main Content

Tables 1, 2, and 3 document the experimental settings. The number of runs, training batches, and batches between evaluations are reported separately for hyperparameter search and definitive runs. The number of training batches is adapted according to how soon each estimator leads to apparent convergence. Note that it is very difficult to establish this setting before hyperparameter search. The number of batches between evaluations is adapted so that there are 100 evaluation steps in total.

Table 1:

Experimental Settings for the Bit Flipping Environments.

Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 5000 1400 15000 1000 
Batches between evaluations (definitive) 50 14 150 10 
Runs (search) 10 10 10 10 
Training batches (search) 4000 1400 4000 1000 
Batches between evaluations (search) 40 14 40 10 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Bit Flipping (8 bits)Bit Flipping (16 bits)
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 5000 1400 15000 1000 
Batches between evaluations (definitive) 50 14 150 10 
Runs (search) 10 10 10 10 
Training batches (search) 4000 1400 4000 1000 
Batches between evaluations (search) 40 14 40 10 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Table 2:

Experimental Settings for the Grid World Environments.

Empty RoomFour Rooms
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 2200 200 10,000 1700 
Batches between evaluations (definitive) 22 100 17 
Runs (search) 10 10 10 10 
Training batches (search) 2500 800 10,000 3500 
Batches between evaluations (search) 25 100 35 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Empty RoomFour Rooms
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 20 20 20 20 
Training batches (definitive) 2200 200 10,000 1700 
Batches between evaluations (definitive) 22 100 17 
Runs (search) 10 10 10 10 
Training batches (search) 2500 800 10,000 3500 
Batches between evaluations (search) 25 100 35 
Policy learning rates R1 R1 R1 R1 
Baseline learning rates R1 R1 R1 R1 
Episodes per evaluation 256 256 256 256 
Maximum active goals per episode     
Table 3:

Experimental Settings for the Ms. Pac-Man and Fetchpush Environ-ments.

Ms. Pac-ManFetchPush
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 10 10 10 10 
Training batches (definitive) 40,000 12,500 40,000 12,500 
Batches between evaluations (definitive) 400 125 400 125 
Runs (search) 
Training batches (search) 40,000 12,000 40,000 15,000 
Batches between evaluations (search) 800 120 800 300 
Policy learning rates R2 R2 R2 R2 
Episodes per evaluation 240 240 512 512 
Maximum active goals per episode   
Ms. Pac-ManFetchPush
Batch Size 2Batch Size 16Batch Size 2Batch Size 16
Runs (definitive) 10 10 10 10 
Training batches (definitive) 40,000 12,500 40,000 12,500 
Batches between evaluations (definitive) 400 125 400 125 
Runs (search) 
Training batches (search) 40,000 12,000 40,000 15,000 
Batches between evaluations (search) 800 120 800 300 
Policy learning rates R2 R2 R2 R2 
Episodes per evaluation 240 240 512 512 
Maximum active goals per episode   

Close Modal

or Create an Account

Close Modal
Close Modal