Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)
Overview
PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.
Original paper:
Implemented Variants
Variants Implemented | Description |
---|---|
ppo_dna_atari_envpool.py , docs |
Uses the blazing fast Envpool Atari vectorized environment. |
Below are our single-file implementations of PPO-DNA:
ppo_dna_atari_envpool.py
The ppo_dna_atari_envpool.py has the following features:
- Uses the blazing fast Envpool vectorized environment.
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_dna_atari_envpool.py
does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files
Usage
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_dna_atari_envpool.py uses a customized RecordEpisodeStatistics
to work with envpool but has the same other implementation details as ppo_atari.py
(see related docs).
Experiment results
Below are the average episodic returns for ppo_dna_atari_envpool.py
compared to ppo_atari_envpool.py
.
| Environment | ppo_dna_atari_envpool.py
| ppo_atari_envpool.py
| ----------- | ----------- | ----------- | ----------- |
| BattleZone-v5 (40M steps) | 94800 ± 18300 | 28800 ± 6800
| BeamRider-v5 (10M steps) | 5470 ± 850 | 1990 ± 560
| Breakout-v5 (10M steps) | 321 ± 63 | 352 ± 52
| DoubleDunk-v5 (40M steps) | -4.9 ± 0.3 | -2.0 ± 0.8
| NameThisGame-v5 (40M steps) | 8500 ± 2600 | 4400 ± 1200
| Phoenix-v5 (45M steps) | 184000 ± 58000 | 10200 ± 2700
| Pong-v5 (3M steps) | 19.5 ± 1.1 | 16.6 ± 2.3
| Qbert-v5 (45M steps) | 12600 ± 4600 | 10800 ± 3300
| Tennis-v5 (10M steps) | 13.0 ± 2.3 | -12.4 ± 2.9
Learning curves:
data:image/s3,"s3://crabby-images/3c94a/3c94a603f3c5c213e296e5d9742cc627898a50bb" alt=""
data:image/s3,"s3://crabby-images/efa7d/efa7deea729bb204b60daedf8ebbf855d24b676f" alt=""
data:image/s3,"s3://crabby-images/29719/2971956410c1508d00073e04aa0f27ae02290a87" alt=""
data:image/s3,"s3://crabby-images/06f10/06f1024c24887ab5af221846962e9fadfc8371ba" alt=""
data:image/s3,"s3://crabby-images/fee39/fee39be62aab6a07b74a4ea859ce1ee769f1556a" alt=""
data:image/s3,"s3://crabby-images/ac320/ac320327888ae7356eea068427ab97129b00203d" alt=""
data:image/s3,"s3://crabby-images/65b5d/65b5d8a5a0d2aebc86836ca90475faf3dd641c5f" alt=""
data:image/s3,"s3://crabby-images/7d7a4/7d7a4aaf80b058d9c34e55ce5a6a49350d69935d" alt=""
data:image/s3,"s3://crabby-images/3bb9c/3bb9c1f5fb4818933a54c7e4e26357df0440bc81" alt=""
data:image/s3,"s3://crabby-images/9479a/9479a5f31fd4c294ffc91e0be2fc63f88da2a681" alt=""
data:image/s3,"s3://crabby-images/89c26/89c26287f1c5225ddda7e47a098f89886969d7a8" alt=""
data:image/s3,"s3://crabby-images/a2994/a299464b1c15030fdd9e904faa58a46686505c18" alt=""
data:image/s3,"s3://crabby-images/20763/2076317ca0b060ebb2c97de3a9626738995972f0" alt=""
data:image/s3,"s3://crabby-images/a9708/a97085ad2c266fd579c3bb5e2074686901fe1ee3" alt=""
data:image/s3,"s3://crabby-images/56a07/56a07c963b0d18ecb11b96f0fbf9c9682b2c5bca" alt=""
data:image/s3,"s3://crabby-images/e4ac9/e4ac93d6110ea038e052b9c5cbdb509ebfdaabac" alt=""
data:image/s3,"s3://crabby-images/2f81c/2f81c0c7607a2f7333a38d9ecf8c6313815225d6" alt=""
data:image/s3,"s3://crabby-images/fec68/fec6804a6d06283f2b57377a1ede465b021e7ff0" alt=""
Tracked experiments: