Large Scale study of Curiosity Driven Learning

Explores four methods for generating intrinsic rewards

Goal
(a) Intro
(b) Feature spaces for Learning Dynamics-based curiosity
(c) Experiments
d) Related Work ** Seems pretty explorable **

https://arxiv.org/pdf/1808.04355.pdf

Goal

Study curiosity driven learning without extrinsic rewards
Show sufficiency of random features, and better generalization of learned features
Demo limitation of prediction based rewards in stochastic setups

(a) Intro

Instrinsic rewards - Rewards generated by the agent itself e.g Curiosity (Prediction error as reward signal) and visitation counts (discourage revisting same states).

Dynamics Based Curiosity

In this paper, the instrinsic reward is represented as the error in predicting the consequence of the agent’s action given its current state, i.e., the prediction error of learned forward-dynamics of the agent.
Issue - Predicting future states in high-D raw observation space challenging. To solve, examines ways of encoding the observation such that the embedding space is compact/lower-dimensional, has sufficient info and is a stationary function of the observations. See (b)
Limitation - If the agent itself is the source of stochasticity in the environment(stochastic dynamics), it can reward itself without making any actual progress

Conclusion & Future Work

Curiosity-like objective aligns with extrinsic provided reward in many designed games.
Random features do good, but learned features appear to generalize better -> Maybe learning features gains importance once the env is complex enough.
Transfer from unlabbeled (no extrinsic reward function) to labelled envs

(b) Feature spaces for Learning Dynamics-based curiosity

Good feature-space qualities:

Stable: the forward dynamics model is evolving over time as it is trained and the features are changing as they learn are sources of non-stationality in rewards.
Compact: Lower-Dimensional (Lower-D). Filter irrelevant image parts
Sufficient: Else, agent may not be rewarded for exploring a relevant env part.

1. Pixels

$φ(x) = x$ Good: Sufficient (No info discarded), Stable(No learning component) Issue: Observation space high-D and complex

2. Random Features

Take CNN and fix it after random intialization. Good: Stable(since Fixed), Possibly compact Issue: Insufficient

3. Variational AutoEncoders (VAE)

Latent variable $z$. Observed variable $x$. Aim to fit $p(x, z)$
Use variational inference with inference net $q(z x)$ to approximate $p(z x)$

Good: Low-D, Suffiecient Issue: Features change as VAE learns

4. Inverse Dynamics Features

Given a transition $(s_t , s_{t+1} , a_t )$. Predict action $a_t$ given $s_t$ and $s_{t+1}$.
Use neural network $φ$ to first embed s$t$ and $s{t+1}$

Good: Low-D Issue: Unstable (Dynamics change as learning progresses) –> Fix: Pretrain VAE or IDF, Potentionially Insufficint (Not represented env parts agent can’t affect)

Practical Considerations

Main goal of authors is to reduce non-stationarity in order to make learning more stable. Their Considerations: Reward, Advantage, Observation and Feature(Batch-norm) Normalization, Parallel Actors, Use PPO (stable, little tuning)
No end-of-episode DONE signal (To separate the gains of an agent’s exploration from that of the death signal)

(c) Experiments

Goal: Investigate: i) Effect of running curiosity-driven agent without extrinsic rewards ii) Behaviours to expect from such agents? iii) Effect of different learning-features on these behaviours

Atari

Extrinsic rew used for evaluation Curious agent can do worse than random agent when the extrinsic reward has little correlation with the agent’s exploration, or when the agent fails to explore efficiently.

Feature learning methods (In Atari)

IDF does better than random agent 75% of the time, random-CNN 70%.
IDF does better than random-CNN 55% of time

Generalization and Transfer

Authors attempted to transfer a Mario level 1 pre-trained agent to a new Mario level 2/3
IDF transfered in both, while Random Features (RF) do so on level 2 but have weak transfer to level 3 (harder)
Hence the suggestion of learned features generalizing better.

Curiosity + Sparse External Reward

Terminal Reward: e.g In maze navigation, Agent only rewarded at goal location. With extrinsic rewards only, agent never finds the goal. Intrinsic + Extrinsic converges every time.

Sparse Reward: Extrinsic + Intrisic does better 4/5. extrinsic 1.0 + instric *.01. *Future Work: How optimally combine intrinsic and extrinsic reward

Discussion on other approaches to intrinsic motivation reward on agent based on prediction error, prediction uncertainty, improvement of dynamics model trained with a policy. (Little work done with no extrinsic rewards)

State visitation counts used for instric rewards. Unclear where they should be preferred over dynamics-based curiosity approaches [(As in this work)].
Further listing on methods of exploration done in combination with maxing a reward function.