Solving for complex RL tasks without a reward function while providing human feedback (oversight) on less than 1% of agent interaction with environment.

1. Introduction

  • There’s misalignment between our values and objectives of our RL systems – if we could communicate our actual objectives to the agents, that would be a step.

  • Inverse-RL not directly applicable to behaviours difficult for humans to demonstrate.

  • This algorithm fits a reward function to human preferences while training a policy to optimize the current predicted reward. The human compares short video clips(states) of the agent behaviour and selects the one they prefer as showing desirable behaviour.

  • Authors say this is sufficient to learn most RL tasks with no observed reward function, with feeback of 15 min - 5 hrs.

Conclusion: Human interaction is often more expensive than agent-env interaction. By learning a separate reward function using Supervised Learning, the interaction complexity can be reduced by ~3 orders of magnitude. Authors say we are hitting diminishing returns on further sample-complexity improvements as its compute cost (at time of writing) is comparable to cost of no labor for non-expert human feedback

2. Preliminaries

  • The goal of the agent is to produce trajectories which are preferred by the human while making few queries as possible to the human.

  • Qualitatively, preferences are generated by a reward function. So agent achieves a reward as high as optizing r using RL. Where there’s no reward function, just evaluate how well the human is satified with agent behaviour.

  • The trajectories shown are 1-2 secs long. The human judgement is recorded in the form: $(σ^1 , σ^2 , μ)$ where $σ^1$, $σ^1$ are the trajectories, and $μ$ the choice expressed as a distribution i.e., equal mass if equally preferred or higher mass one one if one preferred

Reward function

  • The reward estimate is expressed as a latent factor explaining the human judgment. The choice is assumed to depend on the exponential latent reward of a clip.
\[P̂ [σ^1 > σ^2] = \frac{\exp(\sum r(o_t, a_t)}{\exp(\sum r(o_t, a_t) + \exp(\sum r(o_t^2, a_t^2)}\]
  • $\hat{r}$ is to minimize the Cross Entropy Loss(CEL) between the reward predictions and human labels.
\[\hat{r} = - \sum_{(\sigma^1, \sigma^2) \in D} \mu(1)P̂ [σ^1 > σ^2] + \mu(2) P̂ [σ^2 > σ^1]\]
  • In addition, the authors:

    1. Fit an ensemble of predictors. Normalize and Average
    2. A fraction of $\mathcal{D}$ is used as a validation set. There’s l2 regularization
    3. Don’t apply a softmax to the predictor equation to adjust for human error
  • The trajectories selected, from each pair, are those whose predictions have the highest variance across the ensemble.

3. Experiment Results

  • The algorithm (nearly) matches RL performance on gym with 700 labels. Performs slightly better than true reward at 1400 labels

  • Authors compare: RL, human labels and sythentic feedback using an oracle(3). I don’t understand what is meant by (3).

  • Human feedback performs slightly less well than sythentic feedback on most games for equal label count. It does better with 40% less sythentic labels, possible due to human labelling errors and inconsistency

  • For atari labels are 5,500. It fails on Qbert where short clips are difficult to evaluate.

  • For continous control tasks, predicting comparisons is better than predicting scores because variance in the scale of rewards complicates the regression.

  • Offline reward predictor training only recovers a partial reward and leads to undesirable behaviour. e.g. with Pong offline training, the agent avoids losing points, but not scoring