Learning from Human Preferences

Solving for complex RL tasks without a reward function while providing human feedback (oversight) on less than 1% of agent interaction with environment.

1. Introduction
2. Preliminaries
Reward function
3. Experiment Results

https://arxiv.org/pdf/1706.03741
Year: 2017

1. Introduction

There’s misalignment between our values and objectives of our RL systems – if we could communicate our actual objectives to the agents, that would be a step.
Inverse-RL not directly applicable to behaviours difficult for humans to demonstrate.
This algorithm fits a reward function to human preferences while training a policy to optimize the current predicted reward. The human compares short video clips(states) of the agent behaviour and selects the one they prefer as showing desirable behaviour.
Authors say this is sufficient to learn most RL tasks with no observed reward function, with feeback of 15 min - 5 hrs.

Conclusion: Human interaction is often more expensive than agent-env interaction. By learning a separate reward function using Supervised Learning, the interaction complexity can be reduced by ~3 orders of magnitude. Authors say we are hitting diminishing returns on further sample-complexity improvements as its compute cost (at time of writing) is comparable to cost of no labor for non-expert human feedback

2. Preliminaries

The goal of the agent is to produce trajectories which are preferred by the human while making few queries as possible to the human.
Qualitatively, preferences are generated by a reward function. So agent achieves a reward as high as optizing r using RL. Where there’s no reward function, just evaluate how well the human is satified with agent behaviour.
The trajectories shown are 1-2 secs long. The human judgement is recorded in the form: $(σ^1 , σ^2 , μ)$ where $σ^1$, $σ^1$ are the trajectories, and $μ$ the choice expressed as a distribution i.e., equal mass if equally preferred or higher mass one one if one preferred

Reward function

The reward estimate is expressed as a latent factor explaining the human judgment. The choice is assumed to depend on the exponential latent reward of a clip.

\[P̂ [σ^1 > σ^2] = \frac{\exp(\sum r(o_t, a_t)}{\exp(\sum r(o_t, a_t) + \exp(\sum r(o_t^2, a_t^2)}\]

$\hat{r}$ is to minimize the Cross Entropy Loss(CEL) between the reward predictions and human labels.

\[\hat{r} = - \sum_{(\sigma^1, \sigma^2) \in D} \mu(1)P̂ [σ^1 > σ^2] + \mu(2) P̂ [σ^2 > σ^1]\]

In addition, the authors:
1. Fit an ensemble of predictors. Normalize and Average
2. A fraction of $\mathcal{D}$ is used as a validation set. There’s l2 regularization
3. Don’t apply a softmax to the predictor equation to adjust for human error
The trajectories selected, from each pair, are those whose predictions have the highest variance across the ensemble.

3. Experiment Results

The algorithm (nearly) matches RL performance on gym with 700 labels. Performs slightly better than true reward at 1400 labels
Authors compare: RL, human labels and sythentic feedback using an oracle(3). I don’t understand what is meant by (3).
Human feedback performs slightly less well than sythentic feedback on most games for equal label count. It does better with 40% less sythentic labels, possible due to human labelling errors and inconsistency
For atari labels are 5,500. It fails on Qbert where short clips are difficult to evaluate.
For continous control tasks, predicting comparisons is better than predicting scores because variance in the scale of rewards complicates the regression.
Offline reward predictor training only recovers a partial reward and leads to undesirable behaviour. e.g. with Pong offline training, the agent avoids losing points, but not scoring