Learning from Human Preferences
Solving for complex RL tasks without a reward function while providing human feedback (oversight) on less than 1% of agent interaction with environment.
-
Year: 2017
1. Introduction
-
There’s misalignment between our values and objectives of our RL systems – if we could communicate our actual objectives to the agents, that would be a step.
-
Inverse-RL not directly applicable to behaviours difficult for humans to demonstrate.
-
This algorithm fits a reward function to human preferences while training a policy to optimize the current predicted reward. The human compares short video clips(states) of the agent behaviour and selects the one they prefer as showing desirable behaviour.
-
Authors say this is sufficient to learn most RL tasks with no observed reward function, with feeback of 15 min - 5 hrs.
Conclusion: Human interaction is often more expensive than agent-env interaction. By learning a separate reward function using Supervised Learning, the interaction complexity can be reduced by ~3 orders of magnitude. Authors say we are hitting diminishing returns on further sample-complexity improvements as its compute cost (at time of writing) is comparable to cost of no labor for non-expert human feedback
2. Preliminaries
-
The goal of the agent is to produce trajectories which are preferred by the human while making few queries as possible to the human.
-
Qualitatively, preferences are generated by a reward function. So agent achieves a reward as high as optizing r using RL. Where there’s no reward function, just evaluate how well the human is satified with agent behaviour.
-
The trajectories shown are 1-2 secs long. The human judgement is recorded in the form: $(σ^1 , σ^2 , μ)$ where $σ^1$, $σ^1$ are the trajectories, and $μ$ the choice expressed as a distribution i.e., equal mass if equally preferred or higher mass one one if one preferred
Reward function
- The reward estimate is expressed as a latent factor explaining the human judgment. The choice is assumed to depend on the exponential latent reward of a clip.
- $\hat{r}$ is to minimize the Cross Entropy Loss(CEL) between the reward predictions and human labels.
-
In addition, the authors:
- Fit an ensemble of predictors. Normalize and Average
- A fraction of $\mathcal{D}$ is used as a validation set. There’s l2 regularization
- Don’t apply a softmax to the predictor equation to adjust for human error
-
The trajectories selected, from each pair, are those whose predictions have the highest variance across the ensemble.
3. Experiment Results
-
The algorithm (nearly) matches RL performance on gym with 700 labels. Performs slightly better than true reward at 1400 labels
-
Authors compare: RL, human labels and sythentic feedback using an oracle(3). I don’t understand what is meant by (3).
-
Human feedback performs slightly less well than sythentic feedback on most games for equal label count. It does better with 40% less sythentic labels, possible due to human labelling errors and inconsistency
-
For atari labels are 5,500. It fails on Qbert where short clips are difficult to evaluate.
-
For continous control tasks, predicting comparisons is better than predicting scores because variance in the scale of rewards complicates the regression.
-
Offline reward predictor training only recovers a partial reward and leads to undesirable behaviour. e.g. with Pong offline training, the agent avoids losing points, but not scoring