Trial without Error: Towards Safe Reinforcement Learning via Human Intervention

This work trains a learner to imitate a human intervention’s decisions.

The goal of HIRL is to enable the agent to learn without a single catastrophe.

This is achived in simple envs but with more complex catastrophes, the number is significantly reduced, though not to zero.
Authors compare this to giving an agent large negative rewards for catastrophic events without stopping them (Reward Shaping) - the agent still causes them later, unlike HIRL.
A catastrophic action is one the human overseer deems unacceptable.

At each time-step, the human observes the current state s and proposed action a. If (s, a) is catastrophic, the human sends a safe action $a^*$ to the env instead.
The new reward $r=R(s,a^\ast)$ is replaced with a penalty $r^*$.
During the oversight period, $(s, a)$ are stored with a binary label whether a human blocked it or not.
This dataset trains a classifier by Supervised Learning (SL) (“Blocker”) to imitate the human’s blocking decisions.
The oversight lasts until the Blocker performs well (no def given) on a held-out dataset, after which the blocker takes over for the rest of the time.
Oversight is done in multiple phases because of the $(s, a)$ distirbution shift during learning - which also happens on transfer to another agent.
This works with any RL algorithm.

Oversight was done for 4.5 hours on most Atari but this was insufficient for more complex tasks (Road Runner).
HIRL can’t scale to complex tasks as the human time-labelling cost would be infeasible.
Authors conclude by giving points on possible ways to address this of which some are:
- Data efficiency in data size needed to train the Blocker, and also for agent to learn unsafe actions.
- Seeking catastrophies during oversight. i.e., Maximixize human labelling on (s, a) that are actually unsafe
- Active Learning - The agent requests feedback on unsure states, instead of labelling over an entire duration.