Concrete Problems in AI Safety

Categorizes the causes of harmful and unintended behaviour in systems, and presents research solutions for each category

1. Introduction
2. Problems Overview

https://arxiv.org/pdf/1606.06565.pdf
Year: 2016

1. Introduction

This work focuses on accidents(uninteded or harmful behaviour) in ML systems. Conclusion: Presents 5 research problems and discuss approaches. Related work mentions:

Futurists communities work: Future of Humanity Institute and Machine Intelligence Research Institute

2. Problems Overview

Sight problems as arising from wrong objective function design:

Ignoring some environment variables that might be harmful to change.
Easy formalization to max the objective function but that fails to meet the intent
Correcting the objective, e.g through human intervention is expensive to do freqeuntly (Limited access to objective)
Making safe explorations, while deciding using insufficient data or non-expressive model
Decisions with inputs different from those seen in training

The problems are divided into the categories:

Avoiding Negative Side Effects: Undesirable consequences while meeting objective
Avoiding Reward Hacking: Cheeting the reward function. e.g Aim: To Clean. Do: Coverup dirt so it’s not visible.
Scalable Oversight: Respect Objective aspects that are too expensive for frequent evaluation. e.g Human feedback
Safe Exploration: Don’t try out dangerous things
Robustness to Distributional Shift: Transfer to new environments

Approaches I felt interesting are marked in bold, from (b). At (a) I just did that every one before realising they would be too many to list out comperehensively, since this should be a summary.

a) Avoiding Negative side-effects

Formalizing “Perform task X” gives undesirable results. Frame it as “perform task X while observing common-sense constraints on the environment”.

Approaches:

Define an Impact Regularizer: Penalizes change to the environment.(Changes the agent introduces, not those in the natural course of environment evolution) What is this change? State distance, d(s i , s 0 ), between the present state s i and some initial state s 0. However, penalizing this resists all sources of change including env evolution.

Or compare the future state under current policy, to the future state (or distribution) under a hypothetical policy π_null where the agent just did nothing.

Learn an Impact Regularizer: Transfer over many tasks. With model-based RL there’s transfer of dynamics, not Value function; ignoring side-effects.
Penalize Influence: Demotivate agent from being in positions with high chance of having side effects. This influence can be measured through empowerment. That is precision of control over the env - how much agent actions influence future states.
Muli-agent approaches: a) Cooperative Inverse Reinforcement Learning - agent and a human work together to achieve the human’s goals. b) Reward autoencoder - Observer can infer what the agent is trying to do
Reward Uncertainty: Instead of single reward, use a prior probability distribution that reflects the property that random changes are more likely to be bad. Issue: Baseline for the changes

b) Avoiding Reward Hacking

Authors mention various sources of agents hacking reward function to get high rewards

Approaches:

Adversarial Reward Functions: The reward function taken as its own agent and explores the env. Then find scenarios the ML system claimed were high reward but that a human labels as low reward (think GANs).
Model Lookahead: Like model-based RL, consider what future (s, a) will lead to and give rewards based on this anticipation rather than the present one.
Adversarial Blinding/ Crossvalidation for agents - Hide how the reward is generated by hiding some env parts(variables) fromt the agent.
Careful Engineeering: Formal verification and practical testing of system parts.
Rewarding Capping of max reward can prevent extreme low probability high pay-off strategies.
Counter-example resistance: Abstract rewards vulnerable to adversarial counterexamples can use researched adversarial training methods.
Multiple Rewards: e.g Different physical implementations for same function. Then average, minimum, quantile, e.t.c
Reward pre-training: Mention of Inverse RL. Can’t further learn the reward function after pre-training.
Variable Indifference: Optimize for some variables while ignoring others(those we care about). Can have huge applicability in safety.
Trip Wires: Introduce and monitor vulnerabilities to check if agent takes advantage of one(Shouldn’t if reward fun is correct).

c) Scalable Oversight

There’s mention of semi-supervised RL as a solution. . Here, the agent doesn’t see rewards in a fraction of the episodes(unlabeled) but must optimize for all episodes.

A baseline (ordinary RL) agrorithm will make use of all labelled episodes, thus slow. The task is to use the unlaballed episodes to accelerate learning.

Semi-supervised RL needs to identify proxies that give rewards and when the proxies are valid. Authors also say it can incentivize communication feedback and transparency on the agent when a source of true, sparse approval.

There’s a mention on various approaches to semi-supervised RL.

Other approaches:

Hierarchical RL: Since the agent delegates actions to sub-agents, it doesn’t need to know the details of policy implementation and gets sparse rewards.
Distant Supervision: Provides info about system’s decision in the aggregate, or as noisy hints about the correct evaluation. e.g Generalized expectation criteria - user provides population level statistics, such as “this sentence has exactly 1 noun”.
At time of this work (2016) this hadn’t been apllied to agents.

d) Safe Exploration

This has been more deeply studied by the papers:

Safe exploration techniques for reinforcement learning–an overview
A Comprehensive Survey on Safe Reinforcement Learning.

Mentions safety as being in a state area where actions are reversible

Approaches

Risk sensitive performance criteria: Penalizing bad performance or variance in performance. May incorporate uncertainty and off-policy evaluation.
Use of demonstrations
Simulated environments: Question on how to reliably update policy with simulated trajectories and also represent their consequencies as off-policy trajectories
Bounded Exploration :

Define safe states. With models, predict if an action will take the agent outside the safe space. See works on Safe exploration in Markov Decision Processes

Human oversight: Raises the expensive feedback oversight problem.
Trusted Policy Oversight: Limit exploration to actions the trusted policy believes we can recover from. Assumes we have a model.

e) Robustness to Distributional Change

Change in testing distribution compared to training distribution.

Approaches

Covariance shift and marginal likelihood Covariance shift: Though strong, it’s untestable, which from a safety view is a problem
1. Use of marginal likelihood
2. Well-specified models such as Neural turing machines. Use of bootstrapping to estimate finite-sample variation in learned net params

Non-covariance shift:

Use of generative models of the distribution

All the above approaches rely on well-specified models i.e., containing the true distribution. Achieved through expressive models like Turing machines, very large neural nets, kernels.

Partially specified models:(Have assumptions about some aspects of the distribution)
Method of moments in estimation of latent variable models
Modelling the distribution of errors of a model
Training on multiple distributions
How to respond when out of distribution: What should a model do when it realises its in a new distribution?
- Ask human feedback: Unclear what question to ask(Authors mention work pointing what the model is uncertain about), time-constraints (Mention Reachability analysis & Robust policy improvement)
- RL agents can gather env information for more clarity
Counterfactual Reasoning and Learning with contracts Learning with contracts: Simplest contract is the system performing well in both training and test distributions but difficult. Weaker contracts - Partially specified models, reachability analysis

Completed Sep 22, 2020