Concrete Problems in AI Safety
Categorizes the causes of harmful and unintended behaviour in systems, and presents research solutions for each category
- https://arxiv.org/pdf/1606.06565.pdf
- Year: 2016
1. Introduction
This work focuses on accidents(uninteded or harmful behaviour) in ML systems. Conclusion: Presents 5 research problems and discuss approaches. Related work mentions:
- Futurists communities work: Future of Humanity Institute and Machine Intelligence Research Institute
2. Problems Overview
Sight problems as arising from wrong objective function design:
- Ignoring some environment variables that might be harmful to change.
- Easy formalization to max the objective function but that fails to meet the intent
- Correcting the objective, e.g through human intervention is expensive to do freqeuntly (Limited access to objective)
- Making safe explorations, while deciding using insufficient data or non-expressive model
- Decisions with inputs different from those seen in training
The problems are divided into the categories:
- Avoiding Negative Side Effects: Undesirable consequences while meeting objective
- Avoiding Reward Hacking: Cheeting the reward function. e.g Aim: To Clean. Do: Coverup dirt so it’s not visible.
- Scalable Oversight: Respect Objective aspects that are too expensive for frequent evaluation. e.g Human feedback
- Safe Exploration: Don’t try out dangerous things
- Robustness to Distributional Shift: Transfer to new environments
Approaches I felt interesting are marked in bold, from (b). At (a) I just did that every one before realising they would be too many to list out comperehensively, since this should be a summary.
a) Avoiding Negative side-effects
Formalizing “Perform task X” gives undesirable results. Frame it as “perform task X while observing common-sense constraints on the environment”.
Approaches:
- Define an Impact Regularizer: Penalizes change to the environment.(Changes the agent introduces, not those in the natural course of environment evolution) What is this change? State distance, d(s i , s 0 ), between the present state s i and some initial state s 0. However, penalizing this resists all sources of change including env evolution.
Or compare the future state under current policy, to the future state (or distribution) under a hypothetical policy π_null where the agent just did nothing.
- Learn an Impact Regularizer: Transfer over many tasks. With model-based RL there’s transfer of dynamics, not Value function; ignoring side-effects.
- Penalize Influence: Demotivate agent from being in positions with high chance of having side effects. This influence can be measured through empowerment. That is precision of control over the env - how much agent actions influence future states.
- Muli-agent approaches: a) Cooperative Inverse Reinforcement Learning - agent and a human work together to achieve the human’s goals. b) Reward autoencoder - Observer can infer what the agent is trying to do
- Reward Uncertainty: Instead of single reward, use a prior probability distribution that reflects the property that random changes are more likely to be bad. Issue: Baseline for the changes
b) Avoiding Reward Hacking
Authors mention various sources of agents hacking reward function to get high rewards
Approaches:
- Adversarial Reward Functions: The reward function taken as its own agent and explores the env. Then find scenarios the ML system claimed were high reward but that a human labels as low reward (think GANs).
- Model Lookahead: Like model-based RL, consider what future (s, a) will lead to and give rewards based on this anticipation rather than the present one.
- Adversarial Blinding/ Crossvalidation for agents - Hide how the reward is generated by hiding some env parts(variables) fromt the agent.
- Careful Engineeering: Formal verification and practical testing of system parts.
- Rewarding Capping of max reward can prevent extreme low probability high pay-off strategies.
- Counter-example resistance: Abstract rewards vulnerable to adversarial counterexamples can use researched adversarial training methods.
- Multiple Rewards: e.g Different physical implementations for same function. Then average, minimum, quantile, e.t.c
- Reward pre-training: Mention of Inverse RL. Can’t further learn the reward function after pre-training.
- Variable Indifference: Optimize for some variables while ignoring others(those we care about). Can have huge applicability in safety.
- Trip Wires: Introduce and monitor vulnerabilities to check if agent takes advantage of one(Shouldn’t if reward fun is correct).
c) Scalable Oversight
There’s mention of semi-supervised RL as a solution. . Here, the agent doesn’t see rewards in a fraction of the episodes(unlabeled) but must optimize for all episodes.
A baseline (ordinary RL) agrorithm will make use of all labelled episodes, thus slow. The task is to use the unlaballed episodes to accelerate learning.
Semi-supervised RL needs to identify proxies that give rewards and when the proxies are valid. Authors also say it can incentivize communication feedback and transparency on the agent when a source of true, sparse approval.
There’s a mention on various approaches to semi-supervised RL.
Other approaches:
- Hierarchical RL: Since the agent delegates actions to sub-agents, it doesn’t need to know the details of policy implementation and gets sparse rewards.
- Distant Supervision: Provides info about system’s decision in the aggregate, or as noisy hints about the correct evaluation. e.g Generalized expectation criteria - user provides population level statistics, such as “this sentence has exactly 1 noun”.
- At time of this work (2016) this hadn’t been apllied to agents.
d) Safe Exploration
This has been more deeply studied by the papers:
- Safe exploration techniques for reinforcement learning–an overview
- A Comprehensive Survey on Safe Reinforcement Learning.
Mentions safety as being in a state area where actions are reversible
Approaches
- Risk sensitive performance criteria: Penalizing bad performance or variance in performance. May incorporate uncertainty and off-policy evaluation.
- Use of demonstrations
- Simulated environments: Question on how to reliably update policy with simulated trajectories and also represent their consequencies as off-policy trajectories
- Bounded Exploration :
Define safe states. With models, predict if an action will take the agent outside the safe space. See works on Safe exploration in Markov Decision Processes
- Human oversight: Raises the expensive feedback oversight problem.
- Trusted Policy Oversight: Limit exploration to actions the trusted policy believes we can recover from. Assumes we have a model.
e) Robustness to Distributional Change
Change in testing distribution compared to training distribution.
Approaches
- Covariance shift and marginal likelihood
Covariance shift: Though strong, it’s untestable, which from a safety view is a problem
- Use of marginal likelihood
- Well-specified models such as Neural turing machines. Use of bootstrapping to estimate finite-sample variation in learned net params
Non-covariance shift:
- Use of generative models of the distribution
All the above approaches rely on well-specified models i.e., containing the true distribution. Achieved through expressive models like Turing machines, very large neural nets, kernels.
- Partially specified models:(Have assumptions about some aspects of the distribution)
- Method of moments in estimation of latent variable models
-
Modelling the distribution of errors of a model
- Training on multiple distributions
- How to respond when out of distribution:
What should a model do when it realises its in a new distribution?
- Ask human feedback: Unclear what question to ask(Authors mention work pointing what the model is uncertain about), time-constraints (Mention Reachability analysis & Robust policy improvement)
- RL agents can gather env information for more clarity
- Counterfactual Reasoning and Learning with contracts Learning with contracts: Simplest contract is the system performing well in both training and test distributions but difficult. Weaker contracts - Partially specified models, reachability analysis
Completed Sep 22, 2020