A Lyapunov-based Approach to Safe Reinforcement Learning
This safety approach is under a Constrained Markov Decision Process (CMDP). It’s a method for constructing Lyapunov functions which provided guarantee of global safety of a behaviour policy during training via a set of local linear constraints
- 1. Introduction
- 2. Preliminaries
- 3. A Lyapunov Approach to Solve CMDPs
- 4. Safe RL using Lyapunov functions
- 5. Lyapunov-based Safe RL Algorithms
- 6. Experiments
- Year: 2018
- https://arxiv.org/pdf/1805.07708
This safety approach is under a Constrained Markov Decision Process (CMDP). It’s a method for constructing Lyapunov functions which provided guarantee of global safety of a behaviour policy during training via a set of local linear constraints
1. Introduction
- Lagrangian method, as way of solving CMDPs runs into the saddle point problem when learning the Lagrange multiplier and doensn’t guarantee behaviour policy safety during each training iter.
- An improvement to this, using a stepwise surrogate constraint only dependant on current (s, a) pair is conservative and has suboptimal performance.
- Authors say analysing convergence of constrained policy optimization (CPO) is challenging and applying it to other RL algorithms (besides TRPO) non-trivial.
- Instead of a handcrafted Lyapunov candidate, this work uses an LP-based algorithm to construct Lyapunov functions w.r.t. generic CMDP constraints.
- Algorithms presented:
- Safe policy Iteration (SPI)
- Safe Value Iteration (SVI)
- Safe DQN & safe DPI (for large s/a spaces - scalable)
Conclusion:
- Derives an LP-based method to generate Lyapunov functions guaranteeing feasibility and optimality in the algorithm.
- Lyapunov approach guarantees safety and robustness learning in RL
- Future work: Apply this to Policy Gradient (PG) methods and compare to CPO on continous action spaces. (ii) Test in the real world
2. Preliminaries
-
A constrained MDP has these additional factors on the unconstrained MDP: $d(x) ∈ [0, D_{max }]$ - immediate constraint cost; and $d_0 ∈ R ≥0$ - upper-bound on the expected cumulative (through time) constraint cost
-
The CMDP $OPT$ problem is formulated:
Given an initial state $x_0$ and a threshold $d_0$ , solve
\[\min π∈∆ C_π (x_0 ) : D_π (x_0 ) ≤ d_0 .\]If there is a non-empty solution, the optimal policy is denoted by $π^∗$ .
\[D_π (x_0 ) ≤ d_0 \text { is a safety constraint, with }D_π \text {given as}: \\ D_π (x_0 ) := E [\sum_{t_0}^{T* - 1} d(x_t) | x_0, \pi ] \\ \text {where T* is the First step of a terminal state by } \pi\]This paper has plenty of functions new to me. Won’t list them all. Refer to paper for better proof of concepts.
3. A Lyapunov Approach to Solve CMDPs
The main goal is to construct a Lyapunov function $L ∈ L_{π_B} (x_0 , d_0 )$ such that $L(x_0 ) ≤ d_0$
- The constraint value function $D_{\pi^*}$ w.r.t. optimal policy $π^∗$ can be transformed into a Lyapunov function that is induced by some baseline $π_B$ , i.e., $L_{\epsilon^∗} (x) ∈ L_{π_B} (x_0 , d_0 )$
- A cost shaping term $\epsilon(x_t)$ is used and a contraction mapping $T_{\pi_B, d + \epsilon}$
- In conclusion, the Lyapunov function $L_{\epsilon^∗}$ is a uniform upper-bound to the constraint cost, i.e.,
- The authors enforce a condition to make the optimal policy $\pi ^*$ to be close to $\pi_B$ such that the set of $L_{\epsilon^∗}$ − [induced policies] contains an optimal policy
- The solution to $OPT$ is $V^(x)$, where $V^(x)$ is a value function expressed
- If $\pi_B$ satisfies the distance to $\pi^$ constraint, at $x = x_0, V(x)$ gives the OPT solution
4. Safe RL using Lyapunov functions
- $\epsilon^\star$ is approximated with an auxilliary constraint cost, with the largest cost giving a better chance of including $\pi^*$ in the set of feasible policies.
- In safe policy iteration (SPI), updating the Lyapunov function involves bootstrapping. $L_{\sim{\epsilon}}$ is updated at each iteration using a recomputation involving the total visiting probability from the initial state to any state. This visiting probability function is a maximizer of the objective.
- SPI and SVI update the Lyapunov function based on the best baseline policy.
- They have monototic $\pi$ improvement and have convergence
5. Lyapunov-based Safe RL Algorithms
- DQN and DPI replace the policy and value updates with function approximators
- Updates utilize policy distillation and the Jensen-Shannon divergence between policy updates
Both SDQN and SDPI sample mini-batches from buffer to minimizes MSE loss in value function updates.
- Safe Q-learning SDQN: It uses a constraint value network $Q_D$ and a stopping time value network $Q_T$ to make a Lyapunov function estimate. The auxilliary constraint cost $\epsilon^\prime$ is function approximated.
- SDPI: Value estimation is done using policy evaluation
6. Experiments
- The algorithms are tested in a 2D grid planning problem - the goal is to reach a given box, with constraint being to avoid hitting obstacles.
- Number of obstacles is constrolled by a density $\rho \in (0, 1).$
- There’s explicit knowledge of the transition probability and reward function.
- Lyapunov approaches perform safe RL without a known model of the env, with deep function approximators.
- SPI has good performance
- SVI and Lagrangian approaches degrade as $\rho$ grows
- SDQN and SQPI guarantee that, on finding a safe policy, all other updates remain safe. But Lagrangian approaches achieve worse rewards, violate constraints and don’t guarantee safety when the constraint threshold $d_0$ is big(5)