For many applications of reinforcement learning, it can be more convenient to specify a constraint along with a reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016; Schulman et al., 2015; Lillicrap et al., 2016; Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while ensuring policy behavior throughout training.
URL: [1705.10528] Constrained Policy Optimization (arxiv.org)
Author: Joshua Achiam et al. (UC Berkeley)
Topic: Safe RL
Conference: ICML 2017
1. Background
Recent work in deep RL assumes that agents are free to explore any behavior during learning. In many realistic domains, however, it may be unacceptable to give an agent complete freedom. Considering an industrial robot arm as an example, some behaviors could cause damage or are harmful to people working nearby. In domains like this, safe exploration for RL agents is important (Moldovan & Abbeel, 2012; Amodei et al., 2016). A natural way to incorporate safety is via constraints.
A standard and well-studied formulation for reinforcement learning with constraints is the constrained Markov Decision Process (CMDP) framework (Altman, 1999), where agents must satisfy constraints on expectations of auxiliary costs. Although optimal policies for finite CMDPs with known models can be obtained by linear programming, methods for high-dimensional control are lacking. There is currently no approach for policy search in continuous CMDPs that guarantees satisfy constraints.
2. Preliminaries
A Markov decision process (MDP) is a tuple, $(S, A, R, P, \mu)$. Note that P(s'|s,a) is the transition probability function and $\mu: S \rightarrow[0,1]$ is the starting state distribution. A stationary policy $\pi: S \rightarrow \mathcal{P}(A)$ is a map from states to probability distributions over actions, with $\pi(a \mid s)$ denoting the probability of selecting action $a$ in state $s$. We denote the set of all stationary policies by $\Pi$. In reinforcement learning, we aim to select a policy $\pi$ which maximizes a performance measure, $J(\pi)$, which is typically taken to be the infinite horizon discounted total return,
$J(\pi) \doteq \underset{\tau \sim \pi}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(s_{t}, a_{t}, s_{t+1}\right)\right] .$
Letting $R(\tau)$ denote the discounted return of a trajectory, we express the on-policy value function as $V^{\pi}(s) \doteq$ $\mathrm{E}_{\tau \sim \pi}\left[R(\tau) \mid s_{0}=s\right]$ and the on-policy action-value function as $Q^{\pi}(s, a) \doteq \mathrm{E}_{\tau \sim \pi}\left[R(\tau) \mid s_{0}=s, a_{0}=a\right]$. The advantage function is $A^{\pi}(s, a) \doteq Q^{\pi}(s, a)-V^{\pi}(s)$. Also of interest is the discounted future state distribution, $d^{\pi}$, defined by $d^{\pi}(s)=(1-\gamma) \sum_{t=0}^{\infty} \gamma^{t} P\left(s_{t}=s \mid \pi\right)$. It allows us to compactly express the difference in performance between two policies $\pi^{\prime}, \pi$ as
$J\left(\pi^{\prime}\right)-J(\pi)=\frac{1}{1-\gamma} \underset{s \sim d^{\pi^{\prime}}}{\mathrm{E}}\left[A^{\pi}(s, a)\right]$ (1),
where by $a \sim \pi^{\prime}$, we mean $a \sim \pi^{\prime}(\cdot \mid s)$, with explicit notation dropped to reduce clutter. For proof of (1), see (Kakade & Langford, 2002) or Section 10 in the supplementary material.
3. Constrained Markov Decision Processes
A constrained Markov decision process (CMDP) is an MDP augmented with constraints that restrict the set of allowable policies for that MDP. Specifically, we augment the MDP with a set $C$ of auxiliary cost functions, $C_{1}, \ldots, C_{m}$ (with each one a function $C_{i}: S \times A \times S \rightarrow \mathbb{R}$ mapping transition tuples to costs, like the usual reward), and limits $d_{1}, \ldots, d_{m} .$ Let $J_{C_{i}}(\pi)$ denote the expected discounted return of policy $\pi$ with respect to cost function $C_{i}$:
$J_{C_{i}}(\pi)=\underset{\tau \sim \pi}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} C_{i}\left(s_{t}, a_{t}, s_{t+1}\right)\right] .$
The set of feasible stationary policies for a CMDP is then
$\Pi_{C} \doteq\left\{\pi \in \Pi: \forall i, J_{C_{i}}(\pi) \leq d_{i}\right\}$
and the reinforcement learning problem in a CMDP is
$\pi^{*}=\arg \max _{\pi \in \Pi_{C}} J(\pi) .$
The choice of optimizing only over stationary policies is justified: it has been shown that the set of all optimal policies for a CMDP includes stationary policies, under mild technical conditions. For a thorough review of CMDPs and CMDP theory, we refer the reader to (Altman, 1999). We refer to $J_{C_{i}}$ as a constraint return, or $C_{i}$ -return for short. Lastly, we define on-policy value functions, action-value functions, and advantage functions for the auxiliary costs in analogy to $V^{\pi}, Q^{\pi}$, and $A^{\pi}$, with $C_{i}$ replacing $R$ : respectively, we denote these by $V_{C_{i}}^{\pi}, Q_{C_{i}}^{\pi}$, and $A_{C_{i}}^{\pi}$.
4. Constrained Policy Optimization
For large or continuous MDPs, solving for the exact optimal policy is intractable due to the curse of dimensionality (Sutton & Barto, 1998). Policy search algorithms approach this problem by searching for the optimal policy within a set $\Pi_{\theta} \subseteq \Pi$ of parametrized policies with parameters $\theta$ (for example, neural networks of a fixed architecture). In local policy search (Peters & Schaal, 2008), the policy is iteratively updated by maximizing $J(\pi)$ over a local neighborhood of the most recent iterate $\pi_{k}$ :
$\pi_{k+1}=\arg \max _{\pi \in \Pi_{\theta}} J(\pi)$ s.t. $D\left(\pi, \pi_{k}\right) \leq \delta$, (2)
where $D$ is some distance measure, and $\delta>0$ is a step size. When the objective is estimated by linearizing around $\pi_{k}$ as $J\left(\pi_{k}\right)+g^{T}\left(\theta-\theta_{k}\right), g$ is the policy gradient, and the standard policy gradient update is obtained by choosing $D\left(\pi, \pi_{k}\right)=\left\|\theta-\theta_{k}\right\|_{2}$ (Schulman et al., 2015). In local policy search for CMDPs, we additionally require policy iterates to be feasible for the CMDP, so instead of optimizing over $\Pi_{\theta}$, we optimize over $\Pi_{\theta} \cap \Pi_{C}$ :
$\pi_{k+1}=\arg \max _{\pi \in \Pi_{\theta}} J(\pi)$
s.t. $J_{C_{i}}(\pi) \leq d_{i} \quad i=1, \ldots, m \quad$ and $\quad D\left(\pi, \pi_{k}\right) \leq \delta$. (3)
This update is difficult to implement in practice because it requires evaluation of the constraint functions to determine whether a proposed point $\pi$ is feasible. When using sampling to compute policy updates, as is typically done in high-dimensional control (Duan et al., 2016), this requires off-policy evaluation, which is known to be challenging (Jiang & Li, 2015). In this work, we take a different approach, motivated by recent methods for trust-region optimization (Schulman et al., 2015). We develop a principled approximation to (3) with a particular choice of D, where we replace the objective and constraints with surrogate functions. The surrogates we choose are easy to estimate from samples collected on $\pi_k,$ and are good local approximations for the objective and constraints. Our theoretical analysis shows that for our choices of surrogates, we can bound our update’s worst-case performance and worst-case constraint violation with values that depend on a hyperparameter of the algorithm.
To prove the performance guarantees associated with our surrogates, we first prove new bounds on the difference in returns (or constraint returns) between two arbitrary stochastic policies in terms of an average divergence between them. We then show how our bounds permit a new analysis of trust-region methods in general: specifically, we prove a worst-case performance degradation at each update. We conclude by motivating, presenting, and proving guarantees on our algorithm, Constrained Policy Optimization (CPO), a trust region method for CMDPs.
4.1. Policy Performance Bounds
4.2. Trust Region Methods
Trust region algorithms for reinforcement learning (Schulman et al., 2015; 2016) have policy updates of the form,
$\pi_{k+1}=\arg \max _{\pi \in \Pi_{\theta}} \underset{s \sim d^{\pi_{k}}}{\mathrm{E}}\left[A^{\pi_{k}}(s, a)\right] \quad$ s.t. $\bar{D}_{K L}\left(\pi \| \pi_{k}\right) \leq \delta$, (8)
where $\bar{D}_{K L}\left(\pi \| \pi_{k}\right)=\mathrm{E}_{s \sim \pi_{k}}\left[D_{K L}\left(\pi \| \pi_{k}\right)[s]\right]$, and $\delta>0$ is the step size. The set $\left\{\pi_{\theta} \in \Pi_{\theta}: \bar{D}_{K L}\left(\pi \| \pi_{k}\right) \leq \delta\right\}$ is called the trust region.
The primary motivation for this update is that it is an approximation to optimizing the lower bound on policy performance given in (5), which would guarantee monotonic performance improvements. This is important for optimizing neural network policies, which are known to suffer from performance collapse after bad updates (Duan et al., 2016). Despite the approximation, trust-region steps usually give monotonic improvements (Schulman et al., 2015; Duan et al., 2016) and have shown state-of-the-art performance in the deep RL setting (Duan et al., 2016; Gu et al., 2017), making the approach appealing for developing policy search methods for CMDPs.
Until now, the particular choice of trust-region for (8) was heuristically motivated; with (5) and Corollary 3, we are able to show that it is principled and comes with a worst-case performance degradation guarantee that depends on δ.
Proposition 1 (Trust Region Update Performance). Suppose $\pi_{k}, \pi_{k+1}$ are related by $(8)$, and that $\pi_{k} \in \Pi_{\theta} . A$ lower bound on the policy performance difference between $\pi_{k}$ and $\pi_{k+1}$ is
$J\left(\pi_{k+1}\right)-J\left(\pi_{k}\right) \geq \frac{-\sqrt{2 \delta} \gamma \epsilon^{\pi_{k+1}}}{(1-\gamma)^{2}}$
where $\epsilon^{\pi_{k+1}}=\max _{s}\left|\mathrm{E}_{a \sim \pi_{k+1}}\left[A^{\pi_{k}}(s, a)\right]\right| .$ This result is useful for two reasons: 1) it is of independent interest, as it helps tighten the connection between theory and practice for deep RL, and 2) the choice to develop CPO as a trust region method means that CPO inherits this performance guarantee.
4.3. Trust Region Optimization for Constrained MDPs
Constrained policy optimization (CPO), which we present and justify in this section, is a policy search algorithm for CMDPs with updates that approximately solve (3) with a particular choice of D. First, we describe a policy search update for CMDPs that alleviates the issue of off-policy evaluation and comes with guarantees of monotonic performance improvement and constraint satisfaction. Then, because the theoretically guaranteed update will take too small steps in practice, we propose CPO as a practical approximation based on trust-region methods.
By corollaries 1,2, and 3, for appropriate coefficients $\alpha_{k}$, $\beta_{k}^{i}$ the update
$$
\begin{array}{l}
\pi_{k+1}=\arg \max _{\pi \in \Pi_{\theta}} \underset{s \sim d^{\pi_{k}}}{\mathrm{E}}\left[A^{\pi_{k}}(s, a)\right]-\alpha_{k} \sqrt{\bar{D}_{K L}\left(\pi \| \pi_{k}\right)} \\
\text { s.t. } J_{C_{i}}\left(\pi_{k}\right)+\underset{s \sim d^{\pi_{k}} \atop a \sim \pi}{\mathrm{E}}\left[\frac{A_{C_{i}}^{\pi_{k}}(s, a)}{1-\gamma}\right]+\beta_{k}^{i} \sqrt{\bar{D}_{K L}\left(\pi \| \pi_{k}\right)} \leq d_{i}
\end{array}
$$
is guaranteed to produce policies with monotonically nondecreasing returns that satisfy the original constraints. (Observe that the constraint here is on an upper bound for $J_{C_{i}}(\pi)$ by (6).) The off-policy evaluation issue is alleviated because both the objective and constraints involve expectations over state distributions $d^{\pi_{k}}$, which we presume to have samples from. Because the bounds are tight, the problem is always feasible (as long as $\pi_{0}$ is feasible). However, the penalties on policy divergence are quite steep for discount factors close to 1, so steps taken with this update might be small.
Inspired by trust-region methods, we propose CPO, which uses a trust region instead of penalties on policy divergence to enable larger step sizes:
$\pi_{k+1}=\arg \max _{\pi \in \Pi_{\theta}} \underset{s \sim d \atop a \sim \pi}{\mathrm{E}}\left[A^{\pi_{k}}[s, a)\right]$
s.t. $J_{C_{i}}\left(\pi_{k}\right)+\frac{1}{1-\gamma} \underset{s \sim d^{\pi_{k}}}{\mathrm{E}}\left[A_{C_{i}}^{\pi_{k}}(s, a)\right] \leq d_{i} \quad \forall i$, and $\bar{D}_{K L}\left(\pi \| \pi_{k}\right) \leq \delta$. (10)
Because this is a trust region method, it inherits the performance guarantee of Proposition 1. Furthermore, by corollaries 2 and 3, we have a performance guarantee for approximate satisfaction of constraints:
Proposition 2 (CPO Update Worst-Case Constraint Violation). Suppose $\pi_{k}, \pi_{k+1}$ are related by $(10)$, and that $\Pi_{\theta}$ in (10) is any set of policies with $\pi_{k} \in \Pi_{\theta} .$ An upper bound on the $C_{i}$ - return of $\pi_{k+1}$ is
$J_{C_{i}}\left(\pi_{k+1}\right) \leq d_{i}+\frac{\sqrt{2 \delta} \gamma \epsilon_{C_{i}}^{\pi_{k+1}}}{(1-\gamma)^{2}}$
where $\epsilon_{C_{i}}^{\pi_{k+1}}=\max _{s}\left|\mathrm{E}_{a \sim \pi_{k+1}}\left[A_{C_{i}}^{\pi_{k}}(s, a)\right]\right| .$
4.4. Practical Implementation
For policies with high-dimensional parameter spaces like neural networks, (10) can be impractical to solve directly because of the computational cost. However, for small step sizes $\delta$, the objective and cost constraints are well-approximated by linearizing around $\pi_{k}$, and the KLdivergence constraint is well-approximated by second order expansion (at $\pi_{k}=\pi$, the KL-divergence and its gradient are both zero). Denoting the gradient of the objective as $g$, the gradient of constraint $i$ as $b_{i}$, the Hessian of the KL-divergence as $H$, and defining $c_{i} \doteq J_{C_{i}}\left(\pi_{k}\right)-d_{i}$, the approximation to $(10)$ is:
$\begin{aligned} \theta_{k+1}=\arg \max _{\theta} & g^{T}\left(\theta-\theta_{k}\right) \\ \text { s.t. } & c_{i}+b_{i}^{T}\left(\theta-\theta_{k}\right) \leq 0 \quad i=1, \ldots, m \\ & \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta \end{aligned}$ (11)
Because the Fisher information matrix (FIM) $H$ is always positive semi-definite (and we will assume it to be positive-definite in what follows), this optimization problem is convex and, when feasible, can be solved efficiently using duality. (We reserve the case where it is not feasible for the next subsection.) With $B \doteq\left[b_{1}, \ldots, b_{m}\right]$ and $c \doteq\left[c_{1}, \ldots, c_{m}\right]^{T}$, a dual to (11) can be expressed as
$\max _{\lambda \geq 0 \atop \nu \succeq 0} \frac{-1}{2 \lambda}\left(g^{T} H^{-1} g-2 r^{T} \nu+\nu^{T} S \nu\right)+\nu^{T} c-\frac{\lambda \delta}{2}$ (12)
where $r \doteq g^{T} H^{-1} B, S \doteq B^{T} H^{-1} B .$ This is a convex program in $m+1$ variables; when the number of constraints is small by comparison to the dimension of $\theta$, this is much easier to solve than (11). If $\lambda^{*}, \nu^{*}$ are a solution to the dual, the solution to the primal is
$\theta^{*}=\theta_{k}+\frac{1}{\lambda^{*}} H^{-1}\left(g-B \nu^{*}\right) .$ (13)
Our algorithm solves the dual for $\lambda^{*}, \nu^{*}$ and uses it to propose the policy update (13). For the special case where there is only one constraint, we give an analytical solution in the supplementary material (Theorem 2) which removes the need for an inner-loop optimization. Our experiments have only a single constraint and make use of the analytical solution.
Because of the approximation error, the proposed update may not satisfy the constraints in (10); a backtracking line search is used to ensure surrogate constraint satisfaction. Also, for high-dimensional policies, it is impractically expensive to invert the FIM. This poses a challenge for computing $H^{-1} g$ and $H^{-1} b_{i}$, which appear in the dual. Like (Schulman et al., 2015), we approximately compute them using the conjugate gradient method.