Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse. Recent methods of learning from demonstrations have shown to be promising in overcoming exploration difficulties but typically require considerable high-quality demonstrations that are difficult to collect. We propose to effectively leverage available demonstrations to guide exploration through enforcing occupancy measure matching between the learned policy and current demonstrations and develop a novel Policy Optimization from Demonstration (POfD) method. We show that POfD induces implicit dynamic reward shaping and brings provable benefits for policy improvement when combined with policy gradient methods to produce state-of-the-art results even when the demonstrations are few and imperfect.
URL: Policy Optimization with Demonstrations
Author: Bingyi Kang et al.
Topic: Demonstration
Conference: ICML 2018
1. Background
Some recent works (Nair et al., 2017; Houthooft et al., 2016; Plappert et al., 2017; Pathak et al., 2017) have been devoted to tackling the exploration problems in RL. They are basically rooted in the following two ideas. 1) Reshape the original reward function by encouraging the agent to visit states never seen before, driven by intrinsic curiosity (Pathak et al., 2017) or information gain (Houthooft et al., 2016). 2) Use demonstration trajectories sampled from an expert policy to guide the learning procedure, by either putting the demonstrations into a replay memory (Nair et al., 2017; Hester et al., 2017; Veˇcer´ık et al., 2017) or using them to pre-train the policy in a supervised manner (Silver et al., 2016). Learning from demonstrations has shown promising performance in overcoming exploration difficulties in sparse reward environments. However, existing schemes cannot fully leverage the power of the demonstration data, limited by only treating them in the same way as self-generated data, and usually require a tremendous number of high-quality demonstrations which are difficult to collect at scale.
To address this problem, we propose to combine the above two ideas by developing a principled method that rewards demonstration-like actions more during interaction with environments, which thus encourages exploration for meaningful states when feedback is sparse. The intuition is that, when the reward signal is not available, the agent should mimic the demonstrated behavior in the early learning stages for exploration. After acquiring sufficient skills, the agent can explore new states on its own. This is actually a dynamic intrinsic reward mechanism that can be introduced for reshaping native rewards in RL. We realize our idea with three technical novelties. 1) We reformulate the policy optimization objective by adding a demonstration-guided exploration term, which measures the divergence between the current policy and the expert one, forcing expert-alike exploration. 2) We convert the proposed objective to a new one defined on occupancy measure (Ho & Ermon, 2016) for better exploiting demonstrations and establish an optimization-friendly lower bound. 3) We eventually draw a connection between optimizing the derived lower bound and generative adversarial training (Goodfellow et al., 2014), and show that the optimization can thus easily proceed in a manner of alternating between two sub-procedures. One is fitting a discriminator measuring the similarity between expert data and self-generated data to reshape the reward; the other one is updating the policy with the gradient from the reshaped reward function. POfD is general and compatible with most policy gradient methods. We also show that existing replay memory-based learning from demonstration methods (Hester et al., 2017; Vecerık et al., 2017) can be interpreted as degenerated cases of our method in terms of how to leverage the demonstration data.
2. Method
Now, given a stochastic policy $\pi(a|s) = p(a|s;\pi)$, the performance of $\pi$ is usually evaluated by its expected discounted reward $\eta(\pi)$:
$\eta(\pi)=\mathbb{E}_{\pi}[r(s, a)]=\mathbb{E}_{\left(s_{0}, a_{0}, s_{1}, \ldots\right)}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right)\right]$ (1).
Then, the occupancy measure, defined as follows, characterizes the distribution of action-state pairs when executing the policy $\pi$.
Definition 1. (Occupancy measure) Let $\rho_{\pi}(s): \mathcal{S} \rightarrow \mathbb{R}$ denote the unnormalized distribution of state visitation by following policy $\pi$ in the environment:
$ \rho_{\pi}(s)=\sum_{t=0}^{\infty} \gamma^{t} P\left(s_{t}=s \mid \pi\right) . $
Then the unnormalized distribution of state-action pairs $\rho_{\pi}(s, a)=\rho_{\pi}(s) \pi(a \mid s)$ is called occupancy measure of policy $\pi$. Substituting $\rho_{\pi}(s, a)$ into Eqn. (1), we have the following equivalent formulation for the expectation over policy $\pi:$
$\begin{aligned} \mathbb{E}_{\pi}[r(s, a)] &=\sum_{t=0}^{\infty} \sum_{s} P\left(s_{t}=s \mid \pi\right) \sum_{a} \pi(a \mid s) \gamma^{t} r(s, a) \\ &=\sum_{s} \rho_{\pi}(s) \sum_{a} \pi(a \mid s) r(s, a) \\ &=\sum_{s, a} \rho_{\pi}(s, a) r(s, a), \end{aligned} (2)$
which provides an alternative way to compute the expected discounted return. An important property of the occupancy measure is that it uniquely specifies a policy, as described in the lemma from Syed et al., 2008.
Lemma 1. (Theorem 2 of (Syed et al., 2008)) Suppose $\rho$ is the occupancy measure for $\pi_{\rho}(a \mid s) \triangleq \frac{\rho(s, a)}{\sum_{a^{\prime}} \rho\left(s, a^{\prime}\right)} .$ Then $\pi_{\rho}$
is the only policy whose occupancy measure is $\rho .$
Also, in addition to sparse rewards from environments as in traditional RL settings, the agent is also provided with a few demonstrations $D^E = {τ_1, τ_2, . . . , τ_N}$, where the $i^{th}$ trajectory $τ_i = {[(s_0^i, a_0^i), (s_1^i, a_1^i),..., (s_T^i, a_T^i)]}$ (possibly imperfect), which is generated from executing an unknown expert policy $π_E$ in the environment. We aim to develop a method that can boost exploration through effectively leveraging $D^E$ in such settings and maximize $\eta(π)$ in Eqn. (1). Throughout the paper, we use $π_E$ to denote the expert policy that gives the relatively good $\eta(\pi)$, and use $\hat{\mathbb{E}}_{\mathcal{D}}$ to denote empirical expectation estimated from the demonstrated trajectories $D^E$. In the early learning stages, we assume acting according to expert policy $π_E$ will provide a higher advantage value with a margin of at least δ over current policy $\pi$. We do not need to assume the expert policy $π_E$ to be advantageous over all the policies, as our proposed POfD will learn from both rewards and demonstrations and can possibly learn a policy better than $π_E$ through exploration on its own in later learning stages.
2.1. Policy optimization with a demonstration
Policy gradient methods usually start optimizing the policy $\pi(\theta)$ by random exploration, which may cause slow convergence when the state and action spaces have high dimensionality, and may even lead to failure when the environment feedback is sparse. We propose to address this problem by forcing the policy to explore in the nearby region of the expert policy $π_E$ (as shown in Fig.1), which is specified by several demonstrated trajectories $D^E$. We, therefore, introduce demonstration-guided exploration term $\mathcal{L}_{M}\left(\pi_{\theta}, \pi_{E}\right)=D_{J S}\left(\pi_{\theta}, \pi_{E}\right)$, which is defined over Jensen-Shannon divergence between current policy and the expert one, to the vanilla objective $\eta(\pi_\theta)$.
$\mathcal{L}\left(\pi_{\theta}\right)=-\eta\left(\pi_{\theta}\right)+\lambda_{1} D_{J S}\left(\pi_{\theta}, \pi_{E}\right)$.
However, the above policy divergence measure between $\pi_\theta$ and $\pi_E$ is infeasible as $\pi_E$ is unknown. Fortunately, leveraging the one-to-one correspondence between the policy and occupancy measure as given by Lemma 1, we can instead define a divergence over the occupancy measures $\rho_\pi(s, a)$ and $\rho_{\pi_E}(s, a)$, which is easier to optimize through adversarial training on demonstrations. Based on their occupancy measure $\mathcal{L}_{M} \triangleq D_{J S}\left(\rho_{\theta}, \rho_{E}\right)$, where $\rho_\pi$ and $\rho_E$ are short for $\rho_\pi$ and $\rho_{\pi_E}$, the proposed demonstration guided learning objective becomes:
$\mathcal{L}\left(\pi_{\theta}\right)=-\eta\left(\pi_{\theta}\right)+\lambda_{1} D_{J S}\left(\rho_{\theta}, \rho_{E}\right)$ (3).
2.2. Optimization
In this subsection, we introduce a practical optimization algorithm for Eqn. (3), which is compatible with any policy gradient method. In particular, instead of performing optimization on the difficult Jensen-Shannon divergence directly, we optimize its lower bound given as follows.
Theorem 2. Let $h(u)=\log \left(\frac{1}{1+e^{-u}}\right), \bar{h}(u)=\log \left(\frac{e^{-u}}{1+e^{-u}}\right)$ and $U(s, a): \mathcal{S} \times A \rightarrow \mathbb{R}$ be an arbitrary function. Then,
$$
\begin{array}{l}
D_{J S}\left(\rho_{\pi}, \rho_{E}\right) \geq \sup _{U}\left(\mathbb{E}_{\rho_{\pi}}[h(U(s, a))]+\mathbb{E}_{\rho_{E}}[\bar{h}(U(s, a))]\right)+\log 4
\end{array}
$$
With the above theorem, the occupancy measure matching objective $\mathcal{L}_{M}$ can be written as,
$$
\mathcal{L}_{M} \triangleq \sup _{D \in(0,1)} \underset{S \times \mathcal{A}}{\mathbb{E}_{\pi_{\theta}}}[\log (D(s, a))]+\mathbb{E}_{\pi_{E}}[\log (1-D(s, a))]
$$
where $D(s, a)=\frac{1}{1+e^{-U(s, a)}}: \mathcal{S} \times \mathcal{A} \rightarrow(0,1)$ is an arbitrary mapping function followed by a sigmoid activation function for scaling. The supremum ranging over $D(s, a)$ thus represents the optimal binary classification loss of distinguishing the current policy $\pi_{\theta}$ and the expert policy $\pi_{E}$ w.r.t. the state-action pairs sampled from $\rho_{\theta}$ and $\rho_{E}$.
To avoid potential overfitting risks, we introduce causal entropy $-H\left(\pi_{\theta}\right)$ as another regularization term, similar to (Ziebart et al., 2008; Ziebart, 2010). Therefore, the overall objective of our proposed POfD is formulated as
$$
\begin{array}{l}
\min _{\theta} \mathcal{L}=-\eta\left(\pi_{\theta}\right)-\lambda_{2} H\left(\pi_{\theta}\right) + \lambda_{1} \sup _{D \in(0,1)} \underset{s \times \mathcal{A}}{\mathbb{E}_{\pi_{\theta}}}[\log (D(s, a))]+\mathbb{E}_{\pi_{E}}[\log (1-D(s, a))]
\end{array}
$$
It is actually a minimax problem closely related to the learning target of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). GANs aims to train generative model $G$ to produce samples indistinguishable from the ones from real distributions for a well-trained discriminative model $D$. In our case, the true distribution is the expert policy $\pi_{E}(a \mid s)$, or equivalently the expert occupancy measure $\rho_{E}(s, a)$. The generator to learn is the policy model $\pi_{\theta}(a \mid s)$. Suppose $D$ is parameterized by $w$. By labeling expert state-action pairs as true ("1") and policy state-action pairs as false ("0"), we get the following objective,
$$
\begin{array}{l}
\min _{\theta} \max _{w} \mathcal{L}=-\eta\left(\pi_{\theta}\right)-\lambda_{2} H\left(\pi_{\theta}\right) + \lambda_{1}\left(\mathbb{E}_{\pi_{\theta}}\left[\log \left(D_{w}(s, a)\right)\right]+\mathbb{E}_{\pi_{E}}\left[\log \left(1-D_{w}(s, a)\right)\right]\right)
\end{array}
$$
Moreover, the minimax objective (6) can be simplified by substituting Eqn. (1) and Eqn. (2) into it, resulting in a dynamic reward reshaping mechanism over the original reward signal:
$\begin{aligned} \min _{\theta} \max _{w}-\mathbb{E}_{\pi_{\theta}} &\left[r^{\prime}(s, a)\right]-\lambda_{2} H\left(\pi_{\theta}\right) + \lambda_{1} \mathbb{E}_{\pi_{E}}\left[\log \left(1-D_{w}(s, a)\right)\right] \end{aligned}$
where $r^{\prime}(s, a)=r(a, b)-\lambda_{1} \log \left(D_{w}(s, a)\right)$ is the reshaped reward function. This function augments the environment reward with demonstration information. When the environment feedback is sparse or exploration is insufficient, the augmented reward can force the policy to generate similar trajectories as the expert policy $\pi_{E} .$ In other words, the divergence of $\pi$ and $\pi_{E}$ is minimized. Therefore, our algorithm is able to explore the environment more efficiently. The above objective can be optimized efficiently by alternately updating policy parameters $\theta$ and discriminator parameters $w .$ First, trajectories are sampled by executing the current policy $\pi_{\theta}$ and mixed with demonstration data to train the discriminator by SGD. The gradient is given by,
$$
\mathbb{E}_{\pi}\left[\nabla_{w} \log \left(D_{w}(s, a)\right)\right]+\mathbb{E}_{\pi_{E}}\left[\nabla_{w} \log \left(1-D_{w}(s, a)\right)\right]
$$
Then, fixing the discriminator $D_{w}$, i.e., a fixed reward function $r^{\prime(s,a)$, the policy $\pi_{\theta}$ is optimized with a chosen policy gradient method. In particular, by policy gradient theorem (Sutton et al., 2000), the reshaped policy gradient is:
$\begin{aligned}
\nabla_{\theta} \mathbb{E}_{\pi_{\theta}}\left[r^{\prime}(s, a)\right] &=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\prime}(s, a)\right] \\
\text { where } Q^{\prime}(\bar{s}, \bar{a}) &=\mathbb{E}_{\pi_{\theta}}\left[r^{\prime}(s, a) \mid s_{0}=\bar{s}, a_{0}=\bar{a}\right]
\end{aligned}$
The gradient for causal entropy regularization is given by
$\nabla_{\theta} \mathbb{E}_{\pi_{\theta}}\left[-\log \pi_{\theta}(a \mid s)\right]=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{H}(s, a)\right]$
$\quad$ where $Q^{H}(\bar{s}, \bar{a})=\mathbb{E}_{\pi_{\theta}}\left[-\log \pi_{\theta}(a \mid s) \mid s_{0}=\bar{s}, a_{0}=\bar{a}\right]$
The optimization details are summarized in Alg. $1 .$ It is compatible with any policy gradient methods, e.g., TRPO (Schulman et al., 2015 ) and PPO (Schulman et al., 2017).
3. Related Works
Recently, learning from demonstration (LfD) (Schaal, 1997) has received increasing attention as a promising way to overcome exploration difficulties in RL (Subramanian et al., 2016). Most LfD methods adopt value-based RL algorithms, which are off-policy in nature. For instance, DQfD (Hester et al., 2017) introduces LfD into DQN (Mnih et al., 2015) by adding demonstration data into the replay buffer. It uses a refined priority replay mechanism (Schaul et al., 2015) and gives additional priority to the demonstration data. However, DQfD is limited to applications with discrete action spaces. DDPGfD (Vecerık et al., 2017) extends LfD to continuous action domains, which is built upon DDPG (Lillicrap et al., 2015) similar to DQfD. Both DQfD and DDPGfD suffer the problem of under-exploiting demonstration data, as detailed in Sec. 5. Recently, Brys et al. (2015) introduced a reward reshaping mechanism to encourage taking actions close to the demonstrated ones. They share a similar motivation with us but their method has a different course. They defined a potential function based on multivariate Gaussian to model the distribution of state-actions. Different from our POfD, all the above methods require a significantly large amount of perfect demonstration data to achieve satisfactory performance.
Imitation learning aims at mimicking expert behaviors, which can be realized in multiple ways. Recent successful imitation learning algorithms are based on Inverse Reinforce Learning (IRL) (Ng et al., 2000), which targets recovering the reward function of a given task from samples, without knowing the dynamics. Following this line, Syed & Schapire (2008); Syed et al. (2008) cast IRL problems as a two-player zero-sum game which is then solved by alternating between fitting the reward function and selecting the policy. However, their capacity is limited to small-scale problems. Recently, Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) is shown to be very effective in high-dimensional continuous control problems. Instead of fitting the underlying reward function as traditional IRL algorithms, GAIL uses a discriminator to distinguish whether a state-action pair is from the expert or the learned policy. Meanwhile, GAIL optimizes the policy to confuse the discriminator. Though effective for imitation learning, these algorithms cannot leverage the valuable feedback given by the environments and usually suffer sharp performance decline when the expert data are imperfect. Our POfD overcomes such inherent limitations by learning from feedbacks and can learn well-performing policies even though the expert data are imperfect.