본문 바로가기

Deep Learning/강화학습

[2018.02] Diversity is all you need: Learning skills without a reward function

728x90

Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose “Diversity is All You Need”(DIAYN), a method for learning useful skills without a reward function. Our proposed method learns skills by maximizing an information-theoretic objective using a maximum entropy policy. On a variety of simulated robotic tasks, we show that this simple objective results in the unsupervised emergence of diverse skills, such as walking and jumping. We show how pre-trained skills can provide a good parameter initialization for downstream tasks and can be composed hierarchically to solve complex, sparse reward tasks. Our results suggest that unsupervised discovery of skills can serve as an effective pretraining mechanism for overcoming challenges of exploration and data efficiency in reinforcement learning. 

Our paper makes five contributions. First, we propose a method for learning useful skills without any rewards. We formalize our discriminability goal as maximizing an information-theoretic objective with a maximum entropy policy. Second, we show that this simple exploration objective results in the unsupervised emergence of diverse skills, such as running and jumping, on several simulated robotic tasks. In a number of RL benchmark environments, our method is able to solve the benchmark task despite never receiving the true task reward. In these environments, some of the learned skills correspond to solving the task, and each skill that solves the task does so in a distinct manner. Third, we propose a simple method for using learned skills for hierarchical RL and find this method solves challenging tasks. Four, we demonstrate how skills discovered can be quickly adapted to solve a new task. Finally, we show how skills discovered can be used for imitation learning.

Figure 1: DIAYN Algorithm: We update the discriminator to better predict the skill, and update the skill to visit diverse states that make it more discriminable.

URL: arxiv.org/pdf/1802.06070.pdf
Topic: Skill Discovery
Post: talkingaboutme.tistory.com/entry/RL-Meta-Reinforcement-Learninglynnn.tistory.com/108

1. Background

Learning skills without reward has several practical applications. Environments with sparse rewards effectively have no reward until the agent randomly reaches a goal state. Learning useful skills without supervision may help address challenges in exploration in these environments. For long horizon tasks, skills discovered without reward can serve as primitives for hierarchical RL, effectively shortening the episode length. In many practical settings, interacting with the environment is essentially free, but evaluating the reward requires human feedback (Christiano et al., 2017). Unsupervised learning of skills may reduce the amount of supervision necessary to learn a task. 

We propose a method for learning diverse skills with deep RL in the absence of any rewards. We hypothesize that in order to acquire skills that are useful, we must train the skills so that they maximize coverage over the set of possible behaviors. While one skill might perform a useless behavior like random dithering, other skills should perform behaviors that are distinguishable from random dithering, and therefore more useful. A key idea in our work is to use discriminability between skills as an objective. Further, skills that are distinguishable are not necessarily maximally diverse – a slight difference in states makes two skills distinguishable, but not necessarily diverse in a semantically meaningful way. To combat problems, we want to learn skills that not only are distinguishable but also are as diverse as possible. By learning distinguishable skills that are as random as possible, we can “push” the skills away from each other, making each skill robust to perturbations and effectively exploring the environment. 

Previous work on hierarchical RL has learned skills to maximize a single, known, reward function by jointly learning a set of skills and a meta-controller. One problem with joint training is that the meta-policy does not select “bad” options, so these options do not receive any reward signal to improve. Our work prevents this degeneracy by using a random meta-policy during unsupervised skill-learning, such that neither the skills nor the meta-policy is aiming to solve any single task. A second important difference is that our approach learns skills with no reward. Eschewing a reward function not only avoids the difficult problem of reward design but also allows our method to learn task-agnostic.

2. Diversity is all you need

We consider an unsupervised RL paradigm in this work, where the agent is allowed an unsupervised “exploration” stage followed by a supervised stage. In our work, the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised stage. Conveniently, because skills are learned without a priori knowledge of the task, the learned skills can be used for many different tasks.

Our method for unsupervised skill discovery, DIAYN (“Diversity is All You Need”), builds off of three ideas. First, for skills to be useful, we want the skill to dictate the states that the agent visits. Different skills should visit different states, and hence be distinguishable. Second, we want to use states, not actions, to distinguish skills, because actions that do not affect the environment are not visible to an outside observer. For example, an outside observer cannot tell how much force a robotic arm applies when grasping a cup if the cup does not move. Finally, we encourage exploration and incentivize the skills to be as diverse as possible by learning skills that act as randomly as possible. Skills with high entropy that remain discriminable must explore a part of the state space far away from other skills, lest the randomness in its actions lead it to states where it cannot be distinguished. 

We construct our objective using notation from information theory: $S$ and $A$ are random variables for states and actions, respectively; $Z \sim p(z)$ is a latent variable, on which we condition our policy; we refer to a policy conditioned on a fixed $Z$ as a "skill"; $I(\cdot; \cdot)$ and $\mathcal{H}[\cdot]$ refer to mutual information and Shannon entropy, both computed with base $e$. In our objective, we maximize the mutual information between skills and states, $I(A; Z)$, to encode the idea that the skill should control which states the agent visits. Conveniently, this mutual information dictates that we can infer the skill from the states visited. To ensure that states, not actions, are used to distinguish skills, we minimize the mutual information between skills and actions given the state, $I(A; Z \mid S)$. Viewing all skills together with $p(z)$ as a mixture of policies, we maximize the entropy $\mathcal{H}[A \mid S]$ of this mixture policy. 

In summary, we maximize,

$$
\begin{aligned}
\mathcal{F}(\theta) & \triangleq I(S ; Z)+\mathcal{H}[A \mid S]-I(A ; Z \mid S) \quad (1) \\
&=(\mathcal{H}[Z]-\mathcal{H}[Z \mid S])+\mathcal{H}[A \mid S]-(\mathcal{H}[A \mid S]-\mathcal{H}[A \mid S, Z]) \\
&=\mathcal{H}[Z]-\mathcal{H}[Z \mid S]+\mathcal{H}[A \mid S, Z]  \quad (2)
\end{aligned}
$$


We rearranged our objective in Equation 2 to give intuition on how we optimize it. ${ }^{2}$ The first term encourages our prior distribution over $p(z)$ to have high entropy. We fix $p(z)$ to be uniform in our approach, guaranteeing that it has maximum entropy. The second term suggests that it should be easy to infer the skill $z$ from the current state. The third term suggests that each skill should act as randomly as possible, which we achieve by using a maximum entropy policy to represent each skill. As we cannot integrate over all states and skills to compute $p(z \mid s)$ exactly, we approximate this posterior with a learned discriminator $q_{\phi}(z \mid s)$. Jensen's Inequality tells us that replacing $p(z \mid s)$ with $q_{\phi}(z \mid s)$ gives us a variational lower bound $\mathcal{G}(\theta, \phi)$ on our objective $\mathcal{F}(\theta)$ (see (Agakov, 2004) for a detailed derivation):

$$
\begin{aligned}
\mathcal{F}(\theta) &=\mathcal{H}[A \mid S, Z]-\mathcal{H}[Z \mid S]+\mathcal{H}[Z] \\
&=\mathcal{H}[A \mid S, Z]+\mathbb{E}_{z \sim p(z), s \sim \pi(z)}[\log p(z \mid s)]-\mathbb{E}_{z \sim p(z)}[\log p(z)] \\
& \geq \mathcal{H}[A \mid S, Z]+\mathbb{E}_{z \sim p(z), s \sim \pi(z)}\left[\log q_{\phi}(z \mid s)-\log p(z)\right] \triangleq \mathcal{G}(\theta, \phi)
\end{aligned}
$$

2.1. Implementation

We implement DIAYN with soft actor-critic (Haarnoja et al., 2018), learning a policy $\pi_{\theta}(a \mid s, z)$ that is conditioned on the latent variable $z$. Soft actor-critic maximizes the policy's entropy over actions, which takes care of the entropy term in our objective $\mathcal{G}$. Following Haarnoja et al. (2018), we scale the entropy regularizer $\mathcal{H}[a \mid s, z]$ by $\alpha .$ We found empirically that an $\alpha=0.1$ provided a good trade-off between exploration and discriminability. We maximize the expectation in $\mathcal{G}$ by replacing the task reward with the following pseudo-reward:

$$
r_{z}(s, a) \triangleq \log q_{\phi}(z \mid s)-\log p(z)
$$

We use a categorical distribution for $p(z)$. During unsupervised learning, we sample a skill $z \sim p(z)$ at the start of each episode and act according to that skill throughout the episode. The agent is rewarded for visiting states that are easy to discriminate, while the discriminator is updated to better infer the skill $z$ from states visited. Entropy regularization occurs as part of the SAC update.