본문 바로가기

Deep Learning/강화학습

[2018.03] Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

728x90

This work proposes to learn how to quickly and effectively adapt online to new tasks. To enable sample-efficient learning, we consider learning online adaptation in the context of model-based reinforcement learning. Our approach uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context. 

URL: arxiv.org/abs/1803.11347
Topic: Meta-RL

1. Introduction

We specifically develop a model-based meta-reinforcement learning algorithm, where data for updating the model is readily available at every timestep in the form of recent experiences. But more crucially, the meta-training process for training such an adaptive model can be much more sample efficient than model-free meta-RL approaches. Further, our approach foregoes the episodic framework on which model-free meta-RL approaches rely, where tasks are pre-defined to be different rewards or environments, and tasks exist at the trajectory level only. Instead, our method considers each timestep to potentially be a new “task, " where any detail or setting could have changed at any timestep. This view induces a more general meta-RL problem setting by allowing the notion of a task to represent anything from existing in a different part of the state space to experiencing disturbances or attempting to achieve a new goal.

Our algorithm efficiently trains a global model that is capable to use its recent experiences to quickly adapt, achieving fast online adaptation in dynamic environments. We evaluate two versions of our approach, recurrence-based adaptive learner (ReBAL) and gradient-based adaptive learner (GrBAL) on stochastic and simulated continuous control tasks with complex contact dynamics (Fig. 2). We highlight not only the sample efficiency of our meta model-based reinforcement learning approach but also the importance of fast online adaptation in the real world.

2. Preliminaries

2.1. Model-Based RL

Consider a Markov decision process (MDP) defined by the tuple $\left(\mathcal{S}, \mathcal{A}, p, r, \gamma, \rho_{0}, H\right) .$ Here, $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $p\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)$ is the state transition distribution, $r: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is a bounded reward function, $\rho_{0}: \mathcal{S} \rightarrow \mathbb{R}_{+}$ is the initial state distribution, $\gamma$ is the discount factor, and $H$ is the horizon. A trajectory segment is denoted by $\tau(i, j):=\left(\mathbf{s}_{i}, \mathbf{a}_{i}, \ldots, \mathbf{s}_{j}, \mathbf{a}_{j}, \mathbf{s}_{j+1}\right)$. 

Model-based RL aims to solve this problem by learning the transition distribution $p\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)$, which is also referred to as the dynamics model. This can be done using a function approximator $\hat{p}_{\boldsymbol{\theta}}\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)$ to approximate the dynamics, where the weights $\boldsymbol{\theta}$ are optimized to maximize the log-likelihood of the observed data $\mathcal{D}$. In practice, this model is then used in the process of action selection by either producing data points from which to train a policy, or by producing predictions and dynamics constraints to be optimized by a controller.

2.2. Meta-Learning

Meta-learning leverage data from previous tasks to acquire a learning procedure that can quickly adapt to new tasks. These methods operate under the assumption that the previous meta-training tasks and the new meta-test tasks are drawn from the same task distribution $\rho(\mathcal{T})$ and share a common structure that can be exploited for fast learning. In the supervised learning setting, we aim to learn a function $f_{\theta}$ with $\boldsymbol{\theta}$ that minimizes  a supervised loss $\mathcal{L}_{\mathcal{T}}$. Then, the goal of meta-learning is to find a learning procedure, denoted as: 

$\boldsymbol{\theta}^{\prime}=u_{\boldsymbol{\psi}}\left(\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}},  \boldsymbol{\theta}\right)$,

that can learn a range of tasks $\mathcal{T}$ from small datasets $\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}} .$ We can formalize this meta-learning problem setting as optimizing for the parameters of the learning procedure $\theta, \psi$ as follows: 

$\min _{\boldsymbol{\theta}, \boldsymbol{\psi}} \mathbb{E}_{\mathcal{T} \sim \rho(\mathcal{T})}\left[\mathcal{L}\left(\mathcal{D}_{\mathcal{T}}^{\text {test }}, \boldsymbol{\theta}^{\prime}\right)\right] \quad$ s.t. $\quad \boldsymbol{\theta}^{\prime}=u_{\boldsymbol{\psi}}\left(\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}}, \boldsymbol{\theta}\right)$

where $\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}}, \mathcal{D}_{\mathcal{T}}^{\text {test }}$ are sampled without replacement from the meta-training dataset $\mathcal{D}_{\mathcal{T}}$. Once meta-training optimizes for the parameters $\boldsymbol{\theta}_{*}, \boldsymbol{\psi}_{*}$, the learning procedure $u_{\boldsymbol{\psi}}(\cdot, \boldsymbol{\theta})$ can then be used to learn new held-out tasks from small amounts of data. We will also refer to the learning procedure $u$ as the update function. 

Gradient-based meta-learning: Model-agnostic meta-learning (MAML) (Finn et al., 2017) aims to learn the initial parameters of a neural network such that taking one or several gradient descent steps from this initialization leads to effective generalization (or few-shot generalization) to new tasks. Then, when presented with new tasks, the model with the meta-learned initialization can be quickly fine-tuned using a few data points from the new tasks. Using the notation from before, MAML uses gradient descent as a learning algorithm:

$u_{\boldsymbol{\psi}}\left(\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}}, \boldsymbol{\theta}\right)=\boldsymbol{\theta}-\alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}\left(\mathcal{D}_{\mathcal{T}}^{\mathrm{tr}}, \boldsymbol{\theta}\right)$

The learning rate $\alpha$ may be a learnable parameter (in which case $\psi=\alpha$ ) or fixed as a hyperparameter, leading to $\psi=\varnothing$. Despite the update rule being fixed, a learned initialization of an overparameterized deep network followed by gradient descent is as expressive as update rules represented by deep recurrent networks (Finn and Levine, 2017). 

Recurrence-based meta-learning: Another approach to meta-learning is to use recurrent models. In this case, the update function is always learned, and ψ corresponds to the weights of the recurrent model that update the hidden state. The parameters θ of the prediction model correspond to the remainder of the weights of the recurrent model and the hidden state. Both gradient-based and recurrence-based meta-learning methods have been used for meta model-free RL (Finn et al., 2017; Duan et al., 2016). We will build upon these ideas to develop a meta model-based RL algorithm that enables adaptation in dynamic environments, in an online way.

3. Meta-Learning for Online Model Adaptation

Our notion of task is slightly more fluid, where every segment of a trajectory can be considered to be a different “task,” and observations from the past M timesteps (rather than the past M episodes) can be considered as providing information about the current task setting. Since changes in system dynamics, terrain details, or other environmental changes can occur at any time, we consider (at every time step) the problem of adapting the model using the M past time steps to predict the next K timesteps. In this setting, M and K are pre-specified hyperparameters;

In this work, we use the notion of environment $\mathcal{E}$ to denote different settings or configurations of a particular problem, ranging from malfunctions in the system's joints to the state of external disturbances. We assume a distribution of environments $\rho(\mathcal{E})$ that share some common structure, such as the same observation and action space, but may differ in their dynamics $p_{\mathcal{E}}\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)$. We denote a trajectory segment by $\tau_{\mathcal{E}}(i, j)$, which represents a sequence of states and actions $\left(\mathbf{s}_{i}, \mathbf{a}_{i}, \ldots, \mathbf{s}_{j}, \mathbf{a}_{j}, \mathbf{s}_{j+1}\right)$ sampled within an environment $\mathcal{E}$. Our algorithm assumes that the environment is locally consistent, in that every segment of length $j-i$ has the same environment. Even though this assumption is not always correct, it allows us to learn to adapt from data without knowing when the environment has changed. 

We pose the meta-RL problem in this setting as an optimization over $(\boldsymbol{\theta}, \boldsymbol{\psi})$ with respect to a maximum likelihood meta-objective. The meta-objective is the likelihood of the data under a predictive model $\hat{p}_{\boldsymbol{\theta}^{\prime}}\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)$ with parameters $\boldsymbol{\theta}^{\prime}$, where $\boldsymbol{\theta}^{\prime}=u_{\boldsymbol{\psi}}\left(\tau_{\mathcal{E}}(t-M, t-1), \boldsymbol{\theta}\right)$ corresponds to model parameters that were updated using the past $M$ data points. Concretely, this corresponds to the following optimization: 

$\min _{\boldsymbol{\theta}, \boldsymbol{\psi}} \mathbb{E}_{\tau_{\mathcal{E}}(t-M, t+K) \sim \mathcal{D}}\left[\mathcal{L}\left(\tau_{\mathcal{E}}(t, t+K), \boldsymbol{\theta}_{\mathcal{E}}^{\prime}\right)\right] \quad$ s.t.: $\quad \boldsymbol{\theta}_{\mathcal{E}}^{\prime}=u_{\boldsymbol{\psi}}\left(\tau_{\mathcal{E}}(t-M, t-1), \boldsymbol{\theta}\right) \quad (3)$

In that $\tau_{\mathcal{E}}(t-M, t+K) \sim \mathcal{D}$ corresponds to trajectory segments sampled from our previous experience, and the loss $\mathcal{L}$ corresponds to the negative $\log$ likelihood of the data under the model: 

$\mathcal{L}\left(\tau_{\mathcal{E}}(t, t+K), \boldsymbol{\theta}_{\mathcal{E}}^{\prime}\right) \triangleq-\frac{1}{K} \sum_{k=t}^{t+K} \log \hat{p}_{\boldsymbol{\theta}_{\mathcal{E}}^{\prime}}\left(\mathbf{s}_{k+1} \mid \mathbf{s}_{k}, \mathbf{a}_{k}\right) \quad (4)$

In the meta-objective in Equation 3, note that the past $M$ points are used to adapt $\theta$ into $\theta^{\prime}$, and the loss of this $\boldsymbol{\theta}^{\prime}$ is evaluated on the future $K$ points. Thus, we use the past $M$ timesteps to provide insight into how to adapt our model to perform well for nearby future timesteps. As outlined in Algorithm 1, the update rule $u_{\psi}$ for the inner update and a gradient step on $\theta$ for the outer update allow us to optimize this meta-objective of adaptation. Thus, we achieve fast adaptation at test time by being able to fine-tune the model using just $M$ data points. While we focus on reinforcement learning problems in our experiments, this meta-learning approach could be used for learning to adapt online in a variety of sequence modeling domains. We present our algorithm using both a recurrence and a gradient-based meta-learner, as we discuss next.

Gradient-Based Adaptive Learner (GrBAL): GrBAL uses gradient-based meta-learning to perform online adaptation; in particular, we use MAML (Finn et al., 2017). In this case, our update rule is prescribed by gradient descent ( 5.)

$\boldsymbol{\theta}_{\mathcal{E}}^{\prime}=u_{\boldsymbol{\psi}}\left(\tau_{\mathcal{E}}(t-M, t-1), \boldsymbol{\theta}\right)=\boldsymbol{\theta}_{\mathcal{E}}+\boldsymbol{\psi} \nabla_{\boldsymbol{\theta}} \frac{1}{M} \sum_{m=t-M}^{t-1} \log \hat{p}_{\boldsymbol{\theta}_{\mathcal{E}}}\left(\mathbf{s}_{m+1} \mid \mathbf{s}_{m}, \mathbf{a}_{m}\right) \quad (5)$

Recurrence-Based Adaptive Learner (ReBAL): ReBAL, instead, utilizes a recurrent model, which learns its own update rule (i.e., through its internal gating structure). In this case, ψ and uψ correspond to the weights of the recurrent model that update its hidden state.

4. Model-based Meta-Reinforcement Learning

Now that we have discussed our approach for enabling online adaptation, we next propose how to build upon this idea to develop a model-based meta-reinforcement learning algorithm. First, we explain how the agent can use the adapted model to perform a task, given parameters $\boldsymbol{\theta}_{*}$ and $\boldsymbol{\psi}_{*}$ from optimizing the meta-learning objective.

Given $\boldsymbol{\theta}_{*}$ and $\boldsymbol{\psi}_{*}$, we use the agent's recent experience to adapt the model parameters: $\boldsymbol{\theta}_{*}^{\prime}=$ $u_{\boldsymbol{\psi}_{*}}\left(\tau(t-M, t), \boldsymbol{\theta}_{*}\right)$. This results in a model $\hat{p}_{\boldsymbol{\theta}_{*}^{\prime}}$ that better captures the local dynamics in the current setting, task, or environment. This adapted model is then passed to our controller, along with the reward function $r$ and a planning horizon $H$. We use a planning $H$ that is smaller than the adaptation horizon $K$, since the adapted model is only valid within the current context. We use model predictive path integral control (MPPI) (Williams et al., 2015), but, in principle, our model adaptation approach is agnostic to the model predictive control (MPC) method used.

The use of MPC compensates for model inaccuracies by preventing accumulating errors since we replan at each time step using updated state information. MPC also allows for further benefits in this setting of online adaptation, because the model $\hat{p}_{\boldsymbol{\theta}_{\mathcal{s}}^{\prime}}$ itself will also improve by the next time step. After taking each step, we append the resulting state transition onto our dataset, reset the model parameters back to $\boldsymbol{\theta}_{*}$, and repeat the entire planning process for each time step. See Algorithm 2 for this adaptation procedure. Finally, in addition to test-time, we also perform this online adaptation procedure during the meta-training phase itself, to provide on-policy rollouts for meta-training. For the complete meta-RL algorithm, see Algorithm 1.