본문 바로가기

Deep Learning/강화학습

[2018.05] Learning Attentional Communication for Multi-Agent Cooperation

728x90

Communication could potentially be an effective way for multi-agent cooperation. However, information sharing among all agents or in predefined communication architectures that existing methods adopt can be problematic. When there is a large number of agents, agents cannot differentiate valuable information that helps cooperative decision-making from globally shared information. Therefore, communication barely helps, and could even impair the learning of multi-agent cooperation. Predefined communication architectures, on the other hand, restrict communication among agents and thus restrain potential cooperation. To tackle these difficulties, in this paper, we propose an attentional communication model that learns when communication is needed and how to integrate shared information for cooperative decision making. Our model leads to efficient and effective communication for large-scale multi-agent cooperation. Empirically, we show the strength of our model in a variety of cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies than existing methods.

Figure 1: ATOC architecture.

URL: arxiv.org/abs/1805.07733
Topic: Multi-Agent RL

1. Introduction

Multi-agent reinforcement learning (MARL) can be simply seen as independent RL, where each learner treats the other agents as part of its environment. However, the strategies of other agents are uncertain and changing as training progresses, so the environment becomes unstable from the perspective of any individual agent and thus it is hard for agents to collaborate. Moreover, policies learned using independent RL can easily overfit to other agents’ policies [10].

We argue one of the keys to solving this problem is communication, which could enhance strategy coordination. There are several approaches for learning communication in MARL including DIAL [4], CommNet [23], BiCNet [19], and Master-Slave [8]. However, information sharing among all agents or in predefined communication architectures these methods adopt can be problematic. When there is a large number of agents, agents cannot differentiate valuable information that helps cooperative decision-making from globally shared information. Moreover, in real-world applications, it is costly and computationally ineffective that all agents communicate with each other. Predefined communication architectures, e.g., Master-Slave [8], might help, but they restrict communication among specific agents and thus restrain potential cooperation.

To tackle these difficulties, we propose an attentional communication model, called ATOC, to enable agents to learn effective and efficient communication under a partially observable distributed environment for large-scale MARL. The attention unit in ATOC determines whether the agent should communicate with other agents to cooperate in its observable field. If so, the agent, called initiator, selects collaborators to form a communication group for coordinated strategies. We exploit a bidirectional LSTM unit as the communication channel to connect each agent within a communication group. The LSTM unit takes as input internal states (i.e., encoding of local observation and action intention) and returns thoughts that guide agents for coordinated strategies. Unlike CommNet and BiCNet that perform arithmetic mean and weighted mean of internal states, respectively, our LSTM unit selectively outputs important information for cooperative decision making, which makes it possible for agents to learn coordinated strategies in dynamic communication environments.

We implement ATOC as an extension of the actor-critic model. Since all agents share the parameters of the policy network, Q-network, attention unit, and communication channel, ATOC is suitable for large-scale multi-agent environments. We empirically show the success of ATOC in three scenarios, which correspond to the cooperation of agents for local reward, a shared global reward, and reward in competition, respectively. 

2. Related Work

Recently, several end-to-end trainable models have been proven effective to learn communication in MARL. DIAL [4] is the first to propose learnable communication via backpropagation with deep Q-networks. At each timestep, an agent generates its message as the input of other agents for the next timestep. Gradients flow from one agent to another through the communication channel, bringing rich feedback to train an effective channel. However, the communication of DIAL is rather simple, just selecting predefined messages. Further, communication in terms of sequences of discrete symbols is investigated in [7] and [18]. 

CommNet [23] is a large feed-forward neural network that maps inputs of all agents to their actions, where each agent occupies a subset of units and additionally has access to a broadcasting communication channel to share information. At a single communication step, each agent sends its hidden state as the communication message to the channel. The averaged message from other agents is the input of the next layer. However, it is only a large single network for all agents, so it cannot easily scale and would perform poorly in an environment with a large number of agents. It is worth mentioning that CommNet has been extended for abstractive summarization [2] in natural language processing.

BiCNet [19] is based on the actor-critic model for continuous action, using recurrent networks to connect each individual agent’s policy and value networks. BiCNet is able to handle real-time strategy games such as StarCraft micromanagement tasks. Master-Slave [8] is also a communication architecture for real-time strategy games, where the action of each slave agent is composed of contributions from both the slave agent and master agent. However, both works assume that agents know the global states of the environment, which is not realistic in practice. Moreover, predefined communication architectures restrict communication and hence restrain potential cooperation among agents. Therefore, they cannot adapt to the change of scenarios.

MADDPG [14] is an extension of the actor-critic model for mixed cooperative-competitive environments. COMA [5] is proposed to solve multi-agent credit assignments in cooperative settings. MADDPG and COMA both use a centralized critic that takes as input the observations and actions of all agents. However, MADDPG and COMA have to train an independent policy network for each agent, where each agent would learn a policy specializing in specific tasks [11], and the policy network easily overfits to a number of agents. Therefore, MADDPG and COMA are infeasible in large-scale MARL. Mean Field [24] takes as input the observation and average the action of neighboring agents to make the decision. However, the mean action eliminates the difference among neighboring agents in terms of action and observation and thus incurs the loss of important information that could help cooperative decision-making.

3. Background

Deterministic Policy Gradient (DPG):  Different from value-based algorithms like DQN, the main idea of policy gradient methods is to directly adjust the parameters $\theta$ of the policy to maximize the objective $J(\theta)=\mathbb{E}_{s \sim p^{\pi}, a \sim \pi_{\theta}}[R]$ along the direction of policy gradient $\nabla_{\theta} J(\theta)$, which can be written as $\nabla_{\theta} J(\theta)=\mathbb{E}_{s \sim p^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] .$ This can be further extended to deterministic policies [21] $\mu_{\theta}: \mathcal{S} \mapsto \mathcal{A}$, and $\nabla_{\theta} J(\theta)=\mathbb{E}_{s \sim \mathcal{D}}\left[\left.\nabla_{\theta} \mu_{\theta}(a \mid s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right]$. To ensure $\nabla_{a} Q^{\mu}(s, a)$ exists, the action space must be continuous.

Deep Deterministic Policy Gradient (DDPG): DDPG [13] is an actor-critic algorithm based on DPG. It respectively uses deep neural networks parameterized by $\theta^{\mu}$ and $\theta^{Q}$ to approximate the deterministic policy $a=\mu\left(s \mid \theta^{\mu}\right)$ and action-value function $Q\left(s, a \mid \theta^{Q}\right)$. The policy network infers actions according to states, corresponding to the actor; the Q-network approximates the value function of the state-action pair and provides the gradient, corresponding to the critic. 

Recurrent Attention Model (RAM): In the process of perceiving the image, instead of processing the whole perception field, humans focus attention on some important parts to obtain information when and where it is needed and then move from one part to another. RAM [16] uses an RNN to model the attention mechanism. At each timestep, an agent obtains and processes a partial observation via a bandwidth-limited sensor. The glimpse feature extracted from the past observations is stored at an internal state which is encoded into the hidden layer of the RNN. By decoding the internal state, the agent decides the location of the sensor and the action interacting with the environment.

4. Methods

ATOC is instantiated as an extension of the actor-critic model, but it can also be realized using value-based methods. ATOC consists of a policy network, a Q-network, an attention unit, and a communication channel, as illustrated in Figure 1.

We consider the partially observable distributed environment for MARL, where each agent $i$ receives a local observation $o_{t}^{i}$ correlated with the state $s_{t}$ at time $t$. The policy network takes the local observation as input and extracts a hidden layer as thought, which encodes both local observation and action intention, represented as $h_{t}^{i}=\mu_{\mathrm{I}}\left(o_{t}^{i} ; \theta^{\mu}\right)$. Every $T$ timestep, the attention unit takes $h_{+}^{i}$ as input and determines whether the communication is needed for cooperation in its observable field. If needed, the agent, called initiator, selects other agents, called collaborators, in its field to form a communication group and the group stays the same in $T$ timesteps. Communication is fully determined (when and how long to communicate) by the attention unit when $T$ is equal to 1. $T$ can also be tuned for the consistency of cooperation. The communication channel connects each agent of the communication group, takes as input the thought of each agent, and outputs the integrated thought that guides agents to generate coordinated actions. The integrated thought $\tilde{h}_{t}^{i}$ is merged with $h_{t}^{i}$ and fed into the rest of the policy network. Then, the policy network outputs the action $a_{t}^{i}=\mu_{\mathrm{II}}\left(h_{t}^{i}, \tilde{h}_{t}^{i} ; \theta^{\mu}\right) .$ By sharing encoding of local observation and action intention within a dynamically formed group, individual agents could build up relatively more global perception of the environment, infer the intent of other agents, and cooperate on decision making. 

4.1. Attention Model

Our attention unit never senses the environment in full, but only uses encoding of observable field and action intention of an agent and decides whether the communication is helpful in terms of cooperation. The attention unit can be instantiated by RNN or MLP. The first part of the actor-network that produces the thought corresponds to the glimpse network and the thought $h_{t}^{i}$ can be considered as the glimpse feature vector. The attention unit takes the thought representation as input and produces the probability of the observable field of the agent becomes an attention focus (i.e., the probability of communication). 

4.2. Communications

When an initiator selects its collaborators, it only considers the agents in its observable field and ignores those who cannot be perceived. There are three types of agents in the observable field of the initiator: other initiators; agents who have been selected by other initiators; agents who have not been selected. We assume a fixed communication bandwidth, which means each initiator can select at most m collaborators. The initiator first chooses collaborators from agents who have not been selected, then from agents selected by other initiators, finally from other initiators, all based on proximity. 

When an agent is selected by multiple initiators, it will participate in the communication of each group. Assuming agent $k$ is selected by two initiators $p$ and $q$ sequentially. Agent $k$ first participates in the communication of $p$ 's group. The communication channel integrates their thoughts: $\left\{\tilde{h}_{t}^{p}, \cdots, \tilde{h}_{t}^{k^{\prime}}\right\}=$ $g\left(h_{t}^{p}, \cdots, h_{t}^{k}\right) .$ Then agent $k$ communicates with $q$ 's group: $\left\{\tilde{h}_{t}^{q}, \cdots, \tilde{h}_{t}^{k^{\prime \prime}}\right\}=g\left(h_{t}^{q}, \cdots, \tilde{h}_{t}^{k^{\prime}}\right)$. The agent shared by multiple groups bridges the information gap and strategy division among individual groups. It can disseminate the thought within a group to other groups, which can eventually lead to coordinated strategies among the groups. This is especially critical for the case where all agents collaborate on a single task. In addition, to deal with the issue of role assignment and heterogeneous agent types, we can fix the position of agents who participate in the communication. The bi-directional LSTM unit acts as the communication channel. It plays the role of integrating internal states of agents within a group and guiding the agents towards coordinated decision-making.