第 21 章多智能体强化学习进阶

21.1 简介

第 20 章中已经初步介绍了多智能体强化学习研究的问题和最基本的求解范式。本章来介绍一种比较经典且效果不错的进阶范式：中心化训练去中心化执行（centralized training with decentralized execution，CTDE）。所谓中心化训练去中心化执行是指在训练的时候使用一些单个智能体看不到的全局信息而以达到更好的训练效果，而在执行时不使用这些信息，每个智能体完全根据自己的策略直接动作以达到去中心化执行的效果。中心化训练去中心化执行的算法能够在训练时有效地利用全局信息以达到更好且更稳定的训练效果，同时在进行策略模型推断时可以仅利用局部信息，使得算法具有一定的扩展性。CTDE 可以类比成一个足球队的训练和比赛过程：在训练时，11 个球员可以直接获得教练的指导从而完成球队的整体配合，而教练本身掌握着比赛全局信息，教练的指导也是从整支队、整场比赛的角度进行的；而训练好的 11 个球员在上场比赛时，则根据场上的实时情况直接做出决策，不再有教练的指导。

CTDE 算法主要分为两种：一种是基于值函数的方法，例如 VDN，QMIX 算法等；另一种是基于 Actor-Critic 的方法，例如 MADDPG 和 COMA 等。本章将重点介绍 MADDPG 算法。

21.2 MADDPG 算法

多智能体 DDPG（muli-agent DDPG，MADDPG）算法从字面意思上来看就是对于每个智能体实现一个 DDPG 的算法。所有智能体共享一个中心化的 Critic 网络，该 Critic 网络在训练的过程中同时对每个智能体的 Actor 网络给出指导，而执行时每个智能体的 Actor 网络则是完全独立做出行动，即去中心化地执行。

CTDE 算法的应用场景通常可以被建模为一个部分可观测马尔可夫博弈（partially observable Markov games）：用代表个智能体所有可能的状态空间，这是全局的信息。对于每个智能体，其动作空间为，观测空间为，每个智能体的策略是一个概率分布，用来表示智能体在每个观测下采取各个动作的概率。环境的状态转移函数为。每个智能体的奖励函数为，每个智能体从全局状态得到的部分观测信息为，初始状态分布为。每个智能体的目标是最大化其期望累积奖励。

接下来我们看一下 MADDPG 算法的主要细节吧！如图 21-1 所示，每个智能体用 Actor-Critic 的方法训练，但不同于传统单智能体的情况，在 MADDPG 中每个智能体的 Critic 部分都能够获得其他智能体的策略信息。具体来说，考虑一个有个智能体的博弈，每个智能体的策略参数为，记为所有智能体的策略集合，那么我们可以写出在随机性策略情况下每个智能体的期望收益的策略梯度：

其中，就是一个中心化的动作价值函数。为什么说是一个中心化的动作价值函数呢？一般来说包含了所有智能体的观测，另外也需要输入所有智能体在此刻的动作，因此工作的前提就是所有智能体要同时给出自己的观测和相应的动作。

图21-1 MADDPG 算法总览图

对于确定性策略来说，考虑现在有个连续的策略，可以得到 DDPG 的梯度公式：

其中，是我们用来存储数据的经验回放池，它存储的每一个数据为。而在 MADDPG 中，中心化动作价值函数可以按照下面的损失函数来更新：

其中，是更新价值函数中使用的目标策略的集合，它们有着延迟更新的参数。

MADDPG 的具体算法流程如下：

随机初始化每个智能体的 Actor 网络和 Critic 网络
for 序列 do
初始化一个随机过程，用于动作探索；
获取所有智能体的初始观测；
for do：
对于每个智能体，用当前的策略选择一个动作；
执行动作并且获得奖励和新的观测；
把存储到经验回放池中；
从中随机采样一些数据;
对于每个智能体，中心化训练 Critic 网络
对于每个智能体，训练自身的 Actor 网络
对每个智能体，更新目标 Actor 网络和目标 Critic 网络
end for
end for

21.3 MADDPG 代码实践

下面我们来看看如何实现 MADDPG 算法，首先是导入一些需要用到的包。

import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import random
import rl_utils

我们要使用的环境为多智能体粒子环境（multiagent particles environment，MPE），它是一些面向多智能体交互的环境的集合，在这个环境中，粒子智能体可以移动、通信、“看”到其他智能体，也可以和固定位置的地标交互。

接下来安装环境，由于 MPE 的官方仓库的代码已经不再维护了，而其依赖于 gym 的旧版本，因此我们需要重新安装 gym 库。

!git clone https://github.com/boyu-ai/multiagent-particle-envs.git --quiet
!pip install -e multiagent-particle-envs
import sys
sys.path.append("multiagent-particle-envs")
# 由于multiagent-pariticle-env底层的实现有一些版本问题,因此gym需要改为可用的版本
!pip install --upgrade gym==0.10.5 -q
import gym
from multiagent.environment import MultiAgentEnv
import multiagent.scenarios as scenarios


def make_env(scenario_name):
    # 从环境文件脚本中创建环境
    scenario = scenarios.load(scenario_name + ".py").Scenario()
    world = scenario.make_world()
    env = MultiAgentEnv(world, scenario.reset_world, scenario.reward,
                        scenario.observation)
    return env

Obtaining file:///content/multiagent-particle-envs
Requirement already satisfied: gym in /usr/local/lib/python3.7/dist-packages (from multiagent==0.0.1) (0.17.3)
Collecting numpy-stl
  Downloading numpy-stl-2.16.3.tar.gz (772 kB)
[K     |████████████████████████████████| 772 kB 21.9 MB/s
[?25hRequirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.5.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.4.1)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.21.5)
Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.3.0)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyglet<=1.5.0,>=1.4.0->gym->multiagent==0.0.1) (0.16.0)
Requirement already satisfied: python-utils>=1.6.2 in /usr/local/lib/python3.7/dist-packages (from numpy-stl->multiagent==0.0.1) (3.1.0)
Building wheels for collected packages: numpy-stl
  Building wheel for numpy-stl (setup.py) ... [?25l[?25hdone
  Created wheel for numpy-stl: filename=numpy_stl-2.16.3-cp37-cp37m-linux_x86_64.whl size=137073 sha256=6b9d2bdad7dffab23f7c8b6c516fff61e630c72ab40184578d691a91dd8f583c
  Stored in directory: /root/.cache/pip/wheels/06/f4/db/7fac39962a6ba79b7e740892042332083924bff552d4bef41e
Successfully built numpy-stl
Installing collected packages: numpy-stl, multiagent
  Running setup.py develop for multiagent
Successfully installed multiagent-0.0.1 numpy-stl-2.16.3
[K     |████████████████████████████████| 1.5 MB 14.7 MB/s
[?25h  Building wheel for gym (setup.py) ... [?25l[?25hdone

本章选择 MPE 中的simple_adversary环境作为代码实践的示例，如图 21-2 所示。该环境中有 1 个红色的对抗智能体（adversary）、个蓝色的正常智能体，以及个地点（一般），这个地点中有一个是目标地点（绿色）。这个正常智能体知道哪一个是目标地点，但对抗智能体不知道。正常智能体是合作关系：它们其中任意一个距离目标地点足够近，则每个正常智能体都能获得相同的奖励。对抗智能体如果距离目标地点足够近，也能获得奖励，但它需要猜哪一个才是目标地点。因此，正常智能体需要进行合作，分散到不同的坐标点，以此欺骗对抗智能体。

图21-2 MPE 中的`simple_adversary`环境

需要说明的是，MPE 环境中的每个智能体的动作空间是离散的。第 13 章介绍过 DDPG 算法本身需要使智能体的动作对于其策略参数可导，这对连续的动作空间来说是成立的，但是对于离散的动作空间并不成立。但这并不意味着当前的任务不能使用 MADDPG 算法求解，因为我们可以使用一个叫作 Gumbel-Softmax 的方法来得到离散分布的近似采样。下面我们对其原理进行简要的介绍并给出实现代码。

假设有一个随机变量服从某个离散分布。其中，表示并且满足。当我们希望按照这个分布即进行采样时，可以发现这种离散分布的采样是不可导的。

那有没有什么办法可以让离散分布的采样可导呢？答案是肯定的！那就是重参数化方法，这一方法在第 14 章的 SAC 算法中已经介绍过，而这里要用的是 Gumbel-Softmax 技巧。具体来说，我们引入一个重参数因子，它是一个采样自的噪声：

Gumbel-Softmax 采样可以写成

此时，如果通过计算离散值，该离散值就近似等价于离散采样的值。更进一步，采样的结果中自然地引入了对于的梯度。被称作分布的温度参数，通过调整它可以控制生成的 G umbel-Softmax 分布与离散分布的近似程度：越小，生成的分布越趋向于的结果；越大，生成的分布越趋向于均匀分布。

接着再定义一些需要用到的工具函数，其中包括让 DDPG 可以适用于离散动作空间的 Gumbel Softmax 采样的相关函数。

def onehot_from_logits(logits, eps=0.01):
    ''' 生成最优动作的独热（one-hot）形式 '''
    argmax_acs = (logits == logits.max(1, keepdim=True)[0]).float()
    # 生成随机动作,转换成独热形式
    rand_acs = torch.autograd.Variable(torch.eye(logits.shape[1])[[
        np.random.choice(range(logits.shape[1]), size=logits.shape[0])
    ]],
                                       requires_grad=False).to(logits.device)
    # 通过epsilon-贪婪算法来选择用哪个动作
    return torch.stack([
        argmax_acs[i] if r > eps else rand_acs[i]
        for i, r in enumerate(torch.rand(logits.shape[0]))
    ])


def sample_gumbel(shape, eps=1e-20, tens_type=torch.FloatTensor):
    """从Gumbel(0,1)分布中采样"""
    U = torch.autograd.Variable(tens_type(*shape).uniform_(),
                                requires_grad=False)
    return -torch.log(-torch.log(U + eps) + eps)


def gumbel_softmax_sample(logits, temperature):
    """ 从Gumbel-Softmax分布中采样"""
    y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data)).to(
        logits.device)
    return F.softmax(y / temperature, dim=1)


def gumbel_softmax(logits, temperature=1.0):
    """从Gumbel-Softmax分布中采样,并进行离散化"""
    y = gumbel_softmax_sample(logits, temperature)
    y_hard = onehot_from_logits(y)
    y = (y_hard.to(logits.device) - y).detach() + y
    # 返回一个y_hard的独热量,但是它的梯度是y,我们既能够得到一个与环境交互的离散动作,又可以
    # 正确地反传梯度
    return y

接着实现我们的单智能体 DDPG。其中包含 Actor 网络与 Critic 网络，以及计算动作的函数，这在第 13 章中的已经介绍过，此处不再赘述。但这里没有更新网络参数的函数，其将会在 MADDPG 类中被实现。

class TwoLayerFC(torch.nn.Module):
    def __init__(self, num_in, num_out, hidden_dim):
        super().__init__()
        self.fc1 = torch.nn.Linear(num_in, hidden_dim)
        self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = torch.nn.Linear(hidden_dim, num_out)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


class DDPG:
    ''' DDPG算法 '''
    def __init__(self, state_dim, action_dim, critic_input_dim, hidden_dim,
                 actor_lr, critic_lr, device):
        self.actor = TwoLayerFC(state_dim, action_dim, hidden_dim).to(device)
        self.target_actor = TwoLayerFC(state_dim, action_dim,
                                       hidden_dim).to(device)
        self.critic = TwoLayerFC(critic_input_dim, 1, hidden_dim).to(device)
        self.target_critic = TwoLayerFC(critic_input_dim, 1,
                                        hidden_dim).to(device)
        self.target_critic.load_state_dict(self.critic.state_dict())
        self.target_actor.load_state_dict(self.actor.state_dict())
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(),
                                                lr=actor_lr)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(),
                                                 lr=critic_lr)

    def take_action(self, state, explore=False):
        action = self.actor(state)
        if explore:
            action = gumbel_softmax(action)
        else:
            action = onehot_from_logits(action)
        return action.detach().cpu().numpy()[0]

    def soft_update(self, net, target_net, tau):
        for param_target, param in zip(target_net.parameters(),
                                       net.parameters()):
            param_target.data.copy_(param_target.data * (1.0 - tau) +
                                    param.data * tau)

接下来正式实现一个 MADDPG 类，该类对于每个智能体都会维护一个 DDPG 算法。它们的策略更新和价值函数更新使用的是 21.2 节中关于和的公式给出的形式。

class MADDPG:
    def __init__(self, env, device, actor_lr, critic_lr, hidden_dim,
                 state_dims, action_dims, critic_input_dim, gamma, tau):
        self.agents = []
        for i in range(len(env.agents)):
            self.agents.append(
                DDPG(state_dims[i], action_dims[i], critic_input_dim,
                     hidden_dim, actor_lr, critic_lr, device))
        self.gamma = gamma
        self.tau = tau
        self.critic_criterion = torch.nn.MSELoss()
        self.device = device

    @property
    def policies(self):
        return [agt.actor for agt in self.agents]

    @property
    def target_policies(self):
        return [agt.target_actor for agt in self.agents]

    def take_action(self, states, explore):
        states = [
            torch.tensor([states[i]], dtype=torch.float, device=self.device)
            for i in range(len(env.agents))
        ]
        return [
            agent.take_action(state, explore)
            for agent, state in zip(self.agents, states)
        ]

    def update(self, sample, i_agent):
        obs, act, rew, next_obs, done = sample
        cur_agent = self.agents[i_agent]

        cur_agent.critic_optimizer.zero_grad()
        all_target_act = [
            onehot_from_logits(pi(_next_obs))
            for pi, _next_obs in zip(self.target_policies, next_obs)
        ]
        target_critic_input = torch.cat((*next_obs, *all_target_act), dim=1)
        target_critic_value = rew[i_agent].view(
            -1, 1) + self.gamma * cur_agent.target_critic(
                target_critic_input) * (1 - done[i_agent].view(-1, 1))
        critic_input = torch.cat((*obs, *act), dim=1)
        critic_value = cur_agent.critic(critic_input)
        critic_loss = self.critic_criterion(critic_value,
                                            target_critic_value.detach())
        critic_loss.backward()
        cur_agent.critic_optimizer.step()

        cur_agent.actor_optimizer.zero_grad()
        cur_actor_out = cur_agent.actor(obs[i_agent])
        cur_act_vf_in = gumbel_softmax(cur_actor_out)
        all_actor_acs = []
        for i, (pi, _obs) in enumerate(zip(self.policies, obs)):
            if i == i_agent:
                all_actor_acs.append(cur_act_vf_in)
            else:
                all_actor_acs.append(onehot_from_logits(pi(_obs)))
        vf_in = torch.cat((*obs, *all_actor_acs), dim=1)
        actor_loss = -cur_agent.critic(vf_in).mean()
        actor_loss += (cur_actor_out**2).mean() * 1e-3
        actor_loss.backward()
        cur_agent.actor_optimizer.step()

    def update_all_targets(self):
        for agt in self.agents:
            agt.soft_update(agt.actor, agt.target_actor, self.tau)
            agt.soft_update(agt.critic, agt.target_critic, self.tau)

现在我们来定义一些超参数，创建环境、智能体以及经验回放池并准备训练。

num_episodes = 5000
episode_length = 25  # 每条序列的最大长度
buffer_size = 100000
hidden_dim = 64
actor_lr = 1e-2
critic_lr = 1e-2
gamma = 0.95
tau = 1e-2
batch_size = 1024
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
update_interval = 100
minimal_size = 4000

env_id = "simple_adversary"
env = make_env(env_id)
replay_buffer = rl_utils.ReplayBuffer(buffer_size)

state_dims = []
action_dims = []
for action_space in env.action_space:
    action_dims.append(action_space.n)
for state_space in env.observation_space:
    state_dims.append(state_space.shape[0])
critic_input_dim = sum(state_dims) + sum(action_dims)

maddpg = MADDPG(env, device, actor_lr, critic_lr, hidden_dim, state_dims,
                action_dims, critic_input_dim, gamma, tau)

接下来实现以下评估策略的方法，之后就可以开始训练了！

def evaluate(env_id, maddpg, n_episode=10, episode_length=25):
    # 对学习的策略进行评估,此时不会进行探索
    env = make_env(env_id)
    returns = np.zeros(len(env.agents))
    for _ in range(n_episode):
        obs = env.reset()
        for t_i in range(episode_length):
            actions = maddpg.take_action(obs, explore=False)
            obs, rew, done, info = env.step(actions)
            rew = np.array(rew)
            returns += rew / n_episode
    return returns.tolist()


return_list = []  # 记录每一轮的回报（return）
total_step = 0
for i_episode in range(num_episodes):
    state = env.reset()
    # ep_returns = np.zeros(len(env.agents))
    for e_i in range(episode_length):
        actions = maddpg.take_action(state, explore=True)
        next_state, reward, done, _ = env.step(actions)
        replay_buffer.add(state, actions, reward, next_state, done)
        state = next_state

        total_step += 1
        if replay_buffer.size(
        ) >= minimal_size and total_step % update_interval == 0:
            sample = replay_buffer.sample(batch_size)

            def stack_array(x):
                rearranged = [[sub_x[i] for sub_x in x]
                              for i in range(len(x[0]))]
                return [
                    torch.FloatTensor(np.vstack(aa)).to(device)
                    for aa in rearranged
                ]

            sample = [stack_array(x) for x in sample]
            for a_i in range(len(env.agents)):
                maddpg.update(sample, a_i)
            maddpg.update_all_targets()
    if (i_episode + 1) % 100 == 0:
        ep_returns = evaluate(env_id, maddpg, n_episode=100)
        return_list.append(ep_returns)
        print(f"Episode: {i_episode+1}, {ep_returns}")

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)

Episode: 100, [-162.09349111961225, 9.000666921056728, 9.000666921056728]

/content/rl_utils.py:17: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return np.array(state), action, reward, np.array(next_state), done

Episode: 200, [-121.85087049356082, 20.082544683591127, 20.082544683591127]
Episode: 300, [-28.086124816732802, -23.51493605339695, -23.51493605339695]
Episode: 400, [-35.91437846570877, -6.574264880829929, -6.574264880829929]
Episode: 500, [-12.83238365700212, -5.402338391212475, -5.402338391212475]
Episode: 600, [-11.692053500921567, 2.904343355450921, 2.904343355450921]
Episode: 700, [-11.21261001095729, 6.13003213658482, 6.13003213658482]
Episode: 800, [-12.581086056359824, 7.13450533137511, 7.13450533137511]
Episode: 900, [-10.932824468382302, 7.534917449533213, 7.534917449533213]
Episode: 1000, [-10.454432036663551, 7.467940904661571, 7.467940904661571]
Episode: 1100, [-10.099017183836345, 6.764091427064233, 6.764091427064233]
Episode: 1200, [-9.970202627245511, 6.839233648010857, 6.839233648010857]
Episode: 1300, [-8.23988889957424, 5.928539785965939, 5.928539785965939]
Episode: 1400, [-7.618319791914515, 5.4721657785273665, 5.4721657785273665]
Episode: 1500, [-9.528028248906292, 6.716548343395567, 6.716548343395567]
Episode: 1600, [-9.27198788506915, 6.25794360791615, 6.25794360791615]
Episode: 1700, [-9.439913314907297, 6.552076175517556, 6.552076175517556]
Episode: 1800, [-9.41018120255451, 6.170898260988019, 6.170898260988019]
Episode: 1900, [-8.293080671760299, 5.710058304479939, 5.710058304479939]
Episode: 2000, [-8.876670052284371, 5.804116304916539, 5.804116304916539]
Episode: 2100, [-8.20415531215746, 5.170909738207094, 5.170909738207094]
Episode: 2200, [-8.773275999321958, 4.961748911238369, 4.961748911238369]
Episode: 2300, [-8.06474017837516, 5.223795184183733, 5.223795184183733]
Episode: 2400, [-6.587706872401325, 4.366625235204875, 4.366625235204875]
Episode: 2500, [-7.691312056289927, 4.856855290592445, 4.856855290592445]
Episode: 2600, [-8.813560406139358, 5.508815842509804, 5.508815842509804]
Episode: 2700, [-7.056761924960759, 4.758538712873507, 4.758538712873507]
Episode: 2800, [-8.68842389422384, 5.661161581099521, 5.661161581099521]
Episode: 2900, [-7.930406418494052, 4.366106102743839, 4.366106102743839]
Episode: 3000, [-8.114850902595816, 5.1274853968197265, 5.1274853968197265]
Episode: 3100, [-8.381402942461598, 5.093518450135181, 5.093518450135181]
Episode: 3200, [-9.493930234055618, 5.472500034114433, 5.472500034114433]
Episode: 3300, [-8.53312311113189, 4.963767973071618, 4.963767973071618]
Episode: 3400, [-9.229941671093316, 5.555036222150763, 5.555036222150763]
Episode: 3500, [-10.67973248813069, 6.0258368192309115, 6.0258368192309115]
Episode: 3600, [-8.785648619797922, 5.360050159370962, 5.360050159370962]
Episode: 3700, [-10.050750001897885, 5.962048108721202, 5.962048108721202]
Episode: 3800, [-6.673053043055956, 3.732181204778823, 3.732181204778823]
Episode: 3900, [-10.567190838130202, 5.705831860427992, 5.705831860427992]
Episode: 4000, [-9.288291495674969, 5.298166543261745, 5.298166543261745]
Episode: 4100, [-9.433352212890984, 6.016868802323455, 6.016868802323455]
Episode: 4200, [-8.573388252905312, 4.673785791835532, 4.673785791835532]
Episode: 4300, [-8.466209564326363, 5.482892841309288, 5.482892841309288]
Episode: 4400, [-9.988322102926736, 5.5203824927807155, 5.5203824927807155]
Episode: 4500, [-7.4937676078180155, 4.730897948468445, 4.730897948468445]
Episode: 4600, [-8.755589567322176, 5.494709505886223, 5.494709505886223]
Episode: 4700, [-9.16743075823155, 5.234841527940852, 5.234841527940852]
Episode: 4800, [-8.597439825247829, 4.615078133167369, 4.615078133167369]
Episode: 4900, [-9.918505853931377, 5.08561749388552, 5.08561749388552]
Episode: 5000, [-10.16405662517592, 5.43335871613719, 5.43335871613719]

训练结束，我们来看看训练效果如何。

return_array = np.array(return_list)
for i, agent_name in enumerate(["adversary_0", "agent_0", "agent_1"]):
    plt.figure()
    plt.plot(
        np.arange(return_array.shape[0]) * 100,
        rl_utils.moving_average(return_array[:, i], 9))
    plt.xlabel("Episodes")
    plt.ylabel("Returns")
    plt.title(f"{agent_name} by MADDPG")

可以看到，正常智能体agent_0和agent_1的回报结果完全一致，这是因为它们的奖励函数完全一样。正常智能体最终保持了正向的回报，说明它们通过合作成功地占领了两个不同的地点，进而让对抗智能体无法知道哪个地点是目标地点。另外，我们也可以发现 MADDPG 的收敛速度和稳定性都比较不错。

21.4 小结

本章讲解了多智能体强化学习 CTDE 范式下的经典算法 MADDPG，MADDPG 后续也衍生了不少多智能体强化学习算法。因此，理解 MADDPG 对深入探究多智能体算法非常关键，有兴趣的读者可阅读 MADDPG 原论文加深理解。

21.5 参考文献

[1] LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments [J]. Advances in neural information processing systems 2017, 30: 6379-6390.

[2] MPE benchmarks（参见 GitHub 网站中 google/maddpg-replication 项目的 maddpg_replication.ipynb 文件）.