第 20 章中已经初步介绍了多智能体强化学习研究的问题和最基本的求解范式。本章来介绍一种比较经典且效果不错的进阶范式:中心化训练去中心化执行(centralized training with decentralized execution,CTDE)。所谓中心化训练去中心化执行是指在训练的时候使用一些单个智能体看不到的全局信息而以达到更好的训练效果,而在执行时不使用这些信息,每个智能体完全根据自己的策略直接动作以达到去中心化执行的效果。中心化训练去中心化执行的算法能够在训练时有效地利用全局信息以达到更好且更稳定的训练效果,同时在进行策略模型推断时可以仅利用局部信息,使得算法具有一定的扩展性。CTDE 可以类比成一个足球队的训练和比赛过程:在训练时,11 个球员可以直接获得教练的指导从而完成球队的整体配合,而教练本身掌握着比赛全局信息,教练的指导也是从整支队、整场比赛的角度进行的;而训练好的 11 个球员在上场比赛时,则根据场上的实时情况直接做出决策,不再有教练的指导。
CTDE 算法主要分为两种:一种是基于值函数的方法,例如 VDN,QMIX 算法等;另一种是基于 Actor-Critic 的方法,例如 MADDPG 和 COMA 等。本章将重点介绍 MADDPG 算法。
多智能体 DDPG(muli-agent DDPG,MADDPG)算法从字面意思上来看就是对于每个智能体实现一个 DDPG 的算法。所有智能体共享一个中心化的 Critic 网络,该 Critic 网络在训练的过程中同时对每个智能体的 Actor 网络给出指导,而执行时每个智能体的 Actor 网络则是完全独立做出行动,即去中心化地执行。
CTDE 算法的应用场景通常可以被建模为一个部分可观测马尔可夫博弈(partially observable Markov games):用
接下来我们看一下 MADDPG 算法的主要细节吧!如图 21-1 所示,每个智能体用 Actor-Critic 的方法训练,但不同于传统单智能体的情况,在 MADDPG 中每个智能体的 Critic 部分都能够获得其他智能体的策略信息。具体来说,考虑一个有
其中,
对于确定性策略来说,考虑现在有
其中,
其中,
MADDPG 的具体算法流程如下:
下面我们来看看如何实现 MADDPG 算法,首先是导入一些需要用到的包。
import torchimport torch.nn.functional as Fimport numpy as npimport matplotlib.pyplot as pltimport randomimport rl_utils
我们要使用的环境为多智能体粒子环境(multiagent particles environment,MPE),它是一些面向多智能体交互的环境的集合,在这个环境中,粒子智能体可以移动、通信、“看”到其他智能体,也可以和固定位置的地标交互。
接下来安装环境,由于 MPE 的官方仓库的代码已经不再维护了,而其依赖于 gym 的旧版本,因此我们需要重新安装 gym 库。
!git clone https://github.com/boyu-ai/multiagent-particle-envs.git --quiet!pip install -e multiagent-particle-envsimport syssys.path.append("multiagent-particle-envs")# 由于multiagent-pariticle-env底层的实现有一些版本问题,因此gym需要改为可用的版本!pip install --upgrade gym==0.10.5 -qimport gymfrom multiagent.environment import MultiAgentEnvimport multiagent.scenarios as scenariosdef make_env(scenario_name):# 从环境文件脚本中创建环境scenario = scenarios.load(scenario_name + ".py").Scenario()world = scenario.make_world()env = MultiAgentEnv(world, scenario.reset_world, scenario.reward,scenario.observation)return env
Obtaining file:///content/multiagent-particle-envsRequirement already satisfied: gym in /usr/local/lib/python3.7/dist-packages (from multiagent==0.0.1) (0.17.3)Collecting numpy-stlDownloading numpy-stl-2.16.3.tar.gz (772 kB)[K |████████████████████████████████| 772 kB 21.9 MB/s[?25hRequirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.5.0)Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.4.1)Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.21.5)Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from gym->multiagent==0.0.1) (1.3.0)Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from pyglet<=1.5.0,>=1.4.0->gym->multiagent==0.0.1) (0.16.0)Requirement already satisfied: python-utils>=1.6.2 in /usr/local/lib/python3.7/dist-packages (from numpy-stl->multiagent==0.0.1) (3.1.0)Building wheels for collected packages: numpy-stlBuilding wheel for numpy-stl (setup.py) ... [?25l[?25hdoneCreated wheel for numpy-stl: filename=numpy_stl-2.16.3-cp37-cp37m-linux_x86_64.whl size=137073 sha256=6b9d2bdad7dffab23f7c8b6c516fff61e630c72ab40184578d691a91dd8f583cStored in directory: /root/.cache/pip/wheels/06/f4/db/7fac39962a6ba79b7e740892042332083924bff552d4bef41eSuccessfully built numpy-stlInstalling collected packages: numpy-stl, multiagentRunning setup.py develop for multiagentSuccessfully installed multiagent-0.0.1 numpy-stl-2.16.3[K |████████████████████████████████| 1.5 MB 14.7 MB/s[?25h Building wheel for gym (setup.py) ... [?25l[?25hdone
本章选择 MPE 中的simple_adversary
环境作为代码实践的示例,如图 21-2 所示。该环境中有 1 个红色的对抗智能体(adversary)、
需要说明的是,MPE 环境中的每个智能体的动作空间是离散的。第 13 章介绍过 DDPG 算法本身需要使智能体的动作对于其策略参数可导,这对连续的动作空间来说是成立的,但是对于离散的动作空间并不成立。但这并不意味着当前的任务不能使用 MADDPG 算法求解,因为我们可以使用一个叫作 Gumbel-Softmax 的方法来得到离散分布的近似采样。下面我们对其原理进行简要的介绍并给出实现代码。
假设有一个随机变量
那有没有什么办法可以让离散分布的采样可导呢?答案是肯定的!那就是重参数化方法,这一方法在第 14 章的 SAC 算法中已经介绍过,而这里要用的是 Gumbel-Softmax 技巧。具体来说,我们引入一个重参数因子
Gumbel-Softmax 采样可以写成
此时,如果通过
接着再定义一些需要用到的工具函数,其中包括让 DDPG 可以适用于离散动作空间的 Gumbel Softmax 采样的相关函数。
def onehot_from_logits(logits, eps=0.01):''' 生成最优动作的独热(one-hot)形式 '''argmax_acs = (logits == logits.max(1, keepdim=True)[0]).float()# 生成随机动作,转换成独热形式rand_acs = torch.autograd.Variable(torch.eye(logits.shape[1])[[np.random.choice(range(logits.shape[1]), size=logits.shape[0])]],requires_grad=False).to(logits.device)# 通过epsilon-贪婪算法来选择用哪个动作return torch.stack([argmax_acs[i] if r > eps else rand_acs[i]for i, r in enumerate(torch.rand(logits.shape[0]))])def sample_gumbel(shape, eps=1e-20, tens_type=torch.FloatTensor):"""从Gumbel(0,1)分布中采样"""U = torch.autograd.Variable(tens_type(*shape).uniform_(),requires_grad=False)return -torch.log(-torch.log(U + eps) + eps)def gumbel_softmax_sample(logits, temperature):""" 从Gumbel-Softmax分布中采样"""y = logits + sample_gumbel(logits.shape, tens_type=type(logits.data)).to(logits.device)return F.softmax(y / temperature, dim=1)def gumbel_softmax(logits, temperature=1.0):"""从Gumbel-Softmax分布中采样,并进行离散化"""y = gumbel_softmax_sample(logits, temperature)y_hard = onehot_from_logits(y)y = (y_hard.to(logits.device) - y).detach() + y# 返回一个y_hard的独热量,但是它的梯度是y,我们既能够得到一个与环境交互的离散动作,又可以# 正确地反传梯度return y
接着实现我们的单智能体 DDPG。其中包含 Actor 网络与 Critic 网络,以及计算动作的函数,这在第 13 章中的已经介绍过,此处不再赘述。但这里没有更新网络参数的函数,其将会在 MADDPG 类中被实现。
class TwoLayerFC(torch.nn.Module):def __init__(self, num_in, num_out, hidden_dim):super().__init__()self.fc1 = torch.nn.Linear(num_in, hidden_dim)self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)self.fc3 = torch.nn.Linear(hidden_dim, num_out)def forward(self, x):x = F.relu(self.fc1(x))x = F.relu(self.fc2(x))return self.fc3(x)class DDPG:''' DDPG算法 '''def __init__(self, state_dim, action_dim, critic_input_dim, hidden_dim,actor_lr, critic_lr, device):self.actor = TwoLayerFC(state_dim, action_dim, hidden_dim).to(device)self.target_actor = TwoLayerFC(state_dim, action_dim,hidden_dim).to(device)self.critic = TwoLayerFC(critic_input_dim, 1, hidden_dim).to(device)self.target_critic = TwoLayerFC(critic_input_dim, 1,hidden_dim).to(device)self.target_critic.load_state_dict(self.critic.state_dict())self.target_actor.load_state_dict(self.actor.state_dict())self.actor_optimizer = torch.optim.Adam(self.actor.parameters(),lr=actor_lr)self.critic_optimizer = torch.optim.Adam(self.critic.parameters(),lr=critic_lr)def take_action(self, state, explore=False):action = self.actor(state)if explore:action = gumbel_softmax(action)else:action = onehot_from_logits(action)return action.detach().cpu().numpy()[0]def soft_update(self, net, target_net, tau):for param_target, param in zip(target_net.parameters(),net.parameters()):param_target.data.copy_(param_target.data * (1.0 - tau) +param.data * tau)
接下来正式实现一个 MADDPG 类,该类对于每个智能体都会维护一个 DDPG 算法。它们的策略更新和价值函数更新使用的是 21.2 节中关于
class MADDPG:def __init__(self, env, device, actor_lr, critic_lr, hidden_dim,state_dims, action_dims, critic_input_dim, gamma, tau):self.agents = []for i in range(len(env.agents)):self.agents.append(DDPG(state_dims[i], action_dims[i], critic_input_dim,hidden_dim, actor_lr, critic_lr, device))self.gamma = gammaself.tau = tauself.critic_criterion = torch.nn.MSELoss()self.device = device@propertydef policies(self):return [agt.actor for agt in self.agents]@propertydef target_policies(self):return [agt.target_actor for agt in self.agents]def take_action(self, states, explore):states = [torch.tensor([states[i]], dtype=torch.float, device=self.device)for i in range(len(env.agents))]return [agent.take_action(state, explore)for agent, state in zip(self.agents, states)]def update(self, sample, i_agent):obs, act, rew, next_obs, done = samplecur_agent = self.agents[i_agent]cur_agent.critic_optimizer.zero_grad()all_target_act = [onehot_from_logits(pi(_next_obs))for pi, _next_obs in zip(self.target_policies, next_obs)]target_critic_input = torch.cat((*next_obs, *all_target_act), dim=1)target_critic_value = rew[i_agent].view(-1, 1) + self.gamma * cur_agent.target_critic(target_critic_input) * (1 - done[i_agent].view(-1, 1))critic_input = torch.cat((*obs, *act), dim=1)critic_value = cur_agent.critic(critic_input)critic_loss = self.critic_criterion(critic_value,target_critic_value.detach())critic_loss.backward()cur_agent.critic_optimizer.step()cur_agent.actor_optimizer.zero_grad()cur_actor_out = cur_agent.actor(obs[i_agent])cur_act_vf_in = gumbel_softmax(cur_actor_out)all_actor_acs = []for i, (pi, _obs) in enumerate(zip(self.policies, obs)):if i == i_agent:all_actor_acs.append(cur_act_vf_in)else:all_actor_acs.append(onehot_from_logits(pi(_obs)))vf_in = torch.cat((*obs, *all_actor_acs), dim=1)actor_loss = -cur_agent.critic(vf_in).mean()actor_loss += (cur_actor_out**2).mean() * 1e-3actor_loss.backward()cur_agent.actor_optimizer.step()def update_all_targets(self):for agt in self.agents:agt.soft_update(agt.actor, agt.target_actor, self.tau)agt.soft_update(agt.critic, agt.target_critic, self.tau)
现在我们来定义一些超参数,创建环境、智能体以及经验回放池并准备训练。
num_episodes = 5000episode_length = 25 # 每条序列的最大长度buffer_size = 100000hidden_dim = 64actor_lr = 1e-2critic_lr = 1e-2gamma = 0.95tau = 1e-2batch_size = 1024device = torch.device("cuda" if torch.cuda.is_available() else "cpu")update_interval = 100minimal_size = 4000env_id = "simple_adversary"env = make_env(env_id)replay_buffer = rl_utils.ReplayBuffer(buffer_size)state_dims = []action_dims = []for action_space in env.action_space:action_dims.append(action_space.n)for state_space in env.observation_space:state_dims.append(state_space.shape[0])critic_input_dim = sum(state_dims) + sum(action_dims)maddpg = MADDPG(env, device, actor_lr, critic_lr, hidden_dim, state_dims,action_dims, critic_input_dim, gamma, tau)
接下来实现以下评估策略的方法,之后就可以开始训练了!
def evaluate(env_id, maddpg, n_episode=10, episode_length=25):# 对学习的策略进行评估,此时不会进行探索env = make_env(env_id)returns = np.zeros(len(env.agents))for _ in range(n_episode):obs = env.reset()for t_i in range(episode_length):actions = maddpg.take_action(obs, explore=False)obs, rew, done, info = env.step(actions)rew = np.array(rew)returns += rew / n_episodereturn returns.tolist()return_list = [] # 记录每一轮的回报(return)total_step = 0for i_episode in range(num_episodes):state = env.reset()# ep_returns = np.zeros(len(env.agents))for e_i in range(episode_length):actions = maddpg.take_action(state, explore=True)next_state, reward, done, _ = env.step(actions)replay_buffer.add(state, actions, reward, next_state, done)state = next_statetotal_step += 1if replay_buffer.size() >= minimal_size and total_step % update_interval == 0:sample = replay_buffer.sample(batch_size)def stack_array(x):rearranged = [[sub_x[i] for sub_x in x]for i in range(len(x[0]))]return [torch.FloatTensor(np.vstack(aa)).to(device)for aa in rearranged]sample = [stack_array(x) for x in sample]for a_i in range(len(env.agents)):maddpg.update(sample, a_i)maddpg.update_all_targets()if (i_episode + 1) % 100 == 0:ep_returns = evaluate(env_id, maddpg, n_episode=100)return_list.append(ep_returns)print(f"Episode: {i_episode+1}, {ep_returns}")
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:21: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)
Episode: 100, [-162.09349111961225, 9.000666921056728, 9.000666921056728]
/content/rl_utils.py:17: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.return np.array(state), action, reward, np.array(next_state), done
Episode: 200, [-121.85087049356082, 20.082544683591127, 20.082544683591127]Episode: 300, [-28.086124816732802, -23.51493605339695, -23.51493605339695]Episode: 400, [-35.91437846570877, -6.574264880829929, -6.574264880829929]Episode: 500, [-12.83238365700212, -5.402338391212475, -5.402338391212475]Episode: 600, [-11.692053500921567, 2.904343355450921, 2.904343355450921]Episode: 700, [-11.21261001095729, 6.13003213658482, 6.13003213658482]Episode: 800, [-12.581086056359824, 7.13450533137511, 7.13450533137511]Episode: 900, [-10.932824468382302, 7.534917449533213, 7.534917449533213]Episode: 1000, [-10.454432036663551, 7.467940904661571, 7.467940904661571]Episode: 1100, [-10.099017183836345, 6.764091427064233, 6.764091427064233]Episode: 1200, [-9.970202627245511, 6.839233648010857, 6.839233648010857]Episode: 1300, [-8.23988889957424, 5.928539785965939, 5.928539785965939]Episode: 1400, [-7.618319791914515, 5.4721657785273665, 5.4721657785273665]Episode: 1500, [-9.528028248906292, 6.716548343395567, 6.716548343395567]Episode: 1600, [-9.27198788506915, 6.25794360791615, 6.25794360791615]Episode: 1700, [-9.439913314907297, 6.552076175517556, 6.552076175517556]Episode: 1800, [-9.41018120255451, 6.170898260988019, 6.170898260988019]Episode: 1900, [-8.293080671760299, 5.710058304479939, 5.710058304479939]Episode: 2000, [-8.876670052284371, 5.804116304916539, 5.804116304916539]Episode: 2100, [-8.20415531215746, 5.170909738207094, 5.170909738207094]Episode: 2200, [-8.773275999321958, 4.961748911238369, 4.961748911238369]Episode: 2300, [-8.06474017837516, 5.223795184183733, 5.223795184183733]Episode: 2400, [-6.587706872401325, 4.366625235204875, 4.366625235204875]Episode: 2500, [-7.691312056289927, 4.856855290592445, 4.856855290592445]Episode: 2600, [-8.813560406139358, 5.508815842509804, 5.508815842509804]Episode: 2700, [-7.056761924960759, 4.758538712873507, 4.758538712873507]Episode: 2800, [-8.68842389422384, 5.661161581099521, 5.661161581099521]Episode: 2900, [-7.930406418494052, 4.366106102743839, 4.366106102743839]Episode: 3000, [-8.114850902595816, 5.1274853968197265, 5.1274853968197265]Episode: 3100, [-8.381402942461598, 5.093518450135181, 5.093518450135181]Episode: 3200, [-9.493930234055618, 5.472500034114433, 5.472500034114433]Episode: 3300, [-8.53312311113189, 4.963767973071618, 4.963767973071618]Episode: 3400, [-9.229941671093316, 5.555036222150763, 5.555036222150763]Episode: 3500, [-10.67973248813069, 6.0258368192309115, 6.0258368192309115]Episode: 3600, [-8.785648619797922, 5.360050159370962, 5.360050159370962]Episode: 3700, [-10.050750001897885, 5.962048108721202, 5.962048108721202]Episode: 3800, [-6.673053043055956, 3.732181204778823, 3.732181204778823]Episode: 3900, [-10.567190838130202, 5.705831860427992, 5.705831860427992]Episode: 4000, [-9.288291495674969, 5.298166543261745, 5.298166543261745]Episode: 4100, [-9.433352212890984, 6.016868802323455, 6.016868802323455]Episode: 4200, [-8.573388252905312, 4.673785791835532, 4.673785791835532]Episode: 4300, [-8.466209564326363, 5.482892841309288, 5.482892841309288]Episode: 4400, [-9.988322102926736, 5.5203824927807155, 5.5203824927807155]Episode: 4500, [-7.4937676078180155, 4.730897948468445, 4.730897948468445]Episode: 4600, [-8.755589567322176, 5.494709505886223, 5.494709505886223]Episode: 4700, [-9.16743075823155, 5.234841527940852, 5.234841527940852]Episode: 4800, [-8.597439825247829, 4.615078133167369, 4.615078133167369]Episode: 4900, [-9.918505853931377, 5.08561749388552, 5.08561749388552]Episode: 5000, [-10.16405662517592, 5.43335871613719, 5.43335871613719]
训练结束,我们来看看训练效果如何。
return_array = np.array(return_list)for i, agent_name in enumerate(["adversary_0", "agent_0", "agent_1"]):plt.figure()plt.plot(np.arange(return_array.shape[0]) * 100,rl_utils.moving_average(return_array[:, i], 9))plt.xlabel("Episodes")plt.ylabel("Returns")plt.title(f"{agent_name} by MADDPG")
可以看到,正常智能体agent_0
和agent_1
的回报结果完全一致,这是因为它们的奖励函数完全一样。正常智能体最终保持了正向的回报,说明它们通过合作成功地占领了两个不同的地点,进而让对抗智能体无法知道哪个地点是目标地点。另外,我们也可以发现 MADDPG 的收敛速度和稳定性都比较不错。
本章讲解了多智能体强化学习 CTDE 范式下的经典算法 MADDPG,MADDPG 后续也衍生了不少多智能体强化学习算法。因此,理解 MADDPG 对深入探究多智能体算法非常关键,有兴趣的读者可阅读 MADDPG 原论文加深理解。
[1] LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments [J]. Advances in neural information processing systems 2017, 30: 6379-6390.
[2] MPE benchmarks(参见 GitHub 网站中 google/maddpg-replication 项目的 maddpg_replication.ipynb 文件).