之前的章节介绍了基于策略梯度的算法 REINFORCE、Actor-Critic 以及两个改进算法——TRPO 和 PPO。这类算法有一个共同的特点:它们都是在线策略算法,这意味着它们的样本效率(sample efficiency)比较低。我们回忆一下 DQN 算法,DQN 算法直接估计最优函数 Q,可以做到离线策略学习,但是它只能处理动作空间有限的环境,这是因为它需要从所有动作中挑选一个
之前我们学习的策略是随机性的,可以表示为
其中,
那如何得到这个
下面我们来看一下 DDPG 算法的细节。DDPG 要用到
通常
另外,由于函数
下面我们以倒立摆环境为例,结合代码详细讲解 DDPG 的具体实现。
import randomimport gymimport numpy as npfrom tqdm import tqdmimport torchfrom torch import nnimport torch.nn.functional as Fimport matplotlib.pyplot as pltimport rl_utils
对于策略网络和价值网络,我们都采用只有一层隐藏层的神经网络。策略网络的输出层用正切函数(
class PolicyNet(torch.nn.Module):def __init__(self, state_dim, hidden_dim, action_dim, action_bound):super(PolicyNet, self).__init__()self.fc1 = torch.nn.Linear(state_dim, hidden_dim)self.fc2 = torch.nn.Linear(hidden_dim, action_dim)self.action_bound = action_bound # action_bound是环境可以接受的动作最大值def forward(self, x):x = F.relu(self.fc1(x))return torch.tanh(self.fc2(x)) * self.action_boundclass QValueNet(torch.nn.Module):def __init__(self, state_dim, hidden_dim, action_dim):super(QValueNet, self).__init__()self.fc1 = torch.nn.Linear(state_dim + action_dim, hidden_dim)self.fc2 = torch.nn.Linear(hidden_dim, hidden_dim)self.fc_out = torch.nn.Linear(hidden_dim, 1)def forward(self, x, a):cat = torch.cat([x, a], dim=1) # 拼接状态和动作x = F.relu(self.fc1(cat))x = F.relu(self.fc2(x))return self.fc_out(x)
接下来是 DDPG 算法的主体部分。在用策略网络采取动作的时候,为了更好地探索,我们向动作中加入高斯噪声。在 DDPG 的原始论文中,添加的噪声符合奥恩斯坦-乌伦贝克(Ornstein-Uhlenbeck,OU)随机过程:
其中,
class DDPG:''' DDPG算法 '''def __init__(self, state_dim, hidden_dim, action_dim, action_bound, sigma, actor_lr, critic_lr, tau, gamma, device):self.actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)self.critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)self.target_actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)self.target_critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)# 初始化目标价值网络并设置和价值网络相同的参数self.target_critic.load_state_dict(self.critic.state_dict())# 初始化目标策略网络并设置和策略相同的参数self.target_actor.load_state_dict(self.actor.state_dict())self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)self.gamma = gammaself.sigma = sigma # 高斯噪声的标准差,均值直接设为0self.tau = tau # 目标网络软更新参数self.action_dim = action_dimself.device = devicedef take_action(self, state):state = torch.tensor([state], dtype=torch.float).to(self.device)action = self.actor(state).item()# 给动作添加噪声,增加探索action = action + self.sigma * np.random.randn(self.action_dim)return actiondef soft_update(self, net, target_net):for param_target, param in zip(target_net.parameters(), net.parameters()):param_target.data.copy_(param_target.data * (1.0 - self.tau) + param.data * self.tau)def update(self, transition_dict):states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1, 1).to(self.device)rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)next_q_values = self.target_critic(next_states, self.target_actor(next_states))q_targets = rewards + self.gamma * next_q_values * (1 - dones)critic_loss = torch.mean(F.mse_loss(self.critic(states, actions), q_targets))self.critic_optimizer.zero_grad()critic_loss.backward()self.critic_optimizer.step()actor_loss = -torch.mean(self.critic(states, self.actor(states)))self.actor_optimizer.zero_grad()actor_loss.backward()self.actor_optimizer.step()self.soft_update(self.actor, self.target_actor) # 软更新策略网络self.soft_update(self.critic, self.target_critic) # 软更新价值网络
接下来我们在倒立摆环境中训练 DDPG,并绘制其性能曲线。
actor_lr = 3e-4critic_lr = 3e-3num_episodes = 200hidden_dim = 64gamma = 0.98tau = 0.005 # 软更新参数buffer_size = 10000minimal_size = 1000batch_size = 64sigma = 0.01 # 高斯噪声标准差device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")env_name = 'Pendulum-v0'env = gym.make(env_name)random.seed(0)np.random.seed(0)env.seed(0)torch.manual_seed(0)replay_buffer = rl_utils.ReplayBuffer(buffer_size)state_dim = env.observation_space.shape[0]action_dim = env.action_space.shape[0]action_bound = env.action_space.high[0] # 动作最大值agent = DDPG(state_dim, hidden_dim, action_dim, action_bound, sigma, actor_lr, critic_lr, tau, gamma, device)return_list = rl_utils.train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size)
Iteration 0: 100%|██████████| 20/20 [00:11<00:00, 1.78it/s, episode=20, return=-1266.015]Iteration 1: 100%|██████████| 20/20 [00:14<00:00, 1.39it/s, episode=40, return=-610.296]Iteration 2: 100%|██████████| 20/20 [00:14<00:00, 1.37it/s, episode=60, return=-185.336]Iteration 3: 100%|██████████| 20/20 [00:14<00:00, 1.36it/s, episode=80, return=-201.593]Iteration 4: 100%|██████████| 20/20 [00:14<00:00, 1.37it/s, episode=100, return=-157.392]Iteration 5: 100%|██████████| 20/20 [00:14<00:00, 1.39it/s, episode=120, return=-156.995]Iteration 6: 100%|██████████| 20/20 [00:14<00:00, 1.39it/s, episode=140, return=-175.051]Iteration 7: 100%|██████████| 20/20 [00:14<00:00, 1.36it/s, episode=160, return=-191.872]Iteration 8: 100%|██████████| 20/20 [00:14<00:00, 1.38it/s, episode=180, return=-192.037]Iteration 9: 100%|██████████| 20/20 [00:14<00:00, 1.36it/s, episode=200, return=-204.490]
episodes_list = list(range(len(return_list)))plt.plot(episodes_list, return_list)plt.xlabel('Episodes')plt.ylabel('Returns')plt.title('DDPG on {}'.format(env_name))plt.show()mv_return = rl_utils.moving_average(return_list, 9)plt.plot(episodes_list, mv_return)plt.xlabel('Episodes')plt.ylabel('Returns')plt.title('DDPG on {}'.format(env_name))plt.show()
可以发现 DDPG 在倒立摆环境中表现出很不错的效果,其学习速度非常快,并且不需要太多样本。有兴趣的读者可以尝试自行调节超参数(例如用于探索的高斯噪声参数),观察训练结果的变化。
本章讲解了深度确定性策略梯度算法(DDPG),它是面向连续动作空间的深度确定性策略训练的典型算法。相比于它的先期工作,即确定性梯度算法(DPG),DDPG 加入了目标网络和软更新的方法,这对深度模型构建的价值网络和策略网络的稳定学习起到了关键的作用。DDPG 算法也被引入了多智能体强化学习领域,催生了 MADDPG 算法,我们会在后续的章节中对此展开讨论。
对于确定性策略
其中,
首先直接计算
至此我们仅进行了简单的链式求导、合并同类项和代换,下面重点对等式右边的积分进行处理。积分中出现的
这样就计算出了
计算
以上过程证明的是在线策略形式的 DPG 定理,期望下标明确表示
[1] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms [C]// International conference on machine learning, PMLR, 2014: 387-395.
[2] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [C]// International conference on learning representation, 2016.