0


DDPG强化学习的PyTorch代码实现和逐步讲解

深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)是受Deep Q-Network启发的无模型、非策略深度强化算法,是基于使用策略梯度的Actor-Critic,本文将使用pytorch对其进行完整的实现和讲解

DDPG的关键组成部分是

  • Replay Buffer
  • Actor-Critic neural network
  • Exploration Noise
  • Target network
  • Soft Target Updates for Target Network

下面我们一个一个来逐步实现:

Replay Buffer

DDPG使用Replay Buffer存储通过探索环境采样的过程和奖励(Sₜ,aₜ,Rₜ,Sₜ+₁)。Replay Buffer在帮助代理加速学习以及DDPG的稳定性方面起着至关重要的作用:

  • 最小化样本之间的相关性:将过去的经验存储在 Replay Buffer 中,从而允许代理从各种经验中学习。
  • 启用离线策略学习:允许代理从重播缓冲区采样转换,而不是从当前策略采样转换。
  • 高效采样:将过去的经验存储在缓冲区中,允许代理多次从不同的经验中学习。
  1. classReplay_buffer():
  2. '''
  3. Code based on:
  4. https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
  5. Expects tuples of (state, next_state, action, reward, done)
  6. '''
  7. def__init__(self, max_size=capacity):
  8. """Create Replay buffer.
  9. Parameters
  10. ----------
  11. size: int
  12. Max number of transitions to store in the buffer. When the buffer
  13. overflows the old memories are dropped.
  14. """
  15. self.storage= []
  16. self.max_size=max_size
  17. self.ptr=0
  18. defpush(self, data):
  19. iflen(self.storage) ==self.max_size:
  20. self.storage[int(self.ptr)] =data
  21. self.ptr= (self.ptr+1) %self.max_size
  22. else:
  23. self.storage.append(data)
  24. defsample(self, batch_size):
  25. """Sample a batch of experiences.
  26. Parameters
  27. ----------
  28. batch_size: int
  29. How many transitions to sample.
  30. Returns
  31. -------
  32. state: np.array
  33. batch of state or observations
  34. action: np.array
  35. batch of actions executed given a state
  36. reward: np.array
  37. rewards received as results of executing action
  38. next_state: np.array
  39. next state next state or observations seen after executing action
  40. done: np.array
  41. done[i] = 1 if executing ation[i] resulted in
  42. the end of an episode and 0 otherwise.
  43. """
  44. ind=np.random.randint(0, len(self.storage), size=batch_size)
  45. state, next_state, action, reward, done= [], [], [], [], []
  46. foriinind:
  47. st, n_st, act, rew, dn=self.storage[i]
  48. state.append(np.array(st, copy=False))
  49. next_state.append(np.array(n_st, copy=False))
  50. action.append(np.array(act, copy=False))
  51. reward.append(np.array(rew, copy=False))
  52. done.append(np.array(dn, copy=False))
  53. returnnp.array(state), np.array(next_state), np.array(action), np.array(reward).reshape(-1, 1), np.array(done).reshape(-1, 1)

Actor-Critic Neural Network

这是Actor-Critic 强化学习算法的 PyTorch 实现。该代码定义了两个神经网络模型,一个 Actor 和一个 Critic。

Actor 模型的输入:环境状态;Actor 模型的输出:具有连续值的动作。

Critic 模型的输入:环境状态和动作;Critic 模型的输出:Q 值,即当前状态-动作对的预期总奖励。

  1. classActor(nn.Module):
  2. """
  3. The Actor model takes in a state observation as input and
  4. outputs an action, which is a continuous value.
  5. It consists of four fully connected linear layers with ReLU activation functions and
  6. a final output layer selects one single optimized action for the state
  7. """
  8. def__init__(self, n_states, action_dim, hidden1):
  9. super(Actor, self).__init__()
  10. self.net=nn.Sequential(
  11. nn.Linear(n_states, hidden1),
  12. nn.ReLU(),
  13. nn.Linear(hidden1, hidden1),
  14. nn.ReLU(),
  15. nn.Linear(hidden1, hidden1),
  16. nn.ReLU(),
  17. nn.Linear(hidden1, 1)
  18. )
  19. defforward(self, state):
  20. returnself.net(state)
  21. classCritic(nn.Module):
  22. """
  23. The Critic model takes in both a state observation and an action as input and
  24. outputs a Q-value, which estimates the expected total reward for the current state-action pair.
  25. It consists of four linear layers with ReLU activation functions,
  26. State and action inputs are concatenated before being fed into the first linear layer.
  27. The output layer has a single output, representing the Q-value
  28. """
  29. def__init__(self, n_states, action_dim, hidden2):
  30. super(Critic, self).__init__()
  31. self.net=nn.Sequential(
  32. nn.Linear(n_states+action_dim, hidden2),
  33. nn.ReLU(),
  34. nn.Linear(hidden2, hidden2),
  35. nn.ReLU(),
  36. nn.Linear(hidden2, hidden2),
  37. nn.ReLU(),
  38. nn.Linear(hidden2, action_dim)
  39. )
  40. defforward(self, state, action):
  41. returnself.net(torch.cat((state, action), 1))

Exploration Noise

向 Actor 选择的动作添加噪声是 DDPG 中用来鼓励探索和改进学习过程的一种技术。

可以使用高斯噪声或 Ornstein-Uhlenbeck 噪声。 高斯噪声简单且易于实现,Ornstein-Uhlenbeck 噪声会生成时间相关的噪声,可以帮助代理更有效地探索动作空间。但是与高斯噪声方法相比,Ornstein-Uhlenbeck 噪声波动更平滑且随机性更低。

  1. importnumpyasnp
  2. importrandom
  3. importcopy
  4. classOU_Noise(object):
  5. """Ornstein-Uhlenbeck process.
  6. code from :
  7. https://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
  8. The OU_Noise class has four attributes
  9. size: the size of the noise vector to be generated
  10. mu: the mean of the noise, set to 0 by default
  11. theta: the rate of mean reversion, controlling how quickly the noise returns to the mean
  12. sigma: the volatility of the noise, controlling the magnitude of fluctuations
  13. """
  14. def__init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
  15. self.mu=mu*np.ones(size)
  16. self.theta=theta
  17. self.sigma=sigma
  18. self.seed=random.seed(seed)
  19. self.reset()
  20. defreset(self):
  21. """Reset the internal state (= noise) to mean (mu)."""
  22. self.state=copy.copy(self.mu)
  23. defsample(self):
  24. """Update internal state and return it as a noise sample.
  25. This method uses the current state of the noise and generates the next sample
  26. """
  27. dx=self.theta* (self.mu-self.state) +self.sigma*np.array([np.random.normal() for_inrange(len(self.state))])
  28. self.state+=dx
  29. returnself.state

要在DDPG中使用高斯噪声,可以直接将高斯噪声添加到代理的动作选择过程中。

DDPG

DDPG (Deep Deterministic Policy Gradient)采用两组Actor-Critic神经网络进行函数逼近。在DDPG中,目标网络是Actor-Critic ,它目标网络具有与Actor-Critic网络相同的结构和参数化。

在训练期时,代理使用其 Actor-Critic 网络与环境交互,并将经验元组(Sₜ、Aₜ、Rₜ、Sₜ+₁)存储在Replay Buffer中。 然后代理从 Replay Buffer 中采样并使用数据更新 Actor-Critic 网络。 DDPG 算法不是通过直接从 Actor-Critic 网络复制来更新目标网络权重,而是通过称为软目标更新的过程缓慢更新目标网络权重。

软目标的更新是从Actor-Critic网络传输到目标网络的称为目标更新率(τ)的权重的一小部分。

软目标的更新公式如下:

通过使用软目标技术,可以大大提高学习的稳定性。

  1. #Set Hyperparameters
  2. # Hyperparameters adapted for performance from
  3. capacity=1000000
  4. batch_size=64
  5. update_iteration=200
  6. tau=0.001# tau for soft updating
  7. gamma=0.99# discount factor
  8. directory='./'
  9. hidden1=20# hidden layer for actor
  10. hidden2=64.#hiiden laye for critic
  11. classDDPG(object):
  12. def__init__(self, state_dim, action_dim):
  13. """
  14. Initializes the DDPG agent.
  15. Takes three arguments:
  16. state_dim which is the dimensionality of the state space,
  17. action_dim which is the dimensionality of the action space, and
  18. max_action which is the maximum value an action can take.
  19. Creates a replay buffer, an actor-critic networks and their corresponding target networks.
  20. It also initializes the optimizer for both actor and critic networks alog with
  21. counters to track the number of training iterations.
  22. """
  23. self.replay_buffer=Replay_buffer()
  24. self.actor=Actor(state_dim, action_dim, hidden1).to(device)
  25. self.actor_target=Actor(state_dim, action_dim, hidden1).to(device)
  26. self.actor_target.load_state_dict(self.actor.state_dict())
  27. self.actor_optimizer=optim.Adam(self.actor.parameters(), lr=3e-3)
  28. self.critic=Critic(state_dim, action_dim, hidden2).to(device)
  29. self.critic_target=Critic(state_dim, action_dim, hidden2).to(device)
  30. self.critic_target.load_state_dict(self.critic.state_dict())
  31. self.critic_optimizer=optim.Adam(self.critic.parameters(), lr=2e-2)
  32. # learning rate
  33. self.num_critic_update_iteration=0
  34. self.num_actor_update_iteration=0
  35. self.num_training=0
  36. defselect_action(self, state):
  37. """
  38. takes the current state as input and returns an action to take in that state.
  39. It uses the actor network to map the state to an action.
  40. """
  41. state=torch.FloatTensor(state.reshape(1, -1)).to(device)
  42. returnself.actor(state).cpu().data.numpy().flatten()
  43. defupdate(self):
  44. """
  45. updates the actor and critic networks using a batch of samples from the replay buffer.
  46. For each sample in the batch, it computes the target Q value using the target critic network and the target actor network.
  47. It then computes the current Q value
  48. using the critic network and the action taken by the actor network.
  49. It computes the critic loss as the mean squared error between the target Q value and the current Q value, and
  50. updates the critic network using gradient descent.
  51. It then computes the actor loss as the negative mean Q value using the critic network and the actor network, and
  52. updates the actor network using gradient ascent.
  53. Finally, it updates the target networks using
  54. soft updates, where a small fraction of the actor and critic network weights are transferred to their target counterparts.
  55. This process is repeated for a fixed number of iterations.
  56. """
  57. foritinrange(update_iteration):
  58. # For each Sample in replay buffer batch
  59. state, next_state, action, reward, done=self.replay_buffer.sample(batch_size)
  60. state=torch.FloatTensor(state).to(device)
  61. action=torch.FloatTensor(action).to(device)
  62. next_state=torch.FloatTensor(next_state).to(device)
  63. done=torch.FloatTensor(1-done).to(device)
  64. reward=torch.FloatTensor(reward).to(device)
  65. # Compute the target Q value
  66. target_Q=self.critic_target(next_state, self.actor_target(next_state))
  67. target_Q=reward+ (done*gamma*target_Q).detach()
  68. # Get current Q estimate
  69. current_Q=self.critic(state, action)
  70. # Compute critic loss
  71. critic_loss=F.mse_loss(current_Q, target_Q)
  72. # Optimize the critic
  73. self.critic_optimizer.zero_grad()
  74. critic_loss.backward()
  75. self.critic_optimizer.step()
  76. # Compute actor loss as the negative mean Q value using the critic network and the actor network
  77. actor_loss=-self.critic(state, self.actor(state)).mean()
  78. # Optimize the actor
  79. self.actor_optimizer.zero_grad()
  80. actor_loss.backward()
  81. self.actor_optimizer.step()
  82. """
  83. Update the frozen target models using
  84. soft updates, where
  85. tau,a small fraction of the actor and critic network weights are transferred to their target counterparts.
  86. """
  87. forparam, target_paraminzip(self.critic.parameters(), self.critic_target.parameters()):
  88. target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)
  89. forparam, target_paraminzip(self.actor.parameters(), self.actor_target.parameters()):
  90. target_param.data.copy_(tau*param.data+ (1-tau) *target_param.data)
  91. self.num_actor_update_iteration+=1
  92. self.num_critic_update_iteration+=1
  93. defsave(self):
  94. """
  95. Saves the state dictionaries of the actor and critic networks to files
  96. """
  97. torch.save(self.actor.state_dict(), directory+'actor.pth')
  98. torch.save(self.critic.state_dict(), directory+'critic.pth')
  99. defload(self):
  100. """
  101. Loads the state dictionaries of the actor and critic networks to files
  102. """
  103. self.actor.load_state_dict(torch.load(directory+'actor.pth'))
  104. self.critic.load_state_dict(torch.load(directory+'critic.pth'))

训练DDPG

这里我们使用 OpenAI Gym 的“MountainCarContinuous-v0”来训练我们的DDPG RL 模型,这里的环境提供连续的行动和观察空间,目标是尽快让小车到达山顶。

下面定义算法的各种参数,例如最大训练次数、探索噪声和记录间隔等等。 使用固定的随机种子可以使得过程能够回溯。

  1. importgym
  2. # create the environment
  3. env_name='MountainCarContinuous-v0'
  4. env=gym.make(env_name)
  5. device='cuda'iftorch.cuda.is_available() else'cpu'
  6. # Define different parameters for training the agent
  7. max_episode=100
  8. max_time_steps=5000
  9. ep_r=0
  10. total_step=0
  11. score_hist=[]
  12. # for rensering the environmnet
  13. render=True
  14. render_interval=10
  15. # for reproducibility
  16. env.seed(0)
  17. torch.manual_seed(0)
  18. np.random.seed(0)
  19. #Environment action ans states
  20. state_dim=env.observation_space.shape[0]
  21. action_dim=env.action_space.shape[0]
  22. max_action=float(env.action_space.high[0])
  23. min_Val=torch.tensor(1e-7).float().to(device)
  24. # Exploration Noise
  25. exploration_noise=0.1
  26. exploration_noise=0.1*max_action

创建DDPG代理类的实例,以训练代理达到指定的次数。在每轮结束时调用代理的update()方法来更新参数,并且在每十轮之后使用save()方法将代理的参数保存到一个文件中。

  1. # Create a DDPG instance
  2. agent=DDPG(state_dim, action_dim)
  3. # Train the agent for max_episodes
  4. foriinrange(max_episode):
  5. total_reward=0
  6. step=0
  7. state=env.reset()
  8. for tinrange(max_time_steps):
  9. action=agent.select_action(state)
  10. # Add Gaussian noise to actions for exploration
  11. action= (action+np.random.normal(0, 1, size=action_dim)).clip(-max_action, max_action)
  12. #action += ou_noise.sample()
  13. next_state, reward, done, info=env.step(action)
  14. total_reward+=reward
  15. ifrenderandi>=render_interval : env.render()
  16. agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))
  17. state=next_state
  18. ifdone:
  19. break
  20. step+=1
  21. score_hist.append(total_reward)
  22. total_step+=step+1
  23. print("Episode: \t{} Total Reward: \t{:0.2f}".format( i, total_reward))
  24. agent.update()
  25. ifi%10==0:
  26. agent.save()
  27. env.close()

测试DDPG

  1. test_iteration=100
  2. foriinrange(test_iteration):
  3. state=env.reset()
  4. fortincount():
  5. action=agent.select_action(state)
  6. next_state, reward, done, info=env.step(np.float32(action))
  7. ep_r+=reward
  8. print(reward)
  9. env.render()
  10. ifdone:
  11. print("reward{}".format(reward))
  12. print("Episode \t{}, the episode reward is \t{:0.2f}".format(i, ep_r))
  13. ep_r=0
  14. env.render()
  15. break
  16. state=next_state

我们使用下面的参数让模型收敛:

  • 从标准正态分布中采样噪声,而不是随机采样。
  • 将polyak常数(tau)从0.99更改为0.001
  • 修改Critic 网络的隐藏层大小为[64,64]。在Critic 网络的第二层之后删除了ReLU激活。改成(Linear, ReLU, Linear, Linear)。
  • 最大缓冲区大小更改为1000000
  • 将batch_size的大小从128更改为64

训练了75轮之后的效果如下:

总结

DDPG算法是一种受deep Q-Network (DQN)算法启发的无模型off-policy Actor-Critic算法。它结合了策略梯度方法和Q-learning的优点来学习连续动作空间的确定性策略。

与DQN类似,它使用重播缓冲区存储过去的经验和目标网络,用于训练网络,从而提高了训练过程的稳定性。

DDPG算法需要仔细的超参数调优以获得最佳性能。超参数包括学习率、批大小、目标网络更新速率和探测噪声参数。超参数的微小变化会对算法的性能产生重大影响。

上面的参数来自:

https://ai.stackexchange.com/questions/22945/ddpg-doesnt-converge-for-mountaincarcontinuous-v0-gym-environment

本文的完整代码:

https://github.com/arshren/Reinforcement_Learning/blob/main/DDPG-MountainCar.ipynb

作者:Renu Khandelwal

“DDPG强化学习的PyTorch代码实现和逐步讲解”的评论:

还没有评论