0


基础的强化学习(RL)算法及代码详细demo

文章目录

  • gym环境: https://www.gymlibrary.dev/
  • 环境安装:- 我的版本:packagemodulegym0.24.0ale-py0.7.5torch1.11.0torchvision0.12.0tensorboard2.6.0- 安装方法:pip install -i https://pypi.tuna.tsinghua.edu.cn/simple gympip install --no-index -f https://github.com/Kojoley/atari-py/releases atari_pypip install gym[atari]pip uninstall ale-pypip install ale-py安装box2d: 可能会遇到building wheel failed for box2d在 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 下载相应的 PyBox2D的whl文件然后在命令行:pip install D:\FILES\PYTHON_PROJECTS\Box2D-2.3.10-cp37-cp37m-win_amd64.whl

一、Sarsa (悬崖问题)

1.1 CliffWalking-v0环境介绍

在一个4x12的网格中,智能体以网格的左下角位置为起点,以网格的下角位置为终点,目标是移动智能体到达终点位置,智能体每次可以在上、下、左、右这4个方向中移动一步,每移动一步会得到 -1 的奖励。

在这里插入图片描述

  • 如果智能体“掉入悬崖” ,会立即回到起点位置,并得到-100单位的奖励
  • 当智能体移动到终点时,该回合结束,该回合总奖励为各步奖励之和
  1. import gym
  2. env = gym.make("CliffWalking-v0")
  3. observation = env.reset()
  4. env.render()

在这里插入图片描述

  • 从起点到终点最少需要13步,每步得到-1的reward。我们的目标也是要通过RL训练出一个模型,使得该模型能在测试中一个episode的reward能够接近于-13左右。

1.2 Sarsa算法流程

算法参数: 步长

  1. α
  2. <
  3. 1
  4. \alpha<1
  5. α<1 极小值
  6. ϵ
  7. \epsilon
  8. ϵ (两个超参数)

对于所有

  1. Q
  2. (
  3. s
  4. ,
  5. a
  6. )
  7. Q(s,a)
  8. Q(s,a)随机初始化,终点处$ Q(s_{end},a) = 0$

for (each trajectory):

初始化

  1. S
  2. S
  3. S
  4. a
  5. t
  6. =
  7. ϵ
  8. g
  9. r
  10. e
  11. e
  12. d
  13. y
  14. (
  15. s
  16. t
  17. )
  18. a_t = \epsilon -greedy \quad(s_t)
  19. at​=ϵ−greedy(st​)

for (each step):

执行

  1. a
  2. t
  3. a_t
  4. at​,得到
  5. (
  6. r
  7. t
  8. +
  9. 1
  10. ,
  11. s
  12. t
  13. +
  14. 1
  15. )
  16. (r_{t+1},s_{t+1})
  17. (rt+1​,st+1​)
  18. a
  19. t
  20. +
  21. 1
  22. =
  23. ϵ
  24. g
  25. r
  26. e
  27. e
  28. d
  29. y
  30. (
  31. s
  32. t
  33. +
  34. 1
  35. )
  36. a_{t+1} = \epsilon -greedy \quad(s_{t+1})
  37. at+1​=ϵ−greedy(st+1​)
  38. Q
  39. (
  40. s
  41. t
  42. ,
  43. a
  44. t
  45. )
  46. =
  47. Q
  48. (
  49. s
  50. t
  51. ,
  52. a
  53. t
  54. )
  55. +
  56. α
  57. [
  58. r
  59. t
  60. +
  61. 1
  62. +
  63. γ
  64. Q
  65. (
  66. s
  67. t
  68. +
  69. 1
  70. ,
  71. a
  72. t
  73. +
  74. 1
  75. )
  76. Q
  77. (
  78. s
  79. t
  80. ,
  81. a
  82. t
  83. )
  84. ]
  85. Q(s_{t},a_{t})=Q(s_{t},a_{t})+\alpha[r_{t+1}+\gamma Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t})]
  86. Q(st​,at​)=Q(st​,at​)+α[rt+1​+γQ(st+1​,at+1​)−Q(st​,at​)]
  87. s
  88. t
  89. =
  90. s
  91. t
  92. +
  93. 1
  94. ,
  95. a
  96. t
  97. =
  98. a
  99. t
  100. +
  101. 1
  102. s_t = s_{t+1},a_t = a_{t+1}
  103. st​=st+1​,at​=at+1

1.3 具体代码

  1. import numpy as np
  2. import gym
  3. import time
  4. classSarsaAgent:def__init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
  5. self.act_n = act_n
  6. self.lr = learning_rate
  7. self.gamma = gamma
  8. self.epsilon = e_greed
  9. self.Q = np.zeros((obs_n, act_n))# e_greed:根据s_t,选择a_tdefsample(self,obs):if np.random.uniform(0,1)<(1.0- self.epsilon):
  10. action = self.predict(obs)else:
  11. action = np.random.choice(self.act_n)# 0,1,2,3return action
  12. # a_t = argmax Q(s)defpredict(self, obs):
  13. Q_list = self.Q[obs,:]#当前s下所有a对应的Q
  14. maxQ = np.max(Q_list)
  15. action_list = np.where(Q_list == maxQ)[0]# action_list=所有=Qmax的索引
  16. action = np.random.choice(action_list)return action
  17. deflearn(self, obs, action, reward, next_obs, next_action, done):# (S,A,R,S,A)'''
  18. done: episode是否结束
  19. '''
  20. predict_Q = self.Q[obs,action]if done:
  21. target_Q = reward
  22. else:
  23. target_Q = reward + self.gamma * self.Q[next_obs,next_action]# 更新Q表格
  24. self.Q[obs,action]+= self.lr *(target_Q - predict_Q)defsave(self):
  25. npy_file ='./q-table.npy'
  26. np.save(npy_file, self.Q)print(npy_file +' saved.')defload(self, npy_file='./q_table.npy'):
  27. self.Q = np.load(npy_file)print(npy_file +' loaded.')defrun_episode(env, agent, render=False):
  28. total_steps =0# 记录当前episode走了多少step
  29. total_reward =0
  30. obs = env.reset()
  31. action = agent.sample(obs)whileTrue:
  32. next_obs, reward, done, _ = env.step(action)
  33. next_action = agent.sample(next_obs)
  34. agent.learn(obs, action, reward, next_obs, next_action, done)
  35. action = next_action
  36. obs = next_obs
  37. total_reward += reward
  38. total_steps +=1if render:
  39. env.render()
  40. time.sleep(0.)if done:breakreturn total_reward, total_steps
  41. deftest_episode(env, agent):
  42. total_steps =0# 记录当前episode走了多少step
  43. total_reward =0
  44. obs = env.reset()whileTrue:
  45. action = agent.predict(obs)
  46. next_obs, reward, done, _ = env.step(action)
  47. total_reward += reward
  48. total_steps +=1
  49. obs = next_obs
  50. time.sleep(0.5)
  51. env.render()if done:breakreturn total_reward, total_steps
  52. defmain():
  53. env = gym.make("CliffWalking-v0")
  54. agent = SarsaAgent(obs_n=env.observation_space.n,
  55. act_n=env.action_space.n,
  56. learning_rate=0.025, gamma=0.9, e_greed=0.1)for episode inrange(1000):
  57. total_reward, total_steps = run_episode(env, agent,False)print('Episode %s: total_steps = %s , total_reward = %.1f'%(episode, total_steps, total_reward))
  58. test_episode(env, agent)
  59. main()

1.4 演示效果

训练了1000个episode,

  1. r
  2. e
  3. w
  4. a
  5. r
  6. d
  7. =
  8. 23
  9. reward=-23
  10. reward=−23

在这里插入图片描述

二、Q-Learning (悬崖问题)

2.1 CliffWalking-v0环境介绍

(介绍见1.1)

2.2 Q-Learning算法流程

(Q-Learning其实真正执行的策略和Sarsa是一样的,只不过学习的策略是保守的最优策略)

算法参数: 步长

  1. α
  2. <
  3. 1
  4. \alpha<1
  5. α<1 极小值
  6. ϵ
  7. \epsilon
  8. ϵ (两个超参数)

对于所有

  1. Q
  2. (
  3. s
  4. ,
  5. a
  6. )
  7. Q(s,a)
  8. Q(s,a)随机初始化,终点处
  9. Q
  10. (
  11. s
  12. e
  13. n
  14. d
  15. ,
  16. a
  17. )
  18. =
  19. 0
  20. Q(s_{end},a) = 0
  21. Q(send​,a)=0

for (each trajectory):

初始化

  1. S
  2. S
  3. S

for (each step):

  1. a
  2. t
  3. =
  4. ϵ
  5. g
  6. r
  7. e
  8. e
  9. d
  10. y
  11. (
  12. s
  13. t
  14. )
  15. a_{t} = \epsilon -greedy \quad(s_{t})
  16. at​=ϵ−greedy(st​)(行为策略)

执行

  1. a
  2. t
  3. a_t
  4. at​,得到
  5. (
  6. r
  7. t
  8. +
  9. 1
  10. ,
  11. s
  12. t
  13. +
  14. 1
  15. )
  16. (r_{t+1},s_{t+1})
  17. (rt+1​,st+1​)
  18. Q
  19. (
  20. s
  21. t
  22. ,
  23. a
  24. t
  25. )
  26. =
  27. Q
  28. (
  29. s
  30. t
  31. ,
  32. a
  33. t
  34. )
  35. +
  36. α
  37. [
  38. r
  39. t
  40. +
  41. 1
  42. +
  43. γ
  44. m
  45. a
  46. x
  47. a
  48. Q
  49. (
  50. s
  51. t
  52. +
  53. 1
  54. ,
  55. a
  56. )
  57. Q
  58. (
  59. s
  60. t
  61. ,
  62. a
  63. t
  64. )
  65. ]
  66. Q(s_{t},a_{t})=Q(s_{t},a_{t})+\alpha[r_{t+1}+\gamma \underset{a}{max}Q(s_{t+1},a)-Q(s_{t},a_{t})]
  67. Q(st​,at​)=Q(st​,at​)+α[rt+1​+γamaxQ(st+1​,a)−Q(st​,at​)]
  68. s
  69. t
  70. =
  71. s
  72. t
  73. +
  74. 1
  75. s_t = s_{t+1}
  76. st​=st+1

2.3 具体代码

  1. import numpy as np
  2. import gym
  3. import time
  4. classQLearningAgent:def__init__(self, obs_n, act_n, learning_rate=1e-2, gamma=0.9, e_greed=0.1):
  5. self.act_n = act_n # 动作维度,有几个动作可选
  6. self.lr = learning_rate # 学习率
  7. self.gamma = gamma # reward的衰减率
  8. self.epsilon = e_greed # 按一定概率随机选动作
  9. self.Q = np.zeros((obs_n, act_n))defsample(self, obs):if np.random.uniform(0,1)<(1.0- self.epsilon):# 根据tableQ值选动作
  10. action = self.predict(obs)else:
  11. action = np.random.choice(self.act_n)# 有一定概率随机探索选取一个动作return action
  12. # 根据输入观察值,预测输出的动作值defpredict(self, obs):
  13. Q_list = self.Q[obs,:]
  14. maxQ = np.max(Q_list)
  15. action_list = np.where(Q_list == maxQ)[0]# maxQ可能对应多个action
  16. action = np.random.choice(action_list)return action
  17. deflearn(self, obs, action, reward, next_obs, done):#(S,A,R,S)
  18. predict_Q = self.Q[obs, action]if done:
  19. target_Q = reward
  20. else:
  21. target_Q = reward + self.gamma * np.max(self.Q[next_obs,:])
  22. self.Q[obs, action]+= self.lr *(target_Q - predict_Q)defsave(self):
  23. npy_file ='./q-table.npy'
  24. np.save(npy_file, self.Q)print(npy_file +' saved.')defload(self, npy_file='./q_table.npy'):
  25. self.Q = np.load(npy_file)print(npy_file +' loaded.')defrun_episode(env, agent, render=False):# 其实真正执行的策略和Sarsa是一样的,只不过学习的策略是保守的最优策略
  26. total_steps =0
  27. total_reward =0
  28. obs = env.reset()whileTrue:
  29. action = agent.sample(obs)
  30. next_obs, reward, done, _ = env.step(action)
  31. agent.learn(obs, action, reward, next_obs, done)
  32. obs = next_obs
  33. total_reward += reward
  34. total_steps +=1if render:
  35. env.render()if done:breakreturn total_reward, total_steps
  36. deftest_episode(env, agent):
  37. total_reward =0
  38. obs = env.reset()whileTrue:
  39. action = agent.predict(obs)# greedy
  40. next_obs, reward, done, _ = env.step(action)
  41. total_reward += reward
  42. obs = next_obs
  43. time.sleep(0.5)
  44. env.render()if done:breakreturn total_reward
  45. defmain():
  46. env = gym.make("CliffWalking-v0")# 0 up, 1 right, 2 down, 3 left# 创建一个agent实例,输入超参数
  47. agent = QLearningAgent(
  48. obs_n=env.observation_space.n,
  49. act_n=env.action_space.n,
  50. learning_rate=0.1,
  51. gamma=0.9,
  52. e_greed=0.1)# 训练500episode,打印每个episode的分数for episode inrange(500):
  53. ep_reward, ep_steps = run_episode(env, agent,False)print('Episode %s: steps = %s , reward = %.1f'%(episode, ep_steps, ep_reward))# 全部训练结束,查看算法效果
  54. test_reward = test_episode(env, agent)print('test reward = %.1f'%(test_reward))
  55. main()

2.4 演示效果

在这里插入图片描述

三、PG 策略梯度 (倒立摆)

3.1 CartPole-v1环境介绍

(Cart Pole - Gym Documentation (gymlibrary.dev))

一根杆通过一个未驱动的关节连接到一辆小车上,小车沿着一条无摩擦的轨道移动。将钟摆垂直放置在推车上,目标是通过在推车上施加左右方向的力来平衡杆。

倒立摆:
在这里插入图片描述

在这里插入图片描述

  • **obs: (1,4)**NumObservationMinMax0Cart Position0-4.84.81Cart Velocity-InfInf2Pole Angle-0.418 rad0.418 rad3Pole Angular Velocity-InfInf
  • **action: (1,2)**动作空间是离散的:NumAction0向左推车1向右推车
  • reward每活着经过一个时间步,奖励 + 1。
  • 终止条件:- ① Pole Angle > 12°- ② |水平位置|>2.4’- ③ 超过500步

3.2 PG算法流程(REINFORCE)

输入: 可微调的策略参数

  1. π
  2. (
  3. a
  4. s
  5. ,
  6. θ
  7. )
  8. \pi(a|s,\theta)
  9. π(as,θ)

算法参数: 步长大小

  1. α
  2. >
  3. 0
  4. \alpha>0
  5. α>0

初始化的策略参数

  1. θ
  2. \theta
  3. θ

循环(each trajectory):

根据

  1. π
  2. (
  3. ,
  4. θ
  5. )
  6. \pi(\cdot|\cdot,\theta)
  7. π(⋅∣⋅,θ),生成
  8. S
  9. 0
  10. ,
  11. A
  12. 0
  13. ,
  14. R
  15. 1
  16. ,
  17. .
  18. .
  19. .
  20. S
  21. T
  22. 1
  23. ,
  24. A
  25. T
  26. 1
  27. ,
  28. R
  29. T
  30. S_0,A_0,R_1,...S_{T-1},A_{T-1},R_{T}
  31. S0​,A0​,R1​,...ST1​,AT1​,RT

对一个回合的每一步进行循环,

  1. t
  2. =
  3. 0
  4. ,
  5. 1
  6. ,
  7. .
  8. .
  9. .
  10. ,
  11. T
  12. 1
  13. t=0,1,...,T-1
  14. t=0,1,...,T1
  1. G
  2. =
  3. k
  4. =
  5. t
  6. +
  7. 1
  8. T
  9. γ
  10. k
  11. t
  12. 1
  13. R
  14. k
  15. G = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k
  16. G=∑k=t+1T​γkt1Rk
  17. θ
  18. =
  19. θ
  20. +
  21. α
  22. γ
  23. t
  24. G
  25. l
  26. n
  27. [
  28. π
  29. (
  30. a
  31. t
  32. s
  33. t
  34. ,
  35. θ
  36. )
  37. ]
  38. \theta = \theta + \alpha \gamma^t G \bigtriangledown ln[\pi(a_t|s_t,\theta)]
  39. θ=θ+αγtGln[π(at​∣st​,θ)]

3.3 具体代码

  1. import torch
  2. import gym
  3. import numpy as np
  4. import torch.nn as nn
  5. from torch.nn import Linear
  6. import torch.nn.functional as F
  7. import torch.optim as optim
  8. from torch.distributions import Categorical
  9. import time
  10. lr =0.002
  11. gamma =0.8classPGPolicy(nn.Module):def__init__(self, input_size=4, hidden_size=128, output_size=2):super(PGPolicy, self).__init__()
  12. self.fc1 = Linear(input_size, hidden_size)
  13. self.fc2 = Linear(hidden_size, output_size)
  14. self.dropout = nn.Dropout(p=0.6)
  15. self.saved_log_probs =[]# 记录每一步的动作概率
  16. self.rewards =[]#记录每一步的rdefforward(self, x):
  17. x = self.fc1(x)
  18. x = self.dropout(x)
  19. x = F.relu(x)
  20. x = self.fc2(x)
  21. out = F.softmax(x, dim=1)return out
  22. defchoose_action(state, policy):
  23. state = torch.from_numpy(state).float().unsqueeze(0)# 在索引0对应位置增加一个维度
  24. probs = policy(state)
  25. m = Categorical(probs)#创建以参数probs为标准的类别分布,之后的m.sampe就会按此概率选择动作
  26. action = m.sample()
  27. policy.saved_log_probs.append(m.log_prob(action))return action.item()#返回的就是intdeflearn(policy, optimizer):
  28. R =0
  29. policy_loss =[]
  30. returns =[]for r in policy.rewards[::-1]:
  31. R = r + gamma*R
  32. returns.insert(0,R)#从头部插入,即反着插入
  33. returns = torch.tensor(returns)# 归一化(均值方差),eps是一个非常小的数,避免除数为0
  34. eps = np.finfo(np.float64).eps.item()
  35. returns =(returns - returns.mean())/(returns.std()+ eps)for log_prob, R inzip(policy.saved_log_probs, returns):
  36. policy_loss.append(-log_prob*R)
  37. optimizer.zero_grad()
  38. policy_loss = torch.cat(policy_loss).sum()
  39. policy_loss.backward()
  40. optimizer.step()del policy.rewards[:]# 清空数据del policy.saved_log_probs[:]deftrain(episode_num):
  41. env = gym.make('CartPole-v1')
  42. env.seed(1)
  43. torch.manual_seed(1)
  44. policy = PGPolicy()# policy.load_state_dict(torch.load('save_model.pt')) # 模型导入
  45. optimizer = optim.Adam(policy.parameters(), lr)
  46. average_r =0for i inrange(1, episode_num+1):#采这么多轨迹
  47. obs = env.reset()
  48. ep_r =0for t inrange(1,10000):
  49. action = choose_action(obs, policy)
  50. obs, reward, done, _ = env.step(action)
  51. policy.rewards.append(reward)
  52. ep_r += reward
  53. if done:break
  54. average_r =0.05* ep_r +(1-0.05)* average_r
  55. learn(policy, optimizer)if i %10==0:print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(i, ep_r, average_r))
  56. torch.save(policy.state_dict(),'PGPolicy.pt')deftest():
  57. env = gym.make('CartPole-v1')
  58. env.seed(1)
  59. torch.manual_seed(1)
  60. policy = PGPolicy()
  61. policy.load_state_dict(torch.load('PGPolicy.pt'))# 模型导入
  62. average_r =0with torch.no_grad():
  63. obs = env.reset()
  64. ep_r =0for t inrange(1,10000):
  65. action = choose_action(obs, policy)
  66. obs, reward, done, _ = env.step(action)
  67. policy.rewards.append(reward)
  68. env.render()
  69. time.sleep(0.1)
  70. ep_r += reward
  71. if done:break
  72. train(1000)# test()

3.4 演示效果

训练过程:

在这里插入图片描述

在这里插入图片描述

四、PPO (飞船降落)

4.1 LunarLander-v2环境介绍

(该环境需要安装box2d)

https://www.gymlibrary.dev/environments/box2d/lunar_lander/?highlight=lunarlander

在这里插入图片描述

  • **observation (1,8)**NumObservation0x1y2 V x V_x Vx​3 V y V_y Vy​4 a n g l e angle angle5 a n g u l a r v e l o c i t y angular \quad velocity angularvelocity6左腿是否触地(bool)7右腿是否触地(bool)
  • **action (1,4)**NumAction0啥也不干1左侧点火2下面(主发动机)点火3右侧点火
  • reward从屏幕顶部移动到着陆台的奖励约为100-140分。如果着陆器没降落到陆台,它将失去奖励。如果着陆器坠毁,它将获得额外的-100分。如果它成功降落,它将获得额外的+100分。接地的每个支腿为+10点。每架主机点火-0.3分。侧面发动机每帧点火-0.03分。解决的是200分。
  • 终止条件- 飞船与月球接触- 飞船|x|>1

4.2 PPO-Clip算法流程

初始化策略函数的参数

  1. θ
  2. 0
  3. \theta_0
  4. θ0​, 初始化价值函数的参数
  5. ϕ
  6. 0
  7. \phi_0
  8. ϕ0

for k = 0,1,2,…

基于

  1. π
  2. (
  3. θ
  4. k
  5. )
  6. \pi(\theta_k)
  7. π(θk​)来采集轨迹组
  8. D
  9. k
  10. =
  11. τ
  12. k
  13. D_k={\tau_k}
  14. Dk​=τk

计算

  1. R
  2. t
  3. R_t
  4. Rt

计算

  1. A
  2. t
  3. A_t
  4. At

更新策略:

  1. θ
  2. k
  3. +
  4. 1
  5. =
  6. a
  7. r
  8. g
  9. m
  10. a
  11. x
  12. θ
  13. 1
  14. D
  15. k
  16. T
  17. τ
  18. t
  19. m
  20. i
  21. n
  22. (
  23. π
  24. θ
  25. (
  26. a
  27. t
  28. s
  29. t
  30. )
  31. π
  32. θ
  33. (
  34. a
  35. t
  36. s
  37. t
  38. )
  39. A
  40. (
  41. s
  42. t
  43. ,
  44. a
  45. t
  46. )
  47. ,
  48. g
  49. (
  50. ϵ
  51. ,
  52. A
  53. (
  54. s
  55. t
  56. ,
  57. a
  58. t
  59. )
  60. )
  61. )
  62. \theta_{k+1}=\underset{\theta}{argmax}\frac{1}{|D_k|T}\underset{\tau }{\sum}\underset{t }{\sum} min(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta^{'}}(a_t|s_t)}A(s_t,a_t),\quad g(\epsilon,A(s_t,a_t)))
  63. θk+1​=θargmax​∣Dk​∣T1​τ∑​t∑​min(πθ′​(at​∣st​)πθ​(at​∣st​)​A(st​,at​),g(ϵ,A(st​,at​)))

更新价值函数:

  1. ϕ
  2. k
  3. +
  4. 1
  5. =
  6. a
  7. r
  8. g
  9. m
  10. i
  11. n
  12. ϕ
  13. 1
  14. D
  15. k
  16. T
  17. τ
  18. t
  19. (
  20. V
  21. (
  22. s
  23. t
  24. )
  25. R
  26. )
  27. 2
  28. \phi_{k+1}=\underset{\phi}{argmin}\frac{1}{|D_k|T}\underset{\tau }{\sum}\underset{t }{\sum} (V(s_t)-R)^2
  29. ϕk+1​=ϕargmin​∣Dk​∣T1​τ∑​t∑​(V(st​)−R)2

4.3 具体代码

  1. import torch
  2. import torch.nn as nn
  3. from torch.distributions import Categorical
  4. import gym
  5. device ='cpu'classMemory:def__init__(self):
  6. self.actions =[]
  7. self.states =[]
  8. self.logprobs =[]
  9. self.rewards =[]
  10. self.is_terminals =[]defclear_memory(self):del self.actions[:]del self.states[:]del self.logprobs[:]del self.rewards[:]del self.is_terminals[:]classActorCritic(nn.Module):def__init__(self, state_dim, action_dim, n_latent_var):super(ActorCritic, self).__init__()# actor
  11. self.action_layer = nn.Sequential(
  12. nn.Linear(state_dim, n_latent_var),
  13. nn.Tanh(),
  14. nn.Linear(n_latent_var, n_latent_var),
  15. nn.Tanh(),
  16. nn.Linear(n_latent_var, action_dim),
  17. nn.Softmax(dim=-1))# critic
  18. self.value_layer = nn.Sequential(
  19. nn.Linear(state_dim, n_latent_var),
  20. nn.Tanh(),
  21. nn.Linear(n_latent_var, n_latent_var),
  22. nn.Tanh(),
  23. nn.Linear(n_latent_var,1))defforward(self):# 如果这个方法没有被子类重写,但是调用了,就会报错raise NotImplementedError
  24. defact(self, state, memory):
  25. state = torch.from_numpy(state).float().to(device)
  26. action_probs = self.action_layer(state)
  27. dist = Categorical(action_probs)
  28. action = dist.sample()
  29. memory.states.append(state)
  30. memory.actions.append(action)
  31. memory.logprobs.append(dist.log_prob(action))return action.item()defevaluate(self, state, action):
  32. action_probs = self.action_layer(state)
  33. dist = Categorical(action_probs)
  34. action_logprobs = dist.log_prob(action)
  35. dist_entropy = dist.entropy()
  36. state_value = self.value_layer(state)return action_logprobs, torch.squeeze(state_value), dist_entropy
  37. classPPO:def__init__(self, state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip):
  38. self.lr = lr
  39. self.betas = betas
  40. self.gamma = gamma
  41. self.eps_clip = eps_clip
  42. self.K_epochs = K_epochs
  43. self.policy = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
  44. self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
  45. self.policy_old = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
  46. self.policy_old.load_state_dict(self.policy.state_dict())
  47. self.MseLoss = nn.MSELoss()defupdate(self, memory):# Monte Carlo estimate of state rewards:
  48. rewards =[]
  49. discounted_reward =0for reward, is_terminal inzip(reversed(memory.rewards),reversed(memory.is_terminals)):if is_terminal:
  50. discounted_reward =0
  51. discounted_reward = reward +(self.gamma * discounted_reward)
  52. rewards.insert(0, discounted_reward)# Normalizing the rewards:
  53. rewards = torch.tensor(rewards).to(device).to(torch.float32)
  54. rewards =(rewards - rewards.mean())/(rewards.std()+1e-5)# convert list to tensor
  55. old_states = torch.stack(memory.states).to(device).detach().to(torch.float32)
  56. old_actions = torch.stack(memory.actions).to(device).detach().to(torch.float32)
  57. old_logprobs = torch.stack(memory.logprobs).to(device).detach().to(torch.float32)# Optimize policy for K epochs:for _ inrange(self.K_epochs):# Evaluating old actions and values :
  58. logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)# Finding the ratio (pi_theta / pi_theta__old):
  59. ratios = torch.exp(logprobs - old_logprobs.detach())# Finding Surrogate Loss:
  60. advantages = rewards - state_values.detach()
  61. surr1 = ratios * advantages
  62. surr2 = torch.clamp(ratios,1-self.eps_clip,1+self.eps_clip)* advantages
  63. loss =-torch.min(surr1, surr2)+0.5*self.MseLoss(state_values, rewards)-0.01*dist_entropy
  64. loss =loss.to(torch.float32)# take gradient step
  65. self.optimizer.zero_grad()
  66. loss.mean().backward()
  67. self.optimizer.step()# Copy new weights into old policy:
  68. self.policy_old.load_state_dict(self.policy.state_dict())defmain():############## Hyperparameters ##############
  69. env_name ='LunarLander-v2'# "LunarLander-v2"# creating environment
  70. env = gym.make(env_name)
  71. env = env.unwrapped
  72. state_dim = env.observation_space.shape[0]
  73. action_dim =4
  74. render =False
  75. solved_reward =200# stop training if avg_reward > solved_reward
  76. log_interval =20# print avg reward in the interval
  77. max_episodes =5000# max training episodes
  78. max_timesteps =1000# max timesteps in one episode
  79. n_latent_var =64# number of variables in hidden layer
  80. update_timestep =2000# update policy every n timesteps
  81. lr =0.002
  82. betas =(0.9,0.999)
  83. gamma =0.99# discount factor
  84. K_epochs =4# update policy using 1 trajectory for K epochs
  85. eps_clip =0.2# clip parameter for PPO
  86. random_seed =123#############################################if random_seed:
  87. torch.manual_seed(random_seed)
  88. env.seed(random_seed)
  89. memory = Memory()
  90. ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)print(lr,betas)# logging variables
  91. running_reward =0
  92. avg_length =0
  93. timestep =0# training loopfor i_episode inrange(1, max_episodes+1):
  94. state = env.reset()for t inrange(max_timesteps):
  95. timestep +=1# Running policy_old:
  96. action = ppo.policy_old.act(state, memory)
  97. state, reward, done, _ = env.step(action)# Saving reward and is_terminal:
  98. memory.rewards.append(reward)
  99. memory.is_terminals.append(done)# update if its timeif timestep % update_timestep ==0:
  100. ppo.update(memory)
  101. memory.clear_memory()
  102. timestep =0
  103. running_reward += reward
  104. if render:
  105. env.render()if done:break
  106. avg_length += t
  107. # stop training if avg_reward > solved_rewardif running_reward >(log_interval*solved_reward):print("########## Solved! ##########")
  108. torch.save(ppo.policy.state_dict(),'./PPO_{}_{}.pth'.format(env_name,lr))break# loggingif i_episode % log_interval ==0:
  109. avg_length =int(avg_length/log_interval)
  110. running_reward =int((running_reward/log_interval))print('Episode {} \t avg length: {} \t reward: {}'.format(i_episode, avg_length, running_reward))
  111. running_reward =0
  112. avg_length =0if i_episode %2000==0:
  113. torch.save(ppo.policy.state_dict(),'./PPO_{}_{}.pth'.format(env_name,lr))deftest():############## Hyperparameters ##############
  114. env_name ="LunarLander-v2"# creating environment
  115. env = gym.make(env_name)
  116. state_dim = env.observation_space.shape[0]
  117. action_dim =4
  118. render =False
  119. max_timesteps =500
  120. n_latent_var =64# number of variables in hidden layer
  121. lr =0.0002
  122. betas =(0.9,0.999)
  123. gamma =0.99# discount factor
  124. K_epochs =4# update policy for K epochs
  125. eps_clip =0.2# clip parameter for PPO#############################################
  126. n_episodes =3
  127. max_timesteps =300
  128. render =True
  129. save_gif =False
  130. filename ="PPO_{}_0.002.pth".format(env_name)
  131. directory ="./"
  132. memory = Memory()
  133. ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)
  134. ppo.policy_old.load_state_dict(torch.load(directory+filename))for ep inrange(1, n_episodes+1):
  135. ep_reward =0
  136. state = env.reset()for t inrange(max_timesteps):
  137. action = ppo.policy_old.act(state, memory)
  138. state, reward, done, _ = env.step(action)
  139. ep_reward += reward
  140. if render:
  141. env.render()if done:breakprint('Episode: {}\tReward: {}'.format(ep,int(ep_reward)))
  142. ep_reward =0
  143. env.close()if __name__ =='__main__':
  144. main()# test()

4.4 演示效果

在这里插入图片描述
在这里插入图片描述

五、DQN (打砖块)

5.1 Breakout-v0环境介绍

Breakout - Gym Documentation (gymlibrary.dev)

在这里插入图片描述

  • observation (210,160,3)

在这里插入图片描述

  • action (1,4)NumAction0NOOP1FIRE2RIGHT3LEFT
  • reward在这里插入图片描述

5.2 DQN算法流程

(带有经验回放池的DQN)

初始化经验回放池

  1. D
  2. D
  3. D(容量为
  4. N
  5. N
  6. N)

随机初始化 动作-价值 函数

  1. Q
  2. Q
  3. Q

for (each episode)

初始化序列

  1. s
  2. 1
  3. =
  4. [
  5. x
  6. 1
  7. ]
  8. s_1=[x_1]
  9. s1​=[x1​],预处理
  10. ϕ
  11. 1
  12. =
  13. ϕ
  14. (
  15. s
  16. 1
  17. )
  18. \phi_1=\phi(s_1)
  19. ϕ1​=ϕ(s1​)

for (each step)

  1. a
  2. t
  3. =
  4. m
  5. a
  6. x
  7. a
  8. Q
  9. (
  10. ϕ
  11. (
  12. s
  13. t
  14. )
  15. ,
  16. a
  17. :
  18. θ
  19. )
  20. a_t=\underset{a}{max}Q^*(\phi(s_t),a:\theta)
  21. at​=amaxQ∗(ϕ(st​),a:θ) (概率=1-
  22. ϵ
  23. \epsilon
  24. ϵ)

执行

  1. a
  2. t
  3. a_t
  4. at​,得到
  5. r
  6. t
  7. r_t
  8. rt​和图片
  9. x
  10. t
  11. +
  12. 1
  13. x_{t+1}
  14. xt+1
  15. s
  16. t
  17. +
  18. 1
  19. =
  20. s
  21. t
  22. ,
  23. ϕ
  24. t
  25. +
  26. 1
  27. =
  28. ϕ
  29. (
  30. s
  31. t
  32. +
  33. 1
  34. )
  35. s_{t+1}=s_t,\phi_{t+1}=\phi(s_{t+1})
  36. st+1​=st​,ϕt+1​=ϕ(st+1​)

  1. (
  2. ϕ
  3. t
  4. ,
  5. a
  6. t
  7. ,
  8. r
  9. t
  10. ,
  11. ϕ
  12. t
  13. +
  14. 1
  15. )
  16. (\phi_t,a_t,r_t,\phi_{t+1})
  17. t​,at​,rt​,ϕt+1​)存储进
  18. D
  19. D
  20. D

  1. D
  2. D
  3. D中采样
  4. y
  5. i
  6. =
  7. {
  8. r
  9. j
  10. (
  11. t
  12. e
  13. r
  14. m
  15. i
  16. n
  17. a
  18. l
  19. ϕ
  20. j
  21. +
  22. 1
  23. )
  24. r
  25. j
  26. +
  27. γ
  28. m
  29. a
  30. x
  31. Q
  32. (
  33. ϕ
  34. j
  35. +
  36. 1
  37. ,
  38. a
  39. ;
  40. θ
  41. )
  42. (
  43. n
  44. o
  45. n
  46. t
  47. e
  48. r
  49. m
  50. i
  51. n
  52. a
  53. l
  54. ϕ
  55. j
  56. +
  57. 1
  58. )
  59. y_i = \left\{\begin{matrix} r_j & (terminal\quad \phi_{j+1})\\ r_j +\gamma max Q( \phi_{j+1},a^{'}; \theta) & (non-terminal\quad \phi_{j+1}) \end{matrix}\right.
  60. yi​={rj​rj​+γmaxQ(ϕj+1​,a′;θ)​(terminalϕj+1​)(non−terminalϕj+1​)​

根据

  1. (
  2. y
  3. i
  4. Q
  5. (
  6. ϕ
  7. j
  8. ,
  9. a
  10. j
  11. :
  12. θ
  13. )
  14. )
  15. 2
  16. (y_i-Q(\phi_j,a_j:\theta))^2
  17. (yi​−Qj​,aj​:θ))2进行梯度下降

5.3 具体代码

  1. import gym
  2. import cv2
  3. import torch
  4. import numpy as np
  5. import torch.nn as nn
  6. import pandas as pd
  7. from torch.nn import Linear, Conv2d, ReLU
  8. import PIL.Image as Image
  9. device=torch.device("cuda:0"if torch.cuda.is_available()else"cpu")# 经验池classDQBReplayer:def__init__(self, capacity):# (S,A,R,S)
  10. self.memory = pd.DataFrame(index=range(capacity), columns=['observation','action','reward','next_observation','done'])
  11. self.i =0
  12. self.count =0
  13. self.capacity = capacity
  14. defstore(self,*args):
  15. self.memory.loc[self.i]= args
  16. self.i =(self.i +1)%self.capacity
  17. self.count =min(self.count+1, self.capacity)defsample(self, size):
  18. indics = np.random.choice(self.count, size=size)return(np.stack(self.memory.loc[indics,field])for field in self.memory.columns)# Q-NetworkclassDQN_net(nn.Module):def__init__(self):super(DQN_net, self).__init__()
  19. self.conv = nn.Sequential(
  20. Conv2d(in_channels=4, out_channels=32, kernel_size=8, stride=4),
  21. ReLU(),
  22. Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
  23. ReLU(),
  24. Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
  25. ReLU())
  26. self.classifier = nn.Sequential(
  27. Linear(3136,512),
  28. ReLU(),
  29. Linear(512,4))defforward(self, x):
  30. x = self.conv(x)
  31. x = x.view(x.size(0),-1)
  32. output = self.classifier(x)return output
  33. classDQN(nn.Module):def__init__(self, input_shape, env):super(DQN, self).__init__()
  34. self.replayer_start_size =100000
  35. self.upon_times =20
  36. self.replayer = DQBReplayer(capacity=self.replayer_start_size)
  37. self.action_n = env.action_space.n
  38. self.image_stack = input_shape[2]
  39. self.gamma =0.99
  40. self.image_shape =(input_shape[0], input_shape[1])
  41. self.e_net = DQN_net()
  42. self.t_net = DQN_net()
  43. self.learn_step =0
  44. self.max_learn_step =650000
  45. self.epsilon =1.
  46. self.start_learn =Falsedefget_next_state(self,state=None,observation=None):
  47. img=Image.fromarray(observation,"RGB")
  48. img=img.resize(self.image_shape).convert('L')
  49. img=np.asarray(img.getdata(),dtype=np.uint8,).reshape(img.size[1],img.size[0])if state isNone:
  50. next_state = np.array([img,]*self.image_stack)else:
  51. next_state = np.append(state[1:],[img,],axis=0)return next_state
  52. defdecide(self,state,step):if self.start_learn ==False:#前50000步随机选择
  53. action = np.random.randint(0,4)return action
  54. else:
  55. self.epsilon -=0.0000053if step <30:#每局前三十步随机选择,中间30万,#以一定概率(1-epsilon)通过神经网络选择,# 最后30万次以0.99概率通过神经网络选择
  56. action = np.random.randint(0,4)elif np.random.random()<max(self.epsilon,0.0005):
  57. action = np.random.randint(0,4)else:
  58. state = state/128-1
  59. y = torch.Tensor(state).float().unsqueeze(0)
  60. y = y.to(device)
  61. x = self.e_net(y).detach()if self.learn_step%2000==0:print("q value{}".format(x))
  62. action = torch.argmax(x).data.item()return action
  63. defmain():
  64. sum_reward =0
  65. store_count =0
  66. env = gym.make('Breakout-v0')
  67. net = DQN([84,84,4], env).cuda()
  68. Load_Net =0if Load_Net==1:
  69. load_net_path ='./epsiode_2575_reward_10.0.pkl'print("Load old net and the path is:",load_net_path)
  70. net.e_net = torch.load(load_net_path)
  71. net.t_net = torch.load(load_net_path)
  72. max_score =0
  73. mse = nn.MSELoss()
  74. mse = mse.cuda()
  75. opt = torch.optim.RMSprop(net.e_net.parameters(), lr=0.0015)for i inrange(20000):
  76. lives =5
  77. obs = env.reset()
  78. state = net.get_next_state(None,obs)
  79. epoch_reward =0if i%100==0:print("{} times_game".format(i),end=':')print('epoch_reward:{}'.format(epoch_reward))for step inrange(500000):
  80. action = net.decide(state,step=step)
  81. obs, reward, done, _ = env.step(action)
  82. next_state = net.get_next_state(state, obs)
  83. epoch_reward += reward
  84. net.replayer.store(state, action, reward, next_state, done)
  85. net.learn_step +=1if net.learn_step >= net.replayer_start_size //2and net.learn_step %4==0:if net.start_learn ==False:
  86. net.start_learn =Trueprint('Start Learn!')
  87. sample_n =32
  88. states, actions, rewards, next_states, dones = net.replayer.sample(sample_n)
  89. states, next_states = states /128-1, next_states /128-1
  90. rewards = torch.Tensor(np.clip(rewards,-1,1)).unsqueeze(1).cuda()
  91. states, next_states = torch.Tensor(states).cuda(), torch.Tensor(next_states).cuda()
  92. actions = torch.Tensor(actions).long().unsqueeze(1).cuda()
  93. dones = torch.Tensor(dones).unsqueeze(1).cuda()
  94. q = net.e_net(states).gather(1, actions)
  95. q_next = net.t_net(next_states).detach().max(1)[0].reshape(sample_n,1)
  96. tq = rewards + net.gamma *(1-done)* q_next
  97. loss = mse(q, tq)
  98. opt.zero_grad()
  99. loss.backward()
  100. opt.step()if net.learn_step %(net.upon_times *5)==0:
  101. net.t_net.load_state_dict(net.e_net.state_dict())if net.learn_step %100==0:
  102. loss_record = loss.item()
  103. a_r = torch.mean(rewards,0).item()
  104. state = next_state
  105. if done:
  106. save_net_path ='./'
  107. sum_reward+=epoch_reward
  108. if epoch_reward > max_score:
  109. name ="epsiode_"+str(net.learn_step)+"_reward_"+str(epoch_reward)+".pkl"
  110. torch.save(net.e_net, save_net_path+name)
  111. max_score = epoch_reward
  112. elif i %1000==0:
  113. name ="No."+str(i)+".pkl"
  114. torch.save(net.e_net, save_net_path + name)if i%10==0:
  115. sum_reward=0breakimport cv2
  116. defPictureArray2Video(pic_list, path='./test.mp4'):
  117. h,w,_ = pic_list[0].shape[0], pic_list[0].shape[1], pic_list[0].shape[2]print(h,w)
  118. writer = cv2.VideoWriter(path, cv2.VideoWriter_fourcc('m','p','4','v'),10,(w, h),True)
  119. total_frame =len(pic_list)for i inrange(total_frame):
  120. writer.write(pic_list[i])
  121. writer.release()deftest():
  122. pics =[]
  123. sum_reward =0
  124. store_count =0
  125. env = gym.make('Breakout-v0')
  126. net = DQN([84,84,4], env).cuda()
  127. Load_Net =1if Load_Net==1:
  128. load_net_path ='./epsiode_10219_reward_9.0.pkl'print("Load old net and the path is:",load_net_path)
  129. net.e_net = torch.load(load_net_path)
  130. net.t_net = torch.load(load_net_path)
  131. max_score =0
  132. mse = nn.MSELoss()
  133. mse = mse.cuda()
  134. obs = env.reset()
  135. state = net.get_next_state(None,obs)
  136. epoch_reward =0for step inrange(500000):
  137. action = net.decide(state,step=step)
  138. obs, reward, done, _ = env.step(action)
  139. pic = env.render(mode='rgb_array')
  140. pic = cv2.cvtColor(pic,cv2.COLOR_BGR2RGB)
  141. next_state = net.get_next_state(state, obs)
  142. pics.append(pic)if done:
  143. PictureArray2Video(pics)break

5.4 演示效果

这个我感觉要训练好久,我训练了两个小时,reward=11,然后停下了。

在这里插入图片描述

六、DDPG (单摆)

6.1 Pendulum-v1环境介绍

https://www.gymlibrary.dev/environments/classic_control/pendulum/?highlight=pendulum+v1

  • observation (1,3)NumObservationMinMax0cos(theta)-111sin(angle)-112角速度-8.08.0
  • action (1,)力矩,大小在(-2,2)之前的值
  • 奖励 r = − ( θ 2 + 0.1 × ω 2 + 0.001 × 力 矩 2 ) r = -(\theta^2 + 0.1×\omega^2 + 0.001×力矩^2) r=−(θ2+0.1×ω2+0.001×力矩2)

6.2 DDPG算法流程

随机初始化 评论员

  1. Q
  2. (
  3. s
  4. ,
  5. a
  6. θ
  7. Q
  8. )
  9. Q(s,a|\theta^Q)
  10. Q(s,a∣θQ)和 演员
  11. μ
  12. (
  13. s
  14. θ
  15. μ
  16. )
  17. \mu(s|\theta^\mu)
  18. μ(s∣θμ)

初始化目标策略价值网络

  1. Q
  2. Q{'}
  3. Q′和
  4. θ
  5. \theta^{'}
  6. θ′,
  7. θ
  8. Q
  9. =
  10. θ
  11. Q
  12. ,
  13. θ
  14. μ
  15. =
  16. θ
  17. μ
  18. \theta^{Q^{'}}=\theta^Q,\theta^{\mu^{'}}=\theta^\mu
  19. θQ′=θQ,θμ′=θμ

初始化经验回放池R

for (each episode)

for (each step)

  1. a
  2. t
  3. =
  4. μ
  5. (
  6. s
  7. t
  8. θ
  9. μ
  10. )
  11. a_t=\mu(s_t|\theta^{\mu})
  12. at​=μ(st​∣θμ)
  13. s
  14. t
  15. +
  16. 1
  17. ,
  18. r
  19. t
  20. ,
  21. d
  22. o
  23. n
  24. e
  25. ,
  26. =
  27. e
  28. n
  29. v
  30. .
  31. s
  32. t
  33. e
  34. p
  35. (
  36. a
  37. t
  38. )
  39. s_{t+1},r_t,done,_ = env.step(a_t)
  40. st+1​,rt​,done,=​env.step(at​)

  1. (
  2. s
  3. t
  4. ,
  5. a
  6. t
  7. ,
  8. r
  9. t
  10. ,
  11. s
  12. t
  13. +
  14. 1
  15. )
  16. (s_t,a_t,r_t,s_{t+1})
  17. (st​,at​,rt​,st+1​)存储进*R*

从R中采样N条轨迹

  1. (
  2. s
  3. i
  4. ,
  5. a
  6. i
  7. ,
  8. r
  9. i
  10. ,
  11. s
  12. i
  13. +
  14. 1
  15. )
  16. (s_i,a_i,r_i,s_{i+1})
  17. (si​,ai​,ri​,si+1​)
  18. y
  19. i
  20. =
  21. r
  22. i
  23. +
  24. γ
  25. Q
  26. (
  27. s
  28. i
  29. +
  30. 1
  31. ,
  32. μ
  33. (
  34. s
  35. i
  36. +
  37. 1
  38. θ
  39. Q
  40. )
  41. θ
  42. Q
  43. )
  44. y_i = r_i + \gamma Q^{'}(s_{i+1},\mu^{'}(s_{i+1}|\theta^{Q^{'}})|\theta^{Q^{'}})
  45. yi​=ri​+γQ′(si+1​,μ′(si+1​∣θQ′)∣θQ′)
  46. L
  47. o
  48. s
  49. s
  50. =
  51. 1
  52. N
  53. Σ
  54. (
  55. y
  56. i
  57. Q
  58. (
  59. s
  60. i
  61. ,
  62. a
  63. i
  64. θ
  65. Q
  66. )
  67. )
  68. 2
  69. Loss = \frac{1}{N}\Sigma(y_i-Q(s_i,a_i|\theta^{Q}))^2
  70. Loss=N1​Σ(yi​−Q(si​,ai​∣θQ))2, 更新评论员网络
  71. θ
  72. μ
  73. J
  74. =
  75. 1
  76. N
  77. Σ
  78. a
  79. Q
  80. (
  81. s
  82. ,
  83. a
  84. θ
  85. Q
  86. )
  87. s
  88. =
  89. s
  90. i
  91. ,
  92. a
  93. =
  94. μ
  95. (
  96. s
  97. i
  98. )
  99. θ
  100. μ
  101. μ
  102. (
  103. s
  104. θ
  105. μ
  106. )
  107. )
  108. s
  109. i
  110. \bigtriangledown _{\theta^\mu}J = \frac{1}{N}\Sigma \bigtriangledown_a Q(s,a|\theta^Q)|_{s=s_i,a=\mu(s_i)}\bigtriangledown_{\theta^\mu} \mu(s|\theta^\mu)|)_{s_i}
  111. ▽θμ​J=N1​Σ▽aQ(s,a∣θQ)∣s=si​,a=μ(si​)​▽θμ​μ(s∣θμ)∣)si​​

更新目标网络:

  1. θ
  2. Q
  3. =
  4. τ
  5. θ
  6. Q
  7. +
  8. (
  9. 1
  10. τ
  11. )
  12. θ
  13. Q
  14. \theta^{Q^{'}} = \tau \theta^Q + (1-\tau)\theta^{Q^{'}}
  15. θQ′=τθQ+(1−τ)θQ
  16. θ
  17. μ
  18. =
  19. τ
  20. θ
  21. μ
  22. +
  23. (
  24. 1
  25. τ
  26. )
  27. θ
  28. μ
  29. \theta^{\mu^{'}} = \tau \theta^\mu + (1-\tau)\theta^{\mu^{'}}
  30. θμ′=τθμ+(1−τ)θμ′

6.3 具体代码

  1. import torch
  2. import torch.nn as nn
  3. import torch.nn.functional as F
  4. import numpy as np
  5. import gym
  6. import time
  7. ##################### hyper parameters ####################
  8. EPISODES =200
  9. EP_STEPS =200
  10. LR_ACTOR =0.001
  11. LR_CRITIC =0.002
  12. GAMMA =0.9
  13. TAU =0.01
  14. MEMORY_CAPACITY =10000
  15. BATCH_SIZE =32
  16. RENDER =False
  17. ENV_NAME ='Pendulum-v1'########################## DDPG Framework ######################classActorNet(nn.Module):# define the network structure for actor and criticdef__init__(self, s_dim, a_dim):super(ActorNet, self).__init__()
  18. self.fc1 = nn.Linear(s_dim,30)
  19. self.fc1.weight.data.normal_(0,0.1)# initialization of FC1
  20. self.out = nn.Linear(30, a_dim)
  21. self.out.weight.data.normal_(0,0.1)# initilizaiton of OUTdefforward(self, x):
  22. x = self.fc1(x)
  23. x = F.relu(x)
  24. x = self.out(x)
  25. x = torch.tanh(x)
  26. actions = x *2# for the game "Pendulum-v0", action range is [-2, 2]return actions
  27. classCriticNet(nn.Module):def__init__(self, s_dim, a_dim):super(CriticNet, self).__init__()
  28. self.fcs = nn.Linear(s_dim,30)
  29. self.fcs.weight.data.normal_(0,0.1)
  30. self.fca = nn.Linear(a_dim,30)
  31. self.fca.weight.data.normal_(0,0.1)
  32. self.out = nn.Linear(30,1)
  33. self.out.weight.data.normal_(0,0.1)defforward(self, s, a):
  34. x = self.fcs(s)
  35. y = self.fca(a)
  36. actions_value = self.out(F.relu(x+y))return actions_value
  37. classDDPG(object):def__init__(self, a_dim, s_dim, a_bound):
  38. self.a_dim, self.s_dim, self.a_bound = a_dim, s_dim, a_bound
  39. self.memory = np.zeros((MEMORY_CAPACITY, s_dim *2+ a_dim +1), dtype=np.float32)
  40. self.pointer =0# serves as updating the memory data # Create the 4 network objects
  41. self.actor_eval = ActorNet(s_dim, a_dim)
  42. self.actor_target = ActorNet(s_dim, a_dim)
  43. self.critic_eval = CriticNet(s_dim, a_dim)
  44. self.critic_target = CriticNet(s_dim, a_dim)# create 2 optimizers for actor and critic
  45. self.actor_optimizer = torch.optim.Adam(self.actor_eval.parameters(), lr=LR_ACTOR)
  46. self.critic_optimizer = torch.optim.Adam(self.critic_eval.parameters(), lr=LR_CRITIC)# Define the loss function for critic network update
  47. self.loss_func = nn.MSELoss()defstore_transition(self, s, a, r, s_):# how to store the episodic data to buffer
  48. transition = np.hstack((s, a,[r], s_))
  49. index = self.pointer % MEMORY_CAPACITY # replace the old data with new data
  50. self.memory[index,:]= transition
  51. self.pointer +=1defchoose_action(self, s):# print(s)
  52. s = torch.unsqueeze(torch.FloatTensor(s),0)return self.actor_eval(s)[0].detach()deflearn(self):# softly update the target networksfor x in self.actor_target.state_dict().keys():eval('self.actor_target.'+ x +'.data.mul_((1-TAU))')eval('self.actor_target.'+ x +'.data.add_(TAU*self.actor_eval.'+ x +'.data)')for x in self.critic_target.state_dict().keys():eval('self.critic_target.'+ x +'.data.mul_((1-TAU))')eval('self.critic_target.'+ x +'.data.add_(TAU*self.critic_eval.'+ x +'.data)')# sample from buffer a mini-batch data
  53. indices = np.random.choice(MEMORY_CAPACITY, size=BATCH_SIZE)
  54. batch_trans = self.memory[indices,:]# extract data from mini-batch of transitions including s, a, r, s_
  55. batch_s = torch.FloatTensor(batch_trans[:,:self.s_dim])
  56. batch_a = torch.FloatTensor(batch_trans[:, self.s_dim:self.s_dim + self.a_dim])
  57. batch_r = torch.FloatTensor(batch_trans[:,-self.s_dim -1:-self.s_dim])
  58. batch_s_ = torch.FloatTensor(batch_trans[:,-self.s_dim:])# make action and evaluate its action values
  59. a = self.actor_eval(batch_s)
  60. q = self.critic_eval(batch_s, a)
  61. actor_loss =-torch.mean(q)# optimize the loss of actor network
  62. self.actor_optimizer.zero_grad()
  63. actor_loss.backward()
  64. self.actor_optimizer.step()# compute the target Q value using the information of next state
  65. a_target = self.actor_target(batch_s_)
  66. q_tmp = self.critic_target(batch_s_, a_target)
  67. q_target = batch_r + GAMMA * q_tmp
  68. # compute the current q value and the loss
  69. q_eval = self.critic_eval(batch_s, batch_a)
  70. td_error = self.loss_func(q_target, q_eval)# optimize the loss of critic network
  71. self.critic_optimizer.zero_grad()
  72. td_error.backward()
  73. self.critic_optimizer.step()############################### Training ####################################### Define the env in gym
  74. env = gym.make(ENV_NAME)
  75. env = env.unwrapped
  76. env.seed(1)
  77. s_dim = env.observation_space.shape[0]
  78. a_dim = env.action_space.shape[0]
  79. a_bound = env.action_space.high
  80. a_low_bound = env.action_space.low
  81. ddpg = DDPG(a_dim, s_dim, a_bound)
  82. var =3# the controller of exploration which will decay during training process
  83. t1 = time.time()for i inrange(EPISODES):
  84. s = env.reset()
  85. ep_r =0for j inrange(EP_STEPS):if RENDER: env.render()# add explorative noise to action
  86. a = ddpg.choose_action(s)
  87. a = np.clip(np.random.normal(a, var), a_low_bound, a_bound)
  88. s_, r, done, info, _ = env.step(a)
  89. ddpg.store_transition(s, a, r /10, s_)# store the transition to memoryif ddpg.pointer > MEMORY_CAPACITY:
  90. var *=0.9995# decay the exploration controller factor
  91. ddpg.learn()
  92. s = s_
  93. ep_r += r
  94. if j == EP_STEPS -1:print('Episode: ', i,' Reward: %i'%(ep_r),'Explore: %.2f'% var)if ep_r >-300: RENDER =Truebreakprint('Running time: ', time.time()- t1)if __name__ =="__main__":
  95. learn()
  96. env.close()

6.4 演示效果

在这里插入图片描述


本文转载自: https://blog.csdn.net/weixin_45696231/article/details/126723207
版权归原作者 Promethe_us 所有, 如有侵权,请联系我们删除。

“基础的强化学习(RL)算法及代码详细demo”的评论:

还没有评论