连续的DDPG似乎不收敛于二维空间搜索问题（“寻找顶针”）

结果

在MLP案例中运行了大约400集的算法，在定制策略案例中运行了700集，每集运行了1000步之后，它似乎没有学到任何有用的东西。在测试运行期间，平均回报率没有增加，当我检查三个不同起始位置的行为时，它总是朝着这个区域的(0, 1)角走；即使它在目标位置旁边开始，它也会经过它，朝着(0, 1)角走去。我注意到定制策略体系结构代理导致测试集返回的标准偏差要小得多。在

问题

我想知道为什么算法对于给定的设置没有学到任何东西，以及为了使它收敛需要做些什么改变。我不能用一个概念性的问题来解决问题。不过，我无法确定问题的根源，所以如果有人能帮忙，我会很高兴的。在

平均测试返回（自定义策略架构）：

Average test return (custom policy architecture)

（竖线表示测试集返回的标准偏差）

平均测试回报（MLP策略架构）：

Average test return (MLP policy architecture)

测试用例（自定义策略架构）：

测试用例（MLP策略架构）：

代码

import logging import os import gym from gym.wrappers.time_limit import TimeLimit import numpy as np from spinup.algos.ddpg.ddpg import core, ddpg import tensorflow as tf class TestEnv(gym.Env): target = np.array([0.7, 0.8]) action_limit = 0.01 observation_space = gym.spaces.Box(low=np.zeros(2), high=np.ones(2), dtype=np.float32) action_space = gym.spaces.Box(-action_limit * np.ones(2), action_limit * np.ones(2), dtype=np.float32) def __init__(self): super().__init__() self.pos = np.empty(2, dtype=np.float32) self.reset() def step(self, action): self.pos += action self.pos = np.clip(self.pos, self.observation_space.low, self.observation_space.high) reward_ctrl = -np.square(action).sum() / self.action_limit**2 reward_dist = -np.linalg.norm(self.pos - self.target) reward = reward_ctrl + reward_dist done = abs(reward_dist) < 1e-9 logging.debug('Observation: %s', self.pos) logging.debug('Reward: %.6f (reward (ctrl): %.6f, reward (dist): %.6f)', reward, reward_ctrl, reward_dist) return self.pos, reward, done, {} def reset(self): self.pos[:] = np.random.uniform(self.observation_space.low, self.observation_space.high, size=2) logging.info(f'[Reset] New position: {self.pos}') return self.pos def render(self, *args, **kwargs): pass def mlp_actor_critic(x, a, hidden_sizes, activation=tf.nn.relu, action_space=None): act_dim = a.shape.as_list()[-1] act_limit = action_space.high[0] with tf.variable_scope('pi'): # pi = core.mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation=None) # The standard way. pi = tf.layers.dense(x, act_dim, use_bias=True) # Target position should be learned via the bias term. pi = pi / (tf.norm(pi) + 1e-9) * act_limit # Prevent division by zero. with tf.variable_scope('q'): q = tf.squeeze(core.mlp(tf.concat([x,a], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1) with tf.variable_scope('q', reuse=True): q_pi = tf.squeeze(core.mlp(tf.concat([x,pi], axis=-1), list(hidden_sizes)+[1], activation, None), axis=1) return pi, q, q_pi if __name__ == '__main__': log_dir = 'spinup-ddpg' if not os.path.exists(log_dir): os.mkdir(log_dir) logging.basicConfig(level=logging.INFO) ep_length = 1000 ddpg( lambda: TimeLimit(TestEnv(), ep_length), mlp_actor_critic, ac_kwargs=dict(hidden_sizes=(64, 64, 64)), steps_per_epoch=ep_length, epochs=1_000, replay_size=1_000_000, start_steps=10_000, act_noise=TestEnv.action_limit/2, gamma=0.99, # Use large gamma, because of action limit it matters where we walk to early in the episode. polyak=0.995, max_ep_len=ep_length, save_freq=10, logger_kwargs=dict(output_dir=log_dir) )

1条回答

网友

1楼 · 发布于 2024-04-26 09:22:28

您正在使用一个巨大的网络（64x64x64）来解决一个非常小的问题。光是这一点就可能是个大问题。你还需要在内存中保存1百万个样本，同样，对于一个非常简单的问题来说，这可能是有害的，而且收敛速度很慢。首先尝试一个更简单的设置（32x32网络和100000内存，甚至是具有多项式特性的线性近似器）。另外，你如何更新你的目标网络？什么是polyak？最后，像这样规范化操作可能不是一个好主意。最好只是剪辑或使用一个棕褐色的层结束。在

结果

问题

代码

相关问题更多 >

编程相关推荐

热门问题

热门文章