我试图通过在一层神经网络上使用backprop来解决OpenAI中的CartPole-v1问题,同时在每个时间步使用状态动作值(Q(s,a))更新模型。我不可能得到平均奖励超过每集42步。有人能帮忙吗?我的方法是否正确?比如,如果我在每个时间步更新Q值,而不是在每集批量更新,那么代理是否有可能学习到最佳解决方案?从理论上说这应该是可能的。你知道吗
详细信息:在玩转和试验激活函数、随机策略并最终确定一个具有线性激活函数和下面提到的参数的确定性策略之后,我能够让我的代理一致地收敛(大约100-300步)到平均约42步的回报。但不会超过45岁。在下面的程序中调整参数(epsilon、折扣率和学习率)不会对此产生很大影响。你知道吗
我试过在网上寻找类似的解决方案,但没有一个适合我所遵循的方法。几乎所有的解决方案都涉及到在每一集结束时的学习(通过存储SARS的数据)。 增加隐藏层的数量也无济于事。我也认为这个算法不太可能在将来收敛到一个更好的值,因为我已经运行了10000多集,它的平均回报仍然在40左右。你知道吗
首先,超参数:
epsilon = 0.5
lr = 0.05
discount_rate=0.9
# number of features in environment observations
num_inputs = 4
hidden_layer_nodes = 6
num_outputs = 2
q函数:
def calculateNNOutput(observation, m1, m2):
scaled_observation = scaleFeatures(observation)
hidden_layer = np.dot(scaled_observation, m1) # 1x4 X 4x6 -> 1x6
outputs = np.dot(hidden_layer, m2) # 1x6 X 6x2
return np.asmatrix(outputs) # 1x2
操作选择(策略):
def selectAction(observation):
#explore
global epsilon
if random.uniform(0,1) < epsilon:
return random.randint(0,1)
#exploit
outputs = calculateNNOutputs(observation)
print(outputs)
if (outputs[0,0] > outputs[0,1]):
return 0
else:
return 1
后支柱:
def backProp(prev_obs, m1, m2, experimental_values):
global lr
scaled_observation = np.asmatrix(scaleFeatures(prev_obs))
hidden_layer = np.asmatrix(np.dot(scaled_observation, m1)) #
outputs = np.asmatrix(np.dot(hidden_layer, m2)) # 1x6 X 6x2
delta_out = np.asmatrix((outputs-experimental_values)) # 1x2
delta_2=np.transpose(np.dot(m2,np.transpose(delta_out))) # 6x2 X 2x1 = 6x1_T = 1x6
GRADIENT_2 = (np.transpose(hidden_layer))*delta_out # 6x1 X 1x2 = 6x2 - same as w2
GRADIENT_1 = np.multiply(np.transpose(scaled_observation), delta_2) # 4 x 6 - same as w1
m1 = m1 - lr*GRADIENT_1
m2 = m2 - lr*GRADIENT_2
return m1, m2
Q-学习:
def updateWeights(prev_obs, action, obs, reward, done):
global weights_1, weights_2
calculated_value = calculateNNOutputs(prev_obs)
if done:
experimental_value = -1
else:
actionValues = calculateNNOutputs(obs) # 1x2
experimental_value = reward + discount_rate*(np.amax(actionValues, axis = 1)[0,0])
if action==0:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[experimental_value, calculated_value[0,1]]]))
else:
weights_1, weights_2 = backProp(prev_obs, weights_1, weights_2, np.array([[calculated_value[0,0],experimental_value]]))
编辑:主循环-
record = 0
total = 0
for i_episode in range(num_episodes):
if (i_episode%10 == 0):
print("W1 = ", weights_1)
print("W2 = ", weights_2)
observation = env.reset()
epsilon = max(epsilon*0.9,0.01)
lr = max(lr*0.9, 0.01)
print("Average steps = ", total/(i_episode+1))
print("Record = ", record)
for t in range(1000):
action_taken = selectAction(observation)
print(action_taken)
previous_observation=observation
observation, reward, done, info = env.step(action_taken) # take the selected action
updateWeights(previous_observation, action_taken, observation,reward, done) # perform backprop to update the action value
if done:
total = total+t
if t > record:
record = t
print("Episode {} finished after {} timesteps".format(i_episode,t+1))
break
我是否需要对方法/实现/参数调整进行任何更改?你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐