为什么我的tensorflow探员的训练变慢了?是因为我的分批喂食策略吗?

2024-03-29 06:38:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过使用python和Tensorflow从开放人工智能中解决“BipedalWalker-v2”问题。为了解决这个问题,我实现了一个情节策略梯度算法。因为“BipedalWalker-v2”动作是连续的,所以我的策略近似于一个多元高斯分布。这个分布的平均值是用一个完全连接的神经网络来近似的。我的神经网络有以下几层:[输入:24,隐藏:5,隐藏:5,输出:4]. 我的问题是,当我训练代理时,训练过程变得越来越慢,直到它几乎冻结。我猜我误用了赛斯·润,我没有以有效的方式喂食。但这只是猜测。我的问题是:我的猜测正确吗?如果它是正确的,我如何改进它?如果是别的东西,那是什么?我不是在寻找一个简单的解决方案,我只是想了解一下如何改进培训。你知道吗

提前谢谢,

我的电脑是Inspiron 15 7000游戏机,GeForce nvidia gtx 1050,8 gb ram,cpu:I5

我的代码:

图书馆:

import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import gym
import matplotlib.pyplot as plt

代理类:

class agent_episodic_continuous_action():
    def __init__(self, lr, s_size,a_size,batch_size,dist_type):

        self.stuck = False
        self.gamma = 0.99
        self.dist_type = dist_type
        self.is_brain_present = False
        self.s_size = s_size
        self.batch_size=batch_size
        self.state_in= tf.placeholder(shape=[None,s_size],dtype=tf.float32)
        self.a_size=a_size
        self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32)
        self.cov = tf.eye(a_size)
        self.reduction = 0.01
        if a_size > 1:
            self.action_holder = tf.placeholder(shape=[None,a_size],dtype=tf.float32)
        else:
            self.action_holder = tf.placeholder(shape=[None],dtype=tf.float32)

        self.gradient_holders = []
        self.optimizer = tf.train.AdamOptimizer(learning_rate=lr)

    def save_model(self,path,sess):
        self.saver.save(sess, path)


    def load_model(self,path,sess):
        self.saver.restore(sess, path)

    def create_brain(self,hidd_layer,hidd_layer_act_fn,output_act_fn):

        self.is_brain_present =  True
        hidden_output=slim.stack(self.state_in,slim.fully_connected,hidd_layer,activation_fn=hidd_layer_act_fn)
        self.output = slim.fully_connected(hidden_output,self.a_size,activation_fn=output_act_fn,biases_initializer=None)

    def create_pi_dist(self):

        if self.dist_type == "normal":
         #   amplify= tf.pow(slim.fully_connected(self.output,1,activation_fn=None,biases_initializer=None),2)
            mean= self.output
            #cov =tf.eye(self.a_size,batch_shape=[self.batch_size])*amplify
            normal = tf.contrib.distributions.MultivariateNormalFullCovariance(
                 loc=mean,
                 covariance_matrix=self.cov*self.reduction)  
            self.dist = normal  

    def create_loss(self):

        self.loss = -tf.reduce_mean(tf.log(self.dist.prob(self.action_holder))*self.reward_holder)

    def get_gradients_holder(self):

            for idx,var in enumerate(self.tvars):
            placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
            self.gradient_holders.append(placeholder)

    def sample_action(self,sess,state):

        sample_action= sess.run(self.dist.sample(),feed_dict={self.state_in:state})       
        return sample_action

    def calculate_loss_gradient(self):
        self.gradients = tf.gradients(self.loss,self.tvars)


    def update_weights(self):
        self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))

        return self.update_batch

    def memorize_data(self,episode,first):
        if first:
            self.episode_history = episode
            self.stuck = False
        else:
            self.episode_history = np.vstack((self.episode_history,episode))




    def shuffle_memories(self):
        np.random.shuffle(self.episode_history)

    def create_graph_connections(self):
        if self.is_brain_present:
            self.create_pi_dist()
            self.create_loss()
            self.tvars = tf.trainable_variables()
            self.calculate_loss_gradient()
            self.saver = tf.train.Saver()
            self.update_weights()
        else:
            print("initialize brain first")

        self.init = tf.global_variables_initializer()

    def memory_batch_generator(self):

        total=self.episode_history.shape[0]

        amount_of_batches= int(total/self.batch_size)    
        for i in range(amount_of_batches+1):

            if i < amount_of_batches:
                top=(i+1)*self.batch_size
                bottom =i*self.batch_size 
                yield (self.episode_history[bottom:top,0:self.s_size],self.episode_history[bottom:top,self.s_size:self.s_size+self.a_size],self.episode_history[bottom:top,self.s_size+self.a_size:self.s_size+self.a_size+1],self.episode_history[bottom:top,self.s_size+self.a_size+1:])
            else:
                yield (self.episode_history[top:,0:self.s_size],self.episode_history[top:,self.s_size:self.s_size+self.a_size],self.episode_history[top:,self.s_size+self.a_size:self.s_size+self.a_size+1],self.episode_history[top:,self.s_size+self.a_size+1:])

    def train_with_current_memories(self,sess):
        self.sess = sess
        for step_sample_batch in self.memory_batch_generator():

             sess.run(self.update_weights(), feed_dict={self.state_in:step_sample_batch[0],self.action_holder:step_sample_batch[1],self.reward_holder:step_sample_batch[2].reshape([step_sample_batch[2].shape[0]])})

    def get_returns(self):

        self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1] = self.discount_rewards(self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1])


    def discount_rewards(self,r):
        """ take 1D float array of rewards and compute discounted reward """
        discounted_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(0, r.size)):
            running_add = running_add * self.gamma + r[t]
            discounted_r[t] = running_add 
        return discounted_r   

    def prob_action(self,sess,action,state):

        prob = sess.run(self.dist.prob(action),feed_dict={self.state_in:state})
        return prob

    def check_movement(self):

        ep_back = 5
        jump = 3
        threshold = 3
        if len(self.episode_history) > ep_back*2:
            difference = sum(abs(self.episode_history[-ep_back:-1,:]-self.episode_history[-ep_back-jump:-1-jump,:]).flatten())
            print(difference)
            if difference < threshold:
                self.stuck = True

    def print_last_n_returns(self,n):

        if len(self.episode_history[:,self.s_size+self.a_size:self.s_size+self.a_size+1])>n:
            n_returns = sum(self.episode_history[-n:,self.s_size+self.a_size:self.s_size+self.a_size+1])/float(n)
            print(n_returns)
            return n_returns

训练循环:

tf.reset_default_graph()
agent_2= agent_episodic_continuous_action(1e-2,s_size=24,a_size=4,batch_size=30,dist_type="normal")

agent_2.create_brain([5,5],tf.nn.relu,None)

agent_2.create_graph_connections()


env = gym.make('BipedalWalker-v2')
with tf.Session() as sess:             

    sess.run(agent_2.init)
    for i in range(200):

        s = env.reset()
        d = False
        a=agent_2.sample_action(sess,[s])[0]

        print(a)

        if None in a:
            print("None in a! inside for")
            print(s)

        s1,r,d,_ = env.step(a)
        episode = np.hstack((s,a,r,s1))
        agent_2.memorize_data(episode=episode,first=True)
        count = 0 
        while not d:
            count = count + 1
            s = s1
            a=agent_2.sample_action(sess,[s])[0]
            s1,r,d,_ = env.step(a)
            episode = np.hstack((s,a,r,s1))
           # env.render()
            agent_2.memorize_data(episode=episode,first=False)
           # print(s1)
            if count % 5 == 0 :
                agent_2.check_movement()
            if agent_2.stuck:
                d = True

        agent_2.get_returns()
        agent_2.print_last_n_returns(20)
        agent_2.shuffle_memories()
        agent_2.train_with_current_memories(sess)

env.close()     

每批30个样品Agent.u更新权重()

   def update_weights(self):
        self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))

当我执行时:

def train_with_current_memories(self,sess):
    self.sess = sess
    for step_sample_batch in self.memory_batch_generator():
        sess.run(self.update_weights(), feed_dict={self.state_in:step_sample_batch[0],self.action_holder:step_sample_batch[1],self.reward_holder:step_sample_batch[2].reshape([step_sample_batch[2].shape[0]])})

或者这种迟钝是一种预期的行为。你知道吗


Tags: sampleinselfsizedisttfdefstep
1条回答
网友
1楼 · 发布于 2024-03-29 06:38:25

每次迭代后,代码都会变慢,因为每次迭代时图形都会变大。这是因为我在迭代循环中创建了新的图形元素。你知道吗

在每次迭代过程中,将调用以下函数:

def update_weights(self):
        self.update_batch = self.optimizer.apply_gradients(zip(self.gradients,self.tvars))    
        return self.update_batch

此函数正在为图形创建新元素。你知道吗

避免“图形泄漏”的最佳方法是添加行

sess.graph.finalize()

一旦创建会话。这样,如果有图泄漏,Tensorflow将引发异常。你知道吗

相关问题 更多 >