Hemant Vishwakarma: Why the A2C reinforcement learning algorithms could have surpassingly different test performance?

I use the advantage actor-critic (A2C) reinforcement learning model with a different environment with large action space, and I was facing a problem of getting almost the same action for all testing episodes when the learning rate is 0.0001 for both actor and critic. Also, the maximum reward is 230, which is the best. However, in the testing phase, I got good results in some steps, and some of them were really bad, and the reason is that the actions didn't change.

I solved the problem after changing the learning rate to 0.001 for both actor and critic, I got different actions, but the results were not good, and the reward was not converging. So the maximum reward is 100.

Also, I use different settings, changing the learning rate, state, and reward, but the results are not good. The maximum reward is 150.

Here is the Actor and critic Network

        with g.as_default():
            #==============================actor==============================#
            actorstate = tf.placeholder(dtype=tf.float32, shape=n_input, name='state')
            actoraction = tf.placeholder(dtype=tf.int32, name='action')
            actortarget = tf.placeholder(dtype=tf.float32, name='target')

            hidden_layer1 = tf.layers.dense(inputs=tf.expand_dims(actorstate, 0), units=500, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            hidden_layer2 = tf.layers.dense(inputs=hidden_layer1, units=250, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            hidden_layer3 = tf.layers.dense(inputs=hidden_layer2, units=120, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            output_layer = tf.layers.dense(inputs=hidden_layer3, units=n_output, kernel_initializer=tf.zeros_initializer())
            action_probs = tf.squeeze(tf.nn.softmax(output_layer))
            picked_action_prob = tf.gather(action_probs, actoraction)

            actorloss = -tf.log(picked_action_prob) * actortarget
            # actorloss = tf.reduce_mean(tf.losses.huber_loss(picked_action_prob, actortarget, delta=1.0), name='actorloss')

            actoroptimizer1 = tf.train.AdamOptimizer(learning_rate=var.learning_rate)

            if var.opt == 2:
                actoroptimizer1 = tf.train.RMSPropOptimizer(learning_rate=var.learning_rate, momentum=0.95,
                                                            epsilon=0.01)
            elif var.opt == 0:
                actoroptimizer1 = tf.train.GradientDescentOptimizer(learning_rate=var.learning_rate)

            actortrain_op = actoroptimizer1.minimize(actorloss)

            init = tf.global_variables_initializer()
            saver = tf.train.Saver(max_to_keep=var.n)

        p = tf.Graph()
        with p.as_default():
            #==============================critic==============================#
            criticstate = tf.placeholder(dtype=tf.float32, shape=n_input, name='state')
            critictarget = tf.placeholder(dtype=tf.float32, name='target')

            hidden_layer4 = tf.layers.dense(inputs=tf.expand_dims(criticstate, 0), units=500, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            hidden_layer5 = tf.layers.dense(inputs=hidden_layer4, units=250, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            hidden_layer6 = tf.layers.dense(inputs=hidden_layer5, units=120, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
            output_layer2 = tf.layers.dense(inputs=hidden_layer6, units=1, kernel_initializer=tf.zeros_initializer())
            value_estimate = tf.squeeze(output_layer2)

            criticloss= tf.reduce_mean(tf.losses.huber_loss(output_layer2, critictarget,delta = 0.5), name='criticloss')
            optimizer2 = tf.train.AdamOptimizer(learning_rate=var.learning_rateMADDPG_c)
            if var.opt == 2:
                optimizer2 = tf.train.RMSPropOptimizer(learning_rate=var.learning_rate_c, momentum=0.95,
                                                            epsilon=0.01)
            elif var.opt == 0:
                optimizer2 = tf.train.GradientDescentOptimizer(learning_rate=var.learning_rateMADDPG_c)

            update_step2 = optimizer2.minimize(criticloss)

            init2 = tf.global_variables_initializer()
            saver2 = tf.train.Saver(max_to_keep=var.n)

This is the choice of action.

def take_action(self, state):
                """Take the action"""
                action_probs = self.actor.predict(state)
                action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
                return action

This is the actor. Predict function.

def predict(self, s):
        return self._sess.run(self._action_probs, {self._state: s})

I also tried different rewards and settings, which are good in DQN, but I never got good results in A2C. Any Idea what causing this?

from Why the A2C reinforcement learning algorithms could have surpassingly different test performance?

Hemant Vishwakarma

Tuesday, 21 September 2021

Why the A2C reinforcement learning algorithms could have surpassingly different test performance?

No comments:

Post a Comment