Thursday, 1 October 2020

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that

Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.

In detail, they inited all learnable parameters by normal distribution formula. During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer

I reproduced the code from pytorch GAN zoo Github's repo

def forward(self, x, equalized):
    # generate He constant depend on the size of tensor W
    size = self.module.weight.size()
    fan_in = prod(size[1:])
    weight = math.sqrt(2.0 / fan_in)
    '''
    A module example:

    import torch.nn as nn
    module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias) 
    '''
    x = self.module(x)

    if equalized:
        x *= self.weight
    return x

At first, I thought the He constant will be as He's paper

formula

Normally, n_l>2 so w_l can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper \hat{w}_i=\frac{w_i}{c}\rightarrow prevent vanishing gradient.

However, the code shows that \hat{w}_i=w_i*c.

In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.

I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.

Please help me explain it, thank you!



from Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

No comments:

Post a Comment