Batch Normalization

[Mini] Batch Normalization:

First, need to read - Weight initialization - "Previous Post"

We want unit gaussian activations (outputs)? We just make them so [forcefully].

We have N x D, N = data or current batch (size) or number of training examples in a forward pass, D = dimension of each data or number of features

We compute the empirical mean and variance independently for each dimension i.e. each feature. - For Fully Connected Layers [This is highly dependent on layer type - mainly Fully Connected Layers or Convolutional Layers as we'll see later]

We compute this over batch, our current mini-batch that we have.

Usually inserted after fully connected or convolutional layers and before nonlinearity.

--> FC --> BN --> tanh --> FC --> BN --> tanh

Scaling the input connected to each neuron, each neuron is a feature.

We can apply this the same way to fully connected network, the only difference is that, with convolutional layers, we want to normalize not just across all the training examples and independently for each feature dimension but we actually want to normalize jointly across both all the feature dimensions, all the spatial locations that we have in our activation map, as well as all of the training examples - because we want to obey the convolutional property and we want nearby locations to be normalized the same way. So for the convolutional layer, we're basically going to have one mean and one standard deviation per activation map and we're going to normalize this across all of the examples in the batch. [See Ioffe and Szegedy, 2015]

Making unit Gaussian, for tanh, reduces all saturation, but we want to allow some saturation.

say x_ is the normalized data
then we do y = gamma * x_ + beta for all layer
scale by gamma, shift by beta
if gamma = var(x) and beta = mean(x), batch normalization is nulled. So by allowing the network to learn gamma and beta, we are giving that slack.

BN for convolutional layers:

Mean across all data of minibatch, variance across all data of minibatch.

Summary:

Three cases:

Case 1 - BN in the example shown in the Lecture: (Not actually BN but the plot they showed)

Mean and Variance are calculated across all data - so at one layer, Mean = mean of 1000*500 data, 1000 = number of data, 500 = number of features.

Case 2 - BN in Fully Connected Layers.

Mean and Variance are calculated for each feature independently over all the data. So for the same case of 1000 data and 500 features, we'll have 500 means, each mean is calculated over 1000 data of one feature.

Case 3 - BN in Convolutional Layers.

Mean and Variance are calculated over all the data - features and examples both. So for one layer we will have 1 mean and 1 SD - if batch size is 1000 and feature size is 500x1, mean will be over 1000x500x1 data.

When it comes to (2D) CNN, we normalize batch_size * height * width over each channel. So that gamma and beta have the lengths the same as channel_count

[Source]

In batch normalization, we are not changing the weights, but we are normalizing the input to each layer / activation layer. One reason to apply before activation may be because in backprop, d f(w*x) / dw may not be equal to d f_norm(w*x) / dw but d f(w*x) / dw = df(w*x_norm) / dw, because normalization basically scales and gardient is the same. "_norm" denotes normalization.

Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)

** Usually used after FC / Conv layers and before activations: FC / Conv --> BN --> Actiavtion

https://github.com/tmulc18/BatchNormalization

#batch normalization

#We don't use ema.average during training and instead just use batch averages. 
#At test time, the mean/std are not computed based on the batch. 
#Instead, a single fixed empirical mean of activations during training is used.
#Example: can be estimated during training with running averages

#When it comes to (2D) CNN, we normalize batch_size * height * width over each channel.
#So that gamma and beta have the lengths the same as channel_count

#*******************************************************************************************************
#    beta = shift and scale = scale, maybe better to initialize beta = all zeros and scale = all ones 
#    https://www.programcreek.com/python/example/90419/tensorflow.assign Example 26
#*******************************************************************************************************
def batch_norm(name, x, is_train, decay=0.99, epsilon=0.001):
 #averaged mean and variance to be used for test
 dim = x.get_shape().as_list()[-1] #channel dimension
 pop_mean = tf.get_variable(name=name+'_pop_mean', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(0.0), trainable=False)
 pop_var = tf.get_variable(name=name+'_pop_var', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(1.0), trainable=False)
 beta = tf.get_variable(name=name+'_beta', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(0.0), trainable=True)
 scale = tf.get_variable(name=name+'_scale', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(1.0), trainable=True)

 def bn_train():
  batch_mean, batch_var = tf.nn.moments(x, axes=[0, 1, 2]) #Assuming NHWC, all accept channels is averaged/varianced
  # batch_mean and batch_var are 1D matrices of size/length 'C'
  # this will only be used for batch normalization while training.
  # but we also need to calculate average mean that can be used while testing
  mean_op = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1 - decay))
  var_op = tf.assign(pop_var, pop_var * decay + batch_var * (1 - decay))
  with tf.control_dependencies([mean_op, var_op]): #so that pop_mean and pop_var are evaluated, not used now
   return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon)
 
 #the pop mean and pop var are not used in train but now in test
 def bn_test():
  return tf.nn.batch_normalization(x, pop_mean, pop_var, beta, scale, epsilon)

 return tf.cond(is_train, bn_train, bn_test)

################ THE ABOVE ONE IS THE FINAL ONE ###############################3


def batch_norm(name, x, is_train, decay=0.99, epsilon=0.001):
 #averaged mean and variance to be used for test
 dim = x.get_shape().as_list()[-1] #channel dimension
 pop_mean = tf.get_variable(name=name+'_pop_mean', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(0.0), trainable=False)
 pop_var = tf.get_variable(name=name+'_pop_var', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(1.0), trainable=False)
 beta = tf.get_variable(name=name+'_beta', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.0), trainable=True)
 scale = tf.get_variable(name=name+'_scale', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.1), trainable=True)

 def bn_train():
  batch_mean, batch_var = tf.nn.moments(x, axes=[0, 1, 2]) #Assuming NHWC, all accept channels is averaged/varianced
  # batch_mean and batch_var are 1D matrices of size/length 'C'
  # this will only be used for batch normalization while training.
  # but we also need to calculate average mean that can be used while testing
  mean_op = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1 - decay))
  var_op = tf.assign(pop_var, pop_var * decay + batch_var * (1 - decay))
  with tf.control_dependencies([mean_op, var_op]): #so that pop_mean and pop_var are evaluated, not used now
   return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon)
 
 #the pop mean and pop var are not used in train but now in test
 def bn_test():
  return tf.nn.batch_normalization(x, pop_mean, pop_var, beta, scale, epsilon)

 return tf.cond(is_train, bn_train, bn_test)


def batch_norm(name, x, is_train, decay=0.99, epsilon=0.001):
 #averaged mean and variance to be used for test
 mean, var = tf.nn.moments(x, axes=[0, 1, 2]) #Assuming NHWC, all accept channels is averaged/varianced


 dim = x.get_shape().as_list()[-1] #channel dimension
 beta = tf.get_variable(name=name+'_beta', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.0), trainable=True)
 scale = tf.get_variable(name=name+'_scale', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.1), trainable=True)

 ema = tf.train.ExponentialMovingAverage(decay=0.99)

 def update_mean_var(): #for test
  emp_op = ema.apply([mean, var])
  with tf.control_dependencies([emp_op]):
   return tf.identity(mean), tf.identity(var)
        
 batch_mean, batch_var = tf.cond( is_train, lambda: update_mean_var(), lambda: (ema.average(mean), ema.average(var)) )

 return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon)


## CHECKING BATCH NORM
import tensorflow as tf
import numpy as np

tf.reset_default_graph()

np.random.seed(23) #setting random number generator with a fixed value will always generate same random values
#I personally prefer batch_norm1
def batch_norm1(name, x, is_train, decay=0.99, epsilon=0.001):
 #averaged mean and variance to be used for test
    #is_train needs to be a tf.tensor bool not python bool otherwise error
 dim = x.get_shape().as_list()[-1] #channel dimension
 pop_mean = tf.get_variable(name=name+'_pop_mean', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(0.0), trainable=False)
 pop_var = tf.get_variable(name=name+'_pop_var', shape=[dim], dtype=tf.float32,
            initializer=tf.constant_initializer(0.0), trainable=False)
 beta = tf.get_variable(name=name+'_beta', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.0), trainable=True)
 scale = tf.get_variable(name=name+'_scale', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.1), trainable=True)

 def bn_train():
  batch_mean, batch_var = tf.nn.moments(x, axes=[0, 1, 2]) #Assuming NHWC, all accept channels is averaged/varianced
  # batch_mean and batch_var are 1D matrices of size/length 'C'
  # this will only be used for batch normalization while training.
  # but we also need to calculate average mean that can be used while testing
  mean_op = tf.assign(pop_mean, pop_mean * decay + batch_mean * (1 - decay))
  var_op = tf.assign(pop_var, pop_var * decay + batch_var * (1 - decay))
  with tf.control_dependencies([mean_op, var_op]): #so that pop_mean and pop_var are evaluated, not used now
   return tf.identity(batch_mean), tf.identity(batch_var)
   # if you directly write batch_mean, batch_var it won't work
   #because there needs to be a node in the graph so that when we do sess run it is executed 
   #if you have tf.nn it's fine.
   #return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon)
 
 #the pop mean and pop var are not used in train but now in test
 def bn_test():
  return pop_mean, pop_var
  #return tf.nn.batch_normalization(x, pop_mean, pop_var, beta, scale, epsilon)

 return tf.cond(is_train, bn_train, bn_test)


def batch_norm2(name, x, is_train, decay=0.99, epsilon=0.001):
 #averaged mean and variance to be used for test
 mean, var = tf.nn.moments(x, axes=[0, 1, 2]) #Assuming NHWC, all accept channels is averaged/varianced


 dim = x.get_shape().as_list()[-1] #channel dimension
 beta = tf.get_variable(name=name+'_beta', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.0), trainable=True)
 scale = tf.get_variable(name=name+'_scale', shape=[dim], dtype=tf.float32,
            initializer=tf.truncated_normal_initializer(stddev=0.1), trainable=True)

 ema = tf.train.ExponentialMovingAverage(decay=0.99)

 def update_mean_var(): #for test
  emp_op = ema.apply([mean, var])
  with tf.control_dependencies([emp_op]):
   return tf.identity(mean), tf.identity(var)
        
 batch_mean, batch_var = tf.cond( is_train, lambda: update_mean_var(), lambda: (ema.average(mean), ema.average(var)) )

 return batch_mean, batch_var
 #return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon)

x = tf.placeholder(tf.float32, shape=[8,2,3,4])
t_bool = tf.placeholder(tf.bool) #this is causing the error
mean1, var1 = batch_norm1('bn1', x, is_train=t_bool)
mean2, var2 = batch_norm2('bn2', x, is_train=t_bool)
config = tf.ConfigProto()
# config.allow_soft_placement=True
config.log_device_placement = True
config.gpu_options.allow_growth = True

with tf.Session(config=config) as sess:
 sess.run(tf.global_variables_initializer())
 for i in range(3):
  feed_dict = {x: np.random.rand(8,2,3,4), t_bool: True}
  m1, v1, m2, v2 = sess.run([mean1, var1, mean2, var2], feed_dict=feed_dict)
  print('Norm1 Train:' + str([m1, v1]))
  print('Norm2 Train:' + str([m2, v2]))
 feed_dict = {x: np.random.rand(8,2,3,4), t_bool: False}
 m1, v1, m2, v2 = sess.run([mean1, var1, mean2, var2], feed_dict=feed_dict)
 print('Norm1 Test:' + str([m1, v1]))
 print('Norm2 Test:' + str([m2, v2]))

https://medium.com/@SeoJaeDuk/deeper-understanding-of-batch-normalization-with-interactive-code-in-tensorflow-manual-back-1d50d6903d35
Specially see the code part.
https://wiseodd.github.io/techblog/2016/07/04/batchnorm/

Algidus

Search This Blog

Batch Normalization