Regularization is a set of techniques that helps learning models to converge and it allows to get a good model generalization capacity. This kind of result is known as “*right fit*“:

We can see that the function follows the points thoroughly, but without getting a trend too rigorous.

When we train a model, there are two types of fitting problems:

**Underfitting:**trained model isn’t able to represent the input features:**Overfitting:**our model has learned to recognize the training examples, in a very good way, but it isn’t able to categorize new incoming samples:

Even though in different ways, underfitting and overfitting are not able to give us a generalization of the model. They show two kinds of problems:

**Bias:**a measure of approximation error over all the samples in the training set. That is, to put it simply, how far our model is from samples;**Variance:**variance of approximation model over all the samples;

Underfitting presents *high bias* and *low variance*, whereas overfitting *low bias* and *high variance*. As a consequence, we need to deal mainly with these two kinds of wrong fitting. We can state that the problems related to generalization come out from the choice of parameters. In the high bias case, we usually don’t have enough significant parameters, or maybe, the training set is too small. High variance, on the contrary, means that we have too many parameters that, maybe, are not all required. Generally, there is the need to penalize or give more importance to some parameters and less to others. Unfortunately, we don’t know which of them to work on. To address the problem, we use techniques called regularizers. They are applied to the function that gives us the estimation related to the known target values, which is the Cost function:

## L2 Regularization

Also known as *Ridge Regression *and*Tikhonov Regularization, *it uses the half of the squared Euclidean norm of all parameters:

It produces a non-sparse output and penalizes all parameters. It means that all parameters are retained, none of them goes to zero, but all values are small.

## L1 Regularization

It is also called LASSO. It penalizes the sum of the absolute value of weights:

It produces a sparse output performing a feature selection. For example, if our model uses 1000 parameters, maybe not all parameters are useful and L1 zeroes a part of them. In the end, we could get only 100 values. It doesn’t limit weights growth.

## Dropout

It is a different way to regularize a network ^{[1]}Improving neural networks by preventing co-adaptation of feature detectors. It doesn’t apply to loss function, but it is based on the idea that some neurons can be saturated during the training process and they shouldn’t be activated to prevent that their contribution can harm the training process. “Saturation” is expressed as co-adaptation, which means a neuron is only useful as a function of another specific neuron. This problem is, clearly, a limitation to model generalization. The question is “What neurons should be deactivated? And how can they be chosen?” The answer is that we don’t know which neuron can cause the problem, so we can arbitrarily choose a set of them. This means that we can make a random choice.

**N.B.** dropout should be applied only in training phase, whereas while testing it has to be removed, because we need the full generalization power of our model.

## Usage in tensorflow

### L1/L2 regularization

In tensorflow, it is very simple to use regularization. Regardless of which kind of regularizer we want to use (L1 or L2), we define it during the variable creation process. I like to define a function that handles the variable creation for me:

1 2 3 4 5 6 7 8 9 10 |
def create_variable(name, shape, weight_decay=None, loss=tf.nn.l2_loss): with tf.device("/cpu:0"): var = tf.get_variable(name, dtype=tf.float32, shape=shape, initializer=tf.truncated_normal_initializer(stddev=0.05)) if weight_decay: wd = loss(var) * weight_decay tf.add_to_collection("weight_decay", wd) return var |

“create_variable” has a weight_decay parameter that defines whether the variable should be affected by regularization or not. Why do we choose to avoid regularization on any variables? The function is used for weights and bias creation. Bias shouldn’t be influenced by regularization as reported in ^{[2]}https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf. The only parameters that should be influenced by it are the weights. Based on the fact that tensorflow builds a computation graph, we are able to apply the loss on the variable and store it into a collection, which is a key/value storage. In this way, regularization will be applied to the updated weight value each time it is needed. I have explained this concept because at first glance it may appear that we apply regularization to weights when they are created for the first time and only to the initial value, but that’s not the case!

The second element that needs to be modified to allow regularization to take place is the model cost computation. It is applied after the cross-entropy function, in this way we compute the “pure” loss for the model and then apply the regularization term:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def compute_loss(name_scope, logits, labels, sparse=True): if not sparse: cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits) else: cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=logits) cross_entropy_mean = tf.reduce_mean( cross_entropy ) tf.summary.scalar( name_scope + '_cross_entropy', cross_entropy_mean ) weight_decay_loss = tf.get_collection('weight_decay') if len(weight_decay_loss) > 0: tf.summary.scalar(name_scope + '_weight_decay_loss', tf.reduce_mean(weight_decay_loss)) tf.summary.histogram(name_scope + '_weight_decay_loss', weight_decay_loss) # Calculate the total loss for the current tower. total_loss = cross_entropy_mean + weight_decay_loss tf.summary.scalar(name_scope + '_total_loss', tf.reduce_mean(total_loss)) else: total_loss = cross_entropy_mean return total_loss |

It is a very simple way to regularize our model! We only need to get values in the “weight_decay” collection and sum them with the cross entropy mean value. The magic has happened!

In the end, we only need to write code to create a variable and choose a regularization type.

1 2 3 4 5 6 |
weights = { 'w1': create_variable("w1", [input_size, 1024], FLAGS.weight_decay), 'w2': create_variable("w2", [1024, 512], FLAGS.weight_decay), 'w3': create_variable("w3", [512, 64], FLAGS.weight_decay), 'wout': create_variable("wout", [64, FLAGS.num_classes], FLAGS.weight_decay) } |

I used L2 as default, but we can pass whatever regularization type we want. Unfortunately, L1 isn’t a tensorflow function, so we have to create it by ourselves and use it instead of L2:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
def l1_loss(params): return tf.reduce_sum(tf.abs(params)) . . . weights = { 'w1': create_variable("w1", [input_size, 1024], FLAGS.weight_decay, loss=l1_loss), 'w2': create_variable("w2", [1024, 512], FLAGS.weight_decay, loss=l1_loss), 'w3': create_variable("w3", [512, 64], FLAGS.weight_decay, loss=l1_loss), 'wout': create_variable("wout", [64, FLAGS.num_classes], FLAGS.weight_decay, loss=l1_loss) } |

In theory, there is some optimization that can be done to write less code, but I think that this is the clearest way.

### Dropout

Dropout is applied to neurons and not to parameters, so in the case of a Multi-Layer Perceptron, we can use this kind of regularization on one or more layers. We define a hidden layer by a variable dropout, with a default value of 1:

1 2 3 4 5 6 7 |
def hidden_layer(input, weights, bias, name, dropout_prob=1., activation=tf.nn.relu): with tf.name_scope(name): out = tf.matmul(input, weights) + bias out = activation(out) out = tf.nn.dropout(out, dropout_prob) return out |

In this way, we apply dropout only to specific layers, while to the others the probability that a neuron should be kept is 1. As previously mentioned, we have to apply it in training phase only, so we should be able to change the dropout value for a specific layer, this can be obtained by using a placeholder which will be later passed to the layer:

1 2 3 4 5 6 7 8 9 10 |
dropout_placeholder = tf.placeholder(tf.float32, shape=(), name="dropout_placeholder") . . . net = hidden_layer(train_x, weights=weights['w1'], bias=bias["b1"], name="hidden_1") net = hidden_layer(net, weights=weights['w2'], bias=bias["b2"], name="hidden_2") net = hidden_layer(net, weights=weights['w3'], bias=bias["b3"], name="hidden_3", dropout_prob=dropout_placeholder) |

Now we are able to choose the value of dropout in the session run:

1 2 3 4 5 |
_, t_loss = sess.run([train_op, loss], feed_dict={ train_x: train_samples, train_y: train_labels, dropout_placeholder: FLAGS.dropout }) |

Full code is available on GoDeep GitHub repository.

Star Follow @vincenzosantopietro Watch

References

1. | ↑ | Improving neural networks by preventing co-adaptation of feature detectors |

2. | ↑ | https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf |