The most popular regularization method for neural networks is called dropout. The idea is simple: at each iteration of training, a random subset of neurons are removed from the network, and the network is only training with the remaining nodes.
You always specify a dropout rate, which is the percentage of nodes that are eliminated at each training step. Let’s say we use a dropout rate of 33% on the following network:
In one training iteration, one of the 3 nodes will randomly be “dropped out” (due to the dropout rate of 33%), and so only the following weights will be updated:
In the next training iteration, only the following weights could be updated:
And so on for each training iteration.
Why does this work so well? One reason is that it helps to prevent overfitting by ensuring that all weights will contain important information (similar to L2 regularization), preventing any single node or small sets of nodes from storing all of the important information while essentially discarding the other nodes. This makes the network learn redundant representations, which makes the network more robust, helping to prevent overfitting.