Introduction to neural network optimizers [part 3] – Adam optimizer

This is the third part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:

1. Part 1: momentum optimizer
2. Part 2: adaptive learning rates
3. Part 3: Adam optimizer

We covered two important concepts of optimizers in the previous sections, namely the introduction of a momentum term and adaptive learning rates. However, other variations, combinations or even additional concepts have also been proposed1.

Each optimizer has its own advantages and limitations making it suitable for specific contexts. It is beyond the scope of this series to name or introduce them all. Instead, we shortly explain the well established Adam optimizer as one example. It also re-uses some of the ideas discussed previously.

Before we proceed, we want to stress some thoughts regarding the combination of optimizers. One obvious choice might be to combine the momentum optimizer with the adaptive learning scheme. Even though this is theoretically possible and even an option in an implementation of the RMSProp algorithm, there might be a problem.

The main concept of the momentum optimizer is to accelerate when the direction of the gradient remains the same in subsequent iterations. As a result, the update vector increases in magnitude. This, however, contradicts one of the goals of adaptive learning rates which tries to keep the gradients in “reasonable ranges”. This may lead to issues when the momentum vector $$\fvec{m}$$ increases but then gets scaled down again by the scaling vector $$\fvec{s}$$.

It is also noted by the authors of RMSProp that the direct combination of adaptive learning rates with a momentum term does not work so well. The theoretical argument discussed might be a cause for these observations.

In the following, we first define the Adam algorithm and then look at the differences compared to previous approaches. The first is the usage of first-order moments which behave differently compared to a momentum vector. We are using an example to see how this choice has an advantage in skipping suboptimal local minima. The second difference is the usage of bias-correction terms necessary due to the zero-initialization of the moment vectors. Finally, we are also going to take a look at different trajectories.

Mathematical formulation

This optimizer was introduced by Diederik P. Kingma and Jimmy Ba in 2017. It mainly builds upon the ideas from AdaGrad and RMSProp, i.e. adaptive learning rates, and extends these approaches. The name is derived from adaptive moment estimation.

Definition 1: Adam optimizer

Additionally to the variables used in classical gradient descent, let $$\fvec{m} = (m_1, m_2, \ldots, m_n) \in \mathbb{R}^n$$ and $$\fvec{s} = (s_1, s_2, \ldots, s_n) \in \mathbb{R}^n$$ be the vectors with the estimates of the first and second raw moments of the gradients (same lengths as the weight vector $$\fvec{w}$$). Both vectors are initialized to zero, i.e. $$\fvec{m}(0) = \fvec{0}$$ and $$\fvec{s}(0) = \fvec{0}$$. The hyperparameters $$\beta_1, \beta_2 \in [0;1[$$ denote the decaying rates for the moment estimates and $$\varepsilon \in \mathbb{R}^+$$ is a smoothing term. Then, the Adam optimizer defines the update rules

\begin{align} \begin{split} \fvec{m}(t) &= \beta_1 \cdot \fvec{m}(t-1) + (1-\beta_1) \cdot \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{s}(t) &= \beta_2 \cdot \fvec{s}(t-1) + (1-\beta_2) \cdot \nabla E \left( \fvec{w}(t-1) \right) \odot \nabla E \left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \frac{\fvec{m}(t)}{1-\beta_1^t} \oslash \sqrt{\frac{\fvec{s}(t)}{1-\beta_2^t} + \varepsilon} \end{split} \label{eq:AdamOptimizer_Adam} \end{align}

to find a path from the initial position $$\fvec{w}(0)$$ to a local minimum of the error function $$E\left(\fvec{w}\right)$$. The symbol $$\odot$$ denotes the point-wise multiplication and $$\oslash$$ the point-wise division between vectors.

There is a very close relationship to adaptive learning rates. In fact, the update rule of $$\fvec{s}(t)$$ in \eqref{eq:AdamOptimizer_Adam} is identical to the one in the adaptive learning scheme. We also see that there is an $$\fvec{m}$$ vector, although this one is different compared to the one defined in momentum optimization. We are picking up this point shortly.

In the description of Adam, the arguments are more statistically-driven: $$\fvec{m}$$ and $$\fvec{s}$$ are interpreted as exponentially moving averages of the first and second raw moment of the gradient. That is, $$\fvec{m}$$ is a biased estimate of the means of the gradients and $$\fvec{s}$$ is a biased estimate of the uncentred variances of the gradients. In total, we can say that the Adam update process uses information about where the gradients are located on average and how they tend to scatter.

First-order moments

In momentum optimization, we keep track of an exponentially decaying sum whereas in Adam we have an exponentially decaying average. The difference is that in Adam we do not add the full new gradient vector $$\nabla E\left( \fvec{w}(t-1) \right)$$. Instead, only a fraction is used while at the same time a fraction of the old momentum is removed (the last part is identical to the momentum optimizer). For example, if we set $$\beta_1 = 0.9$$, we keep 90 % of the old value and add 10 % of the new. The bottom line is that we build much less momentum, i.e. the momentum vector does not grow that much.

In the analogy of a ball rolling down a valley, we may think of the moment updates in \eqref{eq:AdamOptimizer_Adam} as of a very heavy ball with a lot of friction. It accelerates less and needs more time to take the gradient information into account. The ball rolls down the valley according to the running average of gradients along the track. Since it takes some time until the old gradient information is lost, it is less likely to stop at small plateaus and can hence overshoot small local minima.2

We now want to test this argument on a small example function. For this, we leave out the second moments $$\fvec{s}$$ for now so that \eqref{eq:AdamOptimizer_Adam} reduces to

\begin{align} \begin{split} \fvec{m}(t) &= \beta_1 \cdot \fvec{m}(t-1) + (1-\beta_1) \cdot \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \frac{\fvec{m}(t)}{1-\beta_1^t}. \end{split} \label{eq:AdamOptimizer_AdamFirstMoment} \end{align}

We want to compare these first moment updates with classical gradient descent. The following figure shows the example function and allows you to play around with a trajectory which starts near the summit of the hill.

Directly after the first descent is a small local minimum and we see that classical gradient descent ($$\beta_1 = 0$$) gets stuck here. However, with first-order moments (e.g. $$\beta_1 = 0.95$$), we leverage the fact that the moving average decreases not fast enough so that we can still roll over this small hole and make it down to the valley.4

We can see from the error landscape that the first gradient component has the major impact on the updates as it is the direction of the steepest hill. It is insightful to visualize the first component $$m_1(t)$$ of the first-order moments over iteration time $$t$$: Figure 2: First component $$m_1(t)$$ of the first-order moments over iteration time $$t$$. The values are calculated according to \eqref{eq:AdamOptimizer_AdamFirstMoment} and use the same starting point as the trajectory in the previous figure. 150 iterations and a global learning rate of $$\eta=20$$ were used. The $$\beta_1 = 0$$ curve corresponds to classical gradient descent and the $$\beta_1 = 0.95$$ curve to an update scheme which employs first-order moments.

With classical gradient descent ($$\beta_1 = 0$$), we move fast down the hill but then get stuck in the first local minimum. As only local gradient information is used in the update process, the chances of escaping the hole are very low.

In contrast, when using first-order moments, we increase slower in speed as only a fraction of the large first gradients is used. However, $$m_1(t)$$ also decreases slower when reaching the first hole. In this case, the behaviour of the moving average helps to step over the short increase and to move further down the valley.

Building momentum and accelerating when we move in the same direction in subsequent iterations is the main concept and advantage of momentum optimization. However, as we already saw in the toy example used in the momentum optimizer article, large momentum vectors may be problematic as they can overstep local minima and lead to oscillations. What is more, as stressed in the argument above, it is not entirely clear if momentum optimization works well together with adaptive learning rates. Hence, it might be reasonable that the momentum optimizer is not used directly in Adam.

Bias correction

The final change in the Adam optimizer compared to its predecessors is the bias correction terms where we divide both moment vectors by either $$(1-\beta_1^t)$$ or $$(1-\beta_2^t)$$. This is because the moment vectors are initialized to zero so that the moving averages are, especially in the beginning, biased towards the origin. The factors are a countermeasure to correct this bias.

Practically speaking, these terms boost both vectors in the beginning since they are divided by a number usually $$< 1$$. This can speed-up convergence when the true moving averages are not located at the origin but are larger instead. As the factors have the iteration number $$t$$ in the exponent of the hyperparameters, the terms approach 1 over time and hence become less influential.

We now consider, once again, a one-dimensional example and define measures to compare the update vectors of the second iteration using either classical gradient descent or the Adam optimizer. To visualize the effect of the bias-correction terms, we repeat the process in which we leave these terms out.

Denoting the gradients of the first two iterations as $$g_t = \nabla E\left( w(t-1) \right)$$, we build the moment estimates

\begin{align*} m(1) &= \beta_1 \cdot m(0) + (1-\beta_1) \cdot g_1 = (1-\beta_1) \cdot g_1 \\ m(2) &= \beta_1 \cdot m(1) + (1-\beta_1) \cdot g_2 = \beta_1 \cdot (1-\beta_1) \cdot g_1 + (1-\beta_1) \cdot g_2 \\ s(1) &= \beta_2 \cdot s(0) + (1-\beta_2) \cdot g_1^2 = (1-\beta_2) \cdot g_1^2 \\ s(2) &= \beta_2 \cdot s(1) + (1-\beta_2) \cdot g_2^2 = \beta_2 \cdot (1-\beta_2) \cdot g_1^2 + (1-\beta_2) \cdot g_2^2 \end{align*}

so that we can define a comparison measure as

\begin{equation} \label{eq:AdamOptimizer_AdamMeasureCorrection} C_A(g_1,g_2) = \left| \eta \cdot \frac{\frac{m(2)}{1-\beta_1^2}}{\sqrt{\frac{s(2)}{1-\beta_2^2} + \varepsilon}} \right| - |\eta \cdot g_2| = \left| \eta \cdot \frac{\sqrt{1-\beta_2^2}}{1-\beta_1^2} \cdot \frac{m(2)}{\sqrt{s(2) + (1-\beta_2^2) \cdot \varepsilon}} \right| - |\eta \cdot g_2|. \end{equation}

To make the effect of the bias correction terms more evident, we moved them out of the compound fraction and used them as prefactor. We define a similar measure without these terms

\begin{equation} \label{eq:AdamOptimizer_AdamMeasureNoCorrection} \tilde{C}_A(g_1,g_2) = \left| \eta \cdot \frac{m(2)}{\sqrt{s(2) + \varepsilon}} \right| - |\eta \cdot g_2|. \end{equation}

The following figure compares the two measures by interpreting the gradients of the first two iterations as variables. Figure 3: Effect of the bias correction terms in the Adam optimizer. The left plot shows the measure $$C_A(g_1,g_2)$$ (\eqref{eq:AdamOptimizer_AdamMeasureCorrection}) and the right $$\tilde{C}_A(g_1,g_2)$$ (\eqref{eq:AdamOptimizer_AdamMeasureNoCorrection}). In the former measure, bias correction terms are used and in the latter not. Both measures compare the updates of Adam optimizer with the ones of classical gradient descent. The learning rate is set to $$\eta = 1$$, the smoothing term to $$\varepsilon = 10^{-8}$$ and the exponentially decaying rates to $$\beta_1 = 0.9$$ and $$\beta_2 = 0.999$$.

With correction terms (left image), we can observe that small gradients get amplified and larger ones attenuated. This is an inheritance from the adaptive learning scheme. Back then, however, this behaviour was more centred around the origin whereas here smaller gradients get amplified less and more independently of $$g_1$$. This is likely an effect of the $$m(2)$$ term which uses only a small fraction (10 % in this case) of the first gradient $$g_1$$ leading to a smaller numerator.

When we compare this result with the one without any bias corrections (right image), we see a much brighter picture. That is, the area of amplification of small and attenuation of large gradients is stronger. This is not surprising, as the prefactor

\begin{equation*} \frac{\sqrt{1-\beta_2^2}}{1-\beta_1^2} = \frac{\sqrt{1-0.999^2}}{1-0.9^2} \approx 0.2353 \end{equation*}

is smaller than 1 and hence leads to an overall decrease (the term $$(1-\beta_2^2) \cdot \varepsilon$$ is too small to have a visible effect). Therefore, the bias correction terms ensure that the update vectors behave also more moderately at the beginning of the learning process.

Trajectories

Like in previous articles, we now also want to compare different trajectories when using the Adam optimizer. For this, we can use the following widget which implements the Adam optimizer.

Basically, the parameters behave like expected: larger values for $$\beta_1$$ make the accumulated gradients decrease slower so that we first overshoot the minimum. $$\beta_2$$ controls again the preference of direction ($$\beta_2$$ small) vs. magnitude ($$\beta_2$$ large).

It is to note that even though the Adam optimizer is much more advanced than classical gradient descent, this does not mean that it is immune against extreme settings. It is still possible that weird effects happen like oscillations or that the overshooting mechanism discards good minima (example settings). Hence, it may still be worth it to search for good values for the hyperparameters.

Conclusion

We finished with the main concepts of the Adam optimizer. It is a popular optimization technique and its default settings are often a good starting point. Personally, I have had good experience with this optimizer and would definitely use it again. However, depending on the problem, it might not be the best choice or requires tuning of the hyperparameters. For this, it is good to know what they do and also how the other optimization techniques work.

List of attached files:

• AdamOptimizer.nb [PDF] (Mathematica notebook with some basic computations and visualizations used to write this article)

Introduction to neural network optimizers [part 2] – adaptive learning rates (RMSProp, AdaGrad)

This is the second part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. As a reminder, here is the table of contents:

1. Part 1: momentum optimizer
2. Part 2: adaptive learning rates
3. Part 3: Adam optimizer

In the optimization techniques discussed previously (classical gradient descent and momentum optimizer), we used a global learning rate $$\eta$$ for every weight in the model. However, this might not be a suitable setting as some neurons can benefit from a higher and others from a lower learning rate. What is more, per-neuron learning rates can help the weights move more in lockstep in the error landscape. This can be beneficial in approaching an optimum.

In a neural network with maybe thousands and thousands of parameters, we certainly do not want to adjust them all by hand. Hence, an automatic solution is required. There are multiple approaches to tackle this problem1. Here, we introduce the concept of adaptive learning rates which apply a different scaling to each gradient according to information from previous gradients essentially changing the learning rate.

The optimization technique we describe here follows the RMSProp algorithm introduced by Tieleman and Hinton in 2012. This is again an advancement of the AdaGrad method which was proposed in 2011 by Duchi, Hazan and Singer. We stick to the advanced version and refer to it as the adaptive learning scheme.

There is a second family of problems with relevance to this context. There are two extreme cases of gradient descent which can become problematic. On the one hand, there is the problem of plateaus in the error landscape. There, the gradient is very low and hence learning happens only very slowly. It would be nice to escape from these plateaus faster. Related to this problem are vanishing gradients, which can occur especially in deep neural networks, originating from the multiplication of small numbers (e.g. due to a plateau) and leading to even smaller numbers. If the gradients were larger, we could alleviate these problems.

On the other hand, gradients can also get very large, e.g. when moving across a steep hill in the error landscape. There, at least some components of the gradients are very large. If the steps are too big, we may overstep local minima hampering the goal of finding a good solution. Related to this are exploding gradients where gradients can grow uncontrollably fast due to the multiplication of large numbers. Even though less likely than vanishing gradients, they can still be problematic. If the gradients were smaller, we could alleviate these problems as well.

Generally speaking, problems can occur when gradients become either too large or too small. One goal is to keep the gradients in “reasonable ranges”. That is, amplify very small and attenuate very large gradients so that the problems discussed are less likely. Using an update scheme with adaptive learning rates can help.

An additional advantage of adaptive learning rates is that the global learning rate $$\eta$$ needs much less tuning because each weight adapts its learning speed on its own. This leaves $$\eta$$ being more of a general indicator than a critical design choice. To put it differently, the damage which can be caused by $$\eta$$, e.g. by setting it way too high, is far smaller as each weight also applies its own scaling.

We are starting by defining the update rules for the adaptive learning scheme. We then look at a small numerical example, play around with trajectories in the error landscape and discuss some general concepts. Last but not least, there is also a new hyperparameter on which we want to elaborate. We skip speed comparisons as this would require a more realistic error function which can leverage some of the advantages of the procedure (e.g. faster escape from plateaus).

Mathematical formulation

The basic extension of the adaptive learning scheme, compared to classical gradient descent, is the introduction of scaling factors $$\fvec{s}$$ for each weight.

Definition 1: Adaptive learning scheme

Additionally to the variables used in classical gradient descent, let $$\fvec{s} = (s_1, s_2, \ldots, s_n) \in \mathbb{R}^n$$ be a vector with scaling factors for each weight in $$\fvec{w}$$ initialized to zero, i.e. $$\fvec{s}(0) = \fvec{0}$$, $$\beta \in [0;1]$$ the decaying parameter and $$\varepsilon \in \mathbb{R}_{>0}$$ a smoothing term. Then, the update rules

\begin{align} \begin{split} \fvec{s}(t) &= \beta \cdot \fvec{s}(t-1) + (1-\beta) \cdot \nabla E \left( \fvec{w}(t-1) \right) \odot \nabla E \left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \nabla E \left( \fvec{w}(t-1) \right) \oslash \sqrt{\fvec{s}(t) + \varepsilon} \end{split} \label{eq:AdaptiveLearning_AdaptiveLearningScheme} \end{align}

define the adaptive learning scheme and find a path from the initial position $$\fvec{w}(0)$$ to a local minimum of the error function $$E\left(\fvec{w}\right)$$. The symbol $$\odot$$ denotes the point-wise multiplication and $$\oslash$$ the point-wise division between vectors.

The scaling factors $$\fvec{s}$$ accumulate information about all previous gradients in their squared form so that only the magnitude and not the sign of the gradient is changed. The squaring has the advantage (e.g. compared to the absolute value) that small gradients get even smaller and large gradients even larger so that their contribution in the scaling factors $$\fvec{s}$$ is boosted. This intensifies the effects of the adaptive learning rates which we discuss below.

The hyperparameter $$\beta$$ controls the influence of the previous scales $$\fvec{s}(t-1)$$ compared to the influence of the new gradient $$\nabla E \left( \fvec{w}(t-1) \right)$$. We are going to take a look at this one a bit later.

In each step, the weights now move in the direction of the update vector

\begin{equation} \label{eq:Optimizer_AdaptiveUpdateVector} \fvec{v}(t) = \eta \cdot \nabla E \left( \fvec{w}(t-1) \right) \oslash \sqrt{\fvec{s}(t) + \varepsilon} \end{equation}

which scales the gradient according to the square root of $$\fvec{s}$$ effectively reverting the previous squaring step. Additionally, a small smoothing term $$\varepsilon$$ to avoid division by zero is used. It is usually set to a very small number, e.g. $$\varepsilon = 10^{-8}$$.

Numerical example

With the mathematical toolbox set up, we are ready to start with a small numerical example where we compare classical gradient descent with adaptive learning rates. We are using the same error function as before, i.e.

\begin{equation} \label{eq:AdaptiveLearning_ExampleFunction} E(\fvec{w}) = 3 \cdot w_1^2 + 10 \cdot w_2^2. \end{equation}

We do not change the parameters for classical gradient descent so the results $$\fvec{w}_c(1) = (-14.91, 19.6)$$ and $$\fvec{w}_c(2) \approx (-14.82, 19.21)$$ remain valid.

We set the decaying parameter to $$\beta = 0.9$$ and the smoothing term to $$\varepsilon = 10^{-8}$$. This leaves only the global learning rate to clarify. As we do not change the classical gradient descent approach, this leaves $$\eta_c = 0.001$$. This is not an appropriate rate for the adaptive scheme, however. Due to the attenuation of higher gradients, this value is way too low so that we would not make much progress. We increase it therefore to2 $$\eta_a = 0.05$$. This would make speed comparisons more difficult but since we are not intending to do them, this is of no concern.

\begin{equation} \label{eq:AdaptiveLearning_ExampleFunctionGradient} \nabla E\left( \fvec{w} \right) = \cvec{\xfrac{\partial E}{\partial w_1} \\ \xfrac{\partial E}{\partial w_2}} = \cvec{6 \cdot w_1 \\ 20 \cdot w_2} \end{equation}

evaluated at the initial position

\begin{equation*} \nabla E \left( \fvec{w}(0) \right) = \cvec{6 \cdot (-15) \\ 20 \cdot 20} = \cvec{-90 \\ 400}, \end{equation*}

we can apply the first update step (results are rounded to two decimal places)

\begin{align} \begin{split} \fvec{s}(1) &= \beta \cdot \fvec{s}(0) + (1-\beta) \cdot \nabla E \left( \fvec{w}(0) \right) \odot \nabla E \left( \fvec{w}(0) \right) \\ &= 0.9 \cdot \fvec{0} + 0.1 \cdot \cvec{ (-90) \cdot (-90) \\ 400 \cdot 400 } = \cvec{810 \\ 16000} \\ \fvec{w}_a(1) &= \fvec{w}(0) - \eta_a \cdot \nabla E \left( \fvec{w}(0) \right) \oslash \sqrt{\fvec{s}(1) + \varepsilon} \\ &= \cvec{-15 \\ 20} - 0.05 \cdot \cvec{\xfrac{-90}{\sqrt{810 + 10^{-8}}} \\ \xfrac{400}{\sqrt{16000 + 10^{-8}}}} \approx \cvec{-14.84 \\ 19.84}. \end{split} \label{eq:AdaptiveLearning_AdaptiveStep1} \end{align}

It is not a coincidence that the fractional part $$.84$$ is the same for both vector components of $$\fvec{w}_a(1)$$. This is because we initialized the scaling factors to $$\fvec{s}(0) = \fvec{0}$$ so that only the squared gradient times $$1-\beta$$ remain. In the weight update step, the gradient cancels nearly out due to the square root. In the end, both weight components change roughly by the same value ($$\approx +0.16$$ for the first and $$\approx -0.16$$ for the second component). Only the direction, indicated by the sign of the gradient, is different (we move to the right for $$w_1$$ and to the left for $$w_2$$). We take up this point again later.

The new weight position $$\fvec{w}_a(1)$$ is the basis to evaluate the next gradient

\begin{equation*} \nabla E \left( \fvec{w}_a(1) \right) = \cvec{6 \cdot (-14.84) \\ 20 \cdot 19.84} = \cvec{-89.04 \\ 396.8} \end{equation*}

so that we can proceed in a similar way with the second update

\begin{align*} \fvec{s}(2) &= \beta \cdot \fvec{s}(1) + (1-\beta) \cdot \nabla E \left( \fvec{w}_a(1) \right) \odot \nabla E \left( \fvec{w}_a(1) \right) \\ &= 0.9 \cdot \cvec{810 \\ 16000} + 0.1 \cdot \cvec{ (-89.04) \cdot (-89.04) \\ 396.8 \cdot 396.8 } \approx \cvec{1521.81 \\ 30145.02} \\ \fvec{w}_a(2) &= \fvec{w}_a(1) - \eta_a \cdot \nabla E \left( \fvec{w}_a(1) \right) \oslash \sqrt{\fvec{s}(2) + \varepsilon} \\ &= \cvec{-14.84 \\ 19.84} - 0.05 \cdot \cvec{\xfrac{-89.04}{\sqrt{1521.81 + 10^{-8}}} \\ \xfrac{396.84}{\sqrt{30145.02 + 10^{-8}}}} \approx \cvec{-14.73 \\ 19.73}. \end{align*}

This situation is also visualized in the following figure.

The picture here is quite different from the momentum optimizer example. The difference is not in how far the vectors move (this would be hard to compare anyway due to the different learning rates) but rather how they move. With adaptive learning rates, we move more to the right and less to the bottom than with classical gradient descent.

Trajectories

Similar to what we did in momentum optimization, we want to know how the sequence from the example proceeds further. For this, you can play around in the following animation. It is the same as in the previous article except that the update sequence now uses adaptive learning rates.

We can see that with adaptive learning rates the trajectory is very different from classical gradient descent. Instead of moving down the steepest hill and then walking along the horizontal direction of the valley, the adaptive route points more directly towards the optimum. In this case, this results in a shorter path.

This is an effect of the general principle that the adaptive scheme tends to focus more on the direction than on the magnitude of the gradient. In momentum optimization, on the other side, we mainly focused on improving the magnitude. We even accepted small divergences from the optimal path for larger step sizes (cf. the oscillations of an $$\alpha = 0.9$$ path). The opposite is the case with adaptive learning rates. Here, we focus more on the direction of the gradient vector instead of the magnitude.

The reason behind this lies in the scaling by the $$\fvec{s}$$ vector. On the one hand, when the previous gradients were high so that the scaling factor is $$\sqrt{\fvec{s} + \varepsilon} > 1$$, then the update step is reduced. On the other hand, with small previous gradients, the scaling factor is $$\sqrt{\fvec{s} + \varepsilon} < 1$$ which increases the update step. This happens for each vector component individually.

The scaling has the effect that the magnitude of the gradient becomes less important as the update vectors $$\fvec{v}(t)$$ are more forced to stay in “reasonable ranges”. This design was chosen with the intent to overcome the aforementioned problems of extremely small or large gradients. However, it is to note that the sign of the gradient is unaffected by this scaling which ensures that the update vector $$\fvec{v}(t)$$ points always in the same direction as the gradient $$\nabla E(t)$$.

In our example, this means that the vertical vector component decreases more since the gradients are higher in this direction (this is the direction with the steepest decline). As a result, the update vectors $$\fvec{v}$$ point more to the right than with classical gradient descent. In the case of elongated error functions, this helps to point more directly at the optimum.

When we compare a $$\beta = 0$$ with a $$\beta = 0.9$$ trajectory, we see that the former is shaped more angularly and the latter is smoother. This is the case because the $$\beta = 0.9$$ path does involve, at least to some extent, the magnitude of the gradients in the update decisions so that effectively a broader range of orientations is incorporated in the path. With $$\beta = 0$$, however, we mainly restrict the path to the eight cardinal directions, e.g. in this case first to the south-east ($$\fvec{v} \approx (1, -1)$$) and then to the south ($$\fvec{v} \approx (0, -1)$$).

The decaying parameter $$\beta$$

We discussed that with adaptive learning rates the focus is more on the direction than on the magnitude of the gradients. A valid question may be whether we have control over this focus. It turns out that we can use the hyperparameter $$\beta$$ for this purpose and this is what we want to discuss now.

Let us begin with the extreme cases. When we set $$\beta = 0$$, we do not use information from previous gradients and use only the current gradient instead. This is essentially what happened in the first update step of the numerical example (\eqref{eq:AdaptiveLearning_AdaptiveStep1}) where both vector components were updated by roughly the same value4. As we saw, this cancelled the magnitude of the gradient nearly completely out giving a strong importance to the sign. This is a way of expressing an exclusive focus on the direction.5

Setting $$\beta = 1$$ does not use the current gradient in the scaling vector $$\fvec{s}$$. Instead, only the initial value $$\fvec{s}(0)$$ is used. In the case of zero-initialization, this leaves $$\sqrt{\varepsilon}$$, i.e. a very small number, in the denominator so that an increase of the magnitude $$\left| \nabla E \left( \fvec{w}_a(1) \right) \right|$$ is likely. In a way, this is an extreme focus on the magnitude. However, it is doubtful that this is a useful case as it contradicts the goal of keeping the update vectors in “reasonable ranges”.

For values in-between, i.e. $$\beta \in \; ]0;1[$$, we have the choice of focusing more on the direction (smaller values of $$\beta$$) or more on the magnitude (larger values of $$\beta$$). For the latter, it is to note that the upper limit is not the original magnitude. Rather, it is possible that the magnitude even amplifies, as in the case of the extreme value $$\beta = 1$$.

We now want to analyse how the update vectors $$\fvec{v}$$ of the adaptive learning scheme are different from the gradients $$\nabla E$$ used directly in classical gradient descent. For this, we use a one-dimensional example, consider two update iterations and interpret the gradients $$g_t = \nabla E \left( w(t-1) \right)$$ as variables. We can then set up the scaling values

\begin{align*} s(1) &= (1-\beta) \cdot g_1^2 \\ s(2) &= \beta \cdot s(1) + (1-\beta) \cdot g_2^2 = \beta \cdot \left( (1-\beta) \cdot g_1^2 \right) + (1-\beta) \cdot g_2^2 \\ &= (1-\beta) \cdot (\beta \cdot g_1^2 + g_2^2). \end{align*}

We are interested in the interplay between the first two gradients so we consider only the update value of the second iteration

\begin{equation*} v(2) = \eta \cdot \frac{g_2}{\sqrt{s(2) + \varepsilon}} = \eta \cdot \frac{g_2}{\sqrt{(1-\beta) \cdot (\beta \cdot g_1^2 + g_2^2) + \varepsilon}}. \end{equation*}

The question now is in which way the update value $$v(2)$$ is different from the gradient $$g_2$$, i.e. how we move differently with adaptive learning rates than without. More precisely, we want to know if the original gradient is amplified or attenuated. The idea is to measure this as a simple difference between the two update schemes:

\begin{equation} \label{eq:AdaptiveLearning_AdaptiveDifference} C_a(g_1, g_2) = |\eta \cdot v(2)| - |\eta \cdot g_2| = \eta \cdot \begin{cases} v(2) - g_2, & g_2 \geq 0 \\ g_2 - v(2), & g_2 < 0 \\ \end{cases} \end{equation}

The measure $$C_a(g_1, g_2)$$ informs us about whether we move further in the adaptive learning scheme compared to classical gradient descent. This is, again, a sign-dependent decision as moving further means more to the right of $$g_2$$ when $$g_2$$ is already positive and more to the left of $$g_2$$ when $$g_2$$ is already negative.

When we interpret the sign and the magnitude of $$C_a(g_1, g_2)$$, we can distinguish between the following cases:

1. For $$C_a(g_1, g_2) > 0$$, the magnitude of the gradient $$|g_2|$$ is amplified by $$\left| C_a(g_1, g_2) \right|$$. The update step is larger with adaptive learning rates than without.
2. For $$C_a(g_1, g_2) < 0$$, the magnitude of the gradient $$|g_2|$$ is attenuated by $$\left| C_a(g_1, g_2) \right|$$. The update step is smaller with adaptive learning rates than without.
3. For $$C_a(g_1, g_2) = 0$$, the magnitude of the gradient $$|g_2|$$ remains unchanged. The update step is the same with both approaches.

The following figure visualizes \eqref{eq:AdaptiveLearning_AdaptiveDifference} for different values of the hyperparameter $$\beta$$.

The red regions show where the gradient $$g_2$$ is enlarged (first case). We see that this happens especially in areas where the value of both gradients is low. This is what helps us leaving a plateau of the error landscape escaping more quickly also alleviating the problem of vanishing gradients.

In the blue regions, the gradient $$g_2$$ gets smaller. This corresponds to the second case. We see that these regions are visible strongest at the borders. That is, when either $$g_1$$ or $$g_2$$ are high (or both). This makes sure that the gradients do not become too large so that steep hills in the error landscape pose a risk of overstepping local minima. Smaller gradients also alleviate the problem of exploding gradients.

When we compare the result of different $$\beta$$-values, we can see how the focus changes from the direction to the magnitude of the gradients. The extreme value $$\beta = 0$$ reveals a nearly linear relationship between $$v(2)$$ and $$g_2$$ as the magnitude is almost completely neglected setting the update essentially to $$\left| v(2) \right| \approx 1$$.

Conclusion

This was also the story about adaptive learning rates. The global learning rate $$\eta$$ does not have the power to suit the needs for every weight. Hence, using an individual learning rate (via the scaling vector $$\fvec{s}$$) per weight is a great and powerful idea which is also an important aspect of the Adam optimizer (covered in the next article). It is also nice that the scaling factors are calculated automatically. But, to be fair, we also do not have much of a choice since there is no practical way we could manually adjust the learning rates for each weight in a larger network.

Another advantage is that the global learning rate $$\eta$$ becomes a less critical hyperparameter to adjust. Even though it still has an influence (otherwise, we could remove it), the effects are less drastic. This also means that it can cause less damage when it is set incorrectly.

1. There is, for example, a technique named SuperSAB introduced in 1990 by Tom Tollenaere which uses an individual and adaptable learning rate for each weight.
2. The subscript $$_a$$ is used for variables of the adaptive learning scheme.
3. This is powered by the math.js parser so you can even set the error function to something like sin(w1) + cos(w2).
4. However, due to a different reason, namely the $$\fvec{s}(0) = \fvec{0}$$ initialization. But the effect is very similar.
5. There is a similar approach named resilient propagation described by Riedmiller and Braun in 1993 which uses only the sign of the gradient and weight-dependent learning rates in the update process.

Introduction to neural network optimizers [part 1] – momentum optimization

A neural network is a model with a lot of parameters which are used to derive an output based on an input. In the learning process, we show the network a series of example inputs with an associated output so that it can adapt its parameters (weights) according to a defined error function. Since these error functions are too complex, we cannot simply determine an explicit formula for the optimal parameters. The usual approach to tackle this problem is to start with a random initialization of the weights and use gradient descent to iteratively find a local minimum in the error landscape. That is, we adapt the weights over multiple iterations according to the gradients until the value of the error function is sufficiently low.

However, using this approach without modifications (classical gradient descent) has some disadvantages as it can be slow, may end up in a suboptimal local minimum and requires a careful choice of the learning rate. Hence, multiple approaches have been developed over the years to improve classical gradient descent and address a few of the problems.

This is the first part of a series consisting of three articles with the goal to introduce some general concepts and concrete algorithms in the field of neural network optimizers. We start in in this article with the momentum optimizer which tries to improve convergence by speeding up when we keep moving in the same direction. The second part introduces the concept of individual learning rates for each weight in the form of an adaptive learning scheme which scales according to the accumulation of past gradients. Finally, the third part introduces the Adam optimizer which re-uses some of the ideas discussed in the first two parts and also tries to target better local minima. In summary:

1. Part 1: momentum optimizer
2. Part 2: adaptive learning rates
3. Part 3: Adam optimizer

The basic idea behind momentum optimization

In classical gradient descent, we perform each update step solely on the basis of the current gradient. In particular, no information about the previous gradients is included in the update step. However, this might not be very beneficial as the previous gradients can provide us with information to speed up convergence. This is where the momentum optimizer comes into play.1.

The general idea is to speed up learning when we move in the same direction over multiple iterations and we slow down when the direction changes. With “direction” we mean the sign of the gradient components as we update each weight individually. For example, if we had moved to the right in the first two iterations (positive gradient components), we could assume that our optimum is somewhere at the right and move even a bit further in this direction. Of course, this decision is only possible when we still have the information about the gradient of the first iteration.

There is an analogy which is often used: imagine we place a ball at a hill and it rolls down into a valley. We use gradient descent to update the position of the ball based on the current slopes. If we use classical gradient descent, it is like placing the ball each time on the new position completely independent of the previous trajectory. With momentum optimization, however, the ball accelerates in speed as it builds momentum2. This allows the ball to reach the valley much faster than with classical gradient descent.

There is one problem with the current approach, though: if we accelerated uncontrolledly as long as we move in the same direction, we might end up with huge update steps predestined to overshoot local minima. Hence, we need some kind of friction mechanism to act as a counterforce to slow down the ball again.

In the following, we introduce the concepts behind momentum optimization. We begin with the mathematical formulation where we also compare it with classical gradient descent. We then apply the formulas on a small numerical example to understand their basic functionality and look at different trajectories. Next, we analyse the speed improvements of momentum optimization from an exemplary and theoretical point of view. Lastly, we discuss the two basic concepts in momentum optimization, namely acceleration and deceleration, in more detail.

Mathematical formulation

The basic algorithm to iteratively find a minimum of a function is (classical) gradient descent. It uses, as the name suggests, the gradients $$\nabla E$$ of the error function $$E\left(\fvec{w}\right)$$ as the main ingredient.

Definition 1: Classical gradient descent

Let $$\fvec{w} = (w_1, w_2, \ldots, w_n) \in \mathbb{R}^n$$ be a vector of weights (e.g. of a neural network) initialized randomly at $$\fvec{w}(0)$$, $$\eta \in \mathbb{R}^+$$ a global learning rate and $$t \in \mathbb{N}^+$$ the iteration number. Then, classical gradient descent uses the update rule

\begin{equation} \label{eq:MomentumOptimizer_ClassicalGradientDescent} \fvec{w}(t) = \fvec{w}(t-1) - \eta \cdot \nabla E\left( \fvec{w}(t-1) \right) \end{equation}

to find a path from the initial position $$\fvec{w}(0)$$ to a local minimum of the error function $$E\left(\fvec{w}\right)$$ based on the gradients $$\nabla E$$.

That is, we move in each iteration in the direction of the current gradient vector (the gradient is subtracted since we want to reach a minimum and not a maximum of $$E(\fvec{w})$$ in gradient descent). In momentum optimization, we move in the direction of the momentum vector $$\fvec{m}$$ instead.

Definition 2: Momentum optimizer

Additionally to the variables used in classical gradient descent, let $$\fvec{m} = (m_1, m_2, \ldots, m_n) \in \mathbb{R}^n$$ be the momentum vector of the same length as the weights $$\fvec{w}$$ which is initialized to zero, i.e. $$\fvec{m}(0) = \fvec{0}$$, and $$\alpha \in [0;1]$$ the momentum parameter (friction parameter). Then, the momentum optimizer defines the update rules

\begin{align} \begin{split} \fvec{m}(t) &= \alpha \cdot \fvec{m}(t-1) + \nabla E\left( \fvec{w}(t-1) \right) \\ \fvec{w}(t) &= \fvec{w}(t-1) - \eta \cdot \fvec{m}(t) \end{split} \label{eq:MomentumOptimizer_Momentum} \end{align}

to find a path from the initial position $$\fvec{w}(0)$$ to a local minimum of the error function $$E\left(\fvec{w}\right)$$.

These equations summarize the former ideas: we keep track of previous gradients by adding them to the momentum vector, we subtract the momentum vector $$\fvec{m}$$ which includes information about the complete previous trajectory and we have a friction mechanism in terms of the momentum parameter $$\alpha$$.

If we set $$\alpha = 0$$, we do not consider previous gradients and end up with classical gradient descent, i.e. \eqref{eq:MomentumOptimizer_ClassicalGradientDescent} and \eqref{eq:MomentumOptimizer_Momentum} produce the same sequence of weights. In the other extreme, $$\alpha = 1$$, we have no friction at all and accumulate every previous gradient. This is not really a useful setting as the updates tend to get way too large so that we almost certainly overshoot every local minimum. For values in-between, i.e. $$\alpha \in \; ]0;1[$$, we can control the amount of friction we want to have with lower values of $$\alpha$$ meaning a higher amount of friction as more of the previous gradients are discarded. We are comparing different values for this hyperparameter3 later.

Numerical example

First, let us look at a numerical example which compares classical gradient descent with momentum optimization. For this, suppose that we have two weights $$w_1$$ and $$w_2$$ used in the (fictional) error function

\begin{equation} \label{eq:MomentumOptimizer_ExampleFunction} E(\fvec{w}) = 3 \cdot w_1^2 + 10 \cdot w_2^2. \end{equation}

We are starting in our error landscape at the position $$\fvec{w}(0) = (-15,20)$$, using a learning rate of $$\eta = 0.001$$ and want to apply two updates to the weights $$\fvec{w}$$. We first need to calculate the gradients of our error function

\begin{equation} \label{eq:MomentumOptimizer_ExampleFunctionGradient} \nabla E\left( \fvec{w} \right) = \cvec{\xfrac{\partial E}{\partial w_1} \\ \xfrac{\partial E}{\partial w_2}} = \cvec{6 \cdot w_1 \\ 20 \cdot w_2} \end{equation}

and evaluate it at the initial position

\begin{equation*} \nabla E \left( \fvec{w}(0) \right) = \cvec{6 \cdot (-15) \\ 20 \cdot 20} = \cvec{-90 \\ 400}. \end{equation*}

We can then start with classical gradient descent and apply the first update4

\begin{equation} \label{eq:MomentumOptimizer_W1Classic} \fvec{w}_c(1) = \fvec{w}(0) - \eta \cdot \nabla E\left( \fvec{w}(0) \right) = \cvec{-15 \\ 20} - 0.001 \cdot \cvec{-90 \\ 400} = \cvec{-14.91 \\ 19.6}. \end{equation}

The new weight position $$\fvec{w}_c(1)$$ is the basis for the evaluation of the next gradient

\begin{equation*} \nabla E \left( \fvec{w}_c(1) \right) = \cvec{6 \cdot (-14.91) \\ 20 \cdot 19.6} = \cvec{-89.46 \\ 392} \end{equation*}

required for the second update (results are rounded to two decimal places)

\begin{equation} \label{eq:MomentumOptimizer_W2Classic} \fvec{w}_c(2) = \fvec{w}_c(1) - \eta \cdot \nabla E\left( \fvec{w}_c(1) \right) = \cvec{-14.91 \\ 19.6} - 0.001 \cdot \cvec{-89.46 \\ 392} \approx \cvec{-14.82 \\ 19.21}. \end{equation}

We now repeat the same process but with momentum optimization enabled and the momentum parameter set to $$\alpha = 0.9$$. Since the momentum vector is initialized to $$\fvec{m}(0) = \fvec{0}$$, the first update is the same as before

\begin{align*} \fvec{m}(1) &= \alpha \cdot \fvec{m}(0) + \nabla E\left( \fvec{w}(0) \right) = 0.9 \cdot \fvec{0} + \cvec{-90 \\ 400} = \cvec{-90 \\ 400} \\ \fvec{w}_m(1) &= \fvec{w}(0) - \eta \cdot \fvec{m}(1) = \cvec{-15 \\ 20} - 0.001 \cdot \cvec{-90 \\ 400} = \cvec{-14.91 \\ 19.6}. \end{align*}

In the second update, however, the momentum optimizer leverages the fact that we move again in the same direction ($$w_1$$ to the right and $$w_2$$ to the left) and accelerates (the gradient is the same as before, i.e. $$\nabla E \left( \fvec{w}_m(1) \right) = \nabla E \left( \fvec{w}_c(1) \right)$$)

\begin{align*} \fvec{m}(2) &= \alpha \cdot \fvec{m}(1) + \nabla E\left( \fvec{w}_m(1) \right) = 0.9 \cdot \cvec{-90 \\ 400} + \cvec{-89.46 \\ 392} = \cvec{-170.46 \\ 752} \\ \fvec{w}_m(2) &= \fvec{w}_m(1) - \eta \cdot \fvec{m}(2) = \cvec{-14.91 \\ 19.6} - 0.001 \cdot \cvec{-170.46 \\ 752} \approx \cvec{-14.74 \\ 18.85}. \end{align*}

This situation is also visualized in the following figure:

We can see that the position with momentum $$\fvec{w}_m(2)$$ is a fair way ahead of the position $$\fvec{w}_c(2)$$ reached via classical gradient descent. That is, the first weight moved more to the right ($$w_{1,m}(2) > w_{1,c}(2)$$) and the second weight more to the left ($$w_{2,m}(2) < w_{2,c}(2)$$).

Trajectories

So, how do the two approaches proceed further on their way to the minimum? To answer this question, you can play around in the following animation. With the default error function and without considering extreme settings, all trajectories should find their way to the local (and in this case also global) minimum of $$\fvec{w}^* = (0, 0)$$.

If we start with classical gradient descent ($$\alpha = 0$$) and then increase the momentum parameter $$\alpha$$, we see how we reach closer to the minimum in the same number of iterations. However, setting the parameter too high can lead to negative effects. For example, the $$\alpha = 0.9$$ path first moves in the wrong direction (down) before it turns back to the minimum. This is a consequence of too little friction. The accumulation from the first few gradients is still very high so that the step sizes are very large (enable the $$m_2$$ trace to make this effect more prominent). It takes some iterations before the friction can do its job and decreases the accumulation so that the path can turn and move into the correct direction again. This happens once more when the path is already near the local minimum (zoom in to see this better).

Generally speaking, when we set the momentum parameter $$\alpha$$ too high, it is likely to happen that the path oscillates around an optimum. It may overshoot the minimum several times before the accumulated values decreased enough.

Of course, instead of adjusting the momentum parameter $$\alpha$$, we could also tune the learning rate $$\eta$$. Increasing it can also help to reach the minimum faster and it can lead to similar oscillation problems6. However, $$\eta$$ is a global setting and influences all step sizes independent of the current location in the error surface. The momentum parameter $$\alpha$$, on the other hand, specifies how much we take from the currently accumulated momentum and this depends on all previous gradients. What is more, a good setting for the learning rate $$\eta$$ helps also in momentum optimization.

It is to note that in neural networks, with usually many more parameters than just two, it is not easily possible to visualize the trajectory as it is done here. One has to trust other measures, like the value of the error function $$E(\fvec{w})$$, instead.

Speed improvements

The goal of momentum optimization is to converge faster to a local optimum and we can see that this is indeed the case when we keep track of the error course. To make this comparison even clearer, we are elaborating on this point a bit further in this section.

One note before we proceed, though: the results here serve only as a general hint and do not necessarily represent realistic speed improvements. That is because we use \eqref{eq:MomentumOptimizer_ExampleFunction} as error function which is only a toy example. Error surfaces from the real world are usually more complex so that optimizers may behave differently.

However, the nice thing about this error function is that it has only one minimum so that any reduction of the error value means that we come closer to the same optimum. This allows us to compare the value of the error function in each step $$t$$. This is done in the following figure for three different trajectories with different values for the momentum parameter $$\alpha$$. The lower the value on the $$y$$-axis, the closer is the path to the minimum and the earlier this happens, the better. Figure 3: Comparison of the convergence speed of three trajectories using different values for the momentum parameter $$\alpha$$ over the course of 700 iterations. The other parameters are set to the initial values of the previous figure. The $$y$$-axis is not shown in its full range to restrict the comparison to its relevant parts.

We can clearly see that with classical gradient descent ($$\alpha = 0$$) we reach the optimum slowest compared to the other curves. Using the momentum optimizer helps here in both cases with $$\alpha = 0.9$$ being even a bit faster than $$\alpha = 0.6$$. However, we also see the problems of the $$\alpha = 0.9$$ curve which has a small bump at around $$t = 20$$. This is when the corresponding trajectory moves too much to the south of the error function before turning to the optimum.

Theoretical speedup

Regarding the speed improvements, there is also a theoretical argument we want to stress here. It concerns a possible theoretical speedup with momentum optimization. We use a one-dimensional example and assume that the gradient $$g$$ stays constant during all iterations. We can then calculate the sequence of momentum updates

\begin{align*} m(0) &= 0 \\ m(1) &= \alpha \cdot m(0) + g = g \\ m(2) &= \alpha \cdot m(1) + g = \alpha \cdot g + g \\ m(3) &= \alpha \cdot m(2) + g = \alpha \cdot (\alpha \cdot g + g) + g = \alpha^2 \cdot g + \alpha \cdot g + g \\ &\vdots \\ m(t) &= \sum_{i=0}^{t-1} \alpha^i \cdot g \Rightarrow \lim\limits_{t \rightarrow \infty} g \sum_{i=0}^{t-1} \alpha^i = g \frac{1}{1-\alpha}. \end{align*}

In the last step, we assumed $$|\alpha| < 1$$ and used the limit of the geometric series to simplify the formula.

As we can see, the influence of previous gradients decreases exponentially. In total, momentum optimization effectively pushes the gradient by a factor depending on the momentum parameter $$\alpha$$. For example, if we use $$\alpha = 0.9$$, we get a speedup of $$\xfrac{1}{0.1} = 10$$ for the last update. That is, the weight updates are up to 10 times larger than without momentum optimization.

It is to note that this is only a theoretical perspective and the simplification of constant gradients is not very realistic. After all, we could just adapt the learning rate with increasing iterations to achieve a similar effect in this case. Still, it gives a general idea of the convergence improvements from momentum optimization and which role the hyperparameter $$\alpha$$ plays.

Two basic cases

So far, we focused mainly on the acceleration functionality of momentum optimization. As we saw, we accelerate when we move in the same direction as in the previous iteration7. That is, the magnitude of the momentum vector increases, i.e. $$\left\| \fvec{m}(t) \right\| > \left\| \fvec{m}(t-1) \right\|$$. On the other hand, we decelerate when we change directions. In this case, the magnitude of the momentum vector decreases, i.e. $$\left\| \fvec{m}(t) \right\| < \left\| \fvec{m}(t-1) \right\|$$. These are two fundamental cases in the momentum optimizer. We now want to discuss a bit deeper when and to which extent these cases apply.

To make things simpler, we use, again, one dimension so that only one weight and more importantly one gradient per iteration remains. We consider two iterations of the momentum update, interpret the gradients as variables $$g_1 = \nabla E \left( w(0) \right)$$ for the first and $$g_2 = \nabla E \left( w(1) \right)$$ for the second iteration:

\begin{align} \begin{split} m(1) &= g_1 \\ m(2) &= \alpha \cdot g_1 + g_2. \end{split} \label{eq:MomentumOptimizer_CasesUpdates} \end{align}

The idea is to measure the acceleration and deceleration between the two momentum updates as a function of the gradients $$g_i$$. We can do so by simply calculating the difference between $$m(1)$$ and $$m(2)$$ to see how much the momentum changes. However, since “the correct direction” (positive to the right, negative to the left) depends on the sign of the first gradient $$g_1$$, we need to handle these cases separately:

\begin{equation} \label{eq:MomentumOptimizer_Cases} C_m(g_1, g_2) = \begin{cases} m(2) - m(1), & g_1 \geq 0 \\ m(1) - m(2), & g_1 < 0 \\ \end{cases} = \begin{cases} (\alpha \cdot g_1 + g_2) - g_1, & g_1 \geq 0 \\ g_1 - (\alpha \cdot g_1 + g_2), & g_1 < 0 \\ \end{cases} \end{equation}

For the measure $$C_m(g_1, g_2)$$, we can interpret the sign and the magnitude leading to the following cases:

1. For $$C_m(g_1, g_2) > 0$$, the gradient $$g_2$$ points in the same direction as $$g_1$$ and the magnitude $$\left| m \right|$$ increases by $$\left|C_m(g_1, g_2)\right|$$.
2. For $$C_m(g_1, g_2) < 0$$, the gradient $$g_2$$ points in the opposite direction as $$g_1$$ and the magnitude $$\left| m \right|$$ decreases by $$\left|C_m(g_1, g_2)\right|$$.
3. For $$C_m(g_1, g_2) = 0$$, the magnitude $$\left| m \right|$$ remains unchanged. This is not really an interesting case and only listed for the sake of completeness.

Note that the reverse of the above statements is not true in general. For example, if the gradient $$g_2$$ points in the same direction as $$g_1$$, it does not necessarily mean that the magnitude of the momentum increases. Let $$g_1 = 2, g_2 = 1$$ and $$\alpha = 0.4$$ to make this point clear:

\begin{align*} \begin{split} m(1) &= 2 \\ m(2) &= 0.4 \cdot 2 + 1 = 1.8. \end{split} \end{align*}

Both gradients point in the same direction but the momentum still changes by $$m(2) - m(1) = -0.2$$, i.e. it gets smaller. This is due to the friction parameter $$\alpha$$ and the second gradient being too small to account for the friction loss $$0.6 \cdot 2 = 1.2$$.

The following figure visualizes \eqref{eq:MomentumOptimizer_Cases} graphically. The red regions correspond to the first case and the blue regions to the second case. Additionally, we see by the intensity of the colour how much the momentum changes. We are building momentum in the bottom-left and top-right quarters, i.e. when the gradients share the same sign and the second gradient $$g_2$$ can account for the friction loss.

The friction parameter $$\alpha$$ influences the slope of the contour lines. It is highest for $$\alpha = 0$$ where the previous gradient is ignored completely and the momentum term only increases when the new gradient is larger, i.e. $$g_2 > g_1$$. This is not surprising as this case reduces to classical gradient descent where we do not have a momentum term. The slope is lowest (more precisely: zero) for $$\alpha = 1$$. This is the case where we do not have any friction at all and sum up the full magnitudes of all previous gradients. Hence, when the second gradient has e.g. a value of $$g_2 = 1$$, the momentum term increases by this value: $$m(2) - m(1) = 1$$.

Conclusion

This was the story about momentum optimization. It introduces an important idea to speed up convergence in neural network optimization namely that we can be smarter when we keep moving in the same direction over iteration time. However, it should not be used without caution since the friction parameter $$\alpha$$ highly influences our success. Setting it too low and we may learn slower than we could. Setting it too high and we may lose control over our trajectory in the error landscape. This is especially hard to catch for higher-dimensional error functions (usually the case in neural networks).

Even though the concept of momentum optimization has its right to exist, I have to admit that I can't remember ever using it on its own. There are usually enough (critical) hyperparameters in a network so that adjusting the friction parameter $$\alpha$$ can be a bit scary. What is more, other optimization techniques have been developed which work reasonably well and have less critical hyperparameters.

A popular example is the Adam optimizer which reuses the idea of momentum even though in a different form and with different effects. We are also covering the Adam optimizer in this series about neural network optimization (third part). But before we are taking a look at this technique, we should first learn about a new concept: adaptive learning rates. This is also very important for the Adam optimizer.

List of attached files:

• MomentumOptimizer.nb [PDF] (Mathematica notebook with some basic computations and visualizations used to write this article)

1. The following description and notation follow loosely the style of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow from Aurélien Géron (pages 295–296). The method itself was originally proposed by Boris Polyak in 1964 in his work Some methods of speeding up the convergence of iteration methods.
2. to build momentum is an idiom meaning that something, once started, keeps going on by itself, e.g. like an idea which spreads around and develops further even after initiation. It is related to the physical meaning of momentum as the amount of motion an object with mass has (defined as the product of mass and velocity). This explains the origin of this optimization technique's name.
3. The term hyperparameter comprises all configuration options of a model in distinction to its learnable parameters (e.g. the weights of a neural network).
4. $$\fvec{w_c}$$ denotes the weights when using classical gradient descent and $$\fvec{w}_m$$ when momentum optimization is used.
5. This is powered by the math.js parser so you can even set the error function to something like sin(w1) + cos(w2).
6. This can also be seen in this example based on a simple neural network.
7. Strictly speaking, it is also necessary that we move sufficiently enough in the same direction due to the friction parameter $$\alpha$$. If the new update step is too small and does not account for the friction loss, the momentum decreases as well (even though the sign of both updates was the same). We are discussing this aspect in more detail in a moment.