SVRG Algorithm and Convergence Analysis

SVRG Algorithm

Let be the objective function we want to minimize. The Stochastic Variance Reduced Gradient (SVRG) algorithm works as follows:

Parameters: update frequency , learning rate
Initialize:
Iterate: for
- Iterate: for
  - Sample and set
- Option I: Set
- Option II: Set for randomly chosen

Convergence Analysis

We assume all are convex and -smooth, and is -strongly convex. Let .

Variance Reduction Property

We analyze the variance of

which is an unbiased estimator of . We have

On the other hand, for standard SGD, we have

Importantly, when , the variance of goes to zero, while does not vanish since it is a constant reflecting the inherent gradient noise.

Convergence Rate

We analyze the convergence of SVRG. Using the update rule, we have

Thus

Summing over ,

Let

Then

Thus, after epochs,

When , we achieve linear convergence.

Comparing with GD

Let

For GD, we have

Recall that

By properly choosing and , we can make much smaller than . For example, let and , then . When , we have . In this case, SVRG requires much fewer iterations to achieve the same accuracy as GD, and each iteration only requires more gradient evaluations than GD.