Projected Gradient Descent Method - Line Search + GPU = Fast

This post revisits the last post, where we used projected gradient descent to solve an optimal portfolio problem. We will get even larger performance gains here than in the last post. We will do this by modifying how large a step we are taking in our gradient descent and solving this on the GPU. This post assumes you have read the last post or are familiar with gradient descent or optimization.

One of the most important features of gradient decent type algorithms is the choice in the step size ($\alpha$), in that a step size too small leads to very slow convergence while a step size too large means that the iterations diverge away from a local minimum. The obvious bounds for the step size are zero, and the less obvious upper bound is $\ \frac {2}{L}$, with $L$ being the Lipshitz constant of the objective gradient. In the last post, I glossed over this. This value was estimated in the last post via the power iteration algorithm, and we set $\alpha_k \approx \frac{1}{2L}$ but this is not the most effective strategy given that we have a route to derive a much more effective step size rule.

The most powerful step size selection rule is optimal 1D line search. This is a bad idea for many nonlinear optimization problems as fully solving this problem to optimality is not cheap. However, for this problem, there is a rather simple solution that will give the best possible step size and remove 1) the need to compute the Lipschitz constant and 2) give better improvements on each iteration.

$ \begin{align} \alpha_k = \min_{\alpha} f(x_k - \alpha \nabla f(x_k)) \end{align} $

The route to compute this is rather simple; we simply substitute the problem data into the line search problem and see that it simply becomes minimizing a convex 1D quadratic function, which can be solved in $O(1)$ time. Finding this function from the problem data, unfortunately, is $\mathcal{O}(n^2)$, but this is asymptotically the same cost as the gradient calculation as it involves a matrix-vector product. After some manipulation, the step rule becomes the following.

$ \begin{align} \alpha_k = -\frac{||\nabla f(x_k)||^2_2}{\nabla f(x_k)^TQ\nabla f(x_k)} \end{align} $

Implementing this idea in code is fairly simple, as it involves elementary linear algebra operations that numpy can perform.


    # compute mat-vec of gradient and store it
    Qx = Q@x
    
    # compute the objective
    dx = Qx + c
    
    # compute the optimal step size
    grad_norm = float(numpy.linalg.norm(dx,2))**2
    dxQdx = float(dx.T@Q@dx))    
    alpha = -grad_norm / dxQdx

If we go back and apply it to the same problems we looked at in the last post we see a dramatic reduction in time to solve. I upgraded my computer since the last post, so I benchmarked all of the algorithms again, and now that I have more RAM, I can test out a larger problem that wouldn’t have fit in memory previously. We can see some interesting things here, 1) The time complexity of the PGD algorithms seems to be $\mathcal{O}(n^2)$ for this problem, as when we double the size of the problem, we see approximately a quadrupling in the time to solve 2) with the exact line search we are approximately 4 times faster than with the fixed line search.

Asset Count Gurobi PGD PGD + 1D
10 0.024 0.001 0.001
100 0.051 0.001 0.001
500 0.162 0.008 0.004
1000 0.531 0.011 0.004
5000 36.53 0.279 0.089
10000 307.94 0.998 0.268
20000 - 3.660 0.972

This is exciting news: we are now over 1000 times faster than Gurobi for the largest problem type. I was willing to wait to watch it be solved. However, I’d like to know if we can do better to solve this on the GPU. The GPU on my laptop is a mobile 4070, so that will be the baseline of the GPU. Here, I will be doing a rather simple transformation of the code we have written so far that is more or less just writing cupy instead of numpy when doing linear algebra operations. The code for this can be seen in the appendix.

It did take approximately one second to transfer the largest problem to the GPU from host memory, so the main bottleneck is transferring the problem to the GPU. So if you are only solving a single problem on the GPU with the same problem data, then it likely doesn’t make much sense; however, as this is a portfolio problem with a risk as a parameter, one would usually solve this problem repeatedly with different risk levels. If these are left on the GPU, the overhead of passing the problem data doesn’t have to be paid for each problem. If we look at raw solve times (assuming the data is where it is supposed to be), this is another 10x improvement over the 1D line search on the CPU. The number in parenthesis is the time it took to load the matrix and solve the problem, and without it is just time to solve. If we look at the GPU iteration progress, it only has to do 9 iterations before it is within our tolerance bounds, and if we do some back-of-the-napkin math, we are only using 1% of the theoretical FLOPS of the GPU and ~10% of the memory bandwidth. So, given that, there is likely another multiplicative speedup with memory ordering operations and the use of a more direct interface.

Asset Count Gurobi PGD PGD + 1D PGD + 1D + GPU
10 0.024 0.001 0.001 0.019 (0.020)
100 0.051 0.001 0.001 0.010 (0.011)
500 0.162 0.008 0.004 0.013 (0.016)
1000 0.531 0.011 0.004 0.010 (0.11)
5000 36.53 0.279 0.089 0.024 (0.11)
10000 307.94 0.998 0.268 0.042 (0.32)
20000 - 3.660 0.972 0.096 (1.28)

So, starting from a place of beating a commercial solver by two orders of magnitude, we are now moving into the three-order and for-order speedup domain (if we ignore data transfer overhead) by combining 1D line search with GPU computing!

Code Listings

x_gpu = cupy.asarray(x_0.copy(), dtype = cupy.float32)
objs = []

for i in range(100):
    
    # compute mat-vec of gradient and store it
    Qx = Q_gpu@x_gpu
    
    # compute the objective and store it
    objs.append(float(0.5*x_gpu.T@Qx +c_gpu.T@x_gpu))
    
    # compute the objective
    dx = Qx + c_gpu
    
    # compute the optimal step size
    grad_norm = float(cupy.linalg.norm(dx,2))**2
    dxQdx = float(dx.T@(Q_gpu@dx))    
    alpha = -grad_norm / dxQdx
    
    # make gradient decent move
    y = x_gpu + alpha*dx_
    
    # project back to simplex using the CPU based code - kinda ugly 
    x_gpu = cupy.asarray(project_to_simplex(y.get().flatten())).reshape(-1,1)
    
    if i > 2:
        if objs[-2] - objs[-1] <= 10**-10:
            break