In today’s post, I will take a break from discussing how to make a solver for QUBO problems and discuss the projected gradient method. This fun method allows optimization problems with (usually simple) constraints to be solved without needing particularly powerful machinery that is needed for interior point methods or active set methods. It will allow us to solve problems like the following by splitting them into two subroutines.

\[\begin{align} \min_{x \in \mathcal{C}} f(x) \end{align}\]

As you can guess by the name, there is a gradient and a projection. This means we take the gradient step that we would do as normal in any gradient descent method and then project back to the closest feasible point. This means that we push ourselves back to feasibility every time we take a step. Isn’t the projection step nearly as expensive as the original problem and itself an optimization problem? The general answer is yes, but for specific feasible regions (and collections of these feasible regions), explicit solutions to this project operation will simplify that problem to be very cheap, which is where this method shines. Now, the word “specific” sounds quite restrictive, but for many practical problems, these map to these “specific” feasible spaces.

Anyways, the two-step procedure looks like the following. We make a gradient decent step and then project it back into feasibility. With $\alpha_k$ being the step size, that can be optimized at each iteration (e.g., a 1D line search) or selected as a constant number (a safe one being the inverse of the largest lipshitz constant of the function if known).

\[y_{k+1} = x_k - \alpha_k \nabla f(x_k)\] \[x_{k+1} = \text{Proj}_ {\mathcal{C}}(y_{k+1}) = \min_{x_{k+1} \in \mathcal{C}} ||x_{k+1} - y_{k+1}||_2^2\]

An example of a problem where this is very simple is the quadratic program with nonnegativity constraints, so the projection function is simply $\max(y_{k+1},0)$. The projection step has a time complexity of $\mathcal{O}(n)$ while the gradient step will typically have a time complexity of $\mathcal{O}(n^2)$, which means that solving this problem with non-negative constraints is more or less the same cost as solving it without constraints, given a large enough $n$.

\[\begin{align} \min_{x \geq 0} \frac{1}{2}x^TQx+c^Tx \end{align}\]

Now, speaking of this method in generality is not that interesting. However, we can look at an interesting application where we try to plan a stock portfolio via projected gradient descent without needing a solution that can cost money/time/software dependencies and look at solving it directly. So the general problem format that we are looking at is the following QP, with a constraint that all the assets must sum to 1, and all must be positive, e.g., $x \in \Delta^n$ the unit simplex. $S$ is the correlation matrix between the assets, $r$ is the individual expected return vector from the individual assets, and $q$ is the risk allowance parameter; if $q = 0$, then the safest portfolio will be selected, and if $q = \infty$ is selected then the most risky portfolio will be selected. We looked at this problem under a slightly different formulation in a different post.

\[\begin{align} \min_{x \in \Delta^n} \frac{1}{2}(x^TSx+qR^Tx) \end{align}\]

Now, the projection function for the simplex is somewhat more complicated than the nonnegative reals, so I will not go into detail on it in this post, but it is an efficient algorithm that runs in $\mathcal{O}(n\log(n))$ time, so compared to the gradient step it is (asymptotically) cheap compared to computing a dense QP’s gradient. Here, A simple loop is written to iterate over projected gradient decent, with a simple break statement if there is a lack of improvement. We keep track of the objective w.r.t. iteration in the objs lists. Here, we will use a power iteration to estimate the largest eigenvalue of $S$ and, for safe measure, multiply it by to generate a constant step size $ \alpha = \alpha_k$.

def grad_step(x, Q, c, alpha):
    return x - alpha * grad(x, Q, c)

alpha = 1 / ( 2 * power_iteration(S, 50))

objs = [obj(x, S, R)]

for i in range(10000):
    
    y = grad_step(x, S, R, alpha)
    x = proj_simplex(y)
    objs.append(obj(x, S, R))
    
    print(objs[-2] - objs[-1])
    
    if abs(objs[-2] - objs[-1]) <= 10**-10:
        break

Now, we can try out applying this to some problem instances. Instead of reaching out and getting real financial data (I am not a financial planner), I will make enough data realistic. If you want the exact numbers and values I used, then use the same seed values; otherwise, feel free to plug in other values. This makes the correlation matrix $S$ and the individual return vector $qR$; we have $q$ set to 1 for our tests.

import numpy
import numpy as np
from scipy.stats import random_correlation

num_assets = 4000
numpy.random.seed(123)
rng = np.random.default_rng(seed=123)

rev = numpy.random.rand(num_assets)
rev = (num_assets / numpy.sum(rev)) * rev

S = random_correlation.rvs(eigs = rev, random_state=rng, tol = 10**-10)
R = 0.5*numpy.random.rand(num_assets).reshape(-1,1)
x_0 = numpy.random.rand(num_assets).reshape(-1,1)
x_0 = x_0 / numpy.sum(x_0)

With all that out of the way, we can now solve these problems! That is exciting. We can sanity-check ourselves using a commercial solver (Gurobi 11.0) to make sure they arrive at the same solution, and they do! So now all that is left is to benchmark them against each other, as that seems to be the only thing I do any more in these posts….

Asset Count	$f_{\text{GRB}}(x^*)$	$f_{\text{PGD}}(x^*)$	Gurobi- Time	PGD - Time
10	0.1956729992887787	0.1956729994542699	0.024	0.001
100	0.06705550565755472	0.06705550546472361	0.065	0.002
500	0.029396196860722635	0.029396195699599052	0.601	0.013
1000	0.01860531914042334	0.018605318371605714	2.146	0.025
5000	0.009798037836193776	0.009798034487112582	38.745	0.648
10000	0.007405201565971605	0.00740519903908969	337.097	2.416

We actually see some fairly surprising results that I didn’t expect. The projeccted gradient method is faster (by two orders on the top end) and more accurate when compared to Gurobi’s interior point solver. This makes sense in that no linear systems are solved, and thus essentially, no numerical noise is introduced in this projected gradient decent method vs the alternative approach. While the difference in value is not particularly large, it is interesting that even simple, special case methods can be this much faster than commercial solvers. So the takeaway message is that for many problem types with a very specific structure, there might be more effective ways to solve these problems than you might have originally thought. This occurs all over the machine learning literature, where special algorithms are tailored to specific problem instances to get solutions that scale way better than using a more general approach. There are even linear time algorithms for specific quadratic programming problem instances, such as the quadratic transport problem.

Code Listings

def obj(x, Q, c):
    return float(0.5*x.T@Q@x + c.T@x)

def grad(x, Q, c):
    return Q@x + c

def grad_step(x, Q, c, alpha):
    return x - alpha * grad(x, Q, c)

Projection function for nonnegative feasible space.

def proj_positive(x):
    return numpy.maximum(x, 0)

Projection function for simpex feasible space.

def proj_simplex(x):
    """
    Projects a point v onto the unit simplex.
    """

    # make a 1D array
    x = x.flatten()

    n = len(x)
    
    # Sort the vector v in descending order
    u = numpy.sort(x)[::-1]
    
    # Compute the cumulative sum of u - 1
    cumsum_u = numpy.cumsum(u) - 1
    
    # Compute the rho value
    rho = numpy.where(u > (cumsum_u / np.arange(1, n+1)))[0][-1]
    
    # Compute the threshold theta
    theta = cumsum_u[rho] / (rho + 1)
    
    # Project onto the simplex
    p = numpy.maximum(x - theta, 0)
    
    return p.reshape(-1,1)

Power iteration adapted from the wikipedia article.

def power_iteration(A, num_iterations: int):

    b_k = np.random.randn(A.shape[1])

    for _ in range(num_iterations):
        # calculate the matrix-by-vector product Ab
        b_k1 = np.dot(A, b_k)

        # calculate the norm
        b_k1_norm = np.linalg.norm(b_k1)

        # re normalize the vector
        b_k = b_k1 / b_k1_norm

    return numpy.linalg.norm(A@b_k)

Python Code Optimization Visualization

Code Listings

Related Posts

MixingCut - Rust MAXCUT SDP solver 03 Oct 2024

Projected Gradient Descent Method - Line Search + GPU = Fast 15 Aug 2024

Branch And Bound of QUBO, Problem Preprocessing & Subproblem Projection 01 Jun 2024