We are still looking at the QUBO problem and trying to upgrade the Branch and Bound solver we have been developing. This post mostly covers moving Hercules to be a multithreaded solver using the rayon package. This requires us change some aspects of the inside of the solve()
function. Firstly, change this to be an event-based structure, where instead of each thread being able to change things throughout the process, we instead return an enum
, that signals to the main thread what actions need to be taken. Doing so makes completely independent subproblems, and thus, removes the need for communication between threads and allows us to simplify our parallelization scheme quite a bit. Instead of having atomics, mutexs, locks, etc., in a large tree structure, we have a parallel map structure. Having all of our subproblems be independent allows us to explore the branch and bound tree in parallel without worrying about data races, thread safety, etc., and thus solve our QUBO problem faster. If you have not read the previous posts in this series, I recommend that you read the rest before you get here. Many other minor changes were made, where we extracted many of the minor structs to their own files, and some features were made more general for some upcoming changes, e.g., swapping subproblem solvers, etc.
Changing how actions are propagated to the solver
In a single-threaded application, you can have mutable access to a variable from almost anywhere in a code and not worry, as you won’t accidentally write over the variable while you are using it. Instead of calling the function that will modify the actual solver state, we store records of the changes to apply when we solve the set of subproblems we are considering. Instead of trying to update the best solution immediatly, we create an Event enum
, that stores the action that needs to be done.
pub enum Event {
UpdateBestSolution(Array1<f64>),
AddBranches(QuboBBNode, QuboBBNode)
}
Below is a simplified view of the solver’s main loop. We have moved most of the calculation and subproblem solving to the process_node
function. Since this function no longer needs mutable access, we can use a parallel map to solve the problems in parallel. For those familiar with Rust, the code does not need a single lifetime annotation (thank goodness!).
fn solve(&mut self) -> (Array1<f64>, f64) {
...
// until we have hit a termination condition, we will keep iterating
while !self.termination_condition() {
// gets # of threads problems to solve simultaneously
let nodes = self.get_next_nodes(self.options.threads);
// process all the nodes we are considering
let process_results = nodes.iter().map(|node| {
self.process_node(node)
}).collect();
// iterate over the resulting events and apply the changes
for state in process_results {
self.apply_event_option(state.event);
}
}
...
}
Currently, the only things being processed in the main thread are collecting the nodes to hand process and running the Events
generated via map. This also systematizes the solver’s structure around types and reduces the solver code’s ‘spaghetti-factor.’
Making it Parallel
All we need to do with these changes is change a single line. This is what some people talk about: fearless concurrency in Rust. Using the rayon
crate, we can swap out iter()
with par_iter()
, which now acts as a parallel map. But in order to do this, we had to make some significant changes to the solver’s internals.
let process_results = nodes.par_iter().map(|node| {self.process_node(node)}).collect()
So now that we have added parallelism, we can ask, “Does this scale well?” and does this help us solve our original problem, which is to solve QUBO problems faster?
Benching the Multithread Feature
Just like before, we create a set of test problems to see the effect of adding parallelism to the solver and then compare the solve times. Here, we generated 50 dense instances with 60 binary variables in each problem. These would be impossible to solve with any of the Python versions of the solver. Here, we are using the ‘most violated’ branching rule and an initial solution generated from particle swarm optimization. The multithreaded version here takes batches of 200 nodes at a time.
So, in all, we can see that this spiked the solver by quite a bit. With an average time reduction of 7.04x and a median time reduction of 6.3x, we significantly reduced the time to solve across the entire test set. If we look at a larger, dense problem instance, one with 90 variables, we can see some interesting behavior.
We get within a factor of 2 of idealized scaling as we increase the number of nodes we want to solve at the same time increases. We also solved this problem with the same 200 nodes simultaneously and got a speed of 8.52x compared to serial execution. Which is quite good, as the CPU we ran on this one is the 12700K. While there are 12 cores on the 12700k, they are uneven, and the theoretical maximum is not 12x the serial execution speed for perfect parallelism. Considering the Cinebench rendering benchmark, which is nearly perfectly parallel as a baseline, the max parallelism speed up factor is about 10.3x. So, we are within 20% of perfect parallelism, which is quite good considering how simple of a scheme we have.
There are some fairly important aspects that we are still missing, such as a much more powerful presolve, as right now, we are unfortunately at the forefront of the 1970s with our presolver. Another aspect we can add is cuts, which transform this into a branch-and-cut solver. However, I want to see how far we can push this solver while keeping it branch-and-bound-only. Something else that has been nagging at me is that the subproblems should be easily hot-started from the previous solution, but we are solving them from scratch every time.