Utilizing Hardware - Python Numba and Cuda

I recently did some Monte-Carlo, and I wanted to show the differences in performance between native python code, fully parallel compiled Numpy code, and unoptimized GPU Cuda code running the cupy package. Part of what makes python so powerful is the ability to tap into these resources without changing too many things (as you will see in the code example).

I am going to use the standard estimation of pi as a benchmark of performance. This problem is defined by the following.

\[\pi = \lim_{n\to\infty}{ 4 \frac{count(x^2 +y^2 < 1)}{N}}\]

I will show the code at the end, but I just want to show the results here. (8600K @ 4.5Ghz, Nvidia 1080)

Method Number of Samples Time (s) Million samples / sec
Python 100 Million 43 2.32
Numba 100 Million .295 339
Cuda 100 Million .068 1470
Numba 10 Billion 20.5 488
Cuda 10 Billion 7 1428

By leveraging the thousands of streaming processors on the GPU, we can get an incredible speed up in our task.

I will show another example of the difference, solving a linear system of 20,000 variables in NumPy(CPU) and cupy(GPU). We will see a massive difference in performance between fp32 (float) and fp64 (double). This is due to the physical hardware having more fp32 alu cores.

Method Size Time (s)
Numpy 10000 2.95
Cuda(fp32) 10000 .49
Cuda(fp64) 10000 5.7!

We can see major performance regression when using fp64 on my consumer GPU as expected. Try out the code on your machine and see the difference!

Example Code - Monte-Carlo Pi Estimation

from random import random
import numba
import numpy
import cupy

def monti_carlo_python(n:int, m:int)-> float:
    
    accum = 0
    
    for i in range(m):
        partial = 0
        for j in range(n):
            x = random()
            y = random()
            
            partial += x**2 + y**2 < 1
        accum += partial / n
    return 4.0*accum / m

@numba.njit(parallel=True)
def monte_carlo_cpu(n:int, m:int)-> float:
    accum = 0
    
    for i in numba.prange(m):
        x = numpy.random.random(n)
        y = numpy.random.random(n)
        
        r = x**2 + y**2 < 1.0
        
        accum += numpy.sum(r)/n
    
    return 4.0*accum/m

def monte_carlo_gpu(n:int, m:int)-> float:
    
    accum = 0
    for i in range(m):
    
        x = cupy.random.random(n, dtype=numpy.float32)
        y = cupy.random.random(n, dtype=numpy.float32)
    
        r = cupy.less(x**2 + y**2, 1.0)
        
        accum += cupy.sum(r)/n
    
    return 4.0*accum/m

%timeit monti_carlo_python(1000000, 100)
%timeit monti_carlo_numba(1000000, 100)
%timeit monte_carlo_gpu(1000000, 100)
%timeit monti_carlo_numba(1000000, 10000)
%timeit monte_carlo_gpu(1000000, 10000)

Example Code - Linear System Solve

A = numpy.random.random((10000,10000))
b = numpy.random.random((10000,1))

A_gpufp64 = cupy.array(A)
b_gpufp64 = cupy.array(b)

A_gpufp32 = cupy.array(A.astype(numpy.float32))
b_gpufp32 = cupy.array(b.astype(numpy.float32))


%timeit numpy.linalg.solve(A, b)
%timeit cp.linalg.solve(A_gpufp32, b_gpufp32)
%timeit cp.linalg.solve(A_gpufp64, b_gpufp64)

Note: Remember to clear out unused variables

mempool = cp.get_default_memory_pool()
mempool.free_all_blocks()