Accelerating Python Code with Numba Vectorize

In the world of numerical computing, performance optimization is paramount. Python, with its interpreted nature, may not always offer the desired speed for computationally intensive operations. However, Numba, a powerful library, comes to the rescue with its array-oriented computing capabilities and just-in-time (JIT) compilation. In this article, we will explore one of Numba’s most valuable features: Numba Vectorize.

We will delve into the inner workings of Numba Vectorize, understand how it harnesses the power of Single Instruction Multiple Data (SIMD) operations, and showcase a few code examples that demonstrate its efficiency in various scenarios.

If you haven’t read our main tutorial on Numba yet, we advise you to do so (unless you are familiar with using numba).

How Numba Vectorize Works:

Numba Vectorize is a decorator that allows us to create universal functions capable of operating on NumPy arrays element-wise. It leverages the SIMD capabilities of modern CPUs, enabling parallel execution of computations on multiple elements of an array simultaneously.

Numba automatically generates optimized machine code tailored to the specified data types, eliminating the need for explicit loops and enhancing performance.

Here are some examples where we apply the @vectorize decorator, and compare its performance to the non-vectorized version. Make sure to explicitly enter the type of the return values and parameters of the Numba function. This ensures the function gets compiled in the very beginning, not when the function is first called.

1. Element-wise Addition:

import numpy as np
from numba import vectorize
import time

@vectorize('int32(int32, int32)', nopython=True)
def add_func(x, y):
    return x + y

def numpy_add_func(x, y):
    return x + y

a = np.random.randint(0, 1001, size=(1000, 1000), dtype=np.int32)
b = np.random.randint(0, 1001, size=(1000, 1000), dtype=np.int32)

In this example, the add_func function performs element-wise addition on two arrays using Numba Vectorize. It automatically parallelizes the computation, resulting in faster execution compared to a traditional loop-based approach.

We will benchmark the above code using the following:

start = time.perf_counter()
add_func(a, b)
end = time.perf_counter()
print(f"Numba JIT (vectorized) = {end - start:.7f}")

start = time.perf_counter()
normal_add_func(a, b)
end = time.perf_counter()
print(f"CPython (numpy) = {end - start:.7f}")

Numba JIT (vectorized) = 0.0015579
CPython (numpy) = 0.0025988

As we can see from these results, Numba JIT was able to outperform even Numpy (which is already heavily optimized in C/C++). If you were to write the non-Numba code in pure Python, it would be far slower than the Numba JIT version.

We conducted this test on various arrays (of different sizes and dimensions). We found that the more complex and larger the array, the better Numba performed. With small arrays (e.g 1D array of size 100) Numba and Numba proved to be equal (approx.).

2. Element-wise Maximum:

import numpy as np
from numba import vectorize

@vectorize
def max_func(x, y):
    return max(x, y)

a = np.array([1, 2, 3, 4])
b = np.array([3, 1, 5, 2])

result = max_func(a, b)
print(result)  # Output: [3 2 5 4]

In this example, the max_func function finds the element-wise maximum between two arrays. Numba Vectorize optimizes the computation, resulting in efficient SIMD operations and improved performance.

Here we did not observe much of a performance benefit between Numba, and the optimized Numpy implementations.

Numba JIT = 0.0012470
CPython = 0.0015955

During repeated tests, Numba proved to be slightly faster in all cases, ranging between 10% to 30% faster.

3. Non-Vectorizable Function:

import numpy as np
from numba import vectorize
import time

@vectorize('int32(int32)', nopython=True)
def non_vectorizable_func(x):
    if x < 0:
        return 0
    else:
        return x * 2

def normal_non_vectorizable_func(x):
    return np.where(x < 0, 0, x * 2)

x = np.random.randint(-1000, 1001, size=(1000, 1000), dtype=np.int32)

In this example, the non_vectorizable_func function doubles positive values and replaces negative values with 0. However, this function cannot be fully vectorized due to the conditional statement. Numba Vectorize will fall back to scalar execution for the non-vectorizable elements, resulting in potentially slower performance compared to fully vectorizable functions.

To compare, we have fully vectorizable code using Numpy which is highly optimized. Let’s see which performs better, shall we?

Here are the results:

Numba JIT = 0.0014120
CPython = 0.0074535

As you can see, despite not being able to properly vectorize, Numba still pulled through.

4. Element-wise Trigonometric Functions:

import numpy as np
from numba import vectorize

@vectorize('float64(float64)', nopython=True)
def trig_func(x):
    return np.sin(x) + np.cos(x)

def normal_trig_func(x):
    for index, value in enumerate(x):
        x[index] = np.sin(value) + np.cos(value)

x = np.linspace(0, 2 * np.pi, 10**6, dtype=np.float64)

In this example, the trig_func calculates the sum of sine and cosine values for each element in the array. Numba Vectorize efficiently utilizes SIMD instructions, providing a significant speedup compared to equivalent pure Python code.

We didn’t use numpy this time, because we wanted to show you the raw performance boost of using Numba vs regular Python (also known as CPython).

Here are the benchmarks:

Numba JIT = 0.0129561
CPython = 1.7318294

As you can see, Numba proves to be over a 100x times faster than native CPython!

5. Element-wise Polynomial Evaluation:

import numpy as np
from numba import vectorize, njit, prange
import time

@vectorize('int64(int64)', nopython=True)
def poly_eval(x):
    return 3 * x**3 + 2 * x**2 + x + 1

@njit('int64[:](int64[:])')
def normal_poly_eval(x):
    for index in prange(len(x)):
        value = x[index]
        x[index] = 3 * value**3 + 2 * value**2 + value + 1
    return x

x = np.random.randint(0, 1000, size=10000, dtype=np.int64)

In this example, the poly_eval function evaluates a third-degree polynomial for each element in the array. What’s slightly different about this, is that we are comparing a native Numba approach, with the vectorized Numba approach, to see what difference (if any) exists. Otherwise there wouldn’t be any point of using @vectorize right?

Here are the benchmarks.

Numba JIT (vectorized) = 0.0000136
Numba JIT (non-vectorized) = 0.0003214

As we can see, the vectorized approach is about 25x times faster than the non-vectorized approach. This is because of the fact that there are no loops. Loops severely slow down performance, and should be avoided where possible.

This marks the end of the Numba Vectorize Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Share on Facebook