SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • June 13, 2023
  • Rss Fetcher

How to outperform NumPy function by 50% using Maths

The 2D Fourier Transform is one of the foremost computer science algorithms of the century. It has gained application in our everyday life, ranging from Instagram filters to the processing of MP3 files.

The most frequent implementation used by the average user, sometimes even unknowingly, is an adaptation from NumPy. However, despite its popularity, their algorithm is not the most efficient. With a few simple manipulations and an article from 2015, we beat the NumPy algorithm by 30–60 percent in performance. The core problem with the current implementation is the simple fact that it is originally derived from a performance-weak algorithm.

In its essence, the algorithm implemented by NumPy applies the regular one-dimensional FFT to two dimensions in turn, which clearly cannot be an optimal solution.

Conversely, in 2015, two Russian scientists came up with their version of the algorithm, adapting the idea of a one-dimensional butterfly transform for two-dimensional signals. We have implemented their basic algorithm concept efficiently by adding our ideas.

M. V. Noskov, article author

After building the naive algorithm from the article, we proceeded to optimization, which is shown below:

void _fft2d( /* Square matrix of size N */ ) {
// base case {
if (N == 1) return;
// } base case

int n = N >> 1;

/* pseudo code {

...Creating 4 temprorary matrices here...

// X(x, y, i, j)
// x, y -- indexing over tempreorty submatricies
// i, j -- indexing over rows, columns in each submatrix

_fft2d(&X(0, 0), root * root, ...);
_fft2d(&X(0, 1), root * root, ...);
_fft2d(&X(1, 0), root * root, ...);
_fft2d(&X(1, 1), root * root, ...);
} pseudo code */

for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
auto x00 = X(0, 0, i, j);
auto x10 = X(1, 0, i, j) * /* W[i] */;
auto x01 = X(0, 1, i, j) * /* W[j] */;
auto x11 = X(1, 1, i, j) * /* W[i] * W[j] */;
X(0, 0, i, j) = x00 + x10 + x01 + x11;
X(0, 1, i, j) = x00 + x10 - x01 - x11;
X(1, 0, i, j) = x00 - x10 + x01 - x11;
X(1, 1, i, j) = x00 - x10 - x01 + x11;
}
}
}

Any recursive algorithm can be enhanced by increasing the size of the base case. Here is how we handled it:

void _fft2d( /* Square matrix of size N */ ) {
// base case {
if (N == 1) return;
if (N == 2) {
#define Y(y, x) (V[(y)*rowsize + (x)])
auto x00 = Y(0, 0);
auto x10 = Y(1, 0);
auto x01 = Y(0, 1);
auto x11 = Y(1, 1);
Y(0, 0) = x00 + x10 + x01 + x11;
Y(0, 1) = x00 + x10 - x01 - x11;
Y(1, 0) = x00 - x10 + x01 - x11;
Y(1, 1) = x00 - x10 - x01 + x11;
return;
}
// } base case

// ...
}

The further logical step was to eliminate creating four unnecessary temporary submatrices in each recursive step in favor of a single one. To do this, we used the concept from the algorithmica article and modified it for a two-dimensional matrix. This feature also helps us to reduce unnecessary allocations and increases the number of cache hits.

// Computing values for rev_bits[n]
auto revbits = [](size_t *v, size_t n) {
int lg_n = log2(n);
forn(i, n) {
int revi = 0;
forn(l, lg_n) revi |= ((i >> l) & 1) << (lg_n - l - 1);
v[i] = revi;
}
};

size_t *rev_n = new size_t[N], *rev_m = new size_t[M];
revbits(rev_n, N), revbits(rev_m, M);

// Transforming matrix
forn(i, N) {
int rev_i = rev_n[i];
forn(j, M) {
if ((i < rev_i) || ((i == rev_i) && (j < rev_m[j])))
std::swap(V[i * M + j], V[rev_i * M + rev_m[j]]);
}
}

Our next challenge was to precalculate the roots of unity:

int mxdim = std::max(N, M);
const int lg_dim = log2(mxdim);
auto W = new fft_type[mxdim];
auto rooti = std::polar(1., (inverse ? 2 : -2) * fft::pi / mxdim);

// Computing look-up matrix for roots
auto cur_root = rooti;
W[0] = 1;
forn (i, mxdim - 1) W[i + 1] = W[i] * cur_root;

How can we get by with just such an array? Let’s note that in the naive implementation, in the initial recursion step, we pass through an array of degrees of root. We also pass to the next recursion level, the square of this root(W[2]). On the next recursion level, we pass through the same array of powers but in increments of 2. From this observation, we can derive that on the i-th level of recursion, the step at which we will pass through the array W is 2ⁱ.

At this stage, we receive the following code:

void _fft2d(
fft_type *__restrict__ V,
size_t N,
size_t rowsize,
fft_type *__restrict__ W,
int step
) {
// base case {
if (N == 1) return;
if (N == 2) {
#define Y(y, x) (V[(y)*rowsize + (x)])
auto x00 = Y(0, 0);
auto x10 = Y(1, 0);
auto x01 = Y(0, 1);
auto x11 = Y(1, 1);
Y(0, 0) = x00 + x10 + x01 + x11;
Y(0, 1) = x00 + x10 - x01 - x11;
Y(1, 0) = x00 - x10 + x01 - x11;
Y(1, 1) = x00 - x10 - x01 + x11;
return;
}
// } base case

int n = N >> 1;

#define X(y, x, i, j) (V[((y)*n + (i)) * rowsize + ((x)*n) + j])
#define params n, rowsize, W, (step << 1)
_fft2d(&X(0, 0, 0, 0), params);
_fft2d(&X(0, 1, 0, 0), params);
_fft2d(&X(1, 0, 0, 0), params);
_fft2d(&X(1, 1, 0, 0), params);

for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
auto x00 = X(0, 0, i, j);
auto x10 = X(1, 0, i, j) * W[step * i];
auto x01 = X(0, 1, i, j) * W[step * j];
auto x11 = X(1, 1, i, j) * W[step * (i + j)];
X(0, 0, i, j) = x00 + x10 + x01 + x11;
X(0, 1, i, j) = x00 + x10 - x01 - x11;
X(1, 0, i, j) = x00 - x10 + x01 - x11;
X(1, 1, i, j) = x00 - x10 - x01 + x11;
}
}
}

The original algorithm has another obvious drawback — it only handles square matrices with dimensions equal to a power of two. Using some simple modifications, we can extend it to rectangular matrices.

void _plan(
fft_type *__restrict__ V,
size_t N,
size_t M,
size_t rowsize,
fft_type *__restrict__ W,
int step_i,
int step_j
) {
// Computing square matrix
if (N == M) _fft2d(V, N, rowsize, W, step_i);
// Performing vertical split
else if (N > M) {
int n = N >> 1;
#define Y(y, i, j) (V[((y)*n + (i)) * rowsize + j])
#define params n, M, rowsize, W, (step_i << 1), step_j
_plan(&Y(0, 0, 0), params);
_plan(&Y(1, 0, 0), params);

forn (i, n) {
forn (j, M) {
auto y00 = Y(0, i, j);
auto y10 = Y(1, i, j) * W[i * step_i];
Y(0, i, j) = y00 + y10;
Y(1, i, j) = y00 - y10;
}
}
// Performing horizontal split
} else { /* ...Analogical approach... */ }
}

It will be important to mention that NumPy makes an extra allocation under FFT in its algorithm, bringing the types in it to np.complex128; if we avoid this step, we can get an advantage of about 10%. We also ended up implementing multithreading.

As a visual representation, we can deliver the table with runtimes and also a graph showing the efficiency of our work about NumPy:

Representing graph

Synopsis

The Russian mathematicians’ modified algorithm overtakes the “rows and columns” under the bonnet of NumPy in terms of efficiency. Several logical manipulations, such as base case increases, boosted our optimization significantly.

Crucially, the steps we performed during implementation can also be used in the case of other algorithms, which might be helpful for you in future. It is equally worth noting that while we have made a solid effort, the implementation can still be strengthened by adding padding matrices of different sizes. This post aims to share the source code, which may help improve the calculation of the transformation in all sorts of projects.

The repository link can be found below, or you can also import the package directly using the terminal:

pip3 install git+https://github.com/2D-FFT-Project/2d-fft

Thanks for your attention 🙂

Repository with the source code

Resources

  • FFT, Algorithmica
  • Two-dimensional fast Fourier transform: Butterfly in analog of Cooley-Tukey algorithm, by V. S. Tutatchikov for IEEE, 2016


Beating NumPy in 2DFFT was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Previous Post
Next Post

Recent Posts

  • Integrating AI agents: Navigating challenges, ensuring security, and driving adoption
  • Elon Musk says XChat is rolling out to all, but questions remain about its alleged security
  • Houston’s long term planning
  • Startup Battlefield 200: Final week to submit your application
  • 5 days left to claim your exhibitor table for TechCrunch All Stage 

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.