How to outperform NumPy function by 50% using Maths
The 2D Fourier Transform is one of the foremost computer science algorithms of the century. It has gained application in our everyday life, ranging from Instagram filters to the processing of MP3 files.
The most frequent implementation used by the average user, sometimes even unknowingly, is an adaptation from NumPy. However, despite its popularity, their algorithm is not the most efficient. With a few simple manipulations and an article from 2015, we beat the NumPy algorithm by 30–60 percent in performance. The core problem with the current implementation is the simple fact that it is originally derived from a performance-weak algorithm.
In its essence, the algorithm implemented by NumPy applies the regular one-dimensional FFT to two dimensions in turn, which clearly cannot be an optimal solution.
Conversely, in 2015, two Russian scientists came up with their version of the algorithm, adapting the idea of a one-dimensional butterfly transform for two-dimensional signals. We have implemented their basic algorithm concept efficiently by adding our ideas.
After building the naive algorithm from the article, we proceeded to optimization, which is shown below:
void _fft2d( /* Square matrix of size N */ ) {
// base case {
if (N == 1) return;
// } base case
int n = N >> 1;
/* pseudo code {
...Creating 4 temprorary matrices here...
// X(x, y, i, j)
// x, y -- indexing over tempreorty submatricies
// i, j -- indexing over rows, columns in each submatrix
_fft2d(&X(0, 0), root * root, ...);
_fft2d(&X(0, 1), root * root, ...);
_fft2d(&X(1, 0), root * root, ...);
_fft2d(&X(1, 1), root * root, ...);
} pseudo code */
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
auto x00 = X(0, 0, i, j);
auto x10 = X(1, 0, i, j) * /* W[i] */;
auto x01 = X(0, 1, i, j) * /* W[j] */;
auto x11 = X(1, 1, i, j) * /* W[i] * W[j] */;
X(0, 0, i, j) = x00 + x10 + x01 + x11;
X(0, 1, i, j) = x00 + x10 - x01 - x11;
X(1, 0, i, j) = x00 - x10 + x01 - x11;
X(1, 1, i, j) = x00 - x10 - x01 + x11;
}
}
}
Any recursive algorithm can be enhanced by increasing the size of the base case. Here is how we handled it:
void _fft2d( /* Square matrix of size N */ ) {
// base case {
if (N == 1) return;
if (N == 2) {
#define Y(y, x) (V[(y)*rowsize + (x)])
auto x00 = Y(0, 0);
auto x10 = Y(1, 0);
auto x01 = Y(0, 1);
auto x11 = Y(1, 1);
Y(0, 0) = x00 + x10 + x01 + x11;
Y(0, 1) = x00 + x10 - x01 - x11;
Y(1, 0) = x00 - x10 + x01 - x11;
Y(1, 1) = x00 - x10 - x01 + x11;
return;
}
// } base case
// ...
}
The further logical step was to eliminate creating four unnecessary temporary submatrices in each recursive step in favor of a single one. To do this, we used the concept from the algorithmica article and modified it for a two-dimensional matrix. This feature also helps us to reduce unnecessary allocations and increases the number of cache hits.
// Computing values for rev_bits[n]
auto revbits = [](size_t *v, size_t n) {
int lg_n = log2(n);
forn(i, n) {
int revi = 0;
forn(l, lg_n) revi |= ((i >> l) & 1) << (lg_n - l - 1);
v[i] = revi;
}
};
size_t *rev_n = new size_t[N], *rev_m = new size_t[M];
revbits(rev_n, N), revbits(rev_m, M);
// Transforming matrix
forn(i, N) {
int rev_i = rev_n[i];
forn(j, M) {
if ((i < rev_i) || ((i == rev_i) && (j < rev_m[j])))
std::swap(V[i * M + j], V[rev_i * M + rev_m[j]]);
}
}
Our next challenge was to precalculate the roots of unity:
int mxdim = std::max(N, M);
const int lg_dim = log2(mxdim);
auto W = new fft_type[mxdim];
auto rooti = std::polar(1., (inverse ? 2 : -2) * fft::pi / mxdim);
// Computing look-up matrix for roots
auto cur_root = rooti;
W[0] = 1;
forn (i, mxdim - 1) W[i + 1] = W[i] * cur_root;
How can we get by with just such an array? Let’s note that in the naive implementation, in the initial recursion step, we pass through an array of degrees of root. We also pass to the next recursion level, the square of this root(W[2]). On the next recursion level, we pass through the same array of powers but in increments of 2. From this observation, we can derive that on the i-th level of recursion, the step at which we will pass through the array W is 2ⁱ.
At this stage, we receive the following code:
void _fft2d(
fft_type *__restrict__ V,
size_t N,
size_t rowsize,
fft_type *__restrict__ W,
int step
) {
// base case {
if (N == 1) return;
if (N == 2) {
#define Y(y, x) (V[(y)*rowsize + (x)])
auto x00 = Y(0, 0);
auto x10 = Y(1, 0);
auto x01 = Y(0, 1);
auto x11 = Y(1, 1);
Y(0, 0) = x00 + x10 + x01 + x11;
Y(0, 1) = x00 + x10 - x01 - x11;
Y(1, 0) = x00 - x10 + x01 - x11;
Y(1, 1) = x00 - x10 - x01 + x11;
return;
}
// } base case
int n = N >> 1;
#define X(y, x, i, j) (V[((y)*n + (i)) * rowsize + ((x)*n) + j])
#define params n, rowsize, W, (step << 1)
_fft2d(&X(0, 0, 0, 0), params);
_fft2d(&X(0, 1, 0, 0), params);
_fft2d(&X(1, 0, 0, 0), params);
_fft2d(&X(1, 1, 0, 0), params);
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
auto x00 = X(0, 0, i, j);
auto x10 = X(1, 0, i, j) * W[step * i];
auto x01 = X(0, 1, i, j) * W[step * j];
auto x11 = X(1, 1, i, j) * W[step * (i + j)];
X(0, 0, i, j) = x00 + x10 + x01 + x11;
X(0, 1, i, j) = x00 + x10 - x01 - x11;
X(1, 0, i, j) = x00 - x10 + x01 - x11;
X(1, 1, i, j) = x00 - x10 - x01 + x11;
}
}
}
The original algorithm has another obvious drawback — it only handles square matrices with dimensions equal to a power of two. Using some simple modifications, we can extend it to rectangular matrices.
void _plan(
fft_type *__restrict__ V,
size_t N,
size_t M,
size_t rowsize,
fft_type *__restrict__ W,
int step_i,
int step_j
) {
// Computing square matrix
if (N == M) _fft2d(V, N, rowsize, W, step_i);
// Performing vertical split
else if (N > M) {
int n = N >> 1;
#define Y(y, i, j) (V[((y)*n + (i)) * rowsize + j])
#define params n, M, rowsize, W, (step_i << 1), step_j
_plan(&Y(0, 0, 0), params);
_plan(&Y(1, 0, 0), params);
forn (i, n) {
forn (j, M) {
auto y00 = Y(0, i, j);
auto y10 = Y(1, i, j) * W[i * step_i];
Y(0, i, j) = y00 + y10;
Y(1, i, j) = y00 - y10;
}
}
// Performing horizontal split
} else { /* ...Analogical approach... */ }
}
It will be important to mention that NumPy makes an extra allocation under FFT in its algorithm, bringing the types in it to np.complex128; if we avoid this step, we can get an advantage of about 10%. We also ended up implementing multithreading.
As a visual representation, we can deliver the table with runtimes and also a graph showing the efficiency of our work about NumPy:
Synopsis
The Russian mathematicians’ modified algorithm overtakes the “rows and columns” under the bonnet of NumPy in terms of efficiency. Several logical manipulations, such as base case increases, boosted our optimization significantly.
Crucially, the steps we performed during implementation can also be used in the case of other algorithms, which might be helpful for you in future. It is equally worth noting that while we have made a solid effort, the implementation can still be strengthened by adding padding matrices of different sizes. This post aims to share the source code, which may help improve the calculation of the transformation in all sorts of projects.
The repository link can be found below, or you can also import the package directly using the terminal:
pip3 install git+https://github.com/2D-FFT-Project/2d-fft
Thanks for your attention 🙂
Repository with the source code
Resources
- FFT, Algorithmica
- Two-dimensional fast Fourier transform: Butterfly in analog of Cooley-Tukey algorithm, by V. S. Tutatchikov for IEEE, 2016
Beating NumPy in 2DFFT was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.