Radix-2 Fast Fourier Transform

This is an introduction to the famous Fast Fourier Transform algorithm devised by Cooley and Tuckey in the sixties. The goal of the FFT algorithm is to solve the Discrete Fourier Transform (DFT) in $O(nlog(n))$ time complexity, significantly improving on the naive $O(n^2)$ implementation.

DFT recap

The definition of the DFT is

\begin{equation} X_k = \sum^{N-1}_{n=0}x_n \cdot e^{-j\frac{2\pi}{N}kn} \end{equation}

, which can be expanded with the Euler's formula to \begin{equation} X_k = \sum^{N-1}_{n=0}x_n \cdot (\cos{\frac{2\pi}{N}kn} - j \sin{\frac{2\pi}{N}kn}) \end{equation} We can also write the exponential term as

Matrix form of the DFT

\begin{equation} e^{-j\frac{2\pi}{N}kn} = W_N^{nk} \end{equation}

Quite simply, W is a N by N matrix, each row of which is multiplied by the x vector and summed up to get the transformed data point.

Now the matrix operation version of the DFT looks like

\begin{equation} X_k = \sum_{n=0}^{N-1} x_n W_N^{nk} = W_N \cdot x \end{equation}\begin{equation} \begin{bmatrix} X_0 \\ X_1 \\ X_2 \\ X_3 \end{bmatrix}= \begin{bmatrix} W^{00} & W^{01} & W^{02} & W^{03} \\ W^{10} & W^{11} & W^{12} & W^{13} \\ W^{20} & W^{21} & W^{22} & W^{23} \\ W^{30} & W^{31} & W^{32} & W^{33} \\ \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix} \end{equation}

The key thing to understand here is that the W matrix does not depend on the value vector x, but only the integer indexes n and k. Here we can instantly see an improvement, the nk and kn values must be identical for the W matrix.

We can write the W matrix open for a simple data set of N = 4. Use python for that

\begin{equation} W_N^{nk} = e^{-j\frac{2\pi}{N}kn} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & -j & -1 & j \\ 1 & -1 & 1 & -1 \\ 1 & j & -1 & -j \\ \end{bmatrix} \end{equation}

Here we can confirm that the matrix is symmetric about the diagonal, only depending on the values of k and n. The matrix W can only have N different values, so calculating the required values can easily be optimized to $O(n)$. Note that here we could simply get the DFT by multiplying this W matrix with the input vector x.

Now we know that we can optimize the algorithm by halving the computation w.r.t the W matrix, but it still gives us $O(n^2)$ computation time because we still have to calculate the product of the W matrix and the x vector. So this doesn't really solve anything. But it showed us the redundancy that exists, now we just need to eliminate it.

Divide and conguer

For simplicity, we start by assuming that the data contains $n^2$ values. A divide and conguer algorithm can be used to drastically reduce the amount of required computation. The approach is based on splitting the input data to two groups, even and odd values, both groups having N/2 values.

\begin{equation} X_k = E_k + W^k_N O_k \end{equation}

Where E corresponds to the equation for the even values and O to the odd values.

For even values:

\begin{equation} E_k = \sum_{n=0}^{N/2-1}x_{2n}W_{N}^{2kn} \end{equation}

And for odd values:

\begin{equation} O_{k} = \sum_{n=0}^{N/2-1}x_{2n+1}W_{N}^{(2n+1)k} \end{equation}

The most important detail is in the indexing of the W matrix, as we previously noticed, the indexing of the W matrix only depends on the values of k and n. Let's see what this means in the exponential terms.

The even indexing: \begin{equation} W_{N}^{2kn} = e^{-j\frac{2\pi}{N}2kn} = e^{-j\frac{2}{\pi}2kn}= W_{N/2}^{kn} \end{equation}

For odd indexing: \begin{equation} W_{N}^{(2n+1)k} = e^{-j\frac{2\pi}{N}(2n+1)k} = e^{-j\frac{2\pi}{N/2}kn}e^{-j\frac{2\pi}{N}k} = W_{N/2}^{kn}W_{N}^k \end{equation}

This means that we just effectively reduced the size of the W matrix from N x N to N/2 x N/2! It's clearly seen now that repeating this yields a logarithmic descent of the size.

Now we do this until N/2 = 2, and the divide and conquer is ready. The W matrices are now so small that precomputing them is trivial. We can also further exploit the symmetric nature of the W matrix as because of the complex conjugation the last half of the $W^k_{N}$ vector is equal to negative of the first half. For our splitted DFT this means

First half: \begin{equation} X_k = E_n + W^k_N O_n, k<N/2 \end{equation}

Second half: \begin{equation} X_k = E_n - W^k_N O_n, k>=N/2 \end{equation}

Then we can just concatenate the two $X_n$ vectors.

Note that we assume here that the size N is a power of two (Radix-2 FFT). We are ready to implement the algorithm using recursion.

For fun we can now test our FFT against the built in FFT in the numpy library

Further optimization by removing recursion

Success, we managed to create a O(n log(n)) algorithm. Hovever, we are still quite far away from the numpy's version. From my knowledge with computer science I know that recursion sucks. Function calls are slow and with too long inputs we run to memory issues on the callstack. So we should improve the algorithm by removing recursion.

In this version of the algorithm, we start from bottom up. We reshape the input to segments with the length where we would stop splitting the arrays in the recursive version, and calculate the DFT of all of the small segments simultaneously, then calculating is just a matter of combining the subarrays using same method as we did in the recursive version.

Closing remarks

This is good enough for understanding the algorithm, but it could be optimized a lot by counting the W vectors or "Twiddle factors" is still repeated a lot in the nonrecursive version, by doing that more effectively we could speed up the algorithm. Also with larger input sizes the memory accessing becomes an issue and that can't easily be optimized using python. (Numpy relies on backends implemented in more low-level languages).