You can make a real-to-complex plan which should be a little faster than the complex-to-complex plan (half the size). Just call
fftw_plan_dft_r2c_3d instead, leaving out the
FFTW_FORWARD argument because it's implied. Although that only fills half the array I think.
It may be possible to get a speed up by operating directly on the tensor object's storage, accessible via member function
data;
Beware problems involving storage order mismatches. As we've discovered before FFTW works with basically any reasonable memory layouts as long as you're careful to specify it properly.
There is a member function called Eigen::Tensor::swap_layout that could help, although the documentation says "only the default column major layout is currently fully supported, and it is therefore not recommended to attempt to use the row major layout at the moment":
https://eigen.tuxfamily.org/dox/unsupported/eigen_tensors.html
It will be beneficial to re-use FFTW plans. If you re-use them you could reasonably ask FFTW to find a particularly fast decomposition with FFTW_MEASURE.
I know basically nothing about OpenMP but read this page carefully:
https://www.fftw.org/fftw3_doc/Usage-of-Multi_002dthreaded-FFTW.html
It briefly mentions OpenMP compatibility.
The following program is definitely wrong because I didn't address storage order nor even read enough of Eigen's documentation to make sure it's reasonable. But as a smoke test, on my computer, this program takes about 3ms per FFT.
average Time in s: 0.00291012s
This is just a decent computer from late 2019. The CPU is a AMD Ryzen 9 3950X, with 16 cores.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
static const int nx = 128;
static const int ny = 128;
static const int nz = 128;
#include "D:\jamieal\20230227\eigen\unsupported\Eigen\CXX11\Tensor"
#include "omp.h"
#include "fftw3.h"
#include <iostream>
int main()
{
Eigen::Tensor<double, 3> Re(nx, ny, nz);
Re.setRandom();
Eigen::Tensor<std::complex<double>, 3> Im(nx, ny, nz);
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
fftw_plan plan = fftw_plan_dft_r2c_3d(nx, ny, nz, Re.data(), reinterpret_cast<fftw_complex*>(Im.data()), FFTW_MEASURE);
Re.setRandom();
double start_time = omp_get_wtime();
for (int i = 0; i < 1000; ++i)
{
fftw_execute(plan);
}
double run_time = omp_get_wtime() - start_time;
std::cout << "average Time in s: " << run_time / 1000 << "s\n";
}
| |
You could also investigate hardware-accelerated implementations e.g.
https://docs.nvidia.com/cuda/cufft/