IPB

Welcome Guest ( Log In | Register )

> my speedy FFT, 3x faster than CUFFT
vvolkov
post Jun 14 2008, 12:36 PM
Post #1



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



my speedy FFT

Hi, I'd like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout.

Compile using CUDA 2.0 beta or later.

Vasily

Update (Sep 8, 2008): I attached a newer version that includes n = 16, n = 256 and n = 1024 cases. Here is its performance in Gflop/s when using driver 177.11 (up to ~40% slower with the newer 177.84 version):

CODE
...n     batch   8600GTS  8800GTX  9800GTX  GTX280
...8   2097152    22       67       52       102
..16   1048576    28       89       68       125
..64    262144    35      126      104       231
.256     65536    39      143      137       221
.512     32768    41      162      154       298
1024     16384    41      159      166       260

Update (Jan 12, 2009): I attached a quickly patched version that supports forward and inverse FFTs for n = 8, 16, 64, 256, 512, 1024, 2048, 4096 and 8192. Here are results for CUDA 2.1 beta, driver 180.60:

CODE
Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 2010.
             --------CUFFT-------  ---This prototype---  ---two way---
...N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
...8 1048576    8.8    9.4   1.8     83.0   88.6   1.6     82.8   2.1
..16  524288   19.7   15.8   2.1     91.9   73.5   1.5     92.5   1.8
..64  131072   55.8   29.8   2.3    185.6   99.0   2.4    185.4   3.0
.256   32768   97.1   38.8   2.2    160.1   64.1   2.0    160.1   3.0
.512   16384   65.2   23.2   2.9    240.9   85.6   2.5    240.9   3.7
1024    8192   86.7   27.8   2.7    211.4   67.7   2.5    211.9   3.9
2048    4096   50.6   14.7   3.7    160.0   46.6   3.0    159.6   4.5
4096    2048   48.9   13.0   4.0    171.0   45.6   3.3    170.2   4.9
8192    1024   47.4   11.7   4.4    185.9   45.8   3.4    184.1   5.2

Errors are supposed to be of order of 1 (ULPs).

Some of the layouts used are not straightforward. If you need filtering based on element-wise multiplication by a vector in the frequency space, I'd suggest using a similar layout for the vector.

This post has been edited by vvolkov: Jan 12 2009, 03:54 PM
Attached File(s)
Attached File  FFT_061408.zip ( 7.43K ) Number of downloads: 703
Attached File  FFT_090808.zip ( 11.04K ) Number of downloads: 541
Attached File  IFFT_011209.zip ( 19.55K ) Number of downloads: 644
 
Go to the top of the page
 
+Quote Post

Posts in this topic
- vvolkov   my speedy FFT   Jun 14 2008, 12:36 PM
- - jimh   Very cool. Thanks for posting this. I'm excite...   Jun 15 2008, 05:02 AM
|- - vvolkov   QUOTE(jimh @ Jun 14 2008, 09:02 PM)I noticed ...   Jun 15 2008, 05:26 AM
- - mfatica   Using sincos instead of the sin and cos will be fa...   Jun 15 2008, 11:26 AM
- - jimh   Vasily, I was surprised to hear it didn't make...   Jun 15 2008, 04:26 PM
- - oYo2k7   Hi, I have a couple of questions, i'm kind of...   Jun 17 2008, 02:43 PM
|- - vvolkov   QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM)1) Why d...   Jun 18 2008, 01:17 AM
|- - E.D. Riedijk   QUOTE(vvolkov @ Jun 18 2008, 03:17 AM)Because...   Jun 18 2008, 04:59 AM
||- - vvolkov   QUOTE(E.D. Riedijk @ Jun 17 2008, 08:59 PM)Bu...   Jun 18 2008, 05:23 AM
||- - E.D. Riedijk   QUOTE(vvolkov @ Jun 18 2008, 07:23 AM)I don...   Jun 18 2008, 12:29 PM
||- - vvolkov   QUOTE(E.D. Riedijk @ Jun 18 2008, 04:29 AM)An...   Jun 18 2008, 01:52 PM
|- - oYo2k7   Ok ! Sorry for the second question, i knew yo...   Jun 18 2008, 07:02 AM
|- - vvolkov   QUOTE(oYo2k7 @ Jun 17 2008, 11:02 PM)One more...   Jun 18 2008, 07:31 AM
- - E.D. Riedijk   well, you can, but it will be slow memory compared...   Jun 18 2008, 03:23 PM
- - cudacuda321   I looked through the code and it seems like comple...   Sep 5 2008, 07:45 PM
|- - vvolkov   QUOTE(cudacuda321 @ Sep 5 2008, 11:45 AM)I lo...   Sep 5 2008, 07:55 PM
|- - cudacuda321   QUOTE(vvolkov @ Sep 5 2008, 12:55 PM)This is ...   Sep 6 2008, 06:33 AM
|- - vvolkov   QUOTE(cudacuda321 @ Sep 5 2008, 10:33 PM)Sorr...   Sep 6 2008, 07:04 AM
- - profquail   Vasily, I've read some of your papers, and not...   Sep 24 2008, 05:10 PM
|- - vvolkov   QUOTE(profquail @ Sep 24 2008, 09:10 AM)are y...   Sep 26 2008, 08:20 AM
|- - XFer   Vasily, thanks for this great contribution. Does...   Oct 5 2008, 09:22 PM
|- - vvolkov   QUOTE(XFer @ Oct 5 2008, 01:22 PM)Does it wor...   Oct 6 2008, 07:09 AM
- - g000fy   with "FFT_061408" QUOTEDevice: GeForce ...   Oct 6 2008, 03:37 AM
|- - vvolkov   QUOTE(g000fy @ Oct 5 2008, 07:37 PM)is there ...   Oct 6 2008, 07:14 AM
|- - g000fy   QUOTE(vvolkov @ Oct 6 2008, 02:14 AM)It shoul...   Oct 6 2008, 07:19 PM
|- - RoofusGreen   Would it be possible to run this code in CUDA 1.0?...   Oct 7 2008, 04:48 PM
||- - tmurray   QUOTE(RoofusGreen @ Oct 7 2008, 09:48 AM)Woul...   Oct 7 2008, 05:16 PM
||- - RoofusGreen   QUOTE(tmurray @ Oct 7 2008, 12:16 PM)Why are ...   Oct 7 2008, 05:26 PM
|- - vvolkov   QUOTE(g000fy @ Oct 6 2008, 11:19 AM)thanks fo...   Oct 7 2008, 07:28 PM
- - mfatica   Cuda 2.0 is out for Mac. The compiler in 1.1 will ...   Oct 7 2008, 05:29 PM
- - shawkie   I agree - great work! Is there any chance of ...   Nov 4 2008, 04:20 PM
- - hill_matthew   Although this is now a bit old: http://www.science...   Nov 10 2008, 03:01 PM
|- - vvolkov   QUOTE (hill_matthew @ Nov 10 2008, 07:01 ...   Nov 10 2008, 03:24 PM
- - hill_matthew   Very interesting paper. Did I miss the link to the...   Nov 10 2008, 03:41 PM
|- - vvolkov   QUOTE (hill_matthew @ Nov 10 2008, 07:41 ...   Nov 10 2008, 04:11 PM
|- - _gl   QUOTE (vvolkov @ Nov 10 2008, 04:11 PM) I...   Nov 26 2008, 04:50 PM
|- - shawkie   QUOTE (_gl @ Nov 26 2008, 04:50 PM) I...   Nov 27 2008, 01:51 PM
- - hill_matthew   Not very surprising really, but encouraging that t...   Nov 27 2008, 08:06 PM
- - doctor   Using the 061408 and cuda 2.1 on 8800GT I got D...   Dec 18 2008, 04:03 AM
- - hill_matthew   I've been thinking about the upcoming OpenCL a...   Dec 20 2008, 12:14 AM
- - lzhfire   To make a FFT testing with double precision in CUD...   Dec 22 2008, 10:09 AM
- - dpephd   Hey Everyone, I got a BFG GeForce 260 GTX OC for ...   Jan 5 2009, 05:33 AM
|- - vvolkov   QUOTE (dpephd @ Jan 4 2009, 09:33 PM) I a...   Jan 5 2009, 12:36 PM
|- - dpephd   QUOTE (dpephd @ Jan 4 2009, 09:33 PM) I.e...   Jan 5 2009, 07:43 PM
|- - vvolkov   QUOTE (dpephd @ Jan 5 2009, 11:43 AM) Obt...   Jan 6 2009, 01:57 AM
|- - dpephd   QUOTE (vvolkov @ Jan 5 2009, 05:57 PM) .....   Jan 8 2009, 05:53 PM
- - mfatica   Thanks for reporting this bug, we are working on i...   Jan 6 2009, 01:49 AM
- - mfatica   This compiler flag will fix the performance proble...   Jan 11 2009, 06:36 PM
|- - dpephd   QUOTE (mfatica @ Jan 11 2009, 10:36 AM) T...   Jan 12 2009, 05:27 AM
||- - dpephd   QUOTE (dpephd @ Jan 11 2009, 09:27 PM) Th...   Jan 12 2009, 08:06 AM
||- - vvolkov   QUOTE (carcle85 @ Jan 12 2009, 01:21 AM) ...   Jan 12 2009, 04:05 PM
||- - carcle85   QUOTE (vvolkov @ Jan 12 2009, 09:05 AM) P...   Jan 12 2009, 04:38 PM
|- - dpephd   QUOTE (mfatica @ Jan 11 2009, 10:36 AM) T...   Jan 12 2009, 08:41 AM
|- - dpephd   some additional variable batch size results for 8 ...   Jan 19 2009, 07:32 AM
- - carcle85   Hello, I've tried this fft and I obtained grea...   Jan 12 2009, 09:21 AM
- - kanishk   Thanks vvolkov for your really nice and useful imp...   Jan 17 2009, 09:10 AM
|- - vvolkov   QUOTE (kanishk @ Jan 17 2009, 01:10 AM) W...   Jan 18 2009, 03:17 PM
|- - dpephd   QUOTE (kanishk @ Jan 17 2009, 01:10 AM) C...   Jan 19 2009, 07:27 AM
- - CudaSpeak   IFFT-011209 running unaltered under Windows XP 64 ...   Jan 20 2009, 10:23 PM
- - Pimbolie1979   Do you use a Hanning Window in your FFT?   Jan 24 2009, 11:56 AM
|- - profquail   QUOTE (Pimbolie1979 @ Jan 24 2009, 05:56 ...   Jan 24 2009, 11:45 PM
|- - seibert   QUOTE (profquail @ Jan 24 2009, 06:45 PM)...   Jan 25 2009, 03:17 AM
|- - profquail   QUOTE (seibert @ Jan 24 2009, 09:17 PM) T...   Jan 25 2009, 05:19 PM
- - Pimbolie1979   N = the number of FFT points What is the batch pa...   Jan 24 2009, 12:25 PM
|- - CudaSpeak   QUOTE (Pimbolie1979 @ Jan 24 2009, 06:56 ...   Jan 24 2009, 03:32 PM
- - Andy386   It's seems i am unable to run your FFT on Wind...   Jan 27 2009, 01:46 PM
|- - E.D. Riedijk   QUOTE (Andy386 @ Jan 27 2009, 02:46 PM) I...   Jan 27 2009, 02:24 PM
- - Andy386   Argh, i totally missed to get an mexfile out of th...   Jan 27 2009, 03:17 PM
- - wanderine   Is it possible to use for 2D and 3D? 4D ?   Jan 27 2009, 10:43 PM
|- - Andy386   QUOTE (wanderine @ Jan 28 2009, 12:43 AM)...   Jan 28 2009, 12:36 PM
|- - E.D. Riedijk   QUOTE (Andy386 @ Jan 28 2009, 01:36 PM) I...   Jan 28 2009, 12:46 PM
- - carcle85   My problem is that to obtain the output in the sam...   Jan 31 2009, 11:40 AM
|- - vvolkov   QUOTE (carcle85 @ Jan 31 2009, 03:40 AM) ...   Jan 31 2009, 11:48 AM
||- - carcle85   QUOTE (vvolkov @ Jan 31 2009, 03:48 AM) B...   Jan 31 2009, 01:19 PM
||- - vvolkov   QUOTE (carcle85 @ Jan 31 2009, 05:19 AM) ...   Jan 31 2009, 01:24 PM
|- - E.D. Riedijk   QUOTE (carcle85 @ Jan 31 2009, 12:40 PM) ...   Jan 31 2009, 01:24 PM
- - carcle85   QUOTE (vvolkov @ Jan 31 2009, 05:24 AM) S...   Jan 31 2009, 01:47 PM
|- - Pimbolie1979   Can you create a DLL from your prototype FFT?   Jan 31 2009, 11:17 PM
- - saratoga   Is it possible to estimate FFTs per second from th...   Feb 2 2009, 03:46 PM
- - glenvidia   QUOTE (vvolkov @ Jun 14 2008, 06:36 AM) U...   Feb 17 2009, 09:10 PM
- - TOAOMatis   Hi there, my first post in this topic. Started yes...   Feb 18 2009, 10:08 AM
- - qqliudl   Using 011209 and CUDA 2.1 CODEDevice: GeForce 880...   Mar 2 2009, 08:16 PM
|- - spacerat   Using IFFT_011209 and CUDA 1.1, 8600GTS / 32 Strea...   Mar 3 2009, 11:46 AM
|- - vvolkov   QUOTE (spacerat @ Mar 3 2009, 03:46 AM) U...   Mar 3 2009, 10:27 PM
- - doctor   Can anyone post code for the missing power of two...   Mar 4 2009, 07:10 AM
|- - spacerat   Now using Cuda 2.1 - looks much better already COD...   Mar 5 2009, 09:00 AM
- - wanderine   Has this been implemented in CUFFT yet? How big is...   Mar 7 2009, 10:59 AM
- - wanderine   If a 3D FFT is calculated with the use of 3 sequen...   Mar 10 2009, 10:17 PM
|- - vvolkov   QUOTE (wanderine @ Mar 10 2009, 02:17 PM)...   Mar 10 2009, 11:47 PM
|- - wanderine   QUOTE (vvolkov @ Mar 11 2009, 12:47 AM) P...   Mar 25 2009, 07:16 AM
- - Pimbolie1979   What does error 3.0 mean?   Apr 7 2009, 07:19 PM
|- - vvolkov   QUOTE (Pimbolie1979 @ Apr 7 2009, 11:19 A...   Apr 8 2009, 03:56 PM
- - jsvarma   Hi, After using the speedy FFT code, we hav observ...   Jun 9 2009, 10:44 AM
|- - vvolkov   QUOTE (jsvarma @ Jun 9 2009, 03:44 AM) Af...   Jun 16 2009, 04:33 AM
- - wanderine   How about support for sizes that are not a power o...   Jun 9 2009, 11:01 AM
- - mcg   Hey folks, I attempted to integrate this code int...   Jun 17 2009, 09:21 PM
|- - NCC-1701D   Hey I had one basic doubt abt the performance mea...   Jun 24 2009, 06:35 AM
|- - vvolkov   QUOTE (NCC-1701D @ Jun 23 2009, 11:3...   Jun 24 2009, 07:12 AM
|- - NCC-1701D   QUOTE (vvolkov @ Jun 24 2009, 09:12 AM) G...   Jun 24 2009, 07:40 AM
- - mfatica   I really don't see the problem, as long as you...   Jun 24 2009, 10:07 AM
2 Pages V   1 2 >


Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 9th February 2010 - 11:11 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela