![]() ![]() |
Jun 14 2008, 12:36 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
my speedy FFT
Hi, I'd like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout. Compile using CUDA 2.0 beta or later. Vasily Update (Sep 8, 2008): I attached a newer version that includes n = 16, n = 256 and n = 1024 cases. Here is its performance in Gflop/s when using driver 177.11 (up to ~40% slower with the newer 177.84 version): CODE ...n batch 8600GTS 8800GTX 9800GTX GTX280 ...8 2097152 22 67 52 102 ..16 1048576 28 89 68 125 ..64 262144 35 126 104 231 .256 65536 39 143 137 221 .512 32768 41 162 154 298 1024 16384 41 159 166 260 Update (Jan 12, 2009): I attached a quickly patched version that supports forward and inverse FFTs for n = 8, 16, 64, 256, 512, 1024, 2048, 4096 and 8192. Here are results for CUDA 2.1 beta, driver 180.60: CODE Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory. Compiled with CUDA 2010. --------CUFFT------- ---This prototype--- ---two way--- ...N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error ...8 1048576 8.8 9.4 1.8 83.0 88.6 1.6 82.8 2.1 ..16 524288 19.7 15.8 2.1 91.9 73.5 1.5 92.5 1.8 ..64 131072 55.8 29.8 2.3 185.6 99.0 2.4 185.4 3.0 .256 32768 97.1 38.8 2.2 160.1 64.1 2.0 160.1 3.0 .512 16384 65.2 23.2 2.9 240.9 85.6 2.5 240.9 3.7 1024 8192 86.7 27.8 2.7 211.4 67.7 2.5 211.9 3.9 2048 4096 50.6 14.7 3.7 160.0 46.6 3.0 159.6 4.5 4096 2048 48.9 13.0 4.0 171.0 45.6 3.3 170.2 4.9 8192 1024 47.4 11.7 4.4 185.9 45.8 3.4 184.1 5.2 Errors are supposed to be of order of 1 (ULPs). Some of the layouts used are not straightforward. If you need filtering based on element-wise multiplication by a vector in the frequency space, I'd suggest using a similar layout for the vector. This post has been edited by vvolkov: Jan 12 2009, 03:54 PM
Attached File(s)
FFT_061408.zip ( 7.43K )
Number of downloads: 667
FFT_090808.zip ( 11.04K )
Number of downloads: 490
IFFT_011209.zip ( 19.55K )
Number of downloads: 572 |
|
|
|
Jun 15 2008, 05:02 AM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 155 Joined: 23-March 07 Member No.: 46,540 Org.: Techniscan Medical Systems |
Very cool. Thanks for posting this. I'm excited to give it a try.
I noticed one thing that can be changed that might speed it up a little more. Change exp_i from this: CODE inline __device__ float2 exp_i( float phi ) { return make_float2( __cosf(phi), __sinf(phi) ); } to this: CODE inline __device__ float2 exp_i( float phi ) { float sint, cost; __sincosf(phi, &sint, &cost); return make_float2( cost, sint ); } I don't know if it will help or not, but I think it might after noticing the loop inside twiddle. Thanks again. |
|
|
|
Jun 15 2008, 05:26 AM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(jimh @ Jun 14 2008, 09:02 PM) I noticed one thing that can be changed that might speed it up a little more. Thanks for looking in the code! I tried your technique but didn't get any difference. According to decuda, both choices produce nearly identical machine code. The only difference is that a couple of instructions were reordered and few registers were renamed. It seems that __sincosf is compiled into the same instructions as pair of __cosf and __sinf. Vasily |
|
|
|
Jun 15 2008, 11:26 AM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 883 Joined: 16-February 07 From: Santa Clara, CA Member No.: 41,162 Org.: NVIDIA |
Using sincos instead of the sin and cos will be faster for the full precision call.
|
|
|
|
Jun 15 2008, 04:26 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 155 Joined: 23-March 07 Member No.: 46,540 Org.: Techniscan Medical Systems |
Vasily, I was surprised to hear it didn't make a difference - I have some rotation code that got a nice benefit from using sincosf versus sinf and cosf. Thanks to both of you for the explanation. Turns out I use the full precision calls.
|
|
|
|
Jun 17 2008, 02:43 PM
Post
#6
|
|
![]() ![]() Group: Members Posts: 15 Joined: 30-April 08 From: Paris, France Member No.: 102,231 |
Hi,
I have a couple of questions, i'm kind of a beginner so it might not be very interesting... I've been working on DCTs recenlty, that's why i'm interested in your code. CODE __global__ void FFT512_device( float2 *work ) { � work += (blockIdx.y * gridDim.x + blockIdx.x) * 512; � � int tid = threadIdx.x; � int hi = tid>>3; � int lo = tid&7; � � float2 a[8]; � __shared__ float smem[8*8*9]; � � load( a, work + tid, 64 ); � � FFT8( a ); � � twiddle( a, tid, 512 ); � tranpose( a, &smem[hi*8+lo], 66, &smem[lo*66+hi], 8 ); � � FFT8( a ); � � twiddle( a, hi, 64 ); � tranpose( a, &smem[hi*8+lo], 8*9, &smem[hi*8*9+lo], 8, 0xE ); � � FFT8( a ); � � store( a, work + tid, 64 ); } 1) Why don't use put the "float2 a[8]" in shared mem ? Is it because of shared mem size limitation ? 2) Why don't you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don't really understand why it should work. I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren't in CUDA 1.1. I have weird results : QUOTE Device: Quadro FX 4600, 1188 MHz clock, 767 MB memory. --------CUFFT------- ---This prototype--- N Batch Gflop/s GB/s error Gflop/s GB/s error 8 524288 3.1 3.3 1.7 56.2 59.9 1.8 64 65536 19.9 10.6 2.5 87.5 46.6 2.3 512 8192 22.7 8.1 3.1 10.8 3.8 2.2 Thanks. oYo This post has been edited by oYo2k7: Jun 17 2008, 02:47 PM |
|
|
|
Jun 18 2008, 01:17 AM
Post
#7
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM) 1) Why don't use put the "float2 a[8]" in shared mem ? Is it because of shared mem size limitation ? Because shared memory is substantially slower than registers.QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM) 2) Why don't you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don't really understand why it should work. Synchronization is needed to coordinate access to shared memory. All shared memory accesses are done in transpose(), that performs all synchronization that is required. QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM) I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren't in CUDA 1.1. I have weird results : I was using CUDA 2.0 when writing this code. Later, I found that it runs much slower if compiled in CUDA 1.1. I could try to reoptimize the code for this earlier version of the compiler, but don't really see the point. If you still want to call it from an application code running in CUDA 1.1, you may compile this code into .cubin using CUDA 2.0 and then call it from CUDA 1.1 code using driver API. |
|
|
|
Jun 18 2008, 04:59 AM
Post
#8
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 1,413 Joined: 4-March 08 Member No.: 94,948 |
QUOTE(vvolkov @ Jun 18 2008, 03:17 AM) Because shared memory is substantially slower than registers. [right][snapback]395288[/snapback][/right] But a declaration like this: float2 a[8]; will be placed in local memory, not in registers. -------------------- greets,
Denis |
|
|
|
Jun 18 2008, 05:23 AM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(E.D. Riedijk @ Jun 17 2008, 08:59 PM) But a declaration like this: float2 a[8]; will be placed in local memory, not in registers. [right][snapback]395368[/snapback][/right] I don't think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn't spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case. |
|
|
|
Jun 18 2008, 07:02 AM
Post
#10
|
|
![]() ![]() Group: Members Posts: 15 Joined: 30-April 08 From: Paris, France Member No.: 102,231 |
Ok !
Sorry for the second question, i knew you did not use shared mem except in transpose .. T_T One more thing, could you give me a typical result table ? (I cannot install CUDA 2.0 at the moment) Thx again =) |
|
|
|
Jun 18 2008, 07:31 AM
Post
#11
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(oYo2k7 @ Jun 17 2008, 11:02 PM) One more thing, could you give me a typical result table ? (I cannot install CUDA 2.0 at the moment)[right][snapback]395411[/snapback][/right] Sure, here it is:CODE Device: GeForce 8800 Ultra, 1512 MHz clock, 768 MB memory.
--------CUFFT------- ---This prototype--- N Batch Gflop/s GB/s error Gflop/s GB/s error 8 524288 5.8 6.1 1.8 76.8 81.9 1.6 64 65536 34.2 18.3 2.3 142.3 75.9 2.2 512 8192 38.8 13.8 2.9 181.5 64.5 2.5 Errors are supposed to be of order of 1 (ULPs). |
|
|
|
Jun 18 2008, 12:29 PM
Post
#12
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 1,413 Joined: 4-March 08 Member No.: 94,948 |
QUOTE(vvolkov @ Jun 18 2008, 07:23 AM) I don't think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn't spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case. [right][snapback]395376[/snapback][/right] Then you have constant indexing maybe? Any array declared inside a global function is put in local memory, _unless_ you have constant indexing (I looked briefly at your code yesterday evening, and I think you are indeed using constant indexing because your loops are unrolled, so for your case it will be put in registers indeed) This is irregardless of spilling registers over to local memory. As an example, compile a __global__ function that calls sincosf(). You will see that local memory is used for that kernel, even when using <10 registers, just because the index cannot be determined at compile time. (It is used in a code path that might not get triggered during runtime, so it might be that you will never see local load & store in profiler output) -------------------- greets,
Denis |
|
|
|
Jun 18 2008, 01:52 PM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(E.D. Riedijk @ Jun 18 2008, 04:29 AM) Any array declared inside a global function is put in local memory, _unless_ you have constant indexing[right][snapback]395519[/snapback][/right] I didn't know that you can use non-constant indexing for local variables. Thanks for pointing this out!
|
|
|
|
Jun 18 2008, 03:23 PM
Post
#14
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 1,413 Joined: 4-March 08 Member No.: 94,948 |
well, you can, but it will be slow memory compared to shared memory
-------------------- greets,
Denis |
|
|
|
Sep 5 2008, 07:45 PM
Post
#15
|
|
![]() Group: Members Posts: 2 Joined: 5-September 08 Member No.: 118,108 |
I looked through the code and it seems like complex part of the output is not produced.
Therefore, comparison to CUDAFFT which produces true complex-to-complex FFT is not fair, is it ? Can it explain the performance advantage over FFT reported ? Do the authors have complex-to-complex version of the code ? best QUOTE(vvolkov @ Jun 14 2008, 05:36 AM) my speedy FFT Hi, I'd like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout. Compile using CUDA 2.0 beta or later. Vasily [right][snapback]393493[/snapback][/right] |
|
|
|
Sep 5 2008, 07:55 PM
Post
#16
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(cudacuda321 @ Sep 5 2008, 11:45 AM) I looked through the code and it seems like complex part of the output is not produced. Therefore, comparison to CUDAFFT which produces true complex-to-complex FFT is not fair, is it ? Can it explain the performance advantage over FFT reported ? Do the authors have complex-to-complex version of the code ? best [right][snapback]435944[/snapback][/right] This is an interesting discovery as I thought that the complex part is produced. What made you think that it is not? I'd appreciate a hint. Thanks, Vasily |
|
|
|
Sep 6 2008, 06:33 AM
Post
#17
|
|
![]() Group: Members Posts: 2 Joined: 5-September 08 Member No.: 118,108 |
QUOTE(vvolkov @ Sep 5 2008, 12:55 PM) This is an interesting discovery as I thought that the complex part is produced. What made you think that it is not? I'd appreciate a hint. Thanks, Vasily [right][snapback]435947[/snapback][/right] Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently. I have another question, is it easy to modify the algorithm to produce fft of size 1024 ? thanks a lot! |
|
|
|
Sep 6 2008, 07:04 AM
Post
#18
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(cudacuda321 @ Sep 5 2008, 10:33 PM) Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently. I have another question, is it easy to modify the algorithm to produce fft of size 1024 ? thanks a lot! [right][snapback]436143[/snapback][/right] It is fairly easy to accomplish. I have produced such algorithm by upgrading n=512 case up to radix-16. Of course, 16*16*16 > 1024, so one of the stages (the middle one) has a lower radix of 4, i.e. computes four independent radix-4 transforms instead. Higher radix requires more registers but I did manage to fit into 40 so that 3 thread blocks can run at a time per multiprocessor. It still fits into shared memory due to the particular shared memory allocation that you already found. I applied a fair effort to tune it but ultimately failed to get higher Gflop/s rate than in the n=512 case so I didn't post it here. |
|
|
|
Sep 24 2008, 05:10 PM
Post
#19
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 617 Joined: 14-August 08 Member No.: 115,518 Org.: University of Alabama |
Vasily, I've read some of your papers, and noticed that some of your algorithm improvements have been included in CUBLAS 2.0; are you going to be adding the improvements from this work into the next release of CUFFT?
|
|
|
|
Sep 26 2008, 08:20 AM
Post
#20
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 107 Joined: 6-October 07 From: Berkeley, CA Member No.: 72,970 Org.: UC Berkeley |
QUOTE(profquail @ Sep 24 2008, 09:10 AM) are you going to be adding the improvements from this work into the next release of CUFFT? [right][snapback]443729[/snapback][/right] I heard that NVIDIA is working on it. |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 26th November 2009 - 04:40 PM |