IPB

Welcome Guest ( Log In | Register )

6 Pages V   1 2 3 > »   
Reply to this topicStart new topic
> my speedy FFT, 3x faster than CUFFT
vvolkov
post Jun 14 2008, 12:36 PM
Post #1



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



my speedy FFT

Hi, I'd like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout.

Compile using CUDA 2.0 beta or later.

Vasily

Update (Sep 8, 2008): I attached a newer version that includes n = 16, n = 256 and n = 1024 cases. Here is its performance in Gflop/s when using driver 177.11 (up to ~40% slower with the newer 177.84 version):

CODE
...n     batch   8600GTS  8800GTX  9800GTX  GTX280
...8   2097152    22       67       52       102
..16   1048576    28       89       68       125
..64    262144    35      126      104       231
.256     65536    39      143      137       221
.512     32768    41      162      154       298
1024     16384    41      159      166       260

Update (Jan 12, 2009): I attached a quickly patched version that supports forward and inverse FFTs for n = 8, 16, 64, 256, 512, 1024, 2048, 4096 and 8192. Here are results for CUDA 2.1 beta, driver 180.60:

CODE
Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 2010.
             --------CUFFT-------  ---This prototype---  ---two way---
...N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error  Gflop/s error
...8 1048576    8.8    9.4   1.8     83.0   88.6   1.6     82.8   2.1
..16  524288   19.7   15.8   2.1     91.9   73.5   1.5     92.5   1.8
..64  131072   55.8   29.8   2.3    185.6   99.0   2.4    185.4   3.0
.256   32768   97.1   38.8   2.2    160.1   64.1   2.0    160.1   3.0
.512   16384   65.2   23.2   2.9    240.9   85.6   2.5    240.9   3.7
1024    8192   86.7   27.8   2.7    211.4   67.7   2.5    211.9   3.9
2048    4096   50.6   14.7   3.7    160.0   46.6   3.0    159.6   4.5
4096    2048   48.9   13.0   4.0    171.0   45.6   3.3    170.2   4.9
8192    1024   47.4   11.7   4.4    185.9   45.8   3.4    184.1   5.2

Errors are supposed to be of order of 1 (ULPs).

Some of the layouts used are not straightforward. If you need filtering based on element-wise multiplication by a vector in the frequency space, I'd suggest using a similar layout for the vector.

This post has been edited by vvolkov: Jan 12 2009, 03:54 PM
Attached File(s)
Attached File  FFT_061408.zip ( 7.43K ) Number of downloads: 667
Attached File  FFT_090808.zip ( 11.04K ) Number of downloads: 490
Attached File  IFFT_011209.zip ( 19.55K ) Number of downloads: 572
 
Go to the top of the page
 
+Quote Post
jimh
post Jun 15 2008, 05:02 AM
Post #2



*****

Group: Members
Posts: 155
Joined: 23-March 07
Member No.: 46,540
Org.: Techniscan Medical Systems



Very cool. Thanks for posting this. I'm excited to give it a try.

I noticed one thing that can be changed that might speed it up a little more. Change exp_i from this:

CODE

inline __device__ float2 exp_i( float phi )
{
   return make_float2( __cosf(phi), __sinf(phi) );
}


to this:
CODE

inline __device__ float2 exp_i( float phi )
{
   float sint, cost;
   __sincosf(phi, &sint, &cost);
   return make_float2( cost, sint );
}


I don't know if it will help or not, but I think it might after noticing the loop inside twiddle. Thanks again.
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 15 2008, 05:26 AM
Post #3



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(jimh @ Jun 14 2008, 09:02 PM)
I noticed one thing that can be changed that might speed it up a little more.

Thanks for looking in the code! I tried your technique but didn't get any difference. According to decuda, both choices produce nearly identical machine code. The only difference is that a couple of instructions were reordered and few registers were renamed. It seems that __sincosf is compiled into the same instructions as pair of __cosf and __sinf.

Vasily
Go to the top of the page
 
+Quote Post
mfatica
post Jun 15 2008, 11:26 AM
Post #4



*******

Group: NVIDIA Employees
Posts: 883
Joined: 16-February 07
From: Santa Clara, CA
Member No.: 41,162
Org.: NVIDIA



Using sincos instead of the sin and cos will be faster for the full precision call.
Go to the top of the page
 
+Quote Post
jimh
post Jun 15 2008, 04:26 PM
Post #5



*****

Group: Members
Posts: 155
Joined: 23-March 07
Member No.: 46,540
Org.: Techniscan Medical Systems



Vasily, I was surprised to hear it didn't make a difference - I have some rotation code that got a nice benefit from using sincosf versus sinf and cosf. Thanks to both of you for the explanation. Turns out I use the full precision calls.
Go to the top of the page
 
+Quote Post
oYo2k7
post Jun 17 2008, 02:43 PM
Post #6



**

Group: Members
Posts: 15
Joined: 30-April 08
From: Paris, France
Member No.: 102,231



Hi,

I have a couple of questions, i'm kind of a beginner so it might not be very interesting...
I've been working on DCTs recenlty, that's why i'm interested in your code.

CODE

__global__ void FFT512_device( float2 *work )
{
�   work += (blockIdx.y * gridDim.x + blockIdx.x) * 512;
�   
�   int tid = threadIdx.x;
�   int hi = tid>>3;
�   int lo = tid&7;
�   
�   float2 a[8];
�   __shared__ float smem[8*8*9];
�   
�   load( a, work + tid, 64 );
�   
�   FFT8( a );
�   
�   twiddle( a, tid, 512 );
�   tranpose( a, &smem[hi*8+lo], 66, &smem[lo*66+hi], 8 );
�   
�   FFT8( a );
�   
�   twiddle( a, hi, 64 );
�   tranpose( a, &smem[hi*8+lo], 8*9, &smem[hi*8*9+lo], 8, 0xE );
�   
�   FFT8( a );
�   
�   store( a, work + tid, 64 );
}


1) Why don't use put the "float2 a[8]" in shared mem ? Is it because of shared mem size limitation ?
2) Why don't you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don't really understand why it should work.

I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren't in CUDA 1.1.
I have weird results :

QUOTE
Device: Quadro FX 4600, 1188 MHz clock, 767 MB memory.
            --------CUFFT-------  ---This prototype---
  N      Batch  Gflop/s  GB/s  error  Gflop/s  GB/s  error
  8    524288    3.1      3.3    1.7    56.2  59.9    1.8
  64  65536    19.9    10.6    2.5    87.5  46.6    2.3
512    8192    22.7      8.1    3.1    10.8    3.8      2.2


Thanks.

oYo

This post has been edited by oYo2k7: Jun 17 2008, 02:47 PM
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 18 2008, 01:17 AM
Post #7



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM)
1) Why don't use put the "float2 a[8]" in shared mem ? Is it because of shared mem size limitation ?
Because shared memory is substantially slower than registers.

QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM)
2) Why don't you synchronize threads before calling FFT8 ? Or after, or even between twiddle and transpose ? I don't really understand why it should work.

Synchronization is needed to coordinate access to shared memory. All shared memory accesses are done in transpose(), that performs all synchronization that is required.

QUOTE(oYo2k7 @ Jun 17 2008, 06:43 AM)
I have tried to launch your code on CUDA 1.1, i replaced double by float, because they aren't in CUDA 1.1. I have weird results :

I was using CUDA 2.0 when writing this code. Later, I found that it runs much slower if compiled in CUDA 1.1. I could try to reoptimize the code for this earlier version of the compiler, but don't really see the point.

If you still want to call it from an application code running in CUDA 1.1, you may compile this code into .cubin using CUDA 2.0 and then call it from CUDA 1.1 code using driver API.
Go to the top of the page
 
+Quote Post
E.D. Riedijk
post Jun 18 2008, 04:59 AM
Post #8



********

Group: Extranet Users
Posts: 1,413
Joined: 4-March 08
Member No.: 94,948



QUOTE(vvolkov @ Jun 18 2008, 03:17 AM)
Because shared memory is substantially slower than registers.
[right][snapback]395288[/snapback][/right]


But a declaration like this:
float2 a[8];
will be placed in local memory, not in registers.


--------------------
greets,
Denis
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 18 2008, 05:23 AM
Post #9



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(E.D. Riedijk @ Jun 17 2008, 08:59 PM)
But a declaration like this:
float2 a[8];
will be placed in local memory, not in registers.
[right][snapback]395368[/snapback][/right]

I don't think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn't spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case.
Go to the top of the page
 
+Quote Post
oYo2k7
post Jun 18 2008, 07:02 AM
Post #10



**

Group: Members
Posts: 15
Joined: 30-April 08
From: Paris, France
Member No.: 102,231



Ok !

Sorry for the second question, i knew you did not use shared mem except in transpose .. T_T

One more thing, could you give me a typical result table ? (I cannot install CUDA 2.0 at the moment)


Thx again =)
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 18 2008, 07:31 AM
Post #11



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(oYo2k7 @ Jun 17 2008, 11:02 PM)
One more thing, could you give me a typical result table ? (I cannot install CUDA 2.0 at the moment)[right][snapback]395411[/snapback][/right]
Sure, here it is:

CODE
Device: GeForce 8800 Ultra, 1512 MHz clock, 768 MB memory.
            --------CUFFT-------  ---This prototype---
  N   Batch Gflop/s  GB/s  error  Gflop/s  GB/s  error
  8  524288    5.8    6.1   1.8     76.8   81.9   1.6
 64   65536   34.2   18.3   2.3    142.3   75.9   2.2
512    8192   38.8   13.8   2.9    181.5   64.5   2.5

Errors are supposed to be of order of 1 (ULPs).
Go to the top of the page
 
+Quote Post
E.D. Riedijk
post Jun 18 2008, 12:29 PM
Post #12



********

Group: Extranet Users
Posts: 1,413
Joined: 4-March 08
Member No.: 94,948



QUOTE(vvolkov @ Jun 18 2008, 07:23 AM)
I don't think so. All local variables are placed in registers unless spilled to local memory. FFT512_device() doesn't spill if compiled in CUDA 2.0 but spills in CUDA 1.1. FFT8_device() contains similar declaration that does not spill in either case.
[right][snapback]395376[/snapback][/right]


Then you have constant indexing maybe? Any array declared inside a global function is put in local memory, _unless_ you have constant indexing (I looked briefly at your code yesterday evening, and I think you are indeed using constant indexing because your loops are unrolled, so for your case it will be put in registers indeed) This is irregardless of spilling registers over to local memory.

As an example, compile a __global__ function that calls sincosf(). You will see that local memory is used for that kernel, even when using <10 registers, just because the index cannot be determined at compile time. (It is used in a code path that might not get triggered during runtime, so it might be that you will never see local load & store in profiler output)


--------------------
greets,
Denis
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 18 2008, 01:52 PM
Post #13



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(E.D. Riedijk @ Jun 18 2008, 04:29 AM)
Any array declared inside a global function is put in local memory, _unless_ you have constant indexing[right][snapback]395519[/snapback][/right]
I didn't know that you can use non-constant indexing for local variables. Thanks for pointing this out!
Go to the top of the page
 
+Quote Post
E.D. Riedijk
post Jun 18 2008, 03:23 PM
Post #14



********

Group: Extranet Users
Posts: 1,413
Joined: 4-March 08
Member No.: 94,948



well, you can, but it will be slow memory compared to shared memory wink.gif


--------------------
greets,
Denis
Go to the top of the page
 
+Quote Post
cudacuda321
post Sep 5 2008, 07:45 PM
Post #15



*

Group: Members
Posts: 2
Joined: 5-September 08
Member No.: 118,108



I looked through the code and it seems like complex part of the output is not produced.
Therefore, comparison to CUDAFFT which produces true complex-to-complex FFT is not fair, is it ?
Can it explain the performance advantage over FFT reported ?

Do the authors have complex-to-complex version of the code ?

best


QUOTE(vvolkov @ Jun 14 2008, 05:36 AM)
my speedy FFT

Hi, I'd like to share an implementation of the FFT that achieves 160 Gflop/s on the GeForce 8800 GTX, which is 3x faster than 50 Gflop/s offered by the CUFFT. It is designed for n = 512, which is hardcoded. The implementation also includes cases n = 8 and n = 64 working in a special data layout.

Compile using CUDA 2.0 beta or later.

Vasily
[right][snapback]393493[/snapback][/right]

Go to the top of the page
 
+Quote Post
vvolkov
post Sep 5 2008, 07:55 PM
Post #16



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(cudacuda321 @ Sep 5 2008, 11:45 AM)
I looked through the code and it seems like complex part of the output is not produced.
Therefore, comparison to CUDAFFT which produces true complex-to-complex FFT is not fair, is it ?
Can it explain the performance advantage over FFT reported ?

Do the authors have complex-to-complex version of the code ?

best
[right][snapback]435944[/snapback][/right]


This is an interesting discovery as I thought that the complex part is produced. What made you think that it is not? I'd appreciate a hint.
Thanks,
Vasily
Go to the top of the page
 
+Quote Post
cudacuda321
post Sep 6 2008, 06:33 AM
Post #17



*

Group: Members
Posts: 2
Joined: 5-September 08
Member No.: 118,108



QUOTE(vvolkov @ Sep 5 2008, 12:55 PM)
This is an interesting discovery as I thought that the complex part is produced. What made you think that it is not? I'd appreciate a hint.
Thanks,
Vasily
[right][snapback]435947[/snapback][/right]


Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently.

I have another question, is it easy to modify the algorithm to produce fft of size 1024 ?

thanks a lot!

Go to the top of the page
 
+Quote Post
vvolkov
post Sep 6 2008, 07:04 AM
Post #18



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(cudacuda321 @ Sep 5 2008, 10:33 PM)
Sorry, I was mistaken as I was looking at shared memory allocation and I expected it to be double sized but I realized the algorithm works somewhat differently.

I have another question, is it easy to modify the algorithm to produce fft of size 1024 ?

thanks a lot!
[right][snapback]436143[/snapback][/right]


It is fairly easy to accomplish. I have produced such algorithm by upgrading n=512 case up to radix-16. Of course, 16*16*16 > 1024, so one of the stages (the middle one) has a lower radix of 4, i.e. computes four independent radix-4 transforms instead. Higher radix requires more registers but I did manage to fit into 40 so that 3 thread blocks can run at a time per multiprocessor. It still fits into shared memory due to the particular shared memory allocation that you already found. I applied a fair effort to tune it but ultimately failed to get higher Gflop/s rate than in the n=512 case so I didn't post it here.
Go to the top of the page
 
+Quote Post
profquail
post Sep 24 2008, 05:10 PM
Post #19



*******

Group: Members
Posts: 617
Joined: 14-August 08
Member No.: 115,518
Org.: University of Alabama



Vasily, I've read some of your papers, and noticed that some of your algorithm improvements have been included in CUBLAS 2.0; are you going to be adding the improvements from this work into the next release of CUFFT?
Go to the top of the page
 
+Quote Post
vvolkov
post Sep 26 2008, 08:20 AM
Post #20



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE(profquail @ Sep 24 2008, 09:10 AM)
are you going to be adding the improvements from this work into the next release of CUFFT?
[right][snapback]443729[/snapback][/right]


I heard that NVIDIA is working on it.
Go to the top of the page
 
+Quote Post

6 Pages V   1 2 3 > » 
Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 26th November 2009 - 04:40 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela