IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> difference between __threadfence_block and __syncthreads
Eremey
post Aug 5 2009, 10:59 AM
Post #1



*

Group: Members
Posts: 7
Joined: 5-August 09
Member No.: 169,186



Hi all,

forgive me my ignorance, but could somebody tell me the difference between the __threadfence_block() and __syncthreads()?
according to the CUDA programming guide 2.2.1 they both wait until all writes to global and shared memory are finished...
Go to the top of the page
 
+Quote Post
Eremey
post Aug 5 2009, 11:01 AM
Post #2



*

Group: Members
Posts: 7
Joined: 5-August 09
Member No.: 169,186



in a block, of course
Go to the top of the page
 
+Quote Post
Sarnath
post Aug 5 2009, 12:19 PM
Post #3



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies





__syncthreads() must be executed by all threads - Otherwise, your kernel would hang.

but, I think , that is NOT the case with __threadfence().


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
Gregory Diamos
post Aug 5 2009, 01:02 PM
Post #4



*****

Group: Members
Posts: 145
Joined: 7-November 08
Member No.: 125,063
Org.: Georgia Institue of Technology



QUOTE (Sarnath @ Aug 5 2009, 01:19 PM) *
__syncthreads() must be executed by all threads - Otherwise, your kernel would hang.

but, I think , that is NOT the case with __threadfence().


This is correct I think.

As an aside, has anyone ever run into a problem that required __threadfence() ? I have tried on numerous occasions to get memory transactions from other threads in the same CTA to be not visible, resulting in incorrect results with no success. I have even tried removing the __threadfence() intrinsics from the threadFenceReduction SDK example with no success. I have tried this with 2.3 on a 280GTX and a 285 GTX with similar results..

EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs?

This post has been edited by Gregory Diamos: Aug 5 2009, 01:05 PM
Go to the top of the page
 
+Quote Post
Eremey
post Aug 5 2009, 01:47 PM
Post #5



*

Group: Members
Posts: 7
Joined: 5-August 09
Member No.: 169,186



thanks, guys!
Go to the top of the page
 
+Quote Post
Sarnath
post Aug 5 2009, 02:05 PM
Post #6



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies



QUOTE (Gregory Diamos @ Aug 5 2009, 06:32 PM) *
EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs?



For threads in same CTA, I think "yes"...

But there is a compiler bug or so.. Google "shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com" and pick the first result link..


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
tmurray
post Aug 5 2009, 04:39 PM
Post #7



********

Group: NVIDIA Employees
Posts: 2,072
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (Gregory Diamos @ Aug 5 2009, 06:02 AM) *
As an aside, has anyone ever run into a problem that required __threadfence() ?

Oh yes. smile.gif
Go to the top of the page
 
+Quote Post
Gregory Diamos
post Aug 5 2009, 05:26 PM
Post #8



*****

Group: Members
Posts: 145
Joined: 7-November 08
Member No.: 125,063
Org.: Georgia Institue of Technology



QUOTE (Sarnath @ Aug 5 2009, 03:05 PM) *
For threads in same CTA, I think "yes"...

But there is a compiler bug or so.. Google "shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com" and pick the first result link..


Thanks for the link, this was very informative. I found this section particularly humorous:

QUOTE
(Sarnath @ Jun 18 2009, 11:54 AM) *
Do u mean to say the "PTX" instruction
CODE
atom.shared.cas.b32     %rv1, [%r3], %r8, %r1;

translates to a WHILE loop in cubin???


QUOTE
Yes, exactly:
CODE
join.label label7
label5: mov.acquire.b32 $p1|$r0, s[0x0010]
@$p1.lt bra.label label6
bra.label label5
label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004]
mov.b32 $r1, $r0
@$p1.ne mov.b32 $r1, $r2
mov.b32 s[0x0010], $r1
mov.release.b32 s[0x0010], $r1
label7: nop.join


So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right?

This post has been edited by Gregory Diamos: Aug 5 2009, 05:36 PM
Go to the top of the page
 
+Quote Post
Sylvain Collange
post Aug 5 2009, 07:50 PM
Post #9



*****

Group: Members
Posts: 165
Joined: 7-November 07
Member No.: 77,477
Org.: Université de Perpignan



QUOTE (Gregory Diamos @ Aug 5 2009, 07:26 PM) *
I found this section particularly humorous:


Glad to see that someone understands my humor! wink.gif

QUOTE
So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right?


__threadfence() and __threadfence_block() are two different issues.

I think it is possible to write code doing inter-block communication that need __threadfence() to work properly. (Plus I believe Tim when he says so...)

A way to achieve this in a synthetic benchmark would be to generate heavy contention on one memory partition (refer to http://forums.nvidia.com/index.php?showtopic=96423 for a discussion on partition camping).
For example, make blocks 2 to n continuously read and write to the same memory partition 0, to flood it with requests. In parallel, make block 1 write a word in partition 0, then another word in partition 1. And make block 0 read both words written by block 1 at the same time.
If my intuition is correct (and the whole choreography is timed properly), you will encounter a case when block 0 reads the older value from partition 0 and the newer value from partition 1, that is in the opposite order as it was written.

Now, about __threadfence_block(). I wrote previously that it was currently implemented as NOP because memory ordering is always correct inside one block, but that might change in the future.
Actually, another explanation is that __threadfence_block() tells the compiler to avoid optimizations that may change the order of reads or writes around this boundary. Just like volatile, but more localized.


--------------------
Go to the top of the page
 
+Quote Post
enemyben88
post Nov 4 2009, 10:12 PM
Post #10



***

Group: Members
Posts: 37
Joined: 29-January 09
Member No.: 137,800



Just wondering, does any one have an example of this? I'd be interested in seeing an example where not using __threadfence results in this problem.
Go to the top of the page
 
+Quote Post
tmurray
post Nov 4 2009, 10:29 PM
Post #11



********

Group: NVIDIA Employees
Posts: 2,072
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don't know of anyone who does this.
Go to the top of the page
 
+Quote Post
enemyben88
post Nov 4 2009, 10:53 PM
Post #12



***

Group: Members
Posts: 37
Joined: 29-January 09
Member No.: 137,800



QUOTE (tmurray @ Nov 4 2009, 05:29 PM) *
Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don't know of anyone who does this.



I am working on an application that uses inter-block communication, so its actually interesting to me to check this out. I tried writing a small testbench as Sylvain described with 2 blocks communicating data.

Block 1 spin-waits for Block 0 to unlock him via atomic ops. Block 0 attempts to queue up a bunch of writes in a single memory partition, then performs one last specific write to a single memory location in that same partition(=12345). Next Block 0 uses the atomic ops to let Block 1 exit the spin-wait. The first thing Block 1 does is read the single memory partition (which may have not seen the 12345 written yet, as the many memory-writes have queued it up).

However, I am always able to see the 12345 without needing the __threadfence(). The code is below, does anyone have any suggestions on how I can modify it to catch the need for __threadfence?

CODE
#include <stdio.h>
#include <stdlib.h>

#define DATA_INDEX 128 // Pick an index where we try and catch an old value
#define NUMTHREADS 256
#define PARTITION_STRIDE 384 // In words for 8/9 series, 512 for 200-series GPU
#define STRESS 10000
#define ARRAYSIZE 1000
__device__ unsigned int g_lock= 0;

__global__ void
feedforwardQ(float* g_idata, float* g_odata){

unsigned int i;
unsigned int old = 0;
unsigned int temp; //Try and catch an older write
unsigned int tid = threadIdx.x;
if(blockIdx.x==1){ // Block 1 waits for Block 0
if(threadIdx.x==0){
while(old!=1){ // Spin-wait for Block 0 to set lock
old = atomicCAS(&g_lock, 1, 0) ;
}
temp = g_idata[DATA_INDEX]; // Get indexed data immediately
}

__syncthreads();

if(threadIdx.x==0){
g_odata[DATA_INDEX] = temp; // Copy index data to output
}
else{
g_odata[threadIdx.x] = g_idata[threadIdx.x]; // Copy inputs to outputs
}
}
else{ //Block 0 queues many writes (memory camping)
for(i=0;i<STRESS;i++){
g_idata[DATA_INDEX+PARTITION_STRIDE+i%64] = tid;
}
//__threadfence();
//__syncthreads();

if(threadIdx.x==0){
g_idata[DATA_INDEX] = 12345;
}
//__threadfence(); ??
//__syncthreads();

if(threadIdx.x==0){
atomicInc(&g_lock,1);
}
}
__syncthreads();
}


int main(){
unsigned int i;
cudaError_t err;
float h_data[ARRAYSIZE];
float* d_data;
float* d_temp;

cudaSetDevice(0);

for(i=0;i<ARRAYSIZE;i++){
h_data[i] = i;
}

cudaMalloc( (void**) &d_data, sizeof(float)*ARRAYSIZE);
cudaMalloc( (void**) &d_temp, sizeof(float)*ARRAYSIZE);
cudaMemcpy(d_temp,&h_data, sizeof(float)*ARRAYSIZE,cudaMemcpyHostToDevice) ;

feedforwardQ<<< 2, NUMTHREADS>>>(d_temp,d_data);
err = cudaThreadSynchronize();
if (err != 0){
return -1;
}

cudaMemcpy(h_data, d_data, sizeof(float)*ARRAYSIZE,cudaMemcpyDeviceToHost) ;

printf("data = %f\n", h_data[DATA_INDEX]);
}
Go to the top of the page
 
+Quote Post
tmurray
post Nov 4 2009, 11:28 PM
Post #13



********

Group: NVIDIA Employees
Posts: 2,072
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



One variable won't need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won't need __threadfence.

(this gets into arcane details about CTA scheduling, partitions, and things like that)
Go to the top of the page
 
+Quote Post
enemyben88
post Nov 4 2009, 11:51 PM
Post #14



***

Group: Members
Posts: 37
Joined: 29-January 09
Member No.: 137,800



QUOTE (tmurray @ Nov 4 2009, 06:28 PM) *
One variable won't need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won't need __threadfence.



Thanks for the responses. What you say makes sense, and I believe I may have seen instances where I am reading/writing huge chunks of memory, and using the atomic ops to sync between blocks when the writing is "done". However, when I don't use __threadfence, its possible for a waiting-Block to get signaled by the Atomic-op variable, yet the huge chunks of writes aren't completed by the time I begin reading them.

I guess I was just trying to find/create a much smaller program that exhibits the same type of behavior.
Go to the top of the page
 
+Quote Post
tmurray
post Nov 5 2009, 12:03 AM
Post #15



********

Group: NVIDIA Employees
Posts: 2,072
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



The real fun starts when you get into interactions between atomic operations and __threadfence.
Go to the top of the page
 
+Quote Post
sergeyn
post Nov 5 2009, 07:53 AM
Post #16



****

Group: Members
Posts: 82
Joined: 16-May 09
Member No.: 155,172
Org.: funcom




QUOTE (tmurray @ Nov 5 2009, 02:03 AM) *
The real fun starts when you get into interactions between atomic operations and __threadfence.


How do atomic operations work with other memory load/stores? Do they make other memory requests visible to the others before/after atomic operation?
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 24th November 2009 - 12:51 AM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela