![]() ![]() |
Aug 5 2009, 10:59 AM
Post
#1
|
|
![]() Group: Members Posts: 7 Joined: 5-August 09 Member No.: 169,186 |
Hi all,
forgive me my ignorance, but could somebody tell me the difference between the __threadfence_block() and __syncthreads()? according to the CUDA programming guide 2.2.1 they both wait until all writes to global and shared memory are finished... |
|
|
|
Aug 5 2009, 11:01 AM
Post
#2
|
|
![]() Group: Members Posts: 7 Joined: 5-August 09 Member No.: 169,186 |
in a block, of course
|
|
|
|
Aug 5 2009, 12:19 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,567 Joined: 23-November 07 From: Bangalore Member No.: 79,873 Org.: HCL Technologies |
__syncthreads() must be executed by all threads - Otherwise, your kernel would hang. but, I think , that is NOT the case with __threadfence(). -------------------- Ignorance Rules; Knowledge Liberates!
|
|
|
|
Aug 5 2009, 01:02 PM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 145 Joined: 7-November 08 Member No.: 125,063 Org.: Georgia Institue of Technology |
__syncthreads() must be executed by all threads - Otherwise, your kernel would hang. but, I think , that is NOT the case with __threadfence(). This is correct I think. As an aside, has anyone ever run into a problem that required __threadfence() ? I have tried on numerous occasions to get memory transactions from other threads in the same CTA to be not visible, resulting in incorrect results with no success. I have even tried removing the __threadfence() intrinsics from the threadFenceReduction SDK example with no success. I have tried this with 2.3 on a 280GTX and a 285 GTX with similar results.. EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs? This post has been edited by Gregory Diamos: Aug 5 2009, 01:05 PM |
|
|
|
Aug 5 2009, 01:47 PM
Post
#5
|
|
![]() Group: Members Posts: 7 Joined: 5-August 09 Member No.: 169,186 |
thanks, guys!
|
|
|
|
Aug 5 2009, 02:05 PM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,567 Joined: 23-November 07 From: Bangalore Member No.: 79,873 Org.: HCL Technologies |
EDIT: is it possible that in current hardware, all memory transactions become visible in a consistent order to other threads in the same CTA and all threads in other concurrently running CTAs? For threads in same CTA, I think "yes"... But there is a compiler bug or so.. Google "shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com" and pick the first result link.. -------------------- Ignorance Rules; Knowledge Liberates!
|
|
|
|
Aug 5 2009, 04:39 PM
Post
#7
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,072 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
|
|
|
|
Aug 5 2009, 05:26 PM
Post
#8
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 145 Joined: 7-November 08 Member No.: 125,063 Org.: Georgia Institue of Technology |
For threads in same CTA, I think "yes"... But there is a compiler bug or so.. Google "shared mem atomics sarnath cvnguyen sylvain site:http://forums.nvidia.com" and pick the first result link.. Thanks for the link, this was very informative. I found this section particularly humorous: QUOTE (Sarnath @ Jun 18 2009, 11:54 AM) * Do u mean to say the "PTX" instruction CODE atom.shared.cas.b32 %rv1, [%r3], %r8, %r1; translates to a WHILE loop in cubin??? QUOTE Yes, exactly: CODE join.label label7 label5: mov.acquire.b32 $p1|$r0, s[0x0010] @$p1.lt bra.label label6 bra.label label5 label6: set.eq.u32 $p1|$o127, $r0, c1[0x0004] mov.b32 $r1, $r0 @$p1.ne mov.b32 $r1, $r2 mov.b32 s[0x0010], $r1 mov.release.b32 s[0x0010], $r1 label7: nop.join So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right? This post has been edited by Gregory Diamos: Aug 5 2009, 05:36 PM |
|
|
|
Aug 5 2009, 07:50 PM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 165 Joined: 7-November 07 Member No.: 77,477 Org.: Université de Perpignan |
I found this section particularly humorous: Glad to see that someone understands my humor! QUOTE So my understand as of this point is that the current generation hardware enforces a global order on memory operations, which is why I cannot cause code without a __threadfence() to generate an incorrect result. However, future hardware might drop this global ordering property for performance reasons, making __threadfence() necessary for writing portable and correct code. Is this right? __threadfence() and __threadfence_block() are two different issues. I think it is possible to write code doing inter-block communication that need __threadfence() to work properly. (Plus I believe Tim when he says so...) A way to achieve this in a synthetic benchmark would be to generate heavy contention on one memory partition (refer to http://forums.nvidia.com/index.php?showtopic=96423 for a discussion on partition camping). For example, make blocks 2 to n continuously read and write to the same memory partition 0, to flood it with requests. In parallel, make block 1 write a word in partition 0, then another word in partition 1. And make block 0 read both words written by block 1 at the same time. If my intuition is correct (and the whole choreography is timed properly), you will encounter a case when block 0 reads the older value from partition 0 and the newer value from partition 1, that is in the opposite order as it was written. Now, about __threadfence_block(). I wrote previously that it was currently implemented as NOP because memory ordering is always correct inside one block, but that might change in the future. Actually, another explanation is that __threadfence_block() tells the compiler to avoid optimizations that may change the order of reads or writes around this boundary. Just like volatile, but more localized. -------------------- |
|
|
|
Nov 4 2009, 10:12 PM
Post
#10
|
|
![]() ![]() ![]() Group: Members Posts: 37 Joined: 29-January 09 Member No.: 137,800 |
Just wondering, does any one have an example of this? I'd be interested in seeing an example where not using __threadfence results in this problem.
|
|
|
|
Nov 4 2009, 10:29 PM
Post
#11
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,072 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don't know of anyone who does this.
|
|
|
|
Nov 4 2009, 10:53 PM
Post
#12
|
|
![]() ![]() ![]() Group: Members Posts: 37 Joined: 29-January 09 Member No.: 137,800 |
Inter-block communication during persistent CTA launches. Besides the raytracing guys and me, I don't know of anyone who does this. I am working on an application that uses inter-block communication, so its actually interesting to me to check this out. I tried writing a small testbench as Sylvain described with 2 blocks communicating data. Block 1 spin-waits for Block 0 to unlock him via atomic ops. Block 0 attempts to queue up a bunch of writes in a single memory partition, then performs one last specific write to a single memory location in that same partition(=12345). Next Block 0 uses the atomic ops to let Block 1 exit the spin-wait. The first thing Block 1 does is read the single memory partition (which may have not seen the 12345 written yet, as the many memory-writes have queued it up). However, I am always able to see the 12345 without needing the __threadfence(). The code is below, does anyone have any suggestions on how I can modify it to catch the need for __threadfence? CODE #include <stdio.h>
#include <stdlib.h> #define DATA_INDEX 128 // Pick an index where we try and catch an old value #define NUMTHREADS 256 #define PARTITION_STRIDE 384 // In words for 8/9 series, 512 for 200-series GPU #define STRESS 10000 #define ARRAYSIZE 1000 __device__ unsigned int g_lock= 0; __global__ void feedforwardQ(float* g_idata, float* g_odata){ unsigned int i; unsigned int old = 0; unsigned int temp; //Try and catch an older write unsigned int tid = threadIdx.x; if(blockIdx.x==1){ // Block 1 waits for Block 0 if(threadIdx.x==0){ while(old!=1){ // Spin-wait for Block 0 to set lock old = atomicCAS(&g_lock, 1, 0) ; } temp = g_idata[DATA_INDEX]; // Get indexed data immediately } __syncthreads(); if(threadIdx.x==0){ g_odata[DATA_INDEX] = temp; // Copy index data to output } else{ g_odata[threadIdx.x] = g_idata[threadIdx.x]; // Copy inputs to outputs } } else{ //Block 0 queues many writes (memory camping) for(i=0;i<STRESS;i++){ g_idata[DATA_INDEX+PARTITION_STRIDE+i%64] = tid; } //__threadfence(); //__syncthreads(); if(threadIdx.x==0){ g_idata[DATA_INDEX] = 12345; } //__threadfence(); ?? //__syncthreads(); if(threadIdx.x==0){ atomicInc(&g_lock,1); } } __syncthreads(); } int main(){ unsigned int i; cudaError_t err; float h_data[ARRAYSIZE]; float* d_data; float* d_temp; cudaSetDevice(0); for(i=0;i<ARRAYSIZE;i++){ h_data[i] = i; } cudaMalloc( (void**) &d_data, sizeof(float)*ARRAYSIZE); cudaMalloc( (void**) &d_temp, sizeof(float)*ARRAYSIZE); cudaMemcpy(d_temp,&h_data, sizeof(float)*ARRAYSIZE,cudaMemcpyHostToDevice) ; feedforwardQ<<< 2, NUMTHREADS>>>(d_temp,d_data); err = cudaThreadSynchronize(); if (err != 0){ return -1; } cudaMemcpy(h_data, d_data, sizeof(float)*ARRAYSIZE,cudaMemcpyDeviceToHost) ; printf("data = %f\n", h_data[DATA_INDEX]); } |
|
|
|
Nov 4 2009, 11:28 PM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,072 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
One variable won't need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won't need __threadfence.
(this gets into arcane details about CTA scheduling, partitions, and things like that) |
|
|
|
Nov 4 2009, 11:51 PM
Post
#14
|
|
![]() ![]() ![]() Group: Members Posts: 37 Joined: 29-January 09 Member No.: 137,800 |
One variable won't need __threadfence on current hardware because of how memory is laid out (physically), but if you write to much larger chunks of memory, there is no guarantee that you won't need __threadfence. Thanks for the responses. What you say makes sense, and I believe I may have seen instances where I am reading/writing huge chunks of memory, and using the atomic ops to sync between blocks when the writing is "done". However, when I don't use __threadfence, its possible for a waiting-Block to get signaled by the Atomic-op variable, yet the huge chunks of writes aren't completed by the time I begin reading them. I guess I was just trying to find/create a much smaller program that exhibits the same type of behavior. |
|
|
|
Nov 5 2009, 12:03 AM
Post
#15
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,072 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
The real fun starts when you get into interactions between atomic operations and __threadfence.
|
|
|
|
Nov 5 2009, 07:53 AM
Post
#16
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 82 Joined: 16-May 09 Member No.: 155,172 Org.: funcom |
|
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 24th November 2009 - 12:51 AM |