IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Why shared memory is slower than global memory with gradient computation?
Linh Ha
post Nov 6 2009, 06:18 PM
Post #1



*****

Group: Members
Posts: 116
Joined: 28-August 07
Member No.: 67,654
Org.: University of Utah



I have two functions to compute gradient of 3D image. One function use the global memory directly, other use shared memory.
When I benchmark the speed of two function, the global memory function is 1.2Mpixels /second while the shared memory only achieves 170k.
the result runs on the Tesla S1070 with input size 256^3

So my question is why the shared memory version uses much less memory bandwidth but much slower (about 7 time slower) than the global version ?
What is the best practice for the stencil buffer problem, like gradient or FDM ? When we should use shared memory, when you should not

Any ideas are greatly appreciated!

Here are the codes

CODE

template<int wrap>
__global__ void cudaComputeGradient_kernel(float * d_gdx, float * d_gdy, float * d_gdz,
float * d_i,
int w, int h, int l,
float sx, float sy, float sz)

{
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;

int wh = w * h;
int id = j * w + i;

if (i<w && j < h)
for (int k=0 ; k < l; ++k, id += wh){

if (i == 0)
d_gdx[id] = (d_i[id+1] - d_i[id]) / sx;
else if (i == w - 1)
d_gdx[id] = (d_i[id] - d_i[id-1]) / sx;
else
d_gdx[id] = (d_i[id+1] - d_i[id-1]) / (2.f * sx);

if (j == 0)
d_gdy[id] = (d_i[id+w] - d_i[id]) / sy;
else if (j == h - 1)
d_gdy[id] = (d_i[id] - d_i[id-w]) / sy;
else
d_gdy[id] = (d_i[id+w] - d_i[id-w]) / (2.f * sy);

if (k == 0)
d_gdz[id] = (d_i[id+wh] - d_i[id]) / sz;
else if (k == l - 1)
d_gdz[id] = (d_i[id] - d_i[id-wh]) / sz;
else
d_gdz[id] = (d_i[id+wh] - d_i[id-wh]) / (2.f * sz);
}
}

void cudaComputeGradient(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient
float * d_i, // input image
uint w, uint h, uint l, // size of image
float sx, float sy, float sz) // spacing
{
dim3 threads(16,16);
dim3 grids(iDivUp16(w), iDivUp16(h));

cudaComputeGradient_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i,
w, h, l,
sx, sy, sz);
}



CODE

__global__ void cudaComputeGradient_shared_kernel(float * d_gdx, float * d_gdy, float * d_gdz,
float * d_i,
int w, int h, int l,
float sx, float sy, float sz)

{
const int radius = 1;

float s_data[16 + radius * 2][16 + radius * 2];

int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;


if (i < w && j < h){

int planeSize = w* h;
int id = j * w + i;

float back = 0;
float current = 0;
float front = d_i[id];

for (int k=0; k< l; ++k){
back = current;
current = front;
front = d_i[id + planeSize];

__syncthreads();

int tx = threadIdx.x + radius;
int ty = threadIdx.y + radius;

if (threadIdx.x < radius){
s_data[ty][threadIdx.x] = d_i[id - radius];
s_data[ty][threadIdx.x + 16 + radius] = d_i[id + 16];
}

if (threadIdx.y < radius){
s_data[threadIdx.y][tx] = d_i[id - radius * w];
s_data[threadIdx.y + 16 + radius][tx] = d_i[id + 16 * w];
}


s_data[ty][tx] = current;

__syncthreads();

if (i == 0)
d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx]) / sx;
else if (i == w - 1)
d_gdx[id] = (s_data[ty][tx] -s_data[ty][tx-1]) / sx;
else
d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx-1])/ (2 * sx);

if (j == 0)
d_gdy[id] = (s_data[ty+1][tx] -s_data[ty][tx]) /sy;
else if (j == h - 1)
d_gdy[id] = (s_data[ty][tx] -s_data[ty-1][tx]) /sy;
else
d_gdy[id] = (s_data[ty+1][tx] -s_data[ty-1][tx])/(2 * sy);

if (k==0)
d_gdz[id] = (front - current)/sz;
else if (k == l - 1)
d_gdz[id] = (current - back) /sz;
else
d_gdz[id] = (front - back) /(2 * sz);

id += planeSize;
}
}
}
void cudaComputeGradient_shared(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient
float * d_i, // input image
uint w, uint h, uint l, // size of image
float sx, float sy, float sz) // spacing
{
dim3 threads(16,16);
dim3 grids(iDivUp16(w), iDivUp16(h));

cudaComputeGradient_shared_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i,
w, h, l,
sx, sy, sz);
}



This post has been edited by Linh Ha: Nov 8 2009, 03:06 PM
Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 6 2009, 09:34 PM
Post #2



*******

Group: Members
Posts: 692
Joined: 4-April 06
From: Karlsruhe / Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



In your second example, did you really combine shared memory with texture reads?

Doing that does not seem to make much sense. The texture reads already solve the bandwidth issue caused by reading the same pixel several times, eliminating the need for shared memory use.

I suggest you analyze your shared memory access patterns for bank conflicts, for example with the bank checker macro of cutil. It's not perfect, but in give some clues about what is going on here.

Christian

This post has been edited by cbuchner1: Nov 6 2009, 09:36 PM
Go to the top of the page
 
+Quote Post
avidday
post Nov 6 2009, 09:41 PM
Post #3



*******

Group: Members
Posts: 814
Joined: 1-April 09
Member No.: 148,556



The second example also contains a number of full 32 bit integer multiplies (and integer->float casts) which are going to be a lot slower than the code in the first kernel. I have used the same basic shared memory idea (sans the texture reads) in 3D finite difference kernels and my (probably crappy) version hits about 90Gb/s on a GTX 275 for second order compact stencil, so there is nothing wrong with the idea.

I feel compelled to add that those scrolling code boxes are just evil.
Go to the top of the page
 
+Quote Post
Cygnus X1
post Nov 6 2009, 09:48 PM
Post #4



*****

Group: Members
Posts: 160
Joined: 30-October 08
From: Saarbruecken, Germany & Kraków, Poland
Member No.: 124,006
Org.: Saarland University



@cbuchner1: There is no getTexVal function in the manual. I think it is just some __device__ function

@Linh Ha: You are not using shared memory in the second example!
If you want s_data to reside in shared memory you have to write __shared__ before type specifier:
CODE
__shared__ float s_data[dimX][dimY]


The way you have written it right now, there is a separate array for each thread, and it is residing in local memory. Local memory is approximately as slow as global one.

Also I would suggest precomputing a recipocial of sx, sy, sz and where you compute d_gdx and d_gdy values, multiplying by that precomputed value instead. Division is much slower that multiplication.

This post has been edited by Cygnus X1: Nov 6 2009, 09:49 PM
Go to the top of the page
 
+Quote Post
Linh Ha
post Nov 8 2009, 03:09 PM
Post #5



*****

Group: Members
Posts: 116
Joined: 28-August 07
Member No.: 67,654
Org.: University of Utah



QUOTE (cbuchner1 @ Nov 6 2009, 03:34 PM) *
In your second example, did you really combine shared memory with texture reads?

Doing that does not seem to make much sense. The texture reads already solve the bandwidth issue caused by reading the same pixel several times, eliminating the need for shared memory use.


Christian


I did try texture. Agree that texture memory access partially solve the problem. I have tried with texture, it is not significantly faster than global mem (since this one also satisfies the coalesced condition)
Go to the top of the page
 
+Quote Post
Linh Ha
post Nov 8 2009, 03:15 PM
Post #6



*****

Group: Members
Posts: 116
Joined: 28-August 07
Member No.: 67,654
Org.: University of Utah



QUOTE (Cygnus X1 @ Nov 6 2009, 03:48 PM) *
@Linh Ha: You are not using shared memory in the second example!
If you want s_data to reside in shared memory you have to write __shared__ before type specifier:
CODE
__shared__ float s_data[dimX][dimY]

You are right, i forgot that. It explains the problem, i will check the performance. Thanks
Go to the top of the page
 
+Quote Post
Linh Ha
post Nov 9 2009, 01:31 AM
Post #7



*****

Group: Members
Posts: 116
Joined: 28-August 07
Member No.: 67,654
Org.: University of Utah



With only a small change as suggested
__shared__ float s_data, it is twice as fast. Thank everyone
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 24th November 2009 - 01:35 AM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela