![]() ![]() |
Nov 6 2009, 06:18 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 116 Joined: 28-August 07 Member No.: 67,654 Org.: University of Utah |
I have two functions to compute gradient of 3D image. One function use the global memory directly, other use shared memory.
When I benchmark the speed of two function, the global memory function is 1.2Mpixels /second while the shared memory only achieves 170k. the result runs on the Tesla S1070 with input size 256^3 So my question is why the shared memory version uses much less memory bandwidth but much slower (about 7 time slower) than the global version ? What is the best practice for the stencil buffer problem, like gradient or FDM ? When we should use shared memory, when you should not Any ideas are greatly appreciated! Here are the codes CODE template<int wrap> __global__ void cudaComputeGradient_kernel(float * d_gdx, float * d_gdy, float * d_gdz, float * d_i, int w, int h, int l, float sx, float sy, float sz) { int i = threadIdx.x + blockIdx.x * blockDim.x; int j = threadIdx.y + blockIdx.y * blockDim.y; int wh = w * h; int id = j * w + i; if (i<w && j < h) for (int k=0 ; k < l; ++k, id += wh){ if (i == 0) d_gdx[id] = (d_i[id+1] - d_i[id]) / sx; else if (i == w - 1) d_gdx[id] = (d_i[id] - d_i[id-1]) / sx; else d_gdx[id] = (d_i[id+1] - d_i[id-1]) / (2.f * sx); if (j == 0) d_gdy[id] = (d_i[id+w] - d_i[id]) / sy; else if (j == h - 1) d_gdy[id] = (d_i[id] - d_i[id-w]) / sy; else d_gdy[id] = (d_i[id+w] - d_i[id-w]) / (2.f * sy); if (k == 0) d_gdz[id] = (d_i[id+wh] - d_i[id]) / sz; else if (k == l - 1) d_gdz[id] = (d_i[id] - d_i[id-wh]) / sz; else d_gdz[id] = (d_i[id+wh] - d_i[id-wh]) / (2.f * sz); } } void cudaComputeGradient(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient float * d_i, // input image uint w, uint h, uint l, // size of image float sx, float sy, float sz) // spacing { dim3 threads(16,16); dim3 grids(iDivUp16(w), iDivUp16(h)); cudaComputeGradient_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i, w, h, l, sx, sy, sz); } CODE __global__ void cudaComputeGradient_shared_kernel(float * d_gdx, float * d_gdy, float * d_gdz, float * d_i, int w, int h, int l, float sx, float sy, float sz) { const int radius = 1; float s_data[16 + radius * 2][16 + radius * 2]; int i = threadIdx.x + blockIdx.x * blockDim.x; int j = threadIdx.y + blockIdx.y * blockDim.y; if (i < w && j < h){ int planeSize = w* h; int id = j * w + i; float back = 0; float current = 0; float front = d_i[id]; for (int k=0; k< l; ++k){ back = current; current = front; front = d_i[id + planeSize]; __syncthreads(); int tx = threadIdx.x + radius; int ty = threadIdx.y + radius; if (threadIdx.x < radius){ s_data[ty][threadIdx.x] = d_i[id - radius]; s_data[ty][threadIdx.x + 16 + radius] = d_i[id + 16]; } if (threadIdx.y < radius){ s_data[threadIdx.y][tx] = d_i[id - radius * w]; s_data[threadIdx.y + 16 + radius][tx] = d_i[id + 16 * w]; } s_data[ty][tx] = current; __syncthreads(); if (i == 0) d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx]) / sx; else if (i == w - 1) d_gdx[id] = (s_data[ty][tx] -s_data[ty][tx-1]) / sx; else d_gdx[id] = (s_data[ty][tx+1] -s_data[ty][tx-1])/ (2 * sx); if (j == 0) d_gdy[id] = (s_data[ty+1][tx] -s_data[ty][tx]) /sy; else if (j == h - 1) d_gdy[id] = (s_data[ty][tx] -s_data[ty-1][tx]) /sy; else d_gdy[id] = (s_data[ty+1][tx] -s_data[ty-1][tx])/(2 * sy); if (k==0) d_gdz[id] = (front - current)/sz; else if (k == l - 1) d_gdz[id] = (current - back) /sz; else d_gdz[id] = (front - back) /(2 * sz); id += planeSize; } } } void cudaComputeGradient_shared(float * d_gdx, float * d_gdy, float * d_gdz, // out put gradient float * d_i, // input image uint w, uint h, uint l, // size of image float sx, float sy, float sz) // spacing { dim3 threads(16,16); dim3 grids(iDivUp16(w), iDivUp16(h)); cudaComputeGradient_shared_kernel<<<grids, threads>>>(d_gdx, d_gdy, d_gdz, d_i, w, h, l, sx, sy, sz); } This post has been edited by Linh Ha: Nov 8 2009, 03:06 PM |
|
|
|
Nov 6 2009, 09:34 PM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 692 Joined: 4-April 06 From: Karlsruhe / Munich, Germany Member No.: 18,632 Org.: Nomor Research GmbH |
In your second example, did you really combine shared memory with texture reads?
Doing that does not seem to make much sense. The texture reads already solve the bandwidth issue caused by reading the same pixel several times, eliminating the need for shared memory use. I suggest you analyze your shared memory access patterns for bank conflicts, for example with the bank checker macro of cutil. It's not perfect, but in give some clues about what is going on here. Christian This post has been edited by cbuchner1: Nov 6 2009, 09:36 PM |
|
|
|
Nov 6 2009, 09:41 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 814 Joined: 1-April 09 Member No.: 148,556 |
The second example also contains a number of full 32 bit integer multiplies (and integer->float casts) which are going to be a lot slower than the code in the first kernel. I have used the same basic shared memory idea (sans the texture reads) in 3D finite difference kernels and my (probably crappy) version hits about 90Gb/s on a GTX 275 for second order compact stencil, so there is nothing wrong with the idea.
I feel compelled to add that those scrolling code boxes are just evil. |
|
|
|
Nov 6 2009, 09:48 PM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 160 Joined: 30-October 08 From: Saarbruecken, Germany & Kraków, Poland Member No.: 124,006 Org.: Saarland University |
@cbuchner1: There is no getTexVal function in the manual. I think it is just some __device__ function
@Linh Ha: You are not using shared memory in the second example! If you want s_data to reside in shared memory you have to write __shared__ before type specifier: CODE __shared__ float s_data[dimX][dimY] The way you have written it right now, there is a separate array for each thread, and it is residing in local memory. Local memory is approximately as slow as global one. Also I would suggest precomputing a recipocial of sx, sy, sz and where you compute d_gdx and d_gdy values, multiplying by that precomputed value instead. Division is much slower that multiplication. This post has been edited by Cygnus X1: Nov 6 2009, 09:49 PM |
|
|
|
Nov 8 2009, 03:09 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 116 Joined: 28-August 07 Member No.: 67,654 Org.: University of Utah |
In your second example, did you really combine shared memory with texture reads? Doing that does not seem to make much sense. The texture reads already solve the bandwidth issue caused by reading the same pixel several times, eliminating the need for shared memory use. Christian I did try texture. Agree that texture memory access partially solve the problem. I have tried with texture, it is not significantly faster than global mem (since this one also satisfies the coalesced condition) |
|
|
|
Nov 8 2009, 03:15 PM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 116 Joined: 28-August 07 Member No.: 67,654 Org.: University of Utah |
@Linh Ha: You are not using shared memory in the second example! If you want s_data to reside in shared memory you have to write __shared__ before type specifier: CODE __shared__ float s_data[dimX][dimY] You are right, i forgot that. It explains the problem, i will check the performance. Thanks |
|
|
|
Nov 9 2009, 01:31 AM
Post
#7
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 116 Joined: 28-August 07 Member No.: 67,654 Org.: University of Utah |
With only a small change as suggested
__shared__ float s_data, it is twice as fast. Thank everyone |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 24th November 2009 - 01:35 AM |