Help - Search - Members - Calendar
Full Version: Tables not correct when using __constant__
NVIDIA Forums > CUDA GPU Computing > CUDA Programming and Development
magneei
Hi,

I am currently implementing a lookup-based implementation of AES ctr encryption using CUDA.

The lookup-tables are defined as following:

__device__ u32 Te0[256] =
{
0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
... (256 entries in total)
}

This works perfectly, but I want to increase speed by having it in constant memory. Defining the table as

__device__ __constant__ u32 Te0[256] =
{
0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
... (256 entries in total)
}

compiles, and runs about 10-15 times faster, however the output is wrong. Am I using the __constant__ qualifier wrong, or could anyone give me any hints on how to use constant memory?

If required I can post the whole code.

Sincerly Magne Eimot
MisterAnderson42
Try just "__constant__ u32 Te0 ......" (i.e. remove the __device__).
Sarnath
QUOTE(MisterAnderson42 @ Mar 12 2008, 10:59 PM)
Try just "__constant__ u32 Te0 ......" (i.e. remove the __device__).
[right][snapback]341801[/snapback][/right]



I have never used __constant__. BUt your explanation is quite baffling. Is it based on any of your strange experience with CUDA?
jordyvaneijk
QUOTE(magneei @ Mar 12 2008, 07:13 PM)
Hi,

I am currently implementing a lookup-based implementation of AES ctr encryption using CUDA.

The lookup-tables are defined as following:

__device__ u32 Te0[256] =
{
  0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
... (256 entries in total)
}

This works perfectly, but I want to increase speed by having it in constant memory. Defining the table as

__device__ __constant__ u32 Te0[256] =
{
  0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
... (256 entries in total)
}

compiles, and runs about 10-15 times faster, however the output is wrong. Am I using the __constant__ qualifier wrong, or could anyone give me any hints on how to use constant memory?

If required I can post the whole code.

Sincerly Magne Eimot
[right][snapback]341789[/snapback][/right]


This is how we use constants:
CODE

float h_x_plane1_x1 = (x_plane[0] - x1); float h_x_planeN_x1 = ((x_plane[0]+dimx*dx) - x1);

CUDA_SAFE_CALL(cudaMemcpyToSymbol("x_plane1_x1", &h_x_plane1_x1, sizeof(h_x_plane1_x1)));


Maybe you can do something with it. I think it is the same for an array of constants.

oops forgot to mention:
CODE

__device__ __constant__ float x_plane1_x1;

this part is inside a header file like we use it.
Sarnath
QUOTE(jordyvaneijk @ Mar 13 2008, 01:55 PM)

__device__ __constant__ float x_plane1_x1;
this part is inside a header file like we use it.

[right][snapback]342066[/snapback][/right]



Ur usage is fine. But I would be surprised if we cannot declared look-up tables in constant memory in the data-segment. It would hinder readability.

In the worst case, one has to declare this constant table in host memory and then do a symbol copy as per jordy..., and get it done.

Can some1 more knowledgeable comment here?
MisterAnderson42
I only said to remove the __device__ because in my code that successfully uses constants, I don't have it smile.gif

Maybe it is a compiler issue. Compile with -keep and check the .ptx and see if your data is there.
Sarnath
QUOTE(MisterAnderson42 @ Mar 13 2008, 09:08 PM)
I only said to remove the __device__ because in my code that successfully uses constants, I don't have it smile.gif

Maybe it is a  compiler issue. Compile with -keep and check the .ptx and see if your data is there.
[right][snapback]342230[/snapback][/right]


I see.. smile.gif

Actually, I remembered it like this:

When you say "__device__" -- it tells the compiler that it is a GPU element.
And, you further qualify it with "__constant__", "__global__", etc.... Thats why I was surprised by your answer. ANyway, Never mind...
jordyvaneijk
QUOTE(MisterAnderson42 @ Mar 13 2008, 05:38 PM)
I only said to remove the __device__ because in my code that successfully uses constants, I don't have it smile.gif

Maybe it is a  compiler issue. Compile with -keep and check the .ptx and see if your data is there.
[right][snapback]342230[/snapback][/right]


I once tried that without the __device__ and gave a lot of errors on my side. Tooklit version 1.0

Yes I'm still using the old toolkit.
MisterAnderson42
OK, this has gotten a little off topic since my original post. Check the manual section 4.2.2.2 (CUDA 1.1 at least). It specifically states that the use of __device__ with __constant__ is optional. In both cases, the declared variable resides in the constant memory space on the device.

To get back on topic for the OP, you are going to need to narrow your problem down to a minimal reproduction and post the code here. I just wrote a test using an initialized constant array and had no problems whatsoever.

CODE

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \
   cudaError err = call;                                                    \
   if( cudaSuccess != err) {                                                \
       fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \
               __FILE__, __LINE__, cudaGetErrorString( err) );              \
   exit(EXIT_FAILURE);                                                      \
   } } while (0)

#ifdef NDEBUG
#define CUT_CHECK_ERROR(errorMessage)
#else
#  define CUT_CHECK_ERROR(errorMessage) do {                                 \
   cudaThreadSynchronize();                                                \
   cudaError_t err = cudaGetLastError();                                    \
   if( cudaSuccess != err) {                                                \
       fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \
               errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
       exit(EXIT_FAILURE);                                                  \
   } } while (0)
#endif


__device__ __constant__ int constA[32] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__constant__ int constB[32] = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131};

__global__ void copy_gmemA(int* g_odata)
{
const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

g_odata[idx] = constA[threadIdx.x];
}

__global__ void copy_gmemB(int* g_odata)
{
const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

g_odata[idx] = constB[threadIdx.x];
}

int main()
{
int *d_odata, *h_odata;
int len = 32;
int num_threads = 32;

CUDA_SAFE_CALL( cudaMalloc((void**)&d_odata, sizeof(int)*(len)) );
h_odata = (int *)malloc(sizeof(int) * len);

dim3  threads(num_threads, 1, 1);
dim3  grid(1, 1, 1);

copy_gmemA<<< grid, threads >>>(d_odata);
CUDA_SAFE_CALL( cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost) );
printf("A: ");
for (int i = 0; i < 32; i++)
 printf("%d ", h_odata[i]);

printf("\n\n");
copy_gmemB<<< grid, threads >>>(d_odata);
CUDA_SAFE_CALL( cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost) );
printf("A: ");
for (int i = 0; i < 32; i++)
 printf("%d ", h_odata[i]);

return 0;
}


When I run this, I get the expected output:
CODE

$ ./constant_test
A: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

A: 10 11 12 13 14 15 16 17 18 19 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
oserra
Ok, I have had the same problem for a long time, with 1.0 and 1.1 version (as far as I remember). At some point, and copying line by line the SDK examples, I realized that the only way it works is to put the copy command, the __constant__ declaration and the kernel in the same .cu file. Sincerely, I don't understand why, but after a long trial and error process it is the only way it works.

Did anyone else experience the same solution? Could anyone else having the same problem please try it to see if we are talking about the same problem?
MisterAnderson42
QUOTE(oserra @ Mar 17 2008, 02:25 AM)
... I realized that the only way it works is to put the copy command, the __constant__ declaration and the kernel in the same .cu file. Sincerely, I don't understand why, but after a long trial and error process it is the only way it works.
[right][snapback]344062[/snapback][/right]

This is expected behavior. The programming guide states that variables declared __constant__ have implied static storage in the file that they are defined in. The same goes for texture references and shared memory declarations. All must be in the same .cu file in order to be accessed by both host and device code in that file. The simple (though inelegant) solution is to #include all of your .cu files into one "big.cu' file and only compile that one.
x248
QUOTE (MisterAnderson42 @ Mar 17 2008, 02:33 PM) *
This is expected behavior. The programming guide states that variables declared __constant__ have implied static storage in the file that they are defined in. The same goes for texture references and shared memory declarations. All must be in the same .cu file in order to be accessed by both host and device code in that file. The simple (though inelegant) solution is to #include all of your .cu files into one "big.cu' file and only compile that one.


is it always the case with the 2.2 ?

In fact I have a very strange problem.

I use constant memory.
All the declarations are done in a .cu file which is included in all the files using those constants.

It seemed to be ok, but now I have a very strange behavior.
I have sometimes wrong results in certain part of my program.
If I rebbot my PC and launch the .exe I have wrong results. Then, if I launch an other exe (a previous
without the problem), and I re-lauch after the first one, I have the good result.
So my results are different after a boot or a launch of another program.

Of course in EMuDebug, nothing appears.

Can it come from my declaration of the .cu in an included file ?
Does someone had the same problem, or Ihave an idea about this behaviour?


Thanks in advance.

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.