IPB

Welcome Guest ( Log In | Register )

3 Pages V   1 2 3 >  
Reply to this topicStart new topic
> CUDA Occupancy Calculator, Helps pick optimal thread block size
Mark Harris
post Mar 22 2007, 10:00 AM
Post #1



******

Group: Members
Posts: 382
Joined: 15-February 07
From: Brisbane, Australia
Member No.: 40,950
Org.: NVIDIA



Hello CUDA users,

We've just posted a new tool on the cuda site, the CUDA Occupancy Calculator. This tool is an MS excel spreadsheet that helps you choose thread block size for your kernel in order to achieve highest occupancy of the GPU. The CUDA Occupancy Calculator can be found on the CUDA homepage.

Here is a direct link: CUDA Occupancy Calculator

Hopefully this will answer your questions about occupancy, register file size, how register and shared memory usage affect efficiency, and how to tune your thread block size. The documentation for this tool follows, but it is also included in the excel spreadsheet on the "help" tab.

As we release new GPUs with different parameters, we'll add them to this tool so it can be used for any GPU.

If you have questions about or problems with the CUDA Occupancy Calculator please post them in the forums.

Cheers,
Mark

------------------------------------ Documentation ------------------------------------------


Overview
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail.

The size of N on G80 is 8192 32-bit registers per multiprocessor.

Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements.

Instructions
Using the CUDA Occupancy Calculator is as easy as 1-2-3. Change to the calculator sheet and follow these three steps.
1.) First select your GPU in the green box.

2.) For the kernel you are profiling, enter the number of threads per thread block, the registers used per thread, and the total shared memory used per thread block in bytes in the orange block. See below for how to find the registers used per thread.

3.) Examine the blue box, and the graph to the right. This will tell you the occupancy, as well as the number of active threads, warps, and thread blocks per multiprocessor, and the maximum number of active blocks on the GPU. The graph will show you the occupancy for your chosen block size as a red triangle, and for all other possible block sizes as a line graph.

You can now experiment with how different thread block sizes, register counts, and shared memory usages can affect your GPU occupancy.

Determining Registers Per Thread and Shared Memory Per Thread Block
To determine the number of registers used per thread in your kernel, simply compile the kernel code using the -cubin option to nvcc. This will generate a .cubin file, which you can open in a text editor. Look for the "code" section with your kernel's name. Within the curly braces ("{ ... }") for that code block, you will see a line with "reg = X", where x is the number of registers used by your kernel. You can also see the amount of shared memory used as "smem = Y". However, if your kernel declares any external shared memory that is allocated dynamically, you will need to add the number in the .cubin file to the amount you dynamically allocate at run time to get the correct shareded memory usage. An example is below:

code {
name = my_kernel
lmem = 0
smem = 24
reg = 5
bar = 0
bincode { }
const { }
}

Let's say "my_kernel" contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our total shared memory usage is 2072 bytes. We enter this into the box labeled "shared memory per block (bytes)", and we enter the number of registers used by my_kernel, 5, in the box labeled registers per thread. We then enter our thread block size and the calculator will display the occupancy.

For more information on NVIDIA CUDA, visit http://developer.nvidia.com/cuda
Go to the top of the page
 
+Quote Post
FlyingSquirrel
post Mar 23 2007, 08:30 PM
Post #2



**

Group: Members
Posts: 13
Joined: 8-March 07
Member No.: 44,356



Hi,

In section 5.1 of the documentation is noted that maximum number of theads per block is 512. In CUDA Occupancy Calculator this value is equal to 768, but when I've tried to input this value in Occupancy Calculator, I've got an error. So, my question is - which of those values is correct?
Go to the top of the page
 
+Quote Post
eelsen
post Mar 23 2007, 08:41 PM
Post #3



*****

Group: Members
Posts: 129
Joined: 17-February 07
Member No.: 41,213
Org.: Stanford University



QUOTE(FlyingSquirrel @ Mar 23 2007, 01:30 PM)
Hi,

In section 5.1 of the documentation is noted that maximum number of theads per block is 512. In CUDA Occupancy Calculator this value is equal to 768, but when I've tried to input this value in Occupancy Calculator, I've got an error. So, my question is - which of those values is correct?
[right][snapback]175122[/snapback][/right]


The maximum number of threads per block is not equal to the maximum number of threads per processor. With multiple blocks you can get 768 threads on one processor.
Go to the top of the page
 
+Quote Post
FlyingSquirrel
post Mar 23 2007, 08:52 PM
Post #4



**

Group: Members
Posts: 13
Joined: 8-March 07
Member No.: 44,356



Thanks for answer smile.gif
Go to the top of the page
 
+Quote Post
sicb0161
post Mar 29 2007, 01:11 PM
Post #5



****

Group: Members
Posts: 85
Joined: 22-February 07
From: Turkey
Member No.: 42,165



Hi,

do you have the specs for the GeForce 8800 GTS version ?
I would like to add it to the calculator!

Thank you very much in advance.
Go to the top of the page
 
+Quote Post
Mark Harris
post Mar 30 2007, 01:48 PM
Post #6



******

Group: Members
Posts: 382
Joined: 15-February 07
From: Brisbane, Australia
Member No.: 40,950
Org.: NVIDIA



GTS is the same as GTX except that it has fewer multiprocessors. Since the occupancy is calculated per multiprocessor, this change doesn't affect the calculator, so I didn't add GTS as a different GPU. A G80 is a G80 is a G80 in this regard. smile.gif When future CUDA-supporting GPUs are released, I will add their information and post a new calculator.

Mark
Go to the top of the page
 
+Quote Post
mehussein
post Apr 1 2007, 06:48 PM
Post #7



**

Group: Members
Posts: 14
Joined: 20-February 07
Member No.: 41,780
Org.: UMD



When I enter my kernel's data in the calculator (192 threads/block, 35 reg/thread, 3872 smem/block) I get 0% occupancy. What does it mean to have 0% occupancy? My kernel seems to work fine and produce correct results though. It is also faster than the CPU implementation and my former GPU implementation (OpenGL/Cg).

-MH
Go to the top of the page
 
+Quote Post
shyam.pm
post Apr 2 2007, 08:50 PM
Post #8



***

Group: Members
Posts: 25
Joined: 20-February 07
Member No.: 41,721



Hi,

Could you please let me know how I could do this in a windows environment. I am using Visual Studio 2003. I changed the compile option to -cubin instead of -ccbin in the command line in the properties for the .cu file. But, it did not work.

Thanks,

Shyam
Go to the top of the page
 
+Quote Post
mehussein
post Apr 3 2007, 02:07 AM
Post #9



**

Group: Members
Posts: 14
Joined: 20-February 07
Member No.: 41,780
Org.: UMD



I made a simple batch file to do this from command line. The file contains only one line as follows:

CODE

nvcc -ccbin "C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin" -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA SDK 10\NVIDIA CUDA SDK\common\inc" %1


I use it from command line like this

CODE
runnvcccubin.bat file_name.cu


You may need to change the Visual Studio and NVIDIA SDK directories to make it work in your environment.

-MH
Go to the top of the page
 
+Quote Post
sicb0161
post Apr 3 2007, 04:28 PM
Post #10



****

Group: Members
Posts: 85
Joined: 22-February 07
From: Turkey
Member No.: 42,165



QUOTE(Mark Harris @ Mar 30 2007, 05:48 AM)
GTS is the same as GTX except that it has fewer multiprocessors.  Since the occupancy is calculated per multiprocessor, this change doesn't affect the calculator, so I didn't add GTS as a different GPU.  A G80 is a G80 is a G80 in this regard. smile.gif  When future CUDA-supporting GPUs are released, I will add their information and post a new calculator.

Mark
[right][snapback]177845[/snapback][/right]


yes I see, thought that for instance the number of total registers might change .
thx for the answer.

I have three additional questions:
1.
I have evaluated the performance of the grafic card by comparing the computation time of the CPU and of the NVIDIA using matrixmul and simplecublas. Additionally the size of the matrix is variable, so that one can clearly see the speed up when using the grafic card in case of bigger matrices.
However the computation time of matrixmul is always greater than simplecublas. Why is that? (One possible answer might be that the cublas implementations are optimized, btu i am not sure)

2. If one has a closer on the occupancy calculator in case of matrixmul with threads/Block = 256, registers/thread = 14, shared mem/block = 2048, the percentage of occupancy of each mutli equals 67%.
As i understand, there are mainly three options in order to increase the performance: changing the thread block dimensions, the shared mem size or the number or registers. In the case of matrixmul, the number of registers used per thread causes in this case a bottle neck.
What chance do i have in order to increase the occupancy although i dont have any influence on the number of register (as the compiler tries to increase the number of threads while using less registers)?

3. Although i have not changed any line of matrixmul, the program crashes when the matrixdimension increases to 5120x5120. Memory allocation works fine. I dont have a clue, why the program is crashing blink.gif .

Thx in advance for your help.
Go to the top of the page
 
+Quote Post
Mark Harris
post Apr 4 2007, 10:06 AM
Post #11



******

Group: Members
Posts: 382
Joined: 15-February 07
From: Brisbane, Australia
Member No.: 40,950
Org.: NVIDIA



This post is getting off topic for this thread. I want to keep this thread for discussion of the occupancy calculator tool, so please repost your questions in a new topic.

Thanks,
Mark
Go to the top of the page
 
+Quote Post
shyam.pm
post Apr 4 2007, 12:41 PM
Post #12



***

Group: Members
Posts: 25
Joined: 20-February 07
Member No.: 41,721



QUOTE

I made a simple batch file to do this from command line. The file contains only one line as follows:

CODE

nvcc -ccbin "C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin" -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA SDK 10\NVIDIA CUDA SDK\common\inc" %1


I use it from command line like this

CODE
runnvcccubin.bat file_name.cu


You may need to change the Visual Studio and NVIDIA SDK directories to make it work in your environment.

-MH


Thanks a lot. The batch file worked perfectly.

Shyam
Go to the top of the page
 
+Quote Post
Mark Harris
post Apr 26 2007, 08:31 AM
Post #13



******

Group: Members
Posts: 382
Joined: 15-February 07
From: Brisbane, Australia
Member No.: 40,950
Org.: NVIDIA



Hi all,

There's an update to the CUDA occupancy calculator on the
CUDA Webpage. This is version 1.1, and it includes a bug fix to fix issues with the occupancy being incorrectly calculated.

It also now has 3 graphs -- the block size graph from v1.0, plus graphs of how occupancy will change for the selected block size with varying register and shared memory usage.

Mark
Go to the top of the page
 
+Quote Post
nogradi
post Apr 30 2007, 09:43 AM
Post #14



***

Group: Members
Posts: 48
Joined: 18-February 07
Member No.: 41,484



Great! Thanks very much for the new release.

Daniel
Go to the top of the page
 
+Quote Post
jesser
post May 8 2007, 08:15 PM
Post #15



**

Group:
Posts: 12
Joined: 22-March 07
Member No.: 46,393



Mark,

After using the occupancy calculator I found that I am getting 13% per multiprocessor which sounds bad. The kernel is definitely "bandwidth bound" as you say - much more time spent in data fetch from global memory than in calculations. The nature of my kernel is that none of the threads can really run until all the data is loaded in to shared memory. They are set up to load global data coalesced into shared memory as fast as possible, with the smallest amount of instructions, do some calculations, and then write the data back out again in a similar fashion.

Will increasing the occupancy by increasing the number of threads per block give me better performance in this case? What type of memory access patterns benefit from more threads per block?
Go to the top of the page
 
+Quote Post
osiris
post May 10 2007, 10:56 PM
Post #16



*****

Group: Members
Posts: 181
Joined: 19-April 07
From: Sydney
Member No.: 50,218



Mark, I wonder if you could simply list the necessary and sufficient conditions for 100% occupancy for a given kernel, my understanding is:
1. registers <= 10
2. threads/block mod 32 == 0
3. warps/block is a divisor of 24
4. shared mem/block <= 16Kb * (warps/block) / 24 - any alignment constraint?
constant memory does not come into it as it is the same for all blocks

An official confirmation would be helpful and perhaps it should be in the manual in the G80 specific area. If it is there I have missed it.
Thanks, Eric

ed: and then run N * 16 * 24 / (warps/block) blocks, assuming they all execute for the same time.

This post has been edited by osiris: May 11 2007, 04:26 AM
Go to the top of the page
 
+Quote Post
xaraca
post May 10 2007, 11:24 PM
Post #17



*

Group: Members
Posts: 9
Joined: 27-February 07
Member No.: 42,998



QUOTE
1. registers <= 10
2. threads/block mod 32 == 0
3. warps/block is a divisor of 24
4. shared mem/block <= 16Kb * (warps/block) / 24 - any alignment constraint?


Since there are only a few thread counts that satisfy those requirements, maybe we can summarize like this:

Max registers: 10
Threads per Block........Max shared mem (bytes)
96...............................2048
128.............................2730
192.............................4096
256.............................5461
384.............................8192
Go to the top of the page
 
+Quote Post
lord_jake
post Jun 20 2007, 09:35 AM
Post #18



**

Group: Members
Posts: 15
Joined: 19-June 07
Member No.: 57,921



Dear Mark,

there is something which keeps me puzzled. My kernel cubin file tells me that

lmem = 640, smem = 1092, reg = 33

Now I know that the number of registers is the limiting factor for my kernel according to
your Occupancy Calculator. But could you please tell me what impact the lmem = 640 has.
I get this lmem in my code when I allocate arrays like

float4 fvList[40];

I though lmem (= local memory?) is the same as Registers. But it seems otherwise,
because changing that value does not change the "Occupancy", since it's never used in
your calculation.

Any explanation would be welcome,
Jake
Go to the top of the page
 
+Quote Post
Romant
post Jul 27 2007, 05:41 PM
Post #19



*****

Group: Members
Posts: 196
Joined: 25-July 07
Member No.: 62,733



QUOTE(Mark Harris @ Mar 30 2007, 05:48 PM)
GTS is the same as GTX except that it has fewer multiprocessors.  Since the occupancy is calculated per multiprocessor, this change doesn't affect the calculator, so I didn't add GTS as a different GPU.  A G80 is a G80 is a G80 in this regard. smile.gif  When future CUDA-supporting GPUs are released, I will add their information and post a new calculator.

Mark
[right][snapback]177845[/snapback][/right]


Mark, and what about G86 ? :-)
I have 8500 GT and it is built on the G86 hardware ... could you please specify the info for this chip ?

Many thanks in advance.
Go to the top of the page
 
+Quote Post
osiris
post Jul 28 2007, 10:59 PM
Post #20



*****

Group: Members
Posts: 181
Joined: 19-April 07
From: Sydney
Member No.: 50,218



Hi Mark,

There was no response to my report of this documentation bug. It also affects the occupancy calculator which shows 33% occupancy for 32 threads and 32 registers when infact I measure 17% occupancy and the profiler shows .167 occupancy. Reducing registers to 16 gets you up to 33%.

So is the bug in the 64 bit driver or is it a hardware restriction that is not correctly documented in the guide and a bug in the occupancy calculator?

The particular configuration of 8 blocks,32 threads and 32 registers is a good one on the G80 apart from the fact that it does not work!

Eric
Go to the top of the page
 
+Quote Post

3 Pages V   1 2 3 >
Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 23rd November 2009 - 02:17 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela