![]() ![]() |
Oct 26 2009, 10:43 PM
Post
#1
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
Hi all,
I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I've seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead? Thanks |
|
|
|
Nov 3 2009, 05:31 PM
Post
#2
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
No any reply for this post?!! please guys...
|
|
|
|
Nov 3 2009, 05:35 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 2,289 Joined: 23-March 07 Member No.: 46,425 Org.: University of Michigan |
Hi all, I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I've seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead? Thanks A very long time ago, I looked at the launch overhead of 8800 GTX as a function of the number of blocks and found it to be linearly increasing. I haven't looked at it since then. It is a very simple benchmark to perform just time an empty kernel with increasingly larger block sizes and plot the results. I suggest you try it and post the plot! |
|
|
|
Nov 3 2009, 05:45 PM
Post
#4
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
I found this paper http://www.netlib.org/lapack/lawnspdf/lawn202.pdf which in section 3.1 it says the kernel lunch time is around 10microsec for OLD gpus, I measured this time as 60microsec for GTX 280, I supposed in 280 the kernel lunch time must be lower or at least equal to older gpus, am I right?
|
|
|
|
Nov 3 2009, 06:43 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 2,289 Joined: 23-March 07 Member No.: 46,425 Org.: University of Michigan |
The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds.
You aren't running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux. This post has been edited by MisterAnderson42: Nov 3 2009, 06:43 PM |
|
|
|
Nov 4 2009, 01:45 PM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 692 Joined: 4-April 06 From: Karlsruhe / Munich, Germany Member No.: 18,632 Org.: Nomor Research GmbH |
|
|
|
|
Nov 4 2009, 09:08 PM
Post
#7
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
I'm running on windows XP64, and CUDA 2.3! This could be the reason?
The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds. You aren't running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux. |
|
|
|
Nov 4 2009, 09:10 PM
Post
#8
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
|
|
|
|
Nov 4 2009, 10:02 PM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,072 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
XP64 launch overhead is <10 microseconds, depending on your PCIe bandwidth. Vista/Win7 is... not.
|
|
|
|
Nov 4 2009, 10:14 PM
Post
#10
|
|
![]() ![]() ![]() Group: Members Posts: 42 Joined: 30-June 08 Member No.: 110,141 Org.: NCSA (University of Illinois) |
I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)
Empty kernel call only: datasize(bytes) blockdim griddim time(s) 1 (1,1,1) (1,1,1) 0.000015 2 (2,1,1) (1,1,1) 0.000014 4 (4,1,1) (1,1,1) 0.000016 8 (8,1,1) (1,1,1) 0.000015 16 (16,1,1) (1,1,1) 0.000015 32 (32,1,1) (1,1,1) 0.000015 64 (64,1,1) (1,1,1) 0.000014 128 (128,1,1) (1,1,1) 0.000015 256 (256,1,1) (1,1,1) 0.000016 512 (512,1,1) (1,1,1) 0.000015 1024 (512,1,1) (2,1,1) 0.000015 2048 (512,1,1) (4,1,1) 0.000014 4096 (512,1,1) (8,1,1) 0.000015 8192 (512,1,1) (16,1,1) 0.000014 16384 (512,1,1) (32,1,1) 0.000017 32768 (512,1,1) (64,1,1) 0.000016 65536 (512,1,1) (128,1,1) 0.000017 131072 (512,1,1) (256,1,1) 0.000018 262144 (512,1,1) (512,1,1) 0.000022 524288 (512,1,1) (1024,1,1) 0.000029 1048576 (512,1,1) (2048,1,1) 0.000039 2097152 (512,1,1) (4096,1,1) 0.000064 4194304 (512,1,1) (8192,1,1) 0.000115 8388608 (512,1,1) (16384,1,1) 0.000212 |
|
|
|
Nov 5 2009, 12:12 AM
Post
#11
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
If these times include the cudaThreadSynchronize() too?
Is that possible to share your code to do some comparison? I inserted my code here. For a GTX 280 and on Windows XP 64, the resulted time including cudaThreadSynchronize() is 0.48millisec, and excluding cudaThreadSynchronize() is 0.039millisec!! Is it a big time?! Thanks. CODE __global__ void HelloCUDA2(float* device_result1) { } int main(int argc, char* argv[]) { float *d_A; int N=1500; CUDA_SAFE_CALL( cudaMalloc((void**) &d_A, sizeof(float) * N * N)); unsigned int timer = 0; CUT_SAFE_CALL( cutCreateTimer( &timer)); CUT_SAFE_CALL( cutStartTimer( timer)); HelloCUDA2<<<(N+127)/128, 128>>>(d_A); CUDA_SAFE_CALL( cudaThreadSynchronize() ); CUT_SAFE_CALL( cutStopTimer( timer)); printf("Processing time: %f (ms)\n", cutGetTimerValue( timer)); CUT_SAFE_CALL( cutDeleteTimer( timer)); CUDA_SAFE_CALL( cudaFree(d_A)); return 0; } I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call) Empty kernel call only: datasize(bytes) blockdim griddim time(s) 1 (1,1,1) (1,1,1) 0.000015 2 (2,1,1) (1,1,1) 0.000014 4 (4,1,1) (1,1,1) 0.000016 8 (8,1,1) (1,1,1) 0.000015 16 (16,1,1) (1,1,1) 0.000015 32 (32,1,1) (1,1,1) 0.000015 64 (64,1,1) (1,1,1) 0.000014 128 (128,1,1) (1,1,1) 0.000015 256 (256,1,1) (1,1,1) 0.000016 512 (512,1,1) (1,1,1) 0.000015 1024 (512,1,1) (2,1,1) 0.000015 2048 (512,1,1) (4,1,1) 0.000014 4096 (512,1,1) (8,1,1) 0.000015 8192 (512,1,1) (16,1,1) 0.000014 16384 (512,1,1) (32,1,1) 0.000017 32768 (512,1,1) (64,1,1) 0.000016 65536 (512,1,1) (128,1,1) 0.000017 131072 (512,1,1) (256,1,1) 0.000018 262144 (512,1,1) (512,1,1) 0.000022 524288 (512,1,1) (1024,1,1) 0.000029 1048576 (512,1,1) (2048,1,1) 0.000039 2097152 (512,1,1) (4096,1,1) 0.000064 4194304 (512,1,1) (8192,1,1) 0.000115 8388608 (512,1,1) (16384,1,1) 0.000212 |
|
|
|
Nov 5 2009, 12:16 AM
Post
#12
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
|
|
|
|
Nov 5 2009, 12:19 AM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 692 Joined: 4-April 06 From: Karlsruhe / Munich, Germany Member No.: 18,632 Org.: Nomor Research GmbH |
For my case it is much more than 10microsec!! I inserted my code at gshi's response...please take a look and see if I am doing something wrong!?! Thank you. First kernel launch takes longer. Try cooking your kernel first. No, this is not another food joke. Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing. |
|
|
|
Nov 5 2009, 12:41 AM
Post
#14
|
|
![]() ![]() ![]() Group: Members Posts: 42 Joined: 30-June 08 Member No.: 110,141 Org.: NCSA (University of Illinois) |
yes, the first kernel needs to be outside the loop. Here is the code
CODE #include <stdlib.h> #include <stdio.h> #include <cuda.h> #include <unistd.h> #include <sys/time.h> #include <sched.h> #define CUERR do{ cudaError_t err; \ cudaThreadSynchronize(); \ if ((err = cudaGetLastError()) != cudaSuccess) { \ printf("ERROR: CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); \ exit(-1); \ }}while(0) double gettime(void) { struct timeval t; gettimeofday(&t, NULL); return t.tv_sec + 0.000001* t.tv_usec; } __global__ void empty_kernel(void* p, int len) { } int kernel_launch_test() { char* A = NULL; char* hostA = NULL; double t0, t1; unsigned int max_data_size = 1 << 29; int i; int repeat_num = 1000; char hostname[64]; gethostname(hostname, 64); printf("Runing on host %s\n", hostname); cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(0, &cpuset); int device = 1; cudaSetDevice(device);CUERR; int datasize =1; cudaMallocHost((void**)&hostA, max_data_size); CUERR; cudaMalloc((void**)&A, max_data_size); CUERR; cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR; empty_kernel<<<1, 1>>>(A, datasize); printf("datasize(bytes) blockdim griddim time(s)\n"); while( datasize <= max_data_size){ int blocksize= (datasize <=512)? datasize: 512; int gridsize_x = datasize/ blocksize; int gridsize_y = 1; if (gridsize_x ==0){ gridsize_x= 1; } #define K64 (32*1024) if(gridsize_x > K64){ gridsize_y = gridsize_x /K64; gridsize_x = K64; } dim3 griddim(gridsize_x, gridsize_y); t0 = gettime(); for (i =0;i <repeat_num;i ++){ //cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR; empty_kernel<<<griddim, blocksize>>>(A, datasize);CUERR; } cudaThreadSynchronize();CUERR; t1 = gettime(); double gpu_time = t1 - t0; printf("%10d\t(%d,%d,%d) \t(%d,%d,%d) \t%f\n", datasize, blocksize, 1,1, griddim.x, griddim.y, griddim.z, gpu_time/repeat_num); fflush(stdout); datasize *=2; } cudaFreeHost(hostA);CUERR; cudaFree(A); CUERR; cudaThreadExit(); return 0; } int main() { kernel_launch_test(); } First kernel launch takes longer. Try cooking your kernel first. No, this is not another food joke. Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing. |
|
|
|
Nov 5 2009, 01:02 AM
Post
#15
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
Thanks, I had completely forgotten this point...
I cooked the kernel, and now it is close to 20microsec! it is still big, isn't it?! First kernel launch takes longer. Try cooking your kernel first. No, this is not another food joke. Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing. |
|
|
|
Nov 5 2009, 01:05 AM
Post
#16
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 692 Joined: 4-April 06 From: Karlsruhe / Munich, Germany Member No.: 18,632 Org.: Nomor Research GmbH |
Thanks, I had completely forgotten this point... I cooked the kernel, and now it is close to 20microsec! it is still big, isn't it?! This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task. |
|
|
|
Nov 5 2009, 01:06 AM
Post
#17
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
|
|
|
|
Nov 5 2009, 01:15 AM
Post
#18
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 60 Joined: 3-September 08 Member No.: 117,781 |
mine is AMD Phenom 9850, 2.5GHz, 4GB RAM, and PCIe bandwidth 8GB/S.
I just want to be sure the GPU works well with its most efficiency! This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task. |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 24th November 2009 - 12:46 AM |