IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> kernel launch overhead for GTX 280
kolonel
post Oct 26 2009, 10:43 PM
Post #1



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



Hi all,
I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I've seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead?
Thanks
Go to the top of the page
 
+Quote Post
kolonel
post Nov 3 2009, 05:31 PM
Post #2



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



No any reply for this post?!! please guys...
Go to the top of the page
 
+Quote Post
MisterAnderson42
post Nov 3 2009, 05:35 PM
Post #3



********

Group: Extranet Users
Posts: 2,289
Joined: 23-March 07
Member No.: 46,425
Org.: University of Michigan



QUOTE (kolonel @ Oct 26 2009, 04:43 PM) *
Hi all,
I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I've seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead?
Thanks

A very long time ago, I looked at the launch overhead of 8800 GTX as a function of the number of blocks and found it to be linearly increasing. I haven't looked at it since then. It is a very simple benchmark to perform just time an empty kernel with increasingly larger block sizes and plot the results. I suggest you try it and post the plot!
Go to the top of the page
 
+Quote Post
kolonel
post Nov 3 2009, 05:45 PM
Post #4



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



I found this paper http://www.netlib.org/lapack/lawnspdf/lawn202.pdf which in section 3.1 it says the kernel lunch time is around 10microsec for OLD gpus, I measured this time as 60microsec for GTX 280, I supposed in 280 the kernel lunch time must be lower or at least equal to older gpus, am I right?
Go to the top of the page
 
+Quote Post
MisterAnderson42
post Nov 3 2009, 06:43 PM
Post #5



********

Group: Extranet Users
Posts: 2,289
Joined: 23-March 07
Member No.: 46,425
Org.: University of Michigan



The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds.

You aren't running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux.

This post has been edited by MisterAnderson42: Nov 3 2009, 06:43 PM
Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 4 2009, 01:45 PM
Post #6



*******

Group: Members
Posts: 692
Joined: 4-April 06
From: Karlsruhe / Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



QUOTE (kolonel @ Oct 26 2009, 11:43 PM) *
what is the standard way to calculate this kernel lunch overhead?


My lunch time is around 45 minutes to an hour, but then I don't really like to eat kernels as I am not a vegetarian. ;-)

Christian
Go to the top of the page
 
+Quote Post
kolonel
post Nov 4 2009, 09:08 PM
Post #7



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



I'm running on windows XP64, and CUDA 2.3! This could be the reason? blink.gif

QUOTE (MisterAnderson42 @ Nov 3 2009, 11:43 AM) *
The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds.

You aren't running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux.

Go to the top of the page
 
+Quote Post
kolonel
post Nov 4 2009, 09:10 PM
Post #8



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



good one! thumbup.gif

QUOTE (cbuchner1 @ Nov 4 2009, 06:45 AM) *
My lunch time is around 45 minutes to an hour, but then I don't really like to eat kernels as I am not a vegetarian. ;-)

Christian

Go to the top of the page
 
+Quote Post
tmurray
post Nov 4 2009, 10:02 PM
Post #9



********

Group: NVIDIA Employees
Posts: 2,072
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



XP64 launch overhead is <10 microseconds, depending on your PCIe bandwidth. Vista/Win7 is... not.
Go to the top of the page
 
+Quote Post
gshi
post Nov 4 2009, 10:14 PM
Post #10



***

Group: Members
Posts: 42
Joined: 30-June 08
Member No.: 110,141
Org.: NCSA (University of Illinois)



I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)

Empty kernel call only:
datasize(bytes) blockdim griddim time(s)
1 (1,1,1) (1,1,1) 0.000015
2 (2,1,1) (1,1,1) 0.000014
4 (4,1,1) (1,1,1) 0.000016
8 (8,1,1) (1,1,1) 0.000015
16 (16,1,1) (1,1,1) 0.000015
32 (32,1,1) (1,1,1) 0.000015
64 (64,1,1) (1,1,1) 0.000014
128 (128,1,1) (1,1,1) 0.000015
256 (256,1,1) (1,1,1) 0.000016
512 (512,1,1) (1,1,1) 0.000015
1024 (512,1,1) (2,1,1) 0.000015
2048 (512,1,1) (4,1,1) 0.000014
4096 (512,1,1) (8,1,1) 0.000015
8192 (512,1,1) (16,1,1) 0.000014
16384 (512,1,1) (32,1,1) 0.000017
32768 (512,1,1) (64,1,1) 0.000016
65536 (512,1,1) (128,1,1) 0.000017
131072 (512,1,1) (256,1,1) 0.000018
262144 (512,1,1) (512,1,1) 0.000022
524288 (512,1,1) (1024,1,1) 0.000029
1048576 (512,1,1) (2048,1,1) 0.000039
2097152 (512,1,1) (4096,1,1) 0.000064
4194304 (512,1,1) (8192,1,1) 0.000115
8388608 (512,1,1) (16384,1,1) 0.000212
Go to the top of the page
 
+Quote Post
kolonel
post Nov 5 2009, 12:12 AM
Post #11



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



If these times include the cudaThreadSynchronize() too?
Is that possible to share your code to do some comparison?

I inserted my code here. For a GTX 280 and on Windows XP 64, the resulted time including cudaThreadSynchronize() is 0.48millisec, and excluding cudaThreadSynchronize() is 0.039millisec!! Is it a big time?! blink.gif
Thanks.

CODE
__global__  
void HelloCUDA2(float* device_result1)
{
    
}

int main(int argc, char* argv[])
{

    float *d_A;
    int N=1500;
    CUDA_SAFE_CALL( cudaMalloc((void**) &d_A, sizeof(float) * N * N));

    unsigned int timer = 0;
    CUT_SAFE_CALL( cutCreateTimer( &timer));
    CUT_SAFE_CALL( cutStartTimer( timer));

    HelloCUDA2<<<(N+127)/128, 128>>>(d_A);

    CUDA_SAFE_CALL( cudaThreadSynchronize() );
    
    CUT_SAFE_CALL( cutStopTimer( timer));
    printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));
    CUT_SAFE_CALL( cutDeleteTimer( timer));

    CUDA_SAFE_CALL( cudaFree(d_A));

    return 0;
}



QUOTE (gshi @ Nov 4 2009, 03:14 PM) *
I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)

Empty kernel call only:
datasize(bytes) blockdim griddim time(s)
1 (1,1,1) (1,1,1) 0.000015
2 (2,1,1) (1,1,1) 0.000014
4 (4,1,1) (1,1,1) 0.000016
8 (8,1,1) (1,1,1) 0.000015
16 (16,1,1) (1,1,1) 0.000015
32 (32,1,1) (1,1,1) 0.000015
64 (64,1,1) (1,1,1) 0.000014
128 (128,1,1) (1,1,1) 0.000015
256 (256,1,1) (1,1,1) 0.000016
512 (512,1,1) (1,1,1) 0.000015
1024 (512,1,1) (2,1,1) 0.000015
2048 (512,1,1) (4,1,1) 0.000014
4096 (512,1,1) (8,1,1) 0.000015
8192 (512,1,1) (16,1,1) 0.000014
16384 (512,1,1) (32,1,1) 0.000017
32768 (512,1,1) (64,1,1) 0.000016
65536 (512,1,1) (128,1,1) 0.000017
131072 (512,1,1) (256,1,1) 0.000018
262144 (512,1,1) (512,1,1) 0.000022
524288 (512,1,1) (1024,1,1) 0.000029
1048576 (512,1,1) (2048,1,1) 0.000039
2097152 (512,1,1) (4096,1,1) 0.000064
4194304 (512,1,1) (8192,1,1) 0.000115
8388608 (512,1,1) (16384,1,1) 0.000212

Go to the top of the page
 
+Quote Post
kolonel
post Nov 5 2009, 12:16 AM
Post #12



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



For my case it is much more than 10microsec!! I inserted my code at gshi's response...please take a look and see if I am doing something wrong!?!
Thank you.


QUOTE (tmurray @ Nov 4 2009, 03:02 PM) *
XP64 launch overhead is <10 microseconds, depending on your PCIe bandwidth. Vista/Win7 is... not.

Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 5 2009, 12:19 AM
Post #13



*******

Group: Members
Posts: 692
Joined: 4-April 06
From: Karlsruhe / Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



QUOTE (kolonel @ Nov 5 2009, 01:16 AM) *
For my case it is much more than 10microsec!! I inserted my code at gshi's response...please take a look and see if I am doing something wrong!?!
Thank you.


First kernel launch takes longer.

Try cooking your kernel first. No, this is not another food joke.

Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing.
Go to the top of the page
 
+Quote Post
gshi
post Nov 5 2009, 12:41 AM
Post #14



***

Group: Members
Posts: 42
Joined: 30-June 08
Member No.: 110,141
Org.: NCSA (University of Illinois)



yes, the first kernel needs to be outside the loop. Here is the code

CODE
#include <stdlib.h>
#include <stdio.h>
#include <cuda.h>
#include <unistd.h>
#include <sys/time.h>

#include <sched.h>


#define CUERR  do{ cudaError_t err; \
        cudaThreadSynchronize(); \
        if ((err = cudaGetLastError()) != cudaSuccess) { \
        printf("ERROR: CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); \
        exit(-1);                                                       \
        }}while(0)



double
gettime(void)
{
    struct timeval t;
    gettimeofday(&t, NULL);

    return t.tv_sec + 0.000001* t.tv_usec;


}

__global__ void
empty_kernel(void* p, int len)
{
}

int
kernel_launch_test()
{
    char* A = NULL;
    char* hostA = NULL;
    double t0, t1;
    unsigned int max_data_size = 1 << 29;

    int i;
    int repeat_num = 1000;

    char hostname[64];

    gethostname(hostname, 64);
    printf("Runing on host %s\n", hostname);

    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(0, &cpuset);


    int device = 1;
    cudaSetDevice(device);CUERR;

    int datasize =1;
    cudaMallocHost((void**)&hostA, max_data_size); CUERR;
    cudaMalloc((void**)&A, max_data_size); CUERR;
    cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;

    empty_kernel<<<1, 1>>>(A, datasize);

    printf("datasize(bytes)  blockdim      griddim        time(s)\n");
    while( datasize <= max_data_size){


        int blocksize= (datasize <=512)? datasize: 512;
        int gridsize_x = datasize/ blocksize;
        int gridsize_y = 1;
        if (gridsize_x ==0){
            gridsize_x= 1;
        }
#define K64 (32*1024)
        if(gridsize_x > K64){
            gridsize_y = gridsize_x /K64;
            gridsize_x = K64;
        }
        dim3 griddim(gridsize_x, gridsize_y);

        t0 = gettime();

        for (i =0;i <repeat_num;i ++){
            //cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;
            empty_kernel<<<griddim, blocksize>>>(A, datasize);CUERR;
        }
        cudaThreadSynchronize();CUERR;
        t1 = gettime();

        double gpu_time = t1 - t0;

        printf("%10d\t(%d,%d,%d) \t(%d,%d,%d) \t%f\n",
               datasize,
               blocksize, 1,1,
               griddim.x, griddim.y, griddim.z,
               gpu_time/repeat_num);
        fflush(stdout);

        datasize *=2;


    }

    cudaFreeHost(hostA);CUERR;
    cudaFree(A); CUERR;

    cudaThreadExit();
    return 0;

}

int main()
{
    kernel_launch_test();
}

QUOTE (cbuchner1 @ Nov 4 2009, 07:19 PM) *
First kernel launch takes longer.

Try cooking your kernel first. No, this is not another food joke.

Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing.

Go to the top of the page
 
+Quote Post
kolonel
post Nov 5 2009, 01:02 AM
Post #15



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



Thanks, I had completely forgotten this point...
I cooked the kernel, and now it is close to 20microsec! it is still big, isn't it?!


QUOTE (cbuchner1 @ Nov 4 2009, 05:19 PM) *
First kernel launch takes longer.

Try cooking your kernel first. No, this is not another food joke.

Cooking means launching it once (e..g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing.

Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 5 2009, 01:05 AM
Post #16



*******

Group: Members
Posts: 692
Joined: 4-April 06
From: Karlsruhe / Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



QUOTE (kolonel @ Nov 5 2009, 02:02 AM) *
Thanks, I had completely forgotten this point...
I cooked the kernel, and now it is close to 20microsec! it is still big, isn't it?!


This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task.
Go to the top of the page
 
+Quote Post
kolonel
post Nov 5 2009, 01:06 AM
Post #17



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



Thanks for the code, BTW, what is your platform: linux or windows?!
I tried to compile your code, but it cannot find sched.h and unistd.h!

QUOTE (gshi @ Nov 4 2009, 05:41 PM) *
yes, the first kernel needs to be outside the loop. Here is the code

Go to the top of the page
 
+Quote Post
kolonel
post Nov 5 2009, 01:15 AM
Post #18



****

Group: Members
Posts: 60
Joined: 3-September 08
Member No.: 117,781



mine is AMD Phenom 9850, 2.5GHz, 4GB RAM, and PCIe bandwidth 8GB/S.
I just want to be sure the GPU works well with its most efficiency!

QUOTE (cbuchner1 @ Nov 4 2009, 06:05 PM) *
This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task.

Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 24th November 2009 - 12:46 AM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela