IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> varying no of cores at runtime
saiyedul
post Nov 7 2009, 09:26 AM
Post #1



*

Group: Members
Posts: 3
Joined: 31-August 09
Member No.: 191,077



can we specify no of processor cores only on which our program should run....
actually i m doing a project on showing speedup achieved by same algorithm on differnt no of processing cores...
Do C for CUDA provide any way for doing this....

plz anybody do reply...
Go to the top of the page
 
+Quote Post
avidday
post Nov 7 2009, 09:47 AM
Post #2



*******

Group: Members
Posts: 814
Joined: 1-April 09
Member No.: 148,556



The short answer is no.

In CUDA, the hardware details are completely abstract and the programmer really only controls the number of threads to be run. The hardware itself decides how that thread count should be translated into hardware execution parameters. That can change, depending on the hardware generation you are running.
Go to the top of the page
 
+Quote Post
Cygnus X1
post Nov 7 2009, 10:00 AM
Post #3



*****

Group: Members
Posts: 160
Joined: 30-October 08
From: Saarbruecken, Germany & Kraków, Poland
Member No.: 124,006
Org.: Saarland University



I was told you can use RivaTuner to do that. Originally it was designed for overclocking, but you can do various things with it and disabling some SM is likely one of its functions. If I am not mistaken you can also downclock the cores or memory access rates to see how these have an impact on your execution and - in a way - check if your kernels or compute- or bandwith-limited.

Never used it so far, but I will probably try it one day too...

This post has been edited by Cygnus X1: Nov 7 2009, 10:01 AM
Go to the top of the page
 
+Quote Post
saiyedul
post Nov 7 2009, 10:14 AM
Post #4



*

Group: Members
Posts: 3
Joined: 31-August 09
Member No.: 191,077



QUOTE (Cygnus X1 @ Nov 7 2009, 03:30 PM) *
I was told you can use RivaTuner to do that. Originally it was designed for overclocking, but you can do various things with it and disabling some SM is likely one of its functions. If I am not mistaken you can also downclock the cores or memory access rates to see how these have an impact on your execution and - in a way - check if your kernels or compute- or bandwith-limited.

Never used it so far, but I will probably try it one day too...



hey, actually i have a 9400GT( 2 multiprocessor, 16 SMs)...originally i have planned to run my program on 2,4,8 and 16 SMs and watch the difference on run time...
so wat should i do now???
Go to the top of the page
 
+Quote Post
avidday
post Nov 7 2009, 10:19 AM
Post #5



*******

Group: Members
Posts: 814
Joined: 1-April 09
Member No.: 148,556



QUOTE (saiyedul @ Nov 7 2009, 12:14 PM) *
hey, actually i have a 9400GT( 2 multiprocessor, 16 SMs)...originally i have planned to run my program on 2,4,8 and 16 SMs and watch the difference on run time...
so wat should i do now???

Give up, because you can't do that. Scheduling and execution control happens at the multiprocessor level - there is no way to have finer grained scheduling than that.
Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 7 2009, 10:43 AM
Post #6



*******

Group: Members
Posts: 692
Joined: 4-April 06
From: Karlsruhe / Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



QUOTE (avidday @ Nov 7 2009, 11:19 AM) *
Give up, because you can't do that. Scheduling and execution control happens at the multiprocessor level - there is no way to have finer grained scheduling than that.


Giving up? That word is not known to me.

If you take one card with, say 12 SMs like the 9600GSO, you can easily launch a grid consisting of hundreds of blocks, where you hardcode the first N blocks to wait for the rest to do real work (using global atomics spinlocks for example).

CODE
    if (blockIdx < N)
    {
        // spinlock until algorithm is finished
    }
    else
    {
        // do real work based on blockIdx - N as "true" block index.
    }


N should be < 12 of course. So you can have your algorithm execute on 12-N cores.

However you have to make sure that one SM only executes one block at a time, for example by allocating > 8192 of shared memory per block.
Or by using enough registers to not allow for a second block.

Cheers, you can quote me when you get the Nobel prize for your work.

Christian

This post has been edited by cbuchner1: Nov 7 2009, 10:46 AM
Go to the top of the page
 
+Quote Post
saiyedul
post Nov 7 2009, 12:34 PM
Post #7



*

Group: Members
Posts: 3
Joined: 31-August 09
Member No.: 191,077



QUOTE (cbuchner1 @ Nov 7 2009, 04:13 PM) *
Giving up? That word is not known to me.

If you take one card with, say 12 SMs like the 9600GSO, you can easily launch a grid consisting of hundreds of blocks, where you hardcode the first N blocks to wait for the rest to do real work (using global atomics spinlocks for example).

CODE
    if (blockIdx < N)
    {
        // spinlock until algorithm is finished
    }
    else
    {
        // do real work based on blockIdx - N as "true" block index.
    }


N should be < 12 of course. So you can have your algorithm execute on 12-N cores.

However you have to make sure that one SM only executes one block at a time, for example by allocating > 8192 of shared memory per block.
Or by using enough registers to not allow for a second block.

Cheers, you can quote me when you get the Nobel prize for your work.

Christian



thanx buddy for ur idea...but i m totally new to parallel programming n CUDA (infact i m a 3rd engg graduate student), so if could plz explain ur method a bit more...or if u could suggest some further reading on dis particular problem....
Go to the top of the page
 
+Quote Post
avidday
post Nov 7 2009, 12:59 PM
Post #8



*******

Group: Members
Posts: 814
Joined: 1-April 09
Member No.: 148,556



What he is suggesting is that you write code which runs on all all cores, but simulates running on fewer through a combination of manipulating execution parameters to precisely control how many threads run on each multiprocessor and interprocess communication between the threads to determine which of those threads actually do computations. Through careful instrumentation you could deduce the approximate scalability of the algorithm on 2N cores, where N could be 1,2,4,8 or 8 for your 2 multiprocessor GPU.

It isn't trivial and it isn't really what you are asking, just an approximation of it. CUDA isn't like MPI or OpenMP where you can just pick a number of processes or threads and their affinity at runtime and the code itself need not know anything about it.

This post has been edited by avidday: Nov 7 2009, 01:03 PM
Go to the top of the page
 
+Quote Post
Cygnus X1
post Nov 7 2009, 01:12 PM
Post #9



*****

Group: Members
Posts: 160
Joined: 30-October 08
From: Saarbruecken, Germany & Kraków, Poland
Member No.: 124,006
Org.: Saarland University



I still believe RivaTuner is the simplest way to do it. Try googling it, it is a program which goes in background and allows you to tweak a lot of parameters of your GPU. Just be careful, if you set your GPU to do too much too fast you may simply burn it!
From my understanding you don't have to change which multiprocessors are running at runtime, but rather between program launches.

Btw. "SM" stands for "Stream Multiprocessor". I understand you want to try running it on one or both of them on your GPU.
I think that tweaking it so much that you will use less than all SPs (scalar processors) of your SM wouldn't be that useful. What for? So far warp size does not change between different GPUs, while SM count does.


P.S.
QUOTE
Giving up? That word is not known to me.

I like this way of thinking :D

This post has been edited by Cygnus X1: Nov 7 2009, 01:13 PM
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 24th November 2009 - 01:23 AM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela