IPB

Welcome Guest ( Log In | Register )

8 Pages V   1 2 3 > »   
Reply to this topicStart new topic
> CUDA 2.2 beta features
tmurray
post Mar 18 2009, 08:31 PM
Post #1



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



The CUDA 2.2 beta is available to registered developers--if you want to become a registered developer, sign up here.

A brief overview of CUDA 2.2 beta features:

- Zero-copy support (see this thread for more information)
- Asynchronous memcpy on Vista/Server 2008
- Texturing from pitchlinear memory
- cuda-gdb for 64-bit Linux (it is pretty great)
- OGL interop performance improvements
- CUDA profiler supports a lot more counters on GT200. I think this includes memory bandwidth counters (counters for each transaction size) and instruction count. In other words, you can very easily determine if you're bandwidth limited or compute limited, which makes it far more useful than it used to be.
- CUDA profiler works on Vista
- >4GB of pinned memory in a single allocation (except in Vista, where the limit is still 256MB per allocation, but I think this is going to be raised between now and the final release)
- Blocking sync for all platforms. Whether this made it into the headers for the beta, I'm not entirely sure--I've heard conflicting reports and need to check this afternoon. Basically, it's a context creation flag where instead of spinlocking or spinlocking+yielding when a thread is waiting for the GPU, the thread will sleep and the driver will wake it up when the event has completed. It's not the default mode because you're at the mercy of the OS thread scheduler which will sometimes increase latency, but if you want to minimize CPU utilization, it's very nice.
- Officially supports Ubuntu 8.10, RHEL 5.3, Fedora 10

There's one last feature that didn't make it in the beta that I think is the best feature in 2.2 (even compared to the dramatically improved profiler, zero-copy and the 64-bit debugger), but I don't want to spoil it...

Edit: Here's the 2.2 beta programming guide.

edit 2: I am bad at not revealing surprises. There's still a second surprise in the final release for Windows users, though.

edit 3: Surprise 2: a test version of /MD CUDART. I revealed it because I want feedback on it and whether anyone has objections to moving everything over to /MD going forward.
Go to the top of the page
 
+Quote Post
tmurray
post Mar 18 2009, 09:43 PM
Post #2



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Some more features:

- There are a number of new device functions:

__brev(), __brevll() 32-bit and 64-bit bit reversal
__frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding
__fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding
__fdiv_r{n,z,u,d}() single-precision division with IEEE rounding
__fadd_r{u,d}() single-precision addition with directed rounding
__fmul_r{u,d}() single-precision multiplication with directed rounding

__threadfence(): I'm not sure if there are docs for this yet--it's kind of hard to explain, so I'm not going to comment too much about it here because I forget what its exact behavior is.

- Context creation flags can now be set in CUDART.
Go to the top of the page
 
+Quote Post
profquail
post Mar 18 2009, 09:52 PM
Post #3



*******

Group: Members
Posts: 615
Joined: 14-August 08
Member No.: 115,518
Org.: University of Alabama



One other function that might be neat to have would be a byte-order reversal method. Though CUDA only runs on little-endian systems, there are times when certain file types store their information in big-endian format; converting values like integers between endianness on the CPU could be a big bottleneck in those cases, but it is also something the GPU could do in massively parallel fashion. Perhaps versions for 2-byte, 4-byte, and 8-byte values (which would obviously work for both floating-point and integer types).

Also, will there ever be support for non-nVidia chipsets using the zero-copy methods (even if it's not for another few releases)? As I wrote in one of the other threads, I'm looking at building a new development machine later this year (when PCIe 3.0 and SATA 6Gbps are available), and I'd like to get something that is supported.

This post has been edited by profquail: Mar 18 2009, 09:56 PM
Go to the top of the page
 
+Quote Post
AndreiB
post Mar 18 2009, 10:11 PM
Post #4



*******

Group: Members
Posts: 577
Joined: 10-August 07
From: Russia, Moscow
Member No.: 65,038
Org.: ElcomSoft Co. Ltd.



Are there any limitiation on device compute capability to use __brev() and __brevll()?

QUOTE
__threadfence(): I'm not sure if there are docs for this yet

Yes, there is something about it in 2.2 Programming Guide.


--------------------
// everything is reversible
Go to the top of the page
 
+Quote Post
tmurray
post Mar 18 2009, 10:42 PM
Post #5



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Yeah, I haven't looked at the docs for 2.2 yet...

There's no limitation on device capability for those two functions. There's another function left out of the earlier post:

__fmaf_r(n,z,u,d} // single-precision fused multiply-add with IEEE rounding

These are all done in software, so they're primarily for convenience, not speed.
Go to the top of the page
 
+Quote Post
E.D. Riedijk
post Mar 18 2009, 11:10 PM
Post #6



********

Group: Extranet Users
Posts: 1,412
Joined: 4-March 08
Member No.: 94,948



Since you are in such a talkative mode (apart from the surprise still in store for us wink.gif)
The ptx ISA has been raised to 1.4, while the compute capability is still at 1.3. As far as I know, before it was going in sync. Does this mean that next generation hw will be compute capability 2.0 ?

Apart from that I can't wait to upgrade my dev box from FC8 to FC10 so I can install the 2.2 beta (profiler and debugger, here I come). Anyone have any tips for upgrading 8 -> 10?

This post has been edited by E.D. Riedijk: Mar 18 2009, 11:12 PM


--------------------
greets,
Denis
Go to the top of the page
 
+Quote Post
tmurray
post Mar 18 2009, 11:37 PM
Post #7



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



I can't comment on what the future holds, sorry. (Unless, of course, you want to know about CUDA 2.2...)
Go to the top of the page
 
+Quote Post
Sarnath
post Mar 19 2009, 05:22 AM
Post #8



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies



QUOTE (E.D. Riedijk @ Mar 19 2009, 04:40 AM) *
Anyone have any tips for upgrading 8 -> 10?


Add a 2 biggrin.gif


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
SPWorley
post Mar 19 2009, 07:37 AM
Post #9



*******

Group: Members
Posts: 792
Joined: 13-June 08
From: California USA
Member No.: 107,688



QUOTE (tmurray @ Mar 18 2009, 01:43 PM) *
__brev(), __brevll() 32-bit and 64-bit bit reversal


Ohh these are awesome. Simple little trivial functions but handy!
These can be really useful in FFTs, and also in random number generation and seeding... I even posted to the wishlist thread!

Are they supported natively by all hardware? One clock ops?
Go to the top of the page
 
+Quote Post
Sarnath
post Mar 19 2009, 07:46 AM
Post #10



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies



QUOTE (SPWorley @ Mar 19 2009, 01:07 PM) *
Are they supported natively by all hardware? One clock ops?


I would suspect 2 clocks if there is only one register involved. One clock to open gates parallely to another register and another to copy that register back normally.

If it invovles 2 registers then it will involve one clock.

Just my crude guesses...


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
tmurray
post Mar 19 2009, 07:48 AM
Post #11



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (SPWorley @ Mar 19 2009, 12:37 AM) *
Ohh these are awesome. Simple little trivial functions but handy!
These can be really useful in FFTs, and also in random number generation and seeding... I even posted to the wishlist thread!

Are they supported natively by all hardware? One clock ops?

No, they're done in software, not hardware. Feature requests from the forums, basically.
Go to the top of the page
 
+Quote Post
pvonkaenel
post Mar 19 2009, 12:13 PM
Post #12



***

Group: Members
Posts: 32
Joined: 23-February 09
Member No.: 142,319
Org.: Harmonic Inc.



QUOTE (tmurray @ Mar 18 2009, 05:43 PM) *
Some more features:

- There are a number of new device functions:

__brev(), __brevll() 32-bit and 64-bit bit reversal
__frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding
__fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding
__fdiv_r{n,z,u,d}() single-precision division with IEEE rounding
__fadd_r{u,d}() single-precision addition with directed rounding
__fmul_r{u,d}() single-precision multiplication with directed rounding

__threadfence(): I'm not sure if there are docs for this yet--it's kind of hard to explain, so I'm not going to comment too much about it here because I forget what its exact behavior is.

- Context creation flags can now be set in CUDART.


Whao, are these basically GPU intrinsics? I'm very new to CUDA development, and would love more information on calls like these, but must have missed them in the docs. Could someone please point me to where I can learn more about GPU intrinsics?

Thanks,
Peter
Go to the top of the page
 
+Quote Post
SPWorley
post Mar 19 2009, 12:31 PM
Post #13



*******

Group: Members
Posts: 792
Joined: 13-June 08
From: California USA
Member No.: 107,688




(re new intrinsics for bit reversal, etc)
QUOTE (tmurray @ Mar 18 2009, 11:48 PM) *
No, they're done in software, not hardware. Feature requests from the forums, basically.


Out of curiosity, are these kind of library functions done at some lower level of coding that's more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX?

There's so many layers of abstraction in any architecture, but in CUDA there's even more than most, and I'm just curious if the layer of abstraction below intrinsic functions is something powerful and promising for future (pleasant) surprises like this.

(BTW, please give those low level hackers a thumbs up from us all..)

Go to the top of the page
 
+Quote Post
MisterAnderson42
post Mar 19 2009, 12:45 PM
Post #14



********

Group: Extranet Users
Posts: 2,289
Joined: 23-March 07
Member No.: 46,425
Org.: University of Michigan



QUOTE (pvonkaenel @ Mar 19 2009, 06:13 AM) *
Could someone please point me to where I can learn more about GPU intrinsics?

Appendix B, the CUDA programming guide.
Go to the top of the page
 
+Quote Post
Sarnath
post Mar 19 2009, 01:34 PM
Post #15



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies



QUOTE (SPWorley @ Mar 19 2009, 06:01 PM) *
(re new intrinsics for bit reversal, etc)

Out of curiosity, are these kind of library functions done at some lower level of coding that's more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX?


I would expect it to be a "C" Macro probably using some "hardware" feature to do the bit reversing.
Because the argument that you pass could be a "Shared memory", "local memory", "local variable" or anything... That should not matter.
That would work only with a macro like thing....


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
YDD
post Mar 19 2009, 01:38 PM
Post #16



*****

Group: Members
Posts: 177
Joined: 4-September 07
From: Boston, MA
Member No.: 68,682



Does zero-copy get support on the new MacBooks? I'd like to make a business case for spending someone else's money, but I can't find official word on which motherboard the MacBook uses (merely plenty of rumours).
Go to the top of the page
 
+Quote Post
tmurray
post Mar 19 2009, 04:26 PM
Post #17



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (YDD @ Mar 19 2009, 06:38 AM) *
Does zero-copy get support on the new MacBooks? I'd like to make a business case for spending someone else's money, but I can't find official word on which motherboard the MacBook uses (merely plenty of rumours).

The new MacBooks and MacBook Pros support zero-copy.

I don't know that there's anything magic about how we're doing these intrinsics--I think the answer is probably not. They're really just there for convenience.
Go to the top of the page
 
+Quote Post
YDD
post Mar 19 2009, 05:22 PM
Post #18



*****

Group: Members
Posts: 177
Joined: 4-September 07
From: Boston, MA
Member No.: 68,682



QUOTE (tmurray @ Mar 19 2009, 12:26 PM) *
The new MacBooks and MacBook Pros support zero-copy.
On both GPUs for the MacBook Pro? It could be very interesting to investigate the effect of the PCIe bus on transfer latency & bandwidth.
Go to the top of the page
 
+Quote Post
tmurray
post Mar 19 2009, 05:44 PM
Post #19



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (YDD @ Mar 19 2009, 10:22 AM) *
On both GPUs for the MacBook Pro? It could be very interesting to investigate the effect of the PCIe bus on transfer latency & bandwidth.

The 9400M supports zero-copy (and copy elimination), the 9600M supports neither.
Go to the top of the page
 
+Quote Post
tmurray
post Mar 19 2009, 06:07 PM
Post #20



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Even more new features:
QUOTE
New math library functions in 2.2 (in response to user requests):

erfinvf() single-precision inverse error function
erfcinvf() single-precision inverse complementary error function
erfinv() double-precision inverse error function
erfcinv() double-precision inverse complementary error function
Go to the top of the page
 
+Quote Post

8 Pages V   1 2 3 > » 
Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 23rd November 2009 - 02:11 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela