![]() ![]() |
Mar 18 2009, 08:31 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
The CUDA 2.2 beta is available to registered developers--if you want to become a registered developer, sign up here.
A brief overview of CUDA 2.2 beta features: - Zero-copy support (see this thread for more information) - Asynchronous memcpy on Vista/Server 2008 - Texturing from pitchlinear memory - cuda-gdb for 64-bit Linux (it is pretty great) - OGL interop performance improvements - CUDA profiler supports a lot more counters on GT200. I think this includes memory bandwidth counters (counters for each transaction size) and instruction count. In other words, you can very easily determine if you're bandwidth limited or compute limited, which makes it far more useful than it used to be. - CUDA profiler works on Vista - >4GB of pinned memory in a single allocation (except in Vista, where the limit is still 256MB per allocation, but I think this is going to be raised between now and the final release) - Blocking sync for all platforms. Whether this made it into the headers for the beta, I'm not entirely sure--I've heard conflicting reports and need to check this afternoon. Basically, it's a context creation flag where instead of spinlocking or spinlocking+yielding when a thread is waiting for the GPU, the thread will sleep and the driver will wake it up when the event has completed. It's not the default mode because you're at the mercy of the OS thread scheduler which will sometimes increase latency, but if you want to minimize CPU utilization, it's very nice. - Officially supports Ubuntu 8.10, RHEL 5.3, Fedora 10 There's one last feature that didn't make it in the beta that I think is the best feature in 2.2 (even compared to the dramatically improved profiler, zero-copy and the 64-bit debugger), but I don't want to spoil it... Edit: Here's the 2.2 beta programming guide. edit 2: I am bad at not revealing surprises. There's still a second surprise in the final release for Windows users, though. edit 3: Surprise 2: a test version of /MD CUDART. I revealed it because I want feedback on it and whether anyone has objections to moving everything over to /MD going forward. |
|
|
|
Mar 18 2009, 09:43 PM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Some more features:
- There are a number of new device functions: __brev(), __brevll() 32-bit and 64-bit bit reversal __frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding __fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding __fdiv_r{n,z,u,d}() single-precision division with IEEE rounding __fadd_r{u,d}() single-precision addition with directed rounding __fmul_r{u,d}() single-precision multiplication with directed rounding __threadfence(): I'm not sure if there are docs for this yet--it's kind of hard to explain, so I'm not going to comment too much about it here because I forget what its exact behavior is. - Context creation flags can now be set in CUDART. |
|
|
|
Mar 18 2009, 09:52 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 615 Joined: 14-August 08 Member No.: 115,518 Org.: University of Alabama |
One other function that might be neat to have would be a byte-order reversal method. Though CUDA only runs on little-endian systems, there are times when certain file types store their information in big-endian format; converting values like integers between endianness on the CPU could be a big bottleneck in those cases, but it is also something the GPU could do in massively parallel fashion. Perhaps versions for 2-byte, 4-byte, and 8-byte values (which would obviously work for both floating-point and integer types).
Also, will there ever be support for non-nVidia chipsets using the zero-copy methods (even if it's not for another few releases)? As I wrote in one of the other threads, I'm looking at building a new development machine later this year (when PCIe 3.0 and SATA 6Gbps are available), and I'd like to get something that is supported. This post has been edited by profquail: Mar 18 2009, 09:56 PM |
|
|
|
Mar 18 2009, 10:11 PM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 577 Joined: 10-August 07 From: Russia, Moscow Member No.: 65,038 Org.: ElcomSoft Co. Ltd. |
Are there any limitiation on device compute capability to use __brev() and __brevll()?
QUOTE __threadfence(): I'm not sure if there are docs for this yet Yes, there is something about it in 2.2 Programming Guide. -------------------- // everything is reversible
|
|
|
|
Mar 18 2009, 10:42 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Yeah, I haven't looked at the docs for 2.2 yet...
There's no limitation on device capability for those two functions. There's another function left out of the earlier post: __fmaf_r(n,z,u,d} // single-precision fused multiply-add with IEEE rounding These are all done in software, so they're primarily for convenience, not speed. |
|
|
|
Mar 18 2009, 11:10 PM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 1,412 Joined: 4-March 08 Member No.: 94,948 |
Since you are in such a talkative mode (apart from the surprise still in store for us
The ptx ISA has been raised to 1.4, while the compute capability is still at 1.3. As far as I know, before it was going in sync. Does this mean that next generation hw will be compute capability 2.0 ? Apart from that I can't wait to upgrade my dev box from FC8 to FC10 so I can install the 2.2 beta (profiler and debugger, here I come). Anyone have any tips for upgrading 8 -> 10? This post has been edited by E.D. Riedijk: Mar 18 2009, 11:12 PM -------------------- greets,
Denis |
|
|
|
Mar 18 2009, 11:37 PM
Post
#7
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
I can't comment on what the future holds, sorry. (Unless, of course, you want to know about CUDA 2.2...)
|
|
|
|
Mar 19 2009, 05:22 AM
Post
#8
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,567 Joined: 23-November 07 From: Bangalore Member No.: 79,873 Org.: HCL Technologies |
-------------------- Ignorance Rules; Knowledge Liberates!
|
|
|
|
Mar 19 2009, 07:37 AM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 792 Joined: 13-June 08 From: California USA Member No.: 107,688 |
__brev(), __brevll() 32-bit and 64-bit bit reversal Ohh these are awesome. Simple little trivial functions but handy! These can be really useful in FFTs, and also in random number generation and seeding... I even posted to the wishlist thread! Are they supported natively by all hardware? One clock ops? |
|
|
|
Mar 19 2009, 07:46 AM
Post
#10
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,567 Joined: 23-November 07 From: Bangalore Member No.: 79,873 Org.: HCL Technologies |
Are they supported natively by all hardware? One clock ops? I would suspect 2 clocks if there is only one register involved. One clock to open gates parallely to another register and another to copy that register back normally. If it invovles 2 registers then it will involve one clock. Just my crude guesses... -------------------- Ignorance Rules; Knowledge Liberates!
|
|
|
|
Mar 19 2009, 07:48 AM
Post
#11
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Ohh these are awesome. Simple little trivial functions but handy! These can be really useful in FFTs, and also in random number generation and seeding... I even posted to the wishlist thread! Are they supported natively by all hardware? One clock ops? No, they're done in software, not hardware. Feature requests from the forums, basically. |
|
|
|
Mar 19 2009, 12:13 PM
Post
#12
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 32 Joined: 23-February 09 Member No.: 142,319 Org.: Harmonic Inc. |
Some more features: - There are a number of new device functions: __brev(), __brevll() 32-bit and 64-bit bit reversal __frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding __fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding __fdiv_r{n,z,u,d}() single-precision division with IEEE rounding __fadd_r{u,d}() single-precision addition with directed rounding __fmul_r{u,d}() single-precision multiplication with directed rounding __threadfence(): I'm not sure if there are docs for this yet--it's kind of hard to explain, so I'm not going to comment too much about it here because I forget what its exact behavior is. - Context creation flags can now be set in CUDART. Whao, are these basically GPU intrinsics? I'm very new to CUDA development, and would love more information on calls like these, but must have missed them in the docs. Could someone please point me to where I can learn more about GPU intrinsics? Thanks, Peter |
|
|
|
Mar 19 2009, 12:31 PM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 792 Joined: 13-June 08 From: California USA Member No.: 107,688 |
(re new intrinsics for bit reversal, etc) No, they're done in software, not hardware. Feature requests from the forums, basically. Out of curiosity, are these kind of library functions done at some lower level of coding that's more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX? There's so many layers of abstraction in any architecture, but in CUDA there's even more than most, and I'm just curious if the layer of abstraction below intrinsic functions is something powerful and promising for future (pleasant) surprises like this. (BTW, please give those low level hackers a thumbs up from us all..) |
|
|
|
Mar 19 2009, 12:45 PM
Post
#14
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 2,289 Joined: 23-March 07 Member No.: 46,425 Org.: University of Michigan |
|
|
|
|
Mar 19 2009, 01:34 PM
Post
#15
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,567 Joined: 23-November 07 From: Bangalore Member No.: 79,873 Org.: HCL Technologies |
(re new intrinsics for bit reversal, etc) Out of curiosity, are these kind of library functions done at some lower level of coding that's more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX? I would expect it to be a "C" Macro probably using some "hardware" feature to do the bit reversing. Because the argument that you pass could be a "Shared memory", "local memory", "local variable" or anything... That should not matter. That would work only with a macro like thing.... -------------------- Ignorance Rules; Knowledge Liberates!
|
|
|
|
Mar 19 2009, 01:38 PM
Post
#16
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 177 Joined: 4-September 07 From: Boston, MA Member No.: 68,682 |
Does zero-copy get support on the new MacBooks? I'd like to make a business case for spending someone else's money, but I can't find official word on which motherboard the MacBook uses (merely plenty of rumours).
|
|
|
|
Mar 19 2009, 04:26 PM
Post
#17
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Does zero-copy get support on the new MacBooks? I'd like to make a business case for spending someone else's money, but I can't find official word on which motherboard the MacBook uses (merely plenty of rumours). The new MacBooks and MacBook Pros support zero-copy. I don't know that there's anything magic about how we're doing these intrinsics--I think the answer is probably not. They're really just there for convenience. |
|
|
|
Mar 19 2009, 05:22 PM
Post
#18
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 177 Joined: 4-September 07 From: Boston, MA Member No.: 68,682 |
|
|
|
|
Mar 19 2009, 05:44 PM
Post
#19
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
|
|
|
|
Mar 19 2009, 06:07 PM
Post
#20
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Even more new features:
QUOTE New math library functions in 2.2 (in response to user requests):
erfinvf() single-precision inverse error function erfcinvf() single-precision inverse complementary error function erfinv() double-precision inverse error function erfcinv() double-precision inverse complementary error function |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 23rd November 2009 - 02:11 PM |