IPB

Welcome Guest ( Log In | Register )

6 Pages V   1 2 3 > »   
Reply to this topicStart new topic
> CUDA Toolkit 3.0 beta released, now with public downloads
tmurray
post Nov 5 2009, 11:06 PM
Post #1



Group Icon

Group: Moderators
Posts: 2,619
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Club SLI Member: No
Org.: NVIDIA



The CUDA Toolkit 3.0 Beta is now available.

Highlights for this release include:
  • CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
  • A new, separate version of the CUDA C Runtime (CUDART) for debugging in emulation-mode.
  • C++ Class Inheritance and Template Inheritance support for increased programmer productivity
  • A new unified interoperability API for Direct3D and OpenGL, with support for:
    • OpenGL texture interop
    • Direct3D 11 interop support
  • cuda-gdb hardware debugging support for applications that use the CUDA Driver API
  • New CUDA Memory Checker reports misalignment and out of bounds errors, available as a debugging mode within cuda-gdb and also as a stand-alone utility.
  • CUDA Toolkit libraries are now versioned, enabling applications to require a specific version, support multiple versions explicitly, etc.
  • CUDA C/C++ kernels are now compiled to standard ELF format
  • Support for all the OpenCL features in the latest R195.39 beta driver:
    • Double Precision
    • OpenGL Interoperability, for interactive high performance visualization
    • Query for Compute Capability, so you can target optimizations for GPU architectures (cl_nv_device_attribute_query)
    • Ability to control compiler optimization settings, etc. via support for NVIDIA Compiler Flags (cl_nv_compiler_options)
    • OpenCL Images support, for better/faster image filtering
    • 32-bit Atomics for fast, convenient data manipulation
    • Byte Addressable Stores, for faster video/image processing and compression algorithms
    • Support for the latest OpenCL spec revision 48 and latest official Khronos OpenCL headers as of 11/1/2009
  • Early support for the Fermi architecture, including:
    • Native 64-bit GPU support
    • Multiple Copy Engine support
    • ECC reporting
    • Concurrent Kernel Execution
    • Fermi HW debugging support in cuda-gdb
For more information on general purpose computing features of the Fermi architecture, see: www.nvidia.com/fermi.

Windows developers should be sure to sign up for the Nexus (codename) beta program, and test drive the integrated support for GPU hardware debugging, profiling, and platform trace/analysis features at: www.nvidia.com/nexus

Please review the errata document for important notes about using this beta release.

Special Notice for MacOS X Developers
  • Use the cudadriver_3.0.0-beta1_macos.pkg driver with all NVIDIA GPUs except Quadro FX 4800 and GeForce GTX 285 on MacOS X 10.5.6 and later (pre-SnowLeoard)
  • Use cudadriver_3.0.1-beta1_macos.pkg driver with Quadro FX 4800 and GeForce GTX 285 on MacOS X 10.5.6 and later (pre-SnowLeoard).
  • Use cudadriver_3.0.1-beta1_macos.pkg driver with all NVIDIA GPUs on MacOS X 10.6 SnowLeopard and later.


Downloads
Getting Started - Linux
Getting Started - OS X
Getting Started - Windows

XP32 195.39
XP64 195.39
Vista/Win7 32 195.39
Vista/Win7 64 195.39

Notebook XP32 195.39
Notebok XP64 195.39
Notebook Vista/Win7 32 195.39
Notebook Vista/Win7 64 195.39

Linux 32 195.17
Linux 64 195.17

3.0.0 for Non-GT200 Leopard
3.0.1 for GT200 Leopard and Snow Leopard

CUDA Toolkit for Fedora 10 32-bit
CUDA Toolkit for RHEL 4.8 32-bit
CUDA Toolkit for RHEL 5.3 32-bit
CUDA Toolkit for SLED 11.0 32-bit
CUDA Toolkit for SuSE 11.1 32-bit
CUDA Toolkit for Ubuntu 9.04 32-bit

CUDA Toolkit for Fedora 10 64-bit
CUDA Toolkit for RHEL 4.8 64-bit
CUDA Toolkit for RHEL 5.3 64-bit
CUDA Toolkit for SLED 11.0 64-bit
CUDA Toolkit for SuSE 11.1 64-bit
CUDA Toolkit for Ubuntu 9.04 64-bit

CUDA Toolkit for OS X

CUDA Toolkit for Windows 32-bit
CUDA Toolkit for Windows 64-bit

CUDA Profiler 3.0 Beta Readme
CUDA Profiler 3.0 Beta Release Notes for Linux
CUDA Profiler 3.0 Beta Release Notes for OS X
CUDA Profiler 3.0 Beta Release Notes for Windows
CUDA Toolkit EULA
CUDA-GDB Readme
CUDA-GDB User Manual
CUDA Reference Manual
CUDA Toolkit Release Notes for Linux
CUDA Toolkit Release Notes for OS X
CUDA Toolkit Release Notes for Windows
CUDA Programming Guide
CUDA Best Practices Guide
Online Documentation

GPU Computing SDK for Linux
GPU Computing SDK for OS X
GPU Computing SDK for Win32
GPU Computing SDK for Win64

CUDA SDK Release Notes
DirectCompute Release Notes
OpenCL Release Notes
GPU Computing EULA
Go to the top of the page
 
+Quote Post
tmurray
post Nov 5 2009, 11:09 PM
Post #2



Group Icon

Group: Moderators
Posts: 2,619
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Club SLI Member: No
Org.: NVIDIA



Documentation updates:

Fermi Compatibility Guide
Fermi Tuning Guide
Preview: CUDA Programming Guide for CUDA Toolkit 3.0
Go to the top of the page
 
+Quote Post
GregR
post Nov 6 2009, 01:17 AM
Post #3



**

Group: Members
Posts: 17
Joined: 5-March 07
Member No.: 43,819



Is it possible to have the "CUDA 3 Beta Programming Guide" available for separate download?

This would provide a way for me to learn more about the release without downloading/installing the entire SDK.
Go to the top of the page
 
+Quote Post
sergeyn
post Nov 6 2009, 01:45 AM
Post #4



****

Group: Members
Posts: 90
Joined: 16-May 09
Member No.: 155,172
Org.: funcom



It just bluescreened when I tried to run my code under profiler. I have a minidump if anyone from nvidia would like to take a look
Go to the top of the page
 
+Quote Post
SPWorley
post Nov 6 2009, 03:41 AM
Post #5



********

Group: Members
Posts: 1,210
Joined: 13-June 08
From: California USA
Member No.: 107,688



QUOTE (GregR @ Nov 5 2009, 06:17 PM) *
Is it possible to have the "CUDA 3 Beta Programming Guide" available for separate download?

This would provide a way for me to learn more about the release without downloading/installing the entire SDK.



The toolkit beta teases you with new docs labelled "CUDA 3 Beta Programming Guide" but they are just the 2.3 docs.
That's the first thing I wanted to look at!
The best practices guide, nvcc docs, and PTX spec are also all 2.3 versions.

I checked both Linux and Windows.

There are new 3.0 CCUBLAS and CUFFT beta docs though.

Tim, are we allowed to openly discuss the 3.0 toolkit beta here? The rules were relaxed for 2.3 beta and that was nice to allow forum discussion.
There's some promising new features in 3.0 even ignoring the Fermi support!
Go to the top of the page
 
+Quote Post
tmurray
post Nov 6 2009, 07:17 AM
Post #6



Group Icon

Group: Moderators
Posts: 2,619
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Club SLI Member: No
Org.: NVIDIA



yeah, feel free to discuss as per usual.

I'll investigate the documentation packaging issue tomorrow...
Go to the top of the page
 
+Quote Post
acano
post Nov 6 2009, 10:19 AM
Post #7



*

Group: Members
Posts: 3
Joined: 18-April 09
Member No.: 151,009



Nice! but my code runs now 3 times slower vs 2.3 sdk! ptx generated code is the same, so is it the driver? is it the toolkit? I know this is a beta release.
Go to the top of the page
 
+Quote Post
apaehler
post Nov 6 2009, 01:22 PM
Post #8



****

Group: Members
Posts: 89
Joined: 6-September 07
From: Berlin, Germany
Member No.: 68,914



I downloaded the driver, toolkit, sdk last night (openSuSE 11.1, 64-bit) and just compiled it and ran first tests. nbody and my own test examples (sgemm/dgemm others) run just as fast as before. The setup is 2x GTX260. I have one code with runs on dual GPUs using data partitioning and QThreads (from PyQt/Qt) and it does fine. What I did notice, since my dual-GPU code reports both the total time spent in the kernel and the elapsed time for the whole thread (data is already on the GPU), is that the CUDA runtime initialization part is much faster. It used to be that there was an extra 0.6 to 1.5 seconds(!) to be added to the pure kernel run time. They are now almost the same:
CODE
+++< GPU count: 2 GPUs >+++

All GPUs in parallel - one thread per GPU
-----------------------------------------

GPU-0 : GeForce GTX 260 (1.3) - sum : 0.689 sec [404.1 GFlops]
GPU-1 : GeForce GTX 260 (1.3) - sum : 0.692 sec [402.8 GFlops]
GPU-0 : GeForce GTX 260 (1.3) - run : 2.105 sec [132.3 GFlops]
GPU-1 : GeForce GTX 260 (1.3) - run : 2.108 sec [132.1 GFlops]
All GPUs - run : 2.185 sec [254.9 GFlops]
Last two elements GPU: 1.019e-03 3.220e-04


+++< CPU count: 2 CPUs >+++

All CPUs in parallel - one thread per CPU
-----------------------------------------

Thread-0 - sum : 708.240 sec [ 3.4 GFlops]
Thread-0 - run : 708.271 sec [ 3.4 GFlops]
Thread-1 - sum : 717.896 sec [ 3.4 GFlops]
Thread-1 - run : 717.946 sec [ 3.4 GFlops]
All CPU - run : 717.947 sec [ 6.8 GFlops]
Last two elements CPU: 1.019e-03 3.220e-04


Speedup: 328.5


max error at : 3929812 12 49 98 6.52e-01 6.53e-01 2.93e-04

L2 rel error : 2.5e-04
max rel error : 4.5e-04


The speedups are based on times, not flops values. Flops are defined differently on the CPU and GPU, as it involves trigs (One trig counted as 4 flops GPU and 140 flops CPU). It is now (no CPU part ,as I do not want to wait 12 minutes for the CPU):

CODE
+++< GPU count: 2 GPUs >+++

All GPUs in parallel - one thread per GPU
-----------------------------------------

GPU-0 : GeForce GTX 260 (1.3) - sum : 0.698 sec [399.2 GFlops]
GPU-0 : GeForce GTX 260 (1.3) - run : 0.841 sec [331.2 GFlops]
GPU-1 : GeForce GTX 260 (1.3) - sum : 0.700 sec [397.8 GFlops]
GPU-1 : GeForce GTX 260 (1.3) - run : 0.844 sec [329.9 GFlops]
All GPUs - run : 1.026 sec [543.1 GFlops]
Last two elements GPU: 1.019e-03 3.220e-04


As now OpenCL is included with the driver, I have an issue with that. Last week I wrote ctypes-based Python bindings for OpenCL, like I did for CUDA itself previously (python-cuda). With driver 190.29 a device query gave me

CODE
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES : 512 512 64


I now get (I changed the format to hex to see whats going on):

CODE
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 0x3
CL_DEVICE_MAX_WORK_ITEM_SIZES : 0x200 0x7F6800000200 0x7F6800000040


CL_DEVICE_MAX_WORK_ITEM_SIZES is queried passing a pointer to (size_t*3), i.e. a ctypes array of 3 64-bit longs . The first element is correct, the second and third contain "garbage" (that looks like a device address ?) in the high-order bits and the correct value in the low 32 bits. size_t is 64-bit, passing a pointer to an array of 3 32-bit ints just gives all 0. So 64-bit, as per OpenCL spec, should be correct.

What happened to OpenCL between 190.29 and 195.17? In addition I also get a double free error thrown by glibc on one machine and a segmentation fault on another, after the code ran successfully. Just to be clear: I am calling OpenCL from Python. vector_add example and a simple bandwidth test work just fine, also all data returned in the device query are basically correct.

The segmentation fault was apparently caused by a bug in my code. After a rewrite it goes away. However the MAX_WORK_ITEM_SIZES issue still exits. The C++ SDK sample gives the correct result, Python::OpenCL (don't recall it's exact name right now), based on Cython, also gives a correct answer. PyOpenCL, based on boost.python also gets one value wrong (the last dimension). Despite cl_khr_fp64 available, both my ctypes OpenCL bindings and PyOpenCL report preferred vector width double as 0, while the SDK reports 1.

As far as speed is concerned, I see no slow-down with 3.0b1, tested in 9650M GT, dual GTX-260 and dual GTX-280. I do notice, that X11 windows seem to pop up significantly faster than with previous driver version.

This post has been edited by apaehler: Nov 7 2009, 06:56 PM
Go to the top of the page
 
+Quote Post
Tobi_W
post Nov 6 2009, 03:49 PM
Post #9



****

Group: Members
Posts: 77
Joined: 16-February 09
From: Germany
Member No.: 141,107



I wonder why the 'sdk' folder within the CUDA-SDK is an exact copy of the SDK itself!? So every example (cuda and opencl), library, etc. exists twice.

Edit:

I get 1403 "unused parameter X" warnings from nvcc when compiling my programs with "--compiler-options -Wall,-Wextra" in the following header files:
  • device_functions.h
  • sm_11_atomic_functions.h
  • sm_12_atomic_functions.h
  • sm_13_double_functions.h
  • sm_20_atomic_functions.h
  • sm_20_intrinsics.h
  • surface_functions.h
  • texture_fetch_functions.h

There are a lot of "declared 'static' but never defined" warnings, too. I'm using gcc 4.3.2 on a 64 bit linux machine. In spite of the warnings everything seems to work fine, but in fact a little bit slower than with 2.3.

This post has been edited by Tobi_W: Nov 6 2009, 04:45 PM
Go to the top of the page
 
+Quote Post
tmurray
post Nov 7 2009, 12:25 AM
Post #10



Group Icon

Group: Moderators
Posts: 2,619
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Club SLI Member: No
Org.: NVIDIA



QUOTE (acano @ Nov 6 2009, 02:19 AM) *
Nice! but my code runs now 3 times slower vs 2.3 sdk! ptx generated code is the same, so is it the driver? is it the toolkit? I know this is a beta release.

Would love a repro case...
Go to the top of the page
 
+Quote Post
acano
post Nov 7 2009, 11:03 AM
Post #11



*

Group: Members
Posts: 3
Joined: 18-April 09
Member No.: 151,009



I've tried 190.42 and 195.17 beta drivers on Ubuntu 9.10 64 using CUDA SDK 2.3 and 3.0 beta and gcc 4.3
I'm using 2 devices 285 GTX, and my code is set to use both devices (SLI is OFF). Also I use a 3rd card (8400 GS) for display (not for CUDA).

190.42 + SDK 2.3 = 13 seconds
195.17 + SDK 2.3 || 195.17 + SDK 3.0 beta = 34 seconds !

I've checked the 8400 is not being used anytime.

Then I supose it's a driver problem (I really hope).
However Nbody demo performs better at 195.17 + SDK 3.0 (up to 500 GFLOPs) but smokeparticles also mess up performace :(
Go to the top of the page
 
+Quote Post
aplyer
post Nov 9 2009, 09:34 AM
Post #12



**

Group: Members
Posts: 16
Joined: 8-December 08
Member No.: 129,480
Org.: Onera



I've a gcc-4.4,
with the 2.3 cuda sdk/toolkit I use the '--compiler-biindir' option to chose the gcc-4.3
with nvcc in 3.0 beta, this option is probably bad parsed :
with : "--compiler-bindir=/usr/bin/gcc-4.3'
I've the error : unsuported compiler '/usr/bin/gcc-4'

my solution (hack) is to unlink all /usr/bin/{gcc,g++,cpp, ... } who point to 4.4 and make links to the 4.3.
Go to the top of the page
 
+Quote Post
wanderine
post Nov 10 2009, 03:50 PM
Post #13



*****

Group: Members
Posts: 201
Joined: 19-December 08
Member No.: 131,297



How do I become a registered developer?
Go to the top of the page
 
+Quote Post
Simon Green
post Nov 10 2009, 06:04 PM
Post #14



*******

Group: NVIDIA Employees
Posts: 858
Joined: 17-November 04
From: London, England
Member No.: 243
Org.: NVIDIA Developer Technologies



Sign up as a "GPU Computing Developer" here:
http://developer.nvidia.com/page/registere...er_program.html
Go to the top of the page
 
+Quote Post
profquail
post Nov 11 2009, 12:51 PM
Post #15



*******

Group: Members
Posts: 732
Joined: 14-August 08
From: Cambridge, United Kingdom
Member No.: 115,518
Org.: TidePowerd, Ltd.



As reported in this thread, I was having some problems with CUDA and Windows 7:

http://forums.nvidia.com/index.php?showtopic=101930

I installed the new 3.0-beta drivers, toolkit and SDK and tried running some of the examples, and I'm still having the same problem (kernels take several seconds before executing, and the entire system freezes during that time).

I went into the Nvidia Control Panel and disabled my 2nd, 3rd, and 4th monitors and enabled the multi-GPU acceleration; now the examples run just fine, but when I run the deviceQueryDrv example, it only shows a single device. Since I'm not running displays on the other 3 GPUs (I have 2x GTX 295's), why don't they show up? Also, the device query on the device that does show up says that there is no time limit on kernel execution.

EDIT: Does anyone know if the PTX version will increase to version 1.5 for this release of the CUDA driver? The 3.0-beta toolkit includes the PTX 1.4 specification.

This post has been edited by profquail: Nov 11 2009, 12:57 PM
Go to the top of the page
 
+Quote Post
theMarix
post Nov 11 2009, 01:16 PM
Post #16



*****

Group: Members
Posts: 161
Joined: 14-February 08
From: Heidelberg
Member No.: 92,485
Org.: Frankfurt Institute for Advanced Studies



QUOTE (Tobi_W @ Nov 6 2009, 04:49 PM) *
I get 1403 "unused parameter X" warnings from nvcc when compiling my programs with "--compiler-options -Wall,-Wextra" in the following header files:


In my experience those are caused from gcc when invoked from nvcc. You should be able to shut them up by telling gcc that those are system directories and it should not warn you about errors within those files. (BTW a lot easier if you use CMake to invoke the compilation.)
Go to the top of the page
 
+Quote Post
theMarix
post Nov 11 2009, 01:24 PM
Post #17



*****

Group: Members
Posts: 161
Joined: 14-February 08
From: Heidelberg
Member No.: 92,485
Org.: Frankfurt Institute for Advanced Studies



I noticed there now is a --multicore (and even --multicore-llvm) switch in the compiler, however the headers disable compilation if this switch is used for all compilers but MSVC. Is the multicore support on linux planned for 3.0 final?
Go to the top of the page
 
+Quote Post
soeren87
post Nov 12 2009, 10:31 AM
Post #18



*

Group: Members
Posts: 9
Joined: 6-November 09
From: Bremen, Germany
Member No.: 244,381
Club SLI Member: No



are there any problems with gcc4.4 (of ubuntu9.1) and cudaSDK 3.0beta ?

if not, i would like to register as a developer. Can I do this as a hobby-programmer just for simple tests ?
Go to the top of the page
 
+Quote Post
cbuchner1
post Nov 12 2009, 10:35 AM
Post #19



*******

Group: Members
Posts: 982
Joined: 4-April 06
From: Munich, Germany
Member No.: 18,632
Org.: Nomor Research GmbH



QUOTE (soeren87 @ Nov 12 2009, 11:31 AM) *
are there any problems with gcc4.4 (of ubuntu9.1) and cudaSDK 3.0beta ?

if not, i would like to register as a developer. Can I do this as a hobby-programmer just for simple tests ?


They do ask for company size, job position, area of work and such things. You can always specify a company size of "1" ;)

Christian
Go to the top of the page
 
+Quote Post
soeren87
post Nov 12 2009, 10:45 AM
Post #20



*

Group: Members
Posts: 9
Joined: 6-November 09
From: Bremen, Germany
Member No.: 244,381
Club SLI Member: No



this is no way to get beta feedback.

I do not want to report all these things just to test a beta version
Go to the top of the page
 
+Quote Post

6 Pages V   1 2 3 > » 
Reply to this topicStart new topic

 



Copyright © 2008 NVIDIA® Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 29th July 2010 - 06:20 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela