![]() ![]() |
Jan 13 2009, 11:36 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
As part of my continuing effort to make more of my internal tools for system testing available to you guys, here's a burn-in test I wrote for GT200-based systems. It performs DGEMMs on every capable device simultaneously until device memory is filled and will repeat if you want. It also checks the results of each individual DGEMM to help you track down general stability problems. Time to completion varies widely with options, so feel free to take a look.
It requires CUDA 2.1, because it uses the ability to poll for an active watchdog timer (you can guess who the major proponent of this was). Like most of what I do, it's Linux only for the moment, although I'm in the process of porting it to Windows. Compile with nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas Feedback is welcome.
Attached File(s)
|
|
|
|
Jan 13 2009, 11:36 PM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
stealing this post again for a changelog:
1.0: initial release, Linux only. 1.1: still Linux only, fixed a stupid bug with launching threads on mixed-GPU machines. |
|
|
|
Jan 15 2009, 06:18 AM
Post
#3
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
Thanks for another useful tool.
Unfortunately, I am having trouble getting it compiled on a fresh ubuntu-8.04/cuda 2.1 install with GTX280 hardware: CODE hpc-user@gpu-hpc:~$ nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas dgemmSweep.cu(196): error: class "cudaDeviceProp" has no member "kernelExecTimeoutEnabled" 1 error detected in the compilation of "/tmp/tmpxft_000012fc_00000000-4_dgemmSweep.cpp1.ii". Any hints on what the problem is? |
|
|
|
Jan 15 2009, 08:04 AM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Are you sure that's 2.1 final and not 2.1 beta? It has to be 2.1 final.
|
|
|
|
Jan 15 2009, 02:24 PM
Post
#5
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
Are you sure that's 2.1 final and not 2.1 beta? It has to be 2.1 final. Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible. Also, will these tools (dgemm burn-in, concBandwidthTest..) be making an appearance in the toolkit? I think they would be great additions. Thanks |
|
|
|
Jan 15 2009, 05:21 PM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible. Also, will these tools (dgemm burn-in, concBandwidthTest..) be making an appearance in the toolkit? I think they would be great additions. Thanks 2.1 final is out (STILL probably not on the website, but check the CUDA announcements forum for a link). These will eventually be included somewhere, just trying to figure out the right place for that. |
|
|
|
Jan 15 2009, 06:25 PM
Post
#7
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
Found the new driver and toolkit (180.22). Compilation goes without issue now.
Thanks. This post has been edited by ldpaniak: Jan 15 2009, 06:28 PM |
|
|
|
Jan 16 2009, 05:35 AM
Post
#8
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 792 Joined: 13-June 08 From: California USA Member No.: 107,688 |
Tim, excellent tool!
I had thought about making a burnin test myself, but I am very lazy and never did anything. Do you think DGEMM has a good cascading behavior, so one small error in a memory or compute will get magnified to make the error obvious? I thought I might use an FFT as a basis since a single sample error would create a delta function on input, which propagates to all frequencies of the FFT. (Hmm, but that wouldn't magnify the magnitude of the error, ideally it should be a nice feedback that makes it grow.) Big extra points to anyone who whips up a script to iterate over various memory and shader clocks and use this test to make a Shmoo plot of your card's stability regions. |
|
|
|
Feb 10 2009, 01:24 AM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
bump--an updated version that isn't stupid about launching threads on mixed-gpu machinse
|
|
|
|
Feb 10 2009, 04:32 AM
Post
#10
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
Hi,
The new script does not see one of the three capable devices on the system (a third GTX280): CODE hpc-user@gpu-hpc:~$ ./dgemmSweep11 1
Testing device 1: GeForce GTX 280 Testing device 2: GeForce GTX 280 device = 0 device = 0 iterSize = 5952 Device 1: i = 128 ... |
|
|
|
Feb 10 2009, 07:51 AM
Post
#11
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Are you using it for display? If so, it's not capable.
|
|
|
|
Feb 10 2009, 02:49 PM
Post
#12
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 754 Joined: 14-February 07 Member No.: 40,832 Org.: NVIDIA Corp. |
Hi, The new script does not see one of the three capable devices on the system (a third GTX280): CODE hpc-user@gpu-hpc:~$ ./dgemmSweep11 1 Testing device 1: GeForce GTX 280 Testing device 2: GeForce GTX 280 device = 0 device = 0 iterSize = 5952 Device 1: i = 128 ... Does deviceQuery from the SDK see all 3? Which driver are you using? |
|
|
|
Feb 10 2009, 05:12 PM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
a bit of clarification because I think I made netllama all worried:
dgemmSweep will not use cards that have a watchdog timer enabled because large DGEMMs will trigger the watchdog. |
|
|
|
Feb 11 2009, 12:18 AM
Post
#14
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
CODE hpc-user@gpu-hpc:~$ deviceQuery There are 3 devices supporting CUDA ... Driver is 180.22 for CUDA2.1 on 64-bit Linux (ubuntu 8.04.2). No attached monitor. The system runs HOOMD very well on all three GPUs |
|
|
|
Feb 11 2009, 12:45 AM
Post
#15
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
is it booting into gdm?
|
|
|
|
Feb 11 2009, 02:20 AM
Post
#16
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
|
|
|
|
Feb 11 2009, 07:45 AM
Post
#17
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,065 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
so it's running X on one card and therefore has a watchdog timer enabled, meaning it won't be used by this
|
|
|
|
Feb 11 2009, 01:30 PM
Post
#18
|
|
![]() ![]() Group: Members Posts: 21 Joined: 29-September 08 Member No.: 120,599 Org.: Four Pi Solutions Inc. |
so it's running X on one card and therefore has a watchdog timer enabled, meaning it won't be used by this This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X. |
|
|
|
Feb 11 2009, 01:41 PM
Post
#19
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,425 Joined: 22-February 07 Member No.: 42,046 Org.: Los Alamos National Laboratory |
This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X. Change the default runlevel in /etc/inittab from 5 to 3. Then xdm won't start. Since X also creates the /dev/nvidia* devices for you, you'll have to use the script in the Release Notes to create these device files at boot time. |
|
|
|
Feb 11 2009, 02:47 PM
Post
#20
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 2,289 Joined: 23-March 07 Member No.: 46,425 Org.: University of Michigan |
The nvidia driver installer insists on it by default. No it doesn't. I've installed the stock nvidia driver dozens of times on boxes without X installed. It asks if you want to update some OpenGL library and it doesn't really matter if you say yes or no. The library can be installed even if no one can use it. |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 23rd November 2009 - 02:34 PM |