IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
Reply to this topicStart new topic
> DGEMM-based burn-in test
tmurray
post Jan 13 2009, 11:36 PM
Post #1



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



As part of my continuing effort to make more of my internal tools for system testing available to you guys, here's a burn-in test I wrote for GT200-based systems. It performs DGEMMs on every capable device simultaneously until device memory is filled and will repeat if you want. It also checks the results of each individual DGEMM to help you track down general stability problems. Time to completion varies widely with options, so feel free to take a look.

It requires CUDA 2.1, because it uses the ability to poll for an active watchdog timer (you can guess who the major proponent of this was). Like most of what I do, it's Linux only for the moment, although I'm in the process of porting it to Windows. Compile with

nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas

Feedback is welcome.
Attached File(s)
Attached File  dgemmSweep.1.1.cu.txt ( 7.64K ) Number of downloads: 172
 
Go to the top of the page
 
+Quote Post
tmurray
post Jan 13 2009, 11:36 PM
Post #2



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



stealing this post again for a changelog:

1.0: initial release, Linux only.
1.1: still Linux only, fixed a stupid bug with launching threads on mixed-GPU machines.
Go to the top of the page
 
+Quote Post
ldpaniak
post Jan 15 2009, 06:18 AM
Post #3



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



Thanks for another useful tool.

Unfortunately, I am having trouble getting it compiled on a fresh ubuntu-8.04/cuda 2.1 install with GTX280 hardware:

CODE
hpc-user@gpu-hpc:~$ nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas
dgemmSweep.cu(196): error: class "cudaDeviceProp" has no member "kernelExecTimeoutEnabled"

1 error detected in the compilation of "/tmp/tmpxft_000012fc_00000000-4_dgemmSweep.cpp1.ii".


Any hints on what the problem is?
Go to the top of the page
 
+Quote Post
tmurray
post Jan 15 2009, 08:04 AM
Post #4



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Are you sure that's 2.1 final and not 2.1 beta? It has to be 2.1 final.
Go to the top of the page
 
+Quote Post
ldpaniak
post Jan 15 2009, 02:24 PM
Post #5



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



QUOTE (tmurray @ Jan 15 2009, 03:04 AM) *
Are you sure that's 2.1 final and not 2.1 beta? It has to be 2.1 final.


Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible.

Also, will these tools (dgemm burn-in, concBandwidthTest..) be making an appearance in the toolkit? I think they would be great additions.

Thanks
Go to the top of the page
 
+Quote Post
tmurray
post Jan 15 2009, 05:21 PM
Post #6



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (ldpaniak @ Jan 15 2009, 06:24 AM) *
Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible.

Also, will these tools (dgemm burn-in, concBandwidthTest..) be making an appearance in the toolkit? I think they would be great additions.

Thanks

2.1 final is out (STILL probably not on the website, but check the CUDA announcements forum for a link). These will eventually be included somewhere, just trying to figure out the right place for that.
Go to the top of the page
 
+Quote Post
ldpaniak
post Jan 15 2009, 06:25 PM
Post #7



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



Found the new driver and toolkit (180.22). Compilation goes without issue now.

Thanks.

This post has been edited by ldpaniak: Jan 15 2009, 06:28 PM
Go to the top of the page
 
+Quote Post
SPWorley
post Jan 16 2009, 05:35 AM
Post #8



*******

Group: Members
Posts: 792
Joined: 13-June 08
From: California USA
Member No.: 107,688



Tim, excellent tool!
I had thought about making a burnin test myself, but I am very lazy and never did anything.

Do you think DGEMM has a good cascading behavior, so one small error in a memory or compute will get magnified to make the error obvious?
I thought I might use an FFT as a basis since a single sample error would create a delta function on input, which propagates to all frequencies of the FFT. (Hmm, but that wouldn't magnify the magnitude of the error, ideally it should be a nice feedback that makes it grow.)


Big extra points to anyone who whips up a script to iterate over various memory and shader clocks and use this test to make a Shmoo plot of your card's stability regions.

Go to the top of the page
 
+Quote Post
tmurray
post Feb 10 2009, 01:24 AM
Post #9



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



bump--an updated version that isn't stupid about launching threads on mixed-gpu machinse
Go to the top of the page
 
+Quote Post
ldpaniak
post Feb 10 2009, 04:32 AM
Post #10



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



Hi,

The new script does not see one of the three capable devices on the system (a third GTX280):
CODE
hpc-user@gpu-hpc:~$ ./dgemmSweep11 1
Testing device 1: GeForce GTX 280
Testing device 2: GeForce GTX 280
device = 0
device = 0
iterSize = 5952
Device 1: i = 128
...
Go to the top of the page
 
+Quote Post
tmurray
post Feb 10 2009, 07:51 AM
Post #11



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



Are you using it for display? If so, it's not capable.
Go to the top of the page
 
+Quote Post
netllama
post Feb 10 2009, 02:49 PM
Post #12



*******

Group: Extranet Users
Posts: 754
Joined: 14-February 07
Member No.: 40,832
Org.: NVIDIA Corp.



QUOTE (ldpaniak @ Feb 9 2009, 08:32 PM) *
Hi,

The new script does not see one of the three capable devices on the system (a third GTX280):
CODE
hpc-user@gpu-hpc:~$ ./dgemmSweep11 1
Testing device 1: GeForce GTX 280
Testing device 2: GeForce GTX 280
device = 0
device = 0
iterSize = 5952
Device 1: i = 128
...


Does deviceQuery from the SDK see all 3? Which driver are you using?
Go to the top of the page
 
+Quote Post
tmurray
post Feb 10 2009, 05:12 PM
Post #13



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



a bit of clarification because I think I made netllama all worried:

dgemmSweep will not use cards that have a watchdog timer enabled because large DGEMMs will trigger the watchdog.
Go to the top of the page
 
+Quote Post
ldpaniak
post Feb 11 2009, 12:18 AM
Post #14



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



CODE
hpc-user@gpu-hpc:~$ deviceQuery
There are 3 devices supporting CUDA
...


Driver is 180.22 for CUDA2.1 on 64-bit Linux (ubuntu 8.04.2).

No attached monitor.

The system runs HOOMD very well on all three GPUs

Go to the top of the page
 
+Quote Post
tmurray
post Feb 11 2009, 12:45 AM
Post #15



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



is it booting into gdm?
Go to the top of the page
 
+Quote Post
ldpaniak
post Feb 11 2009, 02:20 AM
Post #16



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



QUOTE (tmurray @ Feb 10 2009, 07:45 PM) *
is it booting into gdm?


xdm
Go to the top of the page
 
+Quote Post
tmurray
post Feb 11 2009, 07:45 AM
Post #17



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



so it's running X on one card and therefore has a watchdog timer enabled, meaning it won't be used by this
Go to the top of the page
 
+Quote Post
ldpaniak
post Feb 11 2009, 01:30 PM
Post #18



**

Group: Members
Posts: 21
Joined: 29-September 08
Member No.: 120,599
Org.: Four Pi Solutions Inc.



QUOTE (tmurray @ Feb 11 2009, 02:45 AM) *
so it's running X on one card and therefore has a watchdog timer enabled, meaning it won't be used by this


This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X.
Go to the top of the page
 
+Quote Post
seibert
post Feb 11 2009, 01:41 PM
Post #19



********

Group: Members
Posts: 1,425
Joined: 22-February 07
Member No.: 42,046
Org.: Los Alamos National Laboratory



QUOTE (ldpaniak @ Feb 11 2009, 08:30 AM) *
This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X.


Change the default runlevel in /etc/inittab from 5 to 3. Then xdm won't start. Since X also creates the /dev/nvidia* devices for you, you'll have to use the script in the Release Notes to create these device files at boot time.
Go to the top of the page
 
+Quote Post
MisterAnderson42
post Feb 11 2009, 02:47 PM
Post #20



********

Group: Extranet Users
Posts: 2,289
Joined: 23-March 07
Member No.: 46,425
Org.: University of Michigan



QUOTE (ldpaniak @ Feb 11 2009, 07:30 AM) *
The nvidia driver installer insists on it by default.

No it doesn't. I've installed the stock nvidia driver dozens of times on boxes without X installed.

It asks if you want to update some OpenGL library and it doesn't really matter if you say yes or no. The library can be installed even if no one can use it.
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 23rd November 2009 - 02:34 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela