Oct 5 2007, 11:50 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 991 Joined: 29-August 07 Member No.: 67,724 |
I've tried implementing a matrix multiply that's as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?
I tried very hard to optimize it, and experimented with many techniques. I'm afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.
Attached File(s)
-------------------- CUDA consulting/programming available: cuda@almson.net. Pay after results.
|
|
|
|
alex_dubinsky my speedy SGEMM Oct 5 2007, 11:50 PM
dhoff QUOTE(alex_dubinsky @ Oct 5 2007, 04:50 PM)Co... Oct 6 2007, 12:28 AM
vvolkov I get 137.9 Gflop/s on 8800 GTX. CUBLAS runs at 12... Oct 6 2007, 04:14 PM
nutti QUOTECould somebody who has an 8800GTX get some nu... Oct 7 2007, 07:41 AM
alex_dubinsky QUOTE(nutti @ Oct 7 2007, 03:41 AM)I get 171.... Oct 7 2007, 05:34 PM

dhoff QUOTE(alex_dubinsky @ Oct 7 2007, 10:34 AM)wo... Oct 7 2007, 05:44 PM

alex_dubinsky QUOTE(dhoff @ Oct 7 2007, 01:44 PM)I doubt he... Oct 7 2007, 08:39 PM

seb I've also got the Ultra.
When I don't rec... Oct 7 2007, 09:14 PM
vvolkov So, this code is 65% faster than CUBLAS when run o... Oct 7 2007, 09:39 PM
paulius QUOTE(vvolkov @ Oct 7 2007, 02:39 PM)...
Also... Oct 8 2007, 05:22 PM
vvolkov QUOTE(paulius @ Oct 8 2007, 09:22 AM)Calling ... Oct 9 2007, 11:29 AM
seb QUOTE(vvolkov @ Oct 9 2007, 06:29 AM)This con... Oct 9 2007, 04:06 PM
vvolkov OK, i've got it. They are queued. And in my ca... Oct 9 2007, 04:30 PM
mfatica Alex,
nice code optimization but what you have co... Oct 9 2007, 05:11 AM
alex_dubinsky QUOTE(mfatica @ Oct 9 2007, 01:11 AM)Alex,
n... Oct 9 2007, 10:08 PM
CBD QUOTE(mfatica @ Oct 8 2007, 10:11 PM)Alex,
n... Nov 12 2007, 06:30 PM
mfatica This is sustained on the card for any alpha and be... Nov 12 2007, 06:39 PM
BeauPaisley QUOTE(mfatica @ Nov 12 2007, 12:39 PM)This is... Nov 20 2007, 11:41 PM
BeauPaisley The matrices in the test were 750x750 which is typ... Dec 5 2007, 06:06 PM
BeauPaisley Why does mod 32 give better performance? If the m... Dec 9 2007, 10:08 PM
BeauPaisley Also, I guess I'm not clear on 'padding... Dec 9 2007, 10:21 PM
BeauPaisley Thank you for the explanation. Indeed doing this ... Dec 10 2007, 02:15 PM
waveone QUOTE(BeauPaisley @ Dec 10 2007, 08:15 AM)Tha... Dec 10 2007, 04:55 PM
waveone Beau and others,
If the data flow into a gpu boar... Dec 10 2007, 05:02 PM
mfatica The 8500GT has 16 processors clocked at 900Mhz and... Nov 21 2007, 12:51 AM
Stanimire Tomov I am trying to run the code on a Quadro FX 5600 bu... Dec 5 2007, 05:24 PM
mfatica On a Tesla card with CUDA 1.1, these are the resul... Dec 5 2007, 06:18 PM
alex_dubinsky What a coincidence that this thread gets zombified... Dec 5 2007, 07:40 PM
lowenz On a 8600 GT @625/800, with ForceWare 169.13:
I... Dec 6 2007, 08:37 AM

waveone Folks,
Good idea to test gemm codes, who is runni... Dec 8 2007, 04:26 AM
alex_dubinsky Was able to take a much closer look with wumpus... Dec 6 2007, 04:49 PM
mfatica The idea of padding is the following one:
1) tran... Dec 10 2007, 02:10 AM
mfatica CUBLAS prefers multiples of 32 for all matrix dime... Dec 10 2007, 05:33 PM
vvolkov I managed to achieve 178 Gflop/s in sgemm on GeFor... Jan 3 2008, 03:12 AM
paulius Accessing an smem operand is as fast as a register... Jan 3 2008, 08:06 PM
pleventi QUOTE(paulius @ Jan 3 2008, 04:06 PM)Accessin... Jan 14 2008, 06:02 AM
vvolkov QUOTE(pleventi @ Jan 13 2008, 10:02 PM)The di... Jan 14 2008, 10:25 AM
Stanimire Tomov I also get 178 GFlop/s on a Quadro FX 5600 (on mac... Jan 7 2008, 06:38 AM
vvolkov QUOTE(Stanimire Tomov @ Jan 6 2008, 10:38 PM)... Jan 13 2008, 06:07 AM
pleventi Hi vvolkov,
BTW, I've got a couple tweaks (mi... Jan 14 2008, 05:37 PM
vvolkov Hey Paul,
I've got 185 Gflop/s with your code... Jan 15 2008, 04:37 AM

pleventi QUOTE(vvolkov @ Jan 15 2008, 12:37 AM)Hey Pau... Jan 15 2008, 05:17 AM


vvolkov I'm sorry, that comment was so long that I did... Jan 15 2008, 09:31 AM


pleventi I'll give it a whirl and see whether I can eek... Jan 15 2008, 02:21 PM


vvolkov Paul,
My last version has 86 instructions in the ... Jan 16 2008, 06:37 AM


pleventi I *think* that the two pairs of add.half instructi... Jan 16 2008, 09:22 AM

samuelmurdoch Dear Vasily,
I'm a very very beginner, and I... May 24 2008, 06:37 PM

vvolkov QUOTE(samuelmurdoch @ May 24 2008, 10:37 AM)i... May 25 2008, 09:33 AM

samuelmurdoch QUOTE(vvolkov @ May 25 2008, 10:33 AM)It... May 25 2008, 11:03 AM

vvolkov QUOTE(samuelmurdoch @ May 25 2008, 03:03 AM)Y... May 25 2008, 11:25 AM

samuelmurdoch QUOTE(vvolkov @ May 25 2008, 12:25 PM)Is it G... May 25 2008, 04:10 PM

vvolkov This change can explain something. To the best of ... May 26 2008, 03:58 AM

samuelmurdoch eheh, Ok, I tried with decuda..
(the moved instruc... May 26 2008, 04:13 PM

E.D. Riedijk QUOTE(samuelmurdoch @ May 26 2008, 09:13 AM)a... May 26 2008, 05:23 PM


samuelmurdoch QUOTE(E.D. Riedijk @ May 26 2008, 06:23 PM)ad... May 26 2008, 06:03 PM

vvolkov I think half means 32-bit opcode (vs. 64-bit). Hal... May 27 2008, 06:35 AM

samuelmurdoch Thank you, I got the facts!
Yeah, decuda is a... May 27 2008, 09:14 AM
langermatze Hi!
I've tried nearly the same approach t... Jan 19 2008, 01:38 PM
vvolkov I think that achieving high performance in CGEMM s... Jan 20 2008, 10:02 AM
Mark Harris Great work guys. Regarding your modeling of the i... Jan 22 2008, 02:09 PM
vvolkov Mark, thanks for the discussion.
QUOTE(Mark Harris... Jan 24 2008, 11:21 AM
Mark Harris QUOTE(vvolkov @ Jan 24 2008, 12:21 PM)But I d... Jan 24 2008, 04:34 PM
pkeir Hi,
I downloaded sgemmN_012408.zip but only manag... Apr 3 2008, 11:36 AM
vvolkov QUOTE(pkeir @ Apr 3 2008, 03:36 AM)I download... Apr 3 2008, 11:43 AM
pkeir Thanks, of course 1.1 kills all my cygwin bash scr... Apr 7 2008, 09:18 AM
mfatica Vasily and Paul,
great work on SGEMM!!... Jan 24 2008, 04:38 PM
vvolkov I found that ssyrk in CUBLAS 2.0 runs on 8800GTX a... May 2 2008, 01:44 PM
pleventi Hi Mark,
According to the disassembly output, the... Jan 24 2008, 06:09 PM
Sarnath Paul,
First of all, Congrats on your speedup... Jan 31 2008, 08:02 AM
julien38 I got 205 Gflop/s on my 8800GTX !!
(vs ... Jan 29 2008, 02:38 PM
vvolkov QUOTE(julien38 @ Jan 29 2008, 06:38 AM)My app... Jan 29 2008, 07:05 PM
gemini0x4d Hello@all,
I've read this thread so far and I... Apr 6 2009, 10:31 AM
vvolkov QUOTE (gemini0x4d @ Apr 6 2009, 02:31 AM)... Apr 6 2009, 10:35 AM
gemini0x4d QUOTE (vvolkov @ Apr 6 2009, 12:35 PM) Su... Apr 6 2009, 12:27 PM
vvolkov QUOTE (gemini0x4d @ Apr 6 2009, 04:27 AM)... Apr 6 2009, 12:53 PM
gemini0x4d QUOTE (vvolkov @ Apr 6 2009, 02:53 PM) OK... Apr 7 2009, 06:39 AM
Sayantan Hi,
I am new to CUDA and currently working on ma... May 11 2009, 06:32 PM![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 26th November 2009 - 05:00 PM |