IPB

Welcome Guest ( Log In | Register )

> my speedy SGEMM
alex_dubinsky
post Oct 5 2007, 11:50 PM
Post #1



*******

Group: Members
Posts: 991
Joined: 29-August 07
Member No.: 67,724



I've tried implementing a matrix multiply that's as fast as I could make it. On my 8600GT OC, it gets 38 Glfops vs 23 with CUBLAS. Could somebody who has an 8800GTX get some numbers?

I tried very hard to optimize it, and experimented with many techniques. I'm afraid, however, that further improvement is unlikely without direct access to cubin. Minor changes give rather chaotic swings in performance (up-down by 30-200%), even with ptxas run at -O0. Turning on ptxas optimizations usually hurts performance.
Attached File(s)
Attached File  milestone_6.3.rar ( 863.27K ) Number of downloads: 1114
 


--------------------
CUDA consulting/programming available: cuda@almson.net. Pay after results.
Go to the top of the page
 
+Quote Post

Posts in this topic
- alex_dubinsky   my speedy SGEMM   Oct 5 2007, 11:50 PM
- - dhoff   QUOTE(alex_dubinsky @ Oct 5 2007, 04:50 PM)Co...   Oct 6 2007, 12:28 AM
- - vvolkov   I get 137.9 Gflop/s on 8800 GTX. CUBLAS runs at 12...   Oct 6 2007, 04:14 PM
- - nutti   QUOTECould somebody who has an 8800GTX get some nu...   Oct 7 2007, 07:41 AM
|- - alex_dubinsky   QUOTE(nutti @ Oct 7 2007, 03:41 AM)I get 171....   Oct 7 2007, 05:34 PM
||- - dhoff   QUOTE(alex_dubinsky @ Oct 7 2007, 10:34 AM)wo...   Oct 7 2007, 05:44 PM
||- - alex_dubinsky   QUOTE(dhoff @ Oct 7 2007, 01:44 PM)I doubt he...   Oct 7 2007, 08:39 PM
||- - seb   I've also got the Ultra. When I don't rec...   Oct 7 2007, 09:14 PM
|- - vvolkov   So, this code is 65% faster than CUBLAS when run o...   Oct 7 2007, 09:39 PM
|- - paulius   QUOTE(vvolkov @ Oct 7 2007, 02:39 PM)... Also...   Oct 8 2007, 05:22 PM
|- - vvolkov   QUOTE(paulius @ Oct 8 2007, 09:22 AM)Calling ...   Oct 9 2007, 11:29 AM
|- - seb   QUOTE(vvolkov @ Oct 9 2007, 06:29 AM)This con...   Oct 9 2007, 04:06 PM
|- - vvolkov   OK, i've got it. They are queued. And in my ca...   Oct 9 2007, 04:30 PM
- - mfatica   Alex, nice code optimization but what you have co...   Oct 9 2007, 05:11 AM
|- - alex_dubinsky   QUOTE(mfatica @ Oct 9 2007, 01:11 AM)Alex, n...   Oct 9 2007, 10:08 PM
|- - CBD   QUOTE(mfatica @ Oct 8 2007, 10:11 PM)Alex, n...   Nov 12 2007, 06:30 PM
- - mfatica   This is sustained on the card for any alpha and be...   Nov 12 2007, 06:39 PM
|- - BeauPaisley   QUOTE(mfatica @ Nov 12 2007, 12:39 PM)This is...   Nov 20 2007, 11:41 PM
|- - BeauPaisley   The matrices in the test were 750x750 which is typ...   Dec 5 2007, 06:06 PM
|- - BeauPaisley   Why does mod 32 give better performance? If the m...   Dec 9 2007, 10:08 PM
|- - BeauPaisley   Also, I guess I'm not clear on 'padding...   Dec 9 2007, 10:21 PM
|- - BeauPaisley   Thank you for the explanation. Indeed doing this ...   Dec 10 2007, 02:15 PM
|- - waveone   QUOTE(BeauPaisley @ Dec 10 2007, 08:15 AM)Tha...   Dec 10 2007, 04:55 PM
|- - waveone   Beau and others, If the data flow into a gpu boar...   Dec 10 2007, 05:02 PM
- - mfatica   The 8500GT has 16 processors clocked at 900Mhz and...   Nov 21 2007, 12:51 AM
- - Stanimire Tomov   I am trying to run the code on a Quadro FX 5600 bu...   Dec 5 2007, 05:24 PM
- - mfatica   On a Tesla card with CUDA 1.1, these are the resul...   Dec 5 2007, 06:18 PM
|- - alex_dubinsky   What a coincidence that this thread gets zombified...   Dec 5 2007, 07:40 PM
|- - lowenz   On a 8600 GT @625/800, with ForceWare 169.13: I...   Dec 6 2007, 08:37 AM
||- - waveone   Folks, Good idea to test gemm codes, who is runni...   Dec 8 2007, 04:26 AM
|- - alex_dubinsky   Was able to take a much closer look with wumpus...   Dec 6 2007, 04:49 PM
- - mfatica   The idea of padding is the following one: 1) tran...   Dec 10 2007, 02:10 AM
- - mfatica   CUBLAS prefers multiples of 32 for all matrix dime...   Dec 10 2007, 05:33 PM
- - vvolkov   I managed to achieve 178 Gflop/s in sgemm on GeFor...   Jan 3 2008, 03:12 AM
|- - paulius   Accessing an smem operand is as fast as a register...   Jan 3 2008, 08:06 PM
|- - pleventi   QUOTE(paulius @ Jan 3 2008, 04:06 PM)Accessin...   Jan 14 2008, 06:02 AM
|- - vvolkov   QUOTE(pleventi @ Jan 13 2008, 10:02 PM)The di...   Jan 14 2008, 10:25 AM
- - Stanimire Tomov   I also get 178 GFlop/s on a Quadro FX 5600 (on mac...   Jan 7 2008, 06:38 AM
|- - vvolkov   QUOTE(Stanimire Tomov @ Jan 6 2008, 10:38 PM)...   Jan 13 2008, 06:07 AM
- - pleventi   Hi vvolkov, BTW, I've got a couple tweaks (mi...   Jan 14 2008, 05:37 PM
|- - vvolkov   Hey Paul, I've got 185 Gflop/s with your code...   Jan 15 2008, 04:37 AM
||- - pleventi   QUOTE(vvolkov @ Jan 15 2008, 12:37 AM)Hey Pau...   Jan 15 2008, 05:17 AM
|||- - vvolkov   I'm sorry, that comment was so long that I did...   Jan 15 2008, 09:31 AM
|||- - pleventi   I'll give it a whirl and see whether I can eek...   Jan 15 2008, 02:21 PM
|||- - vvolkov   Paul, My last version has 86 instructions in the ...   Jan 16 2008, 06:37 AM
|||- - pleventi   I *think* that the two pairs of add.half instructi...   Jan 16 2008, 09:22 AM
||- - samuelmurdoch   Dear Vasily, I'm a very very beginner, and I...   May 24 2008, 06:37 PM
||- - vvolkov   QUOTE(samuelmurdoch @ May 24 2008, 10:37 AM)i...   May 25 2008, 09:33 AM
||- - samuelmurdoch   QUOTE(vvolkov @ May 25 2008, 10:33 AM)It...   May 25 2008, 11:03 AM
||- - vvolkov   QUOTE(samuelmurdoch @ May 25 2008, 03:03 AM)Y...   May 25 2008, 11:25 AM
||- - samuelmurdoch   QUOTE(vvolkov @ May 25 2008, 12:25 PM)Is it G...   May 25 2008, 04:10 PM
||- - vvolkov   This change can explain something. To the best of ...   May 26 2008, 03:58 AM
||- - samuelmurdoch   eheh, Ok, I tried with decuda.. (the moved instruc...   May 26 2008, 04:13 PM
||- - E.D. Riedijk   QUOTE(samuelmurdoch @ May 26 2008, 09:13 AM)a...   May 26 2008, 05:23 PM
|||- - samuelmurdoch   QUOTE(E.D. Riedijk @ May 26 2008, 06:23 PM)ad...   May 26 2008, 06:03 PM
||- - vvolkov   I think half means 32-bit opcode (vs. 64-bit). Hal...   May 27 2008, 06:35 AM
||- - samuelmurdoch   Thank you, I got the facts! Yeah, decuda is a...   May 27 2008, 09:14 AM
|- - langermatze   Hi! I've tried nearly the same approach t...   Jan 19 2008, 01:38 PM
|- - vvolkov   I think that achieving high performance in CGEMM s...   Jan 20 2008, 10:02 AM
|- - Mark Harris   Great work guys. Regarding your modeling of the i...   Jan 22 2008, 02:09 PM
|- - vvolkov   Mark, thanks for the discussion. QUOTE(Mark Harris...   Jan 24 2008, 11:21 AM
|- - Mark Harris   QUOTE(vvolkov @ Jan 24 2008, 12:21 PM)But I d...   Jan 24 2008, 04:34 PM
|- - pkeir   Hi, I downloaded sgemmN_012408.zip but only manag...   Apr 3 2008, 11:36 AM
|- - vvolkov   QUOTE(pkeir @ Apr 3 2008, 03:36 AM)I download...   Apr 3 2008, 11:43 AM
|- - pkeir   Thanks, of course 1.1 kills all my cygwin bash scr...   Apr 7 2008, 09:18 AM
- - mfatica   Vasily and Paul, great work on SGEMM!!...   Jan 24 2008, 04:38 PM
|- - vvolkov   I found that ssyrk in CUBLAS 2.0 runs on 8800GTX a...   May 2 2008, 01:44 PM
- - pleventi   Hi Mark, According to the disassembly output, the...   Jan 24 2008, 06:09 PM
|- - Sarnath   Paul, First of all, Congrats on your speedup...   Jan 31 2008, 08:02 AM
- - julien38   I got 205 Gflop/s on my 8800GTX !! (vs ...   Jan 29 2008, 02:38 PM
|- - vvolkov   QUOTE(julien38 @ Jan 29 2008, 06:38 AM)My app...   Jan 29 2008, 07:05 PM
- - gemini0x4d   Hello@all, I've read this thread so far and I...   Apr 6 2009, 10:31 AM
|- - vvolkov   QUOTE (gemini0x4d @ Apr 6 2009, 02:31 AM)...   Apr 6 2009, 10:35 AM
|- - gemini0x4d   QUOTE (vvolkov @ Apr 6 2009, 12:35 PM) Su...   Apr 6 2009, 12:27 PM
|- - vvolkov   QUOTE (gemini0x4d @ Apr 6 2009, 04:27 AM)...   Apr 6 2009, 12:53 PM
|- - gemini0x4d   QUOTE (vvolkov @ Apr 6 2009, 02:53 PM) OK...   Apr 7 2009, 06:39 AM
- - Sayantan   Hi, I am new to CUDA and currently working on ma...   May 11 2009, 06:32 PM


Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 26th November 2009 - 05:00 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela