IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
Reply to this topicStart new topic
> LU, QR and Cholesky factorizations using GPU
vvolkov
post Feb 9 2009, 03:53 PM
Post #1



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



I'd like to share an implementation of LAPACK's routines SGETRF, SPOTRF, and SGEQRF that is accelerated using GPU. This implementation is limited to factorization of square matrices that reside in the host memory (i.e. at the CPU side). The following figure shows the sustained performance on the following platform: Intel Core2 Quad 2.83 GHz (Q9550), PCIe 2.0 x16, Intel MKL 10.1, Windows XP 64-bit, NVIDIA driver 181.20, CUDA 2.1:


The implementation follows the description given in the following paper; however, some of the finer tunings described, such as recursive and variable blocking, are not included in this release:
Volkov, V., and Demmel, J. W. 2008. Benchmarking GPUs to tune dense linear algebra, SC08.
Regards,

Vasily

05/02/09 edit: updated dead URL to the paper.

This post has been edited by vvolkov: May 2 2009, 12:30 PM
Attached File(s)
Attached File  glapack_090209.zip ( 18.55K ) Number of downloads: 606
 
Go to the top of the page
 
+Quote Post
zhenyu
post Feb 9 2009, 04:36 PM
Post #2



**

Group: Members
Posts: 17
Joined: 20-January 09
From: Eindhoven, the Netherlands
Member No.: 136,332



Thank you very much!
For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.
Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.
But I did not find anyone have ever measure which method is better on GPU.
Go to the top of the page
 
+Quote Post
zhenyu
post Feb 9 2009, 04:40 PM
Post #3



**

Group: Members
Posts: 17
Joined: 20-January 09
From: Eindhoven, the Netherlands
Member No.: 136,332



In fact, I am working on a Givens rotation version of QR decomposition.
Maybe we can compare whose solution is faster : )
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 9 2009, 04:58 PM
Post #4



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (zhenyu @ Feb 9 2009, 08:36 AM) *
Thank you very much!
For the QR decomposition, I wonder whether using Givens rotation, instead of Householder reflector, would be more efficient for GPU implementation.
Some people have been using Givens rotation to do QR decomposition on GPUs in the HPEC challenges 07 and 08.
But I did not find anyone have ever measure which method is better on GPU.


I use block Householder update as done in LAPACK. It is BLAS3, so runs as fast as GEMM does. I wonder if you can do better.

Vasily
Go to the top of the page
 
+Quote Post
VictorGre
post Feb 11 2009, 03:23 PM
Post #5



*

Group: Members
Posts: 1
Joined: 19-January 09
Member No.: 136,225



Many thanks! You could make and lay out too most for Double.
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 15 2009, 01:29 PM
Post #6



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



With routines such as these we are ever so close to having functional "sgetrs" which calls on the existing "strsm" and the
simple, but not yet existing "slaswp". The combination sgetrf and sgetrs solves the equation Ax=b for x, i.e., x=A\b. This being
a holy grail at the moment.

I have hardware one step below the Q9550/gtx 280: a Q6600 quadcore cpu and a gtx 260. I get the following:

CODE

...glapack> ./benchmark

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 14.83 0.80 42.96 34.48 54.31 8.78
2000 101.17 1.07 97.62 60.93 123.00 12.67
3000 140.38 1.21 130.77 80.04 150.68 13.79
4000 111.16 0.94 101.29 106.74 168.95 16.81
5000 174.11 1.53 154.04 124.38 188.27 17.73
6000 172.13 1.43 173.10 146.37 196.90 20.60
7000 180.64 1.68 173.76 159.69 202.71 21.18
8000 190.27 1.61 180.69 193.38 207.50 22.29
9000 194.35 1.50 187.24 206.19 212.15 25.96
10000 198.41 1.67 192.23 225.67 215.90 27.75
11000 199.69 1.78 194.05 238.32 220.92 26.88


I am somewhat stunned that the 260 is only about 2/3 as fast as the 280 for this benchmark. Perhaps it is the cpu/gpu combination that is conspiring to be slower? I have 8 GB of slowish ram in my system, preferring lots of ram over fast ram. Perhaps the code has some special tuning for the 280?

CODE

... glapack> ./benchmark -cpu

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 12.95 0.87 32.01 24.60 39.90 6.47
2000 32.06 0.97 36.39 53.54 51.76 6.71
3000 38.37 0.90 44.59 81.45 47.21 9.00
4000 48.96 0.85 45.72 98.10 49.07 7.62
5000 47.45 1.11 42.56 125.48 50.32 11.28
6000 46.80 1.21 42.53 155.80 51.31 10.47
7000 46.76 1.17 51.04 166.25 51.59 13.42
8000 40.01 1.19 52.32 197.28 52.47 14.37
9000 48.41 1.18 43.29 223.64 52.66 13.83
10000 48.89 1.21 53.09 244.25 42.80 16.26
11000 51.22 1.18 43.80 265.48 52.91 16.33
12000 50.13 1.23 43.68 300.44 43.16 17.32
13000 40.73 1.20 43.53 300.52 43.38 19.32
14000 40.94 1.22 44.06 335.17 43.21 19.20
15000 41.40 1.32 43.36 346.97 42.79 18.02


I've toyed with upgrading to a Q9550 but I am not sure it is worth the $300 it would take... I paid $400 for my gtx 260 last June which brings tears to my eyes now...
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 15 2009, 01:50 PM
Post #7



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.

Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?
Go to the top of the page
 
+Quote Post
frea
post Feb 15 2009, 08:24 PM
Post #8



**

Group: Members
Posts: 20
Joined: 9-July 08
Member No.: 111,249



vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine :). I am trying to implement a gpu only QR and it would be nice to have something to compare against.
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 15 2009, 09:10 PM
Post #9



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (frea @ Feb 15 2009, 12:24 PM) *
vvolkov could you also post how much time does every run take, i am interested mainly in results for 8800, but any will be fine smile.gif. I am trying to implement a gpu only QR and it would be nice to have something to compare against.

Here are the time results for QR on 8800GTX:
CODE
n             1000    2000   3000   4000  5000  6000  7000  8000  9000  10000  11000  12000  13000
seconds     0.0194  0.0918  0.256  0.566  1.05  1.74  2.71  3.94  5.56   7.55   9.92   12.8   16.3
I used formula: Gflop/s rate = 4e-9*n*n*n/3/seconds.
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 15 2009, 11:03 PM
Post #10



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



QUOTE (vvolkov @ Feb 15 2009, 05:50 AM) *
As far as I see, GTX260 has 3/4 peak arithmetic throughput (=number of cores*clock rate) of GTX280, and Q6600 has 94% arithmetic throughput of Q9550. So indeed, you lose ~10% somewhere.

Could you tell more about your system? Is it PCIe 2.0 x16? Do you use 64-bit operating system?


I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800. I run the cpu at normal speed. I ran this benchmark using the latest 180.29 version of the nvidia driver, and I ran the benchmark with X turned off with "init 3". I think that may be all the relevant information...

Thanks for the factorizations!
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 15 2009, 11:27 PM
Post #11



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (Boxed Cylon @ Feb 15 2009, 03:03 PM) *
I am using a stock Suse Linux 10.3, 64-bit version. I have a Gigabyte GA-P35-DS3R motherboard which has one PCIe X16 slot. I mentioned my RAM is slow - I think it is 8 GB of DDR2 800.

I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.

I wonder why you get only up to 53 Gflop/s on CPU, which is ~70% of peak. I get up to 85% of peak with Intel MKL 10.1 on my system. I don't know if it is due to the processor or the library. Can't tell much about DDR2 speed either. I guess that chipset also matters.

Anyway, thanks for reporting the performance!
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 16 2009, 12:04 AM
Post #12



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



QUOTE (vvolkov @ Feb 15 2009, 03:27 PM) *
I guess this is PCIe 1.1 which is 2x slower than the newer PCIe 2.0. This can be checked using bandwidthTest in CUDA SDK. If it shows only up to ~3 GB/s in pinned mode then it is PCI 1.1.


You are correct - some research shows that the P35 chipset is PCIe 1.1. The benchmarks for bandwidthTest below support that notion. It looks like I am underbandwidthing my GTX 260...I see a hardware upgrade in my future...

CODE

./bandwidthTest --memory=pinned
Running on......
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2488.1

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1821.5

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94576.6

&&&& Test PASSED
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 17 2009, 01:32 PM
Post #13



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:

CODE

./bandwidthTest --memory=pinned
Running on......
device 0:GT200
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2657.6

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3216.0

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94604.6

&&&& Test PASSED



CODE

> ./benchmark

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 10.89 1.19 51.91 33.91 66.81 7.97
2000 93.23 1.30 103.11 61.33 133.75 11.92
3000 99.65 1.37 147.28 93.44 163.54 15.11
4000 144.83 1.37 146.59 110.46 191.76 16.54
5000 180.16 1.73 185.05 122.42 209.02 19.93
6000 196.58 1.73 198.55 148.36 222.11 21.75
7000 204.16 1.73 206.30 164.98 228.16 22.82
8000 215.18 1.84 214.76 187.30 236.05 24.78
9000 218.61 1.82 219.08 210.76 240.26 26.26
10000 222.41 1.95 223.29 225.35 243.33 24.91
11000 228.40 1.96 227.84 265.51 247.76 28.26


CODE

./benchmark -cpu

Device: GT200, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 11.24 0.96 44.42 24.25 48.78 5.49
2000 29.42 1.02 38.59 53.91 47.82 7.53
3000 42.78 1.15 49.68 84.66 55.93 8.73
4000 54.96 1.13 54.84 101.21 61.57 10.20
5000 59.25 1.18 60.71 119.79 65.25 11.36
6000 60.77 1.28 63.37 138.30 66.97 12.60
7000 61.40 1.36 64.24 167.32 66.96 13.04
8000 62.72 1.26 66.55 190.67 67.95 14.13
9000 63.74 1.29 66.59 219.02 68.66 15.21
10000 64.27 1.29 67.57 241.27 69.10 15.91
11000 63.43 1.34 69.39 258.35 69.52 16.80


This seems to place the 260 more in the expected place with respect to the 280.

The reviews rather beat up on the Phenoms, but for pure number crunching they seemed to have the edge over comparable Intel offerings. I can't speak to the more recent Intel offerings, but the Q6600 (2.4 GHz) was something of a lightweight when I asked all four cores to compute at once. The Phenom 9600 (2.3 GHz) scaled far better.
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 17 2009, 02:34 PM
Post #14



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (Boxed Cylon @ Feb 17 2009, 05:32 AM) *
It so happens that I just today reconfigured my small cluster and can test out my gtx 260 using a Phenom II 940 and a 790X motherboard. The PCIE on this motherboard is indeed version 2.0. Here are the numbers:

I wonder why your PCIe 2.0 is so slow. Here are my numbers for comparison:
CODE
bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce GTX 280
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5582.3

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5426.2

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 114908.7

&&&& Test PASSED

Press ENTER to exit...

Here are my numbers on PCIe 1.1 system:
CODE
bandwidthTest.exe --memory=pinned

Running on......
device 0:GeForce GTX 280
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3054.5

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3192.1

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 114682.8

&&&& Test PASSED

Press ENTER to exit...

You can see that your PCIe 2.0 runs about as fast as my PCIe 1.1 and much slower than my PCIe 2.0. I use Alienware desktops with nForce 790i Ultra SLI and nForce 680i SLI chipsets.

I also has noticed that your device is recognized as GT200. That happened with me when I was using the now ancient 177.11 drivers. I don't think this may be a performance issue, but I'd double check.
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 17 2009, 04:00 PM
Post #15



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



The short answer as to why I get sub-standard bandwidth is I don't know. I've tried the 180.22 and 180.29 drivers with the same result - linux does not have 181.20 as yet that I know of. Both drivers report the generic ""GT200". I've checked the bios settings and found nothing. And I know the card is in the 16X slot rather than the 8X slot. I'd have to suspect the linux drivers are lagging to some extent. If I sort out the issue, I'll post again.
Go to the top of the page
 
+Quote Post
Boxed Cylon
post Feb 18 2009, 06:20 AM
Post #16



****

Group: Members
Posts: 69
Joined: 11-June 08
Member No.: 107,465
Org.: New Caprica U.



Ah ha! It turns out that on this Gigabyte motherboard, if I hit Cntrl-F1 when in the BIOS I can get at some additional options for PCIe. They were all set to "disabled" and I set them to "auto" - exactly what the settings are, I could not say. But the effect is to boost the bandwidth up to the expected level:

CODE

./bandwidthTest --memory=pinned
Running on......
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5280.0

Quick Mode
Device to Host Bandwidth for Pinned memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5290.8

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 94576.6


The new results for the glapack test are:

CODE

./benchmark

Device: GeForce GTX 260, 1296 MHz clock, 895 MB memory.

Errors reported are 1-norms of the residual such as ||A-QR||_1.

Cholesky LU QR
-------------- -------------- --------------
N Gflop/s error Gflop/s error Gflop/s error
----- -------------- -------------- --------------
1000 2.01 1.19 58.71 35.05 70.78 8.29
2000 103.33 1.33 117.06 58.29 113.79 12.38
3000 125.02 1.35 166.93 83.37 182.32 14.91
4000 158.30 1.47 160.73 112.14 201.49 16.90
5000 195.72 1.64 203.88 125.93 218.48 19.45
6000 213.32 1.69 216.52 151.77 230.96 21.74
7000 220.10 1.86 222.57 177.18 236.10 22.58
8000 222.84 1.82 230.17 160.59 243.12 24.23
9000 231.44 1.82 232.00 216.55 247.03 27.03
10000 236.08 1.90 236.43 222.46 249.57 28.09
11000 241.31 2.02 240.32 254.06 252.09 28.49


It looks like doubling the bandwidth in this case boosted the benchmark numbers by 5% or so. I think I my system is tuned up now.

This post has been edited by Boxed Cylon: Feb 18 2009, 06:21 AM
Go to the top of the page
 
+Quote Post
vvolkov
post Feb 18 2009, 06:25 AM
Post #17



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (Boxed Cylon @ Feb 17 2009, 10:20 PM) *
Ah ha! It turns out that on this Gigabyte motherboard, if I hit Cntrl-F1 when in the BIOS I can get at some additional options for PCIe. They were all set to "disabled" and I set them to "auto" - exactly what the settings are, I could not say. But the effect is to boost the bandwidth up to the expected level:

Cool! Thanks for getting better performance numbers with my code! :-D
Go to the top of the page
 
+Quote Post
marcof
post Jun 25 2009, 06:19 PM
Post #18



*

Group: Members
Posts: 2
Joined: 18-June 09
Member No.: 159,830



Hi, great work here!

Do you think there is a way to implement sparse Cholesky factorization in CUDA?

Cholmod is so efficient on the CPU that it makes me dream about having it ported on the GPU.
Go to the top of the page
 
+Quote Post
vvolkov
post Jun 25 2009, 06:46 PM
Post #19



*****

Group: Extranet Users
Posts: 107
Joined: 6-October 07
From: Berkeley, CA
Member No.: 72,970
Org.: UC Berkeley



QUOTE (marcof @ Jun 25 2009, 11:19 AM) *
Do you think there is a way to implement sparse Cholesky factorization in CUDA?

I think sparse codes need fast communication between thread blocks/multiprocessors, which is currently lacking.
Go to the top of the page
 
+Quote Post
tmurray
post Jun 25 2009, 06:52 PM
Post #20



********

Group: NVIDIA Employees
Posts: 2,065
Joined: 3-June 08
From: Santa Clara, CA
Member No.: 106,363
Org.: NVIDIA



QUOTE (vvolkov @ Jun 25 2009, 11:46 AM) *
I think sparse codes need fast communication between thread blocks/multiprocessors, which is currently lacking.

I'm curious to hear what you'd consider fast in this case (or if you want, you could just email me).
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 23rd November 2009 - 03:00 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela