IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Any good ideas for this special "reduction" ?
Philipp82
post Nov 9 2009, 02:52 PM
Post #1



****

Group: Members
Posts: 50
Joined: 18-February 09
From: Bochum
Member No.: 141,375
Org.: Ruhr University Bochum



Hi,

This is my problem:
I got a very long array A of float values (eg. 1Mio, or 10Mio). All float values are randomly in the range of [0...N]. N can be about 1000 - 4000.
I have a second array B , this one is much smaller, with just N floats. All values of the first array have to be "put into" the second array, with linear weighting.

That means for example:
One element of the array A is "100.2". So element "100" of the second array B is incremented by 0.8 and element "101" is incremented by 0.2.
Another element of the array A is "132.5". So element "132" of the second array B is incremented by 0.5 and element "133" is incremented by 0.5.
...
Until all 10Mio Elements of array A are done.

Does anyone have a good idea, how to manage this (fast) with cuda?

Thanks for any help!
Go to the top of the page
 
+Quote Post
LSChien
post Nov 10 2009, 01:26 AM
Post #2



******

Group: Members
Posts: 261
Joined: 24-September 07
Member No.: 71,314
Org.: Tsing Hua university, R.O.C (Taiwan)



QUOTE (Philipp82 @ Nov 9 2009, 06:52 AM) *
Hi,

This is my problem:
I got a very long array A of float values (eg. 1Mio, or 10Mio). All float values are randomly in the range of [0...N]. N can be about 1000 - 4000.
I have a second array B , this one is much smaller, with just N floats. All values of the first array have to be "put into" the second array, with linear weighting.



I suggest that each thread block deal with a chunk of array A and store result into shared memory, finally

use atomic operation to write data in shared memory to global memory B.

since you need to decrease number of atomic operations, so one thread block can deal with more data.

for example, if thread block = 512, then it can deal with 4K data or more.


--------------------
Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien
Go to the top of the page
 
+Quote Post
Cygnus X1
post Nov 10 2009, 03:18 AM
Post #3



*****

Group: Members
Posts: 160
Joined: 30-October 08
From: Saarbruecken, Germany & Kraków, Poland
Member No.: 124,006
Org.: Saarland University



One strange idea just popped into my head. I don't claim it is the best and fastest, but it is a start...

1. We start by sorting the A array (not that long after all, there are good parallel sorting algorithms)
2. For each element i of our array A we split a number to integer part I_i and factorial part F_i.
3. Then we perform a conditional perfix sum. In conditional prefix sum we add two values only if certain condition is met. In our case for numbers N_p and N_q:
if (I_p == I_q)
we add up F_p and F_q.

Numbers consiting of the same integer part N lie all in a single block without holes (because we performed the sort), therefore after the conditional prefix sum, the first element of the block will hold the sum of all factorials of numbers of given integer part: sum(F(N))

4. Now we should know the following values:
#I(N) - number of numbers of integer part equal to N
sum(F(N)) - sum of all factoral parts of numbers which integer part is equal to N.

so finally
your array B[N] = #I(N) - sum(F(N)) + sum(F(N-1))

This post has been edited by Cygnus X1: Nov 10 2009, 03:20 AM
Go to the top of the page
 
+Quote Post
Sarnath
post Nov 10 2009, 07:02 AM
Post #4



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies



My first thought was sorting.. similar to Cygnus.. But now that he has given a solution, let me talk something else..

IDEA 1
----------
1. Assume you start with "N" blocks. Each block would update a private 1000-4000 range array. Thus your array "B" needs to be replicated "N" times.
For 100 blocks, you would need 100 arrays of size of "B".

2. Divide the 10milion numbers among the N blocks. Each block would update its private B array

3. Then, you need to reduce these private B arrays onto 1 single array.

IDEA 1++
-------------
But then, it still requires atomic operations inside a block...

Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.
At the end, you need to reduce all these 5760 arrays.

IDEA 2
---------
Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.
If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.
Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.
Memory bandwidth would be an issue because all threads would need to scan everybody... This would work good in architectures like FERMI, I would guess (because of large fat common caches )

Good Luck,

This post has been edited by Sarnath: Nov 10 2009, 07:21 AM


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post
Philipp82
post Nov 10 2009, 07:28 AM
Post #5



****

Group: Members
Posts: 50
Joined: 18-February 09
From: Bochum
Member No.: 141,375
Org.: Ruhr University Bochum



Hi all,

Thanks for your answers!!!
Doing all that on my CPU takes about 150ms for 10 Mio elements in array A, so that's what I have to beat. Sorting takes time of the same scale I guess, so that is not an option. I will try your recommendations!

Better ways are still welcome :)

Philipp.
Go to the top of the page
 
+Quote Post
Philipp82
post Nov 10 2009, 10:14 AM
Post #6



****

Group: Members
Posts: 50
Joined: 18-February 09
From: Bochum
Member No.: 141,375
Org.: Ruhr University Bochum



Hi again,

I tryed one thing with an atomic float add function that I found in this forum, which I use on shared Memory.
Now my thread blocks have their own array B in shared memory. Can anyone explain me how I can sum those shared mem arrays on the global memory B field?

Philipp.
Go to the top of the page
 
+Quote Post
Philipp82
post Nov 10 2009, 12:36 PM
Post #7



****

Group: Members
Posts: 50
Joined: 18-February 09
From: Bochum
Member No.: 141,375
Org.: Ruhr University Bochum



My result: From 140ms to 5,7ms

I do the following:
-Every thread-block has a shared memory array B.
-Every thread handles about 1000 elements of array A
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add
Go to the top of the page
 
+Quote Post
LSChien
post Nov 10 2009, 01:29 PM
Post #8



******

Group: Members
Posts: 261
Joined: 24-September 07
Member No.: 71,314
Org.: Tsing Hua university, R.O.C (Taiwan)



QUOTE (Philipp82 @ Nov 10 2009, 04:36 AM) *
My result: From 140ms to 5,7ms

I do the following:
-Every thread-block has a shared memory array B.
-Every thread handles about 1000 elements of array A
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add


I take TeslaC1060 as an example, it has 30 SM.

if you want to use all the resource, then 30 thread blocks must be used at least.

suppose you use 30 thread blocks and each thread block has 512 threads, and 10M data elements ,

then each trhead deals with 651 data elements.

1. at the end, you must write share memory to global B array via atomic operation.

this means that you need to write N= 1000 to array B 30 times serially.

I think that if array A can be reused, then you can copy shared memory into array A and

merge array A to array B via bisection method.

2. inside a thread block, I think that you can use additional integer array count[1:N]

and float B_shared[1:N] (suppose N is not large such that count and B_shared can be put into shared memory)

and add data into corresponding entry of B_shared, say

100.2 ---> add to B_shared[100], count[100]++
100.3 ---> add to B_shared[100], count[100]++

finally, correct weight in B_shared, then it can reduce number of atomic operation.



--------------------
Department of Mathematics, Tsing Hua university, R.O.C.
Lung Sheng Chien
Go to the top of the page
 
+Quote Post
coggrc
post Nov 18 2009, 10:32 PM
Post #9



**

Group: Members
Posts: 16
Joined: 20-July 09
Member No.: 165,101



QUOTE (Philipp82 @ Nov 10 2009, 07:36 AM) *
My result: From 140ms to 5,7ms

I do the following:
-Every thread-block has a shared memory array B.
-Every thread handles about 1000 elements of array A
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add

those sound like great results... i would like to do something very similar... would you be willing to post your kernel here? also, which atomic float add method are you using? please post that as well, if you would,
thanks

Go to the top of the page
 
+Quote Post
gshi
post Nov 19 2009, 03:14 PM
Post #10



***

Group: Members
Posts: 42
Joined: 30-June 08
Member No.: 110,141
Org.: NCSA (University of Illinois)



The Idea 2 sounds great. If texture is used for array A, then it can broadcast each segment of A to all SMs.
Array A is effectively loaded only once.
I would like to see the idea 2 implemented and see what the peformance is.

-gshi

QUOTE (Sarnath @ Nov 10 2009, 01:02 AM) *
My first thought was sorting.. similar to Cygnus.. But now that he has given a solution, let me talk something else..

IDEA 1
----------
1. Assume you start with "N" blocks. Each block would update a private 1000-4000 range array. Thus your array "B" needs to be replicated "N" times.
For 100 blocks, you would need 100 arrays of size of "B".

2. Divide the 10milion numbers among the N blocks. Each block would update its private B array

3. Then, you need to reduce these private B arrays onto 1 single array.

IDEA 1++
-------------
But then, it still requires atomic operations inside a block...

Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.
At the end, you need to reduce all these 5760 arrays.

IDEA 2
---------
Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.
If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.
Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.
Memory bandwidth would be an issue because all threads would need to scan everybody... This would work good in architectures like FERMI, I would guess (because of large fat common caches )

Good Luck,

Go to the top of the page
 
+Quote Post
Sarnath
post Nov 20 2009, 07:21 AM
Post #11



********

Group: Members
Posts: 1,567
Joined: 23-November 07
From: Bangalore
Member No.: 79,873
Org.: HCL Technologies




Oh yes! Textures! Thats a good point gshi!


--------------------
Ignorance Rules; Knowledge Liberates!
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

 



Copyright 2008 NVIDIA Corporation.  Terms of Use | Legal Info | Privacy Policy Time is now: 23rd November 2009 - 01:58 PM
Unites States Argentina Brazil Chile China Colombia France Germany India Italy Japan Korea Mexico Poland Russia Spain Taiwan United Kingdom Venezuela