Help - Search - Members - Calendar
Full Version: Fast texture fetching
NVIDIA Forums > CUDA GPU Computing > General CUDA GPU Computing Discussion
Jover
Dear all!
I have a question about texture fetching. As it's written in a programming guide and discussed a lot at the present forum the speed of tex1Dfetch inside the kernel is highly dependent on the spatial locality of those fetches. In other words all kernel's fetches must be performed from nearby places in global memory to provide maximal speed.
In my case each kernel performs several fetches from the texture binded to the 1D global array which corresponds to 3D geometrical points. Moreover all the kernel's fetches are performed for the points which are geometrical neighbours. So there is a question if there is any CPU or GPU implemented algorithm which performs an optimal 3D points renumbering in the sence that geometrical neighbours become neighbours in a corresponded 1D global array.
Thank you in advance.
_Big_Mac_
If you want to optimally fetch elements that have a 3D geometrical arrangement you should use 3D textures, not 1D textures. To use 3D textures you will first need to copy your 1D linear global array to a 3D cuda array and then bind a 3D texture to that 3D cuda array. You then use tex3Dfetch inside kernels. It's all described in more detail in the programming guide.

Copying 1D memory to a 3D cuda array renumbers the indexes.

Also, 3D textures were introduced in CUDA 2.0.
MisterAnderson42
QUOTE
If you want to optimally fetch elements that have a 3D geometrical arrangement you should use 3D textures, not 1D textures.

Of course. But only if the data is perfectly arranged on a 3D grid. If the points in the data set are somewhat randomly distributed, 3D textures are out of the question.

QUOTE(Jover @ Sep 15 2008, 03:07 PM)
So there is a question if there is any CPU or GPU implemented algorithm which performs an optimal 3D points renumbering in the sence that geometrical neighbours become neighbours in a corresponded 1D global array.
[right][snapback]440365[/snapback][/right]

Yes. You can reorder the points based on a space filling curve. The hilbert curve works best in my testing, but the much simpler Morton order curve is a close 2nd.

Our paper discusses the algorithm we use and contains several references to other methods. You can find the reference or download the preprint from http://www.ameslab.gov/hoomd/about.html#paper
eelsen
QUOTE(_Big_Mac_ @ Sep 15 2008, 02:17 PM)
If you want to optimally fetch elements that have a 3D geometrical arrangement you should use 3D textures, not 1D textures. To use 3D textures you will first need to copy your 1D linear global array to a 3D cuda array and then bind a 3D texture to that 3D cuda array. You then use tex3Dfetch inside kernels. It's all described in more detail in the programming guide.

Copying 1D memory to a 3D cuda array renumbers the indexes.

Also, 3D textures were introduced in CUDA 2.0.
[right][snapback]440367[/snapback][/right]


From an implementation standpoint, this is definitely correct.

To answer the question of what the algorithm is/what the 3D texture is doing for you behind your back is that it is uses a space filling curve, probably similar to this one: http://en.wikipedia.org/wiki/Z-order_(curve)
Jover
Thank you for the answers.
Actually MisterAnderson42 is completely right. thumbup.gif 3D textures are't my case definitely however I've thought about them ) The reason is that all my 3D points are just an output of some 3D mesh generator which gives just a 1D array of arbitrary distributed mesh nodes.
I read about space filling Z-curve at wikipedia as one of the possible approaches however googling gave several links about more complicated approaches. For example it's easy to demonstrate that my question is equivalent to the well-known problem of a sparse matrix bandwidth reduction. And the best of the suggested approaches is based on the reverse Cuthill-McKee algorithm. However I've met an article where guys solve our problem just searching different variants of numbering at the Cray cluster - such a brutal preprocessing cool.gif
Anyway my question was about an implemented CPU or GPU renumbering algorithm (available code will be the best option). As far as I understand I should invent a bicycle myself. sad.gif
MisterAnderson42
QUOTE(Jover @ Sep 15 2008, 03:51 PM)
Anyway my question was about an implemented CPU or GPU renumbering algorithm (available code will be the best option). As far as I understand I should invent a bicycle myself.  sad.gif
[right][snapback]440381[/snapback][/right]

It's actually pretty easy, assuming your points are guarunteed to be bounded within a reasonably sized box. I just assign the points to a grid and then walk through the grid based on the space filling curve, renumbering particles as I go. Of course, this is done on the CPU and the time it takes is absolutely tiny, given that I precomute the space filling curve and reuse it many times for the same grid.

If your points aren't bounded by a nice box, check the references from our paper. One of them has a nice algorithm for the Hilbert curve to recursively identify where each point belongs without ever having to generate the complete curve in memory.
Jover
Thanks I'll look through.
An approach based on the space filling curve seems to be realistic.
Last question. MisterAnderson42, I've read in one of the forum's posts that your renumbering gave you a resulted speedup of almost 5 times! w00t.gif Is it really so that just nodes' renumbering can provide such a drastic speedup?
P.S. In case someone is interested in bandwith reduction algorithms which can be applied for an optimal renumbering I can send found articles via email.
MisterAnderson42
QUOTE(Jover @ Sep 15 2008, 04:12 PM)
MisterAnderson42, I've read in one of the forum's posts that your renumbering gave you a resulted speedup of almost 5 times! 
[right][snapback]440392[/snapback][/right]

Yep! Although the speedup depends on the size of the system. Here is a quick benchmark run I just did for 64,000 particles. The only difference is that one has the data sort and the other does not.
CODE

./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=1
0.003538443 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=0
0.01123575 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1

That is 3.2x faster with the sort.

With 300,000 particles, the speedup is greater.
CODE

joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1
0.01508135 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=0
0.06657493 s/step

That is 4.41x faster (hmmm... my 5x number must have come from an even larger system size)

This benchmark is a run of a kernel where each thread accesses ~90 nearby particles to sum a force. The "unsorted" benchmarks have their particles placed by a random number generator, so the reads for a single thread are likely to be distributed equally among the entire array. The benchmarks with sort=1 have the hilbert curve sort applied to the same randomly placed particles.

YMMV of course, depending on how many neighbors you are accessing in each thread and how "random" the data was to begin with. In my application the random data is completley realistic of a real simulation, since the particles diffuse over time (hence why I need to apply the sort more than one time to keep them in check).
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.