QUOTE(Jover @ Sep 15 2008, 04:12 PM)
MisterAnderson42, I've read in one of the forum's posts that your renumbering gave you a resulted speedup of almost 5 times!Â
[right][snapback]440392[/snapback][/right]
Yep! Although the speedup depends on the size of the system. Here is a quick benchmark run I just did for 64,000 particles. The only difference is that one has the data sort and the other does not.
CODE
./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=1
0.003538443 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=0
0.01123575 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1
That is 3.2x faster with the sort.
With 300,000 particles, the speedup is greater.
CODE
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1
0.01508135 s/step
joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=0
0.06657493 s/step
That is 4.41x faster (hmmm... my 5x number must have come from an even larger system size)
This benchmark is a run of a kernel where each thread accesses ~90 nearby particles to sum a force. The "unsorted" benchmarks have their particles placed by a random number generator, so the reads for a single thread are likely to be distributed equally among the entire array. The benchmarks with sort=1 have the hilbert curve sort applied to the same randomly placed particles.
YMMV of course, depending on how many neighbors you are accessing in each thread and how "random" the data was to begin with. In my application the random data is completley realistic of a real simulation, since the particles diffuse over time (hence why I need to apply the sort more than one time to keep them in check).