QUOTE (pawel_astro @ Jan 28 2009, 01:36 PM)

on the other hand, in order to do an nbody calculation, I would have to run not just 100 or 200 cpus, but a much larger number of nodes on a cluster, because nbody is a well-connected problem and bandwidth via the typical inteconnect is horrible so the calclulation would be communication-bound.
A direct n^2 summation is unlikely to be communication bound even on a cluster with crappy interconnects.
Since we are talking about a direct n^2 summation, you could duplicate all particles on all processors. If you actually had so many particles you couldn't fit them all into the memory of one node, your calculation would take hopelessly long no matter how you did it. I would estimate you could fit 30 million particles into 2GB of RAM, thats almost 10^15 interactions. You would need a cluster the size of Blue Gene L to tackle that in a reasonable amount of time. Or a cluster with 1000 gpus.
Even if you did need to break the particles up into N (where N is almost certainly <10) groups, the cost of shuffling them around would still be dwarfed by the compute cost.
Then all you would have to transfer would be a broadcast of N forces. If you wanted to get really fancy you could start broadcasting forces as each particle completed (or a small group of them) and almost completely eliminate communication overhead.