![]() ![]() |
Nov 7 2009, 03:34 AM
Post
#1
|
|
![]() Group: Members Posts: 3 Joined: 20-October 09 Member No.: 197,895 Club SLI Member: No Org.: Intellectual Ventures |
Hi,
I'm trying to get some benchmarking numbers out of a test port of a large montecarlo simulation our group has developed. These numbers will directly influence our purchasing decision, so you can image my surprise when running the profiler, I noticed that my timings of kernel calls indicated an extra factor of 2 speed *increase* over what my program normally does when run by itself. I need to know if this is real and why this is happening and I unfortunately dont have a lot of time to investigate myself, nor can I check if the program results are still correct. Can any nvidia people explain this as a known possibility and tell me if it represents real potential performance? I report my numbers on monday morning, and given where our benchmarking is sitting now, it could determine whether we invest in a Tesla-based cluster or a traditional cluster (!) Thank you for any help you can provide! Daniel |
|
|
|
Nov 7 2009, 04:03 PM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 176 Joined: 30-July 08 Member No.: 113,716 |
Doesn't sound real to me - are you sure the profiler ran through to completion and didn't hit the profiler time limit? That's the only case that I can think of where the profiler would appear to run faster. I think the default time limitis 30 seconds in the visual profiler. I don't think there is one with the command line profiler.
Thinking about it I don't think I've ever seen the output from a profile run giving different results. I suppose it could happen in the case of a race condition which the different runtime settings might expose. I'm rather suprised you have no way of validating your output. I suppose you could sanity check the FLOPS/memory bandwidth to see if you have gone over the theoretical max, but I imagine this might not be possible in your case. |
|
|
|
Nov 7 2009, 04:19 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 814 Joined: 1-April 09 Member No.: 148,556 |
I would agree with all that. In my experience, code run under the profiler is usually about 15-25% slower than run normally (note the 30 second default cut off in the visual profiler). If you timings are real, and I don't think they are, then the only possible thing I can think of is that the profiler only instruments a single multiprocessor and then scales the results. If your code structure is such that the performance from block to block can vary wildly, it might be that the analysis could be skewed in some way.
BTW, I think you are being pretty optimistic posting a plea for help at what is probably late on Friday evening in the US expecting a reply for a Monday deadline..... This post has been edited by avidday: Nov 7 2009, 04:20 PM |
|
|
|
Nov 7 2009, 04:51 PM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 176 Joined: 30-July 08 Member No.: 113,716 |
Sorry - misread that a bit. I had thought that you were measuing whole program execution time, however it seems you are measuring individual kernels. I'm still not quite clear what you mean.
If you're saying that the profile results are different from your own timings then I'm not sure. I don't know if the profiler targets one specific multiprocessor when timing kernels or not. I would trust individual times from the profiler more than your own timings. Or are you saying that your own timings (not from the profiler) around each kernel call report a two-fold increase in speed when run under the profiler? This would be different, and would suggest (again) that you're timing logic is flawed. Either way program execution time seems a superior metric. |
|
|
|
Nov 8 2009, 10:18 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 174 Joined: 13-December 04 Member No.: 1,673 |
This isn't exactly a new problem, I reported similar behaviour (much faster execution timings in visual profiler) not long after the first visual profiler was released...
Still not sure what the cause of the problem is, but needless to say this issue + the other bugs in the visual profiler (eg: incorrect and/or missing counters) means I don't use this tool anymore. |
|
|
|
Nov 9 2009, 01:30 AM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,073 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
If you're looking at GPU timings only, keep in mind that there are additional sources of driver/OS overhead that are hidden there. (of course if that's a factor of 2 difference, your app is not very well optimized)
|
|
|
|
Nov 9 2009, 09:52 AM
Post
#7
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 66 Joined: 10-September 09 Member No.: 192,528 |
I'm quite new to CUDA but so far I've written and tested about 10 different programs that are CUDA based but I can tell you that my timers and the profiler have so far been totally accurate.
What sometimes happens is that I've got some piece of code like CODE system("pause"); which causes the profiler not being able to finish and aborting after 30 s. Btw, I wouldn't be to quick about making a decision like that. I would simply tell them I need more time, just be honest. |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 24th November 2009 - 01:32 AM |