![]() ![]() |
Jul 8 2009, 10:04 PM
Post
#1
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,080 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
We've just released the CUDA C Programming Best Practices Guide. This guide is designed to help developers programming for the CUDA architecture using C with CUDA extensions implement high performance parallel algorithms and understand best practices for GPU Computing. Chapters on the following topics and more are included in the guide:
|
|
|
|
Jul 8 2009, 10:13 PM
Post
#2
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 825 Joined: 1-April 09 Member No.: 148,556 |
Thanks a bunch for this, it is much more accessible and concise than the other documentation.
|
|
|
|
Jul 8 2009, 10:38 PM
Post
#3
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 110 Joined: 2-April 08 Member No.: 98,683 |
The email that announced this to registered developers had some confidentiality language in it, but given that tmurray posted the link to the PDF, I believe I am ok to comment (the email actually invited comments and suggestions for improvement in the announcements forum, but this is a better thread).
In general, this is excellent. Terrific job. Unfortunately, this document contains less information than included in various tutorial slides NVIDIA employees posted at some place or another, see for instance http://gpgpu.org/isc2009 http://www.cse.unsw.edu.au/~pls/cuda-workshop09/ One thing I particularly dislike is that the first code example in the document contrasts the driver API and the runtime API (which, btw, is now called low level C interface and high-level C++ interface in the progguide and the reference manual version 2.2). This document still does not lower the CUDA entry point. If I were starting with CUDA now, I'd ignore all the text in the document, fast-forward to the first actual code example, copy and paste it into an editor, and get my hands dirty. Pretty much orthogonal to the approach taken in the "best practice guide". This is why we took the simple stupid axpy kernel and turned it into some standalone code and a Makefile (VS solution) when designing the CUDA section on gpgpu.org (check out the "minimalistic CUDA tutorial" at http://gpgpu.org/developer/cuda#code-tutorials). The SDK is nice, no doubt about it, but it almost certainly discourages newbies. There is no documentation on where to start, and the SDK build system (in my opinion) discourages proper CUDA-CPU comparison practice (and is a pain in the butt when integrating CUDA into a moderately complex existing code base). Code up a hack that gives the same result, call it "gold", time it and publish 1000x speedups Now here's some constructive criticism: Launch a book. "CUDA-GEMS" is a tentative title based on the successful GPU-GEMS series. Volume 1 would be a collection of the current SDK whitepapers and examples, presented in a way that does not artificially increase the learing curve like the current SDK does. Requires code duplication, probably, but well worth it. If you don't go for a book, publish something like the axpy example, the reduction whitepaper and something on scan prominently on the CUDA web page. This should, it backed up with some single-file code, lower the learning curve. Sorry for the rant, dom This post has been edited by Dominik Gddeke: Jul 8 2009, 10:42 PM |
|
|
|
Jul 8 2009, 10:41 PM
Post
#4
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: NVIDIA Employees Posts: 2,080 Joined: 3-June 08 From: Santa Clara, CA Member No.: 106,363 Org.: NVIDIA |
Dom, don't worry. We are certainly aware that the current documentation can be discouraging to new users, and we're working on various things to correct this.
|
|
|
|
Jul 8 2009, 10:57 PM
Post
#5
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 110 Joined: 2-April 08 Member No.: 98,683 |
Given that I'm occasionally teaching CUDA, I have to worry
In my experience, a very simple "what should I read to get started" guide, posted *very* prominently on the CUDA web page, would do the trick. You have all the material ready! Currently I refer people to the slides Mark used at his USW workshop (link in my previous post), which in turn references whitepapers in the SDK. The 2.2 "Quickstart" document is a good start actually, but it essentially just steps through verifying that installing the toolkit and building the SDK worked well. Add a chapter to that document on how to continue now that the most primitive SDK example runs fine, and I'll shut up Keep up the good work! All I am complaining about is that the current state of the CUDA documentation is not optimal for self-studying. If you know where to look for conference tutorials, you are fine. This post has been edited by Dominik Gddeke: Jul 8 2009, 10:58 PM |
|
|
|
Jul 9 2009, 08:42 AM
Post
#6
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 145 Joined: 6-September 08 From: Stockholm Member No.: 118,162 |
There isn't anything on fixpoint:
Is the throughput for fixpoint integer mul similar to integer mul with 24 bit operands? " Filtering Valid only if the texture reference returns floating-point data" This is yet another missed opportunity of pointing out that float4 is also a floating-point datatype - possibly the most efficient one in this context. Section "1.1.1 Differences Between Host and Device" ignores that the CPU is not scalar but has a 4 element vector unit - which some smartass will always be keen to point out, diverting the discussion ... This post has been edited by jma: Jul 9 2009, 09:05 AM -------------------- Waiting for forums.nvidia.com...
|
|
|
|
Jul 9 2009, 02:32 PM
Post
#7
|
|
![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 324 Joined: 14-September 06 Member No.: 26,125 |
Looking good so far
I'm no expert on the driver API, but I believe that on page 9 cuDeviceGet(&hContext,0); should be replaced by: cuDeviceGet(&hDevice,0); N. |
|
|
|
Jul 10 2009, 01:47 AM
Post
#8
|
|
![]() Group: Members Posts: 3 Joined: 15-November 08 From: Santa Clara, CA Member No.: 126,037 Org.: NVIDIA |
... One thing I particularly dislike is that the first code example in the document contrasts the driver API and the runtime API (which, btw, is now called low level C interface and high-level C++ interface in the progguide and the reference manual version 2.2). ... Low-level and High-level C++ refer to the different types of functionality available in the Runtime API, which can be used in a strictly C setting (low-level functions only), or a mixed C/C++ setting (low and high level functions can all be used). The Driver API provides only C entry points. In reality, the high-level C++ API is just a bunch of convenience wrappers for templating some of the low-level Runtime API functions. |
|
|
|
Jul 13 2009, 09:48 PM
Post
#9
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 503 Joined: 4-March 07 Member No.: 43,741 Org.: NVIDIA |
Given that I'm occasionally teaching CUDA, I have to worry In my experience, a very simple "what should I read to get started" guide, posted *very* prominently on the CUDA web page, would do the trick. You have all the material ready! ... Would the following slides cover the need you desribe? These are the slides used in the basic CUDA webinar.
Attached File(s)
|
|
|
|
Jul 13 2009, 10:16 PM
Post
#10
|
|
![]() ![]() ![]() ![]() ![]() Group: Members Posts: 127 Joined: 2-June 08 Member No.: 106,229 Org.: Delft University of Technology |
|
|
|
|
Jul 13 2009, 11:38 PM
Post
#11
|
|
![]() ![]() ![]() ![]() ![]() Group: Extranet Users Posts: 110 Joined: 2-April 08 Member No.: 98,683 |
Would the following slides cover the need you desribe? These are the slides used in the basic CUDA webinar. This is the best and most concise "newbie presentation" I've seen to date. Excellent! Some minor suggestions for improvement: - any particular reason why integers are used instead of floats in the example? I am still under the impression that most people compute on FP data - page 16: these are maximum dimensions, I believe the reader will benefit from learning, here already, that say a 1D grid of 1D blocks is perfectly fine. You actually say this ("up to...") on page 20. - page 17: maybe add the restriction that kernels can't allocate device memory - page 22 is just excellent, there is simply no way to convey more information! - typo Walkthruogh on page 23 - slide 31: I always was under the impression that while events are indeed clock-cycle accurate, cudaEventElapsedTime() only gives a maximum granularity of ~0.5ms because it returns a float. I believe all walkthroughs in this presentation would not give meaningful timings when "benchmarked" this way. - slide 32: one CPU thread can control several GPUs at the price of context switching (cudaSetDevice() called repeatedly) - slide 34: lifetime: kernel - slide 38: occasionally you write "threadblock" and occasionally "thread-block" - slide n-1 (before the marketing starts Enough nitpicking, these slides are excellent, no doubt about that, and I am 100% sure that making them available prominently on the CUDA weg page would help a great deal. dom |
|
|
|
Jul 14 2009, 03:31 AM
Post
#12
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 503 Joined: 4-March 07 Member No.: 43,741 Org.: NVIDIA |
Good feedback, thanks. I've made a few suggested additions. I'll check on the resolution returned by cudaEvenElapsedTime. You're right, all FD computation is at least SP FP. I used integers purely for convenience - much easier to print values concisely to the console and check arithmetic results when time is very limited (the walkthroughs were used for hands-on portion of training sessions, where everyone would code from scratch (including me on the projector), instead of looking at the finished code). I think these do get posted somewhere on the webinar portion of the website, I'll inquire about making them more prominent - the intent was to get someone coding CUDA from scratch in a short amount of time.
Paulius |
|
|
|
Jul 14 2009, 03:38 AM
Post
#13
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 503 Joined: 4-March 07 Member No.: 43,741 Org.: NVIDIA |
Read the cudaEventElapsedTime description in the reference - resolution is ~0.5us (microseconds). Not quite clock period grain, but still pretty good.
Paulius |
|
|
|
Jul 14 2009, 07:40 AM
Post
#14
|
|
![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 357 Joined: 12-September 08 Member No.: 118,861 |
Would the following slides cover the need you desribe? These are the slides used in the basic CUDA webinar. Hi, Great basic tutorial - I think its evident from the forums that many new users need it. A few suggestions though: -. Personally I think a PDF/HTML is far better than PPT. -. Explaination about emulation mode and the difference between emulation and release is something many new users fail to understand. -. In the samples you put cudaMemcpy after the kernel invocation - many people fail to understand that cudaMemcpy will implicitly call cudaThreadSynchronize and therefore you see code that call kernels and doesnt synchronize correctly. Maybe a description about implict and explicit synchronization should be added as well. Page 29 talks about it, but there is no code sample showing how/where to use it. -. Doubles vs floats - arch sm_XX is also something new users dont take into account. -. More about why a kernel would fail and how to see whats causing it. People run kernels (which fail because of too many resources or access violations) and think that after ten minutes of coding they've achieved a x1000 performance boost. Users should understand how to check for errors. Page 30 address this a bit, I think it can be extended as this is one of the most common pitfalls of new users. -. Some more info maybe on kernel resources: register pressure and how to see the kernel resource usage: --ptxas-options="-v -mem" -. Differences between shared memory and global memory - people think that to boost the application they simply need to use shared memory instead of global memory. Sometimes people fail to understand that its not just a matter of choosing the memory to use but you need to understand how to load data, sync it and use shared memory wisely in order to gain performance. -. I would also suggest people to get familiar with threading issues on the CPU before coding the GPU. People who dont understand CPU threads, synchronization issues, data dependency et al will never be able to use GPUs correctly. -. Maybe add some "nVidia metodology" as to how to find the bottlenecks, debug (for example on windows without debugger), reduce resource pressure and stuff like that. I know i'd like to hear what nVidia thinks -. Maybe mention the dead-code optimizer. People sometimes dont understand that the kernel was optimized out and think that the kernel gave a x1000 boost. I understand that some of those issues might add some more pages, but I think that those (along with what the document already addresses and what Dominik wrote) are the most common issues and misunderstanding new users are facing. my 1 cent, eyal |
|
|
|
Jul 14 2009, 07:00 PM
Post
#15
|
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 503 Joined: 4-March 07 Member No.: 43,741 Org.: NVIDIA |
Thanks for the comments. I think most of your requests fall into the optimization category (slides above are intended as a minimal basic starter). There are separate presentations (and weibinars) on CUDA optimization techniques. And, of course as the forum-thread title suggests, there's the best practices guide that covers the issues in more detail.
Paulius |
|
|
|
Jul 19 2009, 02:43 PM
Post
#16
|
|
![]() ![]() ![]() ![]() Group: Members Posts: 61 Joined: 23-May 09 From: Portugal Member No.: 156,206 |
This certainly looks good and covers important subjects for users who are taking CUDA to the next level.
Thanks for this, nice job! |
|
|
|
Jul 20 2009, 08:55 PM
Post
#17
|
|
![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 118 Joined: 12-May 08 From: Montreal, Quebec (Canada) Member No.: 103,628 |
Thnaks for the document, it was worth reading albeit anything is in the programmer's guide and reference documentation, it focus on obtaining fast results :-)
-------------------- GeForce 9400M, 9600M GT & 8800 GTS, Mac OS X | Linux | Windows
CudaChess.com Open-source Chess Engine using CUDA-enabled GPU, and more ... |
|
|
|
![]() ![]() |
| Copyright 2008 NVIDIA Corporation. Terms of Use | Legal Info | Privacy Policy | Time is now: 25th November 2009 - 02:27 AM |