performance - How do I choose grid and block dimensions for CUDA kernels? -


this question how determine cuda grid, block , thread sizes. additional question 1 posted here:

https://stackoverflow.com/a/5643838/1292251

following link, answer talonmies contains code-snippet (see below). don't understand comment "value chosen tuning , hardware constraints".

i haven't found explanation or clarification explains in cuda documentation. in summary, question how determine optimal blocksize (=number of threads) given following code:

const int n = 128 * 1024; int blocksize = 512; // value chosen tuning , hardware constraints int nblocks = n / nthreads; // value determine block size , total work madd<<<nblocks,blocksize>>>madd(a,b,c,n); 

btw, started question link above because partly answers first question. if not proper way ask questions on stack overflow, please excuse or advise me.

there 2 parts answer (i wrote it). 1 part easy quantify, other more empirical.

hardware constraints:

this easy quantify part. appendix f of current cuda programming guide lists number of hard limits limit how many threads per block kernel launch can have. if exceed of these, kernel never run. can summarized as:

  1. each block cannot have more 512/1024 threads in total (compute capability 1.x or 2.x-3.x respectively)
  2. the maximum dimensions of each block limited [512,512,64]/[1024,1024,64] (compute 1.x/2.x)
  3. each block cannot consume more 8k/16k/32k registers total (compute 1.0,1.1/1.2,1.3/2.x)
  4. each block cannot consume more 16kb/48kb of shared memory (compute 1.x/2.x)

if stay within limits, kernel can compile launch without error.

performance tuning:

this empirical part. number of threads per block choose within hardware constraints outlined above can , effect performance of code running on hardware. how each code behaves different , real way quantify careful benchmarking , profiling. again, summarized:

  1. the number of threads per block should round multiple of warp size, 32 on current hardware.
  2. each streaming multiprocessor unit on gpu must have enough active warps sufficiently hide of different memory , instruction pipeline latency of architecture , achieve maximum throughput. orthodox approach here try achieving optimal hardware occupancy (what roger dahl's answer referring to).

the second point huge topic doubt going try , cover in single stackoverflow answer. there people writing phd theses around quantitative analysis of aspects of problem (see this presentation vasily volkov uc berkley , this paper henry wong university of toronto examples of how complex question is).

at entry level, should aware block size choose (within range of legal block sizes defined constraints above) can , have impact on how fast code run, depends on hardware have , code running. benchmarking, find non-trivial code has "sweet spot" in 128-512 threads per block range, require analysis on part find is. news because working in multiples of warp size, search space finite , best configuration given piece of code relatively easy find.


Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -