cuda - Avoiding CudaMemcpy in an iterative loop -


i wondering best way following in cuda: imagine* have long array , want sum of elements below 1. , if sum above 1 divide every element 2 , calculate sum again. dividing 2 , calculating sum done on gpu. question now: best way check whether sum below 1 or not on cpu side? cudamemcpy within every iteration, read (and have seen) better few transfers between 2 memory possible. have found dynamic parallelism , thought maybe start kernel 1 block , 1 thread while loop , calls sum , divide kernels, unfortunately hardware has compute capability 3.2 , dynamic parallelism starts 3.5. there other way besides doing cudamemcpy every iteration tell cpu can stop doing while loop?

*the algorithm above toy problem explain situation (hopefully). actual algorithm newton-raphson method, question remain valid iterative method have decide whether stop or not given value calculated on gpu.

for compute capability >= 3.5 answer, correctly identify, dynamic parallelism.

for compute capability < 3.5 things less clear. there 2 options: first @ latency cost of memcpy , kernel launch. second use more advanced techniques finer control on blocks.

optimising latency

if using memcpy, make sure don't synchronize before launching memcpy. if don't synchronize, of overhead associated copy can hidden kernel.

that said, lowest latency path case found using mapped memory: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mapped-memory. using mapped memory kernel write directly host memory without having explicitly launch cudamemcpy.

block control

for problem don't need global synchronisation, being clever can avoid trips host. in case consider over-subscribing gpu. if know need x blocks complete iteration of problem, consider launching, example, 5x blocks. because order in blocks launch undefined, need create ordering using atomics (atomically incrementing global integer once per block).

with block ordering know blocks going take part in first step of iteration. blocks not taking part in first iteration can wait spinning on flag:

do {   locked = volatileload(flag); // make sure volatile } while (locked); 

once first batch of blocks completes operation, , output written global memory, can set flag (make sure use threadfence correctly!) allowing blocks next step start. these blocks can either next step, or return (after allowing blocks depending on them continue) if condition met.

the net result of have blocks ready on gpu waiting start. managing our block ordering know each iteration have enough blocks complete, spinning blocks released. 3 things need make sure correct are:

  1. you manage own block ids using atomics.
  2. you load flag using volatile keyword ensure correct value read.
  3. you apply threadfence ensure output visible before allowing dependent blocks continue.

obviously launching correct number of blocks unlikely, have go host launch more blocks time time. overhead of launching many blocks shouldn't bad, risk.

before implement make sure latency cost of copies resulting in significant slowdown. overhead copy host , conditionally launch kernel should of order of 20 microseconds per iteration. method add quite lot of complexity code, , sure need save microseconds!


Comments