Unfortunately, the asynchronous launch of GPU force buffer clearing can
take up to 2% of the total run-time with short iteration times and
many/fast cores/GPU. Timing it will at least remove it form the "Rest".
Change-Id: I397c563ead24d87181de1b03879f164d1a97c2ca
wallcycle_stop(wcycle,ewcWAIT_GPU_NB_L);
/* now clear the GPU outputs while we finish the step on the CPU */
+
+ wallcycle_start_nocount(wcycle,ewcLAUNCH_GPU_NB);
nbnxn_cuda_clear_outputs(nbv->cu_nbv, flags);
+ wallcycle_stop(wcycle,ewcLAUNCH_GPU_NB);
}
else
{