Merge common nbnxn CUDA/OpenCL GPU wait code-paths