Make the wait on nonbonded GPU results conditional
When the force reduction is done on the GPU and there are no energy or
shift force results required, there is no need to block and wait on the
CPU until the GPU nonbonded kernels complete.
This change makes the wait conditional on whether there are nonbonded
force, energy or shift force outputs so the blocking wait is now skipped
with GPU buffer ops on force-only steps.
Also removed the now unnecessary boolean argument passed to
gpu_launch_cpyback().
Refs #3128
Change-Id: Ic1285f5a00ac910cd1d6c4358f41f2c7c41dea4c