Fix CUDA inter-stream synchronization issue
With the introduction of multiple hardware queues in CC 3.5 and later
NVIDIA GPUs, the implicit dependency between tasks in the local and
non-local kernel got eliminated. However, as the misc_ops_done event
that we sync with in the non-local stream preceded the local coordinate
transfer, even though the tasks in the local stream are always issued
first, under rare circumstances the non-local kernel could start before
the local coordinate transfer completes. This would lead to non-local
interactions being calculated using coordinates (and charges) from the
previous step.
This change moves the synchronization point to creating a dependency
between the local coordinate transfer and non-local non-bonded kernel.
Change-Id: I0b3837d46db6469f6b1d9869a3a73b5176d93d99