Unify coordinate copy handling across GPU platforms
Instead of handling the OpenCL host to device copy as a special case
where this operation is issued in the PME stream, as the "Update" stream
is anyway created unconditionally it makes sense to simplify codepaths
at the negligible cost of an extra GPU-side synchronization.
This allows unifying the coordinate handling in stateGpu as well as in
PME spread (with combined PP+PME) which removes some complexity and makes
it easier to reason about event consumption, reducing the need for balancing
workarounds.
The change also introduces a data member in StatePropagatorDataGpu to
indicate whether it is use on a separate PME rank, case in which marking
of the coordinate H2D event can be skipped as no event dependency is
required.
Refs #3988