Allow x D2H to overlap with GPU force compute
With GPU update coordinates are transferred back to the CPU every step
if there are forces to compute on the CPU. Originally this was
implemented with a back-to-back transfer launch and wait at the
beginning of do_force().
This change moves the CPU wait for the completion of the coordinate
transfer closer to the consumer tasks in order to avoid blocking GPU
force tasks' launch and allowing compute and transfer to overlap.
Fixes #3221
Change-Id: Ia6641147bbec1186b54c1445d36dc31000eae9c4