-constraint calculation on a CUDA-compatible GPU. This allows to having all (compatible)
-parts of a simulation step on the GPU, so that no unnecessary transfers are needed between
-GPU and CPU. This currently only works with single domain cases, and needs to be explicitly
-requested by the user. It is possible to change the default behaviour by setting the
+constraint calculation on a CUDA-compatible GPU. This allows executing all
+(supported) computation of a simulation step on the GPU.
+This feature is supported in single domain runs (unless using the experimental
+GPU domain decomposition feature), and needs to be explicitly requested by the user.
+This is a new parallelization mode where all force and coordinate
+data can be "GPU resident" for a number of steps, typically between neighbor searching steps.
+This has the benefit that there is less coupling between CPU host and GPU and
+on typical MD steps data does not need to be transferred between CPU and GPU.
+In this scheme it is however still possible for part of the computation to be
+executed on the CPU concurrently with GPU calculation.
+This helps supporting the broad range of |Gromacs| features not all of which are
+ported to GPUs. At the same time, it also allows improving performance by making
+use of the otherwise mostly idle CPU. It can often be advantageous to move the bonded
+or PME calculation back to the CPU, but the details of this will depending on the
+relative performance if the CPU cores paired in a simulation with a GPU.
+
+It is possible to change the default behaviour by setting the