Merge release-2021 into master
authorAndrey Alekseenko <al42and@gmail.com>
Mon, 18 Oct 2021 14:31:36 +0000 (16:31 +0200)
committerAndrey Alekseenko <al42and@gmail.com>
Mon, 18 Oct 2021 14:39:47 +0000 (16:39 +0200)
Brings the following fixes from the release branch into master:

- #4148 (!1941)
- #4167 (!1966)
- #4190 (!2050)
- !2039

1  2 
docs/gmxapi/userguide/install.rst
src/gromacs/ewald/pme.cpp
src/gromacs/ewald/pme.h
src/gromacs/mdlib/vsite.cpp
src/gromacs/mdrun/runner.cpp
src/gromacs/selection/evaluate.cpp
src/gromacs/taskassignment/decidegpuusage.cpp
src/gromacs/taskassignment/decidegpuusage.h

index f0ab2210fb9ced25c1be0cb28f42833538b96ce8,8588156a48b3d85140459260e93457ec49e99d3f..e1994e9c74b343f80ed9c496eafad0eb6a494693
@@@ -65,7 -65,8 +65,7 @@@ Set up a Python virtual environmen
      python3 -m venv $HOME/myvenv
      . $HOME/myvenv/bin/activate
      python -m ensurepip --default-pip
 -    pip install --upgrade pip setuptools
 -    pip install --upgrade cmake scikit-build
 +    pip install --upgrade pip setuptools wheel
  
  .. seealso:: :ref:`gmxapi venv`
  
@@@ -85,7 -86,7 +85,7 @@@ Backgroun
  *gmxapi* comes in three parts:
  
  * GROMACS gmxapi library for C++.
 -* This Python package, supporting Python 3.6 and higher
 +* This Python package, supporting Python 3.7 and higher
  * MD restraint plugins and sample gmxapi client code
  
  GROMACS requirements
@@@ -111,9 -112,9 +111,9 @@@ to the build configuration
  Build system requirements
  -------------------------
  
 -gmxapi can be built for Python 3.6 and higher.
 +gmxapi can be built for Python 3.7 and higher.
  
 -You will need a C++ 14 compatible compiler and a reasonably up-to-date version
 +You will need a C++ 17 compatible compiler and a reasonably up-to-date version
  of CMake.
  Full gmxapi functionality may also require an MPI compiler (e.g. :command:`mpicc`).
  
@@@ -143,7 -144,7 +143,7 @@@ that works
  Python environment requirements
  -------------------------------
  
 -gmxapi requires Python 3.6 or higher. Check your version with
 +gmxapi requires Python 3.7 or higher. Check your version with
  :command:`python3 --version` or :command:`python --version`.
  
  ..  note::
      :command:`python3 --version` or :command:`python --version` and :command:`pip --version`.
  
  To build and install, you need the Python packages for
 -cmake_, networkx_, scikit-build_, and setuptools_
 +cmake_, networkx_, and setuptools_
  (all available from `PyPI with pip <https://pip.pypa.io/en/stable/>`_).
  
  For full functionality, you should also have mpi4py_ and numpy_.
@@@ -339,7 -340,7 +339,7 @@@ the remaining examples assume the virtu
  Install dependencies
  ^^^^^^^^^^^^^^^^^^^^
  
 -It is always a good idea to update pip_ and setuptools_ before installing
 +It is always a good idea to update pip_, setuptools_, and wheel_ before installing
  new Python packages::
  
      pip install --upgrade pip setuptools
@@@ -349,7 -350,7 +349,7 @@@ they are installed and up to date befor
  
  ::
  
 -    pip install --upgrade cmake scikit-build
 +    pip install --upgrade cmake pybind11
  
  For MPI, we use mpi4py_.
  Make sure it is using the same MPI installation that we are building
@@@ -431,7 -432,7 +431,7 @@@ without internet access, either from th
  
  For example, the last line of the previous example could be replaced with::
  
 -    pip install --no-cache-dir --no-deps --no-index --no-build-isolation .
 +    pip install --no-cache-dir --no-deps --no-index .
  
  Refer to pip_ documentation for descriptions of these options.
  
@@@ -448,11 -449,11 +448,11 @@@ Building a source archiv
  -------------------------
  
  A source archive for the gmxapi python package can be built from the GROMACS
 -source repository using Python ``setuptools`` and ``scikit-build``.
 +source repository using Python ``setuptools``.
  
  Example::
  
 -    pip install --upgrade setuptools scikit-build
 +    pip install --upgrade setuptools
      cd python_packaging/src
      python setup.py sdist
  
@@@ -461,13 -462,6 +461,13 @@@ archive file. The file name has the for
  *<version>* is the version from the ``setup.py`` file, and *<suffix>* is
  determined by the local environment or by additional arguments to ``setup.py``.
  
 +The new `build <https://pypa-build.readthedocs.io/en/latest/>`__ module is somewhat tidier.
 +It automatically manages a temporary venv with the necessary dependencies::
 +
 +    pip install --upgrade build
 +    cd python_packaging/src
 +    python -m build --sdist .
 +
  .. seealso::
  
      Python documentation for
@@@ -502,29 -496,16 +502,29 @@@ This will first build the *gmxapi* Pyth
  location in the build tree. Sphinx can then import the package to automatically
  extract Python docstrings.
  
 +Note that this is an entirely CMake-driven installation and Python dependencies
 +will not be installed automatically. You can update your Python environment
 +(before configuring with CMake) using the :file:`requirements.txt` files provided
 +in the :file:`python_packaging/` directory of the repository. Example::
 +
 +    pip install -r python_packaging/requirements-docs.txt
 +
 +or
 +
 +::
 +
 +    pip install -r python_packaging/requirements-test.txt
 +
  Sometimes the build environment can choose a different Python interpreter than
  the one you intended.
- You can set the ``Python3_ROOT`` or ``CMAKE_PREFIX_PATH`` CMake variable to
+ You can set the ``Python3_ROOT_DIR`` or ``CMAKE_PREFIX_PATH`` CMake variable to
  explicitly choose the Python installation or *venv* directory.
  
  If you use pyenv or pyenv-virtualenv to dynamically manage your Python version,
  you can help identify a particular version with ``pyenv version-name`` and the
  directory with ``pyenv prefix {version}``. For example::
  
-     -DPython3_ROOT=$(pyenv prefix $(pyenv version-name))
+     -DPython3_ROOT_DIR=$(pyenv prefix $(pyenv version-name))
  
  Docker web server
  -----------------
@@@ -570,27 -551,6 +570,27 @@@ and see if that fixes the problem. If n
  (see above) to help narrow down the problem before you
  `open an issue <https://gitlab.com/gromacs/gromacs/-/issues/>`_.
  
 +Errors regarding pybind11
 +-------------------------
 +
 +An error may occur in ``setup.py`` with output that contains something like the following::
 +
 +      ModuleNotFoundError: No module named 'pybind11'
 +      Building wheel for gmxapi (pyproject.toml): finished with status 'error'
 +      ERROR: Failed building wheel for gmxapi
 +    Failed to build gmxapi
 +    ERROR: Could not build wheels for gmxapi, which is required to install pyproject.toml-based projects
 +
 +The important information here is that ``pybind11`` was not found.
 +
 +Build dependencies aren't always automatically installed.
 +Even if you are using ``pip``, you may have disabled automatic dependency fulfillment with an option like ``--no-build-isolation`` or ``--no-deps``.
 +
 +In any case, the problem should be resolved by explicitly installing the ``pybind11``
 +Python package before attempting to build ``gmxapi``::
 +
 +    pip install --upgrade pybind11
 +
  Couldn't find the ``gmxapi`` support library?
  ---------------------------------------------
  
@@@ -709,5 -669,3 +709,5 @@@ rebase your feature branches for the ne
  .. _scikit-build: https://pypi.org/project/scikit-build/
  
  .. _setuptools: https://pypi.org/project/setuptools/
 +
 +.. _wheel: https://pypi.org/project/wheel/
index 7438077d0b26031ecaf055243eea06eb928dc27a,b0b59e6eedbaf7fe8f37134805eab50a5280b711..e2c87459a0589403888d57fdfead803bbabb869f
  #include "gromacs/utility/logger.h"
  #include "gromacs/utility/real.h"
  #include "gromacs/utility/smalloc.h"
 -#include "gromacs/utility/stringutil.h"
 +#include "gromacs/utility/message_string_collector.h"
  #include "gromacs/utility/unique_cptr.h"
  
  #include "calculate_spline_moduli.h"
  #include "pme_spline_work.h"
  #include "pme_spread.h"
  
 -/*! \brief Help build a descriptive message in \c error if there are
 - * \c errorReasons why PME on GPU is not supported.
 - *
 - * \returns Whether the lack of errorReasons indicate there is support. */
 -static bool addMessageIfNotSupported(const std::list<std::string>& errorReasons, std::string* error)
 -{
 -    bool isSupported = errorReasons.empty();
 -    if (!isSupported && error)
 -    {
 -        std::string regressionTestMarker = "PME GPU does not support";
 -        // this prefix is tested for in the regression tests script gmxtest.pl
 -        *error = regressionTestMarker;
 -        if (errorReasons.size() == 1)
 -        {
 -            *error += " " + errorReasons.back();
 -        }
 -        else
 -        {
 -            *error += ": " + gmx::joinStrings(errorReasons, "; ");
 -        }
 -        *error += ".";
 -    }
 -    return isSupported;
 -}
 +//NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 +bool g_allowPmeWithSyclForTesting = false;
  
  bool pme_gpu_supports_build(std::string* error)
  {
 -    std::list<std::string> errorReasons;
 -    if (GMX_DOUBLE)
 -    {
 -        errorReasons.emplace_back("a double-precision build");
 -    }
 -    if (!GMX_GPU)
 -    {
 -        errorReasons.emplace_back("a non-GPU build");
 -    }
 -    if (GMX_GPU_SYCL)
 -    {
 -        errorReasons.emplace_back("SYCL build"); // SYCL-TODO
 -    }
 -    return addMessageIfNotSupported(errorReasons, error);
 +    gmx::MessageStringCollector errorReasons;
 +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
 +    errorReasons.startContext("PME GPU does not support:");
 +    errorReasons.appendIf(GMX_DOUBLE, "Double-precision build of GROMACS.");
 +    errorReasons.appendIf(!GMX_GPU, "Non-GPU build of GROMACS.");
 +    errorReasons.appendIf(GMX_GPU_SYCL && !g_allowPmeWithSyclForTesting, "SYCL build."); // SYCL-TODO
 +    errorReasons.finishContext();
 +    if (error != nullptr)
 +    {
 +        *error = errorReasons.toString();
 +    }
 +    return errorReasons.isEmpty();
  }
  
  bool pme_gpu_supports_hardware(const gmx_hw_info_t gmx_unused& hwinfo, std::string* error)
  {
 -    std::list<std::string> errorReasons;
 -
 -    if (GMX_GPU_OPENCL)
 -    {
 +    gmx::MessageStringCollector errorReasons;
 +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
 +    errorReasons.startContext("PME GPU does not support:");
  #ifdef __APPLE__
 -        errorReasons.emplace_back("Apple OS X operating system");
 +    errorReasons.appendIf(GMX_GPU_OPENCL, "Apple OS X operating system");
  #endif
 +    errorReasons.finishContext();
 +    if (error != nullptr)
 +    {
 +        *error = errorReasons.toString();
      }
 -    return addMessageIfNotSupported(errorReasons, error);
 +    return errorReasons.isEmpty();
  }
  
  bool pme_gpu_supports_input(const t_inputrec& ir, std::string* error)
  {
 -    std::list<std::string> errorReasons;
 -    if (!EEL_PME(ir.coulombtype))
 -    {
 -        errorReasons.emplace_back("systems that do not use PME for electrostatics");
 -    }
 -    if (ir.pme_order != 4)
 -    {
 -        errorReasons.emplace_back("interpolation orders other than 4");
 -    }
 -    if (EVDW_PME(ir.vdwtype))
 -    {
 -        errorReasons.emplace_back("Lennard-Jones PME");
 -    }
 -    if (!EI_DYNAMICS(ir.eI))
 -    {
 -        errorReasons.emplace_back(
 -                "Cannot compute PME interactions on a GPU, because PME GPU requires a dynamical "
 -                "integrator (md, sd, etc).");
 -    }
 -    return addMessageIfNotSupported(errorReasons, error);
 +    gmx::MessageStringCollector errorReasons;
 +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
 +    errorReasons.startContext("PME GPU does not support:");
 +    errorReasons.appendIf(!EEL_PME(ir.coulombtype),
 +                          "Systems that do not use PME for electrostatics.");
 +    errorReasons.appendIf((ir.pme_order != 4), "Interpolation orders other than 4.");
 +    errorReasons.appendIf(EVDW_PME(ir.vdwtype), "Lennard-Jones PME.");
 +    errorReasons.appendIf(!EI_DYNAMICS(ir.eI), "Non-dynamical integrator (use md, sd, etc).");
 +    errorReasons.finishContext();
 +    if (error != nullptr)
 +    {
 +        *error = errorReasons.toString();
 +    }
 +    return errorReasons.isEmpty();
  }
  
 -    std::list<std::string> errorReasons;
 -    if (ir.efep != efepNO)
+ bool pme_gpu_mixed_mode_supports_input(const t_inputrec& ir, std::string* error)
+ {
 -        errorReasons.emplace_back("Free Energy Perturbation (in PME GPU mixed mode)");
++    gmx::MessageStringCollector errorReasons;
++    // Before changing the prefix string, make sure that it is not searched for in regression tests.
++    errorReasons.startContext("PME GPU in Mixed mode does not support:");
++    errorReasons.appendIf(ir.efep != FreeEnergyPerturbationType::No, "Free Energy Perturbation.");
++    errorReasons.finishContext();
++    if (error != nullptr)
+     {
 -    return addMessageIfNotSupported(errorReasons, error);
++        *error = errorReasons.toString();
+     }
++    return errorReasons.isEmpty();
+ }
  /*! \brief \libinternal
   * Finds out if PME with given inputs is possible to run on GPU.
   * This function is an internal final check, validating the whole PME structure on creation,
   */
  static bool pme_gpu_check_restrictions(const gmx_pme_t* pme, std::string* error)
  {
 -    std::list<std::string> errorReasons;
 -    if (pme->nnodes != 1)
 -    {
 -        errorReasons.emplace_back("PME decomposition");
 -    }
 -    if (pme->pme_order != 4)
 -    {
 -        errorReasons.emplace_back("interpolation orders other than 4");
 -    }
 -    if (pme->doLJ)
 -    {
 -        errorReasons.emplace_back("Lennard-Jones PME");
 -    }
 -    if (GMX_DOUBLE)
 -    {
 -        errorReasons.emplace_back("double precision");
 -    }
 -    if (!GMX_GPU)
 -    {
 -        errorReasons.emplace_back("non-GPU build of GROMACS");
 -    }
 -    if (GMX_GPU_SYCL)
 -    {
 -        errorReasons.emplace_back("SYCL build of GROMACS"); // SYCL-TODO
 -    }
 -    return addMessageIfNotSupported(errorReasons, error);
 +    gmx::MessageStringCollector errorReasons;
 +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
 +    errorReasons.startContext("PME GPU does not support:");
 +    errorReasons.appendIf((pme->nnodes != 1), "PME decomposition.");
 +    errorReasons.appendIf((pme->pme_order != 4), "interpolation orders other than 4.");
 +    errorReasons.appendIf(pme->doLJ, "Lennard-Jones PME.");
 +    errorReasons.appendIf(GMX_DOUBLE, "Double precision build of GROMACS.");
 +    errorReasons.appendIf(!GMX_GPU, "Non-GPU build of GROMACS.");
 +    errorReasons.appendIf(GMX_GPU_SYCL && !g_allowPmeWithSyclForTesting, "SYCL build of GROMACS."); // SYCL-TODO
 +    errorReasons.finishContext();
 +    if (error != nullptr)
 +    {
 +        *error = errorReasons.toString();
 +    }
 +    return errorReasons.isEmpty();
  }
  
  PmeRunMode pme_run_mode(const gmx_pme_t* pme)
@@@ -285,7 -333,11 +299,7 @@@ PmeAtomComm::PmeAtomComm(MPI_Comm   Pme
                           const int  pmeOrder,
                           const int  dimIndex,
                           const bool doSpread) :
 -    dimind(dimIndex),
 -    bSpread(doSpread),
 -    pme_order(pmeOrder),
 -    nthread(numThreads),
 -    spline(nthread)
 +    dimind(dimIndex), bSpread(doSpread), pme_order(pmeOrder), nthread(numThreads), spline(nthread)
  {
      if (PmeMpiCommunicator != MPI_COMM_NULL)
      {
@@@ -433,18 -485,8 +447,18 @@@ static void init_overlap_comm(pme_overl
      MPI_Status stat;
      for (size_t b = 0; b < ol->comm_data.size(); b++)
      {
 -        MPI_Sendrecv(&ol->send_size, 1, MPI_INT, ol->comm_data[b].send_id, b, &ol->comm_data[b].recv_size,
 -                     1, MPI_INT, ol->comm_data[b].recv_id, b, ol->mpi_comm, &stat);
 +        MPI_Sendrecv(&ol->send_size,
 +                     1,
 +                     MPI_INT,
 +                     ol->comm_data[b].send_id,
 +                     b,
 +                     &ol->comm_data[b].recv_size,
 +                     1,
 +                     MPI_INT,
 +                     ol->comm_data[b].recv_id,
 +                     b,
 +                     ol->mpi_comm,
 +                     &stat);
      }
  #endif
  
@@@ -482,8 -524,7 +496,8 @@@ bool gmx_pme_check_restrictions(int pme
          std::string message = gmx::formatString(
                  "pme_order (%d) is larger than the maximum allowed value (%d). Modify and "
                  "recompile the code if you really need such a high order.",
 -                pme_order, PME_ORDER_MAX);
 +                pme_order,
 +                PME_ORDER_MAX);
          GMX_THROW(gmx::InconsistentInputError(message));
      }
  
                    "threads, the number of grid lines per rank along x should be >= pme_order (%d) "
                    "or = pmeorder-1. To resolve this issue, use fewer ranks along x (and possibly "
                    "more along y and/or z) by specifying -dd manually.",
 -                  nkx / static_cast<double>(numPmeDomainsAlongX), pme_order);
 +                  nkx / static_cast<double>(numPmeDomainsAlongX),
 +                  pme_order);
      }
  
      return true;
@@@ -553,6 -593,9 +567,6 @@@ gmx_pme_t* gmx_pme_init(const t_commrec
  
      gmx::unique_cptr<gmx_pme_t, gmx_pme_destroy> pme(new gmx_pme_t());
  
 -    pme->sum_qgrid_tmp    = nullptr;
 -    pme->sum_qgrid_dd_tmp = nullptr;
 -
      pme->buf_nalloc = 0;
  
      pme->nnodes  = 1;
              pme->ndecompdim = 2;
  
  #if GMX_MPI
 -            MPI_Comm_split(pme->mpi_comm, pme->nodeid % numPmeDomains.y, pme->nodeid,
 +            MPI_Comm_split(pme->mpi_comm,
 +                           pme->nodeid % numPmeDomains.y,
 +                           pme->nodeid,
                             &pme->mpi_comm_d[0]); /* My communicator along major dimension */
 -            MPI_Comm_split(pme->mpi_comm, pme->nodeid / numPmeDomains.y, pme->nodeid,
 +            MPI_Comm_split(pme->mpi_comm,
 +                           pme->nodeid / numPmeDomains.y,
 +                           pme->nodeid,
                             &pme->mpi_comm_d[1]); /* My communicator along minor dimension */
  
              MPI_Comm_rank(pme->mpi_comm_d[0], &pme->nodeid_major);
       * not calculating free-energy for Coulomb and/or LJ while gmx_pme_init()
       * configures with free-energy, but that has never been tested.
       */
 -    pme->doCoulomb     = EEL_PME(ir->coulombtype);
 -    pme->doLJ          = EVDW_PME(ir->vdwtype);
 -    pme->bFEP_q        = ((ir->efep != efepNO) && bFreeEnergy_q);
 -    pme->bFEP_lj       = ((ir->efep != efepNO) && bFreeEnergy_lj);
 -    pme->bFEP          = (pme->bFEP_q || pme->bFEP_lj);
 -    pme->nkx           = ir->nkx;
 -    pme->nky           = ir->nky;
 -    pme->nkz           = ir->nkz;
 -    pme->bP3M          = (ir->coulombtype == eelP3M_AD || getenv("GMX_PME_P3M") != nullptr);
 +    pme->doCoulomb = EEL_PME(ir->coulombtype);
 +    pme->doLJ      = EVDW_PME(ir->vdwtype);
 +    pme->bFEP_q    = ((ir->efep != FreeEnergyPerturbationType::No) && bFreeEnergy_q);
 +    pme->bFEP_lj   = ((ir->efep != FreeEnergyPerturbationType::No) && bFreeEnergy_lj);
 +    pme->bFEP      = (pme->bFEP_q || pme->bFEP_lj);
 +    pme->nkx       = ir->nkx;
 +    pme->nky       = ir->nky;
 +    pme->nkz       = ir->nkz;
 +    pme->bP3M = (ir->coulombtype == CoulombInteractionType::P3mAD || getenv("GMX_PME_P3M") != nullptr);
      pme->pme_order     = ir->pme_order;
      pme->ewaldcoeff_q  = ewaldcoeff_q;
      pme->ewaldcoeff_lj = ewaldcoeff_lj;
  
      // The box requires scaling with nwalls = 2, we store that condition as well
      // as the scaling factor
 -    delete pme->boxScaler;
 -    pme->boxScaler = new EwaldBoxZScaler(*ir);
 +    pme->boxScaler = std::make_unique<EwaldBoxZScaler>(
 +            EwaldBoxZScaler(inputrecPbcXY2Walls(ir), ir->wall_ewald_zfac));
  
      /* If we violate restrictions, generate a fatal error here */
 -    gmx_pme_check_restrictions(pme->pme_order, pme->nkx, pme->nky, pme->nkz, pme->nnodes_major,
 -                               pme->bUseThreads, true);
 +    gmx_pme_check_restrictions(
 +            pme->pme_order, pme->nkx, pme->nky, pme->nkz, pme->nnodes_major, pme->bUseThreads, true);
  
      if (pme->nnodes > 1)
      {
                              "      and PME grid_y (%d) and grid_z (%d) should be divisible by "
                              "#PME_ranks_y "
                              "(%d)",
 -                            gmx::roundToInt((imbal - 1) * 100), pme->nkx, pme->nky,
 -                            pme->nnodes_major, pme->nky, pme->nkz, pme->nnodes_minor);
 +                            gmx::roundToInt((imbal - 1) * 100),
 +                            pme->nkx,
 +                            pme->nky,
 +                            pme->nnodes_major,
 +                            pme->nky,
 +                            pme->nkz,
 +                            pme->nnodes_minor);
          }
      }
  
       * y is always copied through a buffer: we don't need padding in z,
       * but we do need the overlap in x because of the communication order.
       */
 -    init_overlap_comm(&pme->overlap[0], pme->pme_order, pme->mpi_comm_d[0], pme->nnodes_major,
 -                      pme->nodeid_major, pme->nkx,
 +    init_overlap_comm(&pme->overlap[0],
 +                      pme->pme_order,
 +                      pme->mpi_comm_d[0],
 +                      pme->nnodes_major,
 +                      pme->nodeid_major,
 +                      pme->nkx,
                        (div_round_up(pme->nky, pme->nnodes_minor) + pme->pme_order)
                                * (pme->nkz + pme->pme_order - 1));
  
       * We do this with an offset buffer of equal size, so we need to allocate
       * extra for the offset. That's what the (+1)*pme->nkz is for.
       */
 -    init_overlap_comm(&pme->overlap[1], pme->pme_order, pme->mpi_comm_d[1], pme->nnodes_minor,
 -                      pme->nodeid_minor, pme->nky,
 +    init_overlap_comm(&pme->overlap[1],
 +                      pme->pme_order,
 +                      pme->mpi_comm_d[1],
 +                      pme->nnodes_minor,
 +                      pme->nodeid_minor,
 +                      pme->nky,
                        (div_round_up(pme->nkx, pme->nnodes_major) + pme->pme_order + 1) * pme->nkz);
  
      /* Double-check for a limitation of the (current) sum_fftgrid_dd code.
      pme->pmegrid_start_iy = pme->overlap[1].s2g0[pme->nodeid_minor];
      pme->pmegrid_start_iz = 0;
  
 -    make_gridindex_to_localindex(pme->nkx, pme->pmegrid_start_ix,
 -                                 pme->pmegrid_nx - (pme->pme_order - 1), &pme->nnx, &pme->fshx);
 -    make_gridindex_to_localindex(pme->nky, pme->pmegrid_start_iy,
 -                                 pme->pmegrid_ny - (pme->pme_order - 1), &pme->nny, &pme->fshy);
 -    make_gridindex_to_localindex(pme->nkz, pme->pmegrid_start_iz, pme->pmegrid_nz_base, &pme->nnz,
 -                                 &pme->fshz);
 +    make_gridindex_to_localindex(
 +            pme->nkx, pme->pmegrid_start_ix, pme->pmegrid_nx - (pme->pme_order - 1), &pme->nnx, &pme->fshx);
 +    make_gridindex_to_localindex(
 +            pme->nky, pme->pmegrid_start_iy, pme->pmegrid_ny - (pme->pme_order - 1), &pme->nny, &pme->fshy);
 +    make_gridindex_to_localindex(
 +            pme->nkz, pme->pmegrid_start_iz, pme->pmegrid_nz_base, &pme->nnz, &pme->fshz);
  
      pme->spline_work = make_pme_spline_work(pme->pme_order);
  
       */
      if (pme->doLJ)
      {
 -        pme->ngrids = ((ir->ljpme_combination_rule == eljpmeLB) ? DO_Q_AND_LJ_LB : DO_Q_AND_LJ);
 +        pme->ngrids = ((ir->ljpme_combination_rule == LongRangeVdW::LB) ? DO_Q_AND_LJ_LB : DO_Q_AND_LJ);
      }
      else
      {
      {
          if ((i < DO_Q && pme->doCoulomb && (i == 0 || bFreeEnergy_q))
              || (i >= DO_Q && pme->doLJ
 -                && (i == 2 || bFreeEnergy_lj || ir->ljpme_combination_rule == eljpmeLB)))
 +                && (i == 2 || bFreeEnergy_lj || ir->ljpme_combination_rule == LongRangeVdW::LB)))
          {
 -            pmegrids_init(&pme->pmegrid[i], pme->pmegrid_nx, pme->pmegrid_ny, pme->pmegrid_nz,
 -                          pme->pmegrid_nz_base, pme->pme_order, pme->bUseThreads, pme->nthread,
 +            pmegrids_init(&pme->pmegrid[i],
 +                          pme->pmegrid_nx,
 +                          pme->pmegrid_ny,
 +                          pme->pmegrid_nz,
 +                          pme->pmegrid_nz_base,
 +                          pme->pme_order,
 +                          pme->bUseThreads,
 +                          pme->nthread,
                            pme->overlap[0].s2g1[pme->nodeid_major]
                                    - pme->overlap[0].s2g0[pme->nodeid_major + 1],
                            pme->overlap[1].s2g1[pme->nodeid_minor]
              const auto allocateRealGridForGpu = (pme->runMode == PmeRunMode::Mixed)
                                                          ? gmx::PinningPolicy::PinnedIfSupported
                                                          : gmx::PinningPolicy::CannotBePinned;
 -            gmx_parallel_3dfft_init(&pme->pfft_setup[i], ndata, &pme->fftgrid[i], &pme->cfftgrid[i],
 -                                    pme->mpi_comm_d, bReproducible, pme->nthread, allocateRealGridForGpu);
 +            gmx_parallel_3dfft_init(&pme->pfft_setup[i],
 +                                    ndata,
 +                                    &pme->fftgrid[i],
 +                                    &pme->cfftgrid[i],
 +                                    pme->mpi_comm_d,
 +                                    bReproducible,
 +                                    pme->nthread,
 +                                    allocateRealGridForGpu);
          }
      }
  
@@@ -924,27 -938,15 +938,27 @@@ void gmx_pme_reinit(struct gmx_pme_t** 
          const gmx::MDLogger dummyLogger;
          GMX_ASSERT(pmedata, "Invalid PME pointer");
          NumPmeDomains numPmeDomains = { pme_src->nnodes_major, pme_src->nnodes_minor };
 -        *pmedata = gmx_pme_init(cr, numPmeDomains, &irc, pme_src->bFEP_q, pme_src->bFEP_lj, FALSE,
 -                                ewaldcoeff_q, ewaldcoeff_lj, pme_src->nthread, pme_src->runMode,
 -                                pme_src->gpu, nullptr, nullptr, nullptr, dummyLogger);
 +        *pmedata                    = gmx_pme_init(cr,
 +                                numPmeDomains,
 +                                &irc,
 +                                pme_src->bFEP_q,
 +                                pme_src->bFEP_lj,
 +                                FALSE,
 +                                ewaldcoeff_q,
 +                                ewaldcoeff_lj,
 +                                pme_src->nthread,
 +                                pme_src->runMode,
 +                                pme_src->gpu,
 +                                nullptr,
 +                                nullptr,
 +                                nullptr,
 +                                dummyLogger);
          /* When running PME on the CPU not using domain decomposition,
           * the atom data is allocated once only in gmx_pme_(re)init().
           */
          if (!pme_src->gpu && pme_src->nnodes == 1)
          {
 -            gmx_pme_reinit_atoms(*pmedata, pme_src->atc[0].numAtoms(), nullptr, nullptr);
 +            gmx_pme_reinit_atoms(*pmedata, pme_src->atc[0].numAtoms(), {}, {});
          }
          // TODO this is mostly passing around current values
      }
      /* We would like to reuse the fft grids, but that's harder */
  }
  
 -void gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q, real* V)
 +real gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q)
  {
      pmegrids_t* grid;
  
      /* Only calculate the spline coefficients, don't actually spread */
      spread_on_grid(pme, atc, nullptr, TRUE, FALSE, pme->fftgrid[PME_GRID_QA], FALSE, PME_GRID_QA);
  
 -    *V = gather_energy_bsplines(pme, grid->grid.grid, atc);
 +    return gather_energy_bsplines(pme, grid->grid.grid, atc);
  }
  
  /*! \brief Calculate initial Lorentz-Berthelot coefficients for LJ-PME */
 -static void calc_initial_lb_coeffs(gmx::ArrayRef<real> coefficient, const real* local_c6, const real* local_sigma)
 +static void calc_initial_lb_coeffs(gmx::ArrayRef<real>       coefficient,
 +                                   gmx::ArrayRef<const real> local_c6,
 +                                   gmx::ArrayRef<const real> local_sigma)
  {
      for (gmx::index i = 0; i < coefficient.ssize(); ++i)
      {
  }
  
  /*! \brief Calculate next Lorentz-Berthelot coefficients for LJ-PME */
 -static void calc_next_lb_coeffs(gmx::ArrayRef<real> coefficient, const real* local_sigma)
 +static void calc_next_lb_coeffs(gmx::ArrayRef<real> coefficient, gmx::ArrayRef<const real> local_sigma)
  {
      for (gmx::index i = 0; i < coefficient.ssize(); ++i)
      {
  int gmx_pme_do(struct gmx_pme_t*              pme,
                 gmx::ArrayRef<const gmx::RVec> coordinates,
                 gmx::ArrayRef<gmx::RVec>       forces,
 -               real                           chargeA[],
 -               real                           chargeB[],
 -               real                           c6A[],
 -               real                           c6B[],
 -               real                           sigmaA[],
 -               real                           sigmaB[],
 +               gmx::ArrayRef<const real>      chargeA,
 +               gmx::ArrayRef<const real>      chargeB,
 +               gmx::ArrayRef<const real>      c6A,
 +               gmx::ArrayRef<const real>      c6B,
 +               gmx::ArrayRef<const real>      sigmaA,
 +               gmx::ArrayRef<const real>      sigmaB,
                 const matrix                   box,
                 const t_commrec*               cr,
                 int                            maxshift_x,
      GMX_ASSERT(pme->runMode == PmeRunMode::CPU,
                 "gmx_pme_do should not be called on the GPU PME run.");
  
 -    int                  d, npme, grid_index, max_grid_index;
 -    PmeAtomComm&         atc         = pme->atc[0];
 -    pmegrids_t*          pmegrid     = nullptr;
 -    real*                grid        = nullptr;
 -    real*                coefficient = nullptr;
 -    PmeOutput            output[2]; // The second is used for the B state with FEP
 -    real                 scale, lambda;
 -    gmx_bool             bClearF;
 -    gmx_parallel_3dfft_t pfft_setup;
 -    real*                fftgrid;
 -    t_complex*           cfftgrid;
 -    int                  thread;
 -    gmx_bool             bFirst, bDoSplines;
 -    int                  fep_state;
 -    int                  fep_states_lj = pme->bFEP_lj ? 2 : 1;
 +    PmeAtomComm&              atc     = pme->atc[0];
 +    pmegrids_t*               pmegrid = nullptr;
 +    real*                     grid    = nullptr;
 +    gmx::ArrayRef<const real> coefficient;
 +    std::array<PmeOutput, 2>  output; // The second is used for the B state with FEP
 +    gmx_parallel_3dfft_t      pfft_setup;
 +    real*                     fftgrid;
 +    t_complex*                cfftgrid;
 +    int                       thread;
 +    const int                 fep_states_lj = pme->bFEP_lj ? 2 : 1;
      // There's no support for computing energy without virial, or vice versa
      const bool computeEnergyAndVirial = (stepWork.computeEnergy || stepWork.computeVirial);
  
      pme->boxScaler->scaleBox(box, scaledBox);
  
      gmx::invertBoxMatrix(scaledBox, pme->recipbox);
 -    bFirst = TRUE;
 +    bool bFirst = true;
  
      /* For simplicity, we construct the splines for all particles if
       * more than one PME calculations is needed. Some optimization
       * that don't yet have them.
       */
  
 -    bDoSplines = pme->bFEP || (pme->doCoulomb && pme->doLJ);
 +    bool bDoSplines = pme->bFEP || (pme->doCoulomb && pme->doLJ);
  
      /* We need a maximum of four separate PME calculations:
       * grid_index=0: Coulomb PME with charges from state A
       */
  
      /* If we are doing LJ-PME with LB, we only do Q here */
 -    max_grid_index = (pme->ljpme_combination_rule == eljpmeLB) ? DO_Q : DO_Q_AND_LJ;
 -
 -    for (grid_index = 0; grid_index < max_grid_index; ++grid_index)
 +    const int max_grid_index = (pme->ljpme_combination_rule == LongRangeVdW::LB) ? DO_Q : DO_Q_AND_LJ;
 +    bool      bClearF;
 +    for (int grid_index = 0; grid_index < max_grid_index; ++grid_index)
      {
          /* Check if we should do calculations at this grid_index
           * If grid_index is odd we should be doing FEP
  
          grid = pmegrid->grid.grid;
  
 -        if (debug)
 -        {
 -            fprintf(debug, "PME: number of ranks = %d, rank = %d\n", cr->nnodes, cr->nodeid);
 -            fprintf(debug, "Grid = %p\n", static_cast<void*>(grid));
 -            if (grid == nullptr)
 -            {
 -                gmx_fatal(FARGS, "No grid!");
 -            }
 -        }
 -
          if (pme->nnodes == 1)
          {
 -            atc.coefficient = gmx::arrayRefFromArray(coefficient, coordinates.size());
 +            atc.coefficient = coefficient;
          }
          else
          {
 -            wallcycle_start(wcycle, ewcPME_REDISTXF);
 +            wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
              do_redist_pos_coeffs(pme, cr, bFirst, coordinates, coefficient);
  
 -            wallcycle_stop(wcycle, ewcPME_REDISTXF);
 +            wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
          }
  
 -        if (debug)
 -        {
 -            fprintf(debug, "Rank= %6d, pme local particles=%6d\n", cr->nodeid, atc.numAtoms());
 -        }
 -
 -        wallcycle_start(wcycle, ewcPME_SPREAD);
 +        wallcycle_start(wcycle, WallCycleCounter::PmeSpread);
  
          /* Spread the coefficients on a grid */
          spread_on_grid(pme, &atc, pmegrid, bFirst, TRUE, fftgrid, bDoSplines, grid_index);
              copy_pmegrid_to_fftgrid(pme, grid, fftgrid, grid_index);
          }
  
 -        wallcycle_stop(wcycle, ewcPME_SPREAD);
 +        wallcycle_stop(wcycle, WallCycleCounter::PmeSpread);
  
          /* TODO If the OpenMP and single-threaded implementations
             converge, then spread_on_grid() and
                  /* do 3d-fft */
                  if (thread == 0)
                  {
 -                    wallcycle_start(wcycle, ewcPME_FFT);
 +                    wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                  }
                  gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_REAL_TO_COMPLEX, thread, wcycle);
                  if (thread == 0)
                  {
 -                    wallcycle_stop(wcycle, ewcPME_FFT);
 +                    wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
                  }
  
                  /* solve in k-space for our local cells */
                  if (thread == 0)
                  {
 -                    wallcycle_start(wcycle, (grid_index < DO_Q ? ewcPME_SOLVE : ewcLJPME));
 +                    wallcycle_start(
 +                            wcycle,
 +                            (grid_index < DO_Q ? WallCycleCounter::PmeSolve : WallCycleCounter::LJPme));
                  }
                  if (grid_index < DO_Q)
                  {
 -                    loop_count = solve_pme_yzx(
 -                            pme, cfftgrid, scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
 -                            computeEnergyAndVirial, pme->nthread, thread);
 +                    loop_count = solve_pme_yzx(pme,
 +                                               cfftgrid,
 +                                               scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
 +                                               computeEnergyAndVirial,
 +                                               pme->nthread,
 +                                               thread);
                  }
                  else
                  {
                      loop_count =
 -                            solve_pme_lj_yzx(pme, &cfftgrid, FALSE,
 +                            solve_pme_lj_yzx(pme,
 +                                             &cfftgrid,
 +                                             FALSE,
                                               scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
 -                                             computeEnergyAndVirial, pme->nthread, thread);
 +                                             computeEnergyAndVirial,
 +                                             pme->nthread,
 +                                             thread);
                  }
  
                  if (thread == 0)
                  {
 -                    wallcycle_stop(wcycle, (grid_index < DO_Q ? ewcPME_SOLVE : ewcLJPME));
 +                    wallcycle_stop(
 +                            wcycle,
 +                            (grid_index < DO_Q ? WallCycleCounter::PmeSolve : WallCycleCounter::LJPme));
                      inc_nrnb(nrnb, eNR_SOLVEPME, loop_count);
                  }
  
                  /* do 3d-invfft */
                  if (thread == 0)
                  {
 -                    wallcycle_start(wcycle, ewcPME_FFT);
 +                    wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                  }
                  gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_COMPLEX_TO_REAL, thread, wcycle);
                  if (thread == 0)
                  {
 -                    wallcycle_stop(wcycle, ewcPME_FFT);
 +                    wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
  
  
                      if (pme->nodeid == 0)
                      {
 -                        real ntot = pme->nkx * pme->nky * pme->nkz;
 -                        npme      = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
 +                        real      ntot = pme->nkx * pme->nky * pme->nkz;
 +                        const int npme = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
                          inc_nrnb(nrnb, eNR_FFT, 2 * npme);
                      }
  
                      /* Note: this wallcycle region is closed below
                         outside an OpenMP region, so take care if
                         refactoring code here. */
 -                    wallcycle_start(wcycle, ewcPME_GATHER);
 +                    wallcycle_start(wcycle, WallCycleCounter::PmeGather);
                  }
  
                  copy_fftgrid_to_pmegrid(pme, fftgrid, grid, grid_index, pme->nthread, thread);
               * atc->f is the actual force array, not a buffer,
               * therefore we should not clear it.
               */
 -            lambda  = grid_index < DO_Q ? lambda_q : lambda_lj;
 -            bClearF = (bFirst && PAR(cr));
 +            real lambda = grid_index < DO_Q ? lambda_q : lambda_lj;
 +            bClearF     = (bFirst && PAR(cr));
  #pragma omp parallel for num_threads(pme->nthread) schedule(static)
              for (thread = 0; thread < pme->nthread; thread++)
              {
                  try
                  {
 -                    gather_f_bsplines(pme, grid, bClearF, &atc, &atc.spline[thread],
 +                    gather_f_bsplines(pme,
 +                                      grid,
 +                                      bClearF,
 +                                      &atc,
 +                                      &atc.spline[thread],
                                        pme->bFEP ? (grid_index % 2 == 0 ? 1.0 - lambda : lambda) : 1.0);
                  }
                  GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
              }
  
  
 -            inc_nrnb(nrnb, eNR_GATHERFBSP,
 -                     pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
 +            inc_nrnb(nrnb, eNR_GATHERFBSP, pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
              /* Note: this wallcycle region is opened above inside an OpenMP
                 region, so take care if refactoring code here. */
 -            wallcycle_stop(wcycle, ewcPME_GATHER);
 +            wallcycle_stop(wcycle, WallCycleCounter::PmeGather);
          }
  
          if (computeEnergyAndVirial)
      /* For Lorentz-Berthelot combination rules in LJ-PME, we need to calculate
       * seven terms. */
  
 -    if (pme->doLJ && pme->ljpme_combination_rule == eljpmeLB)
 +    if (pme->doLJ && pme->ljpme_combination_rule == LongRangeVdW::LB)
      {
          /* Loop over A- and B-state if we are doing FEP */
 -        for (fep_state = 0; fep_state < fep_states_lj; ++fep_state)
 +        for (int fep_state = 0; fep_state < fep_states_lj; ++fep_state)
          {
 -            real *local_c6 = nullptr, *local_sigma = nullptr, *RedistC6 = nullptr, *RedistSigma = nullptr;
 -            gmx::ArrayRef<real> coefficientBuffer;
 +            std::vector<real>         local_c6;
 +            std::vector<real>         local_sigma;
 +            gmx::ArrayRef<const real> RedistC6;
 +            gmx::ArrayRef<const real> RedistSigma;
 +            gmx::ArrayRef<real>       coefficientBuffer;
              if (pme->nnodes == 1)
              {
                  pme->lb_buf1.resize(atc.numAtoms());
                  switch (fep_state)
                  {
                      case 0:
 -                        local_c6    = c6A;
 -                        local_sigma = sigmaA;
 +                        local_c6.assign(c6A.begin(), c6A.end());
 +                        local_sigma.assign(sigmaA.begin(), sigmaA.end());
                          break;
                      case 1:
 -                        local_c6    = c6B;
 -                        local_sigma = sigmaB;
 +                        local_c6.assign(c6B.begin(), c6B.end());
 +                        local_sigma.assign(sigmaB.begin(), sigmaB.end());
                          break;
                      default: gmx_incons("Trying to access wrong FEP-state in LJ-PME routine");
                  }
                          break;
                      default: gmx_incons("Trying to access wrong FEP-state in LJ-PME routine");
                  }
 -                wallcycle_start(wcycle, ewcPME_REDISTXF);
 +                wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
  
                  do_redist_pos_coeffs(pme, cr, bFirst, coordinates, RedistC6);
                  pme->lb_buf1.resize(atc.numAtoms());
                  pme->lb_buf2.resize(atc.numAtoms());
 -                local_c6 = pme->lb_buf1.data();
 +                local_c6.assign(pme->lb_buf1.begin(), pme->lb_buf1.end());
                  for (int i = 0; i < atc.numAtoms(); ++i)
                  {
                      local_c6[i] = atc.coefficient[i];
                  }
  
                  do_redist_pos_coeffs(pme, cr, FALSE, coordinates, RedistSigma);
 -                local_sigma = pme->lb_buf2.data();
 +                local_sigma.assign(pme->lb_buf2.begin(), pme->lb_buf2.end());
                  for (int i = 0; i < atc.numAtoms(); ++i)
                  {
                      local_sigma[i] = atc.coefficient[i];
                  }
  
 -                wallcycle_stop(wcycle, ewcPME_REDISTXF);
 +                wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
              }
              atc.coefficient = coefficientBuffer;
              calc_initial_lb_coeffs(coefficientBuffer, local_c6, local_sigma);
  
              /*Seven terms in LJ-PME with LB, grid_index < 2 reserved for electrostatics*/
 -            for (grid_index = 2; grid_index < 9; ++grid_index)
 +            for (int grid_index = 2; grid_index < 9; ++grid_index)
              {
                  /* Unpack structure */
                  pmegrid    = &pme->pmegrid[grid_index];
                  calc_next_lb_coeffs(coefficientBuffer, local_sigma);
                  grid = pmegrid->grid.grid;
  
 -                wallcycle_start(wcycle, ewcPME_SPREAD);
 +                wallcycle_start(wcycle, WallCycleCounter::PmeSpread);
                  /* Spread the c6 on a grid */
                  spread_on_grid(pme, &atc, pmegrid, bFirst, TRUE, fftgrid, bDoSplines, grid_index);
  
                      inc_nrnb(nrnb, eNR_WEIGHTS, DIM * atc.numAtoms());
                  }
  
 -                inc_nrnb(nrnb, eNR_SPREADBSP,
 +                inc_nrnb(nrnb,
 +                         eNR_SPREADBSP,
                           pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
                  if (pme->nthread == 1)
                  {
                      }
                      copy_pmegrid_to_fftgrid(pme, grid, fftgrid, grid_index);
                  }
 -                wallcycle_stop(wcycle, ewcPME_SPREAD);
 +                wallcycle_stop(wcycle, WallCycleCounter::PmeSpread);
  
                  /*Here we start a large thread parallel region*/
  #pragma omp parallel num_threads(pme->nthread) private(thread)
                          /* do 3d-fft */
                          if (thread == 0)
                          {
 -                            wallcycle_start(wcycle, ewcPME_FFT);
 +                            wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                          }
  
                          gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_REAL_TO_COMPLEX, thread, wcycle);
                          if (thread == 0)
                          {
 -                            wallcycle_stop(wcycle, ewcPME_FFT);
 +                            wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
                          }
                      }
                      GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
                      thread = gmx_omp_get_thread_num();
                      if (thread == 0)
                      {
 -                        wallcycle_start(wcycle, ewcLJPME);
 +                        wallcycle_start(wcycle, WallCycleCounter::LJPme);
                      }
  
                      loop_count =
 -                            solve_pme_lj_yzx(pme, &pme->cfftgrid[2], TRUE,
 +                            solve_pme_lj_yzx(pme,
 +                                             &pme->cfftgrid[2],
 +                                             TRUE,
                                               scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
 -                                             computeEnergyAndVirial, pme->nthread, thread);
 +                                             computeEnergyAndVirial,
 +                                             pme->nthread,
 +                                             thread);
                      if (thread == 0)
                      {
 -                        wallcycle_stop(wcycle, ewcLJPME);
 +                        wallcycle_stop(wcycle, WallCycleCounter::LJPme);
                          inc_nrnb(nrnb, eNR_SOLVEPME, loop_count);
                      }
                  }
  
              bFirst = !pme->doCoulomb;
              calc_initial_lb_coeffs(coefficientBuffer, local_c6, local_sigma);
 -            for (grid_index = 8; grid_index >= 2; --grid_index)
 +            for (int grid_index = 8; grid_index >= 2; --grid_index)
              {
                  /* Unpack structure */
                  pmegrid    = &pme->pmegrid[grid_index];
                          /* do 3d-invfft */
                          if (thread == 0)
                          {
 -                            wallcycle_start(wcycle, ewcPME_FFT);
 +                            wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                          }
  
                          gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_COMPLEX_TO_REAL, thread, wcycle);
                          if (thread == 0)
                          {
 -                            wallcycle_stop(wcycle, ewcPME_FFT);
 +                            wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
  
  
                              if (pme->nodeid == 0)
                              {
 -                                real ntot = pme->nkx * pme->nky * pme->nkz;
 -                                npme      = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
 +                                real      ntot = pme->nkx * pme->nky * pme->nkz;
 +                                const int npme = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
                                  inc_nrnb(nrnb, eNR_FFT, 2 * npme);
                              }
 -                            wallcycle_start(wcycle, ewcPME_GATHER);
 +                            wallcycle_start(wcycle, WallCycleCounter::PmeGather);
                          }
  
                          copy_fftgrid_to_pmegrid(pme, fftgrid, grid, grid_index, pme->nthread, thread);
                  if (stepWork.computeForces)
                  {
                      /* interpolate forces for our local atoms */
 -                    bClearF = (bFirst && PAR(cr));
 -                    scale   = pme->bFEP ? (fep_state < 1 ? 1.0 - lambda_lj : lambda_lj) : 1.0;
 +                    bClearF    = (bFirst && PAR(cr));
 +                    real scale = pme->bFEP ? (fep_state < 1 ? 1.0 - lambda_lj : lambda_lj) : 1.0;
                      scale *= lb_scale_factor[grid_index - 2];
  
  #pragma omp parallel for num_threads(pme->nthread) schedule(static)
                      {
                          try
                          {
 -                            gather_f_bsplines(pme, grid, bClearF, &pme->atc[0],
 -                                              &pme->atc[0].spline[thread], scale);
 +                            gather_f_bsplines(
 +                                    pme, grid, bClearF, &pme->atc[0], &pme->atc[0].spline[thread], scale);
                          }
                          GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
                      }
  
  
 -                    inc_nrnb(nrnb, eNR_GATHERFBSP,
 +                    inc_nrnb(nrnb,
 +                             eNR_GATHERFBSP,
                               pme->pme_order * pme->pme_order * pme->pme_order * pme->atc[0].numAtoms());
                  }
 -                wallcycle_stop(wcycle, ewcPME_GATHER);
 +                wallcycle_stop(wcycle, WallCycleCounter::PmeGather);
  
                  bFirst = FALSE;
              } /* for (grid_index = 8; grid_index >= 2; --grid_index) */
          }     /* for (fep_state = 0; fep_state < fep_states_lj; ++fep_state) */
 -    }         /* if (pme->doLJ && pme->ljpme_combination_rule == eljpmeLB) */
 +    }         /* if (pme->doLJ && pme->ljpme_combination_rule == LongRangeVdW::LB) */
  
      if (stepWork.computeForces && pme->nnodes > 1)
      {
 -        wallcycle_start(wcycle, ewcPME_REDISTXF);
 -        for (d = 0; d < pme->ndecompdim; d++)
 +        wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
 +        for (int d = 0; d < pme->ndecompdim; d++)
          {
              gmx::ArrayRef<gmx::RVec> forcesRef;
              if (d == pme->ndecompdim - 1)
              {
                  forcesRef = pme->atc[d + 1].f;
              }
 -            if (DOMAINDECOMP(cr))
 +            if (haveDDAtomOrdering(*cr))
              {
                  dd_pmeredist_f(pme, &pme->atc[d], forcesRef, d == pme->ndecompdim - 1 && pme->bPPnode);
              }
          }
  
 -        wallcycle_stop(wcycle, ewcPME_REDISTXF);
 +        wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
      }
  
      if (computeEnergyAndVirial)
                      }
                  }
              }
 -            if (debug)
 -            {
 -                fprintf(debug, "Electrostatic PME mesh energy: %g\n", *energy_q);
 -            }
          }
          else
          {
                      }
                  }
              }
 -            if (debug)
 -            {
 -                fprintf(debug, "Lennard-Jones PME mesh energy: %g\n", *energy_lj);
 -            }
          }
          else
          {
@@@ -1655,6 -1660,8 +1669,6 @@@ void gmx_pme_destroy(gmx_pme_t* pme
          return;
      }
  
 -    delete pme->boxScaler;
 -
      sfree(pme->nnx);
      sfree(pme->nny);
      sfree(pme->nnz);
          pme_free_all_work(&pme->solve_work, pme->nthread);
      }
  
 -    sfree(pme->sum_qgrid_tmp);
 -    sfree(pme->sum_qgrid_dd_tmp);
 -
      destroy_pme_spline_work(pme->spline_work);
  
      if (pme->gpu != nullptr)
      delete pme;
  }
  
 -void gmx_pme_reinit_atoms(gmx_pme_t* pme, const int numAtoms, const real* chargesA, const real* chargesB)
 +void gmx_pme_reinit_atoms(gmx_pme_t*                pme,
 +                          const int                 numAtoms,
 +                          gmx::ArrayRef<const real> chargesA,
 +                          gmx::ArrayRef<const real> chargesB)
  {
      if (pme->gpu != nullptr)
      {
 -        GMX_ASSERT(!(pme->bFEP_q && chargesB == nullptr),
 +        GMX_ASSERT(!(pme->bFEP_q && chargesB.empty()),
                     "B state charges must be specified if running Coulomb FEP on the GPU");
 -        pme_gpu_reinit_atoms(pme->gpu, numAtoms, chargesA, pme->bFEP_q ? chargesB : nullptr);
 +        pme_gpu_reinit_atoms(pme->gpu, numAtoms, chargesA.data(), pme->bFEP_q ? chargesB.data() : nullptr);
      }
      else
      {
@@@ -1722,23 -1729,3 +1736,23 @@@ bool gmx_pme_grid_matches(const gmx_pme
  {
      return (pme.nkx == grid_size[XX] && pme.nky == grid_size[YY] && pme.nkz == grid_size[ZZ]);
  }
 +
 +void gmx::SeparatePmeRanksPermitted::disablePmeRanks(const std::string& reason)
 +{
 +    permitSeparatePmeRanks_ = false;
 +
 +    if (!reason.empty())
 +    {
 +        reasons_.push_back(reason);
 +    }
 +}
 +
 +bool gmx::SeparatePmeRanksPermitted::permitSeparatePmeRanks() const
 +{
 +    return permitSeparatePmeRanks_;
 +}
 +
 +std::string gmx::SeparatePmeRanksPermitted::reasonsWhyDisabled() const
 +{
 +    return joinStrings(reasons_, "; ");
 +}
diff --combined src/gromacs/ewald/pme.h
index 03d4da52c07097da2a8521823638d58f99902045,3ac892b3ac786583d5e2a06e633f9c6119e7273f..eb0c5651753535db601959d2101f7f8e12669ebd
  #define GMX_EWALD_PME_H
  
  #include <string>
 +#include <vector>
  
  #include "gromacs/gpu_utils/devicebuffer_datatype.h"
  #include "gromacs/gpu_utils/gpu_macros.h"
  #include "gromacs/math/vectypes.h"
 -#include "gromacs/utility/basedefinitions.h"
  #include "gromacs/utility/real.h"
  
  struct gmx_hw_info_t;
@@@ -75,19 -75,6 +75,19 @@@ enum class GpuTaskCompletion
  class PmeGpuProgram;
  class GpuEventSynchronizer;
  
 +/*! \brief Hack to selectively enable some parts of PME during unit testing.
 + *
 + * Set to \c false by default. If any of the tests sets it to \c true, it will
 + * make the compatibility check consider PME to be supported in SYCL builds.
 + *
 + * Currently we don't have proper PME implementation with SYCL, but we still want
 + * to run tests for some of the kernels.
 + *
 + * \todo Remove after #3927 is done and PME is fully enabled in SYCL builds.
 + */
 +//NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 +extern bool g_allowPmeWithSyclForTesting;
 +
  namespace gmx
  {
  template<typename>
@@@ -96,30 -83,6 +96,30 @@@ class ForceWithVirial
  class MDLogger;
  enum class PinningPolicy : int;
  class StepWorkload;
 +
 +/*! \libinternal \brief Class for managing usage of separate PME-only ranks
 + *
 + * Used for checking if some parts of the code could not use PME-only ranks
 + *
 + */
 +class SeparatePmeRanksPermitted
 +{
 +public:
 +    //! Disables PME ranks permitted flag with a reason
 +    void disablePmeRanks(const std::string& reason);
 +    //! Return status of PME ranks usage
 +    bool permitSeparatePmeRanks() const;
 +    //! Returns all reasons, for not using PME ranks
 +    std::string reasonsWhyDisabled() const;
 +
 +private:
 +    //! Flag that informs whether simualtion could use dedicated PME ranks
 +    bool permitSeparatePmeRanks_ = true;
 +    //! Storage for all reasons, why PME ranks could not be used
 +    std::vector<std::string> reasons_;
 +};
 +
 +class PmeCoordinateReceiverGpu;
  } // namespace gmx
  
  enum
@@@ -218,12 -181,12 +218,12 @@@ void gmx_pme_destroy(gmx_pme_t* pme)
  int gmx_pme_do(struct gmx_pme_t*              pme,
                 gmx::ArrayRef<const gmx::RVec> coordinates,
                 gmx::ArrayRef<gmx::RVec>       forces,
 -               real                           chargeA[],
 -               real                           chargeB[],
 -               real                           c6A[],
 -               real                           c6B[],
 -               real                           sigmaA[],
 -               real                           sigmaB[],
 +               gmx::ArrayRef<const real>      chargeA,
 +               gmx::ArrayRef<const real>      chargeB,
 +               gmx::ArrayRef<const real>      c6A,
 +               gmx::ArrayRef<const real>      c6B,
 +               gmx::ArrayRef<const real>      sigmaA,
 +               gmx::ArrayRef<const real>      sigmaB,
                 const matrix                   box,
                 const t_commrec*               cr,
                 int                            maxshift_x,
   * pme struct. Currently does not work in parallel or with free
   * energy.
   */
 -void gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q, real* V);
 +real gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q);
  
  /*! \brief
   * This function updates the local atom data on GPU after DD (charges, coordinates, etc.).
   * \param[in]     chargesB   The pointer to the array of particle charges in state B. Only used if
   * charges are perturbed and can otherwise be nullptr.
   */
 -void gmx_pme_reinit_atoms(gmx_pme_t* pme, int numAtoms, const real* chargesA, const real* chargesB);
 +void gmx_pme_reinit_atoms(gmx_pme_t*                pme,
 +                          int                       numAtoms,
 +                          gmx::ArrayRef<const real> chargesA,
 +                          gmx::ArrayRef<const real> chargesB);
  
  /* A block of PME GPU functions */
  
@@@ -300,6 -260,17 +300,17 @@@ bool pme_gpu_supports_hardware(const gm
   */
  bool pme_gpu_supports_input(const t_inputrec& ir, std::string* error);
  
+ /*! \brief Checks whether the input system allows to run PME on GPU in Mixed mode.
+  * Assumes that the input system is compatible with GPU PME otherwise, that is,
+  * before calling this function one should check that \ref pme_gpu_supports_input returns \c true.
+  *
+  * \param[in]  ir     Input system.
+  * \param[out] error  If non-null, the error message if the input is not supported.
+  *
+  * \returns true if PME can run on GPU in Mixed mode with this input, false otherwise.
+  */
+ bool pme_gpu_mixed_mode_supports_input(const t_inputrec& ir, std::string* error);
  /*! \brief
   * Returns the active PME codepath (CPU, GPU, mixed).
   * \todo This is a rather static data that should be managed by the higher level task scheduler.
@@@ -374,24 -345,17 +385,24 @@@ GPU_FUNC_QUALIFIER void pme_gpu_prepare
  /*! \brief
   * Launches first stage of PME on GPU - spreading kernel.
   *
 - * \param[in] pme                The PME data structure.
 - * \param[in] xReadyOnDevice     Event synchronizer indicating that the coordinates
 - * are ready in the device memory; nullptr allowed only on separate PME ranks.
 - * \param[in] wcycle             The wallclock counter.
 - * \param[in] lambdaQ            The Coulomb lambda of the current state of the
 - * system. Only used if FEP of Coulomb is active.
 + * \param[in] pme                            The PME data structure.
 + * \param[in] xReadyOnDevice                 Event synchronizer indicating that the coordinates
 + *                                           are ready in the device memory; nullptr allowed only
 + *                                           on separate PME ranks.
 + * \param[in] wcycle                         The wallclock counter.
 + * \param[in] lambdaQ                        The Coulomb lambda of the current state of the
 + *                                           system. Only used if FEP of Coulomb is active.
 + * \param[in] useGpuDirectComm               Whether direct GPU PME-PP communication is active
 + * \param[in]  pmeCoordinateReceiverGpu      Coordinate receiver object, which must be valid when
 + *                                           direct GPU PME-PP communication is active
   */
 -GPU_FUNC_QUALIFIER void pme_gpu_launch_spread(gmx_pme_t*            GPU_FUNC_ARGUMENT(pme),
 -                                              GpuEventSynchronizer* GPU_FUNC_ARGUMENT(xReadyOnDevice),
 -                                              gmx_wallcycle*        GPU_FUNC_ARGUMENT(wcycle),
 -                                              real GPU_FUNC_ARGUMENT(lambdaQ)) GPU_FUNC_TERM;
 +GPU_FUNC_QUALIFIER void pme_gpu_launch_spread(
 +        gmx_pme_t*                     GPU_FUNC_ARGUMENT(pme),
 +        GpuEventSynchronizer*          GPU_FUNC_ARGUMENT(xReadyOnDevice),
 +        gmx_wallcycle*                 GPU_FUNC_ARGUMENT(wcycle),
 +        real                           GPU_FUNC_ARGUMENT(lambdaQ),
 +        bool                           GPU_FUNC_ARGUMENT(useGpuDirectComm),
 +        gmx::PmeCoordinateReceiverGpu* GPU_FUNC_ARGUMENT(pmeCoordinateReceiverGpu)) GPU_FUNC_TERM;
  
  /*! \brief
   * Launches middle stages of PME (FFT R2C, solving, FFT C2R) either on GPU or on CPU, depending on the run mode.
@@@ -490,12 -454,12 +501,12 @@@ GPU_FUNC_QUALIFIER void pme_gpu_set_dev
   * \param[in] pme            The PME data structure.
   * \returns                  Pointer to force data
   */
 -GPU_FUNC_QUALIFIER void* pme_gpu_get_device_f(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
 -        GPU_FUNC_TERM_WITH_RETURN(nullptr);
 +GPU_FUNC_QUALIFIER DeviceBuffer<gmx::RVec> pme_gpu_get_device_f(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
 +        GPU_FUNC_TERM_WITH_RETURN(DeviceBuffer<gmx::RVec>{});
  
  /*! \brief Get pointer to the device synchronizer object that allows syncing on PME force calculation completion
   * \param[in] pme            The PME data structure.
 - * \returns                  Pointer to sychronizer
 + * \returns                  Pointer to synchronizer
   */
  GPU_FUNC_QUALIFIER GpuEventSynchronizer* pme_gpu_get_f_ready_synchronizer(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
          GPU_FUNC_TERM_WITH_RETURN(nullptr);
index d030f9d4aba335b2e3f1b592663e9d27e320473e,f96b7eafe4849cde05797ee65f020ff627f8d1f9..c3f9edb3b27d65b522f598d0cfdac1f8fa4de1f7
@@@ -60,6 -60,7 +60,6 @@@
  #include "gromacs/math/vec.h"
  #include "gromacs/mdlib/gmx_omp_nthreads.h"
  #include "gromacs/mdtypes/commrec.h"
 -#include "gromacs/mdtypes/mdatom.h"
  #include "gromacs/pbcutil/ishift.h"
  #include "gromacs/pbcutil/pbc.h"
  #include "gromacs/timing/wallcycle.h"
@@@ -164,7 -165,7 +164,7 @@@ struct VsiteThrea
      //! The interaction lists, only vsite entries are used
      std::array<InteractionList, F_NRE> ilist;
      //! Local fshift accumulation buffer
 -    std::array<RVec, SHIFTS> fshift;
 +    std::array<RVec, c_numShiftVectors> fshift;
      //! Local virial dx*df accumulation buffer
      matrix dxdf;
      //! Tells if interdependent task idTask should be used (in addition to the rest of this task), this bool has the same value on all threads
@@@ -213,9 -214,7 +213,9 @@@ public
      //! Set VSites and distribute VSite work over threads, should be called after DD partitioning
      void setVirtualSites(ArrayRef<const InteractionList> ilist,
                           ArrayRef<const t_iparams>       iparams,
 -                         const t_mdatoms&                mdatoms,
 +                         int                             numAtoms,
 +                         int                             homenr,
 +                         ArrayRef<const ParticleType>    ptype,
                           bool                            useDomdec);
  
  private:
@@@ -233,28 -232,22 +233,28 @@@ class VirtualSitesHandler::Imp
  {
  public:
      //! Constructor, domdec should be nullptr without DD
 -    Impl(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, PbcType pbcType);
 +    Impl(const gmx_mtop_t&                 mtop,
 +         gmx_domdec_t*                     domdec,
 +         PbcType                           pbcType,
 +         ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType);
  
      //! Returns the number of virtual sites acting over multiple update groups
      int numInterUpdategroupVirtualSites() const { return numInterUpdategroupVirtualSites_; }
  
      //! Set VSites and distribute VSite work over threads, should be called after DD partitioning
 -    void setVirtualSites(ArrayRef<const InteractionList> ilist, const t_mdatoms& mdatoms);
 +    void setVirtualSites(ArrayRef<const InteractionList> ilist,
 +                         int                             numAtoms,
 +                         int                             homenr,
 +                         ArrayRef<const ParticleType>    ptype);
  
      /*! \brief Create positions of vsite atoms based for the local system
       *
 -     * \param[in,out] x        The coordinates
 -     * \param[in]     dt       The time step
 -     * \param[in,out] v        When != nullptr, velocities for vsites are set as displacement/dt
 -     * \param[in]     box      The box
 +     * \param[in,out] x          The coordinates
 +     * \param[in,out] v          The velocities, needed if operation requires it
 +     * \param[in]     box        The box
 +     * \param[in]     operation  Whether we calculate positions, velocities, or both
       */
 -    void construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const;
 +    void construct(ArrayRef<RVec> x, ArrayRef<RVec> v, const matrix box, VSiteOperation operation) const;
  
      /*! \brief Spread the force operating on the vsite atoms on the surrounding atoms.
       *
@@@ -320,7 -313,7 +320,7 @@@ static int pbc_rvec_sub(const t_pbc* pb
      else
      {
          rvec_sub(xi, xj, dx);
 -        return CENTRAL;
 +        return c_centralShiftIndex;
      }
  }
  
@@@ -330,167 -323,87 +330,167 @@@ static inline real inverseNorm(const rv
      return gmx::invsqrt(iprod(x, x));
  }
  
 +//! Whether we're calculating the virtual site position
 +enum class VSiteCalculatePosition
 +{
 +    Yes,
 +    No
 +};
 +//! Whether we're calculating the virtual site velocity
 +enum class VSiteCalculateVelocity
 +{
 +    Yes,
 +    No
 +};
 +
  #ifndef DOXYGEN
  /* Vsite construction routines */
  
 -static void constr_vsite1(const rvec xi, rvec x)
 -{
 -    copy_rvec(xi, x);
 +// GCC 8 falsely flags unused variables if constexpr prunes a code path, fixed in GCC 9
 +// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827
 +// clang-format off
 +GCC_DIAGNOSTIC_IGNORE(-Wunused-but-set-parameter)
 +// clang-format on
  
 -    /* TOTAL: 0 flops */
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void constr_vsite1(const rvec xi, rvec x, const rvec vi, rvec v)
 +{
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        copy_rvec(xi, x);
 +        /* TOTAL: 0 flops */
 +    }
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        copy_rvec(vi, v);
 +    }
  }
  
 -static void constr_vsite2(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void
 +constr_vsite2(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc, const rvec vi, const rvec vj, rvec v)
  {
 -    real b = 1 - a;
 +    const real b = 1 - a;
      /* 1 flop */
  
 -    if (pbc)
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
      {
 -        rvec dx;
 -        pbc_dx_aiuc(pbc, xj, xi, dx);
 -        x[XX] = xi[XX] + a * dx[XX];
 -        x[YY] = xi[YY] + a * dx[YY];
 -        x[ZZ] = xi[ZZ] + a * dx[ZZ];
 +        if (pbc)
 +        {
 +            rvec dx;
 +            pbc_dx_aiuc(pbc, xj, xi, dx);
 +            x[XX] = xi[XX] + a * dx[XX];
 +            x[YY] = xi[YY] + a * dx[YY];
 +            x[ZZ] = xi[ZZ] + a * dx[ZZ];
 +        }
 +        else
 +        {
 +            x[XX] = b * xi[XX] + a * xj[XX];
 +            x[YY] = b * xi[YY] + a * xj[YY];
 +            x[ZZ] = b * xi[ZZ] + a * xj[ZZ];
 +            /* 9 Flops */
 +        }
 +        /* TOTAL: 10 flops */
      }
 -    else
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
      {
 -        x[XX] = b * xi[XX] + a * xj[XX];
 -        x[YY] = b * xi[YY] + a * xj[YY];
 -        x[ZZ] = b * xi[ZZ] + a * xj[ZZ];
 -        /* 9 Flops */
 +        v[XX] = b * vi[XX] + a * vj[XX];
 +        v[YY] = b * vi[YY] + a * vj[YY];
 +        v[ZZ] = b * vi[ZZ] + a * vj[ZZ];
      }
 -
 -    /* TOTAL: 10 flops */
  }
  
 -static void constr_vsite2FD(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void
 +constr_vsite2FD(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc, const rvec vi, const rvec vj, rvec v)
  {
 -    rvec xij;
 +    rvec xij = { 0 };
      pbc_rvec_sub(pbc, xj, xi, xij);
      /* 3 flops */
  
 -    const real b = a * inverseNorm(xij);
 +    const real invNormXij = inverseNorm(xij);
 +    const real b          = a * invNormXij;
      /* 6 + 10 flops */
  
 -    x[XX] = xi[XX] + b * xij[XX];
 -    x[YY] = xi[YY] + b * xij[YY];
 -    x[ZZ] = xi[ZZ] + b * xij[ZZ];
 -    /* 6 Flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + b * xij[XX];
 +        x[YY] = xi[YY] + b * xij[YY];
 +        x[ZZ] = xi[ZZ] + b * xij[ZZ];
 +        /* 6 Flops */
 +        /* TOTAL: 25 flops */
 +    }
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec_sub(vj, vi, vij);
 +        const real vijDotXij = iprod(vij, xij);
  
 -    /* TOTAL: 25 flops */
 +        v[XX] = vi[XX] + b * (vij[XX] - xij[XX] * vijDotXij * invNormXij * invNormXij);
 +        v[YY] = vi[YY] + b * (vij[YY] - xij[YY] * vijDotXij * invNormXij * invNormXij);
 +        v[ZZ] = vi[ZZ] + b * (vij[ZZ] - xij[ZZ] * vijDotXij * invNormXij * invNormXij);
 +    }
  }
  
 -static void constr_vsite3(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void constr_vsite3(const rvec   xi,
 +                          const rvec   xj,
 +                          const rvec   xk,
 +                          rvec         x,
 +                          real         a,
 +                          real         b,
 +                          const t_pbc* pbc,
 +                          const rvec   vi,
 +                          const rvec   vj,
 +                          const rvec   vk,
 +                          rvec         v)
  {
 -    real c = 1 - a - b;
 +    const real c = 1 - a - b;
      /* 2 flops */
  
 -    if (pbc)
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
      {
 -        rvec dxj, dxk;
 +        if (pbc)
 +        {
 +            rvec dxj, dxk;
  
 -        pbc_dx_aiuc(pbc, xj, xi, dxj);
 -        pbc_dx_aiuc(pbc, xk, xi, dxk);
 -        x[XX] = xi[XX] + a * dxj[XX] + b * dxk[XX];
 -        x[YY] = xi[YY] + a * dxj[YY] + b * dxk[YY];
 -        x[ZZ] = xi[ZZ] + a * dxj[ZZ] + b * dxk[ZZ];
 +            pbc_dx_aiuc(pbc, xj, xi, dxj);
 +            pbc_dx_aiuc(pbc, xk, xi, dxk);
 +            x[XX] = xi[XX] + a * dxj[XX] + b * dxk[XX];
 +            x[YY] = xi[YY] + a * dxj[YY] + b * dxk[YY];
 +            x[ZZ] = xi[ZZ] + a * dxj[ZZ] + b * dxk[ZZ];
 +        }
 +        else
 +        {
 +            x[XX] = c * xi[XX] + a * xj[XX] + b * xk[XX];
 +            x[YY] = c * xi[YY] + a * xj[YY] + b * xk[YY];
 +            x[ZZ] = c * xi[ZZ] + a * xj[ZZ] + b * xk[ZZ];
 +            /* 15 Flops */
 +        }
 +        /* TOTAL: 17 flops */
      }
 -    else
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
      {
 -        x[XX] = c * xi[XX] + a * xj[XX] + b * xk[XX];
 -        x[YY] = c * xi[YY] + a * xj[YY] + b * xk[YY];
 -        x[ZZ] = c * xi[ZZ] + a * xj[ZZ] + b * xk[ZZ];
 -        /* 15 Flops */
 +        v[XX] = c * vi[XX] + a * vj[XX] + b * vk[XX];
 +        v[YY] = c * vi[YY] + a * vj[YY] + b * vk[YY];
 +        v[ZZ] = c * vi[ZZ] + a * vj[ZZ] + b * vk[ZZ];
      }
 -
 -    /* TOTAL: 17 flops */
  }
  
 -static void constr_vsite3FD(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void constr_vsite3FD(const rvec   xi,
 +                            const rvec   xj,
 +                            const rvec   xk,
 +                            rvec         x,
 +                            real         a,
 +                            real         b,
 +                            const t_pbc* pbc,
 +                            const rvec   vi,
 +                            const rvec   vj,
 +                            const rvec   vk,
 +                            rvec         v)
  {
      rvec xij, xjk, temp;
 -    real c;
  
      pbc_rvec_sub(pbc, xj, xi, xij);
      pbc_rvec_sub(pbc, xk, xj, xjk);
      temp[ZZ] = xij[ZZ] + a * xjk[ZZ];
      /* 6 flops */
  
 -    c = b * inverseNorm(temp);
 +    const real invNormTemp = inverseNorm(temp);
 +    const real c           = b * invNormTemp;
      /* 6 + 10 flops */
  
 -    x[XX] = xi[XX] + c * temp[XX];
 -    x[YY] = xi[YY] + c * temp[YY];
 -    x[ZZ] = xi[ZZ] + c * temp[ZZ];
 -    /* 6 Flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + c * temp[XX];
 +        x[YY] = xi[YY] + c * temp[YY];
 +        x[ZZ] = xi[ZZ] + c * temp[ZZ];
 +        /* 6 Flops */
 +        /* TOTAL: 34 flops */
 +    }
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec vjk = { 0 };
 +        rvec_sub(vj, vi, vij);
 +        rvec_sub(vk, vj, vjk);
 +        const rvec tempV = { vij[XX] + a * vjk[XX], vij[YY] + a * vjk[YY], vij[ZZ] + a * vjk[ZZ] };
 +        const real tempDotTempV = iprod(temp, tempV);
  
 -    /* TOTAL: 34 flops */
 +        v[XX] = vi[XX] + c * (tempV[XX] - temp[XX] * tempDotTempV * invNormTemp * invNormTemp);
 +        v[YY] = vi[YY] + c * (tempV[YY] - temp[YY] * tempDotTempV * invNormTemp * invNormTemp);
 +        v[ZZ] = vi[ZZ] + c * (tempV[ZZ] - temp[ZZ] * tempDotTempV * invNormTemp * invNormTemp);
 +    }
  }
  
 -static void constr_vsite3FAD(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
 -{
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void constr_vsite3FAD(const rvec   xi,
 +                             const rvec   xj,
 +                             const rvec   xk,
 +                             rvec         x,
 +                             real         a,
 +                             real         b,
 +                             const t_pbc* pbc,
 +                             const rvec   vi,
 +                             const rvec   vj,
 +                             const rvec   vk,
 +                             rvec         v)
 +{ // Note: a = d * cos(theta)
 +    //       b = d * sin(theta)
      rvec xij, xjk, xp;
 -    real a1, b1, c1, invdij;
  
      pbc_rvec_sub(pbc, xj, xi, xij);
      pbc_rvec_sub(pbc, xk, xj, xjk);
      /* 6 flops */
  
 -    invdij = inverseNorm(xij);
 -    c1     = invdij * invdij * iprod(xij, xjk);
 -    xp[XX] = xjk[XX] - c1 * xij[XX];
 -    xp[YY] = xjk[YY] - c1 * xij[YY];
 -    xp[ZZ] = xjk[ZZ] - c1 * xij[ZZ];
 -    a1     = a * invdij;
 -    b1     = b * inverseNorm(xp);
 +    const real invdij    = inverseNorm(xij);
 +    const real xijDotXjk = iprod(xij, xjk);
 +    const real c1        = invdij * invdij * xijDotXjk;
 +    xp[XX]               = xjk[XX] - c1 * xij[XX];
 +    xp[YY]               = xjk[YY] - c1 * xij[YY];
 +    xp[ZZ]               = xjk[ZZ] - c1 * xij[ZZ];
 +    const real a1        = a * invdij;
 +    const real invNormXp = inverseNorm(xp);
 +    const real b1        = b * invNormXp;
      /* 45 */
  
 -    x[XX] = xi[XX] + a1 * xij[XX] + b1 * xp[XX];
 -    x[YY] = xi[YY] + a1 * xij[YY] + b1 * xp[YY];
 -    x[ZZ] = xi[ZZ] + a1 * xij[ZZ] + b1 * xp[ZZ];
 -    /* 12 Flops */
 -
 -    /* TOTAL: 63 flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + a1 * xij[XX] + b1 * xp[XX];
 +        x[YY] = xi[YY] + a1 * xij[YY] + b1 * xp[YY];
 +        x[ZZ] = xi[ZZ] + a1 * xij[ZZ] + b1 * xp[ZZ];
 +        /* 12 Flops */
 +        /* TOTAL: 63 flops */
 +    }
 +
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec vjk = { 0 };
 +        rvec_sub(vj, vi, vij);
 +        rvec_sub(vk, vj, vjk);
 +
 +        const real vijDotXjkPlusXijDotVjk = iprod(vij, xjk) + iprod(xij, vjk);
 +        const real xijDotVij              = iprod(xij, vij);
 +        const real invNormXij2            = invdij * invdij;
 +
 +        rvec vp = { 0 };
 +        vp[XX]  = vjk[XX]
 +                 - xij[XX] * invNormXij2
 +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
 +                 - vij[XX] * xijDotXjk * invNormXij2;
 +        vp[YY] = vjk[YY]
 +                 - xij[YY] * invNormXij2
 +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
 +                 - vij[YY] * xijDotXjk * invNormXij2;
 +        vp[ZZ] = vjk[ZZ]
 +                 - xij[ZZ] * invNormXij2
 +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
 +                 - vij[ZZ] * xijDotXjk * invNormXij2;
 +
 +        const real xpDotVp = iprod(xp, vp);
 +
 +        v[XX] = vi[XX] + a1 * (vij[XX] - xij[XX] * xijDotVij * invdij * invdij)
 +                + b1 * (vp[XX] - xp[XX] * xpDotVp * invNormXp * invNormXp);
 +        v[YY] = vi[YY] + a1 * (vij[YY] - xij[YY] * xijDotVij * invdij * invdij)
 +                + b1 * (vp[YY] - xp[YY] * xpDotVp * invNormXp * invNormXp);
 +        v[ZZ] = vi[ZZ] + a1 * (vij[ZZ] - xij[ZZ] * xijDotVij * invdij * invdij)
 +                + b1 * (vp[ZZ] - xp[ZZ] * xpDotVp * invNormXp * invNormXp);
 +    }
  }
  
 -static void
 -constr_vsite3OUT(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, real c, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static void constr_vsite3OUT(const rvec   xi,
 +                             const rvec   xj,
 +                             const rvec   xk,
 +                             rvec         x,
 +                             real         a,
 +                             real         b,
 +                             real         c,
 +                             const t_pbc* pbc,
 +                             const rvec   vi,
 +                             const rvec   vj,
 +                             const rvec   vk,
 +                             rvec         v)
  {
      rvec xij, xik, temp;
  
      cprod(xij, xik, temp);
      /* 15 Flops */
  
 -    x[XX] = xi[XX] + a * xij[XX] + b * xik[XX] + c * temp[XX];
 -    x[YY] = xi[YY] + a * xij[YY] + b * xik[YY] + c * temp[YY];
 -    x[ZZ] = xi[ZZ] + a * xij[ZZ] + b * xik[ZZ] + c * temp[ZZ];
 -    /* 18 Flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + a * xij[XX] + b * xik[XX] + c * temp[XX];
 +        x[YY] = xi[YY] + a * xij[YY] + b * xik[YY] + c * temp[YY];
 +        x[ZZ] = xi[ZZ] + a * xij[ZZ] + b * xik[ZZ] + c * temp[ZZ];
 +        /* 18 Flops */
 +        /* TOTAL: 33 flops */
 +    }
 +
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec vik = { 0 };
 +        rvec_sub(vj, vi, vij);
 +        rvec_sub(vk, vi, vik);
 +
 +        rvec temp1 = { 0 };
 +        rvec temp2 = { 0 };
 +        cprod(vij, xik, temp1);
 +        cprod(xij, vik, temp2);
  
 -    /* TOTAL: 33 flops */
 +        v[XX] = vi[XX] + a * vij[XX] + b * vik[XX] + c * (temp1[XX] + temp2[XX]);
 +        v[YY] = vi[YY] + a * vij[YY] + b * vik[YY] + c * (temp1[YY] + temp2[YY]);
 +        v[ZZ] = vi[ZZ] + a * vij[ZZ] + b * vik[ZZ] + c * (temp1[ZZ] + temp2[ZZ]);
 +    }
  }
  
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
  static void constr_vsite4FD(const rvec   xi,
                              const rvec   xj,
                              const rvec   xk,
                              real         a,
                              real         b,
                              real         c,
 -                            const t_pbc* pbc)
 +                            const t_pbc* pbc,
 +                            const rvec   vi,
 +                            const rvec   vj,
 +                            const rvec   vk,
 +                            const rvec   vl,
 +                            rvec         v)
  {
      rvec xij, xjk, xjl, temp;
      real d;
      temp[ZZ] = xij[ZZ] + a * xjk[ZZ] + b * xjl[ZZ];
      /* 12 flops */
  
 -    d = c * inverseNorm(temp);
 +    const real invRm = inverseNorm(temp);
 +    d                = c * invRm;
      /* 6 + 10 flops */
  
 -    x[XX] = xi[XX] + d * temp[XX];
 -    x[YY] = xi[YY] + d * temp[YY];
 -    x[ZZ] = xi[ZZ] + d * temp[ZZ];
 -    /* 6 Flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + d * temp[XX];
 +        x[YY] = xi[YY] + d * temp[YY];
 +        x[ZZ] = xi[ZZ] + d * temp[ZZ];
 +        /* 6 Flops */
 +        /* TOTAL: 43 flops */
 +    }
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec vjk = { 0 };
 +        rvec vjl = { 0 };
 +
 +        rvec_sub(vj, vi, vij);
 +        rvec_sub(vk, vj, vjk);
 +        rvec_sub(vl, vj, vjl);
  
 -    /* TOTAL: 43 flops */
 +        rvec vm = { 0 };
 +        vm[XX]  = vij[XX] + a * vjk[XX] + b * vjl[XX];
 +        vm[YY]  = vij[YY] + a * vjk[YY] + b * vjl[YY];
 +        vm[ZZ]  = vij[ZZ] + a * vjk[ZZ] + b * vjl[ZZ];
 +
 +        const real vmDotRm = iprod(vm, temp);
 +        v[XX]              = vi[XX] + d * (vm[XX] - temp[XX] * vmDotRm * invRm * invRm);
 +        v[YY]              = vi[YY] + d * (vm[YY] - temp[YY] * vmDotRm * invRm * invRm);
 +        v[ZZ]              = vi[ZZ] + d * (vm[ZZ] - temp[ZZ] * vmDotRm * invRm * invRm);
 +    }
  }
  
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
  static void constr_vsite4FDN(const rvec   xi,
                               const rvec   xj,
                               const rvec   xk,
                               real         a,
                               real         b,
                               real         c,
 -                             const t_pbc* pbc)
 +                             const t_pbc* pbc,
 +                             const rvec   vi,
 +                             const rvec   vj,
 +                             const rvec   vk,
 +                             const rvec   vl,
 +                             rvec         v)
  {
      rvec xij, xik, xil, ra, rb, rja, rjb, rm;
      real d;
      cprod(rja, rjb, rm);
      /* 9 flops */
  
 -    d = c * inverseNorm(rm);
 +    const real invNormRm = inverseNorm(rm);
 +    d                    = c * invNormRm;
      /* 5+5+1 flops */
  
 -    x[XX] = xi[XX] + d * rm[XX];
 -    x[YY] = xi[YY] + d * rm[YY];
 -    x[ZZ] = xi[ZZ] + d * rm[ZZ];
 -    /* 6 Flops */
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[XX] = xi[XX] + d * rm[XX];
 +        x[YY] = xi[YY] + d * rm[YY];
 +        x[ZZ] = xi[ZZ] + d * rm[ZZ];
 +        /* 6 Flops */
 +        /* TOTAL: 47 flops */
 +    }
 +
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        rvec vij = { 0 };
 +        rvec vik = { 0 };
 +        rvec vil = { 0 };
 +        rvec_sub(vj, vi, vij);
 +        rvec_sub(vk, vi, vik);
 +        rvec_sub(vl, vi, vil);
 +
 +        rvec vja = { 0 };
 +        rvec vjb = { 0 };
 +
 +        vja[XX] = a * vik[XX] - vij[XX];
 +        vja[YY] = a * vik[YY] - vij[YY];
 +        vja[ZZ] = a * vik[ZZ] - vij[ZZ];
 +        vjb[XX] = b * vil[XX] - vij[XX];
 +        vjb[YY] = b * vil[YY] - vij[YY];
 +        vjb[ZZ] = b * vil[ZZ] - vij[ZZ];
 +
 +        rvec temp1 = { 0 };
 +        rvec temp2 = { 0 };
 +        cprod(vja, rjb, temp1);
 +        cprod(rja, vjb, temp2);
  
 -    /* TOTAL: 47 flops */
 +        rvec vm = { 0 };
 +        vm[XX]  = temp1[XX] + temp2[XX];
 +        vm[YY]  = temp1[YY] + temp2[YY];
 +        vm[ZZ]  = temp1[ZZ] + temp2[ZZ];
 +
 +        const real rmDotVm = iprod(rm, vm);
 +        v[XX]              = vi[XX] + d * (vm[XX] - rm[XX] * rmDotVm * invNormRm * invNormRm);
 +        v[YY]              = vi[YY] + d * (vm[YY] - rm[YY] * rmDotVm * invNormRm * invNormRm);
 +        v[ZZ]              = vi[ZZ] + d * (vm[ZZ] - rm[ZZ] * rmDotVm * invNormRm * invNormRm);
 +    }
  }
  
 -static int constr_vsiten(const t_iatom* ia, ArrayRef<const t_iparams> ip, ArrayRef<RVec> x, const t_pbc* pbc)
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
 +static int constr_vsiten(const t_iatom*            ia,
 +                         ArrayRef<const t_iparams> ip,
 +                         ArrayRef<RVec>            x,
 +                         const t_pbc*              pbc,
 +                         ArrayRef<RVec>            v)
  {
      rvec x1, dx;
      dvec dsum;
 -    int  n3, av, ai;
      real a;
 +    dvec dvsum = { 0 };
 +    rvec v1    = { 0 };
  
 -    n3 = 3 * ip[ia[0]].vsiten.n;
 -    av = ia[1];
 -    ai = ia[2];
 +    const int n3 = 3 * ip[ia[0]].vsiten.n;
 +    const int av = ia[1];
 +    int       ai = ia[2];
      copy_rvec(x[ai], x1);
 +    copy_rvec(v[ai], v1);
      clear_dvec(dsum);
      for (int i = 3; i < n3; i += 3)
      {
          ai = ia[i + 2];
          a  = ip[ia[i]].vsiten.a;
 -        if (pbc)
 +        if (calculatePosition == VSiteCalculatePosition::Yes)
          {
 -            pbc_dx_aiuc(pbc, x[ai], x1, dx);
 +            if (pbc)
 +            {
 +                pbc_dx_aiuc(pbc, x[ai], x1, dx);
 +            }
 +            else
 +            {
 +                rvec_sub(x[ai], x1, dx);
 +            }
 +            dsum[XX] += a * dx[XX];
 +            dsum[YY] += a * dx[YY];
 +            dsum[ZZ] += a * dx[ZZ];
 +            /* 9 Flops */
          }
 -        else
 +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
          {
 -            rvec_sub(x[ai], x1, dx);
 +            rvec_sub(v[ai], v1, dx);
 +            dvsum[XX] += a * dx[XX];
 +            dvsum[YY] += a * dx[YY];
 +            dvsum[ZZ] += a * dx[ZZ];
 +            /* 9 Flops */
          }
 -        dsum[XX] += a * dx[XX];
 -        dsum[YY] += a * dx[YY];
 -        dsum[ZZ] += a * dx[ZZ];
 -        /* 9 Flops */
      }
  
 -    x[av][XX] = x1[XX] + dsum[XX];
 -    x[av][YY] = x1[YY] + dsum[YY];
 -    x[av][ZZ] = x1[ZZ] + dsum[ZZ];
 +    if (calculatePosition == VSiteCalculatePosition::Yes)
 +    {
 +        x[av][XX] = x1[XX] + dsum[XX];
 +        x[av][YY] = x1[YY] + dsum[YY];
 +        x[av][ZZ] = x1[ZZ] + dsum[ZZ];
 +    }
 +
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        v[av][XX] = v1[XX] + dvsum[XX];
 +        v[av][YY] = v1[YY] + dvsum[YY];
 +        v[av][ZZ] = v1[ZZ] + dvsum[ZZ];
 +    }
  
      return n3;
  }
 +// End GCC 8 bug
 +GCC_DIAGNOSTIC_RESET
  
  #endif // DOXYGEN
  
@@@ -901,49 -615,29 +901,49 @@@ static PbcMode getPbcMode(const t_pbc* 
  
  /*! \brief Executes the vsite construction task for a single thread
   *
 + * \tparam        operation  Whether we are calculating positions, velocities, or both
   * \param[in,out] x   Coordinates to construct vsites for
 - * \param[in]     dt  Time step, needed when v is not empty
 - * \param[in,out] v   When not empty, velocities are generated for virtual sites
 + * \param[in,out] v   Velocities are generated for virtual sites if `operation` requires it
   * \param[in]     ip  Interaction parameters for all interaction, only vsite parameters are used
   * \param[in]     ilist  The interaction lists, only vsites are usesd
   * \param[in]     pbc_null  PBC struct, used for PBC distance calculations when !=nullptr
   */
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
  static void construct_vsites_thread(ArrayRef<RVec>                  x,
 -                                    const real                      dt,
                                      ArrayRef<RVec>                  v,
                                      ArrayRef<const t_iparams>       ip,
                                      ArrayRef<const InteractionList> ilist,
                                      const t_pbc*                    pbc_null)
  {
 -    real inv_dt;
 -    if (!v.empty())
 -    {
 -        inv_dt = 1.0 / dt;
 -    }
 -    else
 -    {
 -        inv_dt = 1.0;
 -    }
 +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +    {
 +        GMX_RELEASE_ASSERT(!v.empty(),
 +                           "Can't calculate velocities without access to velocity vector.");
 +    }
 +
 +    // Work around clang bug (unfixed as of Feb 2021)
 +    // https://bugs.llvm.org/show_bug.cgi?id=35450
 +    // clang-format off
 +    CLANG_DIAGNOSTIC_IGNORE(-Wunused-lambda-capture)
 +    // clang-format on
 +    // GCC 8 falsely flags unused variables if constexpr prunes a code path, fixed in GCC 9
 +    // https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827
 +    // clang-format off
 +    GCC_DIAGNOSTIC_IGNORE(-Wunused-but-set-parameter)
 +    // clang-format on
 +    // getVOrNull returns a velocity rvec if we need it, nullptr otherwise.
 +    auto getVOrNull = [v](int idx) -> real* {
 +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +        {
 +            return v[idx].as_vec();
 +        }
 +        else
 +        {
 +            return nullptr;
 +        }
 +    };
 +    GCC_DIAGNOSTIC_RESET
 +    CLANG_DIAGNOSTIC_RESET
  
      const PbcMode pbcMode = getPbcMode(pbc_null);
      /* We need another pbc pointer, as with charge groups we switch per vsite */
                  real b1, c1;
                  switch (ftype)
                  {
 -                    case F_VSITE1: constr_vsite1(x[ai], x[avsite]); break;
 +                    case F_VSITE1:
 +                        constr_vsite1<calculatePosition, calculateVelocity>(
 +                                x[ai], x[avsite], getVOrNull(ai), getVOrNull(avsite));
 +                        break;
                      case F_VSITE2:
                          aj = ia[3];
 -                        constr_vsite2(x[ai], x[aj], x[avsite], a1, pbc_null2);
 +                        constr_vsite2<calculatePosition, calculateVelocity>(x[ai],
 +                                                                            x[aj],
 +                                                                            x[avsite],
 +                                                                            a1,
 +                                                                            pbc_null2,
 +                                                                            getVOrNull(ai),
 +                                                                            getVOrNull(aj),
 +                                                                            getVOrNull(avsite));
                          break;
                      case F_VSITE2FD:
                          aj = ia[3];
 -                        constr_vsite2FD(x[ai], x[aj], x[avsite], a1, pbc_null2);
 +                        constr_vsite2FD<calculatePosition, calculateVelocity>(x[ai],
 +                                                                              x[aj],
 +                                                                              x[avsite],
 +                                                                              a1,
 +                                                                              pbc_null2,
 +                                                                              getVOrNull(ai),
 +                                                                              getVOrNull(aj),
 +                                                                              getVOrNull(avsite));
                          break;
                      case F_VSITE3:
                          aj = ia[3];
                          ak = ia[4];
                          b1 = ip[tp].vsite.b;
 -                        constr_vsite3(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
 +                        constr_vsite3<calculatePosition, calculateVelocity>(x[ai],
 +                                                                            x[aj],
 +                                                                            x[ak],
 +                                                                            x[avsite],
 +                                                                            a1,
 +                                                                            b1,
 +                                                                            pbc_null2,
 +                                                                            getVOrNull(ai),
 +                                                                            getVOrNull(aj),
 +                                                                            getVOrNull(ak),
 +                                                                            getVOrNull(avsite));
                          break;
                      case F_VSITE3FD:
                          aj = ia[3];
                          ak = ia[4];
                          b1 = ip[tp].vsite.b;
 -                        constr_vsite3FD(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
 +                        constr_vsite3FD<calculatePosition, calculateVelocity>(x[ai],
 +                                                                              x[aj],
 +                                                                              x[ak],
 +                                                                              x[avsite],
 +                                                                              a1,
 +                                                                              b1,
 +                                                                              pbc_null2,
 +                                                                              getVOrNull(ai),
 +                                                                              getVOrNull(aj),
 +                                                                              getVOrNull(ak),
 +                                                                              getVOrNull(avsite));
                          break;
                      case F_VSITE3FAD:
                          aj = ia[3];
                          ak = ia[4];
                          b1 = ip[tp].vsite.b;
 -                        constr_vsite3FAD(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
 +                        constr_vsite3FAD<calculatePosition, calculateVelocity>(x[ai],
 +                                                                               x[aj],
 +                                                                               x[ak],
 +                                                                               x[avsite],
 +                                                                               a1,
 +                                                                               b1,
 +                                                                               pbc_null2,
 +                                                                               getVOrNull(ai),
 +                                                                               getVOrNull(aj),
 +                                                                               getVOrNull(ak),
 +                                                                               getVOrNull(avsite));
                          break;
                      case F_VSITE3OUT:
                          aj = ia[3];
                          ak = ia[4];
                          b1 = ip[tp].vsite.b;
                          c1 = ip[tp].vsite.c;
 -                        constr_vsite3OUT(x[ai], x[aj], x[ak], x[avsite], a1, b1, c1, pbc_null2);
 +                        constr_vsite3OUT<calculatePosition, calculateVelocity>(x[ai],
 +                                                                               x[aj],
 +                                                                               x[ak],
 +                                                                               x[avsite],
 +                                                                               a1,
 +                                                                               b1,
 +                                                                               c1,
 +                                                                               pbc_null2,
 +                                                                               getVOrNull(ai),
 +                                                                               getVOrNull(aj),
 +                                                                               getVOrNull(ak),
 +                                                                               getVOrNull(avsite));
                          break;
                      case F_VSITE4FD:
                          aj = ia[3];
                          al = ia[5];
                          b1 = ip[tp].vsite.b;
                          c1 = ip[tp].vsite.c;
 -                        constr_vsite4FD(x[ai], x[aj], x[ak], x[al], x[avsite], a1, b1, c1, pbc_null2);
 +                        constr_vsite4FD<calculatePosition, calculateVelocity>(x[ai],
 +                                                                              x[aj],
 +                                                                              x[ak],
 +                                                                              x[al],
 +                                                                              x[avsite],
 +                                                                              a1,
 +                                                                              b1,
 +                                                                              c1,
 +                                                                              pbc_null2,
 +                                                                              getVOrNull(ai),
 +                                                                              getVOrNull(aj),
 +                                                                              getVOrNull(ak),
 +                                                                              getVOrNull(al),
 +                                                                              getVOrNull(avsite));
                          break;
                      case F_VSITE4FDN:
                          aj = ia[3];
                          al = ia[5];
                          b1 = ip[tp].vsite.b;
                          c1 = ip[tp].vsite.c;
 -                        constr_vsite4FDN(x[ai], x[aj], x[ak], x[al], x[avsite], a1, b1, c1, pbc_null2);
 +                        constr_vsite4FDN<calculatePosition, calculateVelocity>(x[ai],
 +                                                                               x[aj],
 +                                                                               x[ak],
 +                                                                               x[al],
 +                                                                               x[avsite],
 +                                                                               a1,
 +                                                                               b1,
 +                                                                               c1,
 +                                                                               pbc_null2,
 +                                                                               getVOrNull(ai),
 +                                                                               getVOrNull(aj),
 +                                                                               getVOrNull(ak),
 +                                                                               getVOrNull(al),
 +                                                                               getVOrNull(avsite));
 +                        break;
 +                    case F_VSITEN:
 +                        inc = constr_vsiten<calculatePosition, calculateVelocity>(ia, ip, x, pbc_null2, v);
                          break;
 -                    case F_VSITEN: inc = constr_vsiten(ia, ip, x, pbc_null2); break;
                      default:
                          gmx_fatal(FARGS, "No such vsite type %d in %s, line %d", ftype, __FILE__, __LINE__);
                  }
                      /* Keep the vsite in the same periodic image as before */
                      rvec dx;
                      int  ishift = pbc_dx_aiuc(pbc_null, x[avsite], xv, dx);
 -                    if (ishift != CENTRAL)
 +                    if (ishift != c_centralShiftIndex)
                      {
                          rvec_add(xv, dx, x[avsite]);
                      }
                  }
 -                if (!v.empty())
 -                {
 -                    /* Calculate velocity of vsite... */
 -                    rvec vv;
 -                    rvec_sub(x[avsite], xv, vv);
 -                    svmul(inv_dt, vv, v[avsite]);
 -                }
  
                  /* Increment loop variables */
                  i += inc;
   *
   * \param[in]     threadingInfo  Used to divide work over threads when != nullptr
   * \param[in,out] x   Coordinates to construct vsites for
 - * \param[in]     dt  Time step, needed when v is not empty
   * \param[in,out] v   When not empty, velocities are generated for virtual sites
   * \param[in]     ip  Interaction parameters for all interaction, only vsite parameters are used
   * \param[in]     ilist  The interaction lists, only vsites are usesd
   * \param[in]     domainInfo  Information about PBC and DD
   * \param[in]     box  Used for PBC when PBC is set in domainInfo
   */
 +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
  static void construct_vsites(const ThreadingInfo*            threadingInfo,
                               ArrayRef<RVec>                  x,
 -                             real                            dt,
                               ArrayRef<RVec>                  v,
                               ArrayRef<const t_iparams>       ip,
                               ArrayRef<const InteractionList> ilist,
           */
          ivec null_ivec;
          clear_ivec(null_ivec);
 -        pbc_null = set_pbc_dd(&pbc, domainInfo.pbcType_,
 -                              useDomdec ? domainInfo.domdec_->numCells : null_ivec, FALSE, box);
 +        pbc_null = set_pbc_dd(
 +                &pbc, domainInfo.pbcType_, useDomdec ? domainInfo.domdec_->numCells : null_ivec, FALSE, box);
      }
      else
      {
  
      if (useDomdec)
      {
 -        dd_move_x_vsites(*domainInfo.domdec_, box, as_rvec_array(x.data()));
 +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
 +        {
 +            dd_move_x_and_v_vsites(
 +                    *domainInfo.domdec_, box, as_rvec_array(x.data()), as_rvec_array(v.data()));
 +        }
 +        else
 +        {
 +            dd_move_x_vsites(*domainInfo.domdec_, box, as_rvec_array(x.data()));
 +        }
      }
  
      if (threadingInfo == nullptr || threadingInfo->numThreads() == 1)
      {
 -        construct_vsites_thread(x, dt, v, ip, ilist, pbc_null);
 +        construct_vsites_thread<calculatePosition, calculateVelocity>(x, v, ip, ilist, pbc_null);
      }
      else
      {
                  GMX_ASSERT(tData.rangeStart >= 0,
                             "The thread data should be initialized before calling construct_vsites");
  
 -                construct_vsites_thread(x, dt, v, ip, tData.ilist, pbc_null);
 +                construct_vsites_thread<calculatePosition, calculateVelocity>(
 +                        x, v, ip, tData.ilist, pbc_null);
                  if (tData.useInterdependentTask)
                  {
                      /* Here we don't need a barrier (unlike the spreading),
                       * since both tasks only construct vsites from particles,
                       * or local vsites, not from non-local vsites.
                       */
 -                    construct_vsites_thread(x, dt, v, ip, tData.idTask.ilist, pbc_null);
 +                    construct_vsites_thread<calculatePosition, calculateVelocity>(
 +                            x, v, ip, tData.idTask.ilist, pbc_null);
                  }
              }
              GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
          }
          /* Now we can construct the vsites that might depend on other vsites */
 -        construct_vsites_thread(x, dt, v, ip, threadingInfo->threadDataNonLocalDependent().ilist, pbc_null);
 +        construct_vsites_thread<calculatePosition, calculateVelocity>(
 +                x, v, ip, threadingInfo->threadDataNonLocalDependent().ilist, pbc_null);
      }
  }
  
 -void VirtualSitesHandler::Impl::construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const
 +void VirtualSitesHandler::Impl::construct(ArrayRef<RVec> x,
 +                                          ArrayRef<RVec> v,
 +                                          const matrix   box,
 +                                          VSiteOperation operation) const
  {
 -    construct_vsites(&threadingInfo_, x, dt, v, iparams_, ilists_, domainInfo_, box);
 +    switch (operation)
 +    {
 +        case VSiteOperation::Positions:
 +            construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::No>(
 +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
 +            break;
 +        case VSiteOperation::Velocities:
 +            construct_vsites<VSiteCalculatePosition::No, VSiteCalculateVelocity::Yes>(
 +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
 +            break;
 +        case VSiteOperation::PositionsAndVelocities:
 +            construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::Yes>(
 +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
 +            break;
 +        default: gmx_fatal(FARGS, "Unknown virtual site operation");
 +    }
  }
  
 -void VirtualSitesHandler::construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const
 +void VirtualSitesHandler::construct(ArrayRef<RVec> x, ArrayRef<RVec> v, const matrix box, VSiteOperation operation) const
  {
 -    impl_->construct(x, dt, v, box);
 +    impl_->construct(x, v, box, operation);
  }
  
  void constructVirtualSites(ArrayRef<RVec> x, ArrayRef<const t_iparams> ip, ArrayRef<const InteractionList> ilist)
  {
      // No PBC, no DD
      const DomainInfo domainInfo;
 -    construct_vsites(nullptr, x, 0, {}, ip, ilist, domainInfo, nullptr);
 +    construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::No>(
 +            nullptr, x, {}, ip, ilist, domainInfo, nullptr);
  }
  
  #ifndef DOXYGEN
@@@ -1313,14 -899,14 +1313,14 @@@ static void spread_vsite2(const t_iato
          }
          else
          {
 -            siv = CENTRAL;
 -            sij = CENTRAL;
 +            siv = c_centralShiftIndex;
 +            sij = c_centralShiftIndex;
          }
  
 -        if (siv != CENTRAL || sij != CENTRAL)
 +        if (siv != c_centralShiftIndex || sij != c_centralShiftIndex)
          {
              rvec_inc(fshift[siv], f[av]);
 -            rvec_dec(fshift[CENTRAL], fi);
 +            rvec_dec(fshift[c_centralShiftIndex], fi);
              rvec_dec(fshift[sij], fj);
          }
      }
@@@ -1343,8 -929,8 +1343,8 @@@ void constructVirtualSitesGlobal(const 
              int atomOffset = mtop.moleculeBlockIndices[mb].globalAtomStart;
              for (int mol = 0; mol < molb.nmol; mol++)
              {
 -                constructVirtualSites(x.subArray(atomOffset, molt.atoms.nr), mtop.ffparams.iparams,
 -                                      molt.ilist);
 +                constructVirtualSites(
 +                        x.subArray(atomOffset, molt.atoms.nr), mtop.ffparams.iparams, molt.ilist);
                  atomOffset += molt.atoms.nr;
              }
          }
@@@ -1403,15 -989,15 +1403,15 @@@ static void spread_vsite2FD(const t_iat
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sji != CENTRAL)
 +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
 -            fshift[CENTRAL][XX] += fv[XX] - fj[XX];
 -            fshift[CENTRAL][YY] += fv[YY] - fj[YY];
 -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ];
 +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX];
 +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY];
 +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ];
              fshift[sji][XX] += fj[XX];
              fshift[sji][YY] += fj[YY];
              fshift[sji][ZZ] += fj[ZZ];
@@@ -1485,15 -1071,15 +1485,15 @@@ static void spread_vsite3(const t_iato
          }
          else
          {
 -            siv = CENTRAL;
 -            sij = CENTRAL;
 -            sik = CENTRAL;
 +            siv = c_centralShiftIndex;
 +            sij = c_centralShiftIndex;
 +            sik = c_centralShiftIndex;
          }
  
 -        if (siv != CENTRAL || sij != CENTRAL || sik != CENTRAL)
 +        if (siv != c_centralShiftIndex || sij != c_centralShiftIndex || sik != c_centralShiftIndex)
          {
              rvec_inc(fshift[siv], f[av]);
 -            rvec_dec(fshift[CENTRAL], fi);
 +            rvec_dec(fshift[c_centralShiftIndex], fi);
              rvec_dec(fshift[sij], fj);
              rvec_dec(fshift[sik], fk);
          }
@@@ -1568,15 -1154,15 +1568,15 @@@ static void spread_vsite3FD(const t_iat
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL)
 +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
 -            fshift[CENTRAL][XX] += fv[XX] - (1 + a) * temp[XX];
 -            fshift[CENTRAL][YY] += fv[YY] - (1 + a) * temp[YY];
 -            fshift[CENTRAL][ZZ] += fv[ZZ] - (1 + a) * temp[ZZ];
 +            fshift[c_centralShiftIndex][XX] += fv[XX] - (1 + a) * temp[XX];
 +            fshift[c_centralShiftIndex][YY] += fv[YY] - (1 + a) * temp[YY];
 +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - (1 + a) * temp[ZZ];
              fshift[sji][XX] += temp[XX];
              fshift[sji][YY] += temp[YY];
              fshift[sji][ZZ] += temp[ZZ];
@@@ -1690,15 -1276,15 +1690,15 @@@ static void spread_vsite3FAD(const t_ia
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL)
 +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
 -            fshift[CENTRAL][XX] += fv[XX] - f1[XX] - (1 - c1) * f2[XX] + f3[XX];
 -            fshift[CENTRAL][YY] += fv[YY] - f1[YY] - (1 - c1) * f2[YY] + f3[YY];
 -            fshift[CENTRAL][ZZ] += fv[ZZ] - f1[ZZ] - (1 - c1) * f2[ZZ] + f3[ZZ];
 +            fshift[c_centralShiftIndex][XX] += fv[XX] - f1[XX] - (1 - c1) * f2[XX] + f3[XX];
 +            fshift[c_centralShiftIndex][YY] += fv[YY] - f1[YY] - (1 - c1) * f2[YY] + f3[YY];
 +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - f1[ZZ] - (1 - c1) * f2[ZZ] + f3[ZZ];
              fshift[sji][XX] += f1[XX] - c1 * f2[XX] - f3[XX];
              fshift[sji][YY] += f1[YY] - c1 * f2[YY] - f3[YY];
              fshift[sji][ZZ] += f1[ZZ] - c1 * f2[ZZ] - f3[ZZ];
@@@ -1784,15 -1370,15 +1784,15 @@@ static void spread_vsite3OUT(const t_ia
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sji != CENTRAL || ski != CENTRAL)
 +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || ski != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
 -            fshift[CENTRAL][XX] += fv[XX] - fj[XX] - fk[XX];
 -            fshift[CENTRAL][YY] += fv[YY] - fj[YY] - fk[YY];
 -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ];
 +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX] - fk[XX];
 +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY] - fk[YY];
 +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ];
              rvec_inc(fshift[sji], fj);
              rvec_inc(fshift[ski], fk);
          }
@@@ -1886,16 -1472,15 +1886,16 @@@ static void spread_vsite4FD(const t_iat
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL || slj != CENTRAL)
 +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex
 +            || slj != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
              for (m = 0; m < DIM; m++)
              {
 -                fshift[CENTRAL][m] += fv[m] - (1 + a + b) * temp[m];
 +                fshift[c_centralShiftIndex][m] += fv[m] - (1 + a + b) * temp[m];
                  fshift[sji][m] += temp[m];
                  fshift[skj][m] += a * temp[m];
                  fshift[slj][m] += b * temp[m];
@@@ -2046,16 -1631,15 +2046,16 @@@ static void spread_vsite4FDN(const t_ia
          }
          else
          {
 -            svi = CENTRAL;
 +            svi = c_centralShiftIndex;
          }
  
 -        if (svi != CENTRAL || sij != CENTRAL || sik != CENTRAL || sil != CENTRAL)
 +        if (svi != c_centralShiftIndex || sij != c_centralShiftIndex || sik != c_centralShiftIndex
 +            || sil != c_centralShiftIndex)
          {
              rvec_dec(fshift[svi], fv);
 -            fshift[CENTRAL][XX] += fv[XX] - fj[XX] - fk[XX] - fl[XX];
 -            fshift[CENTRAL][YY] += fv[YY] - fj[YY] - fk[YY] - fl[YY];
 -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ] - fl[ZZ];
 +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX] - fk[XX] - fl[XX];
 +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY] - fk[YY] - fl[YY];
 +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ] - fl[ZZ];
              rvec_inc(fshift[sij], fj);
              rvec_inc(fshift[sik], fk);
              rvec_inc(fshift[sil], fl);
@@@ -2107,16 -1691,16 +2107,16 @@@ static int spread_vsiten(const t_iato
          }
          else
          {
 -            siv = CENTRAL;
 +            siv = c_centralShiftIndex;
          }
          a = ip[ia[i]].vsiten.a;
          svmul(a, f[av], fi);
          rvec_inc(f[ai], fi);
  
 -        if (virialHandling == VirialHandling::Pbc && siv != CENTRAL)
 +        if (virialHandling == VirialHandling::Pbc && siv != c_centralShiftIndex)
          {
              rvec_inc(fshift[siv], fi);
 -            rvec_dec(fshift[CENTRAL], fi);
 +            rvec_dec(fshift[c_centralShiftIndex], fi);
          }
          /* 6 Flops */
      }
@@@ -2290,7 -1874,7 +2290,7 @@@ void VirtualSitesHandler::Impl::spreadF
                                               const matrix         box,
                                               gmx_wallcycle*       wcycle)
  {
 -    wallcycle_start(wcycle, ewcVSITESPREAD);
 +    wallcycle_start(wcycle, WallCycleCounter::VsiteSpread);
  
      const bool useDomdec = domainInfo_.useDomdec();
  
          /* This is wasting some CPU time as we now do this multiple times
           * per MD step.
           */
 -        pbc_null = set_pbc_dd(&pbc, domainInfo_.pbcType_,
 -                              useDomdec ? domainInfo_.domdec_->numCells : nullptr, FALSE, box);
 +        pbc_null = set_pbc_dd(
 +                &pbc, domainInfo_.pbcType_, useDomdec ? domainInfo_.domdec_->numCells : nullptr, FALSE, box);
      }
      else
      {
      {
          /* First spread the vsites that might depend on non-local vsites */
          auto& nlDependentVSites = threadingInfo_.threadDataNonLocalDependent();
 -        spreadForceWrapper(x, f, virialHandling, fshift, nlDependentVSites.dxdf, true, iparams_,
 -                           nlDependentVSites.ilist, pbc_null);
 +        spreadForceWrapper(x,
 +                           f,
 +                           virialHandling,
 +                           fshift,
 +                           nlDependentVSites.dxdf,
 +                           true,
 +                           iparams_,
 +                           nlDependentVSites.ilist,
 +                           pbc_null);
  
  #pragma omp parallel num_threads(numThreads)
          {
                      {
                          fshift_t = tData.fshift;
  
 -                        for (int i = 0; i < SHIFTS; i++)
 +                        for (int i = 0; i < c_numShiftVectors; i++)
                          {
                              clear_rvec(fshift_t[i]);
                          }
                      {
                          copy_rvec(f[idTask->vsite[i]], idTask->force[idTask->vsite[i]]);
                      }
 -                    spreadForceWrapper(x, idTask->force, virialHandling, fshift_t, tData.dxdf, true,
 -                                       iparams_, tData.idTask.ilist, pbc_null);
 +                    spreadForceWrapper(x,
 +                                       idTask->force,
 +                                       virialHandling,
 +                                       fshift_t,
 +                                       tData.dxdf,
 +                                       true,
 +                                       iparams_,
 +                                       tData.idTask.ilist,
 +                                       pbc_null);
  
                      /* We need a barrier before reducing forces below
                       * that have been produced by a different thread above.
                  }
  
                  /* Spread the vsites that spread locally only */
 -                spreadForceWrapper(x, f, virialHandling, fshift_t, tData.dxdf, false, iparams_,
 -                                   tData.ilist, pbc_null);
 +                spreadForceWrapper(
 +                        x, f, virialHandling, fshift_t, tData.dxdf, false, iparams_, tData.ilist, pbc_null);
              }
              GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
          }
          {
              for (int th = 1; th < numThreads; th++)
              {
 -                for (int i = 0; i < SHIFTS; i++)
 +                for (int i = 0; i < c_numShiftVectors; i++)
                  {
                      rvec_inc(fshift[i], threadingInfo_.threadData(th).fshift[i]);
                  }
      inc_nrnb(nrnb, eNR_VSITE4FDN, vsite_count(ilists_, F_VSITE4FDN));
      inc_nrnb(nrnb, eNR_VSITEN, vsite_count(ilists_, F_VSITEN));
  
 -    wallcycle_stop(wcycle, ewcVSITESPREAD);
 +    wallcycle_stop(wcycle, WallCycleCounter::VsiteSpread);
  }
  
  /*! \brief Returns the an array with group indices for each atom
@@@ -2537,7 -2107,7 +2537,7 @@@ void VirtualSitesHandler::spreadForces(
  }
  
  int countInterUpdategroupVsites(const gmx_mtop_t&                           mtop,
 -                                gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingPerMoleculetype)
 +                                gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingsPerMoleculeType)
  {
      int n_intercg_vsite = 0;
      for (const gmx_molblock_t& molb : mtop.molblock)
          const gmx_moltype_t& molt = mtop.moltype[molb.type];
  
          std::vector<int> atomToGroup;
 -        if (!updateGroupingPerMoleculetype.empty())
 +        if (!updateGroupingsPerMoleculeType.empty())
          {
 -            atomToGroup = makeAtomToGroupMapping(updateGroupingPerMoleculetype[molb.type]);
 +            atomToGroup = makeAtomToGroupMapping(updateGroupingsPerMoleculeType[molb.type]);
          }
          for (int ftype = c_ftypeVsiteStart; ftype < c_ftypeVsiteEnd; ftype++)
          {
  
  std::unique_ptr<VirtualSitesHandler> makeVirtualSitesHandler(const gmx_mtop_t& mtop,
                                                               const t_commrec*  cr,
 -                                                             PbcType           pbcType)
 +                                                             PbcType           pbcType,
 +                                                             ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType)
  {
      GMX_RELEASE_ASSERT(cr != nullptr, "We need a valid commrec");
  
              GMX_ASSERT(ftype >= c_ftypeVsiteStart && ftype < c_ftypeVsiteEnd,
                         "c_ftypeVsiteStart and/or c_ftypeVsiteEnd do not have correct values");
  
 -            nvsite += gmx_mtop_ftype_count(&mtop, ftype);
 +            nvsite += gmx_mtop_ftype_count(mtop, ftype);
          }
          else
          {
          return vsite;
      }
  
 -    return std::make_unique<VirtualSitesHandler>(mtop, cr->dd, pbcType);
 +    return std::make_unique<VirtualSitesHandler>(mtop, cr->dd, pbcType, updateGroupingPerMoleculeType);
  }
  
 -ThreadingInfo::ThreadingInfo() : numThreads_(gmx_omp_nthreads_get(emntVSITE))
 +ThreadingInfo::ThreadingInfo() : numThreads_(gmx_omp_nthreads_get(ModuleMultiThread::VirtualSite))
  {
      if (numThreads_ > 1)
      {
      }
  }
  
 -//! Returns the number of inter update-group vsites
 -static int getNumInterUpdategroupVsites(const gmx_mtop_t& mtop, const gmx_domdec_t* domdec)
 -{
 -    gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingPerMoleculetype;
 -    if (domdec)
 -    {
 -        updateGroupingPerMoleculetype = getUpdateGroupingPerMoleculetype(*domdec);
 -    }
 -
 -    return countInterUpdategroupVsites(mtop, updateGroupingPerMoleculetype);
 -}
 -
 -VirtualSitesHandler::Impl::Impl(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, const PbcType pbcType) :
 -    numInterUpdategroupVirtualSites_(getNumInterUpdategroupVsites(mtop, domdec)),
 +VirtualSitesHandler::Impl::Impl(const gmx_mtop_t&                       mtop,
 +                                gmx_domdec_t*                           domdec,
 +                                const PbcType                           pbcType,
 +                                const ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType) :
 +    numInterUpdategroupVirtualSites_(countInterUpdategroupVsites(mtop, updateGroupingPerMoleculeType)),
      domainInfo_({ pbcType, pbcType != PbcType::No && numInterUpdategroupVirtualSites_ > 0, domdec }),
      iparams_(mtop.ffparams.iparams)
  {
  }
  
 -VirtualSitesHandler::VirtualSitesHandler(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, const PbcType pbcType) :
 -    impl_(new Impl(mtop, domdec, pbcType))
 +VirtualSitesHandler::VirtualSitesHandler(const gmx_mtop_t&                       mtop,
 +                                         gmx_domdec_t*                           domdec,
 +                                         const PbcType                           pbcType,
 +                                         const ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType) :
 +    impl_(new Impl(mtop, domdec, pbcType, updateGroupingPerMoleculeType))
  {
  }
  
@@@ -2690,7 -2265,7 +2690,7 @@@ static void assignVsitesToThread(VsiteT
                                   gmx::ArrayRef<int>              taskIndex,
                                   ArrayRef<const InteractionList> ilist,
                                   ArrayRef<const t_iparams>       ip,
 -                                 const unsigned short*           ptype)
 +                                 ArrayRef<const ParticleType>    ptype)
  {
      for (int ftype = c_ftypeVsiteStart; ftype < c_ftypeVsiteEnd; ftype++)
      {
                      /* Do a range check to avoid a harmless race on taskIndex */
                      if (iat[j] < tData->rangeStart || iat[j] >= tData->rangeEnd || taskIndex[iat[j]] != thread)
                      {
 -                        if (!tData->useInterdependentTask || ptype[iat[j]] == eptVSite)
 +                        if (!tData->useInterdependentTask || ptype[iat[j]] == ParticleType::VSite)
                          {
                              /* At least one constructing atom is a vsite
                               * that is not assigned to the same thread.
                      /* Do a range check to avoid a harmless race on taskIndex */
                      if (iat[j] < tData->rangeStart || iat[j] >= tData->rangeEnd || taskIndex[iat[j]] != thread)
                      {
 -                        GMX_ASSERT(ptype[iat[j]] != eptVSite,
 +                        GMX_ASSERT(ptype[iat[j]] != ParticleType::VSite,
                                     "A vsite to be assigned in assignVsitesToThread has a vsite as "
                                     "a constructing atom that does not belong to our task, such "
                                     "vsites should be assigned to the single 'master' task");
  
-                         task = nthread + thread;
+                         if (tData->useInterdependentTask)
+                         {
+                             // Assign to the interdependent task
+                             task = nthread + thread;
+                         }
+                         else
+                         {
+                             // Assign to the separate, non-parallel task
+                             task = 2 * nthread;
+                         }
                      }
                  }
              }
@@@ -2849,9 -2433,7 +2858,9 @@@ static void assignVsitesToSingleTask(Vs
  
  void ThreadingInfo::setVirtualSites(ArrayRef<const InteractionList> ilists,
                                      ArrayRef<const t_iparams>       iparams,
 -                                    const t_mdatoms&                mdatoms,
 +                                    const int                       numAtoms,
 +                                    const int                       homenr,
 +                                    ArrayRef<const ParticleType>    ptype,
                                      const bool                      useDomdec)
  {
      if (numThreads_ <= 1)
           * When assigning vsites to threads, we should take care that the last
           * threads also covers the non-local range.
           */
 -        vsite_atom_range = mdatoms.nr;
 -        natperthread     = (mdatoms.homenr + numThreads_ - 1) / numThreads_;
 +        vsite_atom_range = numAtoms;
 +        natperthread     = (homenr + numThreads_ - 1) / numThreads_;
      }
  
      if (debug)
      {
 -        fprintf(debug, "virtual site thread dist: natoms %d, range %d, natperthread %d\n",
 -                mdatoms.nr, vsite_atom_range, natperthread);
 +        fprintf(debug,
 +                "virtual site thread dist: natoms %d, range %d, natperthread %d\n",
 +                numAtoms,
 +                vsite_atom_range,
 +                natperthread);
      }
  
      /* To simplify the vsite assignment, we make an index which tells us
       * to which task particles, both non-vsites and vsites, are assigned.
       */
 -    taskIndex_.resize(mdatoms.nr);
 +    taskIndex_.resize(numAtoms);
  
      /* Initialize the task index array. Here we assign the non-vsite
       * particles to task=thread, so we easily figure out if vsites
       */
      {
          int thread = 0;
 -        for (int i = 0; i < mdatoms.nr; i++)
 +        for (int i = 0; i < numAtoms; i++)
          {
 -            if (mdatoms.ptype[i] == eptVSite)
 +            if (ptype[i] == ParticleType::VSite)
              {
                  /* vsites are not assigned to a task yet */
                  taskIndex_[i] = -1;
              }
  
              /* To avoid large f_buf allocations of #threads*vsite_atom_range
-              * we don't use task2 with more than 200000 atoms. This doesn't
-              * affect performance, since with such a large range relatively few
-              * vsites will end up in the separate task.
-              * Note that useTask2 should be the same for all threads.
+              * we don't use the interdependent tasks with more than 200000 atoms.
+              * This doesn't affect performance, since with such a large range
+              * relatively few vsites will end up in the separate task.
+              * Note that useInterdependentTask should be the same for all threads.
               */
-             tData.useInterdependentTask = (vsite_atom_range <= 200000);
+             const int c_maxNumLocalAtomsForInterdependentTask = 200000;
+             tData.useInterdependentTask = (vsite_atom_range <= c_maxNumLocalAtomsForInterdependentTask);
              if (tData.useInterdependentTask)
              {
                  size_t              natoms_use_in_vsites = vsite_atom_range;
              else
              {
                  /* The last thread should cover up to the end of the range */
 -                tData.rangeEnd = mdatoms.nr;
 +                tData.rangeEnd = numAtoms;
              }
 -            assignVsitesToThread(&tData, thread, numThreads_, natperthread, taskIndex_, ilists,
 -                                 iparams, mdatoms.ptype);
 +            assignVsitesToThread(
 +                    &tData, thread, numThreads_, natperthread, taskIndex_, ilists, iparams, ptype);
  
              if (tData.useInterdependentTask)
              {
  
      if (debug && numThreads_ > 1)
      {
 -        fprintf(debug, "virtual site useInterdependentTask %d, nuse:\n",
 +        fprintf(debug,
 +                "virtual site useInterdependentTask %d, nuse:\n",
                  static_cast<int>(tData_[0]->useInterdependentTask));
          for (int th = 0; th < numThreads_ + 1; th++)
          {
                  fprintf(debug, "%-20s thread dist:", interaction_function[ftype].longname);
                  for (int th = 0; th < numThreads_ + 1; th++)
                  {
 -                    fprintf(debug, " %4d %4d ", tData_[th]->ilist[ftype].size(),
 +                    fprintf(debug,
 +                            " %4d %4d ",
 +                            tData_[th]->ilist[ftype].size(),
                              tData_[th]->idTask.ilist[ftype].size());
                  }
                  fprintf(debug, "\n");
  }
  
  void VirtualSitesHandler::Impl::setVirtualSites(ArrayRef<const InteractionList> ilists,
 -                                                const t_mdatoms&                mdatoms)
 +                                                const int                       numAtoms,
 +                                                const int                       homenr,
 +                                                ArrayRef<const ParticleType>    ptype)
  {
      ilists_ = ilists;
  
 -    threadingInfo_.setVirtualSites(ilists, iparams_, mdatoms, domainInfo_.useDomdec());
 +    threadingInfo_.setVirtualSites(ilists, iparams_, numAtoms, homenr, ptype, domainInfo_.useDomdec());
  }
  
 -void VirtualSitesHandler::setVirtualSites(ArrayRef<const InteractionList> ilists, const t_mdatoms& mdatoms)
 +void VirtualSitesHandler::setVirtualSites(ArrayRef<const InteractionList> ilists,
 +                                          const int                       numAtoms,
 +                                          const int                       homenr,
 +                                          ArrayRef<const ParticleType>    ptype)
  {
 -    impl_->setVirtualSites(ilists, mdatoms);
 +    impl_->setVirtualSites(ilists, numAtoms, homenr, ptype);
  }
  
  } // namespace gmx
index 3bf1d5071cd6d9d21bb56dd62ad182f178488c99,423a90905cbb3eb86779f38271287c9ee0df3912..887e7ba7093ee95bbfddcea4cbdce399c1f34446
  #include "gromacs/domdec/domdec_struct.h"
  #include "gromacs/domdec/gpuhaloexchange.h"
  #include "gromacs/domdec/localatomsetmanager.h"
 +#include "gromacs/domdec/makebondedlinks.h"
  #include "gromacs/domdec/partition.h"
 +#include "gromacs/domdec/reversetopology.h"
  #include "gromacs/ewald/ewald_utils.h"
 +#include "gromacs/ewald/pme.h"
  #include "gromacs/ewald/pme_gpu_program.h"
  #include "gromacs/ewald/pme_only.h"
  #include "gromacs/ewald/pme_pp_comm_gpu.h"
@@@ -84,7 -81,7 +84,7 @@@
  #include "gromacs/hardware/printhardware.h"
  #include "gromacs/imd/imd.h"
  #include "gromacs/listed_forces/disre.h"
 -#include "gromacs/listed_forces/gpubonded.h"
 +#include "gromacs/listed_forces/listed_forces_gpu.h"
  #include "gromacs/listed_forces/listed_forces.h"
  #include "gromacs/listed_forces/orires.h"
  #include "gromacs/math/functions.h"
  #include "gromacs/mdtypes/mdatom.h"
  #include "gromacs/mdtypes/mdrunoptions.h"
  #include "gromacs/mdtypes/observableshistory.h"
 +#include "gromacs/mdtypes/observablesreducer.h"
  #include "gromacs/mdtypes/simulation_workload.h"
  #include "gromacs/mdtypes/state.h"
  #include "gromacs/mdtypes/state_propagator_data_gpu.h"
  #include "gromacs/utility/keyvaluetree.h"
  #include "gromacs/utility/logger.h"
  #include "gromacs/utility/loggerbuilder.h"
 -#include "gromacs/utility/mdmodulenotification.h"
 +#include "gromacs/utility/mdmodulesnotifiers.h"
  #include "gromacs/utility/physicalnodecommunicator.h"
  #include "gromacs/utility/pleasecite.h"
  #include "gromacs/utility/programcontext.h"
  #include "gromacs/utility/smalloc.h"
  #include "gromacs/utility/stringutil.h"
 +#include "gromacs/utility/mpiinfo.h"
  
  #include "isimulator.h"
  #include "membedholder.h"
@@@ -203,77 -198,19 +203,77 @@@ static DevelopmentFeatureFlags manageDe
  {
      DevelopmentFeatureFlags devFlags;
  
 -    // Some builds of GCC 5 give false positive warnings that these
 -    // getenv results are ignored when clearly they are used.
 -#pragma GCC diagnostic push
 -#pragma GCC diagnostic ignored "-Wunused-result"
 -
      devFlags.enableGpuBufferOps =
              GMX_GPU_CUDA && useGpuForNonbonded && (getenv("GMX_USE_GPU_BUFFER_OPS") != nullptr);
 -    devFlags.enableGpuHaloExchange = GMX_GPU_CUDA && GMX_THREAD_MPI && getenv("GMX_GPU_DD_COMMS") != nullptr;
 +    devFlags.enableGpuHaloExchange = GMX_MPI && GMX_GPU_CUDA && getenv("GMX_GPU_DD_COMMS") != nullptr;
      devFlags.forceGpuUpdateDefault = (getenv("GMX_FORCE_UPDATE_DEFAULT_GPU") != nullptr) || GMX_FAHCORE;
 -    devFlags.enableGpuPmePPComm =
 -            GMX_GPU_CUDA && GMX_THREAD_MPI && getenv("GMX_GPU_PME_PP_COMMS") != nullptr;
 +    devFlags.enableGpuPmePPComm = GMX_MPI && GMX_GPU_CUDA && getenv("GMX_GPU_PME_PP_COMMS") != nullptr;
 +
 +    // Direct GPU comm path is being used with CUDA_AWARE_MPI
 +    // make sure underlying MPI implementation is CUDA-aware
 +    if (!GMX_THREAD_MPI && (devFlags.enableGpuPmePPComm || devFlags.enableGpuHaloExchange))
 +    {
 +        const bool haveDetectedCudaAwareMpi =
 +                (checkMpiCudaAwareSupport() == CudaAwareMpiStatus::Supported);
 +        const bool forceCudaAwareMpi = (getenv("GMX_FORCE_CUDA_AWARE_MPI") != nullptr);
 +
 +        if (!haveDetectedCudaAwareMpi && forceCudaAwareMpi)
 +        {
 +            // CUDA-aware support not detected in MPI library but, user has forced it's use
 +            GMX_LOG(mdlog.warning)
 +                    .asParagraph()
 +                    .appendTextFormatted(
 +                            "This run has forced use of 'CUDA-aware MPI'. "
 +                            "But, GROMACS cannot determine if underlying MPI "
 +                            "is CUDA-aware. GROMACS recommends use of latest openMPI version "
 +                            "for CUDA-aware support. "
 +                            "If you observe failures at runtime, try unsetting "
 +                            "GMX_FORCE_CUDA_AWARE_MPI environment variable.");
 +        }
 +
 +        if (haveDetectedCudaAwareMpi || forceCudaAwareMpi)
 +        {
 +            devFlags.usingCudaAwareMpi = true;
 +            GMX_LOG(mdlog.warning)
 +                    .asParagraph()
 +                    .appendTextFormatted(
 +                            "Using CUDA-aware MPI for 'GPU halo exchange' or 'GPU PME-PP "
 +                            "communications' feature.");
 +        }
 +        else
 +        {
 +            if (devFlags.enableGpuHaloExchange)
 +            {
 +                GMX_LOG(mdlog.warning)
 +                        .asParagraph()
 +                        .appendTextFormatted(
 +                                "GMX_GPU_DD_COMMS environment variable detected, but the 'GPU "
 +                                "halo exchange' feature will not be enabled as GROMACS couldn't "
 +                                "detect CUDA_aware support in underlying MPI implementation.");
 +                devFlags.enableGpuHaloExchange = false;
 +            }
 +            if (devFlags.enableGpuPmePPComm)
 +            {
 +                GMX_LOG(mdlog.warning)
 +                        .asParagraph()
 +                        .appendText(
 +                                "GMX_GPU_PME_PP_COMMS environment variable detected, but the "
 +                                "'GPU PME-PP communications' feature will not be enabled as "
 +                                "GROMACS couldn't "
 +                                "detect CUDA_aware support in underlying MPI implementation.");
 +                devFlags.enableGpuPmePPComm = false;
 +            }
  
 -#pragma GCC diagnostic pop
 +            GMX_LOG(mdlog.warning)
 +                    .asParagraph()
 +                    .appendTextFormatted(
 +                            "GROMACS recommends use of latest OpenMPI version for CUDA-aware "
 +                            "support. "
 +                            "If you are certain about CUDA-aware support in your MPI library, "
 +                            "you can force it's use by setting environment variable "
 +                            " GMX_FORCE_CUDA_AWARE_MPI.");
 +        }
 +    }
  
      if (devFlags.enableGpuBufferOps)
      {
@@@ -435,7 -372,7 +435,7 @@@ static void mdrunner_start_fn(const voi
  {
      try
      {
 -        auto masterMdrunner = reinterpret_cast<const gmx::Mdrunner*>(arg);
 +        const auto* masterMdrunner = reinterpret_cast<const gmx::Mdrunner*>(arg);
          /* copy the arg list to make sure that it's thread-local. This
             doesn't copy pointed-to items, of course; fnm, cr and fplog
             are reset in the call below, all others should be const. */
@@@ -451,7 -388,8 +451,7 @@@ void Mdrunner::spawnThreads(int numThre
  #if GMX_THREAD_MPI
      /* now spawn new threads that start mdrunner_start_fn(), while
         the main thread returns. Thread affinity is handled later. */
 -    if (tMPI_Init_fn(TRUE, numThreadsToLaunch, TMPI_AFFINITY_NONE, mdrunner_start_fn,
 -                     static_cast<const void*>(this))
 +    if (tMPI_Init_fn(TRUE, numThreadsToLaunch, TMPI_AFFINITY_NONE, mdrunner_start_fn, static_cast<const void*>(this))
          != TMPI_SUCCESS)
      {
          GMX_THROW(gmx::InternalError("Failed to spawn thread-MPI threads"));
@@@ -475,14 -413,14 +475,14 @@@ static void prepare_verlet_scheme(FILE
                                    t_commrec*          cr,
                                    t_inputrec*         ir,
                                    int                 nstlist_cmdline,
 -                                  const gmx_mtop_t*   mtop,
 +                                  const gmx_mtop_t&   mtop,
                                    const matrix        box,
                                    bool                makeGpuPairList,
                                    const gmx::CpuInfo& cpuinfo)
  {
      // We checked the cut-offs in grompp, but double-check here.
      // We have PME+LJcutoff kernels for rcoulomb>rvdw.
 -    if (EEL_PME_EWALD(ir->coulombtype) && ir->vdwtype == eelCUT)
 +    if (EEL_PME_EWALD(ir->coulombtype) && ir->vdwtype == VanDerWaalsType::Cut)
      {
          GMX_RELEASE_ASSERT(ir->rcoulomb >= ir->rvdw,
                             "With Verlet lists and PME we should have rcoulomb>=rvdw");
                             "With Verlet lists and no PME rcoulomb and rvdw should be identical");
      }
      /* For NVE simulations, we will retain the initial list buffer */
 -    if (EI_DYNAMICS(ir->eI) && ir->verletbuf_tol > 0 && !(EI_MD(ir->eI) && ir->etc == etcNO))
 +    if (EI_DYNAMICS(ir->eI) && ir->verletbuf_tol > 0
 +        && !(EI_MD(ir->eI) && ir->etc == TemperatureCoupling::No))
      {
          /* Update the Verlet buffer size for the current run setup */
  
          VerletbufListSetup listSetup = verletbufGetSafeListSetup(listType);
  
          const real rlist_new =
 -                calcVerletBufferSize(*mtop, det(box), *ir, ir->nstlist, ir->nstlist - 1, -1, listSetup);
 +                calcVerletBufferSize(mtop, det(box), *ir, ir->nstlist, ir->nstlist - 1, -1, listSetup);
  
          if (rlist_new != ir->rlist)
          {
              {
                  fprintf(fplog,
                          "\nChanging rlist from %g to %g for non-bonded %dx%d atom kernels\n\n",
 -                        ir->rlist, rlist_new, listSetup.cluster_size_i, listSetup.cluster_size_j);
 +                        ir->rlist,
 +                        rlist_new,
 +                        listSetup.cluster_size_i,
 +                        listSetup.cluster_size_j);
              }
              ir->rlist = rlist_new;
          }
  
      if (nstlist_cmdline > 0 && (!EI_DYNAMICS(ir->eI) || ir->verletbuf_tol <= 0))
      {
 -        gmx_fatal(FARGS, "Can not set nstlist without %s",
 +        gmx_fatal(FARGS,
 +                  "Can not set nstlist without %s",
                    !EI_DYNAMICS(ir->eI) ? "dynamics" : "verlet-buffer-tolerance");
      }
  
      if (EI_DYNAMICS(ir->eI))
      {
          /* Set or try nstlist values */
 -        increaseNstlist(fplog, cr, ir, nstlist_cmdline, mtop, box, makeGpuPairList, cpuinfo);
 +        increaseNstlist(fplog, cr, ir, nstlist_cmdline, &mtop, box, makeGpuPairList, cpuinfo);
      }
  }
  
@@@ -557,13 -490,11 +557,13 @@@ static void override_nsteps_cmdline(con
          {
              sprintf(sbuf_msg,
                      "Overriding nsteps with value passed on the command line: %s steps, %.3g ps",
 -                    gmx_step_str(nsteps_cmdline, sbuf_steps), fabs(nsteps_cmdline * ir->delta_t));
 +                    gmx_step_str(nsteps_cmdline, sbuf_steps),
 +                    fabs(nsteps_cmdline * ir->delta_t));
          }
          else
          {
 -            sprintf(sbuf_msg, "Overriding nsteps with value passed on the command line: %s steps",
 +            sprintf(sbuf_msg,
 +                    "Overriding nsteps with value passed on the command line: %s steps",
                      gmx_step_str(nsteps_cmdline, sbuf_steps));
          }
  
@@@ -659,9 -590,9 +659,9 @@@ static TaskTarget findTaskTarget(const 
  static void finish_run(FILE*                     fplog,
                         const gmx::MDLogger&      mdlog,
                         const t_commrec*          cr,
 -                       const t_inputrec*         inputrec,
 +                       const t_inputrec&         inputrec,
                         t_nrnb                    nrnb[],
 -                       gmx_wallcycle_t           wcycle,
 +                       gmx_wallcycle           wcycle,
                         gmx_walltime_accounting_t walltime_accounting,
                         nonbonded_verlet_t*       nbv,
                         const gmx_pme_t*          pme,
         Further, we only report performance for dynamical integrators,
         because those are the only ones for which we plan to
         consider doing any optimizations. */
 -    bool printReport = EI_DYNAMICS(inputrec->eI) && SIMMASTER(cr);
 +    bool printReport = EI_DYNAMICS(inputrec.eI) && SIMMASTER(cr);
  
      if (printReport && !walltime_accounting_get_valid_finish(walltime_accounting))
      {
          nrnbTotalStorage = std::make_unique<t_nrnb>();
          nrnb_tot         = nrnbTotalStorage.get();
  #if GMX_MPI
 -        MPI_Allreduce(nrnb->n, nrnb_tot->n, eNRNB, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
 +        MPI_Allreduce(nrnb->n.data(), nrnb_tot->n.data(), eNRNB, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
  #endif
      }
      else
      {
  #if GMX_MPI
          /* reduce elapsed_time over all MPI ranks in the current simulation */
 -        MPI_Allreduce(&elapsed_time, &elapsed_time_over_all_ranks, 1, MPI_DOUBLE, MPI_SUM,
 -                      cr->mpi_comm_mysim);
 +        MPI_Allreduce(&elapsed_time, &elapsed_time_over_all_ranks, 1, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
          elapsed_time_over_all_ranks /= cr->nnodes;
          /* Reduce elapsed_time_over_all_threads over all MPI ranks in the
           * current simulation. */
 -        MPI_Allreduce(&elapsed_time_over_all_threads, &elapsed_time_over_all_threads_over_all_ranks,
 -                      1, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
 +        MPI_Allreduce(&elapsed_time_over_all_threads,
 +                      &elapsed_time_over_all_threads_over_all_ranks,
 +                      1,
 +                      MPI_DOUBLE,
 +                      MPI_SUM,
 +                      cr->mpi_comm_mysim);
  #endif
      }
      else
          print_flop(fplog, nrnb_tot, &nbfs, &mflop);
      }
  
 -    if (thisRankHasDuty(cr, DUTY_PP) && DOMAINDECOMP(cr))
 +    if (thisRankHasDuty(cr, DUTY_PP) && haveDDAtomOrdering(*cr))
      {
          print_dd_statistics(cr, inputrec, fplog);
      }
       * to the code that handled the thread region, so that there's a
       * mechanism to keep cycle counting working during the transition
       * to task parallelism. */
 -    int nthreads_pp  = gmx_omp_nthreads_get(emntNonbonded);
 -    int nthreads_pme = gmx_omp_nthreads_get(emntPME);
 -    wallcycle_scale_by_num_threads(wcycle, thisRankHasDuty(cr, DUTY_PME) && !thisRankHasDuty(cr, DUTY_PP),
 -                                   nthreads_pp, nthreads_pme);
 +    int nthreads_pp  = gmx_omp_nthreads_get(ModuleMultiThread::Nonbonded);
 +    int nthreads_pme = gmx_omp_nthreads_get(ModuleMultiThread::Pme);
 +    wallcycle_scale_by_num_threads(
 +            wcycle, thisRankHasDuty(cr, DUTY_PME) && !thisRankHasDuty(cr, DUTY_PP), nthreads_pp, nthreads_pme);
      auto cycle_sum(wallcycle_sum(cr, wcycle));
  
      if (printReport)
      {
 -        auto nbnxn_gpu_timings =
 +        auto* nbnxn_gpu_timings =
                  (nbv != nullptr && nbv->useGpu()) ? Nbnxm::gpu_get_timings(nbv->gpu_nbv) : nullptr;
          gmx_wallclock_gpu_pme_t pme_gpu_timings = {};
  
          {
              pme_gpu_get_timings(pme, &pme_gpu_timings);
          }
 -        wallcycle_print(fplog, mdlog, cr->nnodes, cr->npmenodes, nthreads_pp, nthreads_pme,
 -                        elapsed_time_over_all_ranks, wcycle, cycle_sum, nbnxn_gpu_timings,
 +        wallcycle_print(fplog,
 +                        mdlog,
 +                        cr->nnodes,
 +                        cr->npmenodes,
 +                        nthreads_pp,
 +                        nthreads_pme,
 +                        elapsed_time_over_all_ranks,
 +                        wcycle,
 +                        cycle_sum,
 +                        nbnxn_gpu_timings,
                          &pme_gpu_timings);
  
 -        if (EI_DYNAMICS(inputrec->eI))
 +        if (EI_DYNAMICS(inputrec.eI))
          {
 -            delta_t = inputrec->delta_t;
 +            delta_t = inputrec.delta_t;
          }
  
          if (fplog)
          {
 -            print_perf(fplog, elapsed_time_over_all_threads_over_all_ranks, elapsed_time_over_all_ranks,
 +            print_perf(fplog,
 +                       elapsed_time_over_all_threads_over_all_ranks,
 +                       elapsed_time_over_all_ranks,
                         walltime_accounting_get_nsteps_done_since_reset(walltime_accounting),
 -                       delta_t, nbfs, mflop);
 +                       delta_t,
 +                       nbfs,
 +                       mflop);
          }
          if (bWriteStat)
          {
 -            print_perf(stderr, elapsed_time_over_all_threads_over_all_ranks, elapsed_time_over_all_ranks,
 +            print_perf(stderr,
 +                       elapsed_time_over_all_threads_over_all_ranks,
 +                       elapsed_time_over_all_ranks,
                         walltime_accounting_get_nsteps_done_since_reset(walltime_accounting),
 -                       delta_t, nbfs, mflop);
 +                       delta_t,
 +                       nbfs,
 +                       mflop);
          }
      }
  }
  
  int Mdrunner::mdrunner()
  {
 -    matrix                    box;
 -    t_forcerec*               fr               = nullptr;
 -    real                      ewaldcoeff_q     = 0;
 -    real                      ewaldcoeff_lj    = 0;
 -    int                       nChargePerturbed = -1, nTypePerturbed = 0;
 -    gmx_wallcycle_t           wcycle;
 -    gmx_walltime_accounting_t walltime_accounting = nullptr;
 -    MembedHolder              membedHolder(filenames.size(), filenames.data());
 +    matrix                      box;
 +    std::unique_ptr<t_forcerec> fr;
 +    real                        ewaldcoeff_q     = 0;
 +    real                        ewaldcoeff_lj    = 0;
 +    int                         nChargePerturbed = -1, nTypePerturbed = 0;
 +    gmx_walltime_accounting_t   walltime_accounting = nullptr;
 +    MembedHolder                membedHolder(filenames.size(), filenames.data());
  
      /* CAUTION: threads may be started later on in this function, so
         cr doesn't reflect the final parallel state right now */
  
      gmx_print_detected_hardware(fplog, isSimulationMasterRank && isMasterSim(ms), mdlog, hwinfo_);
  
 -    std::vector<int> gpuIdsToUse = makeGpuIdsToUse(hwinfo_->deviceInfoList, hw_opt.gpuIdsAvailable);
 -    const int        numDevicesToUse = gmx::ssize(gpuIdsToUse);
 +    std::vector<int> availableDevices =
 +            makeListOfAvailableDevices(hwinfo_->deviceInfoList, hw_opt.devicesSelectedByUser);
 +    const int numAvailableDevices = gmx::ssize(availableDevices);
  
      // Print citation requests after all software/hardware printing
      pleaseCiteGromacs(fplog);
          /* Read (nearly) all data required for the simulation
           * and keep the partly serialized tpr contents to send to other ranks later
           */
 -        applyGlobalSimulationState(*inputHolder_.get(), partialDeserializedTpr.get(),
 -                                   globalState.get(), inputrec.get(), &mtop);
 +        applyGlobalSimulationState(
 +                *inputHolder_.get(), partialDeserializedTpr.get(), globalState.get(), inputrec.get(), &mtop);
      }
  
      /* Check and update the hardware options for internal consistency */
 -    checkAndUpdateHardwareOptions(mdlog, &hw_opt, isSimulationMasterRank, domdecOptions.numPmeRanks,
 -                                  inputrec.get());
 +    checkAndUpdateHardwareOptions(
 +            mdlog, &hw_opt, isSimulationMasterRank, domdecOptions.numPmeRanks, inputrec.get());
  
      if (GMX_THREAD_MPI && isSimulationMasterRank)
      {
              // the number of GPUs to choose the number of ranks.
              auto canUseGpuForNonbonded = buildSupportsNonbondedOnGpu(nullptr);
              useGpuForNonbonded         = decideWhetherToUseGpusForNonbondedWithThreadMpi(
 -                    nonbondedTarget, numDevicesToUse, userGpuTaskAssignment, emulateGpuNonbonded,
 +                    nonbondedTarget,
 +                    numAvailableDevices > 0,
 +                    userGpuTaskAssignment,
 +                    emulateGpuNonbonded,
                      canUseGpuForNonbonded,
                      gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, GMX_THREAD_MPI),
                      hw_opt.nthreads_tmpi);
 -            useGpuForPme = decideWhetherToUseGpusForPmeWithThreadMpi(
 -                    useGpuForNonbonded, pmeTarget, pmeFftTarget, numDevicesToUse, userGpuTaskAssignment,
 -                    *hwinfo_, *inputrec, hw_opt.nthreads_tmpi, domdecOptions.numPmeRanks);
 +            useGpuForPme = decideWhetherToUseGpusForPmeWithThreadMpi(useGpuForNonbonded,
 +                                                                     pmeTarget,
++                                                                     pmeFftTarget,
 +                                                                     numAvailableDevices,
 +                                                                     userGpuTaskAssignment,
 +                                                                     *hwinfo_,
 +                                                                     *inputrec,
 +                                                                     hw_opt.nthreads_tmpi,
 +                                                                     domdecOptions.numPmeRanks);
          }
          GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
  
           * TODO Over-writing the user-supplied value here does
           * prevent any possible subsequent checks from working
           * correctly. */
 -        hw_opt.nthreads_tmpi =
 -                get_nthreads_mpi(hwinfo_, &hw_opt, numDevicesToUse, useGpuForNonbonded, useGpuForPme,
 -                                 inputrec.get(), &mtop, mdlog, membedHolder.doMembed());
 +        hw_opt.nthreads_tmpi = get_nthreads_mpi(hwinfo_,
 +                                                &hw_opt,
 +                                                numAvailableDevices,
 +                                                useGpuForNonbonded,
 +                                                useGpuForPme,
 +                                                inputrec.get(),
 +                                                mtop,
 +                                                mdlog,
 +                                                membedHolder.doMembed());
  
          // Now start the threads for thread MPI.
          spawnThreads(hw_opt.nthreads_tmpi);
          // master and spawned threads joins at the end of this block.
      }
  
 -    GMX_RELEASE_ASSERT(ms || simulationCommunicator != MPI_COMM_NULL,
 +    GMX_RELEASE_ASSERT(!GMX_MPI || ms || simulationCommunicator != MPI_COMM_NULL,
                         "Must have valid communicator unless running a multi-simulation");
      CommrecHandle crHandle = init_commrec(simulationCommunicator);
      t_commrec*    cr       = crHandle.get();
              // On non-master ranks, allocate the object that will receive data in the following call.
              inputrec = std::make_unique<t_inputrec>();
          }
 -        init_parallel(cr->mpiDefaultCommunicator, MASTER(cr), inputrec.get(), &mtop,
 +        init_parallel(cr->mpiDefaultCommunicator,
 +                      MASTER(cr),
 +                      inputrec.get(),
 +                      &mtop,
                        partialDeserializedTpr.get());
      }
      GMX_RELEASE_ASSERT(inputrec != nullptr, "All ranks should have a valid inputrec now");
      partialDeserializedTpr.reset(nullptr);
  
 +    GMX_RELEASE_ASSERT(
 +            !inputrec->useConstantAcceleration,
 +            "Linear acceleration has been removed in GROMACS 2022, and was broken for many years "
 +            "before that. Use GROMACS 4.5 or earlier if you need this feature.");
 +
 +    // Now we decide whether to use the domain decomposition machinery.
 +    // Note that this does not necessarily imply actually using multiple domains.
      // Now the number of ranks is known to all ranks, and each knows
      // the inputrec read by the master rank. The ranks can now all run
      // the task-deciding functions and will agree on the result
      // without needing to communicate.
 -    const bool useDomainDecomposition = (PAR(cr) && !(EI_TPI(inputrec->eI) || inputrec->eI == eiNM));
 +    // The LBFGS minimizer, test-particle insertion, normal modes and shell dynamics don't support DD
 +    const bool useDomainDecomposition =
 +            !(inputrec->eI == IntegrationAlgorithm::LBFGS || EI_TPI(inputrec->eI)
 +              || inputrec->eI == IntegrationAlgorithm::NM
 +              || gmx_mtop_particletype_count(mtop)[ParticleType::Shell] > 0);
  
      // Note that these variables describe only their own node.
      //
          // assignment.
          auto canUseGpuForNonbonded = buildSupportsNonbondedOnGpu(nullptr);
          useGpuForNonbonded         = decideWhetherToUseGpusForNonbonded(
 -                nonbondedTarget, userGpuTaskAssignment, emulateGpuNonbonded, canUseGpuForNonbonded,
 -                gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, !GMX_THREAD_MPI), gpusWereDetected);
 -        useGpuForPme = decideWhetherToUseGpusForPme(
 -                useGpuForNonbonded, pmeTarget, pmeFftTarget, userGpuTaskAssignment, *hwinfo_,
 -                *inputrec, cr->sizeOfDefaultCommunicator, domdecOptions.numPmeRanks, gpusWereDetected);
 -        useGpuForBonded = decideWhetherToUseGpusForBonded(useGpuForNonbonded, useGpuForPme,
 -                                                          bondedTarget, *inputrec, mtop,
 -                                                          domdecOptions.numPmeRanks, gpusWereDetected);
 +                nonbondedTarget,
 +                userGpuTaskAssignment,
 +                emulateGpuNonbonded,
 +                canUseGpuForNonbonded,
 +                gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, !GMX_THREAD_MPI),
 +                gpusWereDetected);
 +        useGpuForPme    = decideWhetherToUseGpusForPme(useGpuForNonbonded,
 +                                                    pmeTarget,
++                                                    pmeFftTarget,
 +                                                    userGpuTaskAssignment,
 +                                                    *hwinfo_,
 +                                                    *inputrec,
 +                                                    cr->sizeOfDefaultCommunicator,
 +                                                    domdecOptions.numPmeRanks,
 +                                                    gpusWereDetected);
 +        useGpuForBonded = decideWhetherToUseGpusForBonded(
 +                useGpuForNonbonded, useGpuForPme, bondedTarget, *inputrec, mtop, domdecOptions.numPmeRanks, gpusWereDetected);
      }
      GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
  
      const DevelopmentFeatureFlags devFlags =
              manageDevelopmentFeatures(mdlog, useGpuForNonbonded, pmeRunMode);
  
 -    const bool useModularSimulator =
 -            checkUseModularSimulator(false, inputrec.get(), doRerun, mtop, ms, replExParams,
 -                                     nullptr, doEssentialDynamics, membedHolder.doMembed());
 +    const bool useModularSimulator = checkUseModularSimulator(false,
 +                                                              inputrec.get(),
 +                                                              doRerun,
 +                                                              mtop,
 +                                                              ms,
 +                                                              replExParams,
 +                                                              nullptr,
 +                                                              doEssentialDynamics,
 +                                                              membedHolder.doMembed());
 +
 +    ObservablesReducerBuilder observablesReducerBuilder;
  
      // Build restraints.
      // TODO: hide restraint implementation details from Mdrunner.
  
      // TODO: Error handling
      mdModules_->assignOptionsToModules(*inputrec->params, nullptr);
 -    // now that the MdModules know their options, they know which callbacks to sign up to
 +    // now that the MDModules know their options, they know which callbacks to sign up to
      mdModules_->subscribeToSimulationSetupNotifications();
 -    const auto& mdModulesNotifier = mdModules_->notifier().simulationSetupNotifications_;
 +    const auto& setupNotifier = mdModules_->notifiers().simulationSetupNotifier_;
 +
 +    // Notify MdModules of existing logger
 +    setupNotifier.notify(mdlog);
  
 +    // Notify MdModules of internal parameters, saved into KVT
      if (inputrec->internalParameters != nullptr)
      {
 -        mdModulesNotifier.notify(*inputrec->internalParameters);
 +        setupNotifier.notify(*inputrec->internalParameters);
 +    }
 +
 +    // Let MdModules know the .tpr filename
 +    {
 +        gmx::MdRunInputFilename mdRunInputFilename = { ftp2fn(efTPR, filenames.size(), filenames.data()) };
 +        setupNotifier.notify(mdRunInputFilename);
      }
  
      if (fplog != nullptr)
      if (SIMMASTER(cr))
      {
          /* In rerun, set velocities to zero if present */
 -        if (doRerun && ((globalState->flags & (1 << estV)) != 0))
 +        if (doRerun && ((globalState->flags & enumValueToBitMask(StateEntry::V)) != 0))
          {
              // rerun does not use velocities
              GMX_LOG(mdlog.info)
              {
                  clear_rvec(globalState->v[i]);
              }
 -            globalState->flags &= ~(1 << estV);
 +            globalState->flags &= ~enumValueToBitMask(StateEntry::V);
          }
  
          /* now make sure the state is initialized and propagated */
      /* NM and TPI parallelize over force/energy calculations, not atoms,
       * so we need to initialize and broadcast the global state.
       */
 -    if (inputrec->eI == eiNM || inputrec->eI == eiTPI)
 +    if (inputrec->eI == IntegrationAlgorithm::NM || inputrec->eI == IntegrationAlgorithm::TPI)
      {
          if (!MASTER(cr))
          {
              globalState = std::make_unique<t_state>();
          }
 -        broadcastStateWithoutDynamics(cr->mpiDefaultCommunicator, DOMAINDECOMP(cr), PAR(cr),
 -                                      globalState.get());
 +        broadcastStateWithoutDynamics(
 +                cr->mpiDefaultCommunicator, haveDDAtomOrdering(*cr), PAR(cr), globalState.get());
      }
  
      /* A parallel command line option consistency check that we can
  #endif
      }
  
 -    if (doRerun && (EI_ENERGY_MINIMIZATION(inputrec->eI) || eiNM == inputrec->eI))
 +    if (doRerun && (EI_ENERGY_MINIMIZATION(inputrec->eI) || IntegrationAlgorithm::NM == inputrec->eI))
      {
          gmx_fatal(FARGS,
                    "The .mdp file specified an energy mininization or normal mode algorithm, and "
                    "these are not compatible with mdrun -rerun");
      }
  
 -    if (!(EEL_PME(inputrec->coulombtype) || EVDW_PME(inputrec->vdwtype)))
 -    {
 -        if (domdecOptions.numPmeRanks > 0)
 -        {
 -            gmx_fatal_collective(FARGS, cr->mpiDefaultCommunicator, MASTER(cr),
 -                                 "PME-only ranks are requested, but the system does not use PME "
 -                                 "for electrostatics or LJ");
 -        }
 -
 -        domdecOptions.numPmeRanks = 0;
 -    }
 -
 -    if (useGpuForNonbonded && domdecOptions.numPmeRanks < 0)
 -    {
 -        /* With NB GPUs we don't automatically use PME-only CPU ranks. PME ranks can
 -         * improve performance with many threads per GPU, since our OpenMP
 -         * scaling is bad, but it's difficult to automate the setup.
 -         */
 -        domdecOptions.numPmeRanks = 0;
 -    }
 -    if (useGpuForPme)
 -    {
 -        if (domdecOptions.numPmeRanks < 0)
 -        {
 -            domdecOptions.numPmeRanks = 0;
 -            // TODO possibly print a note that one can opt-in for a separate PME GPU rank?
 -        }
 -        else
 -        {
 -            GMX_RELEASE_ASSERT(domdecOptions.numPmeRanks <= 1,
 -                               "PME GPU decomposition is not supported");
 -        }
 -    }
 -
      /* NMR restraints must be initialized before load_checkpoint,
       * since with time averaging the history is added to t_state.
       * For proper consistency check we therefore need to extend
      /* This needs to be called before read_checkpoint to extend the state */
      t_disresdata* disresdata;
      snew(disresdata, 1);
 -    init_disres(fplog, &mtop, inputrec.get(), DisResRunMode::MDRun,
 +    init_disres(fplog,
 +                mtop,
 +                inputrec.get(),
 +                DisResRunMode::MDRun,
                  MASTER(cr) ? DDRole::Master : DDRole::Agent,
 -                PAR(cr) ? NumRanks::Multiple : NumRanks::Single, cr->mpi_comm_mysim, ms, disresdata,
 -                globalState.get(), replExParams.exchangeInterval > 0);
 +                PAR(cr) ? NumRanks::Multiple : NumRanks::Single,
 +                cr->mpi_comm_mysim,
 +                ms,
 +                disresdata,
 +                globalState.get(),
 +                replExParams.exchangeInterval > 0);
  
 -    t_oriresdata* oriresdata;
 -    snew(oriresdata, 1);
 -    init_orires(fplog, &mtop, inputrec.get(), cr, ms, globalState.get(), oriresdata);
 +    if (gmx_mtop_ftype_count(mtop, F_ORIRES) > 0 && isSimulationMasterRank)
 +    {
 +        extendStateWithOriresHistory(mtop, *inputrec, globalState.get());
 +    }
  
 -    auto deform = prepareBoxDeformation(
 -            globalState != nullptr ? globalState->box : box, MASTER(cr) ? DDRole::Master : DDRole::Agent,
 -            PAR(cr) ? NumRanks::Multiple : NumRanks::Single, cr->mpi_comm_mygroup, *inputrec);
 +    auto deform = prepareBoxDeformation(globalState != nullptr ? globalState->box : box,
 +                                        MASTER(cr) ? DDRole::Master : DDRole::Agent,
 +                                        PAR(cr) ? NumRanks::Multiple : NumRanks::Single,
 +                                        cr->mpi_comm_mygroup,
 +                                        *inputrec);
  
  #if GMX_FAHCORE
      /* We have to remember the generation's first step before reading checkpoint.
  
          // Finish applying initial simulation state information from external sources on all ranks.
          // Reconcile checkpoint file data with Mdrunner state established up to this point.
 -        applyLocalState(*inputHolder_.get(), logFileHandle, cr, domdecOptions.numCells,
 -                        inputrec.get(), globalState.get(), &observablesHistory,
 -                        mdrunOptions.reproducible, mdModules_->notifier(),
 -                        modularSimulatorCheckpointData.get(), useModularSimulator);
 +        applyLocalState(*inputHolder_.get(),
 +                        logFileHandle,
 +                        cr,
 +                        domdecOptions.numCells,
 +                        inputrec.get(),
 +                        globalState.get(),
 +                        &observablesHistory,
 +                        mdrunOptions.reproducible,
 +                        mdModules_->notifiers(),
 +                        modularSimulatorCheckpointData.get(),
 +                        useModularSimulator);
          // TODO: (#3652) Synchronize filesystem state, SimulationInput contents, and program
          //  invariants
          //  on all code paths.
          gmx_bcast(sizeof(box), box, cr->mpiDefaultCommunicator);
      }
  
 -    if (inputrec->cutoff_scheme != ecutsVERLET)
 +    if (inputrec->cutoff_scheme != CutoffScheme::Verlet)
      {
          gmx_fatal(FARGS,
                    "This group-scheme .tpr file can no longer be run by mdrun. Please update to the "
       * increase rlist) tries to check if the newly chosen value fits with the DD scheme. As this is
       * run before any DD scheme is set up, this check is never executed. See #3334 for more details.
       */
 -    prepare_verlet_scheme(fplog, cr, inputrec.get(), nstlist_cmdline, &mtop, box,
 +    prepare_verlet_scheme(fplog,
 +                          cr,
 +                          inputrec.get(),
 +                          nstlist_cmdline,
 +                          mtop,
 +                          box,
                            useGpuForNonbonded || (emulateGpuNonbonded == EmulateGpuNonbonded::Yes),
                            *hwinfo_->cpuInfo);
  
 +    // We need to decide on update groups early, as this affects
 +    // inter-domain communication distances.
 +    auto       updateGroupingsPerMoleculeType = makeUpdateGroupingsPerMoleculeType(mtop);
 +    const real maxUpdateGroupRadius           = computeMaxUpdateGroupRadius(
 +            mtop, updateGroupingsPerMoleculeType, maxReferenceTemperature(*inputrec));
 +    const real   cutoffMargin = std::sqrt(max_cutoff2(inputrec->pbcType, box)) - inputrec->rlist;
 +    UpdateGroups updateGroups = makeUpdateGroups(mdlog,
 +                                                 std::move(updateGroupingsPerMoleculeType),
 +                                                 maxUpdateGroupRadius,
 +                                                 useDomainDecomposition,
 +                                                 systemHasConstraintsOrVsites(mtop),
 +                                                 cutoffMargin);
 +
 +    try
 +    {
 +        const bool haveFrozenAtoms = inputrecFrozenAtoms(inputrec.get());
 +
 +        useGpuForUpdate = decideWhetherToUseGpuForUpdate(useDomainDecomposition,
 +                                                         updateGroups.useUpdateGroups(),
 +                                                         pmeRunMode,
 +                                                         domdecOptions.numPmeRanks > 0,
 +                                                         useGpuForNonbonded,
 +                                                         updateTarget,
 +                                                         gpusWereDetected,
 +                                                         *inputrec,
 +                                                         mtop,
 +                                                         doEssentialDynamics,
 +                                                         gmx_mtop_ftype_count(mtop, F_ORIRES) > 0,
 +                                                         haveFrozenAtoms,
 +                                                         doRerun,
 +                                                         devFlags,
 +                                                         mdlog);
 +    }
 +    GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
 +
 +    bool useGpuDirectHalo = false;
 +
 +    if (useGpuForNonbonded)
 +    {
 +        // cr->npmenodes is not yet initialized.
 +        // domdecOptions.numPmeRanks == -1 results in 0 separate PME ranks when useGpuForNonbonded is true.
 +        // Todo: remove this assumption later once auto mode has support for separate PME rank
 +        const int numPmeRanks = domdecOptions.numPmeRanks > 0 ? domdecOptions.numPmeRanks : 0;
 +        bool      havePPDomainDecomposition = (cr->sizeOfDefaultCommunicator - numPmeRanks) > 1;
 +        useGpuDirectHalo                    = decideWhetherToUseGpuForHalo(devFlags,
 +                                                        havePPDomainDecomposition,
 +                                                        useGpuForNonbonded,
 +                                                        useModularSimulator,
 +                                                        doRerun,
 +                                                        EI_ENERGY_MINIMIZATION(inputrec->eI));
 +    }
 +
      // This builder is necessary while we have multi-part construction
      // of DD. Before DD is constructed, we use the existence of
      // the builder object to indicate that further construction of DD
      std::unique_ptr<DomainDecompositionBuilder> ddBuilder;
      if (useDomainDecomposition)
      {
 -        ddBuilder = std::make_unique<DomainDecompositionBuilder>(
 -                mdlog, cr, domdecOptions, mdrunOptions, mtop, *inputrec, box,
 -                positionsFromStatePointer(globalState.get()));
 +        // P2P GPU comm + GPU update leads to case in which we enqueue async work for multiple
 +        // timesteps. DLB needs to be disabled in that case
 +        const bool directGpuCommUsedWithGpuUpdate = GMX_THREAD_MPI && useGpuDirectHalo && useGpuForUpdate;
 +        ddBuilder                                 = std::make_unique<DomainDecompositionBuilder>(
 +                mdlog,
 +                cr,
 +                domdecOptions,
 +                mdrunOptions,
 +                mtop,
 +                *inputrec,
 +                mdModules_->notifiers(),
 +                box,
 +                updateGroups.updateGroupingPerMoleculeType(),
 +                updateGroups.useUpdateGroups(),
 +                updateGroups.maxUpdateGroupRadius(),
 +                positionsFromStatePointer(globalState.get()),
 +                useGpuForNonbonded,
 +                useGpuForPme,
 +                directGpuCommUsedWithGpuUpdate);
      }
      else
      {
  
      // Produce the task assignment for this rank - done after DD is constructed
      GpuTaskAssignments gpuTaskAssignments = GpuTaskAssignmentsBuilder::build(
 -            gpuIdsToUse, userGpuTaskAssignment, *hwinfo_, simulationCommunicator, physicalNodeComm,
 -            nonbondedTarget, pmeTarget, bondedTarget, updateTarget, useGpuForNonbonded,
 -            useGpuForPme, thisRankHasDuty(cr, DUTY_PP),
 +            availableDevices,
 +            userGpuTaskAssignment,
 +            *hwinfo_,
 +            simulationCommunicator,
 +            physicalNodeComm,
 +            nonbondedTarget,
 +            pmeTarget,
 +            bondedTarget,
 +            updateTarget,
 +            useGpuForNonbonded,
 +            useGpuForPme,
 +            thisRankHasDuty(cr, DUTY_PP),
              // TODO cr->duty & DUTY_PME should imply that a PME
              // algorithm is active, but currently does not.
              EEL_PME(inputrec->coulombtype) && thisRankHasDuty(cr, DUTY_PME));
      // requires it (e.g. pull, CompEl, density fitting), so that we
      // don't update the local atom sets unilaterally every step.
      LocalAtomSetManager atomSets;
 +
 +    // Local state and topology are declared (and perhaps constructed)
 +    // now, because DD needs them for the LocalTopologyChecker, but
 +    // they do not contain valid data until after the first DD
 +    // partition.
 +    std::unique_ptr<t_state> localStateInstance;
 +    t_state*                 localState;
 +    gmx_localtop_t           localTopology(mtop.ffparams);
 +
      if (ddBuilder)
      {
 +        localStateInstance = std::make_unique<t_state>();
 +        localState         = localStateInstance.get();
          // TODO Pass the GPU streams to ddBuilder to use in buffer
          // transfers (e.g. halo exchange)
 -        cr->dd = ddBuilder->build(&atomSets);
 +        cr->dd = ddBuilder->build(&atomSets, localTopology, *localState, &observablesReducerBuilder);
          // The builder's job is done, so destruct it
          ddBuilder.reset(nullptr);
          // Note that local state still does not exist yet.
      }
 +    else
 +    {
 +        // Without DD, the local state is merely an alias to the global state,
 +        // so we don't need to allocate anything.
 +        localState = globalState.get();
 +    }
 +
      // Ensure that all atoms within the same update group are in the
      // same periodic image. Otherwise, a simulation that did not use
      // update groups (e.g. a single-rank simulation) cannot always be
          }
      }
  
 -    // The GPU update is decided here because we need to know whether the constraints or
 -    // SETTLEs can span accross the domain borders (i.e. whether or not update groups are
 -    // defined). This is only known after DD is initialized, hence decision on using GPU
 -    // update is done so late.
 -    try
 -    {
 -        const bool useUpdateGroups = cr->dd ? ddUsesUpdateGroups(*cr->dd) : false;
 -        const bool haveFrozenAtoms = inputrecFrozenAtoms(inputrec.get());
 -
 -        useGpuForUpdate = decideWhetherToUseGpuForUpdate(
 -                useDomainDecomposition, useUpdateGroups, pmeRunMode, domdecOptions.numPmeRanks > 0,
 -                useGpuForNonbonded, updateTarget, gpusWereDetected, *inputrec, mtop,
 -                doEssentialDynamics, gmx_mtop_ftype_count(mtop, F_ORIRES) > 0,
 -                replExParams.exchangeInterval > 0, haveFrozenAtoms, doRerun, devFlags, mdlog);
 -    }
 -    GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
 -
      const bool printHostName = (cr->nnodes > 1);
      gpuTaskAssignments.reportGpuUsage(mdlog, printHostName, useGpuForBonded, pmeRunMode, useGpuForUpdate);
  
  
      MdrunScheduleWorkload runScheduleWork;
  
 -    bool useGpuDirectHalo = decideWhetherToUseGpuForHalo(
 -            devFlags, havePPDomainDecomposition(cr), useGpuForNonbonded, useModularSimulator,
 -            doRerun, EI_ENERGY_MINIMIZATION(inputrec->eI));
 -
      // Also populates the simulation constant workload description.
 -    runScheduleWork.simulationWork = createSimulationWorkload(
 -            *inputrec, disableNonbondedCalculation, devFlags, useGpuForNonbonded, pmeRunMode,
 -            useGpuForBonded, useGpuForUpdate, useGpuDirectHalo);
 +    // Note: currently the default duty is DUTY_PP | DUTY_PME for all simulations, including those without PME,
 +    // so this boolean is sufficient on all ranks to determine whether separate PME ranks are used,
 +    // but this will no longer be the case if cr->duty is changed for !EEL_PME(fr->ic->eeltype).
 +    const bool haveSeparatePmeRank = (!thisRankHasDuty(cr, DUTY_PP) || !thisRankHasDuty(cr, DUTY_PME));
 +    runScheduleWork.simulationWork = createSimulationWorkload(*inputrec,
 +                                                              disableNonbondedCalculation,
 +                                                              devFlags,
 +                                                              havePPDomainDecomposition(cr),
 +                                                              haveSeparatePmeRank,
 +                                                              useGpuForNonbonded,
 +                                                              pmeRunMode,
 +                                                              useGpuForBonded,
 +                                                              useGpuForUpdate,
 +                                                              useGpuDirectHalo);
  
      std::unique_ptr<DeviceStreamManager> deviceStreamManager = nullptr;
  
      if (deviceInfo != nullptr)
      {
 -        if (DOMAINDECOMP(cr) && thisRankHasDuty(cr, DUTY_PP))
 +        if (runScheduleWork.simulationWork.havePpDomainDecomposition && thisRankHasDuty(cr, DUTY_PP))
          {
              dd_setup_dlb_resource_sharing(cr, deviceId);
          }
      // where appropriate.
      if (!userGpuTaskAssignment.empty())
      {
 -        gpuTaskAssignments.logPerformanceHints(mdlog, numDevicesToUse);
 +        gpuTaskAssignments.logPerformanceHints(mdlog, numAvailableDevices);
      }
  
      if (PAR(cr))
                  .appendTextFormatted(
                          "This is simulation %d out of %d running as a composite GROMACS\n"
                          "multi-simulation job. Setup for this simulation:\n",
 -                        ms->simulationIndex_, ms->numSimulations_);
 +                        ms->simulationIndex_,
 +                        ms->numSimulations_);
      }
      GMX_LOG(mdlog.warning)
 -            .appendTextFormatted("Using %d MPI %s\n", cr->nnodes,
 +            .appendTextFormatted("Using %d MPI %s\n",
 +                                 cr->nnodes,
  #    if GMX_THREAD_MPI
                                   cr->nnodes == 1 ? "thread" : "threads"
  #    else
      // the OpenMP support.
      gmx_check_thread_affinity_set(mdlog, &hw_opt, hwinfo_->nthreads_hw_avail, FALSE);
      /* Check and update the number of OpenMP threads requested */
 -    checkAndUpdateRequestedNumOpenmpThreads(&hw_opt, *hwinfo_, cr, ms, physicalNodeComm.size_,
 -                                            pmeRunMode, mtop, *inputrec);
 -
 -    gmx_omp_nthreads_init(mdlog, cr, hwinfo_->nthreads_hw_avail, physicalNodeComm.size_,
 -                          hw_opt.nthreads_omp, hw_opt.nthreads_omp_pme, !thisRankHasDuty(cr, DUTY_PP));
 -
 -    // Enable FP exception detection, but not in
 -    // Release mode and not for compilers with known buggy FP
 -    // exception support (clang with any optimization) or suspected
 -    // buggy FP exception support (gcc 7.* with optimization).
 -#if !defined NDEBUG                                                                         \
 -        && !((defined __clang__ || (defined(__GNUC__) && !defined(__ICC) && __GNUC__ == 7)) \
 -             && defined __OPTIMIZE__)
 -    const bool bEnableFPE = true;
 -#else
 -    const bool bEnableFPE = false;
 -#endif
 +    checkAndUpdateRequestedNumOpenmpThreads(
 +            &hw_opt, *hwinfo_, cr, ms, physicalNodeComm.size_, pmeRunMode, mtop, *inputrec);
 +
 +    gmx_omp_nthreads_init(mdlog,
 +                          cr,
 +                          hwinfo_->nthreads_hw_avail,
 +                          physicalNodeComm.size_,
 +                          hw_opt.nthreads_omp,
 +                          hw_opt.nthreads_omp_pme,
 +                          !thisRankHasDuty(cr, DUTY_PP));
 +
 +    const bool bEnableFPE = gmxShouldEnableFPExceptions();
      // FIXME - reconcile with gmx_feenableexcept() call from CommandLineModuleManager::run()
      if (bEnableFPE)
      {
      }
  
      /* Now that we know the setup is consistent, check for efficiency */
 -    check_resource_division_efficiency(hwinfo_, gpuTaskAssignments.thisRankHasAnyGpuTask(),
 -                                       mdrunOptions.ntompOptionIsSet, cr, mdlog);
 +    check_resource_division_efficiency(
 +            hwinfo_, gpuTaskAssignments.thisRankHasAnyGpuTask(), mdrunOptions.ntompOptionIsSet, cr, mdlog);
  
      /* getting number of PP/PME threads on this MPI / tMPI rank.
         PME: env variable should be read only on one node to make sure it is
         identical everywhere;
       */
 -    const int numThreadsOnThisRank = thisRankHasDuty(cr, DUTY_PP) ? gmx_omp_nthreads_get(emntNonbonded)
 -                                                                  : gmx_omp_nthreads_get(emntPME);
 -    checkHardwareOversubscription(numThreadsOnThisRank, cr->nodeid, *hwinfo_->hardwareTopology,
 -                                  physicalNodeComm, mdlog);
 +    const int numThreadsOnThisRank = thisRankHasDuty(cr, DUTY_PP)
 +                                             ? gmx_omp_nthreads_get(ModuleMultiThread::Nonbonded)
 +                                             : gmx_omp_nthreads_get(ModuleMultiThread::Pme);
 +    checkHardwareOversubscription(
 +            numThreadsOnThisRank, cr->nodeid, *hwinfo_->hardwareTopology, physicalNodeComm, mdlog);
  
      // Enable Peer access between GPUs where available
      // Only for DD, only master PP rank needs to perform setup, and only if thread MPI plus
      // any of the GPU communication features are active.
 -    if (DOMAINDECOMP(cr) && MASTER(cr) && thisRankHasDuty(cr, DUTY_PP) && GMX_THREAD_MPI
 +    if (haveDDAtomOrdering(*cr) && MASTER(cr) && thisRankHasDuty(cr, DUTY_PP) && GMX_THREAD_MPI
          && (runScheduleWork.simulationWork.useGpuHaloExchange
              || runScheduleWork.simulationWork.useGpuPmePpCommunication))
      {
 -        setupGpuDevicePeerAccess(gpuIdsToUse, mdlog);
 +        setupGpuDevicePeerAccess(gpuTaskAssignments.deviceIdsAssigned(), mdlog);
      }
  
      if (hw_opt.threadAffinity != ThreadAffinity::Off)
          gmx_check_thread_affinity_set(mdlog, &hw_opt, hwinfo_->nthreads_hw_avail, TRUE);
  
          int numThreadsOnThisNode, intraNodeThreadOffset;
 -        analyzeThreadsOnThisNode(physicalNodeComm, numThreadsOnThisRank, &numThreadsOnThisNode,
 -                                 &intraNodeThreadOffset);
 +        analyzeThreadsOnThisNode(
 +                physicalNodeComm, numThreadsOnThisRank, &numThreadsOnThisNode, &intraNodeThreadOffset);
  
          /* Set the CPU affinity */
 -        gmx_set_thread_affinity(mdlog, cr, &hw_opt, *hwinfo_->hardwareTopology, numThreadsOnThisRank,
 -                                numThreadsOnThisNode, intraNodeThreadOffset, nullptr);
 +        gmx_set_thread_affinity(mdlog,
 +                                cr,
 +                                &hw_opt,
 +                                *hwinfo_->hardwareTopology,
 +                                numThreadsOnThisRank,
 +                                numThreadsOnThisNode,
 +                                intraNodeThreadOffset,
 +                                nullptr);
      }
  
      if (mdrunOptions.timingOptions.resetStep > -1)
                          "The -resetstep functionality is deprecated, and may be removed in a "
                          "future version.");
      }
 -    wcycle = wallcycle_init(fplog, mdrunOptions.timingOptions.resetStep, cr);
 +    std::unique_ptr<gmx_wallcycle> wcycle =
 +            wallcycle_init(fplog, mdrunOptions.timingOptions.resetStep, cr);
  
      if (PAR(cr))
      {
          /* Master synchronizes its value of reset_counters with all nodes
           * including PME only nodes */
 -        int64_t reset_counters = wcycle_get_reset_counters(wcycle);
 +        int64_t reset_counters = wcycle_get_reset_counters(wcycle.get());
          gmx_bcast(sizeof(reset_counters), &reset_counters, cr->mpi_comm_mysim);
 -        wcycle_set_reset_counters(wcycle, reset_counters);
 +        wcycle_set_reset_counters(wcycle.get(), reset_counters);
      }
  
      // Membrane embedding must be initialized before we call init_forcerec()
 -    membedHolder.initializeMembed(fplog, filenames.size(), filenames.data(), &mtop, inputrec.get(),
 -                                  globalState.get(), cr, &mdrunOptions.checkpointOptions.period);
 +    membedHolder.initializeMembed(fplog,
 +                                  filenames.size(),
 +                                  filenames.data(),
 +                                  &mtop,
 +                                  inputrec.get(),
 +                                  globalState.get(),
 +                                  cr,
 +                                  &mdrunOptions.checkpointOptions.period);
  
      const bool               thisRankHasPmeGpuTask = gpuTaskAssignments.thisRankHasPmeGpuTask();
      std::unique_ptr<MDAtoms> mdAtoms;
      std::unique_ptr<VirtualSitesHandler> vsite;
 -    std::unique_ptr<GpuBonded>           gpuBonded;
  
      t_nrnb nrnb;
      if (thisRankHasDuty(cr, DUTY_PP))
      {
 -        mdModulesNotifier.notify(*cr);
 -        mdModulesNotifier.notify(&atomSets);
 -        mdModulesNotifier.notify(inputrec->pbcType);
 -        mdModulesNotifier.notify(SimulationTimeStep{ inputrec->delta_t });
 +        setupNotifier.notify(*cr);
 +        setupNotifier.notify(&atomSets);
 +        setupNotifier.notify(mtop);
 +        setupNotifier.notify(inputrec->pbcType);
 +        setupNotifier.notify(SimulationTimeStep{ inputrec->delta_t });
          /* Initiate forcerecord */
 -        fr                 = new t_forcerec;
 +        fr                 = std::make_unique<t_forcerec>();
          fr->forceProviders = mdModules_->initForceProviders();
 -        init_forcerec(fplog, mdlog, fr, inputrec.get(), &mtop, cr, box,
 +        init_forcerec(fplog,
 +                      mdlog,
 +                      runScheduleWork.simulationWork,
 +                      fr.get(),
 +                      *inputrec,
 +                      mtop,
 +                      cr,
 +                      box,
                        opt2fn("-table", filenames.size(), filenames.data()),
                        opt2fn("-tablep", filenames.size(), filenames.data()),
 -                      opt2fns("-tableb", filenames.size(), filenames.data()), pforce);
 +                      opt2fns("-tableb", filenames.size(), filenames.data()),
 +                      pforce);
          // Dirty hack, for fixing disres and orires should be made mdmodules
          fr->fcdata->disres = disresdata;
 -        fr->fcdata->orires = oriresdata;
 +        if (gmx_mtop_ftype_count(mtop, F_ORIRES) > 0)
 +        {
 +            fr->fcdata->orires = std::make_unique<t_oriresdata>(
 +                    fplog, mtop, *inputrec, ms, globalState.get(), &atomSets);
 +        }
  
          // Save a handle to device stream manager to use elsewhere in the code
          // TODO: Forcerec is not a correct place to store it.
                      "GPU PP-PME stream should be valid in order to use GPU PME-PP direct "
                      "communications.");
              fr->pmePpCommGpu = std::make_unique<gmx::PmePpCommGpu>(
 -                    cr->mpi_comm_mysim, cr->dd->pme_nodeid, deviceStreamManager->context(),
 +                    cr->mpi_comm_mysim,
 +                    cr->dd->pme_nodeid,
 +                    &cr->dd->pmeForceReceiveBuffer,
 +                    deviceStreamManager->context(),
                      deviceStreamManager->stream(DeviceStreamType::PmePpTransfer));
          }
  
 -        fr->nbv = Nbnxm::init_nb_verlet(mdlog, inputrec.get(), fr, cr, *hwinfo_,
 +        fr->nbv = Nbnxm::init_nb_verlet(mdlog,
 +                                        *inputrec,
 +                                        *fr,
 +                                        cr,
 +                                        *hwinfo_,
                                          runScheduleWork.simulationWork.useGpuNonbonded,
 -                                        deviceStreamManager.get(), &mtop, box, wcycle);
 +                                        deviceStreamManager.get(),
 +                                        mtop,
 +                                        box,
 +                                        wcycle.get());
          // TODO: Move the logic below to a GPU bonded builder
          if (runScheduleWork.simulationWork.useGpuBonded)
          {
              GMX_RELEASE_ASSERT(deviceStreamManager != nullptr,
                                 "GPU device stream manager should be valid in order to use GPU "
                                 "version of bonded forces.");
 -            gpuBonded = std::make_unique<GpuBonded>(
 -                    mtop.ffparams, fr->ic->epsfac * fr->fudgeQQ, deviceStreamManager->context(),
 -                    deviceStreamManager->bondedStream(havePPDomainDecomposition(cr)), wcycle);
 -            fr->gpuBonded = gpuBonded.get();
 +            fr->listedForcesGpu = std::make_unique<ListedForcesGpu>(
 +                    mtop.ffparams,
 +                    fr->ic->epsfac * fr->fudgeQQ,
 +                    deviceStreamManager->context(),
 +                    deviceStreamManager->bondedStream(havePPDomainDecomposition(cr)),
 +                    wcycle.get());
          }
 +        fr->longRangeNonbondeds = std::make_unique<CpuPpLongRangeNonbondeds>(fr->n_tpi,
 +                                                                             fr->ic->ewaldcoeff_q,
 +                                                                             fr->ic->epsilon_r,
 +                                                                             fr->qsum,
 +                                                                             fr->ic->eeltype,
 +                                                                             fr->ic->vdwtype,
 +                                                                             *inputrec,
 +                                                                             &nrnb,
 +                                                                             wcycle.get(),
 +                                                                             fplog);
  
          /* Initialize the mdAtoms structure.
           * mdAtoms is not filled with atom data,
          }
  
          /* Initialize the virtual site communication */
 -        vsite = makeVirtualSitesHandler(mtop, cr, fr->pbcType);
 +        vsite = makeVirtualSitesHandler(
 +                mtop, cr, fr->pbcType, updateGroups.updateGroupingPerMoleculeType());
  
          calc_shifts(box, fr->shift_vec);
  
                  constructVirtualSitesGlobal(mtop, globalState->x);
              }
          }
 +        // Make the DD reverse topology, now that any vsites that are present are available
 +        if (haveDDAtomOrdering(*cr))
 +        {
 +            dd_make_reverse_top(fplog, cr->dd, mtop, vsite.get(), *inputrec, domdecOptions.ddBondedChecking);
 +        }
  
          if (EEL_PME(fr->ic->eeltype) || EVDW_PME(fr->ic->vdwtype))
          {
                  const DeviceContext* deviceContext = runScheduleWork.simulationWork.useGpuPme
                                                               ? &deviceStreamManager->context()
                                                               : nullptr;
 -                const DeviceStream* pmeStream =
 +                const DeviceStream*  pmeStream =
                          runScheduleWork.simulationWork.useGpuPme
 -                                ? &deviceStreamManager->stream(DeviceStreamType::Pme)
 -                                : nullptr;
 -
 -                pmedata = gmx_pme_init(cr, getNumPmeDomains(cr->dd), inputrec.get(),
 -                                       nChargePerturbed != 0, nTypePerturbed != 0,
 -                                       mdrunOptions.reproducible, ewaldcoeff_q, ewaldcoeff_lj,
 -                                       gmx_omp_nthreads_get(emntPME), pmeRunMode, nullptr,
 -                                       deviceContext, pmeStream, pmeGpuProgram.get(), mdlog);
 +                                 ? &deviceStreamManager->stream(DeviceStreamType::Pme)
 +                                 : nullptr;
 +
 +                pmedata = gmx_pme_init(cr,
 +                                       getNumPmeDomains(cr->dd),
 +                                       inputrec.get(),
 +                                       nChargePerturbed != 0,
 +                                       nTypePerturbed != 0,
 +                                       mdrunOptions.reproducible,
 +                                       ewaldcoeff_q,
 +                                       ewaldcoeff_lj,
 +                                       gmx_omp_nthreads_get(ModuleMultiThread::Pme),
 +                                       pmeRunMode,
 +                                       nullptr,
 +                                       deviceContext,
 +                                       pmeStream,
 +                                       pmeGpuProgram.get(),
 +                                       mdlog);
              }
              GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
          }
      if (thisRankHasDuty(cr, DUTY_PP))
      {
          /* Assumes uniform use of the number of OpenMP threads */
 -        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(emntDefault));
 +        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(ModuleMultiThread::Default));
  
          if (inputrec->bPull)
          {
              /* Initialize pull code */
 -            pull_work = init_pull(fplog, inputrec->pull.get(), inputrec.get(), &mtop, cr, &atomSets,
 +            pull_work = init_pull(fplog,
 +                                  inputrec->pull.get(),
 +                                  inputrec.get(),
 +                                  mtop,
 +                                  cr,
 +                                  &atomSets,
                                    inputrec->fepvals->init_lambda);
              if (inputrec->pull->bXOutAverage || inputrec->pull->bFOutAverage)
              {
          if (inputrec->bRot)
          {
              /* Initialize enforced rotation code */
 -            enforcedRotation = init_rot(fplog, inputrec.get(), filenames.size(), filenames.data(),
 -                                        cr, &atomSets, globalState.get(), &mtop, oenv, mdrunOptions,
 +            enforcedRotation = init_rot(fplog,
 +                                        inputrec.get(),
 +                                        filenames.size(),
 +                                        filenames.data(),
 +                                        cr,
 +                                        &atomSets,
 +                                        globalState.get(),
 +                                        mtop,
 +                                        oenv,
 +                                        mdrunOptions,
                                          startingBehavior);
          }
  
          t_swap* swap = nullptr;
 -        if (inputrec->eSwapCoords != eswapNO)
 +        if (inputrec->eSwapCoords != SwapType::No)
          {
              /* Initialize ion swapping code */
 -            swap = init_swapcoords(fplog, inputrec.get(),
 +            swap = init_swapcoords(fplog,
 +                                   inputrec.get(),
                                     opt2fn_master("-swap", filenames.size(), filenames.data(), cr),
 -                                   &mtop, globalState.get(), &observablesHistory, cr, &atomSets,
 -                                   oenv, mdrunOptions, startingBehavior);
 +                                   mtop,
 +                                   globalState.get(),
 +                                   &observablesHistory,
 +                                   cr,
 +                                   &atomSets,
 +                                   oenv,
 +                                   mdrunOptions,
 +                                   startingBehavior);
          }
  
          /* Let makeConstraints know whether we have essential dynamics constraints. */
 -        auto constr = makeConstraints(mtop, *inputrec, pull_work, doEssentialDynamics, fplog, cr,
 -                                      ms, &nrnb, wcycle, fr->bMolPBC);
 +        auto constr = makeConstraints(mtop,
 +                                      *inputrec,
 +                                      pull_work,
 +                                      doEssentialDynamics,
 +                                      fplog,
 +                                      cr,
 +                                      updateGroups.useUpdateGroups(),
 +                                      ms,
 +                                      &nrnb,
 +                                      wcycle.get(),
 +                                      fr->bMolPBC,
 +                                      &observablesReducerBuilder);
  
          /* Energy terms and groups */
          gmx_enerdata_t enerd(mtop.groups.groups[SimulationAtomGroupType::EnergyOutput].size(),
  
          // cos acceleration is only supported by md, but older tpr
          // files might still combine it with other integrators
 -        GMX_RELEASE_ASSERT(inputrec->cos_accel == 0.0 || inputrec->eI == eiMD,
 +        GMX_RELEASE_ASSERT(inputrec->cos_accel == 0.0 || inputrec->eI == IntegrationAlgorithm::MD,
                             "cos_acceleration is only supported by integrator=md");
  
          /* Kinetic energy data */
 -        gmx_ekindata_t ekind;
 -        init_ekindata(fplog, &mtop, &(inputrec->opts), &ekind, inputrec->cos_accel);
 +        gmx_ekindata_t ekind(inputrec->opts.ngtc,
 +                             inputrec->cos_accel,
 +                             gmx_omp_nthreads_get(ModuleMultiThread::Update));
  
          /* Set up interactive MD (IMD) */
 -        auto imdSession =
 -                makeImdSession(inputrec.get(), cr, wcycle, &enerd, ms, &mtop, mdlog,
 -                               MASTER(cr) ? globalState->x.rvec_array() : nullptr, filenames.size(),
 -                               filenames.data(), oenv, mdrunOptions.imdOptions, startingBehavior);
 -
 -        if (DOMAINDECOMP(cr))
 +        auto imdSession = makeImdSession(inputrec.get(),
 +                                         cr,
 +                                         wcycle.get(),
 +                                         &enerd,
 +                                         ms,
 +                                         mtop,
 +                                         mdlog,
 +                                         MASTER(cr) ? globalState->x : gmx::ArrayRef<gmx::RVec>(),
 +                                         filenames.size(),
 +                                         filenames.data(),
 +                                         oenv,
 +                                         mdrunOptions.imdOptions,
 +                                         startingBehavior);
 +
 +        if (haveDDAtomOrdering(*cr))
          {
              GMX_RELEASE_ASSERT(fr, "fr was NULL while cr->duty was DUTY_PP");
 -            /* This call is not included in init_domain_decomposition mainly
 -             * because fr->cginfo_mb is set later.
 +            /* This call is not included in init_domain_decomposition
 +             * because fr->atomInfoForEachMoleculeBlock is set later.
               */
 -            dd_init_bondeds(fplog, cr->dd, mtop, vsite.get(), inputrec.get(),
 -                            domdecOptions.checkBondedInteractions, fr->cginfo_mb);
 +            makeBondedLinks(cr->dd, mtop, fr->atomInfoForEachMoleculeBlock);
          }
  
          if (runScheduleWork.simulationWork.useGpuBufferOps)
          {
              fr->gpuForceReduction[gmx::AtomLocality::Local] = std::make_unique<gmx::GpuForceReduction>(
                      deviceStreamManager->context(),
 -                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedLocal), wcycle);
 +                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedLocal),
 +                    wcycle.get());
              fr->gpuForceReduction[gmx::AtomLocality::NonLocal] = std::make_unique<gmx::GpuForceReduction>(
                      deviceStreamManager->context(),
 -                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedNonLocal), wcycle);
 +                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedNonLocal),
 +                    wcycle.get());
          }
  
          std::unique_ptr<gmx::StatePropagatorDataGpu> stateGpu;
              && ((runScheduleWork.simulationWork.useGpuPme && thisRankHasDuty(cr, DUTY_PME))
                  || runScheduleWork.simulationWork.useGpuBufferOps))
          {
 -            GpuApiCallBehavior transferKind = (inputrec->eI == eiMD && !doRerun && !useModularSimulator)
 -                                                      ? GpuApiCallBehavior::Async
 -                                                      : GpuApiCallBehavior::Sync;
 +            GpuApiCallBehavior transferKind =
 +                    (inputrec->eI == IntegrationAlgorithm::MD && !doRerun && !useModularSimulator)
 +                            ? GpuApiCallBehavior::Async
 +                            : GpuApiCallBehavior::Sync;
              GMX_RELEASE_ASSERT(deviceStreamManager != nullptr,
                                 "GPU device stream manager should be initialized to use GPU.");
              stateGpu = std::make_unique<gmx::StatePropagatorDataGpu>(
 -                    *deviceStreamManager, transferKind, pme_gpu_get_block_size(fr->pmedata), wcycle);
 +                    *deviceStreamManager, transferKind, pme_gpu_get_block_size(fr->pmedata), wcycle.get());
              fr->stateGpu = stateGpu.get();
          }
  
          GMX_ASSERT(stopHandlerBuilder_, "Runner must provide StopHandlerBuilder to simulator.");
          SimulatorBuilder simulatorBuilder;
  
 -        simulatorBuilder.add(SimulatorStateData(globalState.get(), &observablesHistory, &enerd, &ekind));
 +        simulatorBuilder.add(SimulatorStateData(
 +                globalState.get(), localState, &observablesHistory, &enerd, &ekind));
          simulatorBuilder.add(std::move(membedHolder));
          simulatorBuilder.add(std::move(stopHandlerBuilder_));
          simulatorBuilder.add(SimulatorConfig(mdrunOptions, startingBehavior, &runScheduleWork));
  
  
 -        simulatorBuilder.add(SimulatorEnv(fplog, cr, ms, mdlog, oenv));
 -        simulatorBuilder.add(Profiling(&nrnb, walltime_accounting, wcycle));
 +        simulatorBuilder.add(SimulatorEnv(fplog, cr, ms, mdlog, oenv, &observablesReducerBuilder));
 +        simulatorBuilder.add(Profiling(&nrnb, walltime_accounting, wcycle.get()));
          simulatorBuilder.add(ConstraintsParam(
 -                constr.get(), enforcedRotation ? enforcedRotation->getLegacyEnfrot() : nullptr,
 -                vsite.get()));
 +                constr.get(), enforcedRotation ? enforcedRotation->getLegacyEnfrot() : nullptr, vsite.get()));
          // TODO: Separate `fr` to a separate add, and make the `build` handle the coupling sensibly.
 -        simulatorBuilder.add(LegacyInput(static_cast<int>(filenames.size()), filenames.data(),
 -                                         inputrec.get(), fr));
 +        simulatorBuilder.add(LegacyInput(
 +                static_cast<int>(filenames.size()), filenames.data(), inputrec.get(), fr.get()));
          simulatorBuilder.add(ReplicaExchangeParameters(replExParams));
          simulatorBuilder.add(InteractiveMD(imdSession.get()));
 -        simulatorBuilder.add(SimulatorModules(mdModules_->outputProvider(), mdModules_->notifier()));
 +        simulatorBuilder.add(SimulatorModules(mdModules_->outputProvider(), mdModules_->notifiers()));
          simulatorBuilder.add(CenterOfMassPulling(pull_work));
          // Todo move to an MDModule
          simulatorBuilder.add(IonSwapping(swap));
 -        simulatorBuilder.add(TopologyData(&mtop, mdAtoms.get()));
 +        simulatorBuilder.add(TopologyData(mtop, &localTopology, mdAtoms.get()));
          simulatorBuilder.add(BoxDeformationHandle(deform.get()));
          simulatorBuilder.add(std::move(modularSimulatorCheckpointData));
  
      {
          GMX_RELEASE_ASSERT(pmedata, "pmedata was NULL while cr->duty was not DUTY_PP");
          /* do PME only */
 -        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(emntPME));
 -        gmx_pmeonly(pmedata, cr, &nrnb, wcycle, walltime_accounting, inputrec.get(), pmeRunMode,
 +        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(ModuleMultiThread::Pme));
 +        gmx_pmeonly(pmedata,
 +                    cr,
 +                    &nrnb,
 +                    wcycle.get(),
 +                    walltime_accounting,
 +                    inputrec.get(),
 +                    pmeRunMode,
 +                    runScheduleWork.simulationWork.useGpuPmePpCommunication,
                      deviceStreamManager.get());
      }
  
 -    wallcycle_stop(wcycle, ewcRUN);
 +    wallcycle_stop(wcycle.get(), WallCycleCounter::Run);
  
      /* Finish up, write some stuff
       * if rerunMD, don't write last frame again
       */
 -    finish_run(fplog, mdlog, cr, inputrec.get(), &nrnb, wcycle, walltime_accounting,
 -               fr ? fr->nbv.get() : nullptr, pmedata, EI_DYNAMICS(inputrec->eI) && !isMultiSim(ms));
 +    finish_run(fplog,
 +               mdlog,
 +               cr,
 +               *inputrec,
 +               &nrnb,
 +               wcycle.get(),
 +               walltime_accounting,
 +               fr ? fr->nbv.get() : nullptr,
 +               pmedata,
 +               EI_DYNAMICS(inputrec->eI) && !isMultiSim(ms));
  
 -    // clean up cycle counter
 -    wallcycle_destroy(wcycle);
  
      deviceStreamManager.reset(nullptr);
      // Free PME data
      // As soon as we destroy GPU contexts after mdrunner() exits, these lines should go.
      mdAtoms.reset(nullptr);
      globalState.reset(nullptr);
 +    localStateInstance.reset(nullptr);
      mdModules_.reset(nullptr); // destruct force providers here as they might also use the GPU
 -    gpuBonded.reset(nullptr);
 -    /* Free pinned buffers in *fr */
 -    delete fr;
 -    fr = nullptr;
 +    fr.reset(nullptr);         // destruct forcerec before gpu
      // TODO convert to C++ so we can get rid of these frees
      sfree(disresdata);
 -    sfree(oriresdata);
  
      if (!hwinfo_->deviceInfoList.empty())
      {
      {
          physicalNodeComm.barrier();
      }
 -    releaseDevice(deviceInfo);
 +
 +    if (!devFlags.usingCudaAwareMpi)
 +    {
 +        // Don't reset GPU in case of CUDA-AWARE MPI
 +        // UCX creates CUDA buffers which are cleaned-up as part of MPI_Finalize()
 +        // resetting the device before MPI_Finalize() results in crashes inside UCX
 +        releaseDevice(deviceInfo);
 +    }
  
      /* Does what it says */
      print_date_and_time(fplog, cr->nodeid, "Finished mdrun", gmx_gettime());
index 23781cee8cfc0177dc1e85f98eefd9603273f17f,a6b511a3051164867f0aa0173a9387b312543705..079f906cb9701b21d8c94381a1762c42f28a1fb9
@@@ -3,7 -3,7 +3,7 @@@
   *
   * Copyright (c) 2009,2010,2011,2012,2013 by the GROMACS development team.
   * Copyright (c) 2014,2015,2016,2017,2018 by the GROMACS development team.
-  * Copyright (c) 2019,2020, by the GROMACS development team, led by
+  * Copyright (c) 2019,2020,2021, by the GROMACS development team, led by
   * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
   * and including many others, as listed in the AUTHORS file in the
   * top-level source directory and at http://www.gromacs.org.
@@@ -522,6 -522,7 +522,7 @@@ void _gmx_sel_evaluate_static(gmx_sel_e
  {
      if (sel->flags & SEL_UNSORTED)
      {
+         gmx_ana_index_reserve(sel->v.u.g, sel->u.cgrp.isize);
          // This only works if g contains all the atoms, but that is currently
          // the only supported case.
          gmx_ana_index_copy(sel->v.u.g, &sel->u.cgrp, false);
@@@ -953,7 -954,8 +954,7 @@@ void _gmx_sel_evaluate_method(gmx_sel_e
                      expandValueForPositions(sel->v.u.r, &sel->v.nr, sel->u.expr.pos);
                      break;
                  default:
 -                    GMX_RELEASE_ASSERT(false,
 -                                       "Unimplemented value type for position update method");
 +                    GMX_RELEASE_ASSERT(false, "Unimplemented value type for position update method");
              }
          }
      }
index 36c4650f0649311fe23d213ec27e7a66f737e389,bc9af8d4e09ade7668c959c3345e9723dc00e5e8..62010fe2697772148bdcdb604511933f4c06209b
@@@ -57,7 -57,7 +57,7 @@@
  #include "gromacs/hardware/detecthardware.h"
  #include "gromacs/hardware/hardwaretopology.h"
  #include "gromacs/hardware/hw_info.h"
 -#include "gromacs/listed_forces/gpubonded.h"
 +#include "gromacs/listed_forces/listed_forces_gpu.h"
  #include "gromacs/mdlib/gmx_omp_nthreads.h"
  #include "gromacs/mdlib/update_constrain_gpu.h"
  #include "gromacs/mdtypes/commrec.h"
@@@ -83,7 -83,7 +83,7 @@@ namespac
  {
  
  //! Helper variable to localise the text of an often repeated message.
 -const char* g_specifyEverythingFormatString =
 +const char* const g_specifyEverythingFormatString =
          "When you use mdrun -gputasks, %s must be set to non-default "
          "values, so that the device IDs can be interpreted correctly."
  #if GMX_GPU
          // OpenCL standard, but the only current relevant case for GROMACS
          // is AMD OpenCL, which offers this variable.
          "GPU_DEVICE_ORDINAL"
 -#    elif GMX_GPU_SYCL
 -        // As with OpenCL, there are no portable way to do it.
 -        // Intel reference: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
 -        // While SYCL_DEVICE_FILTER is a better option, as of 2021.1-beta10 it is not yet supported.
 -        "SYCL_DEVICE_ALLOWLIST"
 +#    elif GMX_GPU_SYCL && GMX_SYCL_DPCPP
 +        // https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
 +        "SYCL_DEVICE_FILTER"
 +#    elif GMX_GPU_SYCL && GMX_SYCL_HIPSYCL
 +        // Not true if we use hipSYCL over CUDA or IntelLLVM, but in that case the user probably
 +        // knows what they are doing.
 +        // https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html#hip-environment-variables
 +        "HIP_VISIBLE_DEVICES"
  #    else
  #        error "Unreachable branch"
  #    endif
  } // namespace
  
  bool decideWhetherToUseGpusForNonbondedWithThreadMpi(const TaskTarget        nonbondedTarget,
 -                                                     const int               numDevicesToUse,
 +                                                     const bool              haveAvailableDevices,
                                                       const std::vector<int>& userGpuTaskAssignment,
                                                       const EmulateGpuNonbonded emulateGpuNonbonded,
                                                       const bool buildSupportsNonbondedOnGpu,
      // all potential ranks can use, and can use that in a global
      // decision that will later be consistent.
      // If we get here, then the user permitted or required GPUs.
 -    return (numDevicesToUse > 0);
 +    return haveAvailableDevices;
  }
  
  bool decideWhetherToUseGpusForPmeWithThreadMpi(const bool              useGpuForNonbonded,
                                                 const TaskTarget        pmeTarget,
+                                                const TaskTarget        pmeFftTarget,
                                                 const int               numDevicesToUse,
                                                 const std::vector<int>& userGpuTaskAssignment,
                                                 const gmx_hw_info_t&    hardwareInfo,
      if ((pmeTarget == TaskTarget::Cpu) || !useGpuForNonbonded || !pme_gpu_supports_build(nullptr)
          || !pme_gpu_supports_hardware(hardwareInfo, nullptr) || !pme_gpu_supports_input(inputrec, nullptr))
      {
-         // PME can't run on a GPU. If the user required that, we issue
-         // an error later.
+         // PME can't run on a GPU. If the user required that, we issue an error later.
+         return false;
+     }
+     if (pmeFftTarget == TaskTarget::Cpu && !pme_gpu_mixed_mode_supports_input(inputrec, nullptr))
+     {
+         /* User requested PME FFT on CPU, but the current system is not compatible with Mixed mode,
+          * so we don't use GPUs at all.
+          * If the user had -pme gpu, we issue an error later. */
          return false;
      }
  
@@@ -330,6 -335,7 +338,7 @@@ bool decideWhetherToUseGpusForNonbonded
  
  bool decideWhetherToUseGpusForPme(const bool              useGpuForNonbonded,
                                    const TaskTarget        pmeTarget,
+                                   const TaskTarget        pmeFftTarget,
                                    const std::vector<int>& userGpuTaskAssignment,
                                    const gmx_hw_info_t&    hardwareInfo,
                                    const t_inputrec&       inputrec,
          }
          return false;
      }
+     if (pmeFftTarget == TaskTarget::Cpu && !pme_gpu_mixed_mode_supports_input(inputrec, &message))
+     {
+         /* User requested PME FFT on CPU, but the current system is not compatible with Mixed mode,
+          * so we don't use GPUs at all. */
+         if (pmeTarget == TaskTarget::Gpu)
+         {
+             GMX_THROW(NotImplementedError("Cannot compute PME interactions in Mixed mode, because " + message));
+         }
+         return false;
+     }
  
      if (pmeTarget == TaskTarget::Cpu)
      {
@@@ -479,7 -495,7 +498,7 @@@ bool decideWhetherToUseGpusForBonded(bo
  
      std::string errorMessage;
  
 -    if (!buildSupportsGpuBondeds(&errorMessage))
 +    if (!buildSupportsListedForcesGpu(&errorMessage))
      {
          if (bondedTarget == TaskTarget::Gpu)
          {
          return false;
      }
  
 -    if (!inputSupportsGpuBondeds(inputrec, mtop, &errorMessage))
 +    if (!inputSupportsListedForcesGpu(inputrec, mtop, &errorMessage))
      {
          if (bondedTarget == TaskTarget::Gpu)
          {
      // is busy, for which we currently only check PME or Ewald.
      // (It would be better to dynamically assign bondeds based on timings)
      // Note that here we assume that the auto setting of PME ranks will not
 -    // choose seperate PME ranks when nonBonded are assigned to the GPU.
 +    // choose separate PME ranks when nonBonded are assigned to the GPU.
      bool usingOurCpuForPmeOrEwald =
              (EVDW_PME(inputrec.vdwtype)
               || (EEL_PME_EWALD(inputrec.coulombtype) && !useGpuForPme && numPmeRanksPerSimulation <= 0));
@@@ -546,6 -562,7 +565,6 @@@ bool decideWhetherToUseGpuForUpdate(con
                                      const gmx_mtop_t&              mtop,
                                      const bool                     useEssentialDynamics,
                                      const bool                     doOrientationRestraints,
 -                                    const bool                     useReplicaExchange,
                                      const bool                     haveFrozenAtoms,
                                      const bool                     doRerun,
                                      const DevelopmentFeatureFlags& devFlags,
          errorMessage += "Multiple time stepping is not supported.\n";
      }
  
 -    if (inputrec.eConstrAlg == econtSHAKE && hasAnyConstraints && gmx_mtop_ftype_count(mtop, F_CONSTR) > 0)
 +    if (inputrec.eConstrAlg == ConstraintAlgorithm::Shake && hasAnyConstraints
 +        && gmx_mtop_ftype_count(mtop, F_CONSTR) > 0)
      {
          errorMessage += "SHAKE constraints are not supported.\n";
      }
          errorMessage +=
                  "Either PME or short-ranged non-bonded interaction tasks must run on the GPU.\n";
      }
 -
      if (!gpusWereDetected)
      {
          errorMessage += "Compatible GPUs must have been found.\n";
      {
          errorMessage += "Only a CUDA build is supported.\n";
      }
 -    if (inputrec.eI != eiMD)
 +    if (inputrec.eI != IntegrationAlgorithm::MD)
      {
          errorMessage += "Only the md integrator is supported.\n";
      }
 -    if (inputrec.etc == etcNOSEHOOVER)
 +    if (inputrec.etc == TemperatureCoupling::NoseHoover)
      {
          errorMessage += "Nose-Hoover temperature coupling is not supported.\n";
      }
 -    if (!(inputrec.epc == epcNO || inputrec.epc == epcPARRINELLORAHMAN
 -          || inputrec.epc == epcBERENDSEN || inputrec.epc == epcCRESCALE))
 +    if (!(inputrec.epc == PressureCoupling::No || inputrec.epc == PressureCoupling::ParrinelloRahman
 +          || inputrec.epc == PressureCoupling::Berendsen || inputrec.epc == PressureCoupling::CRescale))
      {
          errorMessage +=
                  "Only Parrinello-Rahman, Berendsen, and C-rescale pressure coupling are "
          // The graph is needed, but not supported
          errorMessage += "Orientation restraints are not supported.\n";
      }
 -    if (inputrec.efep != efepNO && (haveFepPerturbedMasses(mtop) || havePerturbedConstraints(mtop)))
 +    if (inputrec.efep != FreeEnergyPerturbationType::No
 +        && (haveFepPerturbedMasses(mtop) || havePerturbedConstraints(mtop)))
      {
          errorMessage += "Free energy perturbation for mass and constraints are not supported.\n";
      }
      const auto particleTypes = gmx_mtop_particletype_count(mtop);
 -    if (particleTypes[eptShell] > 0)
 +    if (particleTypes[ParticleType::Shell] > 0)
      {
          errorMessage += "Shells are not supported.\n";
      }
 -    if (useReplicaExchange)
 -    {
 -        errorMessage += "Replica exchange simulations are not supported.\n";
 -    }
 -    if (inputrec.eSwapCoords != eswapNO)
 +    if (inputrec.eSwapCoords != SwapType::No)
      {
          errorMessage += "Swapping the coordinates is not supported.\n";
      }
                  "The number of coupled constraints is higher than supported in the GPU LINCS "
                  "code.\n";
      }
 +    if (hasAnyConstraints && !UpdateConstrainGpu::areConstraintsSupported())
 +    {
 +        errorMessage += "Chosen GPU implementation does not support constraints.\n";
 +    }
      if (haveFrozenAtoms)
      {
          // There is a known bug with frozen atoms and GPU update, see Issue #3920.
index 3404bebb6af6f6c6e8bc3097dd8d77695fb6768e,8f16f25ae80b198aabf6cb2116ff741bc976ca32..76b6f40042a0fec1ecd7bcec55482aec9d3ac72d
@@@ -88,8 -88,6 +88,8 @@@ struct DevelopmentFeatureFlag
      bool enableGpuHaloExchange = false;
      //! True if the PME PP direct communication GPU development feature is enabled
      bool enableGpuPmePPComm = false;
 +    //! True if the CUDA-aware MPI is being used for GPU direct communication feature
 +    bool usingCudaAwareMpi = false;
  };
  
  
@@@ -105,7 -103,8 +105,7 @@@ class MDAtoms
   *
   * \param[in] nonbondedTarget              The user's choice for mdrun -nb for where to assign
   *                                         short-ranged nonbonded interaction tasks.
 - * \param[in] numDevicesToUse              Number of compatible GPUs that the user permitted
 - *                                         us to use.
 + * \param[in] haveAvailableDevices         Whether there are available devices.
   * \param[in] userGpuTaskAssignment        The user-specified assignment of GPU tasks to device IDs.
   * \param[in] emulateGpuNonbonded          Whether we will emulate GPU calculation of nonbonded
   *                                         interactions.
   * \throws     std::bad_alloc          If out of memory
   *             InconsistentInputError  If the user requirements are inconsistent. */
  bool decideWhetherToUseGpusForNonbondedWithThreadMpi(TaskTarget              nonbondedTarget,
 -                                                     int                     numDevicesToUse,
 +                                                     bool                    haveAvailableDevices,
                                                       const std::vector<int>& userGpuTaskAssignment,
                                                       EmulateGpuNonbonded     emulateGpuNonbonded,
                                                       bool buildSupportsNonbondedOnGpu,
   * \param[in]  useGpuForNonbonded        Whether GPUs will be used for nonbonded interactions.
   * \param[in]  pmeTarget                 The user's choice for mdrun -pme for where to assign
   *                                       long-ranged PME nonbonded interaction tasks.
+  * \param[in]  pmeFftTarget              The user's choice for mdrun -pmefft for where to run FFT.
   * \param[in]  numDevicesToUse           The number of compatible GPUs that the user permitted us to use.
   * \param[in]  userGpuTaskAssignment     The user-specified assignment of GPU tasks to device IDs.
   * \param[in]  hardwareInfo              Hardware information
   *             InconsistentInputError  If the user requirements are inconsistent. */
  bool decideWhetherToUseGpusForPmeWithThreadMpi(bool                    useGpuForNonbonded,
                                                 TaskTarget              pmeTarget,
+                                                TaskTarget              pmeFftTarget,
                                                 int                     numDevicesToUse,
                                                 const std::vector<int>& userGpuTaskAssignment,
                                                 const gmx_hw_info_t&    hardwareInfo,
@@@ -209,6 -210,7 +211,7 @@@ bool decideWhetherToUseGpusForNonbonded
   *
   * \param[in]  useGpuForNonbonded        Whether GPUs will be used for nonbonded interactions.
   * \param[in]  pmeTarget                 The user's choice for mdrun -pme for where to assign long-ranged PME nonbonded interaction tasks.
+  * \param[in]  pmeFftTarget              The user's choice for mdrun -pmefft for where to do FFT for PME.
   * \param[in]  userGpuTaskAssignment     The user-specified assignment of GPU tasks to device IDs.
   * \param[in]  hardwareInfo              Hardware information
   * \param[in]  inputrec                  The user input
   *             InconsistentInputError  If the user requirements are inconsistent. */
  bool decideWhetherToUseGpusForPme(bool                    useGpuForNonbonded,
                                    TaskTarget              pmeTarget,
+                                   TaskTarget              pmeFftTarget,
                                    const std::vector<int>& userGpuTaskAssignment,
                                    const gmx_hw_info_t&    hardwareInfo,
                                    const t_inputrec&       inputrec,
@@@ -278,6 -281,7 +282,6 @@@ bool decideWhetherToUseGpusForBonded(bo
   * \param[in]  mtop                         The global topology.
   * \param[in]  useEssentialDynamics         If essential dynamics is active.
   * \param[in]  doOrientationRestraints      If orientation restraints are enabled.
 - * \param[in]  useReplicaExchange           If this is a REMD simulation.
   * \param[in]  haveFrozenAtoms              If this simulation has frozen atoms (see Issue #3920).
   * \param[in]  doRerun                      It this is a rerun.
   * \param[in]  devFlags                     GPU development / experimental feature flags.
@@@ -298,6 -302,7 +302,6 @@@ bool decideWhetherToUseGpuForUpdate(boo
                                      const gmx_mtop_t&              mtop,
                                      bool                           useEssentialDynamics,
                                      bool                           doOrientationRestraints,
 -                                    bool                           useReplicaExchange,
                                      bool                           haveFrozenAtoms,
                                      bool                           doRerun,
                                      const DevelopmentFeatureFlags& devFlags,