Merge release-2021 into master

author Andrey Alekseenko <al42and@gmail.com>

Mon, 18 Oct 2021 14:31:36 +0000 (16:31 +0200)

committer Andrey Alekseenko <al42and@gmail.com>

Mon, 18 Oct 2021 14:39:47 +0000 (16:39 +0200)
author Andrey Alekseenko <al42and@gmail.com>
Mon, 18 Oct 2021 14:31:36 +0000 (16:31 +0200)
committer Andrey Alekseenko <al42and@gmail.com>
Mon, 18 Oct 2021 14:39:47 +0000 (16:39 +0200)
diff --combined docs/gmxapi/userguide/install.rst

index f0ab2210fb9ced25c1be0cb28f42833538b96ce8,8588156a48b3d85140459260e93457ec49e99d3f..e1994e9c74b343f80ed9c496eafad0eb6a494693
--- 1/docs/gmxapi/userguide/install.rst
--- 2/docs/gmxapi/userguide/install.rst
+++ b/docs/gmxapi/userguide/install.rst
@@@ -65,7 -65,8 +65,7 @@@ Set up a Python virtual environmen
       python3 -m venv $HOME/myvenv
       . $HOME/myvenv/bin/activate
       python -m ensurepip --default-pip
- -    pip install --upgrade pip setuptools
- -    pip install --upgrade cmake scikit-build
+ +    pip install --upgrade pip setuptools wheel
   
   .. seealso:: :ref:`gmxapi venv`
   
@@@ -85,7 -86,7 +85,7 @@@ Backgroun
   *gmxapi* comes in three parts:
   
   * GROMACS gmxapi library for C++.
- -* This Python package, supporting Python 3.6 and higher
+ +* This Python package, supporting Python 3.7 and higher
   * MD restraint plugins and sample gmxapi client code
   
   GROMACS requirements
@@@ -111,9 -112,9 +111,9 @@@ to the build configuration
   Build system requirements
   -------------------------
   
- -gmxapi can be built for Python 3.6 and higher.
+ +gmxapi can be built for Python 3.7 and higher.
   
- -You will need a C++ 14 compatible compiler and a reasonably up-to-date version
+ +You will need a C++ 17 compatible compiler and a reasonably up-to-date version
   of CMake.
   Full gmxapi functionality may also require an MPI compiler (e.g. :command:`mpicc`).
   
@@@ -143,7 -144,7 +143,7 @@@ that works
   Python environment requirements
   -------------------------------
   
- -gmxapi requires Python 3.6 or higher. Check your version with
+ +gmxapi requires Python 3.7 or higher. Check your version with
   :command:`python3 --version` or :command:`python --version`.
   
   ..  note::
@@@ -155,7 -156,7 +155,7 @@@
       :command:`python3 --version` or :command:`python --version` and :command:`pip --version`.
   
   To build and install, you need the Python packages for
- -cmake_, networkx_, scikit-build_, and setuptools_
+ +cmake_, networkx_, and setuptools_
   (all available from `PyPI with pip <https://pip.pypa.io/en/stable/>`_).
   
   For full functionality, you should also have mpi4py_ and numpy_.
@@@ -339,7 -340,7 +339,7 @@@ the remaining examples assume the virtu
   Install dependencies
   ^^^^^^^^^^^^^^^^^^^^
   
- -It is always a good idea to update pip_ and setuptools_ before installing
+ +It is always a good idea to update pip_, setuptools_, and wheel_ before installing
   new Python packages::
   
       pip install --upgrade pip setuptools
@@@ -349,7 -350,7 +349,7 @@@ they are installed and up to date befor
   
   ::
   
- -    pip install --upgrade cmake scikit-build
+ +    pip install --upgrade cmake pybind11
   
   For MPI, we use mpi4py_.
   Make sure it is using the same MPI installation that we are building
@@@ -431,7 -432,7 +431,7 @@@ without internet access, either from th
   
   For example, the last line of the previous example could be replaced with::
   
- -    pip install --no-cache-dir --no-deps --no-index --no-build-isolation .
+ +    pip install --no-cache-dir --no-deps --no-index .
   
   Refer to pip_ documentation for descriptions of these options.
   
@@@ -448,11 -449,11 +448,11 @@@ Building a source archiv
   -------------------------
   
   A source archive for the gmxapi python package can be built from the GROMACS
- -source repository using Python ``setuptools`` and ``scikit-build``.
+ +source repository using Python ``setuptools``.
   
   Example::
   
- -    pip install --upgrade setuptools scikit-build
+ +    pip install --upgrade setuptools
       cd python_packaging/src
       python setup.py sdist
   
@@@ -461,13 -462,6 +461,13 @@@ archive file. The file name has the for
   *<version>* is the version from the ``setup.py`` file, and *<suffix>* is
   determined by the local environment or by additional arguments to ``setup.py``.
   
+ +The new `build <https://pypa-build.readthedocs.io/en/latest/>`__ module is somewhat tidier.
+ +It automatically manages a temporary venv with the necessary dependencies::
+ +
+ +    pip install --upgrade build
+ +    cd python_packaging/src
+ +    python -m build --sdist .
+ +
   .. seealso::
   
       Python documentation for
@@@ -502,29 -496,16 +502,29 @@@ This will first build the *gmxapi* Pyth
   location in the build tree. Sphinx can then import the package to automatically
   extract Python docstrings.
   
+ +Note that this is an entirely CMake-driven installation and Python dependencies
+ +will not be installed automatically. You can update your Python environment
+ +(before configuring with CMake) using the :file:`requirements.txt` files provided
+ +in the :file:`python_packaging/` directory of the repository. Example::
+ +
+ +    pip install -r python_packaging/requirements-docs.txt
+ +
+ +or
+ +
+ +::
+ +
+ +    pip install -r python_packaging/requirements-test.txt
+ +
   Sometimes the build environment can choose a different Python interpreter than
   the one you intended.
- You can set the ``Python3_ROOT`` or ``CMAKE_PREFIX_PATH`` CMake variable to
+ You can set the ``Python3_ROOT_DIR`` or ``CMAKE_PREFIX_PATH`` CMake variable to
   explicitly choose the Python installation or *venv* directory.
   
   If you use pyenv or pyenv-virtualenv to dynamically manage your Python version,
   you can help identify a particular version with ``pyenv version-name`` and the
   directory with ``pyenv prefix {version}``. For example::
   
-     -DPython3_ROOT=$(pyenv prefix $(pyenv version-name))
+     -DPython3_ROOT_DIR=$(pyenv prefix $(pyenv version-name))
   
   Docker web server
   -----------------
@@@ -570,27 -551,6 +570,27 @@@ and see if that fixes the problem. If n
   (see above) to help narrow down the problem before you
   `open an issue <https://gitlab.com/gromacs/gromacs/-/issues/>`_.
   
+ +Errors regarding pybind11
+ +-------------------------
+ +
+ +An error may occur in ``setup.py`` with output that contains something like the following::
+ +
+ +      ModuleNotFoundError: No module named 'pybind11'
+ +      Building wheel for gmxapi (pyproject.toml): finished with status 'error'
+ +      ERROR: Failed building wheel for gmxapi
+ +    Failed to build gmxapi
+ +    ERROR: Could not build wheels for gmxapi, which is required to install pyproject.toml-based projects
+ +
+ +The important information here is that ``pybind11`` was not found.
+ +
+ +Build dependencies aren't always automatically installed.
+ +Even if you are using ``pip``, you may have disabled automatic dependency fulfillment with an option like ``--no-build-isolation`` or ``--no-deps``.
+ +
+ +In any case, the problem should be resolved by explicitly installing the ``pybind11``
+ +Python package before attempting to build ``gmxapi``::
+ +
+ +    pip install --upgrade pybind11
+ +
   Couldn't find the ``gmxapi`` support library?
   ---------------------------------------------
   
@@@ -709,5 -669,3 +709,5 @@@ rebase your feature branches for the ne
   .. _scikit-build: https://pypi.org/project/scikit-build/
   
   .. _setuptools: https://pypi.org/project/setuptools/
+ +
+ +.. _wheel: https://pypi.org/project/wheel/
diff --combined src/gromacs/ewald/pme.cpp

index 7438077d0b26031ecaf055243eea06eb928dc27a,b0b59e6eedbaf7fe8f37134805eab50a5280b711..e2c87459a0589403888d57fdfead803bbabb869f
--- 1/src/gromacs/ewald/pme.cpp
--- 2/src/gromacs/ewald/pme.cpp
+++ b/src/gromacs/ewald/pme.cpp
@@@ -114,7 -114,7 +114,7 @@@
   #include "gromacs/utility/logger.h"
   #include "gromacs/utility/real.h"
   #include "gromacs/utility/smalloc.h"
- -#include "gromacs/utility/stringutil.h"
+ +#include "gromacs/utility/message_string_collector.h"
   #include "gromacs/utility/unique_cptr.h"
   
   #include "calculate_spline_moduli.h"
@@@ -127,59 -127,96 +127,73 @@@
   #include "pme_spline_work.h"
   #include "pme_spread.h"
   
- -/*! \brief Help build a descriptive message in \c error if there are
- - * \c errorReasons why PME on GPU is not supported.
- - *
- - * \returns Whether the lack of errorReasons indicate there is support. */
- -static bool addMessageIfNotSupported(const std::list<std::string>& errorReasons, std::string* error)
- -{
- -    bool isSupported = errorReasons.empty();
- -    if (!isSupported && error)
- -    {
- -        std::string regressionTestMarker = "PME GPU does not support";
- -        // this prefix is tested for in the regression tests script gmxtest.pl
- -        *error = regressionTestMarker;
- -        if (errorReasons.size() == 1)
- -        {
- -            *error += " " + errorReasons.back();
- -        }
- -        else
- -        {
- -            *error += ": " + gmx::joinStrings(errorReasons, "; ");
- -        }
- -        *error += ".";
- -    }
- -    return isSupported;
- -}
+ +//NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+ +bool g_allowPmeWithSyclForTesting = false;
   
   bool pme_gpu_supports_build(std::string* error)
   {
- -    std::list<std::string> errorReasons;
- -    if (GMX_DOUBLE)
- -    {
- -        errorReasons.emplace_back("a double-precision build");
- -    }
- -    if (!GMX_GPU)
- -    {
- -        errorReasons.emplace_back("a non-GPU build");
- -    }
- -    if (GMX_GPU_SYCL)
- -    {
- -        errorReasons.emplace_back("SYCL build"); // SYCL-TODO
- -    }
- -    return addMessageIfNotSupported(errorReasons, error);
+ +    gmx::MessageStringCollector errorReasons;
+ +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
+ +    errorReasons.startContext("PME GPU does not support:");
+ +    errorReasons.appendIf(GMX_DOUBLE, "Double-precision build of GROMACS.");
+ +    errorReasons.appendIf(!GMX_GPU, "Non-GPU build of GROMACS.");
+ +    errorReasons.appendIf(GMX_GPU_SYCL && !g_allowPmeWithSyclForTesting, "SYCL build."); // SYCL-TODO
+ +    errorReasons.finishContext();
+ +    if (error != nullptr)
+ +    {
+ +        *error = errorReasons.toString();
+ +    }
+ +    return errorReasons.isEmpty();
   }
   
   bool pme_gpu_supports_hardware(const gmx_hw_info_t gmx_unused& hwinfo, std::string* error)
   {
- -    std::list<std::string> errorReasons;
- -
- -    if (GMX_GPU_OPENCL)
- -    {
+ +    gmx::MessageStringCollector errorReasons;
+ +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
+ +    errorReasons.startContext("PME GPU does not support:");
   #ifdef __APPLE__
- -        errorReasons.emplace_back("Apple OS X operating system");
+ +    errorReasons.appendIf(GMX_GPU_OPENCL, "Apple OS X operating system");
   #endif
+ +    errorReasons.finishContext();
+ +    if (error != nullptr)
+ +    {
+ +        *error = errorReasons.toString();
       }
- -    return addMessageIfNotSupported(errorReasons, error);
+ +    return errorReasons.isEmpty();
   }
   
   bool pme_gpu_supports_input(const t_inputrec& ir, std::string* error)
   {
- -    std::list<std::string> errorReasons;
- -    if (!EEL_PME(ir.coulombtype))
- -    {
- -        errorReasons.emplace_back("systems that do not use PME for electrostatics");
- -    }
- -    if (ir.pme_order != 4)
- -    {
- -        errorReasons.emplace_back("interpolation orders other than 4");
- -    }
- -    if (EVDW_PME(ir.vdwtype))
- -    {
- -        errorReasons.emplace_back("Lennard-Jones PME");
- -    }
- -    if (!EI_DYNAMICS(ir.eI))
- -    {
- -        errorReasons.emplace_back(
- -                "Cannot compute PME interactions on a GPU, because PME GPU requires a dynamical "
- -                "integrator (md, sd, etc).");
- -    }
- -    return addMessageIfNotSupported(errorReasons, error);
+ +    gmx::MessageStringCollector errorReasons;
+ +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
+ +    errorReasons.startContext("PME GPU does not support:");
+ +    errorReasons.appendIf(!EEL_PME(ir.coulombtype),
+ +                          "Systems that do not use PME for electrostatics.");
+ +    errorReasons.appendIf((ir.pme_order != 4), "Interpolation orders other than 4.");
+ +    errorReasons.appendIf(EVDW_PME(ir.vdwtype), "Lennard-Jones PME.");
+ +    errorReasons.appendIf(!EI_DYNAMICS(ir.eI), "Non-dynamical integrator (use md, sd, etc).");
+ +    errorReasons.finishContext();
+ +    if (error != nullptr)
+ +    {
+ +        *error = errorReasons.toString();
+ +    }
+ +    return errorReasons.isEmpty();
   }
   
- -    std::list<std::string> errorReasons;
- -    if (ir.efep != efepNO)
+ bool pme_gpu_mixed_mode_supports_input(const t_inputrec& ir, std::string* error)
+ {
- -        errorReasons.emplace_back("Free Energy Perturbation (in PME GPU mixed mode)");
++    gmx::MessageStringCollector errorReasons;
++    // Before changing the prefix string, make sure that it is not searched for in regression tests.
++    errorReasons.startContext("PME GPU in Mixed mode does not support:");
++    errorReasons.appendIf(ir.efep != FreeEnergyPerturbationType::No, "Free Energy Perturbation.");
++    errorReasons.finishContext();
++    if (error != nullptr)
+     {
- -    return addMessageIfNotSupported(errorReasons, error);
++        *error = errorReasons.toString();
+     }
++    return errorReasons.isEmpty();
+ }
+ 
   /*! \brief \libinternal
    * Finds out if PME with given inputs is possible to run on GPU.
    * This function is an internal final check, validating the whole PME structure on creation,
@@@ -191,21 -228,32 +205,21 @@@
    */
   static bool pme_gpu_check_restrictions(const gmx_pme_t* pme, std::string* error)
   {
- -    std::list<std::string> errorReasons;
- -    if (pme->nnodes != 1)
- -    {
- -        errorReasons.emplace_back("PME decomposition");
- -    }
- -    if (pme->pme_order != 4)
- -    {
- -        errorReasons.emplace_back("interpolation orders other than 4");
- -    }
- -    if (pme->doLJ)
- -    {
- -        errorReasons.emplace_back("Lennard-Jones PME");
- -    }
- -    if (GMX_DOUBLE)
- -    {
- -        errorReasons.emplace_back("double precision");
- -    }
- -    if (!GMX_GPU)
- -    {
- -        errorReasons.emplace_back("non-GPU build of GROMACS");
- -    }
- -    if (GMX_GPU_SYCL)
- -    {
- -        errorReasons.emplace_back("SYCL build of GROMACS"); // SYCL-TODO
- -    }
- -    return addMessageIfNotSupported(errorReasons, error);
+ +    gmx::MessageStringCollector errorReasons;
+ +    // Before changing the prefix string, make sure that it is not searched for in regression tests.
+ +    errorReasons.startContext("PME GPU does not support:");
+ +    errorReasons.appendIf((pme->nnodes != 1), "PME decomposition.");
+ +    errorReasons.appendIf((pme->pme_order != 4), "interpolation orders other than 4.");
+ +    errorReasons.appendIf(pme->doLJ, "Lennard-Jones PME.");
+ +    errorReasons.appendIf(GMX_DOUBLE, "Double precision build of GROMACS.");
+ +    errorReasons.appendIf(!GMX_GPU, "Non-GPU build of GROMACS.");
+ +    errorReasons.appendIf(GMX_GPU_SYCL && !g_allowPmeWithSyclForTesting, "SYCL build of GROMACS."); // SYCL-TODO
+ +    errorReasons.finishContext();
+ +    if (error != nullptr)
+ +    {
+ +        *error = errorReasons.toString();
+ +    }
+ +    return errorReasons.isEmpty();
   }
   
   PmeRunMode pme_run_mode(const gmx_pme_t* pme)
@@@ -285,7 -333,11 +299,7 @@@ PmeAtomComm::PmeAtomComm(MPI_Comm   Pme
                            const int  pmeOrder,
                            const int  dimIndex,
                            const bool doSpread) :
- -    dimind(dimIndex),
- -    bSpread(doSpread),
- -    pme_order(pmeOrder),
- -    nthread(numThreads),
- -    spline(nthread)
+ +    dimind(dimIndex), bSpread(doSpread), pme_order(pmeOrder), nthread(numThreads), spline(nthread)
   {
       if (PmeMpiCommunicator != MPI_COMM_NULL)
       {
@@@ -433,18 -485,8 +447,18 @@@ static void init_overlap_comm(pme_overl
       MPI_Status stat;
       for (size_t b = 0; b < ol->comm_data.size(); b++)
       {
- -        MPI_Sendrecv(&ol->send_size, 1, MPI_INT, ol->comm_data[b].send_id, b, &ol->comm_data[b].recv_size,
- -                     1, MPI_INT, ol->comm_data[b].recv_id, b, ol->mpi_comm, &stat);
+ +        MPI_Sendrecv(&ol->send_size,
+ +                     1,
+ +                     MPI_INT,
+ +                     ol->comm_data[b].send_id,
+ +                     b,
+ +                     &ol->comm_data[b].recv_size,
+ +                     1,
+ +                     MPI_INT,
+ +                     ol->comm_data[b].recv_id,
+ +                     b,
+ +                     ol->mpi_comm,
+ +                     &stat);
       }
   #endif
   
@@@ -482,8 -524,7 +496,8 @@@ bool gmx_pme_check_restrictions(int pme
           std::string message = gmx::formatString(
                   "pme_order (%d) is larger than the maximum allowed value (%d). Modify and "
                   "recompile the code if you really need such a high order.",
- -                pme_order, PME_ORDER_MAX);
+ +                pme_order,
+ +                PME_ORDER_MAX);
           GMX_THROW(gmx::InconsistentInputError(message));
       }
   
@@@ -514,8 -555,7 +528,8 @@@
                     "threads, the number of grid lines per rank along x should be >= pme_order (%d) "
                     "or = pmeorder-1. To resolve this issue, use fewer ranks along x (and possibly "
                     "more along y and/or z) by specifying -dd manually.",
- -                  nkx / static_cast<double>(numPmeDomainsAlongX), pme_order);
+ +                  nkx / static_cast<double>(numPmeDomainsAlongX),
+ +                  pme_order);
       }
   
       return true;
@@@ -553,6 -593,9 +567,6 @@@ gmx_pme_t* gmx_pme_init(const t_commrec
   
       gmx::unique_cptr<gmx_pme_t, gmx_pme_destroy> pme(new gmx_pme_t());
   
- -    pme->sum_qgrid_tmp    = nullptr;
- -    pme->sum_qgrid_dd_tmp = nullptr;
- -
       pme->buf_nalloc = 0;
   
       pme->nnodes  = 1;
@@@ -616,13 -659,9 +630,13 @@@
               pme->ndecompdim = 2;
   
   #if GMX_MPI
- -            MPI_Comm_split(pme->mpi_comm, pme->nodeid % numPmeDomains.y, pme->nodeid,
+ +            MPI_Comm_split(pme->mpi_comm,
+ +                           pme->nodeid % numPmeDomains.y,
+ +                           pme->nodeid,
                              &pme->mpi_comm_d[0]); /* My communicator along major dimension */
- -            MPI_Comm_split(pme->mpi_comm, pme->nodeid / numPmeDomains.y, pme->nodeid,
+ +            MPI_Comm_split(pme->mpi_comm,
+ +                           pme->nodeid / numPmeDomains.y,
+ +                           pme->nodeid,
                              &pme->mpi_comm_d[1]); /* My communicator along minor dimension */
   
               MPI_Comm_rank(pme->mpi_comm_d[0], &pme->nodeid_major);
@@@ -665,15 -704,15 +679,15 @@@
        * not calculating free-energy for Coulomb and/or LJ while gmx_pme_init()
        * configures with free-energy, but that has never been tested.
        */
- -    pme->doCoulomb     = EEL_PME(ir->coulombtype);
- -    pme->doLJ          = EVDW_PME(ir->vdwtype);
- -    pme->bFEP_q        = ((ir->efep != efepNO) && bFreeEnergy_q);
- -    pme->bFEP_lj       = ((ir->efep != efepNO) && bFreeEnergy_lj);
- -    pme->bFEP          = (pme->bFEP_q || pme->bFEP_lj);
- -    pme->nkx           = ir->nkx;
- -    pme->nky           = ir->nky;
- -    pme->nkz           = ir->nkz;
- -    pme->bP3M          = (ir->coulombtype == eelP3M_AD || getenv("GMX_PME_P3M") != nullptr);
+ +    pme->doCoulomb = EEL_PME(ir->coulombtype);
+ +    pme->doLJ      = EVDW_PME(ir->vdwtype);
+ +    pme->bFEP_q    = ((ir->efep != FreeEnergyPerturbationType::No) && bFreeEnergy_q);
+ +    pme->bFEP_lj   = ((ir->efep != FreeEnergyPerturbationType::No) && bFreeEnergy_lj);
+ +    pme->bFEP      = (pme->bFEP_q || pme->bFEP_lj);
+ +    pme->nkx       = ir->nkx;
+ +    pme->nky       = ir->nky;
+ +    pme->nkz       = ir->nkz;
+ +    pme->bP3M = (ir->coulombtype == CoulombInteractionType::P3mAD || getenv("GMX_PME_P3M") != nullptr);
       pme->pme_order     = ir->pme_order;
       pme->ewaldcoeff_q  = ewaldcoeff_q;
       pme->ewaldcoeff_lj = ewaldcoeff_lj;
@@@ -686,12 -725,12 +700,12 @@@
   
       // The box requires scaling with nwalls = 2, we store that condition as well
       // as the scaling factor
- -    delete pme->boxScaler;
- -    pme->boxScaler = new EwaldBoxZScaler(*ir);
+ +    pme->boxScaler = std::make_unique<EwaldBoxZScaler>(
+ +            EwaldBoxZScaler(inputrecPbcXY2Walls(ir), ir->wall_ewald_zfac));
   
       /* If we violate restrictions, generate a fatal error here */
- -    gmx_pme_check_restrictions(pme->pme_order, pme->nkx, pme->nky, pme->nkz, pme->nnodes_major,
- -                               pme->bUseThreads, true);
+ +    gmx_pme_check_restrictions(
+ +            pme->pme_order, pme->nkx, pme->nky, pme->nkz, pme->nnodes_major, pme->bUseThreads, true);
   
       if (pme->nnodes > 1)
       {
@@@ -722,13 -761,8 +736,13 @@@
                               "      and PME grid_y (%d) and grid_z (%d) should be divisible by "
                               "#PME_ranks_y "
                               "(%d)",
- -                            gmx::roundToInt((imbal - 1) * 100), pme->nkx, pme->nky,
- -                            pme->nnodes_major, pme->nky, pme->nkz, pme->nnodes_minor);
+ +                            gmx::roundToInt((imbal - 1) * 100),
+ +                            pme->nkx,
+ +                            pme->nky,
+ +                            pme->nnodes_major,
+ +                            pme->nky,
+ +                            pme->nkz,
+ +                            pme->nnodes_minor);
           }
       }
   
@@@ -737,12 -771,8 +751,12 @@@
        * y is always copied through a buffer: we don't need padding in z,
        * but we do need the overlap in x because of the communication order.
        */
- -    init_overlap_comm(&pme->overlap[0], pme->pme_order, pme->mpi_comm_d[0], pme->nnodes_major,
- -                      pme->nodeid_major, pme->nkx,
+ +    init_overlap_comm(&pme->overlap[0],
+ +                      pme->pme_order,
+ +                      pme->mpi_comm_d[0],
+ +                      pme->nnodes_major,
+ +                      pme->nodeid_major,
+ +                      pme->nkx,
                         (div_round_up(pme->nky, pme->nnodes_minor) + pme->pme_order)
                                 * (pme->nkz + pme->pme_order - 1));
   
@@@ -750,12 -780,8 +764,12 @@@
        * We do this with an offset buffer of equal size, so we need to allocate
        * extra for the offset. That's what the (+1)*pme->nkz is for.
        */
- -    init_overlap_comm(&pme->overlap[1], pme->pme_order, pme->mpi_comm_d[1], pme->nnodes_minor,
- -                      pme->nodeid_minor, pme->nky,
+ +    init_overlap_comm(&pme->overlap[1],
+ +                      pme->pme_order,
+ +                      pme->mpi_comm_d[1],
+ +                      pme->nnodes_minor,
+ +                      pme->nodeid_minor,
+ +                      pme->nky,
                         (div_round_up(pme->nkx, pme->nnodes_major) + pme->pme_order + 1) * pme->nkz);
   
       /* Double-check for a limitation of the (current) sum_fftgrid_dd code.
@@@ -787,12 -813,12 +801,12 @@@
       pme->pmegrid_start_iy = pme->overlap[1].s2g0[pme->nodeid_minor];
       pme->pmegrid_start_iz = 0;
   
- -    make_gridindex_to_localindex(pme->nkx, pme->pmegrid_start_ix,
- -                                 pme->pmegrid_nx - (pme->pme_order - 1), &pme->nnx, &pme->fshx);
- -    make_gridindex_to_localindex(pme->nky, pme->pmegrid_start_iy,
- -                                 pme->pmegrid_ny - (pme->pme_order - 1), &pme->nny, &pme->fshy);
- -    make_gridindex_to_localindex(pme->nkz, pme->pmegrid_start_iz, pme->pmegrid_nz_base, &pme->nnz,
- -                                 &pme->fshz);
+ +    make_gridindex_to_localindex(
+ +            pme->nkx, pme->pmegrid_start_ix, pme->pmegrid_nx - (pme->pme_order - 1), &pme->nnx, &pme->fshx);
+ +    make_gridindex_to_localindex(
+ +            pme->nky, pme->pmegrid_start_iy, pme->pmegrid_ny - (pme->pme_order - 1), &pme->nny, &pme->fshy);
+ +    make_gridindex_to_localindex(
+ +            pme->nkz, pme->pmegrid_start_iz, pme->pmegrid_nz_base, &pme->nnz, &pme->fshz);
   
       pme->spline_work = make_pme_spline_work(pme->pme_order);
   
@@@ -804,7 -830,7 +818,7 @@@
        */
       if (pme->doLJ)
       {
- -        pme->ngrids = ((ir->ljpme_combination_rule == eljpmeLB) ? DO_Q_AND_LJ_LB : DO_Q_AND_LJ);
+ +        pme->ngrids = ((ir->ljpme_combination_rule == LongRangeVdW::LB) ? DO_Q_AND_LJ_LB : DO_Q_AND_LJ);
       }
       else
       {
@@@ -818,16 -844,10 +832,16 @@@
       {
           if ((i < DO_Q && pme->doCoulomb && (i == 0 || bFreeEnergy_q))
               || (i >= DO_Q && pme->doLJ
- -                && (i == 2 || bFreeEnergy_lj || ir->ljpme_combination_rule == eljpmeLB)))
+ +                && (i == 2 || bFreeEnergy_lj || ir->ljpme_combination_rule == LongRangeVdW::LB)))
           {
- -            pmegrids_init(&pme->pmegrid[i], pme->pmegrid_nx, pme->pmegrid_ny, pme->pmegrid_nz,
- -                          pme->pmegrid_nz_base, pme->pme_order, pme->bUseThreads, pme->nthread,
+ +            pmegrids_init(&pme->pmegrid[i],
+ +                          pme->pmegrid_nx,
+ +                          pme->pmegrid_ny,
+ +                          pme->pmegrid_nz,
+ +                          pme->pmegrid_nz_base,
+ +                          pme->pme_order,
+ +                          pme->bUseThreads,
+ +                          pme->nthread,
                             pme->overlap[0].s2g1[pme->nodeid_major]
                                     - pme->overlap[0].s2g0[pme->nodeid_major + 1],
                             pme->overlap[1].s2g1[pme->nodeid_minor]
@@@ -836,14 -856,8 +850,14 @@@
               const auto allocateRealGridForGpu = (pme->runMode == PmeRunMode::Mixed)
                                                           ? gmx::PinningPolicy::PinnedIfSupported
                                                           : gmx::PinningPolicy::CannotBePinned;
- -            gmx_parallel_3dfft_init(&pme->pfft_setup[i], ndata, &pme->fftgrid[i], &pme->cfftgrid[i],
- -                                    pme->mpi_comm_d, bReproducible, pme->nthread, allocateRealGridForGpu);
+ +            gmx_parallel_3dfft_init(&pme->pfft_setup[i],
+ +                                    ndata,
+ +                                    &pme->fftgrid[i],
+ +                                    &pme->cfftgrid[i],
+ +                                    pme->mpi_comm_d,
+ +                                    bReproducible,
+ +                                    pme->nthread,
+ +                                    allocateRealGridForGpu);
           }
       }
   
@@@ -924,27 -938,15 +938,27 @@@ void gmx_pme_reinit(struct gmx_pme_t** 
           const gmx::MDLogger dummyLogger;
           GMX_ASSERT(pmedata, "Invalid PME pointer");
           NumPmeDomains numPmeDomains = { pme_src->nnodes_major, pme_src->nnodes_minor };
- -        *pmedata = gmx_pme_init(cr, numPmeDomains, &irc, pme_src->bFEP_q, pme_src->bFEP_lj, FALSE,
- -                                ewaldcoeff_q, ewaldcoeff_lj, pme_src->nthread, pme_src->runMode,
- -                                pme_src->gpu, nullptr, nullptr, nullptr, dummyLogger);
+ +        *pmedata                    = gmx_pme_init(cr,
+ +                                numPmeDomains,
+ +                                &irc,
+ +                                pme_src->bFEP_q,
+ +                                pme_src->bFEP_lj,
+ +                                FALSE,
+ +                                ewaldcoeff_q,
+ +                                ewaldcoeff_lj,
+ +                                pme_src->nthread,
+ +                                pme_src->runMode,
+ +                                pme_src->gpu,
+ +                                nullptr,
+ +                                nullptr,
+ +                                nullptr,
+ +                                dummyLogger);
           /* When running PME on the CPU not using domain decomposition,
            * the atom data is allocated once only in gmx_pme_(re)init().
            */
           if (!pme_src->gpu && pme_src->nnodes == 1)
           {
- -            gmx_pme_reinit_atoms(*pmedata, pme_src->atc[0].numAtoms(), nullptr, nullptr);
+ +            gmx_pme_reinit_atoms(*pmedata, pme_src->atc[0].numAtoms(), {}, {});
           }
           // TODO this is mostly passing around current values
       }
@@@ -955,7 -957,7 +969,7 @@@
       /* We would like to reuse the fft grids, but that's harder */
   }
   
- -void gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q, real* V)
+ +real gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q)
   {
       pmegrids_t* grid;
   
@@@ -983,13 -985,11 +997,13 @@@
       /* Only calculate the spline coefficients, don't actually spread */
       spread_on_grid(pme, atc, nullptr, TRUE, FALSE, pme->fftgrid[PME_GRID_QA], FALSE, PME_GRID_QA);
   
- -    *V = gather_energy_bsplines(pme, grid->grid.grid, atc);
+ +    return gather_energy_bsplines(pme, grid->grid.grid, atc);
   }
   
   /*! \brief Calculate initial Lorentz-Berthelot coefficients for LJ-PME */
- -static void calc_initial_lb_coeffs(gmx::ArrayRef<real> coefficient, const real* local_c6, const real* local_sigma)
+ +static void calc_initial_lb_coeffs(gmx::ArrayRef<real>       coefficient,
+ +                                   gmx::ArrayRef<const real> local_c6,
+ +                                   gmx::ArrayRef<const real> local_sigma)
   {
       for (gmx::index i = 0; i < coefficient.ssize(); ++i)
       {
@@@ -1001,7 -1001,7 +1015,7 @@@
   }
   
   /*! \brief Calculate next Lorentz-Berthelot coefficients for LJ-PME */
- -static void calc_next_lb_coeffs(gmx::ArrayRef<real> coefficient, const real* local_sigma)
+ +static void calc_next_lb_coeffs(gmx::ArrayRef<real> coefficient, gmx::ArrayRef<const real> local_sigma)
   {
       for (gmx::index i = 0; i < coefficient.ssize(); ++i)
       {
@@@ -1012,12 -1012,12 +1026,12 @@@
   int gmx_pme_do(struct gmx_pme_t*              pme,
                  gmx::ArrayRef<const gmx::RVec> coordinates,
                  gmx::ArrayRef<gmx::RVec>       forces,
- -               real                           chargeA[],
- -               real                           chargeB[],
- -               real                           c6A[],
- -               real                           c6B[],
- -               real                           sigmaA[],
- -               real                           sigmaB[],
+ +               gmx::ArrayRef<const real>      chargeA,
+ +               gmx::ArrayRef<const real>      chargeB,
+ +               gmx::ArrayRef<const real>      c6A,
+ +               gmx::ArrayRef<const real>      c6B,
+ +               gmx::ArrayRef<const real>      sigmaA,
+ +               gmx::ArrayRef<const real>      sigmaB,
                  const matrix                   box,
                  const t_commrec*               cr,
                  int                            maxshift_x,
@@@ -1037,16 -1037,21 +1051,16 @@@
       GMX_ASSERT(pme->runMode == PmeRunMode::CPU,
                  "gmx_pme_do should not be called on the GPU PME run.");
   
- -    int                  d, npme, grid_index, max_grid_index;
- -    PmeAtomComm&         atc         = pme->atc[0];
- -    pmegrids_t*          pmegrid     = nullptr;
- -    real*                grid        = nullptr;
- -    real*                coefficient = nullptr;
- -    PmeOutput            output[2]; // The second is used for the B state with FEP
- -    real                 scale, lambda;
- -    gmx_bool             bClearF;
- -    gmx_parallel_3dfft_t pfft_setup;
- -    real*                fftgrid;
- -    t_complex*           cfftgrid;
- -    int                  thread;
- -    gmx_bool             bFirst, bDoSplines;
- -    int                  fep_state;
- -    int                  fep_states_lj = pme->bFEP_lj ? 2 : 1;
+ +    PmeAtomComm&              atc     = pme->atc[0];
+ +    pmegrids_t*               pmegrid = nullptr;
+ +    real*                     grid    = nullptr;
+ +    gmx::ArrayRef<const real> coefficient;
+ +    std::array<PmeOutput, 2>  output; // The second is used for the B state with FEP
+ +    gmx_parallel_3dfft_t      pfft_setup;
+ +    real*                     fftgrid;
+ +    t_complex*                cfftgrid;
+ +    int                       thread;
+ +    const int                 fep_states_lj = pme->bFEP_lj ? 2 : 1;
       // There's no support for computing energy without virial, or vice versa
       const bool computeEnergyAndVirial = (stepWork.computeEnergy || stepWork.computeVirial);
   
@@@ -1086,7 -1091,7 +1100,7 @@@
       pme->boxScaler->scaleBox(box, scaledBox);
   
       gmx::invertBoxMatrix(scaledBox, pme->recipbox);
- -    bFirst = TRUE;
+ +    bool bFirst = true;
   
       /* For simplicity, we construct the splines for all particles if
        * more than one PME calculations is needed. Some optimization
@@@ -1095,7 -1100,7 +1109,7 @@@
        * that don't yet have them.
        */
   
- -    bDoSplines = pme->bFEP || (pme->doCoulomb && pme->doLJ);
+ +    bool bDoSplines = pme->bFEP || (pme->doCoulomb && pme->doLJ);
   
       /* We need a maximum of four separate PME calculations:
        * grid_index=0: Coulomb PME with charges from state A
@@@ -1107,9 -1112,9 +1121,9 @@@
        */
   
       /* If we are doing LJ-PME with LB, we only do Q here */
- -    max_grid_index = (pme->ljpme_combination_rule == eljpmeLB) ? DO_Q : DO_Q_AND_LJ;
- -
- -    for (grid_index = 0; grid_index < max_grid_index; ++grid_index)
+ +    const int max_grid_index = (pme->ljpme_combination_rule == LongRangeVdW::LB) ? DO_Q : DO_Q_AND_LJ;
+ +    bool      bClearF;
+ +    for (int grid_index = 0; grid_index < max_grid_index; ++grid_index)
       {
           /* Check if we should do calculations at this grid_index
            * If grid_index is odd we should be doing FEP
@@@ -1136,19 -1141,34 +1150,19 @@@
   
           grid = pmegrid->grid.grid;
   
- -        if (debug)
- -        {
- -            fprintf(debug, "PME: number of ranks = %d, rank = %d\n", cr->nnodes, cr->nodeid);
- -            fprintf(debug, "Grid = %p\n", static_cast<void*>(grid));
- -            if (grid == nullptr)
- -            {
- -                gmx_fatal(FARGS, "No grid!");
- -            }
- -        }
- -
           if (pme->nnodes == 1)
           {
- -            atc.coefficient = gmx::arrayRefFromArray(coefficient, coordinates.size());
+ +            atc.coefficient = coefficient;
           }
           else
           {
- -            wallcycle_start(wcycle, ewcPME_REDISTXF);
+ +            wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
               do_redist_pos_coeffs(pme, cr, bFirst, coordinates, coefficient);
   
- -            wallcycle_stop(wcycle, ewcPME_REDISTXF);
+ +            wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
           }
   
- -        if (debug)
- -        {
- -            fprintf(debug, "Rank= %6d, pme local particles=%6d\n", cr->nodeid, atc.numAtoms());
- -        }
- -
- -        wallcycle_start(wcycle, ewcPME_SPREAD);
+ +        wallcycle_start(wcycle, WallCycleCounter::PmeSpread);
   
           /* Spread the coefficients on a grid */
           spread_on_grid(pme, &atc, pmegrid, bFirst, TRUE, fftgrid, bDoSplines, grid_index);
@@@ -1172,7 -1192,7 +1186,7 @@@
               copy_pmegrid_to_fftgrid(pme, grid, fftgrid, grid_index);
           }
   
- -        wallcycle_stop(wcycle, ewcPME_SPREAD);
+ +        wallcycle_stop(wcycle, WallCycleCounter::PmeSpread);
   
           /* TODO If the OpenMP and single-threaded implementations
              converge, then spread_on_grid() and
@@@ -1191,72 -1211,61 +1205,72 @@@
                   /* do 3d-fft */
                   if (thread == 0)
                   {
- -                    wallcycle_start(wcycle, ewcPME_FFT);
+ +                    wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                   }
                   gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_REAL_TO_COMPLEX, thread, wcycle);
                   if (thread == 0)
                   {
- -                    wallcycle_stop(wcycle, ewcPME_FFT);
+ +                    wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
                   }
   
                   /* solve in k-space for our local cells */
                   if (thread == 0)
                   {
- -                    wallcycle_start(wcycle, (grid_index < DO_Q ? ewcPME_SOLVE : ewcLJPME));
+ +                    wallcycle_start(
+ +                            wcycle,
+ +                            (grid_index < DO_Q ? WallCycleCounter::PmeSolve : WallCycleCounter::LJPme));
                   }
                   if (grid_index < DO_Q)
                   {
- -                    loop_count = solve_pme_yzx(
- -                            pme, cfftgrid, scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
- -                            computeEnergyAndVirial, pme->nthread, thread);
+ +                    loop_count = solve_pme_yzx(pme,
+ +                                               cfftgrid,
+ +                                               scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
+ +                                               computeEnergyAndVirial,
+ +                                               pme->nthread,
+ +                                               thread);
                   }
                   else
                   {
                       loop_count =
- -                            solve_pme_lj_yzx(pme, &cfftgrid, FALSE,
+ +                            solve_pme_lj_yzx(pme,
+ +                                             &cfftgrid,
+ +                                             FALSE,
                                                scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
- -                                             computeEnergyAndVirial, pme->nthread, thread);
+ +                                             computeEnergyAndVirial,
+ +                                             pme->nthread,
+ +                                             thread);
                   }
   
                   if (thread == 0)
                   {
- -                    wallcycle_stop(wcycle, (grid_index < DO_Q ? ewcPME_SOLVE : ewcLJPME));
+ +                    wallcycle_stop(
+ +                            wcycle,
+ +                            (grid_index < DO_Q ? WallCycleCounter::PmeSolve : WallCycleCounter::LJPme));
                       inc_nrnb(nrnb, eNR_SOLVEPME, loop_count);
                   }
   
                   /* do 3d-invfft */
                   if (thread == 0)
                   {
- -                    wallcycle_start(wcycle, ewcPME_FFT);
+ +                    wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                   }
                   gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_COMPLEX_TO_REAL, thread, wcycle);
                   if (thread == 0)
                   {
- -                    wallcycle_stop(wcycle, ewcPME_FFT);
+ +                    wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
   
   
                       if (pme->nodeid == 0)
                       {
- -                        real ntot = pme->nkx * pme->nky * pme->nkz;
- -                        npme      = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
+ +                        real      ntot = pme->nkx * pme->nky * pme->nkz;
+ +                        const int npme = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
                           inc_nrnb(nrnb, eNR_FFT, 2 * npme);
                       }
   
                       /* Note: this wallcycle region is closed below
                          outside an OpenMP region, so take care if
                          refactoring code here. */
- -                    wallcycle_start(wcycle, ewcPME_GATHER);
+ +                    wallcycle_start(wcycle, WallCycleCounter::PmeGather);
                   }
   
                   copy_fftgrid_to_pmegrid(pme, fftgrid, grid, grid_index, pme->nthread, thread);
@@@ -1284,28 -1293,25 +1298,28 @@@
                * atc->f is the actual force array, not a buffer,
                * therefore we should not clear it.
                */
- -            lambda  = grid_index < DO_Q ? lambda_q : lambda_lj;
- -            bClearF = (bFirst && PAR(cr));
+ +            real lambda = grid_index < DO_Q ? lambda_q : lambda_lj;
+ +            bClearF     = (bFirst && PAR(cr));
   #pragma omp parallel for num_threads(pme->nthread) schedule(static)
               for (thread = 0; thread < pme->nthread; thread++)
               {
                   try
                   {
- -                    gather_f_bsplines(pme, grid, bClearF, &atc, &atc.spline[thread],
+ +                    gather_f_bsplines(pme,
+ +                                      grid,
+ +                                      bClearF,
+ +                                      &atc,
+ +                                      &atc.spline[thread],
                                         pme->bFEP ? (grid_index % 2 == 0 ? 1.0 - lambda : lambda) : 1.0);
                   }
                   GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
               }
   
   
- -            inc_nrnb(nrnb, eNR_GATHERFBSP,
- -                     pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
+ +            inc_nrnb(nrnb, eNR_GATHERFBSP, pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
               /* Note: this wallcycle region is opened above inside an OpenMP
                  region, so take care if refactoring code here. */
- -            wallcycle_stop(wcycle, ewcPME_GATHER);
+ +            wallcycle_stop(wcycle, WallCycleCounter::PmeGather);
           }
   
           if (computeEnergyAndVirial)
@@@ -1328,16 -1334,13 +1342,16 @@@
       /* For Lorentz-Berthelot combination rules in LJ-PME, we need to calculate
        * seven terms. */
   
- -    if (pme->doLJ && pme->ljpme_combination_rule == eljpmeLB)
+ +    if (pme->doLJ && pme->ljpme_combination_rule == LongRangeVdW::LB)
       {
           /* Loop over A- and B-state if we are doing FEP */
- -        for (fep_state = 0; fep_state < fep_states_lj; ++fep_state)
+ +        for (int fep_state = 0; fep_state < fep_states_lj; ++fep_state)
           {
- -            real *local_c6 = nullptr, *local_sigma = nullptr, *RedistC6 = nullptr, *RedistSigma = nullptr;
- -            gmx::ArrayRef<real> coefficientBuffer;
+ +            std::vector<real>         local_c6;
+ +            std::vector<real>         local_sigma;
+ +            gmx::ArrayRef<const real> RedistC6;
+ +            gmx::ArrayRef<const real> RedistSigma;
+ +            gmx::ArrayRef<real>       coefficientBuffer;
               if (pme->nnodes == 1)
               {
                   pme->lb_buf1.resize(atc.numAtoms());
@@@ -1345,12 -1348,12 +1359,12 @@@
                   switch (fep_state)
                   {
                       case 0:
- -                        local_c6    = c6A;
- -                        local_sigma = sigmaA;
+ +                        local_c6.assign(c6A.begin(), c6A.end());
+ +                        local_sigma.assign(sigmaA.begin(), sigmaA.end());
                           break;
                       case 1:
- -                        local_c6    = c6B;
- -                        local_sigma = sigmaB;
+ +                        local_c6.assign(c6B.begin(), c6B.end());
+ +                        local_sigma.assign(sigmaB.begin(), sigmaB.end());
                           break;
                       default: gmx_incons("Trying to access wrong FEP-state in LJ-PME routine");
                   }
@@@ -1370,31 -1373,31 +1384,31 @@@
                           break;
                       default: gmx_incons("Trying to access wrong FEP-state in LJ-PME routine");
                   }
- -                wallcycle_start(wcycle, ewcPME_REDISTXF);
+ +                wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
   
                   do_redist_pos_coeffs(pme, cr, bFirst, coordinates, RedistC6);
                   pme->lb_buf1.resize(atc.numAtoms());
                   pme->lb_buf2.resize(atc.numAtoms());
- -                local_c6 = pme->lb_buf1.data();
+ +                local_c6.assign(pme->lb_buf1.begin(), pme->lb_buf1.end());
                   for (int i = 0; i < atc.numAtoms(); ++i)
                   {
                       local_c6[i] = atc.coefficient[i];
                   }
   
                   do_redist_pos_coeffs(pme, cr, FALSE, coordinates, RedistSigma);
- -                local_sigma = pme->lb_buf2.data();
+ +                local_sigma.assign(pme->lb_buf2.begin(), pme->lb_buf2.end());
                   for (int i = 0; i < atc.numAtoms(); ++i)
                   {
                       local_sigma[i] = atc.coefficient[i];
                   }
   
- -                wallcycle_stop(wcycle, ewcPME_REDISTXF);
+ +                wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
               }
               atc.coefficient = coefficientBuffer;
               calc_initial_lb_coeffs(coefficientBuffer, local_c6, local_sigma);
   
               /*Seven terms in LJ-PME with LB, grid_index < 2 reserved for electrostatics*/
- -            for (grid_index = 2; grid_index < 9; ++grid_index)
+ +            for (int grid_index = 2; grid_index < 9; ++grid_index)
               {
                   /* Unpack structure */
                   pmegrid    = &pme->pmegrid[grid_index];
@@@ -1403,7 -1406,7 +1417,7 @@@
                   calc_next_lb_coeffs(coefficientBuffer, local_sigma);
                   grid = pmegrid->grid.grid;
   
- -                wallcycle_start(wcycle, ewcPME_SPREAD);
+ +                wallcycle_start(wcycle, WallCycleCounter::PmeSpread);
                   /* Spread the c6 on a grid */
                   spread_on_grid(pme, &atc, pmegrid, bFirst, TRUE, fftgrid, bDoSplines, grid_index);
   
@@@ -1412,8 -1415,7 +1426,8 @@@
                       inc_nrnb(nrnb, eNR_WEIGHTS, DIM * atc.numAtoms());
                   }
   
- -                inc_nrnb(nrnb, eNR_SPREADBSP,
+ +                inc_nrnb(nrnb,
+ +                         eNR_SPREADBSP,
                            pme->pme_order * pme->pme_order * pme->pme_order * atc.numAtoms());
                   if (pme->nthread == 1)
                   {
@@@ -1425,7 -1427,7 +1439,7 @@@
                       }
                       copy_pmegrid_to_fftgrid(pme, grid, fftgrid, grid_index);
                   }
- -                wallcycle_stop(wcycle, ewcPME_SPREAD);
+ +                wallcycle_stop(wcycle, WallCycleCounter::PmeSpread);
   
                   /*Here we start a large thread parallel region*/
   #pragma omp parallel num_threads(pme->nthread) private(thread)
@@@ -1436,13 -1438,13 +1450,13 @@@
                           /* do 3d-fft */
                           if (thread == 0)
                           {
- -                            wallcycle_start(wcycle, ewcPME_FFT);
+ +                            wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                           }
   
                           gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_REAL_TO_COMPLEX, thread, wcycle);
                           if (thread == 0)
                           {
- -                            wallcycle_stop(wcycle, ewcPME_FFT);
+ +                            wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
                           }
                       }
                       GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
@@@ -1458,20 -1460,16 +1472,20 @@@
                       thread = gmx_omp_get_thread_num();
                       if (thread == 0)
                       {
- -                        wallcycle_start(wcycle, ewcLJPME);
+ +                        wallcycle_start(wcycle, WallCycleCounter::LJPme);
                       }
   
                       loop_count =
- -                            solve_pme_lj_yzx(pme, &pme->cfftgrid[2], TRUE,
+ +                            solve_pme_lj_yzx(pme,
+ +                                             &pme->cfftgrid[2],
+ +                                             TRUE,
                                                scaledBox[XX][XX] * scaledBox[YY][YY] * scaledBox[ZZ][ZZ],
- -                                             computeEnergyAndVirial, pme->nthread, thread);
+ +                                             computeEnergyAndVirial,
+ +                                             pme->nthread,
+ +                                             thread);
                       if (thread == 0)
                       {
- -                        wallcycle_stop(wcycle, ewcLJPME);
+ +                        wallcycle_stop(wcycle, WallCycleCounter::LJPme);
                           inc_nrnb(nrnb, eNR_SOLVEPME, loop_count);
                       }
                   }
@@@ -1488,7 -1486,7 +1502,7 @@@
   
               bFirst = !pme->doCoulomb;
               calc_initial_lb_coeffs(coefficientBuffer, local_c6, local_sigma);
- -            for (grid_index = 8; grid_index >= 2; --grid_index)
+ +            for (int grid_index = 8; grid_index >= 2; --grid_index)
               {
                   /* Unpack structure */
                   pmegrid    = &pme->pmegrid[grid_index];
@@@ -1504,22 -1502,22 +1518,22 @@@
                           /* do 3d-invfft */
                           if (thread == 0)
                           {
- -                            wallcycle_start(wcycle, ewcPME_FFT);
+ +                            wallcycle_start(wcycle, WallCycleCounter::PmeFft);
                           }
   
                           gmx_parallel_3dfft_execute(pfft_setup, GMX_FFT_COMPLEX_TO_REAL, thread, wcycle);
                           if (thread == 0)
                           {
- -                            wallcycle_stop(wcycle, ewcPME_FFT);
+ +                            wallcycle_stop(wcycle, WallCycleCounter::PmeFft);
   
   
                               if (pme->nodeid == 0)
                               {
- -                                real ntot = pme->nkx * pme->nky * pme->nkz;
- -                                npme      = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
+ +                                real      ntot = pme->nkx * pme->nky * pme->nkz;
+ +                                const int npme = static_cast<int>(ntot * std::log(ntot) / std::log(2.0));
                                   inc_nrnb(nrnb, eNR_FFT, 2 * npme);
                               }
- -                            wallcycle_start(wcycle, ewcPME_GATHER);
+ +                            wallcycle_start(wcycle, WallCycleCounter::PmeGather);
                           }
   
                           copy_fftgrid_to_pmegrid(pme, fftgrid, grid, grid_index, pme->nthread, thread);
@@@ -1538,8 -1536,8 +1552,8 @@@
                   if (stepWork.computeForces)
                   {
                       /* interpolate forces for our local atoms */
- -                    bClearF = (bFirst && PAR(cr));
- -                    scale   = pme->bFEP ? (fep_state < 1 ? 1.0 - lambda_lj : lambda_lj) : 1.0;
+ +                    bClearF    = (bFirst && PAR(cr));
+ +                    real scale = pme->bFEP ? (fep_state < 1 ? 1.0 - lambda_lj : lambda_lj) : 1.0;
                       scale *= lb_scale_factor[grid_index - 2];
   
   #pragma omp parallel for num_threads(pme->nthread) schedule(static)
@@@ -1547,28 -1545,27 +1561,28 @@@
                       {
                           try
                           {
- -                            gather_f_bsplines(pme, grid, bClearF, &pme->atc[0],
- -                                              &pme->atc[0].spline[thread], scale);
+ +                            gather_f_bsplines(
+ +                                    pme, grid, bClearF, &pme->atc[0], &pme->atc[0].spline[thread], scale);
                           }
                           GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
                       }
   
   
- -                    inc_nrnb(nrnb, eNR_GATHERFBSP,
+ +                    inc_nrnb(nrnb,
+ +                             eNR_GATHERFBSP,
                                pme->pme_order * pme->pme_order * pme->pme_order * pme->atc[0].numAtoms());
                   }
- -                wallcycle_stop(wcycle, ewcPME_GATHER);
+ +                wallcycle_stop(wcycle, WallCycleCounter::PmeGather);
   
                   bFirst = FALSE;
               } /* for (grid_index = 8; grid_index >= 2; --grid_index) */
           }     /* for (fep_state = 0; fep_state < fep_states_lj; ++fep_state) */
- -    }         /* if (pme->doLJ && pme->ljpme_combination_rule == eljpmeLB) */
+ +    }         /* if (pme->doLJ && pme->ljpme_combination_rule == LongRangeVdW::LB) */
   
       if (stepWork.computeForces && pme->nnodes > 1)
       {
- -        wallcycle_start(wcycle, ewcPME_REDISTXF);
- -        for (d = 0; d < pme->ndecompdim; d++)
+ +        wallcycle_start(wcycle, WallCycleCounter::PmeRedistXF);
+ +        for (int d = 0; d < pme->ndecompdim; d++)
           {
               gmx::ArrayRef<gmx::RVec> forcesRef;
               if (d == pme->ndecompdim - 1)
@@@ -1581,13 -1578,13 +1595,13 @@@
               {
                   forcesRef = pme->atc[d + 1].f;
               }
- -            if (DOMAINDECOMP(cr))
+ +            if (haveDDAtomOrdering(*cr))
               {
                   dd_pmeredist_f(pme, &pme->atc[d], forcesRef, d == pme->ndecompdim - 1 && pme->bPPnode);
               }
           }
   
- -        wallcycle_stop(wcycle, ewcPME_REDISTXF);
+ +        wallcycle_stop(wcycle, WallCycleCounter::PmeRedistXF);
       }
   
       if (computeEnergyAndVirial)
@@@ -1612,6 -1609,10 +1626,6 @@@
                       }
                   }
               }
- -            if (debug)
- -            {
- -                fprintf(debug, "Electrostatic PME mesh energy: %g\n", *energy_q);
- -            }
           }
           else
           {
@@@ -1639,6 -1640,10 +1653,6 @@@
                       }
                   }
               }
- -            if (debug)
- -            {
- -                fprintf(debug, "Lennard-Jones PME mesh energy: %g\n", *energy_lj);
- -            }
           }
           else
           {
@@@ -1655,6 -1660,8 +1669,6 @@@ void gmx_pme_destroy(gmx_pme_t* pme
           return;
       }
   
- -    delete pme->boxScaler;
- -
       sfree(pme->nnx);
       sfree(pme->nny);
       sfree(pme->nnz);
@@@ -1690,6 -1697,9 +1704,6 @@@
           pme_free_all_work(&pme->solve_work, pme->nthread);
       }
   
- -    sfree(pme->sum_qgrid_tmp);
- -    sfree(pme->sum_qgrid_dd_tmp);
- -
       destroy_pme_spline_work(pme->spline_work);
   
       if (pme->gpu != nullptr)
@@@ -1700,16 -1710,13 +1714,16 @@@
       delete pme;
   }
   
- -void gmx_pme_reinit_atoms(gmx_pme_t* pme, const int numAtoms, const real* chargesA, const real* chargesB)
+ +void gmx_pme_reinit_atoms(gmx_pme_t*                pme,
+ +                          const int                 numAtoms,
+ +                          gmx::ArrayRef<const real> chargesA,
+ +                          gmx::ArrayRef<const real> chargesB)
   {
       if (pme->gpu != nullptr)
       {
- -        GMX_ASSERT(!(pme->bFEP_q && chargesB == nullptr),
+ +        GMX_ASSERT(!(pme->bFEP_q && chargesB.empty()),
                      "B state charges must be specified if running Coulomb FEP on the GPU");
- -        pme_gpu_reinit_atoms(pme->gpu, numAtoms, chargesA, pme->bFEP_q ? chargesB : nullptr);
+ +        pme_gpu_reinit_atoms(pme->gpu, numAtoms, chargesA.data(), pme->bFEP_q ? chargesB.data() : nullptr);
       }
       else
       {
@@@ -1722,23 -1729,3 +1736,23 @@@ bool gmx_pme_grid_matches(const gmx_pme
   {
       return (pme.nkx == grid_size[XX] && pme.nky == grid_size[YY] && pme.nkz == grid_size[ZZ]);
   }
+ +
+ +void gmx::SeparatePmeRanksPermitted::disablePmeRanks(const std::string& reason)
+ +{
+ +    permitSeparatePmeRanks_ = false;
+ +
+ +    if (!reason.empty())
+ +    {
+ +        reasons_.push_back(reason);
+ +    }
+ +}
+ +
+ +bool gmx::SeparatePmeRanksPermitted::permitSeparatePmeRanks() const
+ +{
+ +    return permitSeparatePmeRanks_;
+ +}
+ +
+ +std::string gmx::SeparatePmeRanksPermitted::reasonsWhyDisabled() const
+ +{
+ +    return joinStrings(reasons_, "; ");
+ +}
diff --combined src/gromacs/ewald/pme.h

index 03d4da52c07097da2a8521823638d58f99902045,3ac892b3ac786583d5e2a06e633f9c6119e7273f..eb0c5651753535db601959d2101f7f8e12669ebd
--- 1/src/gromacs/ewald/pme.h
--- 2/src/gromacs/ewald/pme.h
+++ b/src/gromacs/ewald/pme.h
@@@ -50,11 -50,11 +50,11 @@@
   #define GMX_EWALD_PME_H
   
   #include <string>
+ +#include <vector>
   
   #include "gromacs/gpu_utils/devicebuffer_datatype.h"
   #include "gromacs/gpu_utils/gpu_macros.h"
   #include "gromacs/math/vectypes.h"
- -#include "gromacs/utility/basedefinitions.h"
   #include "gromacs/utility/real.h"
   
   struct gmx_hw_info_t;
@@@ -75,19 -75,6 +75,19 @@@ enum class GpuTaskCompletion
   class PmeGpuProgram;
   class GpuEventSynchronizer;
   
+ +/*! \brief Hack to selectively enable some parts of PME during unit testing.
+ + *
+ + * Set to \c false by default. If any of the tests sets it to \c true, it will
+ + * make the compatibility check consider PME to be supported in SYCL builds.
+ + *
+ + * Currently we don't have proper PME implementation with SYCL, but we still want
+ + * to run tests for some of the kernels.
+ + *
+ + * \todo Remove after #3927 is done and PME is fully enabled in SYCL builds.
+ + */
+ +//NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
+ +extern bool g_allowPmeWithSyclForTesting;
+ +
   namespace gmx
   {
   template<typename>
@@@ -96,30 -83,6 +96,30 @@@ class ForceWithVirial
   class MDLogger;
   enum class PinningPolicy : int;
   class StepWorkload;
+ +
+ +/*! \libinternal \brief Class for managing usage of separate PME-only ranks
+ + *
+ + * Used for checking if some parts of the code could not use PME-only ranks
+ + *
+ + */
+ +class SeparatePmeRanksPermitted
+ +{
+ +public:
+ +    //! Disables PME ranks permitted flag with a reason
+ +    void disablePmeRanks(const std::string& reason);
+ +    //! Return status of PME ranks usage
+ +    bool permitSeparatePmeRanks() const;
+ +    //! Returns all reasons, for not using PME ranks
+ +    std::string reasonsWhyDisabled() const;
+ +
+ +private:
+ +    //! Flag that informs whether simualtion could use dedicated PME ranks
+ +    bool permitSeparatePmeRanks_ = true;
+ +    //! Storage for all reasons, why PME ranks could not be used
+ +    std::vector<std::string> reasons_;
+ +};
+ +
+ +class PmeCoordinateReceiverGpu;
   } // namespace gmx
   
   enum
@@@ -218,12 -181,12 +218,12 @@@ void gmx_pme_destroy(gmx_pme_t* pme)
   int gmx_pme_do(struct gmx_pme_t*              pme,
                  gmx::ArrayRef<const gmx::RVec> coordinates,
                  gmx::ArrayRef<gmx::RVec>       forces,
- -               real                           chargeA[],
- -               real                           chargeB[],
- -               real                           c6A[],
- -               real                           c6B[],
- -               real                           sigmaA[],
- -               real                           sigmaB[],
+ +               gmx::ArrayRef<const real>      chargeA,
+ +               gmx::ArrayRef<const real>      chargeB,
+ +               gmx::ArrayRef<const real>      c6A,
+ +               gmx::ArrayRef<const real>      c6B,
+ +               gmx::ArrayRef<const real>      sigmaA,
+ +               gmx::ArrayRef<const real>      sigmaB,
                  const matrix                   box,
                  const t_commrec*               cr,
                  int                            maxshift_x,
@@@ -247,7 -210,7 +247,7 @@@
    * pme struct. Currently does not work in parallel or with free
    * energy.
    */
- -void gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q, real* V);
+ +real gmx_pme_calc_energy(gmx_pme_t* pme, gmx::ArrayRef<const gmx::RVec> x, gmx::ArrayRef<const real> q);
   
   /*! \brief
    * This function updates the local atom data on GPU after DD (charges, coordinates, etc.).
@@@ -261,10 -224,7 +261,10 @@@
    * \param[in]     chargesB   The pointer to the array of particle charges in state B. Only used if
    * charges are perturbed and can otherwise be nullptr.
    */
- -void gmx_pme_reinit_atoms(gmx_pme_t* pme, int numAtoms, const real* chargesA, const real* chargesB);
+ +void gmx_pme_reinit_atoms(gmx_pme_t*                pme,
+ +                          int                       numAtoms,
+ +                          gmx::ArrayRef<const real> chargesA,
+ +                          gmx::ArrayRef<const real> chargesB);
   
   /* A block of PME GPU functions */
   
@@@ -300,6 -260,17 +300,17 @@@ bool pme_gpu_supports_hardware(const gm
    */
   bool pme_gpu_supports_input(const t_inputrec& ir, std::string* error);
   
+ /*! \brief Checks whether the input system allows to run PME on GPU in Mixed mode.
+  * Assumes that the input system is compatible with GPU PME otherwise, that is,
+  * before calling this function one should check that \ref pme_gpu_supports_input returns \c true.
+  *
+  * \param[in]  ir     Input system.
+  * \param[out] error  If non-null, the error message if the input is not supported.
+  *
+  * \returns true if PME can run on GPU in Mixed mode with this input, false otherwise.
+  */
+ bool pme_gpu_mixed_mode_supports_input(const t_inputrec& ir, std::string* error);
+ 
   /*! \brief
    * Returns the active PME codepath (CPU, GPU, mixed).
    * \todo This is a rather static data that should be managed by the higher level task scheduler.
@@@ -374,24 -345,17 +385,24 @@@ GPU_FUNC_QUALIFIER void pme_gpu_prepare
   /*! \brief
    * Launches first stage of PME on GPU - spreading kernel.
    *
- - * \param[in] pme                The PME data structure.
- - * \param[in] xReadyOnDevice     Event synchronizer indicating that the coordinates
- - * are ready in the device memory; nullptr allowed only on separate PME ranks.
- - * \param[in] wcycle             The wallclock counter.
- - * \param[in] lambdaQ            The Coulomb lambda of the current state of the
- - * system. Only used if FEP of Coulomb is active.
+ + * \param[in] pme                            The PME data structure.
+ + * \param[in] xReadyOnDevice                 Event synchronizer indicating that the coordinates
+ + *                                           are ready in the device memory; nullptr allowed only
+ + *                                           on separate PME ranks.
+ + * \param[in] wcycle                         The wallclock counter.
+ + * \param[in] lambdaQ                        The Coulomb lambda of the current state of the
+ + *                                           system. Only used if FEP of Coulomb is active.
+ + * \param[in] useGpuDirectComm               Whether direct GPU PME-PP communication is active
+ + * \param[in]  pmeCoordinateReceiverGpu      Coordinate receiver object, which must be valid when
+ + *                                           direct GPU PME-PP communication is active
    */
- -GPU_FUNC_QUALIFIER void pme_gpu_launch_spread(gmx_pme_t*            GPU_FUNC_ARGUMENT(pme),
- -                                              GpuEventSynchronizer* GPU_FUNC_ARGUMENT(xReadyOnDevice),
- -                                              gmx_wallcycle*        GPU_FUNC_ARGUMENT(wcycle),
- -                                              real GPU_FUNC_ARGUMENT(lambdaQ)) GPU_FUNC_TERM;
+ +GPU_FUNC_QUALIFIER void pme_gpu_launch_spread(
+ +        gmx_pme_t*                     GPU_FUNC_ARGUMENT(pme),
+ +        GpuEventSynchronizer*          GPU_FUNC_ARGUMENT(xReadyOnDevice),
+ +        gmx_wallcycle*                 GPU_FUNC_ARGUMENT(wcycle),
+ +        real                           GPU_FUNC_ARGUMENT(lambdaQ),
+ +        bool                           GPU_FUNC_ARGUMENT(useGpuDirectComm),
+ +        gmx::PmeCoordinateReceiverGpu* GPU_FUNC_ARGUMENT(pmeCoordinateReceiverGpu)) GPU_FUNC_TERM;
   
   /*! \brief
    * Launches middle stages of PME (FFT R2C, solving, FFT C2R) either on GPU or on CPU, depending on the run mode.
@@@ -490,12 -454,12 +501,12 @@@ GPU_FUNC_QUALIFIER void pme_gpu_set_dev
    * \param[in] pme            The PME data structure.
    * \returns                  Pointer to force data
    */
- -GPU_FUNC_QUALIFIER void* pme_gpu_get_device_f(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
- -        GPU_FUNC_TERM_WITH_RETURN(nullptr);
+ +GPU_FUNC_QUALIFIER DeviceBuffer<gmx::RVec> pme_gpu_get_device_f(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
+ +        GPU_FUNC_TERM_WITH_RETURN(DeviceBuffer<gmx::RVec>{});
   
   /*! \brief Get pointer to the device synchronizer object that allows syncing on PME force calculation completion
    * \param[in] pme            The PME data structure.
- - * \returns                  Pointer to sychronizer
+ + * \returns                  Pointer to synchronizer
    */
   GPU_FUNC_QUALIFIER GpuEventSynchronizer* pme_gpu_get_f_ready_synchronizer(const gmx_pme_t* GPU_FUNC_ARGUMENT(pme))
           GPU_FUNC_TERM_WITH_RETURN(nullptr);
diff --combined src/gromacs/mdlib/vsite.cpp

index d030f9d4aba335b2e3f1b592663e9d27e320473e,f96b7eafe4849cde05797ee65f020ff627f8d1f9..c3f9edb3b27d65b522f598d0cfdac1f8fa4de1f7
--- 1/src/gromacs/mdlib/vsite.cpp
--- 2/src/gromacs/mdlib/vsite.cpp
+++ b/src/gromacs/mdlib/vsite.cpp
@@@ -60,6 -60,7 +60,6 @@@
   #include "gromacs/math/vec.h"
   #include "gromacs/mdlib/gmx_omp_nthreads.h"
   #include "gromacs/mdtypes/commrec.h"
- -#include "gromacs/mdtypes/mdatom.h"
   #include "gromacs/pbcutil/ishift.h"
   #include "gromacs/pbcutil/pbc.h"
   #include "gromacs/timing/wallcycle.h"
@@@ -164,7 -165,7 +164,7 @@@ struct VsiteThrea
       //! The interaction lists, only vsite entries are used
       std::array<InteractionList, F_NRE> ilist;
       //! Local fshift accumulation buffer
- -    std::array<RVec, SHIFTS> fshift;
+ +    std::array<RVec, c_numShiftVectors> fshift;
       //! Local virial dx*df accumulation buffer
       matrix dxdf;
       //! Tells if interdependent task idTask should be used (in addition to the rest of this task), this bool has the same value on all threads
@@@ -213,9 -214,7 +213,9 @@@ public
       //! Set VSites and distribute VSite work over threads, should be called after DD partitioning
       void setVirtualSites(ArrayRef<const InteractionList> ilist,
                            ArrayRef<const t_iparams>       iparams,
- -                         const t_mdatoms&                mdatoms,
+ +                         int                             numAtoms,
+ +                         int                             homenr,
+ +                         ArrayRef<const ParticleType>    ptype,
                            bool                            useDomdec);
   
   private:
@@@ -233,28 -232,22 +233,28 @@@ class VirtualSitesHandler::Imp
   {
   public:
       //! Constructor, domdec should be nullptr without DD
- -    Impl(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, PbcType pbcType);
+ +    Impl(const gmx_mtop_t&                 mtop,
+ +         gmx_domdec_t*                     domdec,
+ +         PbcType                           pbcType,
+ +         ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType);
   
       //! Returns the number of virtual sites acting over multiple update groups
       int numInterUpdategroupVirtualSites() const { return numInterUpdategroupVirtualSites_; }
   
       //! Set VSites and distribute VSite work over threads, should be called after DD partitioning
- -    void setVirtualSites(ArrayRef<const InteractionList> ilist, const t_mdatoms& mdatoms);
+ +    void setVirtualSites(ArrayRef<const InteractionList> ilist,
+ +                         int                             numAtoms,
+ +                         int                             homenr,
+ +                         ArrayRef<const ParticleType>    ptype);
   
       /*! \brief Create positions of vsite atoms based for the local system
        *
- -     * \param[in,out] x        The coordinates
- -     * \param[in]     dt       The time step
- -     * \param[in,out] v        When != nullptr, velocities for vsites are set as displacement/dt
- -     * \param[in]     box      The box
+ +     * \param[in,out] x          The coordinates
+ +     * \param[in,out] v          The velocities, needed if operation requires it
+ +     * \param[in]     box        The box
+ +     * \param[in]     operation  Whether we calculate positions, velocities, or both
        */
- -    void construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const;
+ +    void construct(ArrayRef<RVec> x, ArrayRef<RVec> v, const matrix box, VSiteOperation operation) const;
   
       /*! \brief Spread the force operating on the vsite atoms on the surrounding atoms.
        *
@@@ -320,7 -313,7 +320,7 @@@ static int pbc_rvec_sub(const t_pbc* pb
       else
       {
           rvec_sub(xi, xj, dx);
- -        return CENTRAL;
+ +        return c_centralShiftIndex;
       }
   }
   
@@@ -330,167 -323,87 +330,167 @@@ static inline real inverseNorm(const rv
       return gmx::invsqrt(iprod(x, x));
   }
   
+ +//! Whether we're calculating the virtual site position
+ +enum class VSiteCalculatePosition
+ +{
+ +    Yes,
+ +    No
+ +};
+ +//! Whether we're calculating the virtual site velocity
+ +enum class VSiteCalculateVelocity
+ +{
+ +    Yes,
+ +    No
+ +};
+ +
   #ifndef DOXYGEN
   /* Vsite construction routines */
   
- -static void constr_vsite1(const rvec xi, rvec x)
- -{
- -    copy_rvec(xi, x);
+ +// GCC 8 falsely flags unused variables if constexpr prunes a code path, fixed in GCC 9
+ +// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827
+ +// clang-format off
+ +GCC_DIAGNOSTIC_IGNORE(-Wunused-but-set-parameter)
+ +// clang-format on
   
- -    /* TOTAL: 0 flops */
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void constr_vsite1(const rvec xi, rvec x, const rvec vi, rvec v)
+ +{
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        copy_rvec(xi, x);
+ +        /* TOTAL: 0 flops */
+ +    }
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        copy_rvec(vi, v);
+ +    }
   }
   
- -static void constr_vsite2(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void
+ +constr_vsite2(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc, const rvec vi, const rvec vj, rvec v)
   {
- -    real b = 1 - a;
+ +    const real b = 1 - a;
       /* 1 flop */
   
- -    if (pbc)
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
       {
- -        rvec dx;
- -        pbc_dx_aiuc(pbc, xj, xi, dx);
- -        x[XX] = xi[XX] + a * dx[XX];
- -        x[YY] = xi[YY] + a * dx[YY];
- -        x[ZZ] = xi[ZZ] + a * dx[ZZ];
+ +        if (pbc)
+ +        {
+ +            rvec dx;
+ +            pbc_dx_aiuc(pbc, xj, xi, dx);
+ +            x[XX] = xi[XX] + a * dx[XX];
+ +            x[YY] = xi[YY] + a * dx[YY];
+ +            x[ZZ] = xi[ZZ] + a * dx[ZZ];
+ +        }
+ +        else
+ +        {
+ +            x[XX] = b * xi[XX] + a * xj[XX];
+ +            x[YY] = b * xi[YY] + a * xj[YY];
+ +            x[ZZ] = b * xi[ZZ] + a * xj[ZZ];
+ +            /* 9 Flops */
+ +        }
+ +        /* TOTAL: 10 flops */
       }
- -    else
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
       {
- -        x[XX] = b * xi[XX] + a * xj[XX];
- -        x[YY] = b * xi[YY] + a * xj[YY];
- -        x[ZZ] = b * xi[ZZ] + a * xj[ZZ];
- -        /* 9 Flops */
+ +        v[XX] = b * vi[XX] + a * vj[XX];
+ +        v[YY] = b * vi[YY] + a * vj[YY];
+ +        v[ZZ] = b * vi[ZZ] + a * vj[ZZ];
       }
- -
- -    /* TOTAL: 10 flops */
   }
   
- -static void constr_vsite2FD(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void
+ +constr_vsite2FD(const rvec xi, const rvec xj, rvec x, real a, const t_pbc* pbc, const rvec vi, const rvec vj, rvec v)
   {
- -    rvec xij;
+ +    rvec xij = { 0 };
       pbc_rvec_sub(pbc, xj, xi, xij);
       /* 3 flops */
   
- -    const real b = a * inverseNorm(xij);
+ +    const real invNormXij = inverseNorm(xij);
+ +    const real b          = a * invNormXij;
       /* 6 + 10 flops */
   
- -    x[XX] = xi[XX] + b * xij[XX];
- -    x[YY] = xi[YY] + b * xij[YY];
- -    x[ZZ] = xi[ZZ] + b * xij[ZZ];
- -    /* 6 Flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + b * xij[XX];
+ +        x[YY] = xi[YY] + b * xij[YY];
+ +        x[ZZ] = xi[ZZ] + b * xij[ZZ];
+ +        /* 6 Flops */
+ +        /* TOTAL: 25 flops */
+ +    }
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec_sub(vj, vi, vij);
+ +        const real vijDotXij = iprod(vij, xij);
   
- -    /* TOTAL: 25 flops */
+ +        v[XX] = vi[XX] + b * (vij[XX] - xij[XX] * vijDotXij * invNormXij * invNormXij);
+ +        v[YY] = vi[YY] + b * (vij[YY] - xij[YY] * vijDotXij * invNormXij * invNormXij);
+ +        v[ZZ] = vi[ZZ] + b * (vij[ZZ] - xij[ZZ] * vijDotXij * invNormXij * invNormXij);
+ +    }
   }
   
- -static void constr_vsite3(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void constr_vsite3(const rvec   xi,
+ +                          const rvec   xj,
+ +                          const rvec   xk,
+ +                          rvec         x,
+ +                          real         a,
+ +                          real         b,
+ +                          const t_pbc* pbc,
+ +                          const rvec   vi,
+ +                          const rvec   vj,
+ +                          const rvec   vk,
+ +                          rvec         v)
   {
- -    real c = 1 - a - b;
+ +    const real c = 1 - a - b;
       /* 2 flops */
   
- -    if (pbc)
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
       {
- -        rvec dxj, dxk;
+ +        if (pbc)
+ +        {
+ +            rvec dxj, dxk;
   
- -        pbc_dx_aiuc(pbc, xj, xi, dxj);
- -        pbc_dx_aiuc(pbc, xk, xi, dxk);
- -        x[XX] = xi[XX] + a * dxj[XX] + b * dxk[XX];
- -        x[YY] = xi[YY] + a * dxj[YY] + b * dxk[YY];
- -        x[ZZ] = xi[ZZ] + a * dxj[ZZ] + b * dxk[ZZ];
+ +            pbc_dx_aiuc(pbc, xj, xi, dxj);
+ +            pbc_dx_aiuc(pbc, xk, xi, dxk);
+ +            x[XX] = xi[XX] + a * dxj[XX] + b * dxk[XX];
+ +            x[YY] = xi[YY] + a * dxj[YY] + b * dxk[YY];
+ +            x[ZZ] = xi[ZZ] + a * dxj[ZZ] + b * dxk[ZZ];
+ +        }
+ +        else
+ +        {
+ +            x[XX] = c * xi[XX] + a * xj[XX] + b * xk[XX];
+ +            x[YY] = c * xi[YY] + a * xj[YY] + b * xk[YY];
+ +            x[ZZ] = c * xi[ZZ] + a * xj[ZZ] + b * xk[ZZ];
+ +            /* 15 Flops */
+ +        }
+ +        /* TOTAL: 17 flops */
       }
- -    else
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
       {
- -        x[XX] = c * xi[XX] + a * xj[XX] + b * xk[XX];
- -        x[YY] = c * xi[YY] + a * xj[YY] + b * xk[YY];
- -        x[ZZ] = c * xi[ZZ] + a * xj[ZZ] + b * xk[ZZ];
- -        /* 15 Flops */
+ +        v[XX] = c * vi[XX] + a * vj[XX] + b * vk[XX];
+ +        v[YY] = c * vi[YY] + a * vj[YY] + b * vk[YY];
+ +        v[ZZ] = c * vi[ZZ] + a * vj[ZZ] + b * vk[ZZ];
       }
- -
- -    /* TOTAL: 17 flops */
   }
   
- -static void constr_vsite3FD(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void constr_vsite3FD(const rvec   xi,
+ +                            const rvec   xj,
+ +                            const rvec   xk,
+ +                            rvec         x,
+ +                            real         a,
+ +                            real         b,
+ +                            const t_pbc* pbc,
+ +                            const rvec   vi,
+ +                            const rvec   vj,
+ +                            const rvec   vk,
+ +                            rvec         v)
   {
       rvec xij, xjk, temp;
- -    real c;
   
       pbc_rvec_sub(pbc, xj, xi, xij);
       pbc_rvec_sub(pbc, xk, xj, xjk);
@@@ -502,122 -415,45 +502,122 @@@
       temp[ZZ] = xij[ZZ] + a * xjk[ZZ];
       /* 6 flops */
   
- -    c = b * inverseNorm(temp);
+ +    const real invNormTemp = inverseNorm(temp);
+ +    const real c           = b * invNormTemp;
       /* 6 + 10 flops */
   
- -    x[XX] = xi[XX] + c * temp[XX];
- -    x[YY] = xi[YY] + c * temp[YY];
- -    x[ZZ] = xi[ZZ] + c * temp[ZZ];
- -    /* 6 Flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + c * temp[XX];
+ +        x[YY] = xi[YY] + c * temp[YY];
+ +        x[ZZ] = xi[ZZ] + c * temp[ZZ];
+ +        /* 6 Flops */
+ +        /* TOTAL: 34 flops */
+ +    }
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec vjk = { 0 };
+ +        rvec_sub(vj, vi, vij);
+ +        rvec_sub(vk, vj, vjk);
+ +        const rvec tempV = { vij[XX] + a * vjk[XX], vij[YY] + a * vjk[YY], vij[ZZ] + a * vjk[ZZ] };
+ +        const real tempDotTempV = iprod(temp, tempV);
   
- -    /* TOTAL: 34 flops */
+ +        v[XX] = vi[XX] + c * (tempV[XX] - temp[XX] * tempDotTempV * invNormTemp * invNormTemp);
+ +        v[YY] = vi[YY] + c * (tempV[YY] - temp[YY] * tempDotTempV * invNormTemp * invNormTemp);
+ +        v[ZZ] = vi[ZZ] + c * (tempV[ZZ] - temp[ZZ] * tempDotTempV * invNormTemp * invNormTemp);
+ +    }
   }
   
- -static void constr_vsite3FAD(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, const t_pbc* pbc)
- -{
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void constr_vsite3FAD(const rvec   xi,
+ +                             const rvec   xj,
+ +                             const rvec   xk,
+ +                             rvec         x,
+ +                             real         a,
+ +                             real         b,
+ +                             const t_pbc* pbc,
+ +                             const rvec   vi,
+ +                             const rvec   vj,
+ +                             const rvec   vk,
+ +                             rvec         v)
+ +{ // Note: a = d * cos(theta)
+ +    //       b = d * sin(theta)
       rvec xij, xjk, xp;
- -    real a1, b1, c1, invdij;
   
       pbc_rvec_sub(pbc, xj, xi, xij);
       pbc_rvec_sub(pbc, xk, xj, xjk);
       /* 6 flops */
   
- -    invdij = inverseNorm(xij);
- -    c1     = invdij * invdij * iprod(xij, xjk);
- -    xp[XX] = xjk[XX] - c1 * xij[XX];
- -    xp[YY] = xjk[YY] - c1 * xij[YY];
- -    xp[ZZ] = xjk[ZZ] - c1 * xij[ZZ];
- -    a1     = a * invdij;
- -    b1     = b * inverseNorm(xp);
+ +    const real invdij    = inverseNorm(xij);
+ +    const real xijDotXjk = iprod(xij, xjk);
+ +    const real c1        = invdij * invdij * xijDotXjk;
+ +    xp[XX]               = xjk[XX] - c1 * xij[XX];
+ +    xp[YY]               = xjk[YY] - c1 * xij[YY];
+ +    xp[ZZ]               = xjk[ZZ] - c1 * xij[ZZ];
+ +    const real a1        = a * invdij;
+ +    const real invNormXp = inverseNorm(xp);
+ +    const real b1        = b * invNormXp;
       /* 45 */
   
- -    x[XX] = xi[XX] + a1 * xij[XX] + b1 * xp[XX];
- -    x[YY] = xi[YY] + a1 * xij[YY] + b1 * xp[YY];
- -    x[ZZ] = xi[ZZ] + a1 * xij[ZZ] + b1 * xp[ZZ];
- -    /* 12 Flops */
- -
- -    /* TOTAL: 63 flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + a1 * xij[XX] + b1 * xp[XX];
+ +        x[YY] = xi[YY] + a1 * xij[YY] + b1 * xp[YY];
+ +        x[ZZ] = xi[ZZ] + a1 * xij[ZZ] + b1 * xp[ZZ];
+ +        /* 12 Flops */
+ +        /* TOTAL: 63 flops */
+ +    }
+ +
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec vjk = { 0 };
+ +        rvec_sub(vj, vi, vij);
+ +        rvec_sub(vk, vj, vjk);
+ +
+ +        const real vijDotXjkPlusXijDotVjk = iprod(vij, xjk) + iprod(xij, vjk);
+ +        const real xijDotVij              = iprod(xij, vij);
+ +        const real invNormXij2            = invdij * invdij;
+ +
+ +        rvec vp = { 0 };
+ +        vp[XX]  = vjk[XX]
+ +                 - xij[XX] * invNormXij2
+ +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
+ +                 - vij[XX] * xijDotXjk * invNormXij2;
+ +        vp[YY] = vjk[YY]
+ +                 - xij[YY] * invNormXij2
+ +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
+ +                 - vij[YY] * xijDotXjk * invNormXij2;
+ +        vp[ZZ] = vjk[ZZ]
+ +                 - xij[ZZ] * invNormXij2
+ +                           * (vijDotXjkPlusXijDotVjk - invNormXij2 * xijDotXjk * xijDotVij * 2)
+ +                 - vij[ZZ] * xijDotXjk * invNormXij2;
+ +
+ +        const real xpDotVp = iprod(xp, vp);
+ +
+ +        v[XX] = vi[XX] + a1 * (vij[XX] - xij[XX] * xijDotVij * invdij * invdij)
+ +                + b1 * (vp[XX] - xp[XX] * xpDotVp * invNormXp * invNormXp);
+ +        v[YY] = vi[YY] + a1 * (vij[YY] - xij[YY] * xijDotVij * invdij * invdij)
+ +                + b1 * (vp[YY] - xp[YY] * xpDotVp * invNormXp * invNormXp);
+ +        v[ZZ] = vi[ZZ] + a1 * (vij[ZZ] - xij[ZZ] * xijDotVij * invdij * invdij)
+ +                + b1 * (vp[ZZ] - xp[ZZ] * xpDotVp * invNormXp * invNormXp);
+ +    }
   }
   
- -static void
- -constr_vsite3OUT(const rvec xi, const rvec xj, const rvec xk, rvec x, real a, real b, real c, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static void constr_vsite3OUT(const rvec   xi,
+ +                             const rvec   xj,
+ +                             const rvec   xk,
+ +                             rvec         x,
+ +                             real         a,
+ +                             real         b,
+ +                             real         c,
+ +                             const t_pbc* pbc,
+ +                             const rvec   vi,
+ +                             const rvec   vj,
+ +                             const rvec   vk,
+ +                             rvec         v)
   {
       rvec xij, xik, temp;
   
@@@ -626,34 -462,14 +626,34 @@@
       cprod(xij, xik, temp);
       /* 15 Flops */
   
- -    x[XX] = xi[XX] + a * xij[XX] + b * xik[XX] + c * temp[XX];
- -    x[YY] = xi[YY] + a * xij[YY] + b * xik[YY] + c * temp[YY];
- -    x[ZZ] = xi[ZZ] + a * xij[ZZ] + b * xik[ZZ] + c * temp[ZZ];
- -    /* 18 Flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + a * xij[XX] + b * xik[XX] + c * temp[XX];
+ +        x[YY] = xi[YY] + a * xij[YY] + b * xik[YY] + c * temp[YY];
+ +        x[ZZ] = xi[ZZ] + a * xij[ZZ] + b * xik[ZZ] + c * temp[ZZ];
+ +        /* 18 Flops */
+ +        /* TOTAL: 33 flops */
+ +    }
+ +
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec vik = { 0 };
+ +        rvec_sub(vj, vi, vij);
+ +        rvec_sub(vk, vi, vik);
+ +
+ +        rvec temp1 = { 0 };
+ +        rvec temp2 = { 0 };
+ +        cprod(vij, xik, temp1);
+ +        cprod(xij, vik, temp2);
   
- -    /* TOTAL: 33 flops */
+ +        v[XX] = vi[XX] + a * vij[XX] + b * vik[XX] + c * (temp1[XX] + temp2[XX]);
+ +        v[YY] = vi[YY] + a * vij[YY] + b * vik[YY] + c * (temp1[YY] + temp2[YY]);
+ +        v[ZZ] = vi[ZZ] + a * vij[ZZ] + b * vik[ZZ] + c * (temp1[ZZ] + temp2[ZZ]);
+ +    }
   }
   
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
   static void constr_vsite4FD(const rvec   xi,
                               const rvec   xj,
                               const rvec   xk,
@@@ -662,12 -478,7 +662,12 @@@
                               real         a,
                               real         b,
                               real         c,
- -                            const t_pbc* pbc)
+ +                            const t_pbc* pbc,
+ +                            const rvec   vi,
+ +                            const rvec   vj,
+ +                            const rvec   vk,
+ +                            const rvec   vl,
+ +                            rvec         v)
   {
       rvec xij, xjk, xjl, temp;
       real d;
@@@ -683,41 -494,17 +683,41 @@@
       temp[ZZ] = xij[ZZ] + a * xjk[ZZ] + b * xjl[ZZ];
       /* 12 flops */
   
- -    d = c * inverseNorm(temp);
+ +    const real invRm = inverseNorm(temp);
+ +    d                = c * invRm;
       /* 6 + 10 flops */
   
- -    x[XX] = xi[XX] + d * temp[XX];
- -    x[YY] = xi[YY] + d * temp[YY];
- -    x[ZZ] = xi[ZZ] + d * temp[ZZ];
- -    /* 6 Flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + d * temp[XX];
+ +        x[YY] = xi[YY] + d * temp[YY];
+ +        x[ZZ] = xi[ZZ] + d * temp[ZZ];
+ +        /* 6 Flops */
+ +        /* TOTAL: 43 flops */
+ +    }
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec vjk = { 0 };
+ +        rvec vjl = { 0 };
+ +
+ +        rvec_sub(vj, vi, vij);
+ +        rvec_sub(vk, vj, vjk);
+ +        rvec_sub(vl, vj, vjl);
   
- -    /* TOTAL: 43 flops */
+ +        rvec vm = { 0 };
+ +        vm[XX]  = vij[XX] + a * vjk[XX] + b * vjl[XX];
+ +        vm[YY]  = vij[YY] + a * vjk[YY] + b * vjl[YY];
+ +        vm[ZZ]  = vij[ZZ] + a * vjk[ZZ] + b * vjl[ZZ];
+ +
+ +        const real vmDotRm = iprod(vm, temp);
+ +        v[XX]              = vi[XX] + d * (vm[XX] - temp[XX] * vmDotRm * invRm * invRm);
+ +        v[YY]              = vi[YY] + d * (vm[YY] - temp[YY] * vmDotRm * invRm * invRm);
+ +        v[ZZ]              = vi[ZZ] + d * (vm[ZZ] - temp[ZZ] * vmDotRm * invRm * invRm);
+ +    }
   }
   
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
   static void constr_vsite4FDN(const rvec   xi,
                                const rvec   xj,
                                const rvec   xk,
@@@ -726,12 -513,7 +726,12 @@@
                                real         a,
                                real         b,
                                real         c,
- -                             const t_pbc* pbc)
+ +                             const t_pbc* pbc,
+ +                             const rvec   vi,
+ +                             const rvec   vj,
+ +                             const rvec   vk,
+ +                             const rvec   vl,
+ +                             rvec         v)
   {
       rvec xij, xik, xil, ra, rb, rja, rjb, rm;
       real d;
@@@ -758,121 -540,53 +758,121 @@@
       cprod(rja, rjb, rm);
       /* 9 flops */
   
- -    d = c * inverseNorm(rm);
+ +    const real invNormRm = inverseNorm(rm);
+ +    d                    = c * invNormRm;
       /* 5+5+1 flops */
   
- -    x[XX] = xi[XX] + d * rm[XX];
- -    x[YY] = xi[YY] + d * rm[YY];
- -    x[ZZ] = xi[ZZ] + d * rm[ZZ];
- -    /* 6 Flops */
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[XX] = xi[XX] + d * rm[XX];
+ +        x[YY] = xi[YY] + d * rm[YY];
+ +        x[ZZ] = xi[ZZ] + d * rm[ZZ];
+ +        /* 6 Flops */
+ +        /* TOTAL: 47 flops */
+ +    }
+ +
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        rvec vij = { 0 };
+ +        rvec vik = { 0 };
+ +        rvec vil = { 0 };
+ +        rvec_sub(vj, vi, vij);
+ +        rvec_sub(vk, vi, vik);
+ +        rvec_sub(vl, vi, vil);
+ +
+ +        rvec vja = { 0 };
+ +        rvec vjb = { 0 };
+ +
+ +        vja[XX] = a * vik[XX] - vij[XX];
+ +        vja[YY] = a * vik[YY] - vij[YY];
+ +        vja[ZZ] = a * vik[ZZ] - vij[ZZ];
+ +        vjb[XX] = b * vil[XX] - vij[XX];
+ +        vjb[YY] = b * vil[YY] - vij[YY];
+ +        vjb[ZZ] = b * vil[ZZ] - vij[ZZ];
+ +
+ +        rvec temp1 = { 0 };
+ +        rvec temp2 = { 0 };
+ +        cprod(vja, rjb, temp1);
+ +        cprod(rja, vjb, temp2);
   
- -    /* TOTAL: 47 flops */
+ +        rvec vm = { 0 };
+ +        vm[XX]  = temp1[XX] + temp2[XX];
+ +        vm[YY]  = temp1[YY] + temp2[YY];
+ +        vm[ZZ]  = temp1[ZZ] + temp2[ZZ];
+ +
+ +        const real rmDotVm = iprod(rm, vm);
+ +        v[XX]              = vi[XX] + d * (vm[XX] - rm[XX] * rmDotVm * invNormRm * invNormRm);
+ +        v[YY]              = vi[YY] + d * (vm[YY] - rm[YY] * rmDotVm * invNormRm * invNormRm);
+ +        v[ZZ]              = vi[ZZ] + d * (vm[ZZ] - rm[ZZ] * rmDotVm * invNormRm * invNormRm);
+ +    }
   }
   
- -static int constr_vsiten(const t_iatom* ia, ArrayRef<const t_iparams> ip, ArrayRef<RVec> x, const t_pbc* pbc)
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
+ +static int constr_vsiten(const t_iatom*            ia,
+ +                         ArrayRef<const t_iparams> ip,
+ +                         ArrayRef<RVec>            x,
+ +                         const t_pbc*              pbc,
+ +                         ArrayRef<RVec>            v)
   {
       rvec x1, dx;
       dvec dsum;
- -    int  n3, av, ai;
       real a;
+ +    dvec dvsum = { 0 };
+ +    rvec v1    = { 0 };
   
- -    n3 = 3 * ip[ia[0]].vsiten.n;
- -    av = ia[1];
- -    ai = ia[2];
+ +    const int n3 = 3 * ip[ia[0]].vsiten.n;
+ +    const int av = ia[1];
+ +    int       ai = ia[2];
       copy_rvec(x[ai], x1);
+ +    copy_rvec(v[ai], v1);
       clear_dvec(dsum);
       for (int i = 3; i < n3; i += 3)
       {
           ai = ia[i + 2];
           a  = ip[ia[i]].vsiten.a;
- -        if (pbc)
+ +        if (calculatePosition == VSiteCalculatePosition::Yes)
           {
- -            pbc_dx_aiuc(pbc, x[ai], x1, dx);
+ +            if (pbc)
+ +            {
+ +                pbc_dx_aiuc(pbc, x[ai], x1, dx);
+ +            }
+ +            else
+ +            {
+ +                rvec_sub(x[ai], x1, dx);
+ +            }
+ +            dsum[XX] += a * dx[XX];
+ +            dsum[YY] += a * dx[YY];
+ +            dsum[ZZ] += a * dx[ZZ];
+ +            /* 9 Flops */
           }
- -        else
+ +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
           {
- -            rvec_sub(x[ai], x1, dx);
+ +            rvec_sub(v[ai], v1, dx);
+ +            dvsum[XX] += a * dx[XX];
+ +            dvsum[YY] += a * dx[YY];
+ +            dvsum[ZZ] += a * dx[ZZ];
+ +            /* 9 Flops */
           }
- -        dsum[XX] += a * dx[XX];
- -        dsum[YY] += a * dx[YY];
- -        dsum[ZZ] += a * dx[ZZ];
- -        /* 9 Flops */
       }
   
- -    x[av][XX] = x1[XX] + dsum[XX];
- -    x[av][YY] = x1[YY] + dsum[YY];
- -    x[av][ZZ] = x1[ZZ] + dsum[ZZ];
+ +    if (calculatePosition == VSiteCalculatePosition::Yes)
+ +    {
+ +        x[av][XX] = x1[XX] + dsum[XX];
+ +        x[av][YY] = x1[YY] + dsum[YY];
+ +        x[av][ZZ] = x1[ZZ] + dsum[ZZ];
+ +    }
+ +
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        v[av][XX] = v1[XX] + dvsum[XX];
+ +        v[av][YY] = v1[YY] + dvsum[YY];
+ +        v[av][ZZ] = v1[ZZ] + dvsum[ZZ];
+ +    }
   
       return n3;
   }
+ +// End GCC 8 bug
+ +GCC_DIAGNOSTIC_RESET
   
   #endif // DOXYGEN
   
@@@ -901,49 -615,29 +901,49 @@@ static PbcMode getPbcMode(const t_pbc* 
   
   /*! \brief Executes the vsite construction task for a single thread
    *
+ + * \tparam        operation  Whether we are calculating positions, velocities, or both
    * \param[in,out] x   Coordinates to construct vsites for
- - * \param[in]     dt  Time step, needed when v is not empty
- - * \param[in,out] v   When not empty, velocities are generated for virtual sites
+ + * \param[in,out] v   Velocities are generated for virtual sites if `operation` requires it
    * \param[in]     ip  Interaction parameters for all interaction, only vsite parameters are used
    * \param[in]     ilist  The interaction lists, only vsites are usesd
    * \param[in]     pbc_null  PBC struct, used for PBC distance calculations when !=nullptr
    */
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
   static void construct_vsites_thread(ArrayRef<RVec>                  x,
- -                                    const real                      dt,
                                       ArrayRef<RVec>                  v,
                                       ArrayRef<const t_iparams>       ip,
                                       ArrayRef<const InteractionList> ilist,
                                       const t_pbc*                    pbc_null)
   {
- -    real inv_dt;
- -    if (!v.empty())
- -    {
- -        inv_dt = 1.0 / dt;
- -    }
- -    else
- -    {
- -        inv_dt = 1.0;
- -    }
+ +    if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +    {
+ +        GMX_RELEASE_ASSERT(!v.empty(),
+ +                           "Can't calculate velocities without access to velocity vector.");
+ +    }
+ +
+ +    // Work around clang bug (unfixed as of Feb 2021)
+ +    // https://bugs.llvm.org/show_bug.cgi?id=35450
+ +    // clang-format off
+ +    CLANG_DIAGNOSTIC_IGNORE(-Wunused-lambda-capture)
+ +    // clang-format on
+ +    // GCC 8 falsely flags unused variables if constexpr prunes a code path, fixed in GCC 9
+ +    // https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85827
+ +    // clang-format off
+ +    GCC_DIAGNOSTIC_IGNORE(-Wunused-but-set-parameter)
+ +    // clang-format on
+ +    // getVOrNull returns a velocity rvec if we need it, nullptr otherwise.
+ +    auto getVOrNull = [v](int idx) -> real* {
+ +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +        {
+ +            return v[idx].as_vec();
+ +        }
+ +        else
+ +        {
+ +            return nullptr;
+ +        }
+ +    };
+ +    GCC_DIAGNOSTIC_RESET
+ +    CLANG_DIAGNOSTIC_RESET
   
       const PbcMode pbcMode = getPbcMode(pbc_null);
       /* We need another pbc pointer, as with charge groups we switch per vsite */
@@@ -980,97 -674,39 +980,97 @@@
                   real b1, c1;
                   switch (ftype)
                   {
- -                    case F_VSITE1: constr_vsite1(x[ai], x[avsite]); break;
+ +                    case F_VSITE1:
+ +                        constr_vsite1<calculatePosition, calculateVelocity>(
+ +                                x[ai], x[avsite], getVOrNull(ai), getVOrNull(avsite));
+ +                        break;
                       case F_VSITE2:
                           aj = ia[3];
- -                        constr_vsite2(x[ai], x[aj], x[avsite], a1, pbc_null2);
+ +                        constr_vsite2<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                            x[aj],
+ +                                                                            x[avsite],
+ +                                                                            a1,
+ +                                                                            pbc_null2,
+ +                                                                            getVOrNull(ai),
+ +                                                                            getVOrNull(aj),
+ +                                                                            getVOrNull(avsite));
                           break;
                       case F_VSITE2FD:
                           aj = ia[3];
- -                        constr_vsite2FD(x[ai], x[aj], x[avsite], a1, pbc_null2);
+ +                        constr_vsite2FD<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                              x[aj],
+ +                                                                              x[avsite],
+ +                                                                              a1,
+ +                                                                              pbc_null2,
+ +                                                                              getVOrNull(ai),
+ +                                                                              getVOrNull(aj),
+ +                                                                              getVOrNull(avsite));
                           break;
                       case F_VSITE3:
                           aj = ia[3];
                           ak = ia[4];
                           b1 = ip[tp].vsite.b;
- -                        constr_vsite3(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
+ +                        constr_vsite3<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                            x[aj],
+ +                                                                            x[ak],
+ +                                                                            x[avsite],
+ +                                                                            a1,
+ +                                                                            b1,
+ +                                                                            pbc_null2,
+ +                                                                            getVOrNull(ai),
+ +                                                                            getVOrNull(aj),
+ +                                                                            getVOrNull(ak),
+ +                                                                            getVOrNull(avsite));
                           break;
                       case F_VSITE3FD:
                           aj = ia[3];
                           ak = ia[4];
                           b1 = ip[tp].vsite.b;
- -                        constr_vsite3FD(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
+ +                        constr_vsite3FD<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                              x[aj],
+ +                                                                              x[ak],
+ +                                                                              x[avsite],
+ +                                                                              a1,
+ +                                                                              b1,
+ +                                                                              pbc_null2,
+ +                                                                              getVOrNull(ai),
+ +                                                                              getVOrNull(aj),
+ +                                                                              getVOrNull(ak),
+ +                                                                              getVOrNull(avsite));
                           break;
                       case F_VSITE3FAD:
                           aj = ia[3];
                           ak = ia[4];
                           b1 = ip[tp].vsite.b;
- -                        constr_vsite3FAD(x[ai], x[aj], x[ak], x[avsite], a1, b1, pbc_null2);
+ +                        constr_vsite3FAD<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                               x[aj],
+ +                                                                               x[ak],
+ +                                                                               x[avsite],
+ +                                                                               a1,
+ +                                                                               b1,
+ +                                                                               pbc_null2,
+ +                                                                               getVOrNull(ai),
+ +                                                                               getVOrNull(aj),
+ +                                                                               getVOrNull(ak),
+ +                                                                               getVOrNull(avsite));
                           break;
                       case F_VSITE3OUT:
                           aj = ia[3];
                           ak = ia[4];
                           b1 = ip[tp].vsite.b;
                           c1 = ip[tp].vsite.c;
- -                        constr_vsite3OUT(x[ai], x[aj], x[ak], x[avsite], a1, b1, c1, pbc_null2);
+ +                        constr_vsite3OUT<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                               x[aj],
+ +                                                                               x[ak],
+ +                                                                               x[avsite],
+ +                                                                               a1,
+ +                                                                               b1,
+ +                                                                               c1,
+ +                                                                               pbc_null2,
+ +                                                                               getVOrNull(ai),
+ +                                                                               getVOrNull(aj),
+ +                                                                               getVOrNull(ak),
+ +                                                                               getVOrNull(avsite));
                           break;
                       case F_VSITE4FD:
                           aj = ia[3];
@@@ -1078,20 -714,7 +1078,20 @@@
                           al = ia[5];
                           b1 = ip[tp].vsite.b;
                           c1 = ip[tp].vsite.c;
- -                        constr_vsite4FD(x[ai], x[aj], x[ak], x[al], x[avsite], a1, b1, c1, pbc_null2);
+ +                        constr_vsite4FD<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                              x[aj],
+ +                                                                              x[ak],
+ +                                                                              x[al],
+ +                                                                              x[avsite],
+ +                                                                              a1,
+ +                                                                              b1,
+ +                                                                              c1,
+ +                                                                              pbc_null2,
+ +                                                                              getVOrNull(ai),
+ +                                                                              getVOrNull(aj),
+ +                                                                              getVOrNull(ak),
+ +                                                                              getVOrNull(al),
+ +                                                                              getVOrNull(avsite));
                           break;
                       case F_VSITE4FDN:
                           aj = ia[3];
@@@ -1099,24 -722,9 +1099,24 @@@
                           al = ia[5];
                           b1 = ip[tp].vsite.b;
                           c1 = ip[tp].vsite.c;
- -                        constr_vsite4FDN(x[ai], x[aj], x[ak], x[al], x[avsite], a1, b1, c1, pbc_null2);
+ +                        constr_vsite4FDN<calculatePosition, calculateVelocity>(x[ai],
+ +                                                                               x[aj],
+ +                                                                               x[ak],
+ +                                                                               x[al],
+ +                                                                               x[avsite],
+ +                                                                               a1,
+ +                                                                               b1,
+ +                                                                               c1,
+ +                                                                               pbc_null2,
+ +                                                                               getVOrNull(ai),
+ +                                                                               getVOrNull(aj),
+ +                                                                               getVOrNull(ak),
+ +                                                                               getVOrNull(al),
+ +                                                                               getVOrNull(avsite));
+ +                        break;
+ +                    case F_VSITEN:
+ +                        inc = constr_vsiten<calculatePosition, calculateVelocity>(ia, ip, x, pbc_null2, v);
                           break;
- -                    case F_VSITEN: inc = constr_vsiten(ia, ip, x, pbc_null2); break;
                       default:
                           gmx_fatal(FARGS, "No such vsite type %d in %s, line %d", ftype, __FILE__, __LINE__);
                   }
@@@ -1126,11 -734,18 +1126,11 @@@
                       /* Keep the vsite in the same periodic image as before */
                       rvec dx;
                       int  ishift = pbc_dx_aiuc(pbc_null, x[avsite], xv, dx);
- -                    if (ishift != CENTRAL)
+ +                    if (ishift != c_centralShiftIndex)
                       {
                           rvec_add(xv, dx, x[avsite]);
                       }
                   }
- -                if (!v.empty())
- -                {
- -                    /* Calculate velocity of vsite... */
- -                    rvec vv;
- -                    rvec_sub(x[avsite], xv, vv);
- -                    svmul(inv_dt, vv, v[avsite]);
- -                }
   
                   /* Increment loop variables */
                   i += inc;
@@@ -1144,15 -759,16 +1144,15 @@@
    *
    * \param[in]     threadingInfo  Used to divide work over threads when != nullptr
    * \param[in,out] x   Coordinates to construct vsites for
- - * \param[in]     dt  Time step, needed when v is not empty
    * \param[in,out] v   When not empty, velocities are generated for virtual sites
    * \param[in]     ip  Interaction parameters for all interaction, only vsite parameters are used
    * \param[in]     ilist  The interaction lists, only vsites are usesd
    * \param[in]     domainInfo  Information about PBC and DD
    * \param[in]     box  Used for PBC when PBC is set in domainInfo
    */
+ +template<VSiteCalculatePosition calculatePosition, VSiteCalculateVelocity calculateVelocity>
   static void construct_vsites(const ThreadingInfo*            threadingInfo,
                                ArrayRef<RVec>                  x,
- -                             real                            dt,
                                ArrayRef<RVec>                  v,
                                ArrayRef<const t_iparams>       ip,
                                ArrayRef<const InteractionList> ilist,
@@@ -1175,8 -791,8 +1175,8 @@@
            */
           ivec null_ivec;
           clear_ivec(null_ivec);
- -        pbc_null = set_pbc_dd(&pbc, domainInfo.pbcType_,
- -                              useDomdec ? domainInfo.domdec_->numCells : null_ivec, FALSE, box);
+ +        pbc_null = set_pbc_dd(
+ +                &pbc, domainInfo.pbcType_, useDomdec ? domainInfo.domdec_->numCells : null_ivec, FALSE, box);
       }
       else
       {
@@@ -1185,20 -801,12 +1185,20 @@@
   
       if (useDomdec)
       {
- -        dd_move_x_vsites(*domainInfo.domdec_, box, as_rvec_array(x.data()));
+ +        if (calculateVelocity == VSiteCalculateVelocity::Yes)
+ +        {
+ +            dd_move_x_and_v_vsites(
+ +                    *domainInfo.domdec_, box, as_rvec_array(x.data()), as_rvec_array(v.data()));
+ +        }
+ +        else
+ +        {
+ +            dd_move_x_vsites(*domainInfo.domdec_, box, as_rvec_array(x.data()));
+ +        }
       }
   
       if (threadingInfo == nullptr || threadingInfo->numThreads() == 1)
       {
- -        construct_vsites_thread(x, dt, v, ip, ilist, pbc_null);
+ +        construct_vsites_thread<calculatePosition, calculateVelocity>(x, v, ip, ilist, pbc_null);
       }
       else
       {
@@@ -1211,52 -819,31 +1211,52 @@@
                   GMX_ASSERT(tData.rangeStart >= 0,
                              "The thread data should be initialized before calling construct_vsites");
   
- -                construct_vsites_thread(x, dt, v, ip, tData.ilist, pbc_null);
+ +                construct_vsites_thread<calculatePosition, calculateVelocity>(
+ +                        x, v, ip, tData.ilist, pbc_null);
                   if (tData.useInterdependentTask)
                   {
                       /* Here we don't need a barrier (unlike the spreading),
                        * since both tasks only construct vsites from particles,
                        * or local vsites, not from non-local vsites.
                        */
- -                    construct_vsites_thread(x, dt, v, ip, tData.idTask.ilist, pbc_null);
+ +                    construct_vsites_thread<calculatePosition, calculateVelocity>(
+ +                            x, v, ip, tData.idTask.ilist, pbc_null);
                   }
               }
               GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
           }
           /* Now we can construct the vsites that might depend on other vsites */
- -        construct_vsites_thread(x, dt, v, ip, threadingInfo->threadDataNonLocalDependent().ilist, pbc_null);
+ +        construct_vsites_thread<calculatePosition, calculateVelocity>(
+ +                x, v, ip, threadingInfo->threadDataNonLocalDependent().ilist, pbc_null);
       }
   }
   
- -void VirtualSitesHandler::Impl::construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const
+ +void VirtualSitesHandler::Impl::construct(ArrayRef<RVec> x,
+ +                                          ArrayRef<RVec> v,
+ +                                          const matrix   box,
+ +                                          VSiteOperation operation) const
   {
- -    construct_vsites(&threadingInfo_, x, dt, v, iparams_, ilists_, domainInfo_, box);
+ +    switch (operation)
+ +    {
+ +        case VSiteOperation::Positions:
+ +            construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::No>(
+ +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
+ +            break;
+ +        case VSiteOperation::Velocities:
+ +            construct_vsites<VSiteCalculatePosition::No, VSiteCalculateVelocity::Yes>(
+ +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
+ +            break;
+ +        case VSiteOperation::PositionsAndVelocities:
+ +            construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::Yes>(
+ +                    &threadingInfo_, x, v, iparams_, ilists_, domainInfo_, box);
+ +            break;
+ +        default: gmx_fatal(FARGS, "Unknown virtual site operation");
+ +    }
   }
   
- -void VirtualSitesHandler::construct(ArrayRef<RVec> x, real dt, ArrayRef<RVec> v, const matrix box) const
+ +void VirtualSitesHandler::construct(ArrayRef<RVec> x, ArrayRef<RVec> v, const matrix box, VSiteOperation operation) const
   {
- -    impl_->construct(x, dt, v, box);
+ +    impl_->construct(x, v, box, operation);
   }
   
   void constructVirtualSites(ArrayRef<RVec> x, ArrayRef<const t_iparams> ip, ArrayRef<const InteractionList> ilist)
@@@ -1264,8 -851,7 +1264,8 @@@
   {
       // No PBC, no DD
       const DomainInfo domainInfo;
- -    construct_vsites(nullptr, x, 0, {}, ip, ilist, domainInfo, nullptr);
+ +    construct_vsites<VSiteCalculatePosition::Yes, VSiteCalculateVelocity::No>(
+ +            nullptr, x, {}, ip, ilist, domainInfo, nullptr);
   }
   
   #ifndef DOXYGEN
@@@ -1313,14 -899,14 +1313,14 @@@ static void spread_vsite2(const t_iato
           }
           else
           {
- -            siv = CENTRAL;
- -            sij = CENTRAL;
+ +            siv = c_centralShiftIndex;
+ +            sij = c_centralShiftIndex;
           }
   
- -        if (siv != CENTRAL || sij != CENTRAL)
+ +        if (siv != c_centralShiftIndex || sij != c_centralShiftIndex)
           {
               rvec_inc(fshift[siv], f[av]);
- -            rvec_dec(fshift[CENTRAL], fi);
+ +            rvec_dec(fshift[c_centralShiftIndex], fi);
               rvec_dec(fshift[sij], fj);
           }
       }
@@@ -1343,8 -929,8 +1343,8 @@@ void constructVirtualSitesGlobal(const 
               int atomOffset = mtop.moleculeBlockIndices[mb].globalAtomStart;
               for (int mol = 0; mol < molb.nmol; mol++)
               {
- -                constructVirtualSites(x.subArray(atomOffset, molt.atoms.nr), mtop.ffparams.iparams,
- -                                      molt.ilist);
+ +                constructVirtualSites(
+ +                        x.subArray(atomOffset, molt.atoms.nr), mtop.ffparams.iparams, molt.ilist);
                   atomOffset += molt.atoms.nr;
               }
           }
@@@ -1403,15 -989,15 +1403,15 @@@ static void spread_vsite2FD(const t_iat
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sji != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
- -            fshift[CENTRAL][XX] += fv[XX] - fj[XX];
- -            fshift[CENTRAL][YY] += fv[YY] - fj[YY];
- -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ];
+ +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX];
+ +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY];
+ +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ];
               fshift[sji][XX] += fj[XX];
               fshift[sji][YY] += fj[YY];
               fshift[sji][ZZ] += fj[ZZ];
@@@ -1485,15 -1071,15 +1485,15 @@@ static void spread_vsite3(const t_iato
           }
           else
           {
- -            siv = CENTRAL;
- -            sij = CENTRAL;
- -            sik = CENTRAL;
+ +            siv = c_centralShiftIndex;
+ +            sij = c_centralShiftIndex;
+ +            sik = c_centralShiftIndex;
           }
   
- -        if (siv != CENTRAL || sij != CENTRAL || sik != CENTRAL)
+ +        if (siv != c_centralShiftIndex || sij != c_centralShiftIndex || sik != c_centralShiftIndex)
           {
               rvec_inc(fshift[siv], f[av]);
- -            rvec_dec(fshift[CENTRAL], fi);
+ +            rvec_dec(fshift[c_centralShiftIndex], fi);
               rvec_dec(fshift[sij], fj);
               rvec_dec(fshift[sik], fk);
           }
@@@ -1568,15 -1154,15 +1568,15 @@@ static void spread_vsite3FD(const t_iat
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
- -            fshift[CENTRAL][XX] += fv[XX] - (1 + a) * temp[XX];
- -            fshift[CENTRAL][YY] += fv[YY] - (1 + a) * temp[YY];
- -            fshift[CENTRAL][ZZ] += fv[ZZ] - (1 + a) * temp[ZZ];
+ +            fshift[c_centralShiftIndex][XX] += fv[XX] - (1 + a) * temp[XX];
+ +            fshift[c_centralShiftIndex][YY] += fv[YY] - (1 + a) * temp[YY];
+ +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - (1 + a) * temp[ZZ];
               fshift[sji][XX] += temp[XX];
               fshift[sji][YY] += temp[YY];
               fshift[sji][ZZ] += temp[ZZ];
@@@ -1690,15 -1276,15 +1690,15 @@@ static void spread_vsite3FAD(const t_ia
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
- -            fshift[CENTRAL][XX] += fv[XX] - f1[XX] - (1 - c1) * f2[XX] + f3[XX];
- -            fshift[CENTRAL][YY] += fv[YY] - f1[YY] - (1 - c1) * f2[YY] + f3[YY];
- -            fshift[CENTRAL][ZZ] += fv[ZZ] - f1[ZZ] - (1 - c1) * f2[ZZ] + f3[ZZ];
+ +            fshift[c_centralShiftIndex][XX] += fv[XX] - f1[XX] - (1 - c1) * f2[XX] + f3[XX];
+ +            fshift[c_centralShiftIndex][YY] += fv[YY] - f1[YY] - (1 - c1) * f2[YY] + f3[YY];
+ +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - f1[ZZ] - (1 - c1) * f2[ZZ] + f3[ZZ];
               fshift[sji][XX] += f1[XX] - c1 * f2[XX] - f3[XX];
               fshift[sji][YY] += f1[YY] - c1 * f2[YY] - f3[YY];
               fshift[sji][ZZ] += f1[ZZ] - c1 * f2[ZZ] - f3[ZZ];
@@@ -1784,15 -1370,15 +1784,15 @@@ static void spread_vsite3OUT(const t_ia
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sji != CENTRAL || ski != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || ski != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
- -            fshift[CENTRAL][XX] += fv[XX] - fj[XX] - fk[XX];
- -            fshift[CENTRAL][YY] += fv[YY] - fj[YY] - fk[YY];
- -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ];
+ +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX] - fk[XX];
+ +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY] - fk[YY];
+ +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ];
               rvec_inc(fshift[sji], fj);
               rvec_inc(fshift[ski], fk);
           }
@@@ -1886,16 -1472,15 +1886,16 @@@ static void spread_vsite4FD(const t_iat
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sji != CENTRAL || skj != CENTRAL || slj != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sji != c_centralShiftIndex || skj != c_centralShiftIndex
+ +            || slj != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
               for (m = 0; m < DIM; m++)
               {
- -                fshift[CENTRAL][m] += fv[m] - (1 + a + b) * temp[m];
+ +                fshift[c_centralShiftIndex][m] += fv[m] - (1 + a + b) * temp[m];
                   fshift[sji][m] += temp[m];
                   fshift[skj][m] += a * temp[m];
                   fshift[slj][m] += b * temp[m];
@@@ -2046,16 -1631,15 +2046,16 @@@ static void spread_vsite4FDN(const t_ia
           }
           else
           {
- -            svi = CENTRAL;
+ +            svi = c_centralShiftIndex;
           }
   
- -        if (svi != CENTRAL || sij != CENTRAL || sik != CENTRAL || sil != CENTRAL)
+ +        if (svi != c_centralShiftIndex || sij != c_centralShiftIndex || sik != c_centralShiftIndex
+ +            || sil != c_centralShiftIndex)
           {
               rvec_dec(fshift[svi], fv);
- -            fshift[CENTRAL][XX] += fv[XX] - fj[XX] - fk[XX] - fl[XX];
- -            fshift[CENTRAL][YY] += fv[YY] - fj[YY] - fk[YY] - fl[YY];
- -            fshift[CENTRAL][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ] - fl[ZZ];
+ +            fshift[c_centralShiftIndex][XX] += fv[XX] - fj[XX] - fk[XX] - fl[XX];
+ +            fshift[c_centralShiftIndex][YY] += fv[YY] - fj[YY] - fk[YY] - fl[YY];
+ +            fshift[c_centralShiftIndex][ZZ] += fv[ZZ] - fj[ZZ] - fk[ZZ] - fl[ZZ];
               rvec_inc(fshift[sij], fj);
               rvec_inc(fshift[sik], fk);
               rvec_inc(fshift[sil], fl);
@@@ -2107,16 -1691,16 +2107,16 @@@ static int spread_vsiten(const t_iato
           }
           else
           {
- -            siv = CENTRAL;
+ +            siv = c_centralShiftIndex;
           }
           a = ip[ia[i]].vsiten.a;
           svmul(a, f[av], fi);
           rvec_inc(f[ai], fi);
   
- -        if (virialHandling == VirialHandling::Pbc && siv != CENTRAL)
+ +        if (virialHandling == VirialHandling::Pbc && siv != c_centralShiftIndex)
           {
               rvec_inc(fshift[siv], fi);
- -            rvec_dec(fshift[CENTRAL], fi);
+ +            rvec_dec(fshift[c_centralShiftIndex], fi);
           }
           /* 6 Flops */
       }
@@@ -2290,7 -1874,7 +2290,7 @@@ void VirtualSitesHandler::Impl::spreadF
                                                const matrix         box,
                                                gmx_wallcycle*       wcycle)
   {
- -    wallcycle_start(wcycle, ewcVSITESPREAD);
+ +    wallcycle_start(wcycle, WallCycleCounter::VsiteSpread);
   
       const bool useDomdec = domainInfo_.useDomdec();
   
@@@ -2301,8 -1885,8 +2301,8 @@@
           /* This is wasting some CPU time as we now do this multiple times
            * per MD step.
            */
- -        pbc_null = set_pbc_dd(&pbc, domainInfo_.pbcType_,
- -                              useDomdec ? domainInfo_.domdec_->numCells : nullptr, FALSE, box);
+ +        pbc_null = set_pbc_dd(
+ +                &pbc, domainInfo_.pbcType_, useDomdec ? domainInfo_.domdec_->numCells : nullptr, FALSE, box);
       }
       else
       {
@@@ -2336,15 -1920,8 +2336,15 @@@
       {
           /* First spread the vsites that might depend on non-local vsites */
           auto& nlDependentVSites = threadingInfo_.threadDataNonLocalDependent();
- -        spreadForceWrapper(x, f, virialHandling, fshift, nlDependentVSites.dxdf, true, iparams_,
- -                           nlDependentVSites.ilist, pbc_null);
+ +        spreadForceWrapper(x,
+ +                           f,
+ +                           virialHandling,
+ +                           fshift,
+ +                           nlDependentVSites.dxdf,
+ +                           true,
+ +                           iparams_,
+ +                           nlDependentVSites.ilist,
+ +                           pbc_null);
   
   #pragma omp parallel num_threads(numThreads)
           {
@@@ -2364,7 -1941,7 +2364,7 @@@
                       {
                           fshift_t = tData.fshift;
   
- -                        for (int i = 0; i < SHIFTS; i++)
+ +                        for (int i = 0; i < c_numShiftVectors; i++)
                           {
                               clear_rvec(fshift_t[i]);
                           }
@@@ -2389,15 -1966,8 +2389,15 @@@
                       {
                           copy_rvec(f[idTask->vsite[i]], idTask->force[idTask->vsite[i]]);
                       }
- -                    spreadForceWrapper(x, idTask->force, virialHandling, fshift_t, tData.dxdf, true,
- -                                       iparams_, tData.idTask.ilist, pbc_null);
+ +                    spreadForceWrapper(x,
+ +                                       idTask->force,
+ +                                       virialHandling,
+ +                                       fshift_t,
+ +                                       tData.dxdf,
+ +                                       true,
+ +                                       iparams_,
+ +                                       tData.idTask.ilist,
+ +                                       pbc_null);
   
                       /* We need a barrier before reducing forces below
                        * that have been produced by a different thread above.
@@@ -2433,8 -2003,8 +2433,8 @@@
                   }
   
                   /* Spread the vsites that spread locally only */
- -                spreadForceWrapper(x, f, virialHandling, fshift_t, tData.dxdf, false, iparams_,
- -                                   tData.ilist, pbc_null);
+ +                spreadForceWrapper(
+ +                        x, f, virialHandling, fshift_t, tData.dxdf, false, iparams_, tData.ilist, pbc_null);
               }
               GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
           }
@@@ -2443,7 -2013,7 +2443,7 @@@
           {
               for (int th = 1; th < numThreads; th++)
               {
- -                for (int i = 0; i < SHIFTS; i++)
+ +                for (int i = 0; i < c_numShiftVectors; i++)
                   {
                       rvec_inc(fshift[i], threadingInfo_.threadData(th).fshift[i]);
                   }
@@@ -2484,7 -2054,7 +2484,7 @@@
       inc_nrnb(nrnb, eNR_VSITE4FDN, vsite_count(ilists_, F_VSITE4FDN));
       inc_nrnb(nrnb, eNR_VSITEN, vsite_count(ilists_, F_VSITEN));
   
- -    wallcycle_stop(wcycle, ewcVSITESPREAD);
+ +    wallcycle_stop(wcycle, WallCycleCounter::VsiteSpread);
   }
   
   /*! \brief Returns the an array with group indices for each atom
@@@ -2537,7 -2107,7 +2537,7 @@@ void VirtualSitesHandler::spreadForces(
   }
   
   int countInterUpdategroupVsites(const gmx_mtop_t&                           mtop,
- -                                gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingPerMoleculetype)
+ +                                gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingsPerMoleculeType)
   {
       int n_intercg_vsite = 0;
       for (const gmx_molblock_t& molb : mtop.molblock)
@@@ -2545,9 -2115,9 +2545,9 @@@
           const gmx_moltype_t& molt = mtop.moltype[molb.type];
   
           std::vector<int> atomToGroup;
- -        if (!updateGroupingPerMoleculetype.empty())
+ +        if (!updateGroupingsPerMoleculeType.empty())
           {
- -            atomToGroup = makeAtomToGroupMapping(updateGroupingPerMoleculetype[molb.type]);
+ +            atomToGroup = makeAtomToGroupMapping(updateGroupingsPerMoleculeType[molb.type]);
           }
           for (int ftype = c_ftypeVsiteStart; ftype < c_ftypeVsiteEnd; ftype++)
           {
@@@ -2581,8 -2151,7 +2581,8 @@@
   
   std::unique_ptr<VirtualSitesHandler> makeVirtualSitesHandler(const gmx_mtop_t& mtop,
                                                                const t_commrec*  cr,
- -                                                             PbcType           pbcType)
+ +                                                             PbcType           pbcType,
+ +                                                             ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType)
   {
       GMX_RELEASE_ASSERT(cr != nullptr, "We need a valid commrec");
   
@@@ -2597,7 -2166,7 +2597,7 @@@
               GMX_ASSERT(ftype >= c_ftypeVsiteStart && ftype < c_ftypeVsiteEnd,
                          "c_ftypeVsiteStart and/or c_ftypeVsiteEnd do not have correct values");
   
- -            nvsite += gmx_mtop_ftype_count(&mtop, ftype);
+ +            nvsite += gmx_mtop_ftype_count(mtop, ftype);
           }
           else
           {
@@@ -2611,10 -2180,10 +2611,10 @@@
           return vsite;
       }
   
- -    return std::make_unique<VirtualSitesHandler>(mtop, cr->dd, pbcType);
+ +    return std::make_unique<VirtualSitesHandler>(mtop, cr->dd, pbcType, updateGroupingPerMoleculeType);
   }
   
- -ThreadingInfo::ThreadingInfo() : numThreads_(gmx_omp_nthreads_get(emntVSITE))
+ +ThreadingInfo::ThreadingInfo() : numThreads_(gmx_omp_nthreads_get(ModuleMultiThread::VirtualSite))
   {
       if (numThreads_ > 1)
       {
@@@ -2640,21 -2209,27 +2640,21 @@@
       }
   }
   
- -//! Returns the number of inter update-group vsites
- -static int getNumInterUpdategroupVsites(const gmx_mtop_t& mtop, const gmx_domdec_t* domdec)
- -{
- -    gmx::ArrayRef<const gmx::RangePartitioning> updateGroupingPerMoleculetype;
- -    if (domdec)
- -    {
- -        updateGroupingPerMoleculetype = getUpdateGroupingPerMoleculetype(*domdec);
- -    }
- -
- -    return countInterUpdategroupVsites(mtop, updateGroupingPerMoleculetype);
- -}
- -
- -VirtualSitesHandler::Impl::Impl(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, const PbcType pbcType) :
- -    numInterUpdategroupVirtualSites_(getNumInterUpdategroupVsites(mtop, domdec)),
+ +VirtualSitesHandler::Impl::Impl(const gmx_mtop_t&                       mtop,
+ +                                gmx_domdec_t*                           domdec,
+ +                                const PbcType                           pbcType,
+ +                                const ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType) :
+ +    numInterUpdategroupVirtualSites_(countInterUpdategroupVsites(mtop, updateGroupingPerMoleculeType)),
       domainInfo_({ pbcType, pbcType != PbcType::No && numInterUpdategroupVirtualSites_ > 0, domdec }),
       iparams_(mtop.ffparams.iparams)
   {
   }
   
- -VirtualSitesHandler::VirtualSitesHandler(const gmx_mtop_t& mtop, gmx_domdec_t* domdec, const PbcType pbcType) :
- -    impl_(new Impl(mtop, domdec, pbcType))
+ +VirtualSitesHandler::VirtualSitesHandler(const gmx_mtop_t&                       mtop,
+ +                                         gmx_domdec_t*                           domdec,
+ +                                         const PbcType                           pbcType,
+ +                                         const ArrayRef<const RangePartitioning> updateGroupingPerMoleculeType) :
+ +    impl_(new Impl(mtop, domdec, pbcType, updateGroupingPerMoleculeType))
   {
   }
   
@@@ -2690,7 -2265,7 +2690,7 @@@ static void assignVsitesToThread(VsiteT
                                    gmx::ArrayRef<int>              taskIndex,
                                    ArrayRef<const InteractionList> ilist,
                                    ArrayRef<const t_iparams>       ip,
- -                                 const unsigned short*           ptype)
+ +                                 ArrayRef<const ParticleType>    ptype)
   {
       for (int ftype = c_ftypeVsiteStart; ftype < c_ftypeVsiteEnd; ftype++)
       {
@@@ -2725,7 -2300,7 +2725,7 @@@
                       /* Do a range check to avoid a harmless race on taskIndex */
                       if (iat[j] < tData->rangeStart || iat[j] >= tData->rangeEnd || taskIndex[iat[j]] != thread)
                       {
- -                        if (!tData->useInterdependentTask || ptype[iat[j]] == eptVSite)
+ +                        if (!tData->useInterdependentTask || ptype[iat[j]] == ParticleType::VSite)
                           {
                               /* At least one constructing atom is a vsite
                                * that is not assigned to the same thread.
@@@ -2755,12 -2330,21 +2755,21 @@@
                       /* Do a range check to avoid a harmless race on taskIndex */
                       if (iat[j] < tData->rangeStart || iat[j] >= tData->rangeEnd || taskIndex[iat[j]] != thread)
                       {
- -                        GMX_ASSERT(ptype[iat[j]] != eptVSite,
+ +                        GMX_ASSERT(ptype[iat[j]] != ParticleType::VSite,
                                      "A vsite to be assigned in assignVsitesToThread has a vsite as "
                                      "a constructing atom that does not belong to our task, such "
                                      "vsites should be assigned to the single 'master' task");
   
-                         task = nthread + thread;
+                         if (tData->useInterdependentTask)
+                         {
+                             // Assign to the interdependent task
+                             task = nthread + thread;
+                         }
+                         else
+                         {
+                             // Assign to the separate, non-parallel task
+                             task = 2 * nthread;
+                         }
                       }
                   }
               }
@@@ -2849,9 -2433,7 +2858,9 @@@ static void assignVsitesToSingleTask(Vs
   
   void ThreadingInfo::setVirtualSites(ArrayRef<const InteractionList> ilists,
                                       ArrayRef<const t_iparams>       iparams,
- -                                    const t_mdatoms&                mdatoms,
+ +                                    const int                       numAtoms,
+ +                                    const int                       homenr,
+ +                                    ArrayRef<const ParticleType>    ptype,
                                       const bool                      useDomdec)
   {
       if (numThreads_ <= 1)
@@@ -2924,23 -2506,20 +2933,23 @@@
            * When assigning vsites to threads, we should take care that the last
            * threads also covers the non-local range.
            */
- -        vsite_atom_range = mdatoms.nr;
- -        natperthread     = (mdatoms.homenr + numThreads_ - 1) / numThreads_;
+ +        vsite_atom_range = numAtoms;
+ +        natperthread     = (homenr + numThreads_ - 1) / numThreads_;
       }
   
       if (debug)
       {
- -        fprintf(debug, "virtual site thread dist: natoms %d, range %d, natperthread %d\n",
- -                mdatoms.nr, vsite_atom_range, natperthread);
+ +        fprintf(debug,
+ +                "virtual site thread dist: natoms %d, range %d, natperthread %d\n",
+ +                numAtoms,
+ +                vsite_atom_range,
+ +                natperthread);
       }
   
       /* To simplify the vsite assignment, we make an index which tells us
        * to which task particles, both non-vsites and vsites, are assigned.
        */
- -    taskIndex_.resize(mdatoms.nr);
+ +    taskIndex_.resize(numAtoms);
   
       /* Initialize the task index array. Here we assign the non-vsite
        * particles to task=thread, so we easily figure out if vsites
@@@ -2948,9 -2527,9 +2957,9 @@@
        */
       {
           int thread = 0;
- -        for (int i = 0; i < mdatoms.nr; i++)
+ +        for (int i = 0; i < numAtoms; i++)
           {
- -            if (mdatoms.ptype[i] == eptVSite)
+ +            if (ptype[i] == ParticleType::VSite)
               {
                   /* vsites are not assigned to a task yet */
                   taskIndex_[i] = -1;
@@@ -3000,12 -2579,13 +3009,13 @@@
               }
   
               /* To avoid large f_buf allocations of #threads*vsite_atom_range
-              * we don't use task2 with more than 200000 atoms. This doesn't
-              * affect performance, since with such a large range relatively few
-              * vsites will end up in the separate task.
-              * Note that useTask2 should be the same for all threads.
+              * we don't use the interdependent tasks with more than 200000 atoms.
+              * This doesn't affect performance, since with such a large range
+              * relatively few vsites will end up in the separate task.
+              * Note that useInterdependentTask should be the same for all threads.
                */
-             tData.useInterdependentTask = (vsite_atom_range <= 200000);
+             const int c_maxNumLocalAtomsForInterdependentTask = 200000;
+             tData.useInterdependentTask = (vsite_atom_range <= c_maxNumLocalAtomsForInterdependentTask);
               if (tData.useInterdependentTask)
               {
                   size_t              natoms_use_in_vsites = vsite_atom_range;
@@@ -3029,10 -2609,10 +3039,10 @@@
               else
               {
                   /* The last thread should cover up to the end of the range */
- -                tData.rangeEnd = mdatoms.nr;
+ +                tData.rangeEnd = numAtoms;
               }
- -            assignVsitesToThread(&tData, thread, numThreads_, natperthread, taskIndex_, ilists,
- -                                 iparams, mdatoms.ptype);
+ +            assignVsitesToThread(
+ +                    &tData, thread, numThreads_, natperthread, taskIndex_, ilists, iparams, ptype);
   
               if (tData.useInterdependentTask)
               {
@@@ -3074,8 -2654,7 +3084,8 @@@
   
       if (debug && numThreads_ > 1)
       {
- -        fprintf(debug, "virtual site useInterdependentTask %d, nuse:\n",
+ +        fprintf(debug,
+ +                "virtual site useInterdependentTask %d, nuse:\n",
                   static_cast<int>(tData_[0]->useInterdependentTask));
           for (int th = 0; th < numThreads_ + 1; th++)
           {
@@@ -3090,9 -2669,7 +3100,9 @@@
                   fprintf(debug, "%-20s thread dist:", interaction_function[ftype].longname);
                   for (int th = 0; th < numThreads_ + 1; th++)
                   {
- -                    fprintf(debug, " %4d %4d ", tData_[th]->ilist[ftype].size(),
+ +                    fprintf(debug,
+ +                            " %4d %4d ",
+ +                            tData_[th]->ilist[ftype].size(),
                               tData_[th]->idTask.ilist[ftype].size());
                   }
                   fprintf(debug, "\n");
@@@ -3114,21 -2691,16 +3124,21 @@@
   }
   
   void VirtualSitesHandler::Impl::setVirtualSites(ArrayRef<const InteractionList> ilists,
- -                                                const t_mdatoms&                mdatoms)
+ +                                                const int                       numAtoms,
+ +                                                const int                       homenr,
+ +                                                ArrayRef<const ParticleType>    ptype)
   {
       ilists_ = ilists;
   
- -    threadingInfo_.setVirtualSites(ilists, iparams_, mdatoms, domainInfo_.useDomdec());
+ +    threadingInfo_.setVirtualSites(ilists, iparams_, numAtoms, homenr, ptype, domainInfo_.useDomdec());
   }
   
- -void VirtualSitesHandler::setVirtualSites(ArrayRef<const InteractionList> ilists, const t_mdatoms& mdatoms)
+ +void VirtualSitesHandler::setVirtualSites(ArrayRef<const InteractionList> ilists,
+ +                                          const int                       numAtoms,
+ +                                          const int                       homenr,
+ +                                          ArrayRef<const ParticleType>    ptype)
   {
- -    impl_->setVirtualSites(ilists, mdatoms);
+ +    impl_->setVirtualSites(ilists, numAtoms, homenr, ptype);
   }
   
   } // namespace gmx
diff --combined src/gromacs/mdrun/runner.cpp

index 3bf1d5071cd6d9d21bb56dd62ad182f178488c99,423a90905cbb3eb86779f38271287c9ee0df3912..887e7ba7093ee95bbfddcea4cbdce399c1f34446
--- 1/src/gromacs/mdrun/runner.cpp
--- 2/src/gromacs/mdrun/runner.cpp
+++ b/src/gromacs/mdrun/runner.cpp
@@@ -62,11 -62,8 +62,11 @@@
   #include "gromacs/domdec/domdec_struct.h"
   #include "gromacs/domdec/gpuhaloexchange.h"
   #include "gromacs/domdec/localatomsetmanager.h"
+ +#include "gromacs/domdec/makebondedlinks.h"
   #include "gromacs/domdec/partition.h"
+ +#include "gromacs/domdec/reversetopology.h"
   #include "gromacs/ewald/ewald_utils.h"
+ +#include "gromacs/ewald/pme.h"
   #include "gromacs/ewald/pme_gpu_program.h"
   #include "gromacs/ewald/pme_only.h"
   #include "gromacs/ewald/pme_pp_comm_gpu.h"
@@@ -84,7 -81,7 +84,7 @@@
   #include "gromacs/hardware/printhardware.h"
   #include "gromacs/imd/imd.h"
   #include "gromacs/listed_forces/disre.h"
- -#include "gromacs/listed_forces/gpubonded.h"
+ +#include "gromacs/listed_forces/listed_forces_gpu.h"
   #include "gromacs/listed_forces/listed_forces.h"
   #include "gromacs/listed_forces/orires.h"
   #include "gromacs/math/functions.h"
@@@ -128,7 -125,6 +128,7 @@@
   #include "gromacs/mdtypes/mdatom.h"
   #include "gromacs/mdtypes/mdrunoptions.h"
   #include "gromacs/mdtypes/observableshistory.h"
+ +#include "gromacs/mdtypes/observablesreducer.h"
   #include "gromacs/mdtypes/simulation_workload.h"
   #include "gromacs/mdtypes/state.h"
   #include "gromacs/mdtypes/state_propagator_data_gpu.h"
@@@ -164,13 -160,12 +164,13 @@@
   #include "gromacs/utility/keyvaluetree.h"
   #include "gromacs/utility/logger.h"
   #include "gromacs/utility/loggerbuilder.h"
- -#include "gromacs/utility/mdmodulenotification.h"
+ +#include "gromacs/utility/mdmodulesnotifiers.h"
   #include "gromacs/utility/physicalnodecommunicator.h"
   #include "gromacs/utility/pleasecite.h"
   #include "gromacs/utility/programcontext.h"
   #include "gromacs/utility/smalloc.h"
   #include "gromacs/utility/stringutil.h"
+ +#include "gromacs/utility/mpiinfo.h"
   
   #include "isimulator.h"
   #include "membedholder.h"
@@@ -203,77 -198,19 +203,77 @@@ static DevelopmentFeatureFlags manageDe
   {
       DevelopmentFeatureFlags devFlags;
   
- -    // Some builds of GCC 5 give false positive warnings that these
- -    // getenv results are ignored when clearly they are used.
- -#pragma GCC diagnostic push
- -#pragma GCC diagnostic ignored "-Wunused-result"
- -
       devFlags.enableGpuBufferOps =
               GMX_GPU_CUDA && useGpuForNonbonded && (getenv("GMX_USE_GPU_BUFFER_OPS") != nullptr);
- -    devFlags.enableGpuHaloExchange = GMX_GPU_CUDA && GMX_THREAD_MPI && getenv("GMX_GPU_DD_COMMS") != nullptr;
+ +    devFlags.enableGpuHaloExchange = GMX_MPI && GMX_GPU_CUDA && getenv("GMX_GPU_DD_COMMS") != nullptr;
       devFlags.forceGpuUpdateDefault = (getenv("GMX_FORCE_UPDATE_DEFAULT_GPU") != nullptr) || GMX_FAHCORE;
- -    devFlags.enableGpuPmePPComm =
- -            GMX_GPU_CUDA && GMX_THREAD_MPI && getenv("GMX_GPU_PME_PP_COMMS") != nullptr;
+ +    devFlags.enableGpuPmePPComm = GMX_MPI && GMX_GPU_CUDA && getenv("GMX_GPU_PME_PP_COMMS") != nullptr;
+ +
+ +    // Direct GPU comm path is being used with CUDA_AWARE_MPI
+ +    // make sure underlying MPI implementation is CUDA-aware
+ +    if (!GMX_THREAD_MPI && (devFlags.enableGpuPmePPComm || devFlags.enableGpuHaloExchange))
+ +    {
+ +        const bool haveDetectedCudaAwareMpi =
+ +                (checkMpiCudaAwareSupport() == CudaAwareMpiStatus::Supported);
+ +        const bool forceCudaAwareMpi = (getenv("GMX_FORCE_CUDA_AWARE_MPI") != nullptr);
+ +
+ +        if (!haveDetectedCudaAwareMpi && forceCudaAwareMpi)
+ +        {
+ +            // CUDA-aware support not detected in MPI library but, user has forced it's use
+ +            GMX_LOG(mdlog.warning)
+ +                    .asParagraph()
+ +                    .appendTextFormatted(
+ +                            "This run has forced use of 'CUDA-aware MPI'. "
+ +                            "But, GROMACS cannot determine if underlying MPI "
+ +                            "is CUDA-aware. GROMACS recommends use of latest openMPI version "
+ +                            "for CUDA-aware support. "
+ +                            "If you observe failures at runtime, try unsetting "
+ +                            "GMX_FORCE_CUDA_AWARE_MPI environment variable.");
+ +        }
+ +
+ +        if (haveDetectedCudaAwareMpi || forceCudaAwareMpi)
+ +        {
+ +            devFlags.usingCudaAwareMpi = true;
+ +            GMX_LOG(mdlog.warning)
+ +                    .asParagraph()
+ +                    .appendTextFormatted(
+ +                            "Using CUDA-aware MPI for 'GPU halo exchange' or 'GPU PME-PP "
+ +                            "communications' feature.");
+ +        }
+ +        else
+ +        {
+ +            if (devFlags.enableGpuHaloExchange)
+ +            {
+ +                GMX_LOG(mdlog.warning)
+ +                        .asParagraph()
+ +                        .appendTextFormatted(
+ +                                "GMX_GPU_DD_COMMS environment variable detected, but the 'GPU "
+ +                                "halo exchange' feature will not be enabled as GROMACS couldn't "
+ +                                "detect CUDA_aware support in underlying MPI implementation.");
+ +                devFlags.enableGpuHaloExchange = false;
+ +            }
+ +            if (devFlags.enableGpuPmePPComm)
+ +            {
+ +                GMX_LOG(mdlog.warning)
+ +                        .asParagraph()
+ +                        .appendText(
+ +                                "GMX_GPU_PME_PP_COMMS environment variable detected, but the "
+ +                                "'GPU PME-PP communications' feature will not be enabled as "
+ +                                "GROMACS couldn't "
+ +                                "detect CUDA_aware support in underlying MPI implementation.");
+ +                devFlags.enableGpuPmePPComm = false;
+ +            }
   
- -#pragma GCC diagnostic pop
+ +            GMX_LOG(mdlog.warning)
+ +                    .asParagraph()
+ +                    .appendTextFormatted(
+ +                            "GROMACS recommends use of latest OpenMPI version for CUDA-aware "
+ +                            "support. "
+ +                            "If you are certain about CUDA-aware support in your MPI library, "
+ +                            "you can force it's use by setting environment variable "
+ +                            " GMX_FORCE_CUDA_AWARE_MPI.");
+ +        }
+ +    }
   
       if (devFlags.enableGpuBufferOps)
       {
@@@ -435,7 -372,7 +435,7 @@@ static void mdrunner_start_fn(const voi
   {
       try
       {
- -        auto masterMdrunner = reinterpret_cast<const gmx::Mdrunner*>(arg);
+ +        const auto* masterMdrunner = reinterpret_cast<const gmx::Mdrunner*>(arg);
           /* copy the arg list to make sure that it's thread-local. This
              doesn't copy pointed-to items, of course; fnm, cr and fplog
              are reset in the call below, all others should be const. */
@@@ -451,7 -388,8 +451,7 @@@ void Mdrunner::spawnThreads(int numThre
   #if GMX_THREAD_MPI
       /* now spawn new threads that start mdrunner_start_fn(), while
          the main thread returns. Thread affinity is handled later. */
- -    if (tMPI_Init_fn(TRUE, numThreadsToLaunch, TMPI_AFFINITY_NONE, mdrunner_start_fn,
- -                     static_cast<const void*>(this))
+ +    if (tMPI_Init_fn(TRUE, numThreadsToLaunch, TMPI_AFFINITY_NONE, mdrunner_start_fn, static_cast<const void*>(this))
           != TMPI_SUCCESS)
       {
           GMX_THROW(gmx::InternalError("Failed to spawn thread-MPI threads"));
@@@ -475,14 -413,14 +475,14 @@@ static void prepare_verlet_scheme(FILE
                                     t_commrec*          cr,
                                     t_inputrec*         ir,
                                     int                 nstlist_cmdline,
- -                                  const gmx_mtop_t*   mtop,
+ +                                  const gmx_mtop_t&   mtop,
                                     const matrix        box,
                                     bool                makeGpuPairList,
                                     const gmx::CpuInfo& cpuinfo)
   {
       // We checked the cut-offs in grompp, but double-check here.
       // We have PME+LJcutoff kernels for rcoulomb>rvdw.
- -    if (EEL_PME_EWALD(ir->coulombtype) && ir->vdwtype == eelCUT)
+ +    if (EEL_PME_EWALD(ir->coulombtype) && ir->vdwtype == VanDerWaalsType::Cut)
       {
           GMX_RELEASE_ASSERT(ir->rcoulomb >= ir->rvdw,
                              "With Verlet lists and PME we should have rcoulomb>=rvdw");
@@@ -493,8 -431,7 +493,8 @@@
                              "With Verlet lists and no PME rcoulomb and rvdw should be identical");
       }
       /* For NVE simulations, we will retain the initial list buffer */
- -    if (EI_DYNAMICS(ir->eI) && ir->verletbuf_tol > 0 && !(EI_MD(ir->eI) && ir->etc == etcNO))
+ +    if (EI_DYNAMICS(ir->eI) && ir->verletbuf_tol > 0
+ +        && !(EI_MD(ir->eI) && ir->etc == TemperatureCoupling::No))
       {
           /* Update the Verlet buffer size for the current run setup */
   
@@@ -507,7 -444,7 +507,7 @@@
           VerletbufListSetup listSetup = verletbufGetSafeListSetup(listType);
   
           const real rlist_new =
- -                calcVerletBufferSize(*mtop, det(box), *ir, ir->nstlist, ir->nstlist - 1, -1, listSetup);
+ +                calcVerletBufferSize(mtop, det(box), *ir, ir->nstlist, ir->nstlist - 1, -1, listSetup);
   
           if (rlist_new != ir->rlist)
           {
@@@ -515,10 -452,7 +515,10 @@@
               {
                   fprintf(fplog,
                           "\nChanging rlist from %g to %g for non-bonded %dx%d atom kernels\n\n",
- -                        ir->rlist, rlist_new, listSetup.cluster_size_i, listSetup.cluster_size_j);
+ +                        ir->rlist,
+ +                        rlist_new,
+ +                        listSetup.cluster_size_i,
+ +                        listSetup.cluster_size_j);
               }
               ir->rlist = rlist_new;
           }
@@@ -526,15 -460,14 +526,15 @@@
   
       if (nstlist_cmdline > 0 && (!EI_DYNAMICS(ir->eI) || ir->verletbuf_tol <= 0))
       {
- -        gmx_fatal(FARGS, "Can not set nstlist without %s",
+ +        gmx_fatal(FARGS,
+ +                  "Can not set nstlist without %s",
                     !EI_DYNAMICS(ir->eI) ? "dynamics" : "verlet-buffer-tolerance");
       }
   
       if (EI_DYNAMICS(ir->eI))
       {
           /* Set or try nstlist values */
- -        increaseNstlist(fplog, cr, ir, nstlist_cmdline, mtop, box, makeGpuPairList, cpuinfo);
+ +        increaseNstlist(fplog, cr, ir, nstlist_cmdline, &mtop, box, makeGpuPairList, cpuinfo);
       }
   }
   
@@@ -557,13 -490,11 +557,13 @@@ static void override_nsteps_cmdline(con
           {
               sprintf(sbuf_msg,
                       "Overriding nsteps with value passed on the command line: %s steps, %.3g ps",
- -                    gmx_step_str(nsteps_cmdline, sbuf_steps), fabs(nsteps_cmdline * ir->delta_t));
+ +                    gmx_step_str(nsteps_cmdline, sbuf_steps),
+ +                    fabs(nsteps_cmdline * ir->delta_t));
           }
           else
           {
- -            sprintf(sbuf_msg, "Overriding nsteps with value passed on the command line: %s steps",
+ +            sprintf(sbuf_msg,
+ +                    "Overriding nsteps with value passed on the command line: %s steps",
                       gmx_step_str(nsteps_cmdline, sbuf_steps));
           }
   
@@@ -659,9 -590,9 +659,9 @@@ static TaskTarget findTaskTarget(const 
   static void finish_run(FILE*                     fplog,
                          const gmx::MDLogger&      mdlog,
                          const t_commrec*          cr,
- -                       const t_inputrec*         inputrec,
+ +                       const t_inputrec&         inputrec,
                          t_nrnb                    nrnb[],
- -                       gmx_wallcycle_t           wcycle,
+ +                       gmx_wallcycle*            wcycle,
                          gmx_walltime_accounting_t walltime_accounting,
                          nonbonded_verlet_t*       nbv,
                          const gmx_pme_t*          pme,
@@@ -684,7 -615,7 +684,7 @@@
          Further, we only report performance for dynamical integrators,
          because those are the only ones for which we plan to
          consider doing any optimizations. */
- -    bool printReport = EI_DYNAMICS(inputrec->eI) && SIMMASTER(cr);
+ +    bool printReport = EI_DYNAMICS(inputrec.eI) && SIMMASTER(cr);
   
       if (printReport && !walltime_accounting_get_valid_finish(walltime_accounting))
       {
@@@ -701,7 -632,7 +701,7 @@@
           nrnbTotalStorage = std::make_unique<t_nrnb>();
           nrnb_tot         = nrnbTotalStorage.get();
   #if GMX_MPI
- -        MPI_Allreduce(nrnb->n, nrnb_tot->n, eNRNB, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
+ +        MPI_Allreduce(nrnb->n.data(), nrnb_tot->n.data(), eNRNB, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
   #endif
       }
       else
@@@ -716,16 -647,13 +716,16 @@@
       {
   #if GMX_MPI
           /* reduce elapsed_time over all MPI ranks in the current simulation */
- -        MPI_Allreduce(&elapsed_time, &elapsed_time_over_all_ranks, 1, MPI_DOUBLE, MPI_SUM,
- -                      cr->mpi_comm_mysim);
+ +        MPI_Allreduce(&elapsed_time, &elapsed_time_over_all_ranks, 1, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
           elapsed_time_over_all_ranks /= cr->nnodes;
           /* Reduce elapsed_time_over_all_threads over all MPI ranks in the
            * current simulation. */
- -        MPI_Allreduce(&elapsed_time_over_all_threads, &elapsed_time_over_all_threads_over_all_ranks,
- -                      1, MPI_DOUBLE, MPI_SUM, cr->mpi_comm_mysim);
+ +        MPI_Allreduce(&elapsed_time_over_all_threads,
+ +                      &elapsed_time_over_all_threads_over_all_ranks,
+ +                      1,
+ +                      MPI_DOUBLE,
+ +                      MPI_SUM,
+ +                      cr->mpi_comm_mysim);
   #endif
       }
       else
@@@ -739,7 -667,7 +739,7 @@@
           print_flop(fplog, nrnb_tot, &nbfs, &mflop);
       }
   
- -    if (thisRankHasDuty(cr, DUTY_PP) && DOMAINDECOMP(cr))
+ +    if (thisRankHasDuty(cr, DUTY_PP) && haveDDAtomOrdering(*cr))
       {
           print_dd_statistics(cr, inputrec, fplog);
       }
@@@ -748,15 -676,15 +748,15 @@@
        * to the code that handled the thread region, so that there's a
        * mechanism to keep cycle counting working during the transition
        * to task parallelism. */
- -    int nthreads_pp  = gmx_omp_nthreads_get(emntNonbonded);
- -    int nthreads_pme = gmx_omp_nthreads_get(emntPME);
- -    wallcycle_scale_by_num_threads(wcycle, thisRankHasDuty(cr, DUTY_PME) && !thisRankHasDuty(cr, DUTY_PP),
- -                                   nthreads_pp, nthreads_pme);
+ +    int nthreads_pp  = gmx_omp_nthreads_get(ModuleMultiThread::Nonbonded);
+ +    int nthreads_pme = gmx_omp_nthreads_get(ModuleMultiThread::Pme);
+ +    wallcycle_scale_by_num_threads(
+ +            wcycle, thisRankHasDuty(cr, DUTY_PME) && !thisRankHasDuty(cr, DUTY_PP), nthreads_pp, nthreads_pme);
       auto cycle_sum(wallcycle_sum(cr, wcycle));
   
       if (printReport)
       {
- -        auto nbnxn_gpu_timings =
+ +        auto* nbnxn_gpu_timings =
                   (nbv != nullptr && nbv->useGpu()) ? Nbnxm::gpu_get_timings(nbv->gpu_nbv) : nullptr;
           gmx_wallclock_gpu_pme_t pme_gpu_timings = {};
   
@@@ -764,55 -692,40 +764,55 @@@
           {
               pme_gpu_get_timings(pme, &pme_gpu_timings);
           }
- -        wallcycle_print(fplog, mdlog, cr->nnodes, cr->npmenodes, nthreads_pp, nthreads_pme,
- -                        elapsed_time_over_all_ranks, wcycle, cycle_sum, nbnxn_gpu_timings,
+ +        wallcycle_print(fplog,
+ +                        mdlog,
+ +                        cr->nnodes,
+ +                        cr->npmenodes,
+ +                        nthreads_pp,
+ +                        nthreads_pme,
+ +                        elapsed_time_over_all_ranks,
+ +                        wcycle,
+ +                        cycle_sum,
+ +                        nbnxn_gpu_timings,
                           &pme_gpu_timings);
   
- -        if (EI_DYNAMICS(inputrec->eI))
+ +        if (EI_DYNAMICS(inputrec.eI))
           {
- -            delta_t = inputrec->delta_t;
+ +            delta_t = inputrec.delta_t;
           }
   
           if (fplog)
           {
- -            print_perf(fplog, elapsed_time_over_all_threads_over_all_ranks, elapsed_time_over_all_ranks,
+ +            print_perf(fplog,
+ +                       elapsed_time_over_all_threads_over_all_ranks,
+ +                       elapsed_time_over_all_ranks,
                          walltime_accounting_get_nsteps_done_since_reset(walltime_accounting),
- -                       delta_t, nbfs, mflop);
+ +                       delta_t,
+ +                       nbfs,
+ +                       mflop);
           }
           if (bWriteStat)
           {
- -            print_perf(stderr, elapsed_time_over_all_threads_over_all_ranks, elapsed_time_over_all_ranks,
+ +            print_perf(stderr,
+ +                       elapsed_time_over_all_threads_over_all_ranks,
+ +                       elapsed_time_over_all_ranks,
                          walltime_accounting_get_nsteps_done_since_reset(walltime_accounting),
- -                       delta_t, nbfs, mflop);
+ +                       delta_t,
+ +                       nbfs,
+ +                       mflop);
           }
       }
   }
   
   int Mdrunner::mdrunner()
   {
- -    matrix                    box;
- -    t_forcerec*               fr               = nullptr;
- -    real                      ewaldcoeff_q     = 0;
- -    real                      ewaldcoeff_lj    = 0;
- -    int                       nChargePerturbed = -1, nTypePerturbed = 0;
- -    gmx_wallcycle_t           wcycle;
- -    gmx_walltime_accounting_t walltime_accounting = nullptr;
- -    MembedHolder              membedHolder(filenames.size(), filenames.data());
+ +    matrix                      box;
+ +    std::unique_ptr<t_forcerec> fr;
+ +    real                        ewaldcoeff_q     = 0;
+ +    real                        ewaldcoeff_lj    = 0;
+ +    int                         nChargePerturbed = -1, nTypePerturbed = 0;
+ +    gmx_walltime_accounting_t   walltime_accounting = nullptr;
+ +    MembedHolder                membedHolder(filenames.size(), filenames.data());
   
       /* CAUTION: threads may be started later on in this function, so
          cr doesn't reflect the final parallel state right now */
@@@ -853,9 -766,8 +853,9 @@@
   
       gmx_print_detected_hardware(fplog, isSimulationMasterRank && isMasterSim(ms), mdlog, hwinfo_);
   
- -    std::vector<int> gpuIdsToUse = makeGpuIdsToUse(hwinfo_->deviceInfoList, hw_opt.gpuIdsAvailable);
- -    const int        numDevicesToUse = gmx::ssize(gpuIdsToUse);
+ +    std::vector<int> availableDevices =
+ +            makeListOfAvailableDevices(hwinfo_->deviceInfoList, hw_opt.devicesSelectedByUser);
+ +    const int numAvailableDevices = gmx::ssize(availableDevices);
   
       // Print citation requests after all software/hardware printing
       pleaseCiteGromacs(fplog);
@@@ -877,13 -789,13 +877,13 @@@
           /* Read (nearly) all data required for the simulation
            * and keep the partly serialized tpr contents to send to other ranks later
            */
- -        applyGlobalSimulationState(*inputHolder_.get(), partialDeserializedTpr.get(),
- -                                   globalState.get(), inputrec.get(), &mtop);
+ +        applyGlobalSimulationState(
+ +                *inputHolder_.get(), partialDeserializedTpr.get(), globalState.get(), inputrec.get(), &mtop);
       }
   
       /* Check and update the hardware options for internal consistency */
- -    checkAndUpdateHardwareOptions(mdlog, &hw_opt, isSimulationMasterRank, domdecOptions.numPmeRanks,
- -                                  inputrec.get());
+ +    checkAndUpdateHardwareOptions(
+ +            mdlog, &hw_opt, isSimulationMasterRank, domdecOptions.numPmeRanks, inputrec.get());
   
       if (GMX_THREAD_MPI && isSimulationMasterRank)
       {
@@@ -898,21 -810,13 +898,22 @@@
               // the number of GPUs to choose the number of ranks.
               auto canUseGpuForNonbonded = buildSupportsNonbondedOnGpu(nullptr);
               useGpuForNonbonded         = decideWhetherToUseGpusForNonbondedWithThreadMpi(
- -                    nonbondedTarget, numDevicesToUse, userGpuTaskAssignment, emulateGpuNonbonded,
+ +                    nonbondedTarget,
+ +                    numAvailableDevices > 0,
+ +                    userGpuTaskAssignment,
+ +                    emulateGpuNonbonded,
                       canUseGpuForNonbonded,
                       gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, GMX_THREAD_MPI),
                       hw_opt.nthreads_tmpi);
- -            useGpuForPme = decideWhetherToUseGpusForPmeWithThreadMpi(
- -                    useGpuForNonbonded, pmeTarget, pmeFftTarget, numDevicesToUse, userGpuTaskAssignment,
- -                    *hwinfo_, *inputrec, hw_opt.nthreads_tmpi, domdecOptions.numPmeRanks);
+ +            useGpuForPme = decideWhetherToUseGpusForPmeWithThreadMpi(useGpuForNonbonded,
+ +                                                                     pmeTarget,
++                                                                     pmeFftTarget,
+ +                                                                     numAvailableDevices,
+ +                                                                     userGpuTaskAssignment,
+ +                                                                     *hwinfo_,
+ +                                                                     *inputrec,
+ +                                                                     hw_opt.nthreads_tmpi,
+ +                                                                     domdecOptions.numPmeRanks);
           }
           GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
   
@@@ -921,15 -825,9 +922,15 @@@
            * TODO Over-writing the user-supplied value here does
            * prevent any possible subsequent checks from working
            * correctly. */
- -        hw_opt.nthreads_tmpi =
- -                get_nthreads_mpi(hwinfo_, &hw_opt, numDevicesToUse, useGpuForNonbonded, useGpuForPme,
- -                                 inputrec.get(), &mtop, mdlog, membedHolder.doMembed());
+ +        hw_opt.nthreads_tmpi = get_nthreads_mpi(hwinfo_,
+ +                                                &hw_opt,
+ +                                                numAvailableDevices,
+ +                                                useGpuForNonbonded,
+ +                                                useGpuForPme,
+ +                                                inputrec.get(),
+ +                                                mtop,
+ +                                                mdlog,
+ +                                                membedHolder.doMembed());
   
           // Now start the threads for thread MPI.
           spawnThreads(hw_opt.nthreads_tmpi);
@@@ -937,7 -835,7 +938,7 @@@
           // master and spawned threads joins at the end of this block.
       }
   
- -    GMX_RELEASE_ASSERT(ms || simulationCommunicator != MPI_COMM_NULL,
+ +    GMX_RELEASE_ASSERT(!GMX_MPI || ms || simulationCommunicator != MPI_COMM_NULL,
                          "Must have valid communicator unless running a multi-simulation");
       CommrecHandle crHandle = init_commrec(simulationCommunicator);
       t_commrec*    cr       = crHandle.get();
@@@ -960,31 -858,17 +961,31 @@@
               // On non-master ranks, allocate the object that will receive data in the following call.
               inputrec = std::make_unique<t_inputrec>();
           }
- -        init_parallel(cr->mpiDefaultCommunicator, MASTER(cr), inputrec.get(), &mtop,
+ +        init_parallel(cr->mpiDefaultCommunicator,
+ +                      MASTER(cr),
+ +                      inputrec.get(),
+ +                      &mtop,
                         partialDeserializedTpr.get());
       }
       GMX_RELEASE_ASSERT(inputrec != nullptr, "All ranks should have a valid inputrec now");
       partialDeserializedTpr.reset(nullptr);
   
+ +    GMX_RELEASE_ASSERT(
+ +            !inputrec->useConstantAcceleration,
+ +            "Linear acceleration has been removed in GROMACS 2022, and was broken for many years "
+ +            "before that. Use GROMACS 4.5 or earlier if you need this feature.");
+ +
+ +    // Now we decide whether to use the domain decomposition machinery.
+ +    // Note that this does not necessarily imply actually using multiple domains.
       // Now the number of ranks is known to all ranks, and each knows
       // the inputrec read by the master rank. The ranks can now all run
       // the task-deciding functions and will agree on the result
       // without needing to communicate.
- -    const bool useDomainDecomposition = (PAR(cr) && !(EI_TPI(inputrec->eI) || inputrec->eI == eiNM));
+ +    // The LBFGS minimizer, test-particle insertion, normal modes and shell dynamics don't support DD
+ +    const bool useDomainDecomposition =
+ +            !(inputrec->eI == IntegrationAlgorithm::LBFGS || EI_TPI(inputrec->eI)
+ +              || inputrec->eI == IntegrationAlgorithm::NM
+ +              || gmx_mtop_particletype_count(mtop)[ParticleType::Shell] > 0);
   
       // Note that these variables describe only their own node.
       //
@@@ -1004,22 -888,14 +1005,23 @@@
           // assignment.
           auto canUseGpuForNonbonded = buildSupportsNonbondedOnGpu(nullptr);
           useGpuForNonbonded         = decideWhetherToUseGpusForNonbonded(
- -                nonbondedTarget, userGpuTaskAssignment, emulateGpuNonbonded, canUseGpuForNonbonded,
- -                gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, !GMX_THREAD_MPI), gpusWereDetected);
- -        useGpuForPme = decideWhetherToUseGpusForPme(
- -                useGpuForNonbonded, pmeTarget, pmeFftTarget, userGpuTaskAssignment, *hwinfo_,
- -                *inputrec, cr->sizeOfDefaultCommunicator, domdecOptions.numPmeRanks, gpusWereDetected);
- -        useGpuForBonded = decideWhetherToUseGpusForBonded(useGpuForNonbonded, useGpuForPme,
- -                                                          bondedTarget, *inputrec, mtop,
- -                                                          domdecOptions.numPmeRanks, gpusWereDetected);
+ +                nonbondedTarget,
+ +                userGpuTaskAssignment,
+ +                emulateGpuNonbonded,
+ +                canUseGpuForNonbonded,
+ +                gpuAccelerationOfNonbondedIsUseful(mdlog, *inputrec, !GMX_THREAD_MPI),
+ +                gpusWereDetected);
+ +        useGpuForPme    = decideWhetherToUseGpusForPme(useGpuForNonbonded,
+ +                                                    pmeTarget,
++                                                    pmeFftTarget,
+ +                                                    userGpuTaskAssignment,
+ +                                                    *hwinfo_,
+ +                                                    *inputrec,
+ +                                                    cr->sizeOfDefaultCommunicator,
+ +                                                    domdecOptions.numPmeRanks,
+ +                                                    gpusWereDetected);
+ +        useGpuForBonded = decideWhetherToUseGpusForBonded(
+ +                useGpuForNonbonded, useGpuForPme, bondedTarget, *inputrec, mtop, domdecOptions.numPmeRanks, gpusWereDetected);
       }
       GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
   
@@@ -1030,17 -906,9 +1032,17 @@@
       const DevelopmentFeatureFlags devFlags =
               manageDevelopmentFeatures(mdlog, useGpuForNonbonded, pmeRunMode);
   
- -    const bool useModularSimulator =
- -            checkUseModularSimulator(false, inputrec.get(), doRerun, mtop, ms, replExParams,
- -                                     nullptr, doEssentialDynamics, membedHolder.doMembed());
+ +    const bool useModularSimulator = checkUseModularSimulator(false,
+ +                                                              inputrec.get(),
+ +                                                              doRerun,
+ +                                                              mtop,
+ +                                                              ms,
+ +                                                              replExParams,
+ +                                                              nullptr,
+ +                                                              doEssentialDynamics,
+ +                                                              membedHolder.doMembed());
+ +
+ +    ObservablesReducerBuilder observablesReducerBuilder;
   
       // Build restraints.
       // TODO: hide restraint implementation details from Mdrunner.
@@@ -1056,23 -924,13 +1058,23 @@@
   
       // TODO: Error handling
       mdModules_->assignOptionsToModules(*inputrec->params, nullptr);
- -    // now that the MdModules know their options, they know which callbacks to sign up to
+ +    // now that the MDModules know their options, they know which callbacks to sign up to
       mdModules_->subscribeToSimulationSetupNotifications();
- -    const auto& mdModulesNotifier = mdModules_->notifier().simulationSetupNotifications_;
+ +    const auto& setupNotifier = mdModules_->notifiers().simulationSetupNotifier_;
+ +
+ +    // Notify MdModules of existing logger
+ +    setupNotifier.notify(mdlog);
   
+ +    // Notify MdModules of internal parameters, saved into KVT
       if (inputrec->internalParameters != nullptr)
       {
- -        mdModulesNotifier.notify(*inputrec->internalParameters);
+ +        setupNotifier.notify(*inputrec->internalParameters);
+ +    }
+ +
+ +    // Let MdModules know the .tpr filename
+ +    {
+ +        gmx::MdRunInputFilename mdRunInputFilename = { ftp2fn(efTPR, filenames.size(), filenames.data()) };
+ +        setupNotifier.notify(mdRunInputFilename);
       }
   
       if (fplog != nullptr)
@@@ -1084,7 -942,7 +1086,7 @@@
       if (SIMMASTER(cr))
       {
           /* In rerun, set velocities to zero if present */
- -        if (doRerun && ((globalState->flags & (1 << estV)) != 0))
+ +        if (doRerun && ((globalState->flags & enumValueToBitMask(StateEntry::V)) != 0))
           {
               // rerun does not use velocities
               GMX_LOG(mdlog.info)
@@@ -1096,7 -954,7 +1098,7 @@@
               {
                   clear_rvec(globalState->v[i]);
               }
- -            globalState->flags &= ~(1 << estV);
+ +            globalState->flags &= ~enumValueToBitMask(StateEntry::V);
           }
   
           /* now make sure the state is initialized and propagated */
@@@ -1106,14 -964,14 +1108,14 @@@
       /* NM and TPI parallelize over force/energy calculations, not atoms,
        * so we need to initialize and broadcast the global state.
        */
- -    if (inputrec->eI == eiNM || inputrec->eI == eiTPI)
+ +    if (inputrec->eI == IntegrationAlgorithm::NM || inputrec->eI == IntegrationAlgorithm::TPI)
       {
           if (!MASTER(cr))
           {
               globalState = std::make_unique<t_state>();
           }
- -        broadcastStateWithoutDynamics(cr->mpiDefaultCommunicator, DOMAINDECOMP(cr), PAR(cr),
- -                                      globalState.get());
+ +        broadcastStateWithoutDynamics(
+ +                cr->mpiDefaultCommunicator, haveDDAtomOrdering(*cr), PAR(cr), globalState.get());
       }
   
       /* A parallel command line option consistency check that we can
@@@ -1136,13 -994,47 +1138,13 @@@
   #endif
       }
   
- -    if (doRerun && (EI_ENERGY_MINIMIZATION(inputrec->eI) || eiNM == inputrec->eI))
+ +    if (doRerun && (EI_ENERGY_MINIMIZATION(inputrec->eI) || IntegrationAlgorithm::NM == inputrec->eI))
       {
           gmx_fatal(FARGS,
                     "The .mdp file specified an energy mininization or normal mode algorithm, and "
                     "these are not compatible with mdrun -rerun");
       }
   
- -    if (!(EEL_PME(inputrec->coulombtype) || EVDW_PME(inputrec->vdwtype)))
- -    {
- -        if (domdecOptions.numPmeRanks > 0)
- -        {
- -            gmx_fatal_collective(FARGS, cr->mpiDefaultCommunicator, MASTER(cr),
- -                                 "PME-only ranks are requested, but the system does not use PME "
- -                                 "for electrostatics or LJ");
- -        }
- -
- -        domdecOptions.numPmeRanks = 0;
- -    }
- -
- -    if (useGpuForNonbonded && domdecOptions.numPmeRanks < 0)
- -    {
- -        /* With NB GPUs we don't automatically use PME-only CPU ranks. PME ranks can
- -         * improve performance with many threads per GPU, since our OpenMP
- -         * scaling is bad, but it's difficult to automate the setup.
- -         */
- -        domdecOptions.numPmeRanks = 0;
- -    }
- -    if (useGpuForPme)
- -    {
- -        if (domdecOptions.numPmeRanks < 0)
- -        {
- -            domdecOptions.numPmeRanks = 0;
- -            // TODO possibly print a note that one can opt-in for a separate PME GPU rank?
- -        }
- -        else
- -        {
- -            GMX_RELEASE_ASSERT(domdecOptions.numPmeRanks <= 1,
- -                               "PME GPU decomposition is not supported");
- -        }
- -    }
- -
       /* NMR restraints must be initialized before load_checkpoint,
        * since with time averaging the history is added to t_state.
        * For proper consistency check we therefore need to extend
@@@ -1154,28 -1046,18 +1156,28 @@@
       /* This needs to be called before read_checkpoint to extend the state */
       t_disresdata* disresdata;
       snew(disresdata, 1);
- -    init_disres(fplog, &mtop, inputrec.get(), DisResRunMode::MDRun,
+ +    init_disres(fplog,
+ +                mtop,
+ +                inputrec.get(),
+ +                DisResRunMode::MDRun,
                   MASTER(cr) ? DDRole::Master : DDRole::Agent,
- -                PAR(cr) ? NumRanks::Multiple : NumRanks::Single, cr->mpi_comm_mysim, ms, disresdata,
- -                globalState.get(), replExParams.exchangeInterval > 0);
+ +                PAR(cr) ? NumRanks::Multiple : NumRanks::Single,
+ +                cr->mpi_comm_mysim,
+ +                ms,
+ +                disresdata,
+ +                globalState.get(),
+ +                replExParams.exchangeInterval > 0);
   
- -    t_oriresdata* oriresdata;
- -    snew(oriresdata, 1);
- -    init_orires(fplog, &mtop, inputrec.get(), cr, ms, globalState.get(), oriresdata);
+ +    if (gmx_mtop_ftype_count(mtop, F_ORIRES) > 0 && isSimulationMasterRank)
+ +    {
+ +        extendStateWithOriresHistory(mtop, *inputrec, globalState.get());
+ +    }
   
- -    auto deform = prepareBoxDeformation(
- -            globalState != nullptr ? globalState->box : box, MASTER(cr) ? DDRole::Master : DDRole::Agent,
- -            PAR(cr) ? NumRanks::Multiple : NumRanks::Single, cr->mpi_comm_mygroup, *inputrec);
+ +    auto deform = prepareBoxDeformation(globalState != nullptr ? globalState->box : box,
+ +                                        MASTER(cr) ? DDRole::Master : DDRole::Agent,
+ +                                        PAR(cr) ? NumRanks::Multiple : NumRanks::Single,
+ +                                        cr->mpi_comm_mygroup,
+ +                                        *inputrec);
   
   #if GMX_FAHCORE
       /* We have to remember the generation's first step before reading checkpoint.
@@@ -1211,17 -1093,10 +1213,17 @@@
   
           // Finish applying initial simulation state information from external sources on all ranks.
           // Reconcile checkpoint file data with Mdrunner state established up to this point.
- -        applyLocalState(*inputHolder_.get(), logFileHandle, cr, domdecOptions.numCells,
- -                        inputrec.get(), globalState.get(), &observablesHistory,
- -                        mdrunOptions.reproducible, mdModules_->notifier(),
- -                        modularSimulatorCheckpointData.get(), useModularSimulator);
+ +        applyLocalState(*inputHolder_.get(),
+ +                        logFileHandle,
+ +                        cr,
+ +                        domdecOptions.numCells,
+ +                        inputrec.get(),
+ +                        globalState.get(),
+ +                        &observablesHistory,
+ +                        mdrunOptions.reproducible,
+ +                        mdModules_->notifiers(),
+ +                        modularSimulatorCheckpointData.get(),
+ +                        useModularSimulator);
           // TODO: (#3652) Synchronize filesystem state, SimulationInput contents, and program
           //  invariants
           //  on all code paths.
@@@ -1272,7 -1147,7 +1274,7 @@@
           gmx_bcast(sizeof(box), box, cr->mpiDefaultCommunicator);
       }
   
- -    if (inputrec->cutoff_scheme != ecutsVERLET)
+ +    if (inputrec->cutoff_scheme != CutoffScheme::Verlet)
       {
           gmx_fatal(FARGS,
                     "This group-scheme .tpr file can no longer be run by mdrun. Please update to the "
@@@ -1283,67 -1158,10 +1285,67 @@@
        * increase rlist) tries to check if the newly chosen value fits with the DD scheme. As this is
        * run before any DD scheme is set up, this check is never executed. See #3334 for more details.
        */
- -    prepare_verlet_scheme(fplog, cr, inputrec.get(), nstlist_cmdline, &mtop, box,
+ +    prepare_verlet_scheme(fplog,
+ +                          cr,
+ +                          inputrec.get(),
+ +                          nstlist_cmdline,
+ +                          mtop,
+ +                          box,
                             useGpuForNonbonded || (emulateGpuNonbonded == EmulateGpuNonbonded::Yes),
                             *hwinfo_->cpuInfo);
   
+ +    // We need to decide on update groups early, as this affects
+ +    // inter-domain communication distances.
+ +    auto       updateGroupingsPerMoleculeType = makeUpdateGroupingsPerMoleculeType(mtop);
+ +    const real maxUpdateGroupRadius           = computeMaxUpdateGroupRadius(
+ +            mtop, updateGroupingsPerMoleculeType, maxReferenceTemperature(*inputrec));
+ +    const real   cutoffMargin = std::sqrt(max_cutoff2(inputrec->pbcType, box)) - inputrec->rlist;
+ +    UpdateGroups updateGroups = makeUpdateGroups(mdlog,
+ +                                                 std::move(updateGroupingsPerMoleculeType),
+ +                                                 maxUpdateGroupRadius,
+ +                                                 useDomainDecomposition,
+ +                                                 systemHasConstraintsOrVsites(mtop),
+ +                                                 cutoffMargin);
+ +
+ +    try
+ +    {
+ +        const bool haveFrozenAtoms = inputrecFrozenAtoms(inputrec.get());
+ +
+ +        useGpuForUpdate = decideWhetherToUseGpuForUpdate(useDomainDecomposition,
+ +                                                         updateGroups.useUpdateGroups(),
+ +                                                         pmeRunMode,
+ +                                                         domdecOptions.numPmeRanks > 0,
+ +                                                         useGpuForNonbonded,
+ +                                                         updateTarget,
+ +                                                         gpusWereDetected,
+ +                                                         *inputrec,
+ +                                                         mtop,
+ +                                                         doEssentialDynamics,
+ +                                                         gmx_mtop_ftype_count(mtop, F_ORIRES) > 0,
+ +                                                         haveFrozenAtoms,
+ +                                                         doRerun,
+ +                                                         devFlags,
+ +                                                         mdlog);
+ +    }
+ +    GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
+ +
+ +    bool useGpuDirectHalo = false;
+ +
+ +    if (useGpuForNonbonded)
+ +    {
+ +        // cr->npmenodes is not yet initialized.
+ +        // domdecOptions.numPmeRanks == -1 results in 0 separate PME ranks when useGpuForNonbonded is true.
+ +        // Todo: remove this assumption later once auto mode has support for separate PME rank
+ +        const int numPmeRanks = domdecOptions.numPmeRanks > 0 ? domdecOptions.numPmeRanks : 0;
+ +        bool      havePPDomainDecomposition = (cr->sizeOfDefaultCommunicator - numPmeRanks) > 1;
+ +        useGpuDirectHalo                    = decideWhetherToUseGpuForHalo(devFlags,
+ +                                                        havePPDomainDecomposition,
+ +                                                        useGpuForNonbonded,
+ +                                                        useModularSimulator,
+ +                                                        doRerun,
+ +                                                        EI_ENERGY_MINIMIZATION(inputrec->eI));
+ +    }
+ +
       // This builder is necessary while we have multi-part construction
       // of DD. Before DD is constructed, we use the existence of
       // the builder object to indicate that further construction of DD
@@@ -1351,25 -1169,9 +1353,25 @@@
       std::unique_ptr<DomainDecompositionBuilder> ddBuilder;
       if (useDomainDecomposition)
       {
- -        ddBuilder = std::make_unique<DomainDecompositionBuilder>(
- -                mdlog, cr, domdecOptions, mdrunOptions, mtop, *inputrec, box,
- -                positionsFromStatePointer(globalState.get()));
+ +        // P2P GPU comm + GPU update leads to case in which we enqueue async work for multiple
+ +        // timesteps. DLB needs to be disabled in that case
+ +        const bool directGpuCommUsedWithGpuUpdate = GMX_THREAD_MPI && useGpuDirectHalo && useGpuForUpdate;
+ +        ddBuilder                                 = std::make_unique<DomainDecompositionBuilder>(
+ +                mdlog,
+ +                cr,
+ +                domdecOptions,
+ +                mdrunOptions,
+ +                mtop,
+ +                *inputrec,
+ +                mdModules_->notifiers(),
+ +                box,
+ +                updateGroups.updateGroupingPerMoleculeType(),
+ +                updateGroups.useUpdateGroups(),
+ +                updateGroups.maxUpdateGroupRadius(),
+ +                positionsFromStatePointer(globalState.get()),
+ +                useGpuForNonbonded,
+ +                useGpuForPme,
+ +                directGpuCommUsedWithGpuUpdate);
       }
       else
       {
@@@ -1388,18 -1190,9 +1390,18 @@@
   
       // Produce the task assignment for this rank - done after DD is constructed
       GpuTaskAssignments gpuTaskAssignments = GpuTaskAssignmentsBuilder::build(
- -            gpuIdsToUse, userGpuTaskAssignment, *hwinfo_, simulationCommunicator, physicalNodeComm,
- -            nonbondedTarget, pmeTarget, bondedTarget, updateTarget, useGpuForNonbonded,
- -            useGpuForPme, thisRankHasDuty(cr, DUTY_PP),
+ +            availableDevices,
+ +            userGpuTaskAssignment,
+ +            *hwinfo_,
+ +            simulationCommunicator,
+ +            physicalNodeComm,
+ +            nonbondedTarget,
+ +            pmeTarget,
+ +            bondedTarget,
+ +            updateTarget,
+ +            useGpuForNonbonded,
+ +            useGpuForPme,
+ +            thisRankHasDuty(cr, DUTY_PP),
               // TODO cr->duty & DUTY_PME should imply that a PME
               // algorithm is active, but currently does not.
               EEL_PME(inputrec->coulombtype) && thisRankHasDuty(cr, DUTY_PME));
@@@ -1430,33 -1223,15 +1432,33 @@@
       // requires it (e.g. pull, CompEl, density fitting), so that we
       // don't update the local atom sets unilaterally every step.
       LocalAtomSetManager atomSets;
+ +
+ +    // Local state and topology are declared (and perhaps constructed)
+ +    // now, because DD needs them for the LocalTopologyChecker, but
+ +    // they do not contain valid data until after the first DD
+ +    // partition.
+ +    std::unique_ptr<t_state> localStateInstance;
+ +    t_state*                 localState;
+ +    gmx_localtop_t           localTopology(mtop.ffparams);
+ +
       if (ddBuilder)
       {
+ +        localStateInstance = std::make_unique<t_state>();
+ +        localState         = localStateInstance.get();
           // TODO Pass the GPU streams to ddBuilder to use in buffer
           // transfers (e.g. halo exchange)
- -        cr->dd = ddBuilder->build(&atomSets);
+ +        cr->dd = ddBuilder->build(&atomSets, localTopology, *localState, &observablesReducerBuilder);
           // The builder's job is done, so destruct it
           ddBuilder.reset(nullptr);
           // Note that local state still does not exist yet.
       }
+ +    else
+ +    {
+ +        // Without DD, the local state is merely an alias to the global state,
+ +        // so we don't need to allocate anything.
+ +        localState = globalState.get();
+ +    }
+ +
       // Ensure that all atoms within the same update group are in the
       // same periodic image. Otherwise, a simulation that did not use
       // update groups (e.g. a single-rank simulation) cannot always be
@@@ -1471,6 -1246,23 +1473,6 @@@
           }
       }
   
- -    // The GPU update is decided here because we need to know whether the constraints or
- -    // SETTLEs can span accross the domain borders (i.e. whether or not update groups are
- -    // defined). This is only known after DD is initialized, hence decision on using GPU
- -    // update is done so late.
- -    try
- -    {
- -        const bool useUpdateGroups = cr->dd ? ddUsesUpdateGroups(*cr->dd) : false;
- -        const bool haveFrozenAtoms = inputrecFrozenAtoms(inputrec.get());
- -
- -        useGpuForUpdate = decideWhetherToUseGpuForUpdate(
- -                useDomainDecomposition, useUpdateGroups, pmeRunMode, domdecOptions.numPmeRanks > 0,
- -                useGpuForNonbonded, updateTarget, gpusWereDetected, *inputrec, mtop,
- -                doEssentialDynamics, gmx_mtop_ftype_count(mtop, F_ORIRES) > 0,
- -                replExParams.exchangeInterval > 0, haveFrozenAtoms, doRerun, devFlags, mdlog);
- -    }
- -    GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
- -
       const bool printHostName = (cr->nnodes > 1);
       gpuTaskAssignments.reportGpuUsage(mdlog, printHostName, useGpuForBonded, pmeRunMode, useGpuForUpdate);
   
@@@ -1487,27 -1279,20 +1489,27 @@@
   
       MdrunScheduleWorkload runScheduleWork;
   
- -    bool useGpuDirectHalo = decideWhetherToUseGpuForHalo(
- -            devFlags, havePPDomainDecomposition(cr), useGpuForNonbonded, useModularSimulator,
- -            doRerun, EI_ENERGY_MINIMIZATION(inputrec->eI));
- -
       // Also populates the simulation constant workload description.
- -    runScheduleWork.simulationWork = createSimulationWorkload(
- -            *inputrec, disableNonbondedCalculation, devFlags, useGpuForNonbonded, pmeRunMode,
- -            useGpuForBonded, useGpuForUpdate, useGpuDirectHalo);
+ +    // Note: currently the default duty is DUTY_PP | DUTY_PME for all simulations, including those without PME,
+ +    // so this boolean is sufficient on all ranks to determine whether separate PME ranks are used,
+ +    // but this will no longer be the case if cr->duty is changed for !EEL_PME(fr->ic->eeltype).
+ +    const bool haveSeparatePmeRank = (!thisRankHasDuty(cr, DUTY_PP) || !thisRankHasDuty(cr, DUTY_PME));
+ +    runScheduleWork.simulationWork = createSimulationWorkload(*inputrec,
+ +                                                              disableNonbondedCalculation,
+ +                                                              devFlags,
+ +                                                              havePPDomainDecomposition(cr),
+ +                                                              haveSeparatePmeRank,
+ +                                                              useGpuForNonbonded,
+ +                                                              pmeRunMode,
+ +                                                              useGpuForBonded,
+ +                                                              useGpuForUpdate,
+ +                                                              useGpuDirectHalo);
   
       std::unique_ptr<DeviceStreamManager> deviceStreamManager = nullptr;
   
       if (deviceInfo != nullptr)
       {
- -        if (DOMAINDECOMP(cr) && thisRankHasDuty(cr, DUTY_PP))
+ +        if (runScheduleWork.simulationWork.havePpDomainDecomposition && thisRankHasDuty(cr, DUTY_PP))
           {
               dd_setup_dlb_resource_sharing(cr, deviceId);
           }
@@@ -1519,7 -1304,7 +1521,7 @@@
       // where appropriate.
       if (!userGpuTaskAssignment.empty())
       {
- -        gpuTaskAssignments.logPerformanceHints(mdlog, numDevicesToUse);
+ +        gpuTaskAssignments.logPerformanceHints(mdlog, numAvailableDevices);
       }
   
       if (PAR(cr))
@@@ -1538,12 -1323,10 +1540,12 @@@
                   .appendTextFormatted(
                           "This is simulation %d out of %d running as a composite GROMACS\n"
                           "multi-simulation job. Setup for this simulation:\n",
- -                        ms->simulationIndex_, ms->numSimulations_);
+ +                        ms->simulationIndex_,
+ +                        ms->numSimulations_);
       }
       GMX_LOG(mdlog.warning)
- -            .appendTextFormatted("Using %d MPI %s\n", cr->nnodes,
+ +            .appendTextFormatted("Using %d MPI %s\n",
+ +                                 cr->nnodes,
   #    if GMX_THREAD_MPI
                                    cr->nnodes == 1 ? "thread" : "threads"
   #    else
@@@ -1560,18 -1343,23 +1562,18 @@@
       // the OpenMP support.
       gmx_check_thread_affinity_set(mdlog, &hw_opt, hwinfo_->nthreads_hw_avail, FALSE);
       /* Check and update the number of OpenMP threads requested */
- -    checkAndUpdateRequestedNumOpenmpThreads(&hw_opt, *hwinfo_, cr, ms, physicalNodeComm.size_,
- -                                            pmeRunMode, mtop, *inputrec);
- -
- -    gmx_omp_nthreads_init(mdlog, cr, hwinfo_->nthreads_hw_avail, physicalNodeComm.size_,
- -                          hw_opt.nthreads_omp, hw_opt.nthreads_omp_pme, !thisRankHasDuty(cr, DUTY_PP));
- -
- -    // Enable FP exception detection, but not in
- -    // Release mode and not for compilers with known buggy FP
- -    // exception support (clang with any optimization) or suspected
- -    // buggy FP exception support (gcc 7.* with optimization).
- -#if !defined NDEBUG                                                                         \
- -        && !((defined __clang__ || (defined(__GNUC__) && !defined(__ICC) && __GNUC__ == 7)) \
- -             && defined __OPTIMIZE__)
- -    const bool bEnableFPE = true;
- -#else
- -    const bool bEnableFPE = false;
- -#endif
+ +    checkAndUpdateRequestedNumOpenmpThreads(
+ +            &hw_opt, *hwinfo_, cr, ms, physicalNodeComm.size_, pmeRunMode, mtop, *inputrec);
+ +
+ +    gmx_omp_nthreads_init(mdlog,
+ +                          cr,
+ +                          hwinfo_->nthreads_hw_avail,
+ +                          physicalNodeComm.size_,
+ +                          hw_opt.nthreads_omp,
+ +                          hw_opt.nthreads_omp_pme,
+ +                          !thisRankHasDuty(cr, DUTY_PP));
+ +
+ +    const bool bEnableFPE = gmxShouldEnableFPExceptions();
       // FIXME - reconcile with gmx_feenableexcept() call from CommandLineModuleManager::run()
       if (bEnableFPE)
       {
@@@ -1579,27 -1367,26 +1581,27 @@@
       }
   
       /* Now that we know the setup is consistent, check for efficiency */
- -    check_resource_division_efficiency(hwinfo_, gpuTaskAssignments.thisRankHasAnyGpuTask(),
- -                                       mdrunOptions.ntompOptionIsSet, cr, mdlog);
+ +    check_resource_division_efficiency(
+ +            hwinfo_, gpuTaskAssignments.thisRankHasAnyGpuTask(), mdrunOptions.ntompOptionIsSet, cr, mdlog);
   
       /* getting number of PP/PME threads on this MPI / tMPI rank.
          PME: env variable should be read only on one node to make sure it is
          identical everywhere;
        */
- -    const int numThreadsOnThisRank = thisRankHasDuty(cr, DUTY_PP) ? gmx_omp_nthreads_get(emntNonbonded)
- -                                                                  : gmx_omp_nthreads_get(emntPME);
- -    checkHardwareOversubscription(numThreadsOnThisRank, cr->nodeid, *hwinfo_->hardwareTopology,
- -                                  physicalNodeComm, mdlog);
+ +    const int numThreadsOnThisRank = thisRankHasDuty(cr, DUTY_PP)
+ +                                             ? gmx_omp_nthreads_get(ModuleMultiThread::Nonbonded)
+ +                                             : gmx_omp_nthreads_get(ModuleMultiThread::Pme);
+ +    checkHardwareOversubscription(
+ +            numThreadsOnThisRank, cr->nodeid, *hwinfo_->hardwareTopology, physicalNodeComm, mdlog);
   
       // Enable Peer access between GPUs where available
       // Only for DD, only master PP rank needs to perform setup, and only if thread MPI plus
       // any of the GPU communication features are active.
- -    if (DOMAINDECOMP(cr) && MASTER(cr) && thisRankHasDuty(cr, DUTY_PP) && GMX_THREAD_MPI
+ +    if (haveDDAtomOrdering(*cr) && MASTER(cr) && thisRankHasDuty(cr, DUTY_PP) && GMX_THREAD_MPI
           && (runScheduleWork.simulationWork.useGpuHaloExchange
               || runScheduleWork.simulationWork.useGpuPmePpCommunication))
       {
- -        setupGpuDevicePeerAccess(gpuIdsToUse, mdlog);
+ +        setupGpuDevicePeerAccess(gpuTaskAssignments.deviceIdsAssigned(), mdlog);
       }
   
       if (hw_opt.threadAffinity != ThreadAffinity::Off)
@@@ -1611,18 -1398,12 +1613,18 @@@
           gmx_check_thread_affinity_set(mdlog, &hw_opt, hwinfo_->nthreads_hw_avail, TRUE);
   
           int numThreadsOnThisNode, intraNodeThreadOffset;
- -        analyzeThreadsOnThisNode(physicalNodeComm, numThreadsOnThisRank, &numThreadsOnThisNode,
- -                                 &intraNodeThreadOffset);
+ +        analyzeThreadsOnThisNode(
+ +                physicalNodeComm, numThreadsOnThisRank, &numThreadsOnThisNode, &intraNodeThreadOffset);
   
           /* Set the CPU affinity */
- -        gmx_set_thread_affinity(mdlog, cr, &hw_opt, *hwinfo_->hardwareTopology, numThreadsOnThisRank,
- -                                numThreadsOnThisNode, intraNodeThreadOffset, nullptr);
+ +        gmx_set_thread_affinity(mdlog,
+ +                                cr,
+ +                                &hw_opt,
+ +                                *hwinfo_->hardwareTopology,
+ +                                numThreadsOnThisRank,
+ +                                numThreadsOnThisNode,
+ +                                intraNodeThreadOffset,
+ +                                nullptr);
       }
   
       if (mdrunOptions.timingOptions.resetStep > -1)
@@@ -1633,62 -1414,43 +1635,62 @@@
                           "The -resetstep functionality is deprecated, and may be removed in a "
                           "future version.");
       }
- -    wcycle = wallcycle_init(fplog, mdrunOptions.timingOptions.resetStep, cr);
+ +    std::unique_ptr<gmx_wallcycle> wcycle =
+ +            wallcycle_init(fplog, mdrunOptions.timingOptions.resetStep, cr);
   
       if (PAR(cr))
       {
           /* Master synchronizes its value of reset_counters with all nodes
            * including PME only nodes */
- -        int64_t reset_counters = wcycle_get_reset_counters(wcycle);
+ +        int64_t reset_counters = wcycle_get_reset_counters(wcycle.get());
           gmx_bcast(sizeof(reset_counters), &reset_counters, cr->mpi_comm_mysim);
- -        wcycle_set_reset_counters(wcycle, reset_counters);
+ +        wcycle_set_reset_counters(wcycle.get(), reset_counters);
       }
   
       // Membrane embedding must be initialized before we call init_forcerec()
- -    membedHolder.initializeMembed(fplog, filenames.size(), filenames.data(), &mtop, inputrec.get(),
- -                                  globalState.get(), cr, &mdrunOptions.checkpointOptions.period);
+ +    membedHolder.initializeMembed(fplog,
+ +                                  filenames.size(),
+ +                                  filenames.data(),
+ +                                  &mtop,
+ +                                  inputrec.get(),
+ +                                  globalState.get(),
+ +                                  cr,
+ +                                  &mdrunOptions.checkpointOptions.period);
   
       const bool               thisRankHasPmeGpuTask = gpuTaskAssignments.thisRankHasPmeGpuTask();
       std::unique_ptr<MDAtoms> mdAtoms;
       std::unique_ptr<VirtualSitesHandler> vsite;
- -    std::unique_ptr<GpuBonded>           gpuBonded;
   
       t_nrnb nrnb;
       if (thisRankHasDuty(cr, DUTY_PP))
       {
- -        mdModulesNotifier.notify(*cr);
- -        mdModulesNotifier.notify(&atomSets);
- -        mdModulesNotifier.notify(inputrec->pbcType);
- -        mdModulesNotifier.notify(SimulationTimeStep{ inputrec->delta_t });
+ +        setupNotifier.notify(*cr);
+ +        setupNotifier.notify(&atomSets);
+ +        setupNotifier.notify(mtop);
+ +        setupNotifier.notify(inputrec->pbcType);
+ +        setupNotifier.notify(SimulationTimeStep{ inputrec->delta_t });
           /* Initiate forcerecord */
- -        fr                 = new t_forcerec;
+ +        fr                 = std::make_unique<t_forcerec>();
           fr->forceProviders = mdModules_->initForceProviders();
- -        init_forcerec(fplog, mdlog, fr, inputrec.get(), &mtop, cr, box,
+ +        init_forcerec(fplog,
+ +                      mdlog,
+ +                      runScheduleWork.simulationWork,
+ +                      fr.get(),
+ +                      *inputrec,
+ +                      mtop,
+ +                      cr,
+ +                      box,
                         opt2fn("-table", filenames.size(), filenames.data()),
                         opt2fn("-tablep", filenames.size(), filenames.data()),
- -                      opt2fns("-tableb", filenames.size(), filenames.data()), pforce);
+ +                      opt2fns("-tableb", filenames.size(), filenames.data()),
+ +                      pforce);
           // Dirty hack, for fixing disres and orires should be made mdmodules
           fr->fcdata->disres = disresdata;
- -        fr->fcdata->orires = oriresdata;
+ +        if (gmx_mtop_ftype_count(mtop, F_ORIRES) > 0)
+ +        {
+ +            fr->fcdata->orires = std::make_unique<t_oriresdata>(
+ +                    fplog, mtop, *inputrec, ms, globalState.get(), &atomSets);
+ +        }
   
           // Save a handle to device stream manager to use elsewhere in the code
           // TODO: Forcerec is not a correct place to store it.
@@@ -1705,46 -1467,24 +1707,46 @@@
                       "GPU PP-PME stream should be valid in order to use GPU PME-PP direct "
                       "communications.");
               fr->pmePpCommGpu = std::make_unique<gmx::PmePpCommGpu>(
- -                    cr->mpi_comm_mysim, cr->dd->pme_nodeid, deviceStreamManager->context(),
+ +                    cr->mpi_comm_mysim,
+ +                    cr->dd->pme_nodeid,
+ +                    &cr->dd->pmeForceReceiveBuffer,
+ +                    deviceStreamManager->context(),
                       deviceStreamManager->stream(DeviceStreamType::PmePpTransfer));
           }
   
- -        fr->nbv = Nbnxm::init_nb_verlet(mdlog, inputrec.get(), fr, cr, *hwinfo_,
+ +        fr->nbv = Nbnxm::init_nb_verlet(mdlog,
+ +                                        *inputrec,
+ +                                        *fr,
+ +                                        cr,
+ +                                        *hwinfo_,
                                           runScheduleWork.simulationWork.useGpuNonbonded,
- -                                        deviceStreamManager.get(), &mtop, box, wcycle);
+ +                                        deviceStreamManager.get(),
+ +                                        mtop,
+ +                                        box,
+ +                                        wcycle.get());
           // TODO: Move the logic below to a GPU bonded builder
           if (runScheduleWork.simulationWork.useGpuBonded)
           {
               GMX_RELEASE_ASSERT(deviceStreamManager != nullptr,
                                  "GPU device stream manager should be valid in order to use GPU "
                                  "version of bonded forces.");
- -            gpuBonded = std::make_unique<GpuBonded>(
- -                    mtop.ffparams, fr->ic->epsfac * fr->fudgeQQ, deviceStreamManager->context(),
- -                    deviceStreamManager->bondedStream(havePPDomainDecomposition(cr)), wcycle);
- -            fr->gpuBonded = gpuBonded.get();
+ +            fr->listedForcesGpu = std::make_unique<ListedForcesGpu>(
+ +                    mtop.ffparams,
+ +                    fr->ic->epsfac * fr->fudgeQQ,
+ +                    deviceStreamManager->context(),
+ +                    deviceStreamManager->bondedStream(havePPDomainDecomposition(cr)),
+ +                    wcycle.get());
           }
+ +        fr->longRangeNonbondeds = std::make_unique<CpuPpLongRangeNonbondeds>(fr->n_tpi,
+ +                                                                             fr->ic->ewaldcoeff_q,
+ +                                                                             fr->ic->epsilon_r,
+ +                                                                             fr->qsum,
+ +                                                                             fr->ic->eeltype,
+ +                                                                             fr->ic->vdwtype,
+ +                                                                             *inputrec,
+ +                                                                             &nrnb,
+ +                                                                             wcycle.get(),
+ +                                                                             fplog);
   
           /* Initialize the mdAtoms structure.
            * mdAtoms is not filled with atom data,
@@@ -1762,8 -1502,7 +1764,8 @@@
           }
   
           /* Initialize the virtual site communication */
- -        vsite = makeVirtualSitesHandler(mtop, cr, fr->pbcType);
+ +        vsite = makeVirtualSitesHandler(
+ +                mtop, cr, fr->pbcType, updateGroups.updateGroupingPerMoleculeType());
   
           calc_shifts(box, fr->shift_vec);
   
@@@ -1787,11 -1526,6 +1789,11 @@@
                   constructVirtualSitesGlobal(mtop, globalState->x);
               }
           }
+ +        // Make the DD reverse topology, now that any vsites that are present are available
+ +        if (haveDDAtomOrdering(*cr))
+ +        {
+ +            dd_make_reverse_top(fplog, cr->dd, mtop, vsite.get(), *inputrec, domdecOptions.ddBondedChecking);
+ +        }
   
           if (EEL_PME(fr->ic->eeltype) || EVDW_PME(fr->ic->vdwtype))
           {
@@@ -1868,26 -1602,16 +1870,26 @@@
                   const DeviceContext* deviceContext = runScheduleWork.simulationWork.useGpuPme
                                                                ? &deviceStreamManager->context()
                                                                : nullptr;
- -                const DeviceStream* pmeStream =
+ +                const DeviceStream*  pmeStream =
                           runScheduleWork.simulationWork.useGpuPme
- -                                ? &deviceStreamManager->stream(DeviceStreamType::Pme)
- -                                : nullptr;
- -
- -                pmedata = gmx_pme_init(cr, getNumPmeDomains(cr->dd), inputrec.get(),
- -                                       nChargePerturbed != 0, nTypePerturbed != 0,
- -                                       mdrunOptions.reproducible, ewaldcoeff_q, ewaldcoeff_lj,
- -                                       gmx_omp_nthreads_get(emntPME), pmeRunMode, nullptr,
- -                                       deviceContext, pmeStream, pmeGpuProgram.get(), mdlog);
+ +                                 ? &deviceStreamManager->stream(DeviceStreamType::Pme)
+ +                                 : nullptr;
+ +
+ +                pmedata = gmx_pme_init(cr,
+ +                                       getNumPmeDomains(cr->dd),
+ +                                       inputrec.get(),
+ +                                       nChargePerturbed != 0,
+ +                                       nTypePerturbed != 0,
+ +                                       mdrunOptions.reproducible,
+ +                                       ewaldcoeff_q,
+ +                                       ewaldcoeff_lj,
+ +                                       gmx_omp_nthreads_get(ModuleMultiThread::Pme),
+ +                                       pmeRunMode,
+ +                                       nullptr,
+ +                                       deviceContext,
+ +                                       pmeStream,
+ +                                       pmeGpuProgram.get(),
+ +                                       mdlog);
               }
               GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR
           }
@@@ -1908,17 -1632,12 +1910,17 @@@
       if (thisRankHasDuty(cr, DUTY_PP))
       {
           /* Assumes uniform use of the number of OpenMP threads */
- -        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(emntDefault));
+ +        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(ModuleMultiThread::Default));
   
           if (inputrec->bPull)
           {
               /* Initialize pull code */
- -            pull_work = init_pull(fplog, inputrec->pull.get(), inputrec.get(), &mtop, cr, &atomSets,
+ +            pull_work = init_pull(fplog,
+ +                                  inputrec->pull.get(),
+ +                                  inputrec.get(),
+ +                                  mtop,
+ +                                  cr,
+ +                                  &atomSets,
                                     inputrec->fepvals->init_lambda);
               if (inputrec->pull->bXOutAverage || inputrec->pull->bFOutAverage)
               {
@@@ -1934,49 -1653,24 +1936,49 @@@
           if (inputrec->bRot)
           {
               /* Initialize enforced rotation code */
- -            enforcedRotation = init_rot(fplog, inputrec.get(), filenames.size(), filenames.data(),
- -                                        cr, &atomSets, globalState.get(), &mtop, oenv, mdrunOptions,
+ +            enforcedRotation = init_rot(fplog,
+ +                                        inputrec.get(),
+ +                                        filenames.size(),
+ +                                        filenames.data(),
+ +                                        cr,
+ +                                        &atomSets,
+ +                                        globalState.get(),
+ +                                        mtop,
+ +                                        oenv,
+ +                                        mdrunOptions,
                                           startingBehavior);
           }
   
           t_swap* swap = nullptr;
- -        if (inputrec->eSwapCoords != eswapNO)
+ +        if (inputrec->eSwapCoords != SwapType::No)
           {
               /* Initialize ion swapping code */
- -            swap = init_swapcoords(fplog, inputrec.get(),
+ +            swap = init_swapcoords(fplog,
+ +                                   inputrec.get(),
                                      opt2fn_master("-swap", filenames.size(), filenames.data(), cr),
- -                                   &mtop, globalState.get(), &observablesHistory, cr, &atomSets,
- -                                   oenv, mdrunOptions, startingBehavior);
+ +                                   mtop,
+ +                                   globalState.get(),
+ +                                   &observablesHistory,
+ +                                   cr,
+ +                                   &atomSets,
+ +                                   oenv,
+ +                                   mdrunOptions,
+ +                                   startingBehavior);
           }
   
           /* Let makeConstraints know whether we have essential dynamics constraints. */
- -        auto constr = makeConstraints(mtop, *inputrec, pull_work, doEssentialDynamics, fplog, cr,
- -                                      ms, &nrnb, wcycle, fr->bMolPBC);
+ +        auto constr = makeConstraints(mtop,
+ +                                      *inputrec,
+ +                                      pull_work,
+ +                                      doEssentialDynamics,
+ +                                      fplog,
+ +                                      cr,
+ +                                      updateGroups.useUpdateGroups(),
+ +                                      ms,
+ +                                      &nrnb,
+ +                                      wcycle.get(),
+ +                                      fr->bMolPBC,
+ +                                      &observablesReducerBuilder);
   
           /* Energy terms and groups */
           gmx_enerdata_t enerd(mtop.groups.groups[SimulationAtomGroupType::EnergyOutput].size(),
@@@ -1984,48 -1678,37 +1986,48 @@@
   
           // cos acceleration is only supported by md, but older tpr
           // files might still combine it with other integrators
- -        GMX_RELEASE_ASSERT(inputrec->cos_accel == 0.0 || inputrec->eI == eiMD,
+ +        GMX_RELEASE_ASSERT(inputrec->cos_accel == 0.0 || inputrec->eI == IntegrationAlgorithm::MD,
                              "cos_acceleration is only supported by integrator=md");
   
           /* Kinetic energy data */
- -        gmx_ekindata_t ekind;
- -        init_ekindata(fplog, &mtop, &(inputrec->opts), &ekind, inputrec->cos_accel);
+ +        gmx_ekindata_t ekind(inputrec->opts.ngtc,
+ +                             inputrec->cos_accel,
+ +                             gmx_omp_nthreads_get(ModuleMultiThread::Update));
   
           /* Set up interactive MD (IMD) */
- -        auto imdSession =
- -                makeImdSession(inputrec.get(), cr, wcycle, &enerd, ms, &mtop, mdlog,
- -                               MASTER(cr) ? globalState->x.rvec_array() : nullptr, filenames.size(),
- -                               filenames.data(), oenv, mdrunOptions.imdOptions, startingBehavior);
- -
- -        if (DOMAINDECOMP(cr))
+ +        auto imdSession = makeImdSession(inputrec.get(),
+ +                                         cr,
+ +                                         wcycle.get(),
+ +                                         &enerd,
+ +                                         ms,
+ +                                         mtop,
+ +                                         mdlog,
+ +                                         MASTER(cr) ? globalState->x : gmx::ArrayRef<gmx::RVec>(),
+ +                                         filenames.size(),
+ +                                         filenames.data(),
+ +                                         oenv,
+ +                                         mdrunOptions.imdOptions,
+ +                                         startingBehavior);
+ +
+ +        if (haveDDAtomOrdering(*cr))
           {
               GMX_RELEASE_ASSERT(fr, "fr was NULL while cr->duty was DUTY_PP");
- -            /* This call is not included in init_domain_decomposition mainly
- -             * because fr->cginfo_mb is set later.
+ +            /* This call is not included in init_domain_decomposition
+ +             * because fr->atomInfoForEachMoleculeBlock is set later.
                */
- -            dd_init_bondeds(fplog, cr->dd, mtop, vsite.get(), inputrec.get(),
- -                            domdecOptions.checkBondedInteractions, fr->cginfo_mb);
+ +            makeBondedLinks(cr->dd, mtop, fr->atomInfoForEachMoleculeBlock);
           }
   
           if (runScheduleWork.simulationWork.useGpuBufferOps)
           {
               fr->gpuForceReduction[gmx::AtomLocality::Local] = std::make_unique<gmx::GpuForceReduction>(
                       deviceStreamManager->context(),
- -                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedLocal), wcycle);
+ +                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedLocal),
+ +                    wcycle.get());
               fr->gpuForceReduction[gmx::AtomLocality::NonLocal] = std::make_unique<gmx::GpuForceReduction>(
                       deviceStreamManager->context(),
- -                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedNonLocal), wcycle);
+ +                    deviceStreamManager->stream(gmx::DeviceStreamType::NonBondedNonLocal),
+ +                    wcycle.get());
           }
   
           std::unique_ptr<gmx::StatePropagatorDataGpu> stateGpu;
@@@ -2033,41 -1716,40 +2035,41 @@@
               && ((runScheduleWork.simulationWork.useGpuPme && thisRankHasDuty(cr, DUTY_PME))
                   || runScheduleWork.simulationWork.useGpuBufferOps))
           {
- -            GpuApiCallBehavior transferKind = (inputrec->eI == eiMD && !doRerun && !useModularSimulator)
- -                                                      ? GpuApiCallBehavior::Async
- -                                                      : GpuApiCallBehavior::Sync;
+ +            GpuApiCallBehavior transferKind =
+ +                    (inputrec->eI == IntegrationAlgorithm::MD && !doRerun && !useModularSimulator)
+ +                            ? GpuApiCallBehavior::Async
+ +                            : GpuApiCallBehavior::Sync;
               GMX_RELEASE_ASSERT(deviceStreamManager != nullptr,
                                  "GPU device stream manager should be initialized to use GPU.");
               stateGpu = std::make_unique<gmx::StatePropagatorDataGpu>(
- -                    *deviceStreamManager, transferKind, pme_gpu_get_block_size(fr->pmedata), wcycle);
+ +                    *deviceStreamManager, transferKind, pme_gpu_get_block_size(fr->pmedata), wcycle.get());
               fr->stateGpu = stateGpu.get();
           }
   
           GMX_ASSERT(stopHandlerBuilder_, "Runner must provide StopHandlerBuilder to simulator.");
           SimulatorBuilder simulatorBuilder;
   
- -        simulatorBuilder.add(SimulatorStateData(globalState.get(), &observablesHistory, &enerd, &ekind));
+ +        simulatorBuilder.add(SimulatorStateData(
+ +                globalState.get(), localState, &observablesHistory, &enerd, &ekind));
           simulatorBuilder.add(std::move(membedHolder));
           simulatorBuilder.add(std::move(stopHandlerBuilder_));
           simulatorBuilder.add(SimulatorConfig(mdrunOptions, startingBehavior, &runScheduleWork));
   
   
- -        simulatorBuilder.add(SimulatorEnv(fplog, cr, ms, mdlog, oenv));
- -        simulatorBuilder.add(Profiling(&nrnb, walltime_accounting, wcycle));
+ +        simulatorBuilder.add(SimulatorEnv(fplog, cr, ms, mdlog, oenv, &observablesReducerBuilder));
+ +        simulatorBuilder.add(Profiling(&nrnb, walltime_accounting, wcycle.get()));
           simulatorBuilder.add(ConstraintsParam(
- -                constr.get(), enforcedRotation ? enforcedRotation->getLegacyEnfrot() : nullptr,
- -                vsite.get()));
+ +                constr.get(), enforcedRotation ? enforcedRotation->getLegacyEnfrot() : nullptr, vsite.get()));
           // TODO: Separate `fr` to a separate add, and make the `build` handle the coupling sensibly.
- -        simulatorBuilder.add(LegacyInput(static_cast<int>(filenames.size()), filenames.data(),
- -                                         inputrec.get(), fr));
+ +        simulatorBuilder.add(LegacyInput(
+ +                static_cast<int>(filenames.size()), filenames.data(), inputrec.get(), fr.get()));
           simulatorBuilder.add(ReplicaExchangeParameters(replExParams));
           simulatorBuilder.add(InteractiveMD(imdSession.get()));
- -        simulatorBuilder.add(SimulatorModules(mdModules_->outputProvider(), mdModules_->notifier()));
+ +        simulatorBuilder.add(SimulatorModules(mdModules_->outputProvider(), mdModules_->notifiers()));
           simulatorBuilder.add(CenterOfMassPulling(pull_work));
           // Todo move to an MDModule
           simulatorBuilder.add(IonSwapping(swap));
- -        simulatorBuilder.add(TopologyData(&mtop, mdAtoms.get()));
+ +        simulatorBuilder.add(TopologyData(mtop, &localTopology, mdAtoms.get()));
           simulatorBuilder.add(BoxDeformationHandle(deform.get()));
           simulatorBuilder.add(std::move(modularSimulatorCheckpointData));
   
@@@ -2091,34 -1773,21 +2093,34 @@@
       {
           GMX_RELEASE_ASSERT(pmedata, "pmedata was NULL while cr->duty was not DUTY_PP");
           /* do PME only */
- -        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(emntPME));
- -        gmx_pmeonly(pmedata, cr, &nrnb, wcycle, walltime_accounting, inputrec.get(), pmeRunMode,
+ +        walltime_accounting = walltime_accounting_init(gmx_omp_nthreads_get(ModuleMultiThread::Pme));
+ +        gmx_pmeonly(pmedata,
+ +                    cr,
+ +                    &nrnb,
+ +                    wcycle.get(),
+ +                    walltime_accounting,
+ +                    inputrec.get(),
+ +                    pmeRunMode,
+ +                    runScheduleWork.simulationWork.useGpuPmePpCommunication,
                       deviceStreamManager.get());
       }
   
- -    wallcycle_stop(wcycle, ewcRUN);
+ +    wallcycle_stop(wcycle.get(), WallCycleCounter::Run);
   
       /* Finish up, write some stuff
        * if rerunMD, don't write last frame again
        */
- -    finish_run(fplog, mdlog, cr, inputrec.get(), &nrnb, wcycle, walltime_accounting,
- -               fr ? fr->nbv.get() : nullptr, pmedata, EI_DYNAMICS(inputrec->eI) && !isMultiSim(ms));
+ +    finish_run(fplog,
+ +               mdlog,
+ +               cr,
+ +               *inputrec,
+ +               &nrnb,
+ +               wcycle.get(),
+ +               walltime_accounting,
+ +               fr ? fr->nbv.get() : nullptr,
+ +               pmedata,
+ +               EI_DYNAMICS(inputrec->eI) && !isMultiSim(ms));
   
- -    // clean up cycle counter
- -    wallcycle_destroy(wcycle);
   
       deviceStreamManager.reset(nullptr);
       // Free PME data
@@@ -2134,11 -1803,14 +2136,11 @@@
       // As soon as we destroy GPU contexts after mdrunner() exits, these lines should go.
       mdAtoms.reset(nullptr);
       globalState.reset(nullptr);
+ +    localStateInstance.reset(nullptr);
       mdModules_.reset(nullptr); // destruct force providers here as they might also use the GPU
- -    gpuBonded.reset(nullptr);
- -    /* Free pinned buffers in *fr */
- -    delete fr;
- -    fr = nullptr;
+ +    fr.reset(nullptr);         // destruct forcerec before gpu
       // TODO convert to C++ so we can get rid of these frees
       sfree(disresdata);
- -    sfree(oriresdata);
   
       if (!hwinfo_->deviceInfoList.empty())
       {
@@@ -2165,14 -1837,7 +2167,14 @@@
       {
           physicalNodeComm.barrier();
       }
- -    releaseDevice(deviceInfo);
+ +
+ +    if (!devFlags.usingCudaAwareMpi)
+ +    {
+ +        // Don't reset GPU in case of CUDA-AWARE MPI
+ +        // UCX creates CUDA buffers which are cleaned-up as part of MPI_Finalize()
+ +        // resetting the device before MPI_Finalize() results in crashes inside UCX
+ +        releaseDevice(deviceInfo);
+ +    }
   
       /* Does what it says */
       print_date_and_time(fplog, cr->nodeid, "Finished mdrun", gmx_gettime());
diff --combined src/gromacs/selection/evaluate.cpp

index 23781cee8cfc0177dc1e85f98eefd9603273f17f,a6b511a3051164867f0aa0173a9387b312543705..079f906cb9701b21d8c94381a1762c42f28a1fb9
--- 1/src/gromacs/selection/evaluate.cpp
--- 2/src/gromacs/selection/evaluate.cpp
+++ b/src/gromacs/selection/evaluate.cpp
@@@ -3,7 -3,7 +3,7 @@@
    *
    * Copyright (c) 2009,2010,2011,2012,2013 by the GROMACS development team.
    * Copyright (c) 2014,2015,2016,2017,2018 by the GROMACS development team.
-  * Copyright (c) 2019,2020, by the GROMACS development team, led by
+  * Copyright (c) 2019,2020,2021, by the GROMACS development team, led by
    * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
    * and including many others, as listed in the AUTHORS file in the
    * top-level source directory and at http://www.gromacs.org.
@@@ -522,6 -522,7 +522,7 @@@ void _gmx_sel_evaluate_static(gmx_sel_e
   {
       if (sel->flags & SEL_UNSORTED)
       {
+         gmx_ana_index_reserve(sel->v.u.g, sel->u.cgrp.isize);
           // This only works if g contains all the atoms, but that is currently
           // the only supported case.
           gmx_ana_index_copy(sel->v.u.g, &sel->u.cgrp, false);
@@@ -953,7 -954,8 +954,7 @@@ void _gmx_sel_evaluate_method(gmx_sel_e
                       expandValueForPositions(sel->v.u.r, &sel->v.nr, sel->u.expr.pos);
                       break;
                   default:
- -                    GMX_RELEASE_ASSERT(false,
- -                                       "Unimplemented value type for position update method");
+ +                    GMX_RELEASE_ASSERT(false, "Unimplemented value type for position update method");
               }
           }
       }
diff --combined src/gromacs/taskassignment/decidegpuusage.cpp

index 36c4650f0649311fe23d213ec27e7a66f737e389,bc9af8d4e09ade7668c959c3345e9723dc00e5e8..62010fe2697772148bdcdb604511933f4c06209b
--- 1/src/gromacs/taskassignment/decidegpuusage.cpp
--- 2/src/gromacs/taskassignment/decidegpuusage.cpp
+++ b/src/gromacs/taskassignment/decidegpuusage.cpp
@@@ -57,7 -57,7 +57,7 @@@
   #include "gromacs/hardware/detecthardware.h"
   #include "gromacs/hardware/hardwaretopology.h"
   #include "gromacs/hardware/hw_info.h"
- -#include "gromacs/listed_forces/gpubonded.h"
+ +#include "gromacs/listed_forces/listed_forces_gpu.h"
   #include "gromacs/mdlib/gmx_omp_nthreads.h"
   #include "gromacs/mdlib/update_constrain_gpu.h"
   #include "gromacs/mdtypes/commrec.h"
@@@ -83,7 -83,7 +83,7 @@@ namespac
   {
   
   //! Helper variable to localise the text of an often repeated message.
- -const char* g_specifyEverythingFormatString =
+ +const char* const g_specifyEverythingFormatString =
           "When you use mdrun -gputasks, %s must be set to non-default "
           "values, so that the device IDs can be interpreted correctly."
   #if GMX_GPU
@@@ -96,14 -96,11 +96,14 @@@
           // OpenCL standard, but the only current relevant case for GROMACS
           // is AMD OpenCL, which offers this variable.
           "GPU_DEVICE_ORDINAL"
- -#    elif GMX_GPU_SYCL
- -        // As with OpenCL, there are no portable way to do it.
- -        // Intel reference: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
- -        // While SYCL_DEVICE_FILTER is a better option, as of 2021.1-beta10 it is not yet supported.
- -        "SYCL_DEVICE_ALLOWLIST"
+ +#    elif GMX_GPU_SYCL && GMX_SYCL_DPCPP
+ +        // https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
+ +        "SYCL_DEVICE_FILTER"
+ +#    elif GMX_GPU_SYCL && GMX_SYCL_HIPSYCL
+ +        // Not true if we use hipSYCL over CUDA or IntelLLVM, but in that case the user probably
+ +        // knows what they are doing.
+ +        // https://rocmdocs.amd.com/en/latest/Other_Solutions/Other-Solutions.html#hip-environment-variables
+ +        "HIP_VISIBLE_DEVICES"
   #    else
   #        error "Unreachable branch"
   #    endif
@@@ -115,7 -112,7 +115,7 @@@
   } // namespace
   
   bool decideWhetherToUseGpusForNonbondedWithThreadMpi(const TaskTarget        nonbondedTarget,
- -                                                     const int               numDevicesToUse,
+ +                                                     const bool              haveAvailableDevices,
                                                        const std::vector<int>& userGpuTaskAssignment,
                                                        const EmulateGpuNonbonded emulateGpuNonbonded,
                                                        const bool buildSupportsNonbondedOnGpu,
@@@ -152,11 -149,12 +152,12 @@@
       // all potential ranks can use, and can use that in a global
       // decision that will later be consistent.
       // If we get here, then the user permitted or required GPUs.
- -    return (numDevicesToUse > 0);
+ +    return haveAvailableDevices;
   }
   
   bool decideWhetherToUseGpusForPmeWithThreadMpi(const bool              useGpuForNonbonded,
                                                  const TaskTarget        pmeTarget,
+                                                const TaskTarget        pmeFftTarget,
                                                  const int               numDevicesToUse,
                                                  const std::vector<int>& userGpuTaskAssignment,
                                                  const gmx_hw_info_t&    hardwareInfo,
@@@ -168,8 -166,15 +169,15 @@@
       if ((pmeTarget == TaskTarget::Cpu) || !useGpuForNonbonded || !pme_gpu_supports_build(nullptr)
           || !pme_gpu_supports_hardware(hardwareInfo, nullptr) || !pme_gpu_supports_input(inputrec, nullptr))
       {
-         // PME can't run on a GPU. If the user required that, we issue
-         // an error later.
+         // PME can't run on a GPU. If the user required that, we issue an error later.
+         return false;
+     }
+ 
+     if (pmeFftTarget == TaskTarget::Cpu && !pme_gpu_mixed_mode_supports_input(inputrec, nullptr))
+     {
+         /* User requested PME FFT on CPU, but the current system is not compatible with Mixed mode,
+          * so we don't use GPUs at all.
+          * If the user had -pme gpu, we issue an error later. */
           return false;
       }
   
@@@ -330,6 -335,7 +338,7 @@@ bool decideWhetherToUseGpusForNonbonded
   
   bool decideWhetherToUseGpusForPme(const bool              useGpuForNonbonded,
                                     const TaskTarget        pmeTarget,
+                                   const TaskTarget        pmeFftTarget,
                                     const std::vector<int>& userGpuTaskAssignment,
                                     const gmx_hw_info_t&    hardwareInfo,
                                     const t_inputrec&       inputrec,
@@@ -377,6 -383,16 +386,16 @@@
           }
           return false;
       }
+     if (pmeFftTarget == TaskTarget::Cpu && !pme_gpu_mixed_mode_supports_input(inputrec, &message))
+     {
+         /* User requested PME FFT on CPU, but the current system is not compatible with Mixed mode,
+          * so we don't use GPUs at all. */
+         if (pmeTarget == TaskTarget::Gpu)
+         {
+             GMX_THROW(NotImplementedError("Cannot compute PME interactions in Mixed mode, because " + message));
+         }
+         return false;
+     }
   
       if (pmeTarget == TaskTarget::Cpu)
       {
@@@ -479,7 -495,7 +498,7 @@@ bool decideWhetherToUseGpusForBonded(bo
   
       std::string errorMessage;
   
- -    if (!buildSupportsGpuBondeds(&errorMessage))
+ +    if (!buildSupportsListedForcesGpu(&errorMessage))
       {
           if (bondedTarget == TaskTarget::Gpu)
           {
@@@ -489,7 -505,7 +508,7 @@@
           return false;
       }
   
- -    if (!inputSupportsGpuBondeds(inputrec, mtop, &errorMessage))
+ +    if (!inputSupportsListedForcesGpu(inputrec, mtop, &errorMessage))
       {
           if (bondedTarget == TaskTarget::Gpu)
           {
@@@ -527,7 -543,7 +546,7 @@@
       // is busy, for which we currently only check PME or Ewald.
       // (It would be better to dynamically assign bondeds based on timings)
       // Note that here we assume that the auto setting of PME ranks will not
- -    // choose seperate PME ranks when nonBonded are assigned to the GPU.
+ +    // choose separate PME ranks when nonBonded are assigned to the GPU.
       bool usingOurCpuForPmeOrEwald =
               (EVDW_PME(inputrec.vdwtype)
                || (EEL_PME_EWALD(inputrec.coulombtype) && !useGpuForPme && numPmeRanksPerSimulation <= 0));
@@@ -546,6 -562,7 +565,6 @@@ bool decideWhetherToUseGpuForUpdate(con
                                       const gmx_mtop_t&              mtop,
                                       const bool                     useEssentialDynamics,
                                       const bool                     doOrientationRestraints,
- -                                    const bool                     useReplicaExchange,
                                       const bool                     haveFrozenAtoms,
                                       const bool                     doRerun,
                                       const DevelopmentFeatureFlags& devFlags,
@@@ -590,8 -607,7 +609,8 @@@
           errorMessage += "Multiple time stepping is not supported.\n";
       }
   
- -    if (inputrec.eConstrAlg == econtSHAKE && hasAnyConstraints && gmx_mtop_ftype_count(mtop, F_CONSTR) > 0)
+ +    if (inputrec.eConstrAlg == ConstraintAlgorithm::Shake && hasAnyConstraints
+ +        && gmx_mtop_ftype_count(mtop, F_CONSTR) > 0)
       {
           errorMessage += "SHAKE constraints are not supported.\n";
       }
@@@ -603,6 -619,7 +622,6 @@@
           errorMessage +=
                   "Either PME or short-ranged non-bonded interaction tasks must run on the GPU.\n";
       }
- -
       if (!gpusWereDetected)
       {
           errorMessage += "Compatible GPUs must have been found.\n";
@@@ -611,16 -628,16 +630,16 @@@
       {
           errorMessage += "Only a CUDA build is supported.\n";
       }
- -    if (inputrec.eI != eiMD)
+ +    if (inputrec.eI != IntegrationAlgorithm::MD)
       {
           errorMessage += "Only the md integrator is supported.\n";
       }
- -    if (inputrec.etc == etcNOSEHOOVER)
+ +    if (inputrec.etc == TemperatureCoupling::NoseHoover)
       {
           errorMessage += "Nose-Hoover temperature coupling is not supported.\n";
       }
- -    if (!(inputrec.epc == epcNO || inputrec.epc == epcPARRINELLORAHMAN
- -          || inputrec.epc == epcBERENDSEN || inputrec.epc == epcCRESCALE))
+ +    if (!(inputrec.epc == PressureCoupling::No || inputrec.epc == PressureCoupling::ParrinelloRahman
+ +          || inputrec.epc == PressureCoupling::Berendsen || inputrec.epc == PressureCoupling::CRescale))
       {
           errorMessage +=
                   "Only Parrinello-Rahman, Berendsen, and C-rescale pressure coupling are "
@@@ -648,17 -665,20 +667,17 @@@
           // The graph is needed, but not supported
           errorMessage += "Orientation restraints are not supported.\n";
       }
- -    if (inputrec.efep != efepNO && (haveFepPerturbedMasses(mtop) || havePerturbedConstraints(mtop)))
+ +    if (inputrec.efep != FreeEnergyPerturbationType::No
+ +        && (haveFepPerturbedMasses(mtop) || havePerturbedConstraints(mtop)))
       {
           errorMessage += "Free energy perturbation for mass and constraints are not supported.\n";
       }
       const auto particleTypes = gmx_mtop_particletype_count(mtop);
- -    if (particleTypes[eptShell] > 0)
+ +    if (particleTypes[ParticleType::Shell] > 0)
       {
           errorMessage += "Shells are not supported.\n";
       }
- -    if (useReplicaExchange)
- -    {
- -        errorMessage += "Replica exchange simulations are not supported.\n";
- -    }
- -    if (inputrec.eSwapCoords != eswapNO)
+ +    if (inputrec.eSwapCoords != SwapType::No)
       {
           errorMessage += "Swapping the coordinates is not supported.\n";
       }
@@@ -679,10 -699,6 +698,10 @@@
                   "The number of coupled constraints is higher than supported in the GPU LINCS "
                   "code.\n";
       }
+ +    if (hasAnyConstraints && !UpdateConstrainGpu::areConstraintsSupported())
+ +    {
+ +        errorMessage += "Chosen GPU implementation does not support constraints.\n";
+ +    }
       if (haveFrozenAtoms)
       {
           // There is a known bug with frozen atoms and GPU update, see Issue #3920.
diff --combined src/gromacs/taskassignment/decidegpuusage.h

index 3404bebb6af6f6c6e8bc3097dd8d77695fb6768e,8f16f25ae80b198aabf6cb2116ff741bc976ca32..76b6f40042a0fec1ecd7bcec55482aec9d3ac72d
--- 1/src/gromacs/taskassignment/decidegpuusage.h
--- 2/src/gromacs/taskassignment/decidegpuusage.h
+++ b/src/gromacs/taskassignment/decidegpuusage.h
@@@ -88,8 -88,6 +88,8 @@@ struct DevelopmentFeatureFlag
       bool enableGpuHaloExchange = false;
       //! True if the PME PP direct communication GPU development feature is enabled
       bool enableGpuPmePPComm = false;
+ +    //! True if the CUDA-aware MPI is being used for GPU direct communication feature
+ +    bool usingCudaAwareMpi = false;
   };
   
   
@@@ -105,7 -103,8 +105,7 @@@ class MDAtoms
    *
    * \param[in] nonbondedTarget              The user's choice for mdrun -nb for where to assign
    *                                         short-ranged nonbonded interaction tasks.
- - * \param[in] numDevicesToUse              Number of compatible GPUs that the user permitted
- - *                                         us to use.
+ + * \param[in] haveAvailableDevices         Whether there are available devices.
    * \param[in] userGpuTaskAssignment        The user-specified assignment of GPU tasks to device IDs.
    * \param[in] emulateGpuNonbonded          Whether we will emulate GPU calculation of nonbonded
    *                                         interactions.
@@@ -119,7 -118,7 +119,7 @@@
    * \throws     std::bad_alloc          If out of memory
    *             InconsistentInputError  If the user requirements are inconsistent. */
   bool decideWhetherToUseGpusForNonbondedWithThreadMpi(TaskTarget              nonbondedTarget,
- -                                                     int                     numDevicesToUse,
+ +                                                     bool                    haveAvailableDevices,
                                                        const std::vector<int>& userGpuTaskAssignment,
                                                        EmulateGpuNonbonded     emulateGpuNonbonded,
                                                        bool buildSupportsNonbondedOnGpu,
@@@ -137,6 -136,7 +137,7 @@@
    * \param[in]  useGpuForNonbonded        Whether GPUs will be used for nonbonded interactions.
    * \param[in]  pmeTarget                 The user's choice for mdrun -pme for where to assign
    *                                       long-ranged PME nonbonded interaction tasks.
+  * \param[in]  pmeFftTarget              The user's choice for mdrun -pmefft for where to run FFT.
    * \param[in]  numDevicesToUse           The number of compatible GPUs that the user permitted us to use.
    * \param[in]  userGpuTaskAssignment     The user-specified assignment of GPU tasks to device IDs.
    * \param[in]  hardwareInfo              Hardware information
@@@ -150,6 -150,7 +151,7 @@@
    *             InconsistentInputError  If the user requirements are inconsistent. */
   bool decideWhetherToUseGpusForPmeWithThreadMpi(bool                    useGpuForNonbonded,
                                                  TaskTarget              pmeTarget,
+                                                TaskTarget              pmeFftTarget,
                                                  int                     numDevicesToUse,
                                                  const std::vector<int>& userGpuTaskAssignment,
                                                  const gmx_hw_info_t&    hardwareInfo,
@@@ -209,6 -210,7 +211,7 @@@ bool decideWhetherToUseGpusForNonbonded
    *
    * \param[in]  useGpuForNonbonded        Whether GPUs will be used for nonbonded interactions.
    * \param[in]  pmeTarget                 The user's choice for mdrun -pme for where to assign long-ranged PME nonbonded interaction tasks.
+  * \param[in]  pmeFftTarget              The user's choice for mdrun -pmefft for where to do FFT for PME.
    * \param[in]  userGpuTaskAssignment     The user-specified assignment of GPU tasks to device IDs.
    * \param[in]  hardwareInfo              Hardware information
    * \param[in]  inputrec                  The user input
@@@ -222,6 -224,7 +225,7 @@@
    *             InconsistentInputError  If the user requirements are inconsistent. */
   bool decideWhetherToUseGpusForPme(bool                    useGpuForNonbonded,
                                     TaskTarget              pmeTarget,
+                                   TaskTarget              pmeFftTarget,
                                     const std::vector<int>& userGpuTaskAssignment,
                                     const gmx_hw_info_t&    hardwareInfo,
                                     const t_inputrec&       inputrec,
@@@ -278,6 -281,7 +282,6 @@@ bool decideWhetherToUseGpusForBonded(bo
    * \param[in]  mtop                         The global topology.
    * \param[in]  useEssentialDynamics         If essential dynamics is active.
    * \param[in]  doOrientationRestraints      If orientation restraints are enabled.
- - * \param[in]  useReplicaExchange           If this is a REMD simulation.
    * \param[in]  haveFrozenAtoms              If this simulation has frozen atoms (see Issue #3920).
    * \param[in]  doRerun                      It this is a rerun.
    * \param[in]  devFlags                     GPU development / experimental feature flags.
@@@ -298,6 -302,7 +302,6 @@@ bool decideWhetherToUseGpuForUpdate(boo
                                       const gmx_mtop_t&              mtop,
                                       bool                           useEssentialDynamics,
                                       bool                           doOrientationRestraints,
- -                                    bool                           useReplicaExchange,
                                       bool                           haveFrozenAtoms,
                                       bool                           doRerun,
                                       const DevelopmentFeatureFlags& devFlags,
author	Andrey Alekseenko <al42and@gmail.com>
	Mon, 18 Oct 2021 14:31:36 +0000 (16:31 +0200)
committer	Andrey Alekseenko <al42and@gmail.com>
	Mon, 18 Oct 2021 14:39:47 +0000 (16:39 +0200)
		1	2
docs/gmxapi/userguide/install.rst	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/ewald/pme.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/ewald/pme.h	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/mdlib/vsite.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/mdrun/runner.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/selection/evaluate.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/taskassignment/decidegpuusage.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/taskassignment/decidegpuusage.h	patch \|	diff1 \|	diff2 \|	blob \| history