docs/reference-manual/algorithms/parallelization-domain-decomp.rst

   1 Parallelization
   2 ---------------
   3
   4 The CPU time required for a simulation can be reduced by running the
   5 simulation in parallel over more than one core. Ideally, one would want
   6 to have linear scaling: running on :math:`N` cores makes the simulation
   7 :math:`N` times faster. In practice this can only be achieved for a
   8 small number of cores. The scaling will depend a lot on the algorithms
   9 used. Also, different algorithms can have different restrictions on the
  10 interaction ranges between atoms.
  11
  12 Domain decomposition
  13 --------------------
  14
  15 Since most interactions in molecular simulations are local, domain
  16 decomposition is a natural way to decompose the system. In domain
  17 decomposition, a spatial domain is assigned to each rank, which will
  18 then integrate the equations of motion for the particles that currently
  19 reside in its local domain. With domain decomposition, there are two
  20 choices that have to be made: the division of the unit cell into domains
  21 and the assignment of the forces to domains. Most molecular simulation
  22 packages use the half-shell method for assigning the forces. But there
  23 are two methods that always require less communication: the eighth
  24 shell \ :ref:`69 <refLiem1991>` and the midpoint \ :ref:`70 <refShaw2006>`
  25 method. |Gromacs| currently uses the eighth shell method, but
  26 for certain systems or hardware architectures it might be advantageous
  27 to use the midpoint method. Therefore, we might implement the midpoint
  28 method in the future. Most of the details of the domain decomposition
  29 can be found in the |Gromacs| 4 paper \ :ref:`5 <refHess2008b>`.
  30
  31 Coordinate and force communication
  32 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  33
  34 In the most general case of a triclinic unit cell, the space in divided
  35 with a 1-, 2-, or 3-D grid in parallelepipeds that we call domain
  36 decomposition cells. Each cell is assigned to a particle-particle rank.
  37 The system is partitioned over the ranks at the beginning of each MD
  38 step in which neighbor searching is performed. The minimum unit of
  39 partitioning can be an atom, or a charge group with the (deprecated)
  40 group cut-off scheme or an update group. An update group is a group
  41 of atoms that has dependencies during update, which occurs when using
  42 constraints and/or virtual sites. Thus different update groups can be
  43 updated independenly. Currently update groups can only be used with at most
  44 two sequential constraints, which is the case when only constraining
  45 bonds involving hydrogen atoms. The advantages of update groups are that
  46 no communication is required in the update and that this allows updating part
  47 of the system while computing forces for other parts. Atom groups are assigned
  48 to the cell where their center of geometry resides. Before the forces can
  49 be calculated, the coordinates from some neighboring cells need to be
  50 communicated, and after the forces are calculated, the forces need to be
  51 communicated in the other direction. The communication and force
  52 assignment is based on zones that can cover one or multiple cells. An
  53 example of a zone setup is shown in :numref:`Fig. %s <fig-ddcells>`.
  54
  55 .. _fig-ddcells:
  56
  57 .. figure:: plots/dd-cells.*
  58    :width: 6.00000cm
  59
  60    A non-staggered domain decomposition grid of
  61    3\ :math:`\times`\ 2\ :math:`\times`\ 2 cells. Coordinates in zones 1
  62    to 7 are communicated to the corner cell that has its home particles
  63    in zone 0. :math:`r_c` is the cut-off radius.
  64
  65 The coordinates are communicated by moving data along the “negative”
  66 direction in :math:`x`, :math:`y` or :math:`z` to the next neighbor.
  67 This can be done in one or multiple pulses. In :numref:`Fig. %s <fig-ddcells>` two
  68 pulses in :math:`x` are required, then one in :math:`y` and then one in
  69 :math:`z`. The forces are communicated by reversing this procedure. See
  70 the |Gromacs| 4 paper \ :ref:`5 <refHess2008b>` for details on determining which
  71 non-bonded and bonded forces should be calculated on which rank.
  72
  73 Dynamic load balancing
  74 ~~~~~~~~~~~~~~~~~~~~~~
  75
  76 When different ranks have a different computational load (load
  77 imbalance), all ranks will have to wait for the one that takes the most
  78 time. One would like to avoid such a situation. Load imbalance can occur
  79 due to four reasons:
  80
  81 -  inhomogeneous particle distribution
  82
  83 -  inhomogeneous interaction cost distribution (charged/uncharged,
  84    water/non-water due to |Gromacs| water innerloops)
  85
  86 -  statistical fluctuation (only with small particle numbers)
  87
  88 -  differences in communication time, due to network topology and/or
  89    other jobs on the machine interfering with our communication
  90
  91 So we need a dynamic load balancing algorithm where the volume of each
  92 domain decomposition cell can be adjusted *independently*. To achieve
  93 this, the 2- or 3-D domain decomposition grids need to be staggered.
  94 :numref:`Fig. %s <fig-ddtric>` shows the most general case in 2-D. Due to the
  95 staggering, one might require two distance checks for deciding if a
  96 charge group needs to be communicated: a non-bonded distance and a
  97 bonded distance check.
  98
  99 .. _fig-ddtric:
 100
 101 .. figure:: plots/dd-tric.*
 102    :width: 7.00000cm
 103
 104    The zones to communicate to the rank of zone 0, see the text
 105    for details. :math:`r_c` and :math:`r_b` are the non-bonded and
 106    bonded cut-off radii respectively, :math:`d` is an example of a
 107    distance between following, staggered boundaries of cells.
 108
 109 By default, :ref:`mdrun <gmx mdrun>` automatically turns on the dynamic load balancing
 110 during a simulation when the total performance loss due to the force
 111 calculation imbalance is 2% or more. **Note** that the reported force
 112 load imbalance numbers might be higher, since the force calculation is
 113 only part of work that needs to be done during an integration step. The
 114 load imbalance is reported in the log file at log output steps and when
 115 the ``-v`` option is used also on screen. The average load imbalance and the
 116 total performance loss due to load imbalance are reported at the end of
 117 the log file.
 118
 119 There is one important parameter for the dynamic load balancing, which
 120 is the minimum allowed scaling. By default, each dimension of the domain
 121 decomposition cell can scale down by at least a factor of 0.8. For 3-D
 122 domain decomposition this allows cells to change their volume by about a
 123 factor of 0.5, which should allow for compensation of a load imbalance
 124 of 100%. The minimum allowed scaling can be changed with the
 125 ``-dds`` option of :ref:`mdrun <gmx mdrun>`.
 126
 127 The load imbalance is measured by timing a single region of the MD step
 128 on each MPI rank. This region can not include MPI communication, as
 129 timing of MPI calls does not allow separating wait due to imbalance from
 130 actual communication. The domain volumes are then scaled, with
 131 under-relaxation, inversely proportional with the measured time. This
 132 procedure will decrease the load imbalance when the change in load in
 133 the measured region correlates with the change in domain volume and the
 134 load outside the measured region does not depend strongly on the domain
 135 volume. In CPU-only simulations, the load is measured between the
 136 coordinate and the force communication. In simulations with non-bonded
 137 work on GPUs, we overlap communication and work on the CPU with
 138 calculation on the GPU. Therefore we measure from the last communication
 139 before the force calculation to when the CPU or GPU is finished,
 140 whichever is last. When not using PME ranks, we subtract the time in PME
 141 from the CPU time, as this includes MPI calls and the PME load is
 142 independent of domain size. This generally works well, unless the
 143 non-bonded load is low and there is imbalance in the bonded
 144 interactions. Then two issues can arise. Dynamic load balancing can
 145 increase the imbalance in update and constraints and with PME the
 146 coordinate and force redistribution time can go up significantly.
 147 Although dynamic load balancing can significantly improve performance in
 148 cases where there is imbalance in the bonded interactions on the CPU,
 149 there are many situations in which some domains continue decreasing in
 150 size and the load imbalance increases and/or PME coordinate and force
 151 redistribution cost increases significantly. As of version 2016.1, :ref:`mdrun <gmx mdrun>`
 152 disables the dynamic load balancing when measurement indicates that it
 153 deteriorates performance. This means that in most cases the user will
 154 get good performance with the default, automated dynamic load balancing
 155 setting.
 156
 157 .. _plincs:
 158
 159 Constraints in parallel
 160 ~~~~~~~~~~~~~~~~~~~~~~~
 161
 162 Since with domain decomposition parts of molecules can reside on
 163 different ranks, bond constraints can cross cell boundaries. Therefore a
 164 parallel constraint algorithm is required. |Gromacs| uses the P-LINCS
 165 algorithm \ :ref:`50 <refHess2008a>`, which is the parallel version of the LINCS
 166 algorithm \ :ref:`49 <refHess97>` (see :ref:`lincs`). The P-LINCS procedure
 167 is illustrated in :numref:`Fig. %s <fig-plincs>`. When molecules cross the cell
 168 boundaries, atoms in such molecules up to (``lincs_order + 1``) bonds away
 169 are communicated over the cell boundaries. Then, the normal LINCS
 170 algorithm can be applied to the local bonds plus the communicated ones.
 171 After this procedure, the local bonds are correctly constrained, even
 172 though the extra communicated ones are not. One coordinate communication
 173 step is required for the initial LINCS step and one for each iteration.
 174 Forces do not need to be communicated.
 175
 176 .. _fig-plincs:
 177
 178 .. figure:: plots/par-lincs2.*
 179    :width: 6.00000cm
 180
 181    Example of the parallel setup of P-LINCS with one molecule
 182    split over three domain decomposition cells, using a matrix expansion
 183    order of 3. The top part shows which atom coordinates need to be
 184    communicated to which cells. The bottom parts show the local
 185    constraints (solid) and the non-local constraints (dashed) for each
 186    of the three cells.
 187
 188 Interaction ranges
 189 ~~~~~~~~~~~~~~~~~~
 190
 191 Domain decomposition takes advantage of the locality of interactions.
 192 This means that there will be limitations on the range of interactions.
 193 By default, :ref:`mdrun <gmx mdrun>` tries to find the optimal balance between interaction
 194 range and efficiency. But it can happen that a simulation stops with an
 195 error message about missing interactions, or that a simulation might run
 196 slightly faster with shorter interaction ranges. A list of interaction
 197 ranges and their default values is given in :numref:`Table %s <table-ddranges>`
 198
 199 .. |nbrange| replace:: :math:`r_c`\ =\ max(\ :math:`r_{\mathrm{list}}`\ ,\ :math:`r_{\mathrm{VdW}}`\ ,\ :math:`r_{\mathrm{Coul}}`\ )
 200 .. |tbrange| replace:: max(:math:`r_{\mathrm{mb}}`\ ,\ :math:`r_c`)
 201 .. |mbrange| replace:: :math:`r_{\mathrm{mb}}`
 202 .. |csrange| replace:: :math:`r_{\mathrm{con}}`
 203 .. |vsrange| replace:: :math:`r_{\mathrm{con}}`
 204 .. |mdrunr| replace:: :ref:`mdrun <gmx mdrun>` ``-rdd``
 205 .. |mdrunc| replace:: :ref:`mdrun <gmx mdrun>` ``-rcon``
 206
 207 .. _table-ddranges:
 208
 209 .. table:: The interaction ranges with domain decomposition.
 210     :widths: auto
 211     :align: center
 212
 213     +-------------------+-----------+-----------------+------------------------+
 214     | interaction       | range     | option          | default                |
 215     +===================+===========+=================+========================+
 216     | non-bonded        | |nbrange| | :ref:`mdp` file |                        |
 217     +-------------------+-----------+-----------------+------------------------+
 218     | two-body bonded   | |tbrange| | |mdrunr|        | starting conf. + 10%   |
 219     +-------------------+-----------+-----------------+------------------------+
 220     | multi-body bonded | |mbrange| | |mdrunr|        | starting conf. + 10%   |
 221     +-------------------+-----------+-----------------+------------------------+
 222     | constraints       | |csrange| | |mdrunc|        | est. from bond lengths |
 223     +-------------------+-----------+-----------------+------------------------+
 224     | virtual sites     | |vsrange| | |mdrunc|        | 0                      |
 225     +-------------------+-----------+-----------------+------------------------+
 226
 227 In most cases the defaults of :ref:`mdrun <gmx mdrun>` should not cause the simulation to
 228 stop with an error message of missing interactions. The range for the
 229 bonded interactions is determined from the distance between bonded
 230 charge-groups in the starting configuration, with 10% added for
 231 headroom. For the constraints, the value of :math:`r_{\mathrm{con}}` is
 232 determined by taking the maximum distance that (``lincs_order + 1``) bonds
 233 can cover when they all connect at angles of 120 degrees. The actual
 234 constraint communication is not limited by :math:`r_{\mathrm{con}}`, but
 235 by the minimum cell size :math:`L_C`, which has the following lower
 236 limit:
 237
 238 .. math:: L_C \geq \max(r_{\mathrm{mb}},r_{\mathrm{con}})
 239
 240 Without dynamic load balancing the system is actually allowed to scale
 241 beyond this limit when pressure scaling is used. **Note** that for
 242 triclinic boxes, :math:`L_C` is not simply the box diagonal component
 243 divided by the number of cells in that direction, rather it is the
 244 shortest distance between the triclinic cells borders. For rhombic
 245 dodecahedra this is a factor of :math:`\sqrt{3/2}` shorter along
 246 :math:`x` and :math:`y`.
 247
 248 When :math:`r_{\mathrm{mb}} > r_c`, :ref:`mdrun <gmx mdrun>` employs a smart algorithm to
 249 reduce the communication. Simply communicating all charge groups within
 250 :math:`r_{\mathrm{mb}}` would increase the amount of communication
 251 enormously. Therefore only charge-groups that are connected by bonded
 252 interactions to charge groups which are not locally present are
 253 communicated. This leads to little extra communication, but also to a
 254 slightly increased cost for the domain decomposition setup. In some
 255 cases, *e.g.* coarse-grained simulations with a very short cut-off, one
 256 might want to set :math:`r_{\mathrm{mb}}` by hand to reduce this cost.
 257
 258 .. _mpmdpme:
 259
 260 Multiple-Program, Multiple-Data PME parallelization
 261 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 262
 263 Electrostatics interactions are long-range, therefore special algorithms
 264 are used to avoid summation over many atom pairs. In |Gromacs| this is
 265 usually PME (sec. :ref:`pme`). Since with PME all particles interact with
 266 each other, global communication is required. This will usually be the
 267 limiting factor for scaling with domain decomposition. To reduce the
 268 effect of this problem, we have come up with a Multiple-Program,
 269 Multiple-Data approach \ :ref:`5 <refHess2008b>`. Here, some ranks are
 270 selected to do only the PME mesh calculation, while the other ranks,
 271 called particle-particle (PP) ranks, do all the rest of the work. For
 272 rectangular boxes the optimal PP to PME rank ratio is usually 3:1, for
 273 rhombic dodecahedra usually 2:1. When the number of PME ranks is reduced
 274 by a factor of 4, the number of communication calls is reduced by about
 275 a factor of 16. Or put differently, we can now scale to 4 times more
 276 ranks. In addition, for modern 4 or 8 core machines in a network, the
 277 effective network bandwidth for PME is quadrupled, since only a quarter
 278 of the cores will be using the network connection on each machine during
 279 the PME calculations.
 280
 281 .. _fig-mpmdpme:
 282
 283 .. figure:: plots/mpmd-pme.*
 284    :width: 12.00000cm
 285
 286    Example of 8 ranks without (left) and with (right) MPMD. The
 287    PME communication (red arrows) is much higher on the left than on the
 288    right. For MPMD additional PP - PME coordinate and force
 289    communication (blue arrows) is required, but the total communication
 290    complexity is lower.
 291
 292 :ref:`mdrun <gmx mdrun>` will by default interleave the PP and PME ranks.
 293 If the ranks are not number consecutively inside the machines, one might
 294 want to use :ref:`mdrun <gmx mdrun>` ``-ddorder pp_pme``. For machines with a
 295 real 3-D torus and proper communication software that assigns the ranks
 296 accordingly one should use :ref:`mdrun <gmx mdrun>` ``-ddorder cartesian``.
 297
 298 To optimize the performance one should usually set up the cut-offs and
 299 the PME grid such that the PME load is 25 to 33% of the total
 300 calculation load. :ref:`grompp <gmx grompp>` will print an estimate for this load at the end
 301 and also :ref:`mdrun <gmx mdrun>` calculates the same estimate to determine the optimal
 302 number of PME ranks to use. For high parallelization it might be
 303 worthwhile to optimize the PME load with the :ref:`mdp` settings and/or the
 304 number of PME ranks with the ``-npme`` option of :ref:`mdrun <gmx mdrun>`. For changing the
 305 electrostatics settings it is useful to know the accuracy of the
 306 electrostatics remains nearly constant when the Coulomb cut-off and the
 307 PME grid spacing are scaled by the same factor. **Note** that it is
 308 usually better to overestimate than to underestimate the number of PME
 309 ranks, since the number of PME ranks is smaller than the number of PP
 310 ranks, which leads to less total waiting time.
 311
 312 The PME domain decomposition can be 1-D or 2-D along the :math:`x`
 313 and/or :math:`y` axis. 2-D decomposition is also known as pencil
 314 decomposition because of the shape of the domains at high
 315 parallelization. 1-D decomposition along the :math:`y` axis can only be
 316 used when the PP decomposition has only 1 domain along :math:`x`. 2-D
 317 PME decomposition has to have the number of domains along :math:`x`
 318 equal to the number of the PP decomposition. :ref:`mdrun <gmx mdrun>` automatically chooses
 319 1-D or 2-D PME decomposition (when possible with the total given number
 320 of ranks), based on the minimum amount of communication for the
 321 coordinate redistribution in PME plus the communication for the grid
 322 overlap and transposes. To avoid superfluous communication of
 323 coordinates and forces between the PP and PME ranks, the number of DD
 324 cells in the :math:`x` direction should ideally be the same or a
 325 multiple of the number of PME ranks. By default, :ref:`mdrun <gmx mdrun>` takes care of
 326 this issue.
 327
 328 Domain decomposition flow chart
 329 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 330
 331 In :numref:`Fig. %s <fig-ddflow>` a flow chart is shown for domain decomposition
 332 with all possible communication for different algorithms. For simpler
 333 simulations, the same flow chart applies, without the algorithms and
 334 communication for the algorithms that are not used.
 335
 336 .. _fig-ddflow:
 337
 338 .. figure:: plots/flowchart.*
 339    :width: 12.00000cm
 340
 341    Flow chart showing the algorithms and communication (arrows)
 342    for a standard MD simulation with virtual sites, constraints and
 343    separate PME-mesh ranks.
 344
 345 .. raw:: latex
 346
 347     \clearpage
 348
 349