docs/reference-manual/parallelization-domain-decomp.rst

   1 Parallelization
   2 ---------------
   3
   4 The CPU time required for a simulation can be reduced by running the
   5 simulation in parallel over more than one core. Ideally, one would want
   6 to have linear scaling: running on :math:`N` cores makes the simulation
   7 :math:`N` times faster. In practice this can only be achieved for a
   8 small number of cores. The scaling will depend a lot on the algorithms
   9 used. Also, different algorithms can have different restrictions on the
  10 interaction ranges between atoms.
  11
  12 Domain decomposition
  13 --------------------
  14
  15 Since most interactions in molecular simulations are local, domain
  16 decomposition is a natural way to decompose the system. In domain
  17 decomposition, a spatial domain is assigned to each rank, which will
  18 then integrate the equations of motion for the particles that currently
  19 reside in its local domain. With domain decomposition, there are two
  20 choices that have to be made: the division of the unit cell into domains
  21 and the assignment of the forces to domains. Most molecular simulation
  22 packages use the half-shell method for assigning the forces. But there
  23 are two methods that always require less communication: the eighth
  24 shell \ :ref:`69 <refLiem1991>` and the midpoint \ :ref:`70 <refShaw2006>`
  25 method. |Gromacs| currently uses the eighth shell method, but
  26 for certain systems or hardware architectures it might be advantageous
  27 to use the midpoint method. Therefore, we might implement the midpoint
  28 method in the future. Most of the details of the domain decomposition
  29 can be found in the |Gromacs| 4 paper \ :ref:`5 <refHess2008b>`.
  30
  31 Coordinate and force communication
  32 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  33
  34 In the most general case of a triclinic unit cell, the space in divided
  35 with a 1-, 2-, or 3-D grid in parallelepipeds that we call domain
  36 decomposition cells. Each cell is assigned to a particle-particle rank.
  37 The system is partitioned over the ranks at the beginning of each MD
  38 step in which neighbor searching is performed. Since the neighbor
  39 searching is based on charge groups, charge groups are also the units
  40 for the domain decomposition. Charge groups are assigned to the cell
  41 where their center of geometry resides. Before the forces can be
  42 calculated, the coordinates from some neighboring cells need to be
  43 communicated, and after the forces are calculated, the forces need to be
  44 communicated in the other direction. The communication and force
  45 assignment is based on zones that can cover one or multiple cells. An
  46 example of a zone setup is shown in :numref:`Fig. %s <fig-ddcells>`.
  47
  48 .. _fig-ddcells:
  49
  50 .. figure:: plots/dd-cells.*
  51    :width: 6.00000cm
  52
  53    A non-staggered domain decomposition grid of
  54    3\ :math:`\times`\ 2\ :math:`\times`\ 2 cells. Coordinates in zones 1
  55    to 7 are communicated to the corner cell that has its home particles
  56    in zone 0. :math:`r_c` is the cut-off radius.
  57
  58 The coordinates are communicated by moving data along the “negative”
  59 direction in :math:`x`, :math:`y` or :math:`z` to the next neighbor.
  60 This can be done in one or multiple pulses. In :numref:`Fig. %s <fig-ddcells>` two
  61 pulses in :math:`x` are required, then one in :math:`y` and then one in
  62 :math:`z`. The forces are communicated by reversing this procedure. See
  63 the |Gromacs| 4 paper \ :ref:`5 <refHess2008b>` for details on determining which
  64 non-bonded and bonded forces should be calculated on which rank.
  65
  66 Dynamic load balancing
  67 ~~~~~~~~~~~~~~~~~~~~~~
  68
  69 When different ranks have a different computational load (load
  70 imbalance), all ranks will have to wait for the one that takes the most
  71 time. One would like to avoid such a situation. Load imbalance can occur
  72 due to four reasons:
  73
  74 -  inhomogeneous particle distribution
  75
  76 -  inhomogeneous interaction cost distribution (charged/uncharged,
  77    water/non-water due to |Gromacs| water innerloops)
  78
  79 -  statistical fluctuation (only with small particle numbers)
  80
  81 -  differences in communication time, due to network topology and/or
  82    other jobs on the machine interfering with our communication
  83
  84 So we need a dynamic load balancing algorithm where the volume of each
  85 domain decomposition cell can be adjusted *independently*. To achieve
  86 this, the 2- or 3-D domain decomposition grids need to be staggered.
  87 :numref:`Fig. %s <fig-ddtric>` shows the most general case in 2-D. Due to the
  88 staggering, one might require two distance checks for deciding if a
  89 charge group needs to be communicated: a non-bonded distance and a
  90 bonded distance check.
  91
  92 .. _fig-ddtric:
  93
  94 .. figure:: plots/dd-tric.*
  95    :width: 7.00000cm
  96
  97    The zones to communicate to the rank of zone 0, see the text
  98    for details. :math:`r_c` and :math:`r_b` are the non-bonded and
  99    bonded cut-off radii respectively, :math:`d` is an example of a
 100    distance between following, staggered boundaries of cells.
 101
 102 By default, :ref:`mdrun <gmx mdrun>` automatically turns on the dynamic load balancing
 103 during a simulation when the total performance loss due to the force
 104 calculation imbalance is 2% or more. **Note** that the reported force
 105 load imbalance numbers might be higher, since the force calculation is
 106 only part of work that needs to be done during an integration step. The
 107 load imbalance is reported in the log file at log output steps and when
 108 the ``-v`` option is used also on screen. The average load imbalance and the
 109 total performance loss due to load imbalance are reported at the end of
 110 the log file.
 111
 112 There is one important parameter for the dynamic load balancing, which
 113 is the minimum allowed scaling. By default, each dimension of the domain
 114 decomposition cell can scale down by at least a factor of 0.8. For 3-D
 115 domain decomposition this allows cells to change their volume by about a
 116 factor of 0.5, which should allow for compensation of a load imbalance
 117 of 100%. The minimum allowed scaling can be changed with the
 118 ``-dds`` option of :ref:`mdrun <gmx mdrun>`.
 119
 120 The load imbalance is measured by timing a single region of the MD step
 121 on each MPI rank. This region can not include MPI communication, as
 122 timing of MPI calls does not allow separating wait due to imbalance from
 123 actual communication. The domain volumes are then scaled, with
 124 under-relaxation, inversely proportional with the measured time. This
 125 procedure will decrease the load imbalance when the change in load in
 126 the measured region correlates with the change in domain volume and the
 127 load outside the measured region does not depend strongly on the domain
 128 volume. In CPU-only simulations, the load is measured between the
 129 coordinate and the force communication. In simulations with non-bonded
 130 work on GPUs, we overlap communication and work on the CPU with
 131 calculation on the GPU. Therefore we measure from the last communication
 132 before the force calculation to when the CPU or GPU is finished,
 133 whichever is last. When not using PME ranks, we subtract the time in PME
 134 from the CPU time, as this includes MPI calls and the PME load is
 135 independent of domain size. This generally works well, unless the
 136 non-bonded load is low and there is imbalance in the bonded
 137 interactions. Then two issues can arise. Dynamic load balancing can
 138 increase the imbalance in update and constraints and with PME the
 139 coordinate and force redistribution time can go up significantly.
 140 Although dynamic load balancing can significantly improve performance in
 141 cases where there is imbalance in the bonded interactions on the CPU,
 142 there are many situations in which some domains continue decreasing in
 143 size and the load imbalance increases and/or PME coordinate and force
 144 redistribution cost increases significantly. As of version 2016.1, :ref:`mdrun <gmx mdrun>`
 145 disables the dynamic load balancing when measurement indicates that it
 146 deteriorates performance. This means that in most cases the user will
 147 get good performance with the default, automated dynamic load balancing
 148 setting.
 149
 150 .. _plincs:
 151
 152 Constraints in parallel
 153 ~~~~~~~~~~~~~~~~~~~~~~~
 154
 155 Since with domain decomposition parts of molecules can reside on
 156 different ranks, bond constraints can cross cell boundaries. Therefore a
 157 parallel constraint algorithm is required. |Gromacs| uses the P-LINCS
 158 algorithm \ :ref:`50 <refHess2008a>`, which is the parallel version of the LINCS
 159 algorithm \ :ref:`49 <refHess97>` (see :ref:`lincs`). The P-LINCS procedure
 160 is illustrated in :numref:`Fig. %s <fig-plincs>`. When molecules cross the cell
 161 boundaries, atoms in such molecules up to (``lincs_order + 1``) bonds away
 162 are communicated over the cell boundaries. Then, the normal LINCS
 163 algorithm can be applied to the local bonds plus the communicated ones.
 164 After this procedure, the local bonds are correctly constrained, even
 165 though the extra communicated ones are not. One coordinate communication
 166 step is required for the initial LINCS step and one for each iteration.
 167 Forces do not need to be communicated.
 168
 169 .. _fig-plincs:
 170
 171 .. figure:: plots/par-lincs2.*
 172    :width: 6.00000cm
 173
 174    Example of the parallel setup of P-LINCS with one molecule
 175    split over three domain decomposition cells, using a matrix expansion
 176    order of 3. The top part shows which atom coordinates need to be
 177    communicated to which cells. The bottom parts show the local
 178    constraints (solid) and the non-local constraints (dashed) for each
 179    of the three cells.
 180
 181 Interaction ranges
 182 ~~~~~~~~~~~~~~~~~~
 183
 184 Domain decomposition takes advantage of the locality of interactions.
 185 This means that there will be limitations on the range of interactions.
 186 By default, :ref:`mdrun <gmx mdrun>` tries to find the optimal balance between interaction
 187 range and efficiency. But it can happen that a simulation stops with an
 188 error message about missing interactions, or that a simulation might run
 189 slightly faster with shorter interaction ranges. A list of interaction
 190 ranges and their default values is given in :numref:`Table %s <table-ddranges>`
 191
 192 .. |nbrange| replace:: :math:`r_c`\ =\ max(\ :math:`r_{\mathrm{list}}`\ ,\ :math:`r_{\mathrm{VdW}}`\ ,\ :math:`r_{\mathrm{Coul}}`\ )
 193 .. |tbrange| replace:: max(:math:`r_{\mathrm{mb}}`\ ,\ :math:`r_c`)
 194 .. |mbrange| replace:: :math:`r_{\mathrm{mb}}`
 195 .. |csrange| replace:: :math:`r_{\mathrm{con}}`
 196 .. |vsrange| replace:: :math:`r_{\mathrm{con}}`
 197 .. |mdrunr| replace:: :ref:`mdrun <gmx mdrun>` ``-rdd``
 198 .. |mdrunc| replace:: :ref:`mdrun <gmx mdrun>` ``-rcon``
 199
 200 .. _table-ddranges:
 201
 202 .. table:: The interaction ranges with domain decomposition.
 203     :widths: auto
 204     :align: center
 205
 206     +-------------------+-----------+-----------------+------------------------+
 207     | interaction       | range     | option          | default                |
 208     +===================+===========+=================+========================+
 209     | non-bonded        | |nbrange| | :ref:`mdp` file |                        |
 210     +-------------------+-----------+-----------------+------------------------+
 211     | two-body bonded   | |tbrange| | |mdrunr|        | starting conf. + 10%   |
 212     +-------------------+-----------+-----------------+------------------------+
 213     | multi-body bonded | |mbrange| | |mdrunr|        | starting conf. + 10%   |
 214     +-------------------+-----------+-----------------+------------------------+
 215     | constraints       | |csrange| | |mdrunc|        | est. from bond lengths |
 216     +-------------------+-----------+-----------------+------------------------+
 217     | virtual sites     | |vsrange| | |mdrunc|        | 0                      |
 218     +-------------------+-----------+-----------------+------------------------+
 219
 220 In most cases the defaults of :ref:`mdrun <gmx mdrun>` should not cause the simulation to
 221 stop with an error message of missing interactions. The range for the
 222 bonded interactions is determined from the distance between bonded
 223 charge-groups in the starting configuration, with 10% added for
 224 headroom. For the constraints, the value of :math:`r_{\mathrm{con}}` is
 225 determined by taking the maximum distance that (``lincs_order + 1``) bonds
 226 can cover when they all connect at angles of 120 degrees. The actual
 227 constraint communication is not limited by :math:`r_{\mathrm{con}}`, but
 228 by the minimum cell size :math:`L_C`, which has the following lower
 229 limit:
 230
 231 .. math:: L_C \geq \max(r_{\mathrm{mb}},r_{\mathrm{con}})
 232
 233 Without dynamic load balancing the system is actually allowed to scale
 234 beyond this limit when pressure scaling is used. **Note** that for
 235 triclinic boxes, :math:`L_C` is not simply the box diagonal component
 236 divided by the number of cells in that direction, rather it is the
 237 shortest distance between the triclinic cells borders. For rhombic
 238 dodecahedra this is a factor of :math:`\sqrt{3/2}` shorter along
 239 :math:`x` and :math:`y`.
 240
 241 When :math:`r_{\mathrm{mb}} > r_c`, :ref:`mdrun <gmx mdrun>` employs a smart algorithm to
 242 reduce the communication. Simply communicating all charge groups within
 243 :math:`r_{\mathrm{mb}}` would increase the amount of communication
 244 enormously. Therefore only charge-groups that are connected by bonded
 245 interactions to charge groups which are not locally present are
 246 communicated. This leads to little extra communication, but also to a
 247 slightly increased cost for the domain decomposition setup. In some
 248 cases, *e.g.* coarse-grained simulations with a very short cut-off, one
 249 might want to set :math:`r_{\mathrm{mb}}` by hand to reduce this cost.
 250
 251 .. _mpmdpme:
 252
 253 Multiple-Program, Multiple-Data PME parallelization
 254 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 255
 256 Electrostatics interactions are long-range, therefore special algorithms
 257 are used to avoid summation over many atom pairs. In |Gromacs| this is
 258 usually PME (sec. :ref:`pme`). Since with PME all particles interact with
 259 each other, global communication is required. This will usually be the
 260 limiting factor for scaling with domain decomposition. To reduce the
 261 effect of this problem, we have come up with a Multiple-Program,
 262 Multiple-Data approach \ :ref:`5 <refHess2008b>`. Here, some ranks are
 263 selected to do only the PME mesh calculation, while the other ranks,
 264 called particle-particle (PP) ranks, do all the rest of the work. For
 265 rectangular boxes the optimal PP to PME rank ratio is usually 3:1, for
 266 rhombic dodecahedra usually 2:1. When the number of PME ranks is reduced
 267 by a factor of 4, the number of communication calls is reduced by about
 268 a factor of 16. Or put differently, we can now scale to 4 times more
 269 ranks. In addition, for modern 4 or 8 core machines in a network, the
 270 effective network bandwidth for PME is quadrupled, since only a quarter
 271 of the cores will be using the network connection on each machine during
 272 the PME calculations.
 273
 274 .. _fig-mpmdpme:
 275
 276 .. figure:: plots/mpmd-pme.*
 277    :width: 12.00000cm
 278
 279    Example of 8 ranks without (left) and with (right) MPMD. The
 280    PME communication (red arrows) is much higher on the left than on the
 281    right. For MPMD additional PP - PME coordinate and force
 282    communication (blue arrows) is required, but the total communication
 283    complexity is lower.
 284
 285 :ref:`mdrun <gmx mdrun>` will by default interleave the PP and PME ranks.
 286 If the ranks are not number consecutively inside the machines, one might
 287 want to use :ref:`mdrun <gmx mdrun>` ``-ddorder pp_pme``. For machines with a
 288 real 3-D torus and proper communication software that assigns the ranks
 289 accordingly one should use :ref:`mdrun <gmx mdrun>` ``-ddorder cartesian``.
 290
 291 To optimize the performance one should usually set up the cut-offs and
 292 the PME grid such that the PME load is 25 to 33% of the total
 293 calculation load. :ref:`grompp <gmx grompp>` will print an estimate for this load at the end
 294 and also :ref:`mdrun <gmx mdrun>` calculates the same estimate to determine the optimal
 295 number of PME ranks to use. For high parallelization it might be
 296 worthwhile to optimize the PME load with the :ref:`mdp` settings and/or the
 297 number of PME ranks with the ``-npme`` option of :ref:`mdrun <gmx mdrun>`. For changing the
 298 electrostatics settings it is useful to know the accuracy of the
 299 electrostatics remains nearly constant when the Coulomb cut-off and the
 300 PME grid spacing are scaled by the same factor. **Note** that it is
 301 usually better to overestimate than to underestimate the number of PME
 302 ranks, since the number of PME ranks is smaller than the number of PP
 303 ranks, which leads to less total waiting time.
 304
 305 The PME domain decomposition can be 1-D or 2-D along the :math:`x`
 306 and/or :math:`y` axis. 2-D decomposition is also known as pencil
 307 decomposition because of the shape of the domains at high
 308 parallelization. 1-D decomposition along the :math:`y` axis can only be
 309 used when the PP decomposition has only 1 domain along :math:`x`. 2-D
 310 PME decomposition has to have the number of domains along :math:`x`
 311 equal to the number of the PP decomposition. :ref:`mdrun <gmx mdrun>` automatically chooses
 312 1-D or 2-D PME decomposition (when possible with the total given number
 313 of ranks), based on the minimum amount of communication for the
 314 coordinate redistribution in PME plus the communication for the grid
 315 overlap and transposes. To avoid superfluous communication of
 316 coordinates and forces between the PP and PME ranks, the number of DD
 317 cells in the :math:`x` direction should ideally be the same or a
 318 multiple of the number of PME ranks. By default, :ref:`mdrun <gmx mdrun>` takes care of
 319 this issue.
 320
 321 Domain decomposition flow chart
 322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 323
 324 In :numref:`Fig. %s <fig-ddflow>` a flow chart is shown for domain decomposition
 325 with all possible communication for different algorithms. For simpler
 326 simulations, the same flow chart applies, without the algorithms and
 327 communication for the algorithms that are not used.
 328
 329 .. _fig-ddflow:
 330
 331 .. figure:: plots/flowchart.*
 332    :width: 12.00000cm
 333
 334    Flow chart showing the algorithms and communication (arrows)
 335    for a standard MD simulation with virtual sites, constraints and
 336    separate PME-mesh ranks.
 337
 338 .. raw:: latex
 339
 340     \clearpage
 341
 342