docs/user-guide/managing-simulations.rst

   1 .. _managing long simulations:
   2
   3 Managing long simulations
   4 =========================
   5
   6 Molecular simulations often extend beyond the lifetime of a single
   7 UNIX command-line process. It is useful to be able to stop and
   8 restart the simulation in a
   9 way that is equivalent to a single run. When :ref:`gmx mdrun` is
  10 halted, it writes a checkpoint file that can restart the simulation
  11 exactly as if there was no interruption. To do this, the checkpoint
  12 retains a full-precision version of the positions and velocities,
  13 along with state information necessary to restart algorithms e.g.
  14 that implement coupling to external thermal reservoirs. A restart can
  15 be attempted using e.g. a :ref:`gro` file with velocities, but since
  16 the :ref:`gro` file has significantly less precision, and none of
  17 the coupling algorithms will have their state carried over, such
  18 a restart is less continuous than a normal MD step.
  19
  20 Such a checkpoint file is also written periodically by :ref:`gmx
  21 mdrun` during the run. The interval is given by the ``-cpt`` flag to
  22 :ref:`gmx mdrun`. When :ref:`gmx mdrun` attemps to write each
  23 successive checkpoint file, it first renames the old file with the
  24 suffix ``_prev``, so that even if something goes wrong while writing
  25 the new checkpoint file, only recent progress can be lost.
  26
  27 :ref:`gmx mdrun` can be halted in several ways:
  28
  29 * the number of simulation :mdp:`nsteps` can expire
  30 * the user issues a termination signal (e.g. with Ctrl-C on the terminal)
  31 * the job scheduler issues a termination signal when time expires
  32 * when :ref:`gmx mdrun` detects that the length specified with
  33   ``-maxh`` has elapsed (this option is useful to help cooperate with
  34   a job scheduler, but can be problematic if jobs can be suspended)
  35 * some kind of catastrophic failure, such as loss of power, or a
  36   disk filling up, or a network failing
  37
  38 To use the checkpoint file for a restart, use a command line such as
  39
  40 ::
  41
  42    gmx mdrun -cpi state
  43
  44 which directs mdrun to use the checkpoint file (which is named
  45 ``state.cpt`` by default). You can choose to give the output
  46 checkpoint file a different name with the ``-cpo`` flag, but if so
  47 then you must provide that name as input to ``-cpi`` when you later
  48 use that file. You can
  49 query the contents of checkpoint files with :ref:`gmx check` and
  50 :ref:`gmx dump`.
  51
  52 Appending to output files
  53 -------------------------
  54
  55 By default, :ref:`gmx mdrun` will append to the old output files. If
  56 the previous part ended in a regular way, then the performance data at
  57 the end of the log file will will be removed, some new information
  58 about the run context written, and the simulation will proceed. Otherwise,
  59 mdrun will truncate all the output files back to the time of the last
  60 written checkpoint file, and continue from there, as if the simulation
  61 stopped at that checkpoint in a regular way.
  62
  63 You can choose not to append the output files by using the
  64 ``-noappend`` flag, which forces mdrun to write each output to a
  65 separate file, whose name includes a ".partXXXX" string to describe
  66 which simulation part is contained in this file. This numbering starts
  67 from zero and increases monotonically as simulations are restarted,
  68 but does not reflect the number of simulation steps in each part. The
  69 :mdp:`simulation-part` option can be used to set this number manually
  70 in :ref:`gmx grompp`, which can be useful if data has been lost,
  71 e.g. through filesystem failure or user error.
  72
  73 Appending will not work if any output files have been modified or
  74 removed after mdrun wrote them, because the checkpoint file maintains
  75 a checksum of each file that it will verify before it writes to them
  76 again. In such cases, you must either restore the file, name them
  77 as the checkpoint file expects, or continue with ``-noappend``. If
  78 your original run used ``-deffnm``, and you want appending, then
  79 your continuations must also use ``-deffnm``.
  80
  81 Backing up your files
  82 ---------------------
  83
  84 You should arrange to back up your simulation files frequently. Network
  85 file systems on clusters can be configured in more or less conservative
  86 ways, and this can lead :ref:`gmx mdrun` to be told that a checkpoint
  87 file has been written to disk when actually it is still in memory
  88 somewhere and vulnerable to a power failure or disk that fills or
  89 fails in the meantime. The UNIX tool rsync can be a useful way to
  90 periodically copy your simulation output to a remote storage location,
  91 which works safely even while the simulation is underway. Keeping a copy
  92 of the final checkpoint file from each part of a job submitted to a
  93 cluster can be useful if a file system is unreliable.
  94
  95 Extending a .tpr file
  96 ---------------------
  97
  98 If the simulation described by :ref:`tpr` file has completed and should
  99 be extended, use the :ref:`gmx convert-tpr` tool to extend the run, e.g.
 100
 101 ::
 102
 103    gmx convert-tpr -s previous.tpr -extend timetoextendby -o next.tpr
 104    gmx mdrun -s next.tpr -cpi state.cpt
 105
 106 The time can also be extended using the ``-until`` and ``-nsteps``
 107 options. Note that the original :ref:`mdp` file may have generated
 108 velocities, but that is a one-time operation within :ref:`gmx grompp`
 109 that is never performed again by any other tool.
 110
 111 Changing mdp options for a restart
 112 ----------------------------------
 113
 114 If you wish to make changes to your simulations settings other than
 115 length, then you should do so in the :ref:`mdp` file or topology, and
 116 then call
 117
 118 ::
 119
 120    gmx grompp -f possibly-changed.mdp -p possibly-changed.top -c state.cpt -o new.tpr
 121    gmx mdrun -s new.tpr -cpi state.cpt
 122
 123 to instruct :ref:`gmx grompp` to copy the full-precision coordinates
 124 in the checkpoint file into the new :ref:`tpr` file. You should
 125 consider your choices for :mdp:`tinit`, :mdp:`init-step`,
 126 :mdp:`nsteps` and :mdp:`simulation-part`. You should generally not
 127 regenerate velocities with :mdp:`gen-vel`, and generally select
 128 :mdp:`continuation` so that constraints are not re-applied before
 129 the first integration step.
 130
 131 Restarts without checkpoint files
 132 ---------------------------------
 133
 134 It used to be possible to continue simulations without the checkpoint
 135 files. As this approach could be unreliable or lead to
 136 unphysical results, only restarts from checkpoints are permitted now.
 137
 138 Are continuations exact?
 139 ------------------------
 140
 141 If you had a computer with unlimited precision, or if you integrated
 142 the time-discretized equations of motion by hand, exact continuation
 143 would lead to identical results. But since practical computers have
 144 limited precision and MD is chaotic, trajectories will diverge very
 145 rapidly even if one bit is different. Such trajectories will all be
 146 equally valid, but eventually very different. Continuation using a
 147 checkpoint file, using the same code compiled with the same compiler
 148 and running on the same computer architecture using the same number of
 149 processors without GPUs (see next section) would lead to binary
 150 identical results. However,
 151 by default the actual work load will be balanced across the hardware
 152 according to the observed execution times. Such trajectories are
 153 in principle not reproducible, and in particular a run that took
 154 place in more than one part will not be identical with an equivalent
 155 run in one part - but neither of them is better in any sense.
 156
 157 Reproducibility
 158 ---------------
 159
 160 The following factors affect the reproducibility of a simulation, and thus its output:
 161
 162 * Precision (mixed / double) with double giving "better" reproducibility.
 163 * Number of cores, due to different order in which forces are
 164   accumulated. For instance (a+b)+c is not necessarily binary
 165   identical to a+(b+c) in floating-point arithmetic.
 166 * Type of processors. Even within the same processor family there can be slight differences.
 167 * Optimization level when compiling.
 168 * Optimizations at run time: e.g. the FFTW library that is typically
 169   used for fast Fourier transforms determines at startup which version
 170   of their algorithms is fastest, and uses that for the remainder of
 171   the calculations. Since the speed estimate is not deterministic, the
 172   results may vary from run to run.
 173 * Random numbers used for instance as a seed for generating velocities
 174   (in |Gromacs| at the preprocessing stage).
 175 * Uninitialized variables in the code (but there shouldn't be any)
 176 * Dynamic linking to different versions of shared libraries (e.g. for FFTs)
 177 * Dynamic load balancing, since particles are redistributed to
 178   processors based on elapsed wallclock time, which will lead to
 179   (a+b)+c != a+(b+c) issues as above
 180 * Number of PME-only ranks (for parallel PME simulations)
 181 * MPI reductions typically do not guarantee the order of the
 182   operations, and so the absence of associativity for floating-point
 183   arithmetic means the result of a reduction depends on the order
 184   actually chosen
 185 * On GPUs, the reduction of e.g. non-bonded forces has a non-deterministic
 186   summation order, so any fast implementation is non-reprodudible by
 187   design.
 188
 189 The important question is whether it is a problem if simulations are
 190 not completely reproducible. The answer is yes and no. Reproducibility
 191 is a cornerstone of science in general, and hence it is important.
 192 The `Central Limit Theorem <https://en.wikipedia.org/wiki/Central_limit_theorem>`_
 193 tells us that in the case of infinitely long
 194 simulations, all observables converge to their equilibrium
 195 values. Molecular simulations in |Gromacs| adhere to this theorem, and
 196 hence, for instance, the energy of your system will converge to a
 197 finite value, the diffusion constant of your water molecules will
 198 converge to a finite value, and so on. That means all the important
 199 observables, which are the values you would like to get out of your
 200 simulation, are reproducible. Each individual trajectory is not
 201 reproducible, however.
 202
 203 However, there are a few cases where it would be useful if
 204 trajectories were reproducible, too. These include developers doing
 205 debugging, and searching for a rare event in a trajectory when, if
 206 it occurs, you want to have manually saved your checkpoint file so
 207 you can restart the simulation under different conditions, e.g.
 208 writing output much more frequently.
 209
 210 In order to obtain this reproducible trajectory, it is important
 211 to look over the list above and eliminate the factors that could
 212 affect it. Further, using
 213
 214 ::
 215
 216    gmx mdrun -reprod
 217
 218 will eliminate all sources of non-reproducibility that it can,
 219 i.e. same executable + same hardware + same shared libraries + same
 220 run input file + same command line parameters will lead to
 221 reproducible results.