--- /dev/null
+Managing long simulations
+=========================
+
+Molecular simulations often extend beyond the lifetime of a single
+UNIX command-line process. It is useful to be able to stop and
+restart the simulation in a
+way that is equivalent to a single run. When :ref:`gmx mdrun` is
+halted, it writes a checkpoint file that can restart the simulation
+exactly as if there was no interruption. To do this, the checkpoint
+retains a full-precision version of the positions and velocities,
+along with state information necessary to restart algorithms e.g.
+that implement coupling to external thermal reservoirs. A restart can
+be attempted using e.g. a :ref:`gro` file with velocities, but since
+the :ref:`gro` file has significantly less precision, and none of
+the coupling algorithms will have their state carried over, such
+a restart is less continuous than a normal MD step.
+
+Such a checkpoint file is also written periodically by :ref:`gmx
+mdrun` during the run. The interval is given by the ``-cpt`` flag to
+:ref:`gmx mdrun`. When :ref:`gmx mdrun` attemps to write each
+successive checkpoint file, it first renames the old file with the
+suffix ``_prev``, so that even if something goes wrong while writing
+the new checkpoint file, only recent progress can be lost.
+
+:ref:`gmx mdrun` can be halted in several ways:
+
+* the number of simulation :mdp:`nsteps` can expire
+* the user issues a termination signal (e.g. with Ctrl-C on the terminal)
+* the job scheduler issues a termination signal when time expires
+* when :ref:`gmx mdrun` detects that the length specified with
+ ``-maxh`` has elapsed (this option is useful to help cooperate with
+ a job scheduler, but can be problematic if jobs can be suspended)
+* some kind of catastrophic failure, such as loss of power, or a
+ disk filling up, or a network failing
+
+To use the checkpoint file for a restart, use a command line such as
+
+::
+
+ gmx mdrun -cpi state
+
+which directs mdrun to use the checkpoint file (which is named
+``state.cpt`` by default). You can choose to give the output
+checkpoint file a different name with the ``-cpo`` flag, but if so
+then you must provide that name as input to ``-cpi`` when you later
+use that file. You can
+query the contents of checkpoint files with :ref:`gmx check` and
+:ref:`gmx dump`.
+
+Appending to output files
+-------------------------
+
+By default, :ref:`gmx mdrun` will append to the old output files. If
+the previous part ended in a regular way, then the performance data at
+the end of the log file will will be removed, some new information
+about the run context written, and the simulation will proceed. Otherwise,
+mdrun will truncate all the output files back to the time of the last
+written checkpoint file, and continue from there, as if the simulation
+stopped at that checkpoint in a regular way.
+
+You can choose not to append the output files by using the
+``-noappend`` flag, which forces mdrun to write each output to a
+separate file, whose name includes a ".partXXXX" string to describe
+which simulation part is contained in this file. This numbering starts
+from zero and increases monotonically as simulations are restarted,
+but does not reflect the number of simulation steps in each part. The
+:mdp:`simulation-part` option can be used to set this number manually
+in :ref:`gmx grompp`, which can be useful if data has been lost,
+e.g. through filesystem failure or user error.
+
+Appending will not work if any output files have been modified or
+removed after mdrun wrote them, because the checkpoint file maintains
+a checksum of each file that it will verify before it writes to them
+again. In such cases, you must either restore the file, name them
+as the checkpoint file expects, or continue with ``-noappend``. If
+your original run used ``-deffnm``, and you want appending, then
+your continuations must also use ``-deffnm``.
+
+Backing up your files
+---------------------
+
+You should arrange to back up your simulation files frequently. Network
+file systems on clusters can be configured in more or less conservative
+ways, and this can lead :ref:`gmx mdrun` to be told that a checkpoint
+file has been written to disk when actually it is still in memory
+somewhere and vulnerable to a power failure or disk that fills or
+fails in the meantime. The UNIX tool rsync can be a useful way to
+periodically copy your simulation output to a remote storage location,
+which works safely even while the simulation is underway. Keeping a copy
+of the final checkpoint file from each part of a job submitted to a
+cluster can be useful if a file system is unreliable.
+
+Extending a .tpr file
+---------------------
+
+If the simulation described by :ref:`tpr` file has completed and should
+be extended, use the :ref:`gmx convert-tpr` tool to extend the run, e.g.
+
+::
+
+ gmx convert-tpr -s previous.tpr -extend timetoextendby -o next.tpr
+ gmx mdrun -s next.tpr -cpi state.cpt
+
+The time can also be extended using the ``-until`` and ``-nsteps``
+options. Note that the original :ref:`mdp` file may have generated
+velocities, but that is a one-time operation within :ref:`gmx grompp`
+that is never performed again by any other tool.
+
+Changing mdp options for a restart
+----------------------------------
+
+If you wish to make changes to your simulations settings other than
+length, then you should do so in the :ref:`mdp` file or topology, and
+then call
+
+::
+
+ gmx grompp -f possibly-changed.mdp -p possibly-changed.top -c state.cpt -o new.tpr
+ gmx mdrun -s new.tpr -cpi state.cpt
+
+to instruct :ref:`gmx grompp` to copy the full-precision coordinates
+in the checkpoint file into the new :ref:`tpr` file. You should
+consider your choices for :mdp:`tinit`, :mdp:`init-step`,
+:mdp:`nsteps` and :mdp:`simulation-part`. You should generally not
+regenerate velocities with :mdp:`gen-vel`, and generally select
+:mdp:`continuation` so that constraints are not re-applied before
+the first integration step.
+
+Restarts without checkpoint files
+---------------------------------
+
+It is possible to perform an exact restart a simulation if you lack a
+checkpoint file but have a matching pair of frames in a :ref:`trr` and
+:ref:`edr` file written by :ref:`gmx mdrun`. To do this, use
+
+::
+
+ gmx convert-tpr -s old.tpr -e matching.edr -t matching.trr -o new.tpr
+
+Are continuations exact?
+------------------------
+
+If you had a computer with unlimited precision, or if you integrated
+the time-discretized equations of motion by hand, exact continuation
+would lead to identical results. But since practical computers have
+limited precision and MD is chaotic, trajectories will diverge very
+rapidly even if one bit is different. Such trajectories will all be
+equally valid, but eventually very different. Continuation using a
+checkpoint file, using the same code compiled with the same compiler
+and running on the same computer architecture using the same number of
+processors without GPUs (see next section) would lead to binary
+identical results. However,
+by default the actual work load will be balanced across the hardware
+according to the observed execution times. Such trajectories are
+in principle not reproducible, and in particular a run that took
+place in more than one part will not be identical with an equivalent
+run in one part - but neither of them is better in any sense.
+
+Reproducibility
+---------------
+
+The following factors affect the reproducibility of a simulation, and thus its output:
+
+* Precision (mixed / double) with double giving "better" reproducibility.
+* Number of cores, due to different order in which forces are
+ accumulated. For instance (a+b)+c is not necessarily binary
+ identical to a+(b+c) in floating-point arithmetic.
+* Type of processors. Even within the same processor family there can be slight differences.
+* Optimization level when compiling.
+* Optimizations at run time: e.g. the FFTW library that is typically
+ used for fast Fourier transforms determines at startup which version
+ of their algorithms is fastest, and uses that for the remainder of
+ the calculations. Since the speed estimate is not deterministic, the
+ results may vary from run to run.
+* Random numbers used for instance as a seed for generating velocities
+ (in GROMACS at the preprocessing stage).
+* Uninitialized variables in the code (but there shouldn't be any)
+* Dynamic linking to different versions of shared libraries (e.g. for FFTs)
+* Dynamic load balancing, since particles are redistributed to
+ processors based on elapsed wallclock time, which will lead to
+ (a+b)+c != a+(b+c) issues as above
+* Number of PME-only ranks (for parallel PME simulations)
+* MPI reductions typically do not guarantee the order of the
+ operations, and so the absence of associativity for floating-point
+ arithmetic means the result of a reduction depends on the order
+ actually chosen
+* On GPUs, the reduction of e.g. non-bonded forces has a non-deterministic
+ summation order, so any fast implementation is non-reprodudible by
+ design.
+
+The important question is whether it is a problem if simulations are
+not completely reproducible. The answer is yes and no. Reproducibility
+is a cornerstone of science in general, and hence it is important.
+The `Central Limit Theorem <https://en.wikipedia.org/wiki/Central_limit_theorem>`
+tells us that in the case of infinitely long
+simulations, all observables converge to their equilibrium
+values. Molecular simulations in GROMACS adhere to this theorem, and
+hence, for instance, the energy of your system will converge to a
+finite value, the diffusion constant of your water molecules will
+converge to a finite value, and so on. That means all the important
+observables, which are the values you would like to get out of your
+simulation, are reproducible. Each individual trajectory is not
+reproducible, however.
+
+However, there are a few cases where it would be useful if
+trajectories were reproducible, too. These include developers doing
+debugging, and searching for a rare event in a trajectory when, if
+it occurs, you want to have manually saved your checkpoint file so
+you can restart the simulation under different conditions, e.g.
+writing output much more frequently.
+
+In order to obtain this reproducible trajectory, it is important
+to look over the list above and eliminate the factors that could
+affect it. Further, using
+
+::
+
+ gmx mdrun -reprod
+
+will eliminate all sources of non-reproducibility that it can,
+i.e. same executable + same hardware + same shared libraries + same
+run input file + same command line parameters will lead to
+reproducible results.