"([TT]-x[tt]).[PAR]",
"The option [TT]-dhdl[tt] is only used when free energy calculation is",
"turned on.[PAR]",
- "A simulation can be run in parallel using two different parallelization",
- "schemes: MPI parallelization and/or OpenMP thread parallelization.",
- "The MPI parallelization uses multiple processes when [TT]mdrun[tt] is",
- "compiled with a normal MPI library or threads when [TT]mdrun[tt] is",
- "compiled with the GROMACS built-in thread-MPI library. OpenMP threads",
- "are supported when [TT]mdrun[tt] is compiled with OpenMP. Full OpenMP support",
- "is only available with the Verlet cut-off scheme, with the (older)",
- "group scheme only PME-only ranks can use OpenMP parallelization.",
- "In all cases [TT]mdrun[tt] will by default try to use all the available",
- "hardware resources. With a normal MPI library only the options",
- "[TT]-ntomp[tt] (with the Verlet cut-off scheme) and [TT]-ntomp_pme[tt],",
- "for PME-only ranks, can be used to control the number of threads.",
- "With thread-MPI there are additional options [TT]-nt[tt], which sets",
- "the total number of threads, and [TT]-ntmpi[tt], which sets the number",
- "of thread-MPI threads.",
- "The number of OpenMP threads used by [TT]mdrun[tt] can also be set with",
- "the standard environment variable, [TT]OMP_NUM_THREADS[tt].",
- "The [TT]GMX_PME_NUM_THREADS[tt] environment variable can be used to specify",
- "the number of threads used by the PME-only ranks.[PAR]",
- "Note that combined MPI+OpenMP parallelization is in many cases",
- "slower than either on its own. However, at high parallelization, using the",
- "combination is often beneficial as it reduces the number of domains and/or",
- "the number of MPI ranks. (Less and larger domains can improve scaling,",
- "with separate PME ranks, using fewer MPI ranks reduces communication costs.)",
- "OpenMP-only parallelization is typically faster than MPI-only parallelization",
- "on a single CPU(-die). Since we currently don't have proper hardware",
- "topology detection, [TT]mdrun[tt] compiled with thread-MPI will only",
- "automatically use OpenMP-only parallelization when you use up to 4",
- "threads, up to 12 threads with Intel Nehalem/Westmere, or up to 16",
- "threads with Intel Sandy Bridge or newer CPUs. Otherwise MPI-only",
- "parallelization is used (except with GPUs, see below).",
- "[PAR]",
- "With GPUs (only supported with the Verlet cut-off scheme), the number",
- "of GPUs should match the number of particle-particle ranks, i.e.",
- "excluding PME-only ranks. With thread-MPI, unless set on the command line, the number",
- "of MPI threads will automatically be set to the number of GPUs detected.",
- "To use a subset of the available GPUs, or to manually provide a mapping of",
- "GPUs to PP ranks, you can use the [TT]-gpu_id[tt] option. The argument of [TT]-gpu_id[tt] is",
- "a string of digits (without delimiter) representing device id-s of the GPUs to be used.",
- "For example, \"[TT]02[tt]\" specifies using GPUs 0 and 2 in the first and second PP ranks per compute node",
- "respectively. To select different sets of GPU-s",
- "on different nodes of a compute cluster, use the [TT]GMX_GPU_ID[tt] environment",
- "variable instead. The format for [TT]GMX_GPU_ID[tt] is identical to ",
- "[TT]-gpu_id[tt], with the difference that an environment variable can have",
- "different values on different compute nodes. Multiple MPI ranks on each node",
- "can share GPUs. This is accomplished by specifying the id(s) of the GPU(s)",
- "multiple times, e.g. \"[TT]0011[tt]\" for four ranks sharing two GPUs in this node.",
- "This works within a single simulation, or a multi-simulation, with any form of MPI.",
- "[PAR]",
- "With the Verlet cut-off scheme and verlet-buffer-tolerance set,",
- "the pair-list update interval nstlist can be chosen freely with",
- "the option [TT]-nstlist[tt]. [TT]mdrun[tt] will then adjust",
- "the pair-list cut-off to maintain accuracy, and not adjust nstlist.",
- "Otherwise, by default, [TT]mdrun[tt] will try to increase the",
- "value of nstlist set in the [TT].mdp[tt] file to improve the",
- "performance. For CPU-only runs, nstlist might increase to 20, for",
- "GPU runs up to 40. For medium to high parallelization or with",
- "fast GPUs, a (user-supplied) larger nstlist value can give much",
- "better performance.",
- "[PAR]",
- "When using PME with separate PME ranks or with a GPU, the two major",
- "compute tasks, the non-bonded force calculation and the PME calculation",
- "run on different compute resources. If this load is not balanced,",
- "some of the resources will be idle part of time. With the Verlet",
- "cut-off scheme this load is automatically balanced when the PME load",
- "is too high (but not when it is too low). This is done by scaling",
- "the Coulomb cut-off and PME grid spacing by the same amount. In the first",
- "few hundred steps different settings are tried and the fastest is chosen",
- "for the rest of the simulation. This does not affect the accuracy of",
- "the results, but it does affect the decomposition of the Coulomb energy",
- "into particle and mesh contributions. The auto-tuning can be turned off",
- "with the option [TT]-notunepme[tt].",
- "[PAR]",
- "[TT]mdrun[tt] pins (sets affinity of) threads to specific cores,",
- "when all (logical) cores on a compute node are used by [TT]mdrun[tt],",
- "even when no multi-threading is used,",
- "as this usually results in significantly better performance.",
- "If the queuing systems or the OpenMP library pinned threads, we honor",
- "this and don't pin again, even though the layout may be sub-optimal.",
- "If you want to have [TT]mdrun[tt] override an already set thread affinity",
- "or pin threads when using less cores, use [TT]-pin on[tt].",
- "With SMT (simultaneous multithreading), e.g. Intel Hyper-Threading,",
- "there are multiple logical cores per physical core.",
- "The option [TT]-pinstride[tt] sets the stride in logical cores for",
- "pinning consecutive threads. Without SMT, 1 is usually the best choice.",
- "With Intel Hyper-Threading 2 is best when using half or less of the",
- "logical cores, 1 otherwise. The default value of 0 do exactly that:",
- "it minimizes the threads per logical core, to optimize performance.",
- "If you want to run multiple [TT]mdrun[tt] jobs on the same physical node,"
- "you should set [TT]-pinstride[tt] to 1 when using all logical cores.",
- "When running multiple [TT]mdrun[tt] (or other) simulations on the same physical",
- "node, some simulations need to start pinning from a non-zero core",
- "to avoid overloading cores; with [TT]-pinoffset[tt] you can specify",
- "the offset in logical cores for pinning.",
- "[PAR]",
- "When [TT]mdrun[tt] is started with more than 1 rank,",
- "parallelization with domain decomposition is used.",
- "[PAR]",
- "With domain decomposition, the spatial decomposition can be set",
- "with option [TT]-dd[tt]. By default [TT]mdrun[tt] selects a good decomposition.",
- "The user only needs to change this when the system is very inhomogeneous.",
- "Dynamic load balancing is set with the option [TT]-dlb[tt],",
- "which can give a significant performance improvement,",
- "especially for inhomogeneous systems. The only disadvantage of",
- "dynamic load balancing is that runs are no longer binary reproducible,",
- "but in most cases this is not important.",
- "By default the dynamic load balancing is automatically turned on",
- "when the measured performance loss due to load imbalance is 5% or more.",
- "At low parallelization these are the only important options",
- "for domain decomposition.",
- "At high parallelization the options in the next two sections",
- "could be important for increasing the performace.",
- "[PAR]",
- "When PME is used with domain decomposition, separate ranks can",
- "be assigned to do only the PME mesh calculation;",
- "this is computationally more efficient starting at about 12 ranks,",
- "or even fewer when OpenMP parallelization is used.",
- "The number of PME ranks is set with option [TT]-npme[tt],",
- "but this cannot be more than half of the ranks.",
- "By default [TT]mdrun[tt] makes a guess for the number of PME",
- "ranks when the number of ranks is larger than 16. With GPUs,",
- "using separate PME ranks is not selected automatically,",
- "since the optimal setup depends very much on the details",
- "of the hardware. In all cases, you might gain performance",
- "by optimizing [TT]-npme[tt]. Performance statistics on this issue",
- "are written at the end of the log file.",
- "For good load balancing at high parallelization, the PME grid x and y",
- "dimensions should be divisible by the number of PME ranks",
- "(the simulation will run correctly also when this is not the case).",
- "[PAR]",
- "This section lists all options that affect the domain decomposition.",
- "[PAR]",
- "Option [TT]-rdd[tt] can be used to set the required maximum distance",
- "for inter charge-group bonded interactions.",
- "Communication for two-body bonded interactions below the non-bonded",
- "cut-off distance always comes for free with the non-bonded communication.",
- "Atoms beyond the non-bonded cut-off are only communicated when they have",
- "missing bonded interactions; this means that the extra cost is minor",
- "and nearly indepedent of the value of [TT]-rdd[tt].",
- "With dynamic load balancing option [TT]-rdd[tt] also sets",
- "the lower limit for the domain decomposition cell sizes.",
- "By default [TT]-rdd[tt] is determined by [TT]mdrun[tt] based on",
- "the initial coordinates. The chosen value will be a balance",
- "between interaction range and communication cost.",
- "[PAR]",
- "When inter charge-group bonded interactions are beyond",
- "the bonded cut-off distance, [TT]mdrun[tt] terminates with an error message.",
- "For pair interactions and tabulated bonds",
- "that do not generate exclusions, this check can be turned off",
- "with the option [TT]-noddcheck[tt].",
- "[PAR]",
- "When constraints are present, option [TT]-rcon[tt] influences",
- "the cell size limit as well.",
- "Atoms connected by NC constraints, where NC is the LINCS order plus 1,",
- "should not be beyond the smallest cell size. A error message is",
- "generated when this happens and the user should change the decomposition",
- "or decrease the LINCS order and increase the number of LINCS iterations.",
- "By default [TT]mdrun[tt] estimates the minimum cell size required for P-LINCS",
- "in a conservative fashion. For high parallelization it can be useful",
- "to set the distance required for P-LINCS with the option [TT]-rcon[tt].",
- "[PAR]",
- "The [TT]-dds[tt] option sets the minimum allowed x, y and/or z scaling",
- "of the cells with dynamic load balancing. [TT]mdrun[tt] will ensure that",
- "the cells can scale down by at least this factor. This option is used",
- "for the automated spatial decomposition (when not using [TT]-dd[tt])",
- "as well as for determining the number of grid pulses, which in turn",
- "sets the minimum allowed cell size. Under certain circumstances",
- "the value of [TT]-dds[tt] might need to be adjusted to account for",
- "high or low spatial inhomogeneity of the system.",
- "[PAR]",
- "The option [TT]-gcom[tt] can be used to only do global communication",
- "every n steps.",
- "This can improve performance for highly parallel simulations",
- "where this global communication step becomes the bottleneck.",
- "For a global thermostat and/or barostat the temperature",
- "and/or pressure will also only be updated every [TT]-gcom[tt] steps.",
- "By default it is set to the minimum of nstcalcenergy and nstlist.[PAR]",
- "With [TT]-rerun[tt] an input trajectory can be given for which ",
- "forces and energies will be (re)calculated. Neighbor searching will be",
- "performed for every frame, unless [TT]nstlist[tt] is zero",
- "(see the [TT].mdp[tt] file).[PAR]",
+ "Running mdrun efficiently in parallel is a complex topic topic,",
+ "many aspects of which are covered in the online User Guide. You",
+ "should look there for practical advice on using many of the options",
+ "available in mdrun.[PAR]",
"ED (essential dynamics) sampling and/or additional flooding potentials",
"are switched on by using the [TT]-ei[tt] flag followed by an [TT].edi[tt]",
"file. The [TT].edi[tt] file can be produced with the [TT]make_edi[tt] tool",
"The options [TT]-px[tt] and [TT]-pf[tt] are used for writing pull COM",
"coordinates and forces when pulling is selected",
"in the [TT].mdp[tt] file.[PAR]",
- "With [TT]-multi[tt] or [TT]-multidir[tt], multiple systems can be ",
- "simulated in parallel.",
- "As many input files/directories are required as the number of systems. ",
- "The [TT]-multidir[tt] option takes a list of directories (one for each ",
- "system) and runs in each of them, using the input/output file names, ",
- "such as specified by e.g. the [TT]-s[tt] option, relative to these ",
- "directories.",
- "With [TT]-multi[tt], the system number is appended to the run input ",
- "and each output filename, for instance [TT]topol.tpr[tt] becomes",
- "[TT]topol0.tpr[tt], [TT]topol1.tpr[tt] etc.",
- "The number of ranks per system is the total number of ranks",
- "divided by the number of systems.",
- "One use of this option is for NMR refinement: when distance",
- "or orientation restraints are present these can be ensemble averaged",
- "over all the systems.[PAR]",
- "With [TT]-replex[tt] replica exchange is attempted every given number",
- "of steps. The number of replicas is set with the [TT]-multi[tt] or ",
- "[TT]-multidir[tt] option, described above.",
- "All run input files should use a different coupling temperature,",
- "the order of the files is not important. The random seed is set with",
- "[TT]-reseed[tt]. The velocities are scaled and neighbor searching",
- "is performed after every exchange.[PAR]",
"Finally some experimental algorithms can be tested when the",
"appropriate options have been given. Currently under",
"investigation are: polarizability.",