and configurable intelligence to detect your hardware and make pretty
effective use of that hardware. For a lot of casual and serious use of
:ref:`gmx mdrun`, the automatic machinery works well enough. But to get the
-most from your hardware to maximise your scientific quality, read on!
+most from your hardware to maximize your scientific quality, read on!
Hardware background information
-------------------------------
members of its domain. A GPU may perform work for more than
one PP rank, but it is normally most efficient to use a single
PP rank per GPU and for that rank to have thousands of
- atoms. When the work of a PP rank is done on the CPU, mdrun
+ particles. When the work of a PP rank is done on the CPU, mdrun
will make extensive use of the SIMD capabilities of the
core. There are various `command-line options
<controlling-the-domain-decomposition-algorithm` to control
----------------------------------
:ref:`gmx mdrun` can be configured and compiled in several different ways that
-are efficient to use within a single node. The default configuration
+are efficient to use within a single :term:`node`. The default configuration
using a suitable compiler will deploy a multi-level hybrid parallelism
that uses CUDA, OpenMP and the threading platform native to the
hardware. For programming convenience, in GROMACS, those native
inform the user about the choices made and possible consequences.
A number of command-line parameters are available to vary the default
-behaviour.
+behavior.
``-nt``
The total number of threads to use. The default, 0, will start as
``-ntomp``
The total number of OpenMP threads per rank to start. The
default, 0, will start one thread on each available core.
- Alternatively, mdrun will honour the appropriate system
+ Alternatively, mdrun will honor the appropriate system
environment variable (e.g. ``OMP_NUM_THREADS``) if set.
``-npme``
core is needed to continue to improve absolute performance.
The location of the scaling limit depends on the processor,
presence of GPUs, network, and simulation algorithm, but
-it is worth measuring at around ~200 atoms/core if you
+it is worth measuring at around ~200 particles/core if you
need maximum throughput.
There are further command-line parameters that are relevant in these
cases.
``-tunepme``
- If "on," will optimize various aspects of the PME
- and DD algorithms, shifting load between ranks and/or
- GPUs to maximize throughput
+ Defaults to "on." If "on," will optimize various aspects of the
+ PME and DD algorithms, shifting load between ranks and/or GPUs to
+ maximize throughput
+
+``-dlb``
+ Can be set to "auto," "no," or "yes."
+ Defaults to "auto." Doing Dynamic Load Balancing between MPI ranks
+ is needed to maximize performance. This is particularly important
+ for molecular systems with heterogeneous particle or interaction
+ density. When a certain threshold for performance loss is
+ exceeded, DLB activates and shifts particles between ranks to improve
+ performance.
``-gcom``
- Can be used to limit global communication every n steps. This can
- improve performance for highly parallel simulations where this global
- communication step becomes the bottleneck. For a global thermostat
- and/or barostat, the temperature and/or pressure will also only be
- updated every ``-gcom`` steps. By default, it is set to the
- minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+ During the simulation :ref:`gmx mdrun` must communicate between all ranks to
+ compute quantities such as kinetic energy. By default, this
+ happens whenever plausible, and is influenced by a lot of [.mdp
+ options](#mdp-options). The period between communication phases
+ must be a multiple of :mdp:`nstlist`, and defaults to
+ the minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+ ``mdrun -gcom`` sets the number of steps that must elapse between
+ such communication phases, which can improve performance when
+ running on a lot of nodes. Note that this means that _e.g._
+ temperature coupling algorithms will
+ effectively remain at constant energy until the next global
+ communication phase.
+
+Note that ``-tunepme`` has more effect when there is more than one
+:term:`node`, because the cost of communication for the PP and PME
+ranks differs. It still shifts load between PP and PME ranks, but does
+not change the number of separate PME ranks in use.
+
+Note also that ``-dlb`` and ``-tunepme`` can interfere with each other, so
+if you experience performance variation that could result from this,
+you may wish to tune PME separately, and run the result with ``mdrun
+-notunepme -dlb yes``.
The :ref:`gmx tune_pme` utility is available to search a wider
range of parameter space, including making safe
Can be used to set the required maximum distance for inter
charge-group bonded interactions. Communication for two-body
bonded interactions below the non-bonded cut-off distance always
- comes for free with the non-bonded communication. Atoms beyond
+ comes for free with the non-bonded communication. Particles beyond
the non-bonded cut-off are only communicated when they have
missing bonded interactions; this means that the extra cost is
- minor and nearly indepedent of the value of ``-rdd``. With dynamic
+ minor and nearly independent of the value of ``-rdd``. With dynamic
load balancing, option ``-rdd`` also sets the lower limit for the
domain decomposition cell sizes. By default ``-rdd`` is determined
by :ref:`gmx mdrun` based on the initial coordinates. The chosen value will
``-rcon``
When constraints are present, option ``-rcon`` influences
the cell size limit as well.
- Atoms connected by NC constraints, where NC is the LINCS order
+ Particles connected by NC constraints, where NC is the LINCS order
plus 1, should not be beyond the smallest cell size. A error
message is generated when this happens, and the user should change
the decomposition or decrease the LINCS order and increase the
Running mdrun with GPUs
-----------------------
TODO In future patch: any tips not covered above
+
+Running the OpenCL version of mdrun
+-----------------------------------
+
+The current version works with GCN-based AMD GPUs, and NVIDIA CUDA
+GPUs. Make sure that you have the latest drivers installed. The
+minimum OpenCL version required is |REQUIRED_OPENCL_MIN_VERSION|. See
+also the :ref:`known limitations <opencl-known-limitations>`.
+
+The same ``-gpu_id`` option (or ``GMX_GPU_ID`` environment variable)
+used to select CUDA devices, or to define a mapping of GPUs to PP
+ranks, is used for OpenCL devices.
+
+The following devices are known to work correctly:
+ - AMD: FirePro W5100, HD 7950, FirePro W9100, Radeon R7 240,
+ Radeon R7 M260, Radeon R9 290
+ - NVIDIA: GeForce GTX 660M, GeForce GTX 660Ti, GeForce GTX 750Ti,
+ GeForce GTX 780, GTX Titan
+
+Building an OpenCL program can take a significant amount of
+time. NVIDIA implements a mechanism to cache the result of the
+build. As a consequence, only the first run will take longer (because
+of the kernel builds), and the following runs will be very fast. AMD
+drivers, on the other hand, implement no caching and the initial phase
+of running an OpenCL program can be very slow. This is not normally a
+problem for long production MD, but you might prefer to do some kinds
+of work on just the CPU (e.g. see ``-nb`` above).
+
+Some other :ref:`OpenCL management <opencl-management>` environment
+variables may be of interest to developers.
+
+.. _opencl-known-limitations:
+
+Known limitations of the OpenCL support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Limitations in the current OpenCL support of interest to |Gromacs| users:
+
+- Using more than one GPU on a node is not supported
+- Sharing a GPU between multiple PP ranks is not supported
+- No Intel devices (CPUs, GPUs or Xeon Phi) are supported
+- Due to blocking behavior of clEnqueue functions in the NVIDIA driver, there is
+ almost no performance gain when using NVIDIA GPUs. A bug report has already
+ been filled on about this issue. A possible workaround would be to have a
+ separate thread for issuing GPU commands. However this hasn't been implemented
+ yet.
+
+Limitations of interest to |Gromacs| developers:
+
+- The current implementation is not compatible with OpenCL devices that are
+ not using warp/wavefronts or for which the warp/wavefront size is not a
+ multiple of 32
+- Some Ewald tabulated kernels are known to produce incorrect results, so
+ (correct) analytical kernels are used instead.