Fix spelling errors in user guide.

[alexxy/gromacs.git] / docs / user-guide / mdrun-performance.rst
diff --git a/docs/user-guide/mdrun-performance.rst b/docs/user-guide/mdrun-performance.rst

index cba6ff27719eef74fa6ddc48b85e5b940d87ac1a..07ecf55604abaac415ecd3e2e52073547bdd1091 100644 (file)
--- a/docs/user-guide/mdrun-performance.rst
+++ b/docs/user-guide/mdrun-performance.rst
@@ -4,7 +4,7 @@ The GROMACS build system and the :ref:`gmx mdrun` tool has a lot of built-in
  and configurable intelligence to detect your hardware and make pretty
  effective use of that hardware. For a lot of casual and serious use of
  :ref:`gmx mdrun`, the automatic machinery works well enough. But to get the
-most from your hardware to maximise your scientific quality, read on!
+most from your hardware to maximize your scientific quality, read on!
  
  Hardware background information
  -------------------------------
@@ -112,7 +112,7 @@ see the Reference Manual. The most important of these are
          members of its domain. A GPU may perform work for more than
          one PP rank, but it is normally most efficient to use a single
          PP rank per GPU and for that rank to have thousands of
-        atoms. When the work of a PP rank is done on the CPU, mdrun
+        particles. When the work of a PP rank is done on the CPU, mdrun
          will make extensive use of the SIMD capabilities of the
          core. There are various `command-line options
          <controlling-the-domain-decomposition-algorithm` to control
@@ -135,7 +135,7 @@ Running mdrun within a single node
  ----------------------------------
  
  :ref:`gmx mdrun` can be configured and compiled in several different ways that
-are efficient to use within a single node. The default configuration
+are efficient to use within a single :term:`node`. The default configuration
  using a suitable compiler will deploy a multi-level hybrid parallelism
  that uses CUDA, OpenMP and the threading platform native to the
  hardware. For programming convenience, in GROMACS, those native
@@ -152,7 +152,7 @@ log file, stdout and stderr are used to print diagnostics that
  inform the user about the choices made and possible consequences.
  
  A number of command-line parameters are available to vary the default
-behaviour.
+behavior.
  
  ``-nt``
      The total number of threads to use. The default, 0, will start as
@@ -168,7 +168,7 @@ behaviour.
  ``-ntomp``
      The total number of OpenMP threads per rank to start. The
      default, 0, will start one thread on each available core.
-    Alternatively, mdrun will honour the appropriate system
+    Alternatively, mdrun will honor the appropriate system
      environment variable (e.g. ``OMP_NUM_THREADS``) if set.
  
  ``-npme``
@@ -315,24 +315,49 @@ OpenMP to spread the work of an MPI rank over more than one
  core is needed to continue to improve absolute performance.
  The location of the scaling limit depends on the processor,
  presence of GPUs, network, and simulation algorithm, but
-it is worth measuring at around ~200 atoms/core if you
+it is worth measuring at around ~200 particles/core if you
  need maximum throughput.
  
  There are further command-line parameters that are relevant in these
  cases.
  
  ``-tunepme``
-    If "on," will optimize various aspects of the PME
-    and DD algorithms, shifting load between ranks and/or
-    GPUs to maximize throughput
+    Defaults to "on." If "on," will optimize various aspects of the
+    PME and DD algorithms, shifting load between ranks and/or GPUs to
+    maximize throughput
+
+``-dlb``
+    Can be set to "auto," "no," or "yes."
+    Defaults to "auto." Doing Dynamic Load Balancing between MPI ranks
+    is needed to maximize performance. This is particularly important
+    for molecular systems with heterogeneous particle or interaction
+    density. When a certain threshold for performance loss is
+    exceeded, DLB activates and shifts particles between ranks to improve
+    performance.
  
  ``-gcom``
-    Can be used to limit global communication every n steps. This can
-    improve performance for highly parallel simulations where this global
-    communication step becomes the bottleneck. For a global thermostat
-    and/or barostat, the temperature and/or pressure will also only be
-    updated every ``-gcom`` steps. By default, it is set to the
-    minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+    During the simulation :ref:`gmx mdrun` must communicate between all ranks to
+    compute quantities such as kinetic energy. By default, this
+    happens whenever plausible, and is influenced by a lot of [.mdp
+    options](#mdp-options). The period between communication phases
+    must be a multiple of :mdp:`nstlist`, and defaults to
+    the minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+    ``mdrun -gcom`` sets the number of steps that must elapse between
+    such communication phases, which can improve performance when
+    running on a lot of nodes. Note that this means that _e.g._
+    temperature coupling algorithms will
+    effectively remain at constant energy until the next global
+    communication phase.
+
+Note that ``-tunepme`` has more effect when there is more than one
+:term:`node`, because the cost of communication for the PP and PME
+ranks differs. It still shifts load between PP and PME ranks, but does
+not change the number of separate PME ranks in use.
+
+Note also that ``-dlb`` and ``-tunepme`` can interfere with each other, so
+if you experience performance variation that could result from this,
+you may wish to tune PME separately, and run the result with ``mdrun
+-notunepme -dlb yes``.
  
  The :ref:`gmx tune_pme` utility is available to search a wider
  range of parameter space, including making safe
@@ -432,10 +457,10 @@ parallel hardware.
      Can be used to set the required maximum distance for inter
      charge-group bonded interactions. Communication for two-body
      bonded interactions below the non-bonded cut-off distance always
-    comes for free with the non-bonded communication. Atoms beyond
+    comes for free with the non-bonded communication. Particles beyond
      the non-bonded cut-off are only communicated when they have
      missing bonded interactions; this means that the extra cost is
-    minor and nearly indepedent of the value of ``-rdd``. With dynamic
+    minor and nearly independent of the value of ``-rdd``. With dynamic
      load balancing, option ``-rdd`` also sets the lower limit for the
      domain decomposition cell sizes. By default ``-rdd`` is determined
      by :ref:`gmx mdrun` based on the initial coordinates. The chosen value will
@@ -451,7 +476,7 @@ parallel hardware.
  ``-rcon``
      When constraints are present, option ``-rcon`` influences
      the cell size limit as well.  
-    Atoms connected by NC constraints, where NC is the LINCS order
+    Particles connected by NC constraints, where NC is the LINCS order
      plus 1, should not be beyond the smallest cell size. A error
      message is generated when this happens, and the user should change
      the decomposition or decrease the LINCS order and increase the
@@ -480,3 +505,57 @@ maybe elsewhere
  Running mdrun with GPUs
  -----------------------
  TODO In future patch: any tips not covered above
+
+Running the OpenCL version of mdrun
+-----------------------------------
+
+The current version works with GCN-based AMD GPUs, and NVIDIA CUDA
+GPUs. Make sure that you have the latest drivers installed. The
+minimum OpenCL version required is |REQUIRED_OPENCL_MIN_VERSION|. See
+also the :ref:`known limitations <opencl-known-limitations>`.
+
+The same ``-gpu_id`` option (or ``GMX_GPU_ID`` environment variable)
+used to select CUDA devices, or to define a mapping of GPUs to PP
+ranks, is used for OpenCL devices.
+
+The following devices are known to work correctly:
+   - AMD: FirePro W5100, HD 7950, FirePro W9100, Radeon R7 240,
+     Radeon R7 M260, Radeon R9 290
+   - NVIDIA: GeForce GTX 660M, GeForce GTX 660Ti, GeForce GTX 750Ti,
+     GeForce GTX 780, GTX Titan
+
+Building an OpenCL program can take a significant amount of
+time. NVIDIA implements a mechanism to cache the result of the
+build. As a consequence, only the first run will take longer (because
+of the kernel builds), and the following runs will be very fast. AMD
+drivers, on the other hand, implement no caching and the initial phase
+of running an OpenCL program can be very slow. This is not normally a
+problem for long production MD, but you might prefer to do some kinds
+of work on just the CPU (e.g. see ``-nb`` above).
+
+Some other :ref:`OpenCL management <opencl-management>` environment
+variables may be of interest to developers.
+
+.. _opencl-known-limitations:
+
+Known limitations of the OpenCL support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Limitations in the current OpenCL support of interest to |Gromacs| users:
+
+- Using more than one GPU on a node is not supported
+- Sharing a GPU between multiple PP ranks is not supported
+- No Intel devices (CPUs, GPUs or Xeon Phi) are supported
+- Due to blocking behavior of clEnqueue functions in the NVIDIA driver, there is
+  almost no performance gain when using NVIDIA GPUs. A bug report has already
+  been filled on about this issue. A possible workaround would be to have a
+  separate thread for issuing GPU commands. However this hasn't been implemented
+  yet.
+
+Limitations of interest to |Gromacs| developers:
+
+- The current implementation is not compatible with OpenCL devices that are
+  not using warp/wavefronts or for which the warp/wavefront size is not a
+  multiple of 32
+- Some Ewald tabulated kernels are known to produce incorrect results, so
+  (correct) analytical kernels are used instead.