Unify handling of GMX_ENABLE_GPU_TIMING and GMX_DISABLE_GPU_TIMING

[alexxy/gromacs.git] / docs / user-guide / environment-variables.rst
diff --git a/docs/user-guide/environment-variables.rst b/docs/user-guide/environment-variables.rst

index 597ed8f900e9f36f5ebd5cf9bf01568ee4c42213..1f323174e1124da09523dbf792d20286fd844d23 100644 (file)
--- a/docs/user-guide/environment-variables.rst
+++ b/docs/user-guide/environment-variables.rst
@@ -1,3 +1,9 @@
+.. NOTE: Below is a useful bash one-liner to verify whether there are variables in this file
+..        no longer present in the code.
+.. ( export INPUT_FILE='docs/user-guide/environment-variables.rst' GIT_PAGER="cat "; for s in $(grep '^`'  $INPUT_FILE | sed 's/`//g' | sed 's/,/ /g'); do count=$(git grep $s | grep -v $INPUT_FILE | wc -l); [ $count -eq 0 ] && printf "%-30s%s\n" $s $count; done ; )
+.. Another useful one-liner to find undocumentedvariables:
+..  ( export INPUT_FILE=docs/user-guide/environment-variables.rst; GIT_PAGER="cat ";   for ss in `for s in $(git grep getenv |  sed 's/.*getenv("\(.*\)".*/\1/' | sort -u  | grep '^[A-Z]'); do [ $(grep $s $INPUT_FILE -c) -eq 0 ] && echo $s; done `; do git grep $ss ; done )
+
  Environment Variables
  =====================
  
@@ -16,8 +22,8 @@ you should consult your local documentation for details.
  
  Output Control
  --------------
-``GMX_CONSTRAINTVIR``
-        Print constraint virial and force virial energy terms.
+``GMX_DUMP_NL``
+        Neighbour list dump level; default 0.
  
  ``GMX_MAXBACKUP``
          |Gromacs| automatically backs up old
@@ -52,9 +58,6 @@ Output Control
          Be careful not to use a command which blocks the terminal
          (e.g. ``vi``), since multiple instances might be run.
  
-``GMX_VIRIAL_TEMPERATURE``
-        print virial temperature energy term
-
  ``GMX_LOG_BUFFER``
          the size of the buffer for file I/O. When set
          to 0, all file I/O will be unbuffered and therefore very slow.
@@ -78,6 +81,15 @@ Output Control
          Defaults to 1, which prints frame count e.g. when reading trajectory
          files. Set to 0 for quiet operation.
  
+``GMX_ENABLE_GPU_TIMING``
+        Enables GPU timings in the log file for CUDA and SYCL. Note that CUDA
+        timings are incorrect with multiple streams, as happens with domain
+        decomposition or with both non-bondeds and PME on the GPU (this is
+        also the main reason why they are not turned on by default).
+
+``GMX_DISABLE_GPU_TIMING``
+        Disables GPU timings in the log file for OpenCL.
+
  Debugging
  ---------
  ``GMX_PRINT_DEBUG_LINES``
@@ -104,6 +116,18 @@ Debugging
          over-ride the number of DD pulses used
          (default 0, meaning no over-ride). Normally 1 or 2.
  
+``GMX_DISABLE_ALTERNATING_GPU_WAIT``
+        disables the specialized polling wait path used to wait for the PME and nonbonded
+        GPU tasks completion to overlap to do the reduction of the resulting forces that
+        arrive first. Setting this variable switches to the generic path with fixed waiting
+        order.
+
+``GMX_TEST_REQUIRED_NUMBER_OF_DEVICES``
+        sets the number of GPUs required by the test suite. By default, the test suite would
+        fall-back to using CPU if GPUs could not be detected. Set it to a positive integer value
+        to ensure that at least this at least this number of usable GPUs are detected. Default:
+        0 (not testing GPU availability).
+
  There are a number of extra environment variables like these
  that are used in debugging - check the code!
  
@@ -115,33 +139,37 @@ Performance and Run Control
          file. Normally, :mdp:`epsilon-r` must be greater than zero to prevent a fatal error.
          See webpage_ for example input files for a planetary simulation.
  
-``GMX_ALLOW_CPT_MISMATCH``
-        when set, runs will not exit if the
-        ensemble set in the :ref:`tpr` file does not match that of the
-        :ref:`cpt` file.
+``GMX_BONDED_NTHREAD_UNIFORM``
+        Value of the number of threads per rank from which to switch from uniform
+        to localized bonded interaction distribution; optimal value dependent on
+        system and hardware, default value is 4.
  
-``GMX_CUDA_NB_EWALD_TWINCUT``
+``GMX_GPU_NB_EWALD_TWINCUT``
          force the use of twin-range cutoff kernel even if :mdp:`rvdw` equals
          :mdp:`rcoulomb` after PP-PME load balancing. The switch to twin-range kernels is automated,
          so this variable should be used only for benchmarking.
  
-``GMX_CUDA_NB_ANA_EWALD``
+``GMX_GPU_NB_ANA_EWALD``
          force the use of analytical Ewald kernels. Should be used only for benchmarking.
  
-``GMX_CUDA_NB_TAB_EWALD``
+``GMX_GPU_NB_TAB_EWALD``
          force the use of tabulated Ewald kernels. Should be used only for benchmarking.
  
-``GMX_CUDA_STREAMSYNC``
-        force the use of cudaStreamSynchronize on ECC-enabled GPUs, which leads
-        to performance loss due to a known CUDA driver bug present in API v5.0 NVIDIA drivers (pre-30x.xx).
-        Cannot be set simultaneously with ``GMX_NO_CUDA_STREAMSYNC``.
+``GMX_DISABLE_CUDA_TIMING``
+        Deprecated. Use ``GMX_DISABLE_GPU_TIMING`` instead.
+
+``GMX_GPU_DD_COMMS``
+        perform domain decomposition halo exchange communication operations (on coordinate and force buffers)
+        directly on GPU memory spaces, without the staging of data through CPU memory, where possible.
  
-``GMX_DISABLE_CUDALAUNCH``
-        disable the use of the lower-latency cudaLaunchKernel API even when supported (CUDA >=v7.0).
-        Should only be used for benchmarking purposes.
+``GMX_GPU_PME_PP_COMMS``
+        when the simulation uses a separate PME rank, perform communication operations between PP and PME rank
+        (for coordinate and force buffers) directly on GPU memory spaces, without the staging of data through CPU
+        memory, where possible. 
  
-``GMX_DISABLE_CUDA_TIMING``
-        Disables GPU timing of CUDA tasks; synonymous with ``GMX_DISABLE_GPU_TIMING``.
+``GMX_GPU_SYCL_NO_SYNCHRONIZE``
+        disable synchronizations between different GPU streams in SYCL build, instead relying on SYCL runtime to
+        do scheduling based on data dependencies. Experimental.
  
  ``GMX_CYCLE_ALL``
          times all code during runs.  Incompatible with threads.
@@ -183,6 +211,7 @@ Performance and Run Control
  ``GMX_DISABLE_GPU_TIMING``
          timing of asynchronously executed GPU operations can have a
          non-negligible overhead with short step times. Disabling timing can improve performance in these cases.
+        Timings are disabled by default with CUDA and SYCL.
  
  ``GMX_DISABLE_GPU_DETECTION``
          when set, disables GPU detection even if :ref:`gmx mdrun` was compiled
@@ -199,8 +228,6 @@ Performance and Run Control
  ``GMX_EMULATE_GPU``
          emulate GPU runs by using algorithmically equivalent CPU reference code instead of
          GPU-accelerated functions. As the CPU code is slow, it is intended to be used only for debugging purposes.
-        The behavior is automatically triggered if non-bonded calculations are turned off using ``GMX_NO_NONBONDED``
-        case in which the non-bonded calculations will not be called, but the CPU-GPU transfer will also be skipped.
  
  ``GMX_ENX_NO_FATAL``
          disable exiting upon encountering a corrupted frame in an :ref:`edr`
@@ -209,11 +236,29 @@ Performance and Run Control
  ``GMX_FORCE_UPDATE``
          update forces when invoking ``mdrun -rerun``.
  
+``GMX_FORCE_UPDATE_DEFAULT_GPU``
+        Force update to run on the GPU by default, overriding the ``mdrun -update auto`` option. Works similar to setting
+        ``mdrun -update gpu``, but (1) falls back to the CPU code-path, if set with input that is not supported and
+        (2) can be used to run update on GPUs in multi-rank cases. The latter case should be
+        considered experimental since it lacks substantial testing. Also, GPU update is only supported with the GPU direct
+        communications and ``GMX_FORCE_UPDATE_DEFAULT_GPU`` variable should be set simultaneously with ``GMX_GPU_DD_COMMS``
+        and ``GMX_GPU_PME_PP_COMMS`` environment variables in multi-rank case. Does not override ``mdrun -update cpu``.
+
  ``GMX_GPU_ID``
          set in the same way as ``mdrun -gpu_id``, ``GMX_GPU_ID``
-        allows the user to specify different GPU id-s, which can be useful for selecting different
+        allows the user to specify different GPU IDs for different ranks, which can be useful for selecting different
          devices on different compute nodes in a cluster.  Cannot be used in conjunction with ``mdrun -gpu_id``.
  
+``GMX_GPUTASKS``
+        set in the same way as ``mdrun -gputasks``, ``GMX_GPUTASKS`` allows the mapping
+        of GPU tasks to GPU device IDs to be different on different ranks, if e.g. the MPI
+        runtime permits this variable to be different for different ranks. Cannot be used
+        in conjunction with ``mdrun -gputasks``. Has all the same requirements as ``mdrun -gputasks``.
+
+``GMX_GPU_DISABLE_COMPATIBILITY_CHECK``
+        Disables the hardware compatibility check in OpenCL and SYCL. Useful for developers
+        and allows testing the OpenCL/SYCL kernels on non-supported platforms without source code modification.
+
  ``GMX_IGNORE_FSYNC_FAILURE_ENV``
          allow :ref:`gmx mdrun` to continue even if
          a file is missing.
@@ -226,17 +271,12 @@ Performance and Run Control
          if set to -1, :ref:`gmx mdrun` will
          not exit if it produces too many LINCS warnings.
  
-``GMX_NB_GENERIC``
-        use the generic C kernel.  Should be set if using
-        the group-based cutoff scheme and also sets ``GMX_NO_SOLV_OPT`` to be true,
-        thus disabling solvent optimizations as well.
-
  ``GMX_NB_MIN_CI``
          neighbor list balancing parameter used when running on GPU. Sets the
          target minimum number pair-lists in order to improve multi-processor load-balance for better
          performance with small simulation systems. Must be set to a non-negative integer,
          the 0 value disables list splitting.
-        The default value is optimized for supported GPUs (NVIDIA Fermi to Maxwell),
+        The default value is optimized for supported GPUs
          therefore changing it is not necessary for normal usage, but it can be useful on future architectures.
  
  ``GMX_NBLISTCG``
@@ -261,8 +301,8 @@ Performance and Run Control
          force the use of 4xN SIMD CPU non-bonded kernels,
          mutually exclusive of ``GMX_NBNXN_SIMD_2XNN``.
  
-``GMX_NO_ALLVSALL``
-        disables optimized all-vs-all kernels.
+``GMX_NOOPTIMIZEDKERNELS``
+        deprecated, use ``GMX_DISABLE_SIMD_KERNELS`` instead.
  
  ``GMX_NO_CART_REORDER``
          used in initializing domain decomposition communicators. Rank reordering
@@ -272,11 +312,6 @@ Performance and Run Control
          force the use of LJ paremeter lookup instead of using combination rules
          in the non-bonded kernels.
  
-``GMX_NO_CUDA_STREAMSYNC``
-        the opposite of ``GMX_CUDA_STREAMSYNC``. Disables the use of the
-        standard cudaStreamSynchronize-based GPU waiting to improve performance when using CUDA driver API
-        ealier than v5.0 with ECC-enabled GPUs.
-
  ``GMX_NO_INT``, ``GMX_NO_TERM``, ``GMX_NO_USR1``
          disable signal handlers for SIGINT,
          SIGTERM, and SIGUSR1, respectively.
@@ -288,25 +323,27 @@ Performance and Run Control
          skip non-bonded calculations; can be used to estimate the possible
          performance gain from adding a GPU accelerator to the current hardware setup -- assuming that this is
          fast enough to complete the non-bonded calculations while the CPU does bonded force and PME computation.
+        Freezing the particles will be required to stop the system blowing up.
  
-``GMX_NO_PULLVIR``
-        when set, do not add virial contribution to COM pull forces.
+``GMX_PULL_PARTICIPATE_ALL``
+        disable the default heuristic for when to use a separate pull MPI communicator (at >=32 ranks).
  
  ``GMX_NOPREDICT``
          shell positions are not predicted.
  
-``GMX_NO_SOLV_OPT``
-        turns off solvent optimizations; automatic if ``GMX_NB_GENERIC``
-        is enabled.
+``GMX_NO_UPDATEGROUPS``
+        turns off update groups. May allow for a decomposition of more
+        domains for small systems at the cost of communication during update.
  
  ``GMX_NSCELL_NCG``
          the ideal number of charge groups per neighbor searching grid cell is hard-coded
          to a value of 10. Setting this environment variable to any other integer value overrides this hard-coded
          value.
  
-``GMX_PME_NTHREADS``
-        set the number of OpenMP or PME threads (overrides the number guessed by
-        :ref:`gmx mdrun`.
+``GMX_PME_NUM_THREADS``
+        set the number of OpenMP or PME threads; overrides the default set by
+        :ref:`gmx mdrun`; can be used instead of the ``-npme`` command line option,
+        also useful to set heterogeneous per-process/-node thread count.
  
  ``GMX_PME_P3M``
          use P3M-optimized influence function instead of smooth PME B-spline interpolation.
@@ -314,7 +351,7 @@ Performance and Run Control
  ``GMX_PME_THREAD_DIVISION``
          PME thread division in the format "x y z" for all three dimensions. The
          sum of the threads in each dimension must equal the total number of PME threads (set in
-        `GMX_PME_NTHREADS`).
+        :envvar:`GMX_PME_NTHREADS`).
  
  ``GMX_PMEONEDD``
          if the number of domain decomposition cells is set to 1 for both x and y,
@@ -327,11 +364,6 @@ Performance and Run Control
          require the use of tabulated Coulombic
          and van der Waals interactions.
  
-``GMX_SCSIGMA_MIN``
-        the minimum value for soft-core sigma. **Note** that this value is set
-        using the :mdp:`sc-sigma` keyword in the :ref:`mdp` file, but this environment variable can be used
-        to reproduce pre-4.5 behavior with respect to this parameter.
-
  ``GMX_TPIC_MASSES``
          should contain multiple masses used for test particle insertion into a cavity.
          The center of mass of the last atoms is used for insertion into the cavity.
@@ -346,7 +378,7 @@ Performance and Run Control
  ``HWLOC_XMLFILE``
          Not strictly a |Gromacs| environment variable, but on large machines
          the hwloc detection can take a few seconds if you have lots of MPI processes.
-        If you run the hwloc command `lstopo out.xml` and set this environment
+        If you run the hwloc command :command:`lstopo out.xml` and set this environment
          variable to point to the location of this file, the hwloc library will use
          the cached information instead, which can be faster.
  
@@ -356,13 +388,15 @@ Performance and Run Control
  ``MDRUN``
          the :ref:`gmx mdrun` command used by :ref:`gmx tune_pme`.
  
-``GMX_NSTLIST``
-        sets the default value for :mdp:`nstlist`, preventing it from being tuned during
-        :ref:`gmx mdrun` startup when using the Verlet cutoff scheme.
+``GMX_DISABLE_DYNAMICPRUNING``
+        disables dynamic pair-list pruning. Note that :ref:`gmx mdrun` will
+        still tune nstlist to the optimal value picked assuming dynamic pruning. Thus
+        for good performance the -nstlist option should be used.
  
-``GMX_USE_TREEREDUCE``
-        use tree reduction for nbnxn force reduction. Potentially faster for large number of
-        OpenMP threads (if memory locality is important).
+``GMX_NSTLIST_DYNAMICPRUNING``
+        overrides the dynamic pair-list pruning interval chosen heuristically
+        by mdrun. Values should be between the pruning frequency value
+        (1 for CPU and 2 for GPU) and :mdp:`nstlist` ``- 1``.
  
  .. _opencl-management:
  
@@ -390,6 +424,7 @@ compilation of OpenCL kernels, but they are also used in device selection.
  
  ``GMX_OCL_DISABLE_FASTMATH``
          Prevents the use of ``-cl-fast-relaxed-math`` compiler option.
+        Not: fast math is always disabled on Intel devices due to instability.
  
  ``GMX_OCL_DUMP_LOG``
          If defined, the OpenCL build log is always written to the
@@ -409,8 +444,8 @@ compilation of OpenCL kernels, but they are also used in device selection.
          ``GMX_OCL_NOGENCACHE``).
  
              - NVIDIA GPUs: PTX code is saved in the current directory
-             with the name ``device_name.ptx``
-           - AMD GPUs: ``.IL/.ISA`` files will be created for each OpenCL
+              with the name ``device_name.ptx``
+            - AMD GPUs: ``.IL/.ISA`` files will be created for each OpenCL
                kernel built.  For details about where these files are
                created check AMD documentation for ``-save-temps`` compiler
                option.
@@ -430,60 +465,30 @@ compilation of OpenCL kernels, but they are also used in device selection.
          simplicity of stepping in a kernel and see what is happening.
  
  ``GMX_OCL_DISABLE_I_PREFETCH``
-        Disables i-atom data (type or LJ parameter) prefetch allowig
+        Disables i-atom data (type or LJ parameter) prefetch allowing
          testing.
  
  ``GMX_OCL_ENABLE_I_PREFETCH``
-        Enables i-atom data (type or LJ parameter) prefetch allowig
+        Enables i-atom data (type or LJ parameter) prefetch allowing
          testing on platforms where this behavior is not default.
  
-``GMX_OCL_NB_ANA_EWALD``
-        Forces the use of analytical Ewald kernels. Equivalent of
-        CUDA environment variable ``GMX_CUDA_NB_ANA_EWALD``
-
-``GMX_OCL_NB_TAB_EWALD``
-        Forces the use of tabulated Ewald kernel. Equivalent
-        of CUDA environment variable ``GMX_OCL_NB_TAB_EWALD``
-
-``GMX_OCL_NB_EWALD_TWINCUT``
-        Forces the use of twin-range cutoff kernel. Equivalent of
-        CUDA environment variable ``GMX_CUDA_NB_EWALD_TWINCUT``
-
-``GMX_DISABLE_OCL_TIMING``
-        Disables timing for OpenCL operations
-
  ``GMX_OCL_FILE_PATH``
          Use this parameter to force |Gromacs| to load the OpenCL
          kernels from a custom location. Use it only if you want to
          override |Gromacs| default behavior, or if you want to test
          your own kernels.
  
+``GMX_OCL_SHOW_DIAGNOSTICS``
+        Use Intel OpenCL extension to show additional runtime performance
+        diagnostics.
+
  Analysis and Core Functions
  ---------------------------
-``GMX_QM_ACCURACY``
-        accuracy in Gaussian L510 (MC-SCF) component program.
-
-``GMX_QM_ORCA_BASENAME``
-        prefix of :ref:`tpr` files, used in Orca calculations
-        for input and output file names.
-
-``GMX_QM_CPMCSCF``
-        when set to a nonzero value, Gaussian QM calculations will
-        iteratively solve the CP-MCSCF equations.
-
-``GMX_QM_MODIFIED_LINKS_DIR``
-        location of modified links in Gaussian.
  
  ``DSSP``
          used by :ref:`gmx do_dssp` to point to the ``dssp``
          executable (not just its path).
  
-``GMX_QM_GAUSS_DIR``
-        directory where Gaussian is installed.
-
-``GMX_QM_GAUSS_EXE``
-        name of the Gaussian executable.
-
  ``GMX_DIPOLE_SPACING``
          spacing used by :ref:`gmx dipoles`.
  
@@ -491,14 +496,11 @@ Analysis and Core Functions
          sets the maximum number of residues to be renumbered by
          :ref:`gmx grompp`. A value of -1 indicates all residues should be renumbered.
  
-``GMX_FFRTP_TER_RENAME``
+``GMX_NO_FFRTP_TER_RENAME``
          Some force fields (like AMBER) use specific names for N- and C-
          terminal residues (NXXX and CXXX) as :ref:`rtp` entries that are normally renamed. Setting
          this environment variable disables this renaming.
  
-``GMX_PATH_GZIP``
-        ``gunzip`` executable, used by :ref:`gmx wham`.
-
  ``GMX_FONT``
          name of X11 font used by :ref:`gmx view`.
  
@@ -506,9 +508,6 @@ Analysis and Core Functions
          the time unit used in output files, can be
          anything in fs, ps, ns, us, ms, s, m or h.
  
-``GMX_QM_GAUSSIAN_MEMORY``
-        memory used for Gaussian QM calculation.
-
  ``MULTIPROT``
          name of the ``multiprot`` executable, used by the
          contributed program ``do_multiprot``.
@@ -516,15 +515,6 @@ Analysis and Core Functions
  ``NCPUS``
          number of CPUs to be used for Gaussian QM calculation
  
-``GMX_ORCA_PATH``
-        directory where Orca is installed.
-
-``GMX_QM_SA_STEP``
-        simulated annealing step size for Gaussian QM calculation.
-
-``GMX_QM_GROUND_STATE``
-        defines state for Gaussian surface hopping calculation.
-
  ``GMX_TOTAL``
          name of the ``total`` executable used by the contributed
          ``do_shift`` program.