docs/release-notes/2020/major/performance.rst

   1 Performance improvements
   2 ^^^^^^^^^^^^^^^^^^^^^^^^
   3
   4 .. Note to developers!
   5    Please use """"""" to underline the individual entries for fixed issues in the subfolders,
   6    otherwise the formatting on the webpage is messed up.
   7    Also, please use the syntax :issue:`number` to reference issues on GitLab, without the
   8    a space between the colon and number!
   9
  10 Up to a factor 2.5 speed-up of the non-bonded free-energy kernel
  11 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  12
  13 The non-bonded free-energy kernel is a factor 2.5 faster with non-zero A and B
  14 states and a factor 1.5 with one zero state. This especially improves the run
  15 performance when non-perturbed non-bondeds are offloaded to a GPU. In that case
  16 the PME-mesh calculation now always takes the most CPU time.
  17
  18
  19 Proper dihedrals of Fourier type and improper dihedrals of periodic type are SIMD accelerated
  20 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  21
  22 Avoid configuring the own-FFTW with AVX512 enabled when |Gromacs| does not use AVX512
  23 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  24
  25 Previously if |Gromacs| was configured to use any AVX flavor, the internally built FFTW
  26 would be configured to also contain AVX512 kernels. This could cause performance loss
  27 if the (often noisy) FFTW auto-tuner picks an AVX512 kernel in a run that otherwise
  28 only uses AVX/AVX2 which could run at higher CPU clocks without AVX512 clock speed limitation.
  29 Now AVX512 is only used for the internal FFTW if |Gromacs| is also configured with
  30 the same SIMD flavor.
  31
  32 Update and constraints can run on a GPU
  33 """""""""""""""""""""""""""""""""""""""
  34
  35 For standard simulations (see the user guide for more details),
  36 update and constraints can be offloaded to a GPU with CUDA. Thus all compute
  37 intensive parts of a simulation can be offloaded, which provides
  38 better performance when using a fast GPU combined with a slow CPU.
  39 By default, update will run on the CPU, to use GPU in single rank simulations,
  40 one can use new '-update gpu' command line option.
  41 For use with domain decomposition, please see below.
  42
  43 GPU Direct Communications
  44 """""""""""""""""""""""""
  45
  46 When running on multiple GPUs with CUDA, communication operations can
  47 now be performed directly between GPU memory spaces (automatically
  48 routed, including via NVLink where available). This behaviour is not
  49 yet enabled by default: the new codepaths have been verified by the
  50 standard |Gromacs| regression tests, but (at the time of release) still
  51 lack substantial "real-world" testing. They can be enabled by setting
  52 the following environment variables to any non-NULL value in your
  53 shell: GMX_GPU_DD_COMMS (for halo exchange communications between PP
  54 tasks); GMX_GPU_PME_PP_COMMS (for communications between PME and PP
  55 tasks); GMX_FORCE_UPDATE_DEFAULT_GPU can also be set in
  56 order to combine with the new GPU update feature (above). The
  57 combination of these will (for many common simulations) keep data
  58 resident on the GPU across most timesteps, avoiding expensive data
  59 transfers. Note that these currently require |Gromacs| to be built
  60 with its internal thread-MPI library rather than any external MPI
  61 library, and are limited to a single compute node. We stress that
  62 users should carefully verify results against the default path, and
  63 any reported issues will be gratefully received to help us mature the
  64 software.
  65
  66
  67 Bonded kernels on GPU have been fused
  68 """""""""""""""""""""""""""""""""""""
  69
  70 Instead of launching one GPU kernel for each listed interaction type there is now one
  71 GPU kernel that handles all listed interactions. This improves the performance when
  72 running bonded calculations on a GPU.
  73
  74 Delay for ramp-up added to PP-PME tuning
  75 """"""""""""""""""""""""""""""""""""""""
  76
  77 Modern CPUs and GPUs can take a few seconds to ramp up their clock speeds.
  78 Therefore the PP-PME load balancing now starts after 5 seconds instead
  79 of after a few MD steps. This avoids sub-optimal performance settings.