docs/doxygen/lib/simd.md

   1 Single-instruction Multiple-data (SIMD) coding {#page_simd}
   2 ==============================================
   3
   4 Coding with SIMD instructions
   5 =============================
   6
   7 One important way for \Gromacs to achieve high performance is
   8 to use modern hardware capabilities where a single assembly
   9 instruction operates on multiple data units, essentially short
  10 fixed-length vectors (usually 2, 4, 8, or 16 elements). This provides
  11 a very efficient way for the CPU to increase floating-point
  12 performance, but it is much less versatile than general purpose
  13 registers. For this reason it is difficult for the compiler to
  14 generate efficient SIMD code, so the user has to organize the
  15 data in a way where it is possible to access as vectors, and
  16 these vectors often need to be aligned on cache boundaries.
  17
  18 We have supported a number of different SIMD instruction sets in
  19 the group kernels for ages, and it is now also present in the
  20 Verlet kernels and a few other places. However, with the increased
  21 usage and several architectures with different capabilities we now
  22 use a vendor-agnostic \Gromacs SIMD module, as documented in
  23 \ref module_simd.
  24
  25 Design of the \Gromacs SIMD module
  26 ==================================
  27
  28 The functions in `src/gromacs/simd` are intended to be used for writing
  29 architecture-independent SIMD intrinsics code. Rather than making assumptions
  30 based on architecture, we have introduced a limited number of
  31 predefined preprocessor macros that describe the capabilities of the
  32 current implementation - these are the ones you need to check when
  33 writing SIMD code. As you will see, the functionality exposed by
  34 this module as typically a small subset of general SIMD implementations,
  35 and in particular we do not even try to expose advanced shuffling or
  36 permute operations, simply because we haven't been able to describe those
  37 in a generic way that can be implemented efficiently regardless of the
  38 hardware. However, the advantage of this approach is that it is straightforward
  39 to extend with support for new simd instruction sets in the future,
  40 and that will instantly speed up old code too.
  41
  42 To support the more complex stuff in the \Gromacs nonbonded kernels and
  43 to make it possible to use SIMD intrinsics even for some parts of the code
  44 where the data is not in SIMD-friendly layout, we have also added about 10
  45 higher-level utility routines. These perform gather/scatter operations
  46 on coordinate triplets, they load table data and aligned pairs (Lennard-Jones
  47 parameters), and sum up the forces needed in the outer loop of the nonbonded
  48 kernels. They are very straightforward to implement, but since they are
  49 performance-critical we want to exploit all features of each architecture,
  50 and for this reason they are part of the SIMD implementation.
  51
  52 Finally, for some architectures with large or very large SIMD width (e.g. AVX
  53 with 8 elements in single precision, or AVX-512 with 16), the nonbonded
  54 kernels can become inefficient. Since all such architectures presently known
  55 (AVX, AVX2, MIC, AVX512) also provide extensive support for accessing
  56 parts of the register, we optionally define a handful of routines to
  57 perform load, store, and reduce operations based on half-SIMD-width data,
  58 which can improve performance. It is only useful for wide implementations,
  59 and it can safely be ignored first when porting to new platforms - they
  60 are only needed for the so-called 2xnn SIMD kernels.
  61
  62 Unfortunately there is no standard for SIMD architectures. The available
  63 features vary a lot, but we still need to use quite a few of them to
  64 get the best performance possible. This means some features will only
  65 be available on certain platforms, and it is critical that we do NOT make
  66 too many assumptions about the storage formats, their size or SIMD width.
  67 Just to give a few examples:
  68
  69 - On x86, double precision (64-bit) floating-point values always convert
  70   to 32-bit integers, while many other platforms use 64-bit, and some cannot
  71   use 32-bit integers at all. This means we cannot use a mask (boolean)
  72   derived from integer operations to select double-precision floating-point
  73   values, and it could get very complex for higher-level code if all these
  74   decisions were exposed. Instead, we want to keep integers 32-bit since
  75   all algorithms anyway need to work in single precision (w. 32-bit ints).
  76 - AVX1 only supports 4-wide 128-bit integer SIMD arithmetics, but the integer
  77   _conversions_ can still be done 8-wide which corresponds to the single
  78   precision floating-point width. Similarly, with AVX1 conversions between
  79   double-precision and integers use the 32-bit 4-wide 128bit registers where
  80   we can also do integer arithmetics. AVX2 adds proper arithmetics for
  81   8-wide integers. We would severely limit performance if we had to say
  82   that integer support was not present, so instead we stick to 32-bit ints
  83   but limit the operations we expose (and do shuffling internally).
  84 - For SSE2 through SSE4.1, double precision is 2-wide, but when we convert
  85   to integers they will be put in the first two elements of a 4-wide integer
  86   type. This means we cannot assume that floating-point SIMD registers and
  87   corresponding integer registers (after conversion) have the same width.
  88 - Since boolean values can have different width for float/double and the
  89   integers corresponding to float/double, we need to use separate boolean
  90   types for all these values and convert between them if we e.g. want to use
  91   the result of an integer compare to select floating-point values.
  92
  93 While this might sound complicated, it is actually far easier than writing
  94 separate SIMD code for 10 architectures in both single & double. The point
  95 is not that you need to remember the limitations above, but it is critical
  96 that you *never assume anything about the SIMD implementation*. We
  97 typically implement SIMD support for a new architecture in days with this
  98 new module, and the extensions required for Verlet kernels
  99 are also very straightforward (group kernels can be more complex, but those
 100 are gradually on their way out). For the higher-level
 101 code, the only important thing is to never _assume_ anything about the SIMD
 102 architecture. Our general strategy in \Gromacs is to split the SIMD coding
 103 in three levels:
 104
 105 <dl>
 106 <dt>Base level generic SIMD</dt>
 107 <dd>
 108 The base level SIMD module (which we get by including `gromacs/simd/simd.h`
 109 provides the API to define and manipulate SIMD datatypes. This will be enough
 110 for lots of cases, and it is a huge advantage that there is roughly
 111 parity between different architectures.
 112 </dd>
 113 <dt>Higher-level architecture-specific SIMD utility functions</dt>
 114 <dd>
 115 For some parts of the code this is not enough. In particular, both the
 116 group and Verlet kernels do insane amounts of floating-point operations,
 117 and since we spend 85-90% of the time in these kernels it is critical that
 118 we can optimize them as much as possible. Here, our strategy is first to
 119 define larger high-level functions that e.g. take a number of distances
 120 and load the table interactions for this interaction. This way we can
 121 move this architecture-specific implementation to the SIMD module, and
 122 both achieve a reasonably clean kernel but still optimize a lot. This
 123 is what we have done for the approximately 10 functions for the nonbonded
 124 kernels, to load tables and Lennard-Jones parameters, and to sum up the
 125 forces in the outer loop. These functions have intentionally been given
 126 names that describe what they do with the data, rather than what their
 127 function is in \Gromacs. By looking at the documentation for these routines,
 128 and the reference implementation, it should be quite straightforward to
 129 implement them for a new architecture too.
 130 </dd>
 131 <dt>Half-SIMD-width architecture-specific utility functions</dt>
 132 <dd>
 133 As described earlier, as the SIMD width increases to 8 or more elements,
 134 the nonbonded kernels can become inefficient due to the large j-particle
 135 cluster size. Things will still work, but if an architecture supports
 136 efficient access to partial SIMD registers (e.g. loading half the width),
 137 we can use this to alter the balance between memory load/store operations
 138 and floating-point arithmetic operations by processing either e.g. 4-by-4
 139 or 2-by-8 interactions in one iteration. When \ref
 140 GMX_SIMD_HAVE_HSIMD_UTIL_REAL is set, a handful of routines to
 141 use this in the nonbonded kernels is present. Avoid using these routines
 142 outside the nonbonded kernels since they are slightly more complex, and
 143 is is not straightforward to determine which alternative provides the best
 144 performance.
 145 </dd>
 146 <dt>Architecture-specific kernels (directories/files)</dt>
 147 <dd>
 148 No code outside the SIMD module implementation directories should try
 149 to execute anything hardware specific. Note that this includes even checking
 150 for what architecture the current SIMD implementation is - you should check
 151 for features instead, so it will work with future ports too.
 152 </dd>
 153 </dl>
 154
 155 File organization
 156 =================
 157
 158 The SIMD module uses a couple of different files:
 159
 160 <dl>
 161 <dt>`gromacs/simd/simd.h`</dt>
 162 <dd>
 163 This is the top-level wrapper that you should always include first.
 164 It will check the settings made at configuration time and include a
 165 suitable low-level implementation (that can be either single, double,
 166 or both). It also contains the routines for memory alignment, and
 167 based on the current \Gromacs precision it will set aliases to 'real'
 168 SIMD datatypes (see further down) so the implementations do not have
 169 to care about \Gromacs-specific details. However, note that you might
 170 not get all SIMD support you hoped for: If you compiled \Gromacs in
 171 double precision but the hardware only supports single-precision SIMD
 172 there will not be any SIMD routines for default \Gromacs 'real' precision.
 173 There are \#defines you can use to check this, as described further down.
 174 </dd>
 175 <dt>`gromacs/simd/impl_reference/impl_reference.h`</dt>
 176 <dd>
 177 This is an example of a low-level implementation. You should never, ever,
 178 work directly with these in higher-level code. The reference implementation
 179 contains the documentation for all SIMD wrappers, though. This file will
 180 in turn include other separate implementation files for single, double,
 181 simd4, etc. Since we want to be able to run the low-level SIMD implementation
 182 in simulators for new platforms, these files are intentionally not using
 183 the rest of the GROMACS infrastructure, e.g. for asserts().
 184 </dd>
 185 <dt>`gromacs/simd/simd_math.h`</dt>
 186 <dd>
 187 SIMD math functions. All functions in this file have to be designed
 188 so they work no matter whether the hardware supports integer SIMD, logical
 189 operations on integer or floating-point SIMD, or arithmetic operations
 190 on integers. However, a few routines check for defines and use faster
 191 algorithms if these features are present.
 192 </dd>
 193 <dt>`gromacs/simd/vector_operations.h`</dt>
 194 <dd>
 195 This file contains a few rvec-related SIMD functions, e.g. to
 196 calculate scalar products, norms, or cross products. They obviously
 197 cannot operate on scalar \Gromacs rvec types, but use separate SIMD
 198 variables for X,Y, and Z vector components.
 199 </dd>
 200 </dl>
 201
 202
 203 SIMD datatypes
 204 ==============
 205
 206 The SIMD module handles the challenges mentioned in the introduction
 207 by introducing a number of datatypes;
 208 many of these might map to the same underlying SIMD types, but we need separate
 209 types because some architectures use different registers e.g. for boolean
 210 types.
 211
 212 Floating-point data
 213 -------------------
 214
 215 <dl>
 216 <dt>`#gmx::SimdReal`</dt>
 217 <dd>
 218 This is the SIMD-version of \Gromacs' real type,
 219 which is set based on the CMake configuration and internally aliased
 220 to one of the next two types.
 221 </dd>
 222 <dt>`#gmx::SimdFloat`</dt>
 223 <dd>
 224 This is always single-precision data, but it
 225 might not be supported on all architectures.
 226 </dd>
 227 <dt>`gmx::SimdDouble`</dt>
 228 <dd>
 229 This is always double precision when available,
 230 and in rare cases you might want to use a specific precision.
 231 </dd>
 232 </dl>
 233
 234 Integers corresponding to floating-point values
 235 -----------------------------------------------
 236
 237 For these types, 'correspond' means that it is the integer type we
 238 get when we convert data e.g. from single (or double) precision
 239 floating-point SIMD variables. Those need to be different, since many
 240 common implementations only use half as many elements for double as
 241 for single SIMD variables, and then we only get half the number of
 242 integers too.
 243
 244 <dl>
 245 <dt>`#gmx::SimdInt32`</dt>
 246 <dd>
 247 This is used for integers when converting to/from \Gromacs default "real" type.
 248 </dd>
 249 <dt>`gmx::SimdFInt32`</dt>
 250 <dd>
 251 Integers obtained when converting from single precision, or intended to be
 252 converted to single precision floating-point. These are normal integers
 253 (not a special conversion type), but since some SIMD architectures such as
 254 SSE or AVX use different registers for integer SIMD variables having the
 255 same width as float and double, respectively, we need to separate these
 256 two types of integers. The actual operations you perform on them are normal
 257 ones such as addition or multiplication.
 258 This will also be the widest integer data type if you want to do pure
 259 integer SIMD operations, but that will not be supported on all platforms.
 260 If the architecture does not support any SIMD integer type at all, this
 261 will likely be defined from the floating-point SIMD type, without support
 262 for any integer operations apart from load/store/convert.
 263 </dd>
 264 <dt>`gmx::SimdDInt32`</dt>
 265 <dd>
 266 Integers used when converting to/from double. See the preceding item
 267 for a detailed explanation. On many architectures,
 268 including all x86 ones, this will be a narrower type than `gmx::SimdFInt32`.
 269 </dd>
 270 </dl>
 271
 272 Note that all integer load/stores operations defined here load/store 32-bit
 273 integers, even when the internal register storage might be 64-bit, and we
 274 set the "width" of the SIMD implementation based on how many float/double/
 275 integers we load/store - even if the internal width could be larger.
 276
 277 Boolean values
 278 --------------
 279
 280 We need a separate boolean datatype for masks and comparison results, since
 281 we cannot assume they are identical either to integers, floats or double -
 282 some implementations use specific predicate registers for booleans.
 283
 284 <dl>
 285 <dt>`#gmx::SimdBool`</dt>
 286 <dd>
 287 Results from boolean operations involving reals, and the booleans we use
 288 to select between real values. The corresponding routines have suffix `B`,
 289 like `gmx::simdOrB()`.
 290 </dd>
 291 <dt>`gmx::SimdFBool`</dt>
 292 <dd>
 293 Booleans specifically for single precision.
 294 </dd>
 295 <dt>`gmx::SimdDBool`</dt>
 296 <dd>
 297 Operations specifically on double.
 298 </dd>
 299 <dt>`#gmx::SimdIBool`</dt>
 300 <dd>
 301 Boolean operations on integers corresponding to real (see floating-point
 302 descriptions above).
 303 </dd>
 304 <dt>`gmx::SimdFIBool`</dt>
 305 <dd>
 306 Booleans for integers corresponding to float.
 307 </dd>
 308 <dt>`gmx::SimdDIBool`</dt>
 309 <dd>
 310 Booleans for integers corresponding to double.
 311 </dd>
 312 </dl>
 313
 314 Note: You should NOT try to store and load boolean SIMD types to memory - that
 315 is the whole reason why there are no store or load operations provided for
 316 them. While it will be technically possible to achieve by defining objects
 317 inside a structure and then doing a placement new with aligned memory, this
 318 can be a very expensive operation on platforms where special single-bit
 319 predicate registers are used to represent booleans. You will need to find
 320 a more portable algorithm for your code instead.
 321
 322 The subset you should use in practice
 323 -------------------------------------
 324
 325 If this seems daunting, in practice you should only need to use these types
 326 when you start coding:
 327
 328 <dl>
 329 <dt>`#gmx::SimdReal`</dt>
 330 <dd>
 331 Floating-point data.
 332 </dd>
 333 <dt>`#gmx::SimdBool`</dt>
 334 <dd>
 335 Booleans.
 336 </dd>
 337 <dt>`#gmx::SimdInt32`</dt>
 338 <dd>
 339 Integer data. Might not be supported, so you must check
 340 the preprocessor macros described below.
 341 </dd>
 342 </dl>
 343
 344 Operations on these types will be defined to either float/double (or
 345 corresponding integers) based on the current \Gromacs precision, so the
 346 documentation is occasionally more detailed for the lower-level actual
 347 implementation functions.
 348
 349 Note that it is critical for these types to be aligned in memory. This
 350 should always be the case when you declare variables on the stack, but
 351 unfortunately some compilers (at least clang-3.7 on OS X) appear to be
 352 buggy when our SIMD datatypes are placed inside a structure. Somewhere
 353 in the processes where this structure includes our class, which in turn
 354 includes the actual SIMD datatype, the alignment appears to be lost.
 355 Thus, even though the compiler will not warn you, until further notice
 356 we need to avoid putting the SIMD datatypes into other structures. This
 357 is particular severe when allocating memory on the heap, but it occurs
 358 for stack structures/classes too.
 359
 360
 361 SIMD4 implementation
 362 --------------------
 363
 364 The above should be sufficient for code that works with the full SIMD width.
 365 Unfortunately reality is not that simple. Some algorithms like lattice
 366 summation need quartets of elements, so even when the SIMD width is >4 we
 367 need width-4 SIMD if it is supported. The availability of SIMD4 is indicated
 368 by \ref GMX_SIMD4_HAVE_FLOAT and \ref GMX_SIMD4_HAVE_DOUBLE. For now we only
 369 support a small subset of SIMD operations for SIMD4. Because SIMD4 doesn't
 370 scale with increasingly large SIMD width it should be avoided for all new
 371 code and SIMD4N should be used instead.
 372
 373 SIMD4N implementation
 374 ---------------------
 375
 376 Some code, like lattice summation, has inner loops which are smaller
 377 than the full SIMD width. In GROMACS algorithms 3 and 4 iterations are common
 378 because of PME order and three dimensions. This makes 4 an important special
 379 case. Vectorizing such loops efficiently requires to collapse the two
 380 most inner loops and using e.g. one 8-wide SIMD vector for 2 outer
 381 and 4 inner iterations or one 16-wide SIMD vector for 4 outer and 4 inner
 382 iterations. For this SIMD4N functions are
 383 provided. The availability of these function is indicated by
 384 \ref GMX_SIMD_HAVE_4NSIMD_UTIL_FLOAT and
 385 \ref GMX_SIMD_HAVE_4NSIMD_UTIL_DOUBLE.
 386 These functions return the type alias Simd4NFloat / Simd4NDouble which is
 387 either the normal SIMD type or the SIMD4 type and thus only supports
 388 the operations the SIMD4 type supports.
 389
 390 Predefined SIMD preprocessor macros
 391 ===================================
 392
 393 Functionality-wise, we have a small set of core features that we
 394 require to be present on all platforms, while more avanced features can be
 395 used in the code when defines like e.g. \ref GMX_SIMD_HAVE_LOADU have the
 396 value 1.
 397
 398 This is a summary of the currently available preprocessor defines that
 399 you should use to check for support when using the corresponding features.
 400 We first list the float/double/int defines set by the _implementation_; in
 401 most cases you do not want to check directly for float/double defines, but
 402 you should instead use the derived "real" defines set in this file - we list
 403 those at the end below.
 404
 405 Preprocessor predefined macro defines set by the low-level implementation.
 406 These only have the value 1 if they work for all datatypes;
 407 \ref GMX_SIMD_HAVE_LOADU thus means we can load both float, double, and
 408 integers from unaligned memory, and that the unaligned loads are available
 409 for SIMD4 too.
 410
 411 <dl>
 412 <dt>\ref GMX_SIMD</dt>
 413 <dd>
 414 Some sort of SIMD architecture is enabled.
 415 </dd>
 416 <dt>\ref GMX_SIMD_HAVE_FLOAT</dt>
 417 <dd>
 418 Single-precision instructions available.
 419 </dd>
 420 <dt>\ref GMX_SIMD_HAVE_DOUBLE</dt>
 421 <dd>
 422 Double-precision instructions available.
 423 </dd>
 424 <dt>\ref GMX_SIMD_HAVE_LOADU</dt>
 425 <dd>
 426 Load from unaligned memory available.
 427 </dd>
 428 <dt>\ref GMX_SIMD_HAVE_STOREU</dt>
 429 <dd>
 430 Store to unaligned memory available.
 431 </dd>
 432 <dt>\ref GMX_SIMD_HAVE_LOGICAL</dt>
 433 <dd>
 434 Support for and/andnot/or/xor on floating-point variables.
 435 </dd>
 436 <dt>\ref GMX_SIMD_HAVE_FMA</dt>
 437 <dd>
 438 Floating-point fused multiply-add.
 439 Note: We provide emulated FMA instructions if you do not have FMA
 440 support, but in that case you might be able to code it more efficient w/o FMA.
 441 </dd>
 442 <dt>\ref GMX_SIMD_HAVE_FINT32_EXTRACT</dt>
 443 <dd>
 444 Support for extracting integer SIMD elements from `gmx::SimdFInt32`.
 445 </dd>
 446 <dt>\ref GMX_SIMD_HAVE_FINT32_LOGICAL</dt>
 447 <dd>
 448 Bitwise shifts on `gmx::SimdFInt32`.
 449 </dd>
 450 <dt>\ref GMX_SIMD_HAVE_FINT32_ARITHMETICS</dt>
 451 <dd>
 452 Arithmetic ops for `gmx::SimdFInt32`.
 453 </dd>
 454 <dt>\ref GMX_SIMD_HAVE_DINT32_EXTRACT</dt>
 455 <dd>
 456 Support for extracting integer SIMD elements from `gmx::SimdDInt32`.
 457 </dd>
 458 <dt>\ref GMX_SIMD_HAVE_DINT32_LOGICAL</dt>
 459 <dd>
 460 Bitwise shifts on `gmx::SimdDInt32`.
 461 </dd>
 462 <dt>\ref GMX_SIMD_HAVE_DINT32_ARITHMETICS</dt>
 463 <dd>
 464 Arithmetic ops for `gmx::SimdDInt32`.
 465 </dd>
 466 <dt>\ref GMX_SIMD_HAVE_HSIMD_UTIL_FLOAT</dt>
 467 <dd>
 468 Half-SIMD-width nonbonded kernel utilities available for float SIMD.
 469 </dd>
 470 <dt>\ref GMX_SIMD_HAVE_HSIMD_UTIL_DOUBLE</dt>
 471 <dd>
 472 Half-SIMD-width nonbonded kernel utilities available for double SIMD.
 473 </dd>
 474 <dt>\ref GMX_SIMD_HAVE_GATHER_LOADU_BYSIMDINT_TRANSPOSE_FLOAT</dt>
 475 <dd>
 476 Can load pairs of unaligned floats from simd offsets (meant for linear tables).
 477 </dd>
 478 <dt>\ref GMX_SIMD_HAVE_GATHER_LOADU_BYSIMDINT_TRANSPOSE_DOUBLE</dt>
 479 <dd>
 480 Can load pairs of unaligned doubles from simd offsets (meant for linear tables).
 481 </dd>
 482 </dl>
 483
 484 There are also two macros specific to SIMD4: \ref GMX_SIMD4_HAVE_FLOAT is set
 485 if we can use SIMD4 in single precision, and \ref GMX_SIMD4_HAVE_DOUBLE
 486 similarly denotes support for a double-precision SIMD4 implementation. For
 487 generic properties (e.g. whether SIMD4 FMA is supported), you should check
 488 the normal SIMD macros above.
 489
 490 Implementation properties
 491 -------------------------
 492
 493 Higher-level code can use these macros to find information about the implementation,
 494 for instance what the SIMD width is:
 495
 496 <dl>
 497 <dt>\ref GMX_SIMD_FLOAT_WIDTH</dt>
 498 <dd>
 499 Number of elements in `gmx::SimdFloat`, and practical width of `gmx::SimdFInt32`.
 500 </dd>
 501 <dt>\ref GMX_SIMD_DOUBLE_WIDTH</dt>
 502 <dd>
 503 Number of elements in `gmx::SimdDouble`, and practical width of `gmx::SimdDInt32`</dd>
 504 <dt>\ref GMX_SIMD_RSQRT_BITS</dt>
 505 <dd>
 506 Accuracy (bits) of 1/sqrt(x) lookup step.
 507 </dd>
 508 <dt>\ref GMX_SIMD_RCP_BITS</dt>
 509 <dd>
 510 Accuracy (bits) of 1/x lookup step.
 511 </dd>
 512 </dl>
 513
 514 After including the low-level architecture-specific implementation, this
 515 header sets the following derived defines based on the current precision;
 516 these are the ones you should check for unless you absolutely want to dig
 517 deep into the explicit single/double precision implementations:
 518
 519 <dl>
 520 <dt>\ref GMX_SIMD_HAVE_REAL</dt>
 521 <dd>
 522 Set to either \ref GMX_SIMD_HAVE_FLOAT or \ref GMX_SIMD_HAVE_DOUBLE
 523 </dd>
 524 <dt>\ref GMX_SIMD4_HAVE_REAL</dt>
 525 <dd>
 526 Set to either \ref GMX_SIMD4_HAVE_FLOAT or \ref GMX_SIMD4_HAVE_DOUBLE
 527 </dd>
 528 <dt>\ref GMX_SIMD_REAL_WIDTH</dt>
 529 <dd>
 530 Set to either \ref GMX_SIMD_FLOAT_WIDTH or \ref GMX_SIMD_DOUBLE_WIDTH
 531 </dd>
 532 <dt>\ref GMX_SIMD_HAVE_INT32_EXTRACT</dt>
 533 <dd>
 534 Set to either \ref GMX_SIMD_HAVE_FINT32_EXTRACT or \ref GMX_SIMD_HAVE_DINT32_EXTRACT
 535 </dd>
 536 <dt>\ref GMX_SIMD_HAVE_INT32_LOGICAL</dt>
 537 <dd>
 538 Set to either \ref GMX_SIMD_HAVE_FINT32_LOGICAL or \ref GMX_SIMD_HAVE_DINT32_LOGICAL
 539 </dd>
 540 <dt>\ref GMX_SIMD_HAVE_INT32_ARITHMETICS</dt>
 541 <dd>
 542 Set to either \ref GMX_SIMD_HAVE_FINT32_ARITHMETICS or \ref GMX_SIMD_HAVE_DINT32_ARITHMETICS
 543 </dd>
 544 <dt>\ref GMX_SIMD_HAVE_HSIMD_UTIL_REAL</dt>
 545 <dd>
 546 Set to either \ref GMX_SIMD_HAVE_HSIMD_UTIL_FLOAT or \ref GMX_SIMD_HAVE_HSIMD_UTIL_DOUBLE
 547 </dd>
 548 <dt>\ref GMX_SIMD_HAVE_GATHER_LOADU_BYSIMDINT_TRANSPOSE_REAL</dt>
 549 <dd>
 550 Set to either \ref GMX_SIMD_HAVE_GATHER_LOADU_BYSIMDINT_TRANSPOSE_FLOAT or \ref GMX_SIMD_HAVE_GATHER_LOADU_BYSIMDINT_TRANSPOSE_DOUBLE
 551 </dd>
 552 </dl>
 553
 554 For convenience we also define \ref GMX_SIMD4_WIDTH to 4. This will never vary,
 555 but using it helps you make it clear that a loop or array refers to the
 556 SIMD4 width rather than some other '4'.
 557
 558 While all these defines are available to specify the features of the
 559 hardware, we would strongly recommend that you do NOT sprinkle your code
 560 with defines - if nothing else it will be a debug nightmare. Instead you can
 561 write a slower generic SIMD function that works everywhere, and then override
 562 this with faster architecture-specific versions for some implementations. The
 563 recommended way to do that is to add a define around the generic function
 564 that skips it if the name is already defined. The actual implementations in
 565 the lowest-level files are typically defined to an architecture-specific name
 566 (such as `simdSinCosD_Sse2`) so we can override it (e.g. in SSE4) by
 567 simply undefining and setting a new definition. Still, this is an
 568 implementation detail you won't have to worry about until you start writing
 569 support for a new SIMD architecture.
 570
 571
 572 Function naming
 573 ---------------
 574
 575 We rely on C++ overloading, so the name of a function is usually identical
 576 regardless of what datatype it operates on. There are a few exceptions to this
 577 for functions that do not take arguments but only return a value, e.g. setZero(),
 578 since overloading only works if the formal parameters are different. To solve this,
 579 we use different low-level function names in these cases, but then create proxy
 580 objects in the high-level `gromacs/simd/simd.h` so that you can still get the
 581 functionality by simply writing setZero() in the code.
 582
 583 Automated checking
 584 ------------------
 585
 586 Having fallback implementations when SIMD is not supported can be a
 587 performance problem if the code does not correctly include
 588 `gromacs/simd/simd.h`, particularly after refactoring.
 589 `make check-source` checks the whole code for the use of symbols defined
 590 in `gromacs/simd/simd.h` and requires that files using those symbols
 591 do the correct include. Similar checking is done for higher-level
 592 SIMD-management headers, e.g. `gromacs/ewald/pme_simd.h`.
 593
 594
 595 The SIMD math library
 596 =====================
 597
 598 In addition to the low-level SIMD instructions, \Gromacs comes with a fairly
 599 extensive SIMD math library in `gromacs/simd/simd_math.h` to support various
 600 mathematical functions. The functions are available both in single and
 601 double precision (overloaded on the usual math function names), and we also
 602 provide a special version of functions that use double precision arguments,
 603 but that only evaluate the result to single precision accuracy. This is
 604 useful when you don’t need highly accurate results, but you want to avoid
 605 the overhead of doing multiple single/double conversions, or if the hardware
 606 architecture only provides a double precision SIMD implementation.
 607
 608 For a few functions such as the square root and exponential that are
 609 performance-critical, we provide additional tempate parameters where the
 610 default choice is to execute the normal function version, but it is also
 611 possible to choose an unsafe execution path that completely bypass all
 612 argument checking. Make absolutely sure your arguments always fulfil the
 613 restrictions listed in the documentation of such a function before using it,
 614 and it might even be a good idea to add a note before each call to an unsafe
 615 function justifying why that flavor is fine to use here.