docs/doxygen/lib/simd.md

   1 Single-instruction Multiple-data (SIMD) coding {#page_simd}
   2 ==============================================
   3
   4 Coding with SIMD instructions
   5 =============================
   6
   7 One important way for \Gromacs to achieve high performance is
   8 to use modern hardware capabilities where a single assembly
   9 instruction operates on multiple data units, essentially short
  10 fixed-length vectors (usually 2,4,8, or 16 elements). This provides
  11 a very efficient way for the CPU to increase floating-point
  12 performance, but it is much less versatile than general purpose
  13 registers. For this reason it is difficult for the compiler to
  14 generate efficient SIMD code, so the user has to organize the
  15 data in a way where it is possible to access as vectors, and
  16 these vectors often need to be aligned on cache boundaries.
  17
  18 We have supported a number of different SIMD instruction sets in
  19 the group kernels for ages, and it is now also present in the
  20 verlet kernels and a few other places. However, with the increased
  21 usage and several architectures with different capabilities we now
  22 use a vendor-agnostic \Gromacs SIMD module, as documented in
  23 \ref module_simd.
  24
  25 Design of the \Gromacs SIMD module
  26 ==================================
  27
  28 The macros in `src/gromacs/simd` are intended to be used for writing
  29 architecture-independent SIMD intrinsics code. Rather than making assumptions
  30 based on architecture, we have introduced a limited number of
  31 predefined preprocessor macros that describe the capabilities of the
  32 current implementation - these are the ones you need to check when
  33 writing SIMD code. As you will see, the functionality exposed by
  34 this module as typically a small subset of general SIMD implementations,
  35 and in particular we do not even try to expose advanced shuffling or
  36 permute operations, simply because we haven't been able to describe those
  37 in a generic way that can be implemented efficiently regardless of the
  38 hardware. However, the advantage of this approach is that it is straightforward
  39 to extend with support for new simd instruction sets in the future,
  40 and that will instantly speed up old code too.
  41
  42 Unfortunately there is no standard for SIMD architectures. The available
  43 features vary a lot, but we still need to use quite a few of them to
  44 get the best performance possible. This means some features will only
  45 be available on certain platforms, and it is critical that we do NOT make
  46 to many assumptions about the storage formats, their size or SIMD width.
  47 Just to give a few examples:
  48
  49 - On x86, double precision (64-bit) floating-point values always convert
  50   to 32-bit integers, while many other platforms use 64-bit, and some cannot
  51   use 32-bit integers at all. This means we cannot use a mask (boolean)
  52   derived from integer operations to select double-precision floating-point
  53   values, and it could get very complex for higher-level code if all these
  54   decisions were exposed. Instead, we want to keep integers 32-bit since
  55   all algorithms anyway need to work in single precision (w. 32-bit ints).
  56 - IBM QPX uses 4-wide SIMD both for single and double precision. Integer
  57   support is highly limited, and the storage format means QPX does not
  58   use x86-style all-ones masks (which have different widths in single/double)
  59   but it uses the sign bit to denote the _false_ value. In particular, this
  60   means we cannot use the bit contents for any fancy mask operations.
  61 - AVX1 only supports 4-wide 128-bit integer SIMD arithmetics, but the integer
  62   _conversions_ can still be done 8-wide which corresponds to the single
  63   precision floating-point width. Similarly, with AVX1 conversions between
  64   double-precision and integers use the 32-bit 4-wide 128bit registers where
  65   we can also do integer arithmetics. AVX2 adds proper arithmetics for
  66   8-wide integers. We would severely limit performance if we had to say
  67   that integer support was not present, so instead we stick to 32-bit ints
  68   but limit the operations we expose (and do shuffling internally).
  69 - For SSE2 through SSE4.1, double precision is 2-wide, but when we convert
  70   to integers they will be put in the first two elements of a 4-wide integer
  71   type. This means we cannot assume that floating-point SIMD registers and
  72   corresponding integer registers (after conversion) have the same width.
  73 - The 2-wide SIMD instructions on BlueGene/L and BlueGene/P cannot do any
  74   floating-point logical operations (and/andnot/or/xor) whatsoever, which
  75   can be a pain when implementing approximations for math functions.
  76 - Since boolean values can have different width for float/double and the
  77   integers corresponding to float/double, we need to use separate boolean
  78   types for all these values and convert between them if we e.g. want to use
  79   result of an integer compare to select floating-point values.
  80
  81 While this might sound complicated, it is actually far easier than writing
  82 separate SIMD code for 10 architectures in both single & double. The point
  83 is not that you need to remember the limitations above, but it is critical
  84 that you *never assume anything about the SIMD implementation*. We
  85 typically implement SIMD support for a new architecture in days with this
  86 new module, and the extensions required for verlet kernels
  87 are also very straightforward (group kernels can be more complex, but those
  88 are gradually on their way out). For the higher-level
  89 code, the only important thing is to never _assume_ anything about the SIMD
  90 architecture. Our general strategy in \Gromacs is to split the SIMD coding
  91 in three levels:
  92
  93 <dl>
  94 <dt>Base level generic SIMD</dt>
  95 <dd>
  96 The base level SIMD module (which we get by including `gromacs/simd/simd.h`
  97 provides the API to define and manipulate SIMD datatypes. This will be enough
  98 for lots of cases, and it is a huge advantage that there is roughly
  99 parity between different architectures.
 100 </dd>
 101 <dt>Larger architecture-specific SIMD functions</dt>
 102 <dd>
 103 For some parts of the code this is not enough. In particular, both the
 104 group and Verlet kernels do insane amounts of floating-point operations,
 105 and since we spend 85-90% of the time in these kernels it is critical that
 106 we can optimize them as much as possible. Here, our strategy is first to
 107 define larger high-level functions that e.g. take a number of distances
 108 and loads the table interactions for this interaction. This way we can
 109 move this architecture-specific implementation to the SIMD module, and
 110 both achieve a reasonably clean kernel but still optimize a lot.
 111 </dd>
 112 <dt>Architecture-specific kernels (directories/files)</dt>
 113 <dd>
 114 When it is absolutely impossible to use a shared implementation we might
 115 have to code SIMD (just as GPU code). When this happens, we should create
 116 subdirectory or otherwise clearly names a file with a suffix for the
 117 SIMD architecture, to clarify to the user that the SIMD file has a
 118 direct non-SIMD correspondence. Since this code can be very hard to read,
 119 it is important to be explicit and use lots of comments - this is not the
 120 type of code where you should use smart optimization with hundreds of
 121 preprocessor directives. Keep it simple so other developers can help you
 122 support it. The question is not whether you can get a function 20% faster,
 123 but whether it justifies the added complexity of the code.
 124 </dd>
 125 </dl>
 126
 127 File organization
 128 =================
 129
 130 The SIMD module uses a couple of different files:
 131
 132 <dl>
 133 <dt>`gromacs/simd/simd.h`</dt>
 134 <dd>
 135 This is the top-level wrapper that you should always include first.
 136 It will check the settings made at configuration time and include a
 137 suitable low-level implementation (that can be either single, double,
 138 or both). It also contains the routines for memory alignment, and
 139 based on the current Gromacs precision it will set aliases to 'real'
 140 SIMD datatypes (see further down) so the implementations do not have
 141 to care about Gromacs-specific details. However, note that you might
 142 not get all SIMD support you hoped for: If you compiled Gromacs in
 143 double precision but the hardware only supports single-precision SIMD
 144 there will not be any SIMD routines for default Gromacs 'real' precision.
 145 There are \#defines you can use to check this, as described further down.
 146 </dd>
 147 <dt>`gromacs/simd/impl_reference.h`</dt>
 148 <dd>
 149 This is an example of a low-level implementation. You should never, ever,
 150 work directly with these in higher-level code. The reference implementation
 151 contains the documentation for all SIMD wrappers, though.
 152 </dd>
 153 <dt>`gromacs/simd/simd_math.h`</dt>
 154 <dd>
 155 SIMD math functions. All functions in this file have to be designed
 156 so they work no matter whether the hardware supports integer SIMD, logical
 157 operations on integer or floating-point SIMD, or arithmetic operations
 158 on integers. However, a few routines check for defines and use faster
 159 algorithms if these features are present.
 160 </dd>
 161 <dt>`gromacs/simd/vector_operations.h`</dt>
 162 <dd>
 163 This file contains a few rvec-related SIMD functions, e.g. to
 164 calculate scalar products, norms, or cross products. They obviously
 165 cannot operate on scalar Gromacs rvec types, but use separate SIMD
 166 variables for X,Y, and Z vector components.
 167 </dd>
 168 </dl>
 169
 170
 171 SIMD datatypes
 172 ==============
 173
 174 The SIMD module handles the challenges mentioned in the introduction
 175 by introducing a number of datatypes;
 176 many of these might map to the same underlying SIMD types, but we need separate
 177 types because some architectures use different registers e.g. for boolean
 178 types.
 179
 180 Floating-point data
 181 -------------------
 182
 183 <dl>
 184 <dt>`#gmx_simd_real_t`</dt>
 185 <dd>
 186 This is the SIMD-version of \Gromacs' real type,
 187 which is set based on the CMake configuration and internally aliased
 188 to one of the next two types.
 189 Operations on these variables have the suffix `_r`, e.g. `gmx_simd_add_r()`.
 190 </dd>
 191 <dt>`#gmx_simd_float_t`</dt>
 192 <dd>
 193 This is always single-precision data, but it
 194 might not be supported on all architectures. Suffix `_f` is used for
 195 explicit single-precision routines, e.g. `gmx_simd_mul_f()`.
 196 </dd>
 197 <dt>`gmx_simd_double_t`</dt>
 198 <dd>
 199 This is always double precision when available,
 200 and in rare cases you might want to use a specific precision.
 201 Suffix `_d` is used for explicit double-precision routines,
 202 e.g. `gmx_simd_mul_d()`
 203 </dd>
 204 </dl>
 205
 206 Integers corresponding to floating-point values
 207 -----------------------------------------------
 208
 209 For these types, 'correspond' means that it is the integer type we
 210 get when we convert data e.g. from single (or double) precision
 211 floating-point SIMD variables. Those need to be different, since many
 212 common implementations only use half as many elements for double as
 213 for single SIMD variables, and then we only get half the number of
 214 integers too.
 215
 216 <dl>
 217 <dt>`#gmx_simd_int32_t`</dt>
 218 <dd>
 219 This is used for integers when converting to/from Gromacs default "real" type.
 220 The corresponding routines have suffix `_i`, e.g. `gmx_simd_add_i()`.
 221 </dd>
 222 <dt>`gmx_simd_fint32_t`</dt>
 223 <dd>
 224 Integers obtained when converting from single precision, or intended to be
 225 converted to single precision floating-point. These are normal integers
 226 (not a special conversion type), but since some SIMD architectures such as
 227 SSE or AVX use different registers for integer SIMD variables having the
 228 same width as float and double, respectively, we need to separate these
 229 two types of integers. The actual operations you perform on the are normal
 230 ones such as addition or multiplication. The routines
 231 operating on these variables have suffix `_fi`, like `gmx_simd_add_fi()`.
 232 This will also be the widest integer data type if you want to do pure
 233 integer SIMD operations, but that will not be supported on all platforms.
 234 </dd>
 235 <dt>`gmx_simd_dint32_t`</dt>
 236 <dd>
 237 Integers used when converting to/from double. See the preceding item
 238 for a detailed explanation. On many architectures,
 239 including all x86 ones, this will be a narrower type than `gmx_simd_fint32_t`.
 240 The correspoding routines have suffix `_di`, like `gmx_simd_add_di()`.
 241 </dd>
 242 </dl>
 243
 244 Note that all integer load/stores operations defined here load/store 32-bit
 245 integers, even when the internal register storage might be 64-bit, and we
 246 set the "width" of the SIMD implementation based on how many float/double/
 247 integers we load/store - even if the internal width could be larger.
 248
 249 Boolean values
 250 --------------
 251
 252 We need a separate boolean datatype for masks and comparison results, since
 253 we cannot assume they are identical either to integers, floats or double -
 254 some implementations use specific predicate registers for booleans.
 255
 256 <dl>
 257 <dt>`#gmx_simd_bool_t`</dt>
 258 <dd>
 259 Results from boolean operations involving reals, and the booleans we use
 260 to select between real values. The corresponding routines have suffix `_b`,
 261 like `gmx_simd_or_b()`.
 262 </dd>
 263 <dt>`gmx_simd_fbool_t`</dt>
 264 <dd>
 265 Booleans specifically for single precision. Corresponding function suffix
 266 is `_fb`, like `gmx_simd_or_fb()`.
 267 </dd>
 268 <dt>`gmx_simd_dbool_t`</dt>
 269 <dd>
 270 Operations specifically on double. Operations have suffix `_db`: `gmx_simd_or_db()`
 271 </dd>
 272 <dt>`#gmx_simd_ibool_t`</dt>
 273 <dd>
 274 Boolean operations on integers corresponding to real (see floating-point
 275 descriptions above). Operations on these booleans use suffix `_ib`,
 276 like `gmx_simd_or_ib()`.
 277 </dd>
 278 <dt>`gmx_simd_fibool_t`</dt>
 279 <dd>
 280 Booleans for integers corresponding to float. Operation suffix is `_fib`,
 281 like `gmx_simd_or_fib()`.
 282 </dd>
 283 <dt>`gmx_simd_dibool_t`</dt>
 284 <dd>
 285 Booleans for integers corresponding to double. Operation suffix is `_dib`,
 286 like `gmx_simd_or_dib()`.
 287 </dd>
 288 </dl>
 289
 290 The subset you should use in practice
 291 -------------------------------------
 292
 293 If this seems daunting, in practice you should only need to use these types
 294 when you start coding:
 295
 296 <dl>
 297 <dt>`#gmx_simd_real_t`</dt>
 298 <dd>
 299 Floating-point data.
 300 </dd>
 301 <dt>`#gmx_simd_bool_t`</dt>
 302 <dd>
 303 Booleans.
 304 </dd>
 305 <dt>`#gmx_simd_int32_t`</dt>
 306 <dd>
 307 Integer data. Might not be supported, so you must check
 308 the preprocessor macros described below.
 309 </dd>
 310 </dl>
 311
 312 Operations on these types will be defined to either float/double (or corresponding integers) based on the current Gromacs precision, so the documentation is occasionally more detailed for the lower-level actual implementation functions.
 313
 314 SIMD4 Macros
 315 ------------
 316
 317 The above should be sufficient for code that works with the full SIMD width.
 318 Unfortunately reality is not that simple. Some algorithms like lattice
 319 summation need quartets of elements, so even when the SIMD width is >4 we
 320 need width-4 SIMD if it is supported. These datatypes and operations use the
 321 prefix `gmx_simd4_`, and availability is indicated by `GMX_SIMD4_HAVE_FLOAT`
 322 and `GMX_SIMD4_HAVE_DOUBLE`. For now we only support a small subset of SIMD
 323 operations for SIMD4, but that is trivial to extend if we need to.
 324
 325 Predefined SIMD preprocessor macros
 326 ===================================
 327
 328 Functionality-wise, we have a small set of core set of features that we
 329 require to be present on all platforms, while more avanced features can be
 330 used in the code when defines like e.g. `GMX_SIMD_HAVE_LOADU` are set.
 331
 332 This is a summary of the currently available preprocessor defines that
 333 you should use to check for support when using the corresponding features.
 334 We first list the float/double/int defines set by the _implementation_; in
 335 most cases you do not want to check directly for float/double defines, but
 336 you should instead use the derived "real" defines set in this file - we list
 337 those at the end below.
 338
 339 Preprocessor predefined macro defines set by the low-level implementation.
 340 These are only set if they work for all datatypes; `GMX_SIMD_HAVE_LOADU`
 341 thus means we can load both float, double, and integers from unaligned memory,
 342 and that the unaligned loads are available for SIMD4 too.
 343
 344 <dl>
 345 <dt>`GMX_SIMD_HAVE_FLOAT`</dt>
 346 <dd>
 347 Single-precision instructions available.
 348 </dd>
 349 <dt>`GMX_SIMD_HAVE_DOUBLE `</dt>
 350 <dd>
 351 Double-precision instructions available.
 352 </dd>
 353 <dt>`GMX_SIMD_HAVE_HARDWARE`</dt>
 354 <dd>
 355 Set when we are NOT emulating SIMD.
 356 </dd>
 357 <dt>`GMX_SIMD_HAVE_LOADU`</dt>
 358 <dd>
 359 Load from unaligned memory available.
 360 </dd>
 361 <dt>`GMX_SIMD_HAVE_STOREU`</dt>
 362 <dd>
 363 Store to unaligned memory available.
 364 </dd>
 365 <dt>`GMX_SIMD_HAVE_LOGICAL`</dt>
 366 <dd>
 367 Support for and/andnot/or/xor on floating-point variables.
 368 </dd>
 369 <dt>`GMX_SIMD_HAVE_FMA`</dt>
 370 <dd>
 371 Floating-point fused multiply-add.
 372 Note: We provide emulated FMA instructions if you do not have FMA
 373 support, but in that case you might be able to code it more efficient w/o FMA.
 374 </dd>
 375 <dt>`GMX_SIMD_HAVE_FRACTION`</dt>
 376 <dd>
 377 Instruction to get decimal fraction. Same as FMA: This denotes
 378 hardware support, otherwise instruction will be emulated.
 379 </dd>
 380 <dt>`GMX_SIMD_HAVE_FINT32`</dt>
 381 <dd>
 382 Integer conversions to/from float available.
 383 </dd>
 384 <dt>`GMX_SIMD_HAVE_FINT32_EXTRACT`</dt>
 385 <dd>
 386 Support for extracting integer SIMD elements from `gmx_simd_fint32_t`.
 387 </dd>
 388 <dt>`GMX_SIMD_HAVE_FINT32_LOGICAL`</dt>
 389 <dd>
 390 Bitwise shifts on `gmx_simd_fint32_t`.
 391 </dd>
 392 <dt>`GMX_SIMD_HAVE_FINT32_ARITHMETICS`</dt>
 393 <dd>
 394 Arithmetic ops for `gmx_simd_fint32_t`.
 395 </dd>
 396 <dt>`GMX_SIMD_HAVE_DINT32`</dt>
 397 <dd>
 398 Integer conversions to/from double available.
 399 </dd>
 400 <dt>`GMX_SIMD_HAVE_DINT32_EXTRACT`</dt>
 401 <dd>
 402 Support for extracting integer SIMD elements from `gmx_simd_dint32_t`.
 403 </dd>
 404 <dt>`GMX_SIMD_HAVE_DINT32_LOGICAL`</dt>
 405 <dd>
 406 Bitwise shifts on `gmx_simd_dint32_t`.
 407 </dd>
 408 <dt>`GMX_SIMD_HAVE_DINT32_ARITHMETICS`</dt>
 409 <dd>
 410 Arithmetic ops for `gmx_simd_dint32_t`.
 411 </dd>
 412 </dl>
 413
 414 There are also two macros specific to SIMD4: `GMX_SIMD4_HAVE_FLOAT` is set
 415 if we can use SIMD4 in single precision, and `GMX_SIMD4_HAVE_DOUBLE`
 416 similarly denotes support for a double-precision SIMD4 implementation. For
 417 generic properties (e.g. whether SIMD4 FMA is supported), you should check
 418 the normal SIMD macros above.
 419
 420 Implementation properties
 421 -------------------------
 422
 423 Higher-level code can use these macros to find information about the implementation,
 424 for instance what the SIMD width is:
 425
 426 <dl>
 427 <dt>`GMX_SIMD_FLOAT_WIDTH`</dt>
 428 <dd>
 429 Number of elements in `gmx_simd_float_t`, and practical width of `gmx_simd_fint32_t`.
 430 </dd>
 431 <dt>`GMX_SIMD_DOUBLE_WIDTH`</dt>
 432 <dd>
 433 Number of elements in `gmx_simd_double_t`, and practical width of `gmx_simd_dint32_t`</dd>
 434 <dt>`GMX_SIMD_RSQRT_BITS`</dt>
 435 <dd>
 436 Accuracy (bits) of 1/sqrt(x) lookup step.
 437 </dd>
 438 <dt>`GMX_SIMD_RCP_BITS`</dt>
 439 <dd>
 440 Accuracy (bits) of 1/x lookup step.
 441 </dd>
 442 </dl>
 443
 444 After including the low-level architecture-specific implementation, this
 445 header sets the following derived defines based on the current precision;
 446 these are the ones you should check for unless you absolutely want to dig
 447 deep into the explicit single/double precision implementations:
 448
 449 <dl>
 450 <dt>`GMX_SIMD_HAVE_REAL`</dt>
 451 <dd>
 452 Set either to `GMX_SIMD_HAVE_FLOAT` or `GMX_SIMD_HAVE_DOUBLE`
 453 </dd>
 454 <dt>`GMX_SIMD4_HAVE_REAL`</dt>
 455 <dd>
 456 Set either to `GMX_SIMD4_HAVE_FLOAT` or `GMX_SIMD4_HAVE_DOUBLE`
 457 </dd>
 458 <dt>`GMX_SIMD_REAL_WIDTH`</dt>
 459 <dd>
 460 Set either to `GMX_SIMD_FLOAT_WIDTH` or `GMX_SIMD_DOUBLE_WIDTH`
 461 </dd>
 462 <dt>`GMX_SIMD_HAVE_INT32`</dt>
 463 <dd>
 464 Set either to `GMX_SIMD_HAVE_FINT32` or `GMX_SIMD_HAVE_DINT32`
 465 </dd>
 466 <dt>`GMX_SIMD_INT32_WIDTH`</dt>
 467 <dd>
 468 Set either to `GMX_SIMD_FINT32_WIDTH` or `GMX_SIMD_DINT32_WIDTH`
 469 </dd>
 470 <dt>`GMX_SIMD_HAVE_INT32_EXTRACT`</dt>
 471 <dd>
 472 Set either to `GMX_SIMD_HAVE_FINT32_EXTRACT` or `GMX_SIMD_HAVE_DINT32_EXTRACT`
 473 </dd>
 474 <dt>`GMX_SIMD_HAVE_INT32_LOGICAL`</dt>
 475 <dd>
 476 Set either to `GMX_SIMD_HAVE_FINT32_LOGICAL` or `GMX_SIMD_HAVE_DINT32_LOGICAL`
 477 </dd>
 478 <dt>`GMX_SIMD_HAVE_INT32_ARITHMETICS`</dt>
 479 <dd>
 480 Set either to `GMX_SIMD_HAVE_FINT32_ARITHMETICS` or `GMX_SIMD_HAVE_DINT32_ARITHMETICS`
 481 </dd>
 482 </dl>
 483
 484 For convenience we also define `GMX_SIMD4_WIDTH` to 4. This will never vary,
 485 but using it helps you make it clear that a loop or array refers to the
 486 SIMD4 width rather than some other '4'.
 487
 488 While all these defines are available to specify the features of the
 489 hardware, we would strongly recommend that you do NOT sprinkle your code
 490 with defines - if nothing else it will be a debug nightmare. Instead you can
 491 write a slower generic SIMD function that works everywhere, and then override
 492 this with faster architecture-specific versions for some implementations. The
 493 recommended way to do that is to add a define around the generic function
 494 that skips it if the name is already defined. The actual implementations in
 495 the lowest-level files are typically defined to an architecture-specific name
 496 (such as `gmx_simd_sincos_d_sse2`) so we can override it (e.g. in SSE4) by
 497 simply undefining and setting a new definition. Still, this is an
 498 implementation detail you won't have to worry about until you start writing
 499 support for a new SIMD architecture.
 500
 501
 502