PME OpenCL kernels (spread, gather, solve) are added.
PmeGpuContext is taught to compile them. PME with OpenCL
is enabled everywhere (mdrun, unit tests, integrations tests).
Added preliminary documentation.
Change-Id: Iecf1393ddeb442f9b66ca31100bac9b88f83e52b
to the clusters were not written. This adds support for writing this :ref:`index <ndx>` file
as well as proper :ref:`PDB <pdb>` files.
-Intel integrated GPUs are now supported for GPU offload with OpenCL
-"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-On Intel CPUs with integrated GPUs, it is now possible to offload nonbonded tasks
-to the GPU the same way as offload is done to other GPU architectures.
-This can have performance benefits, in particular on modern desktop and mobile
-Intel CPUs this offload can give up to 20% higher simulation performance.
-
Allow using COM of previous step as PBC reference
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
PME can now be run on a GPU when doing free energy perturbations
that do not involve perturbing charges.
+
+PME long-ranged interaction GPU offload now available with OpenCL
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+On supported devices from all supported vendors (AMD, Intel, NVIDIA),
+it is now possible to offload PME tasks to the GPU using OpenCL. This
+works in the same way as the former CUDA offload. A single GPU can
+now be used to accelerate the computation of the long-ranged PME
+interactions. This feature means that only 2-4 CPU cores per
+GPU will be about as fast as the 2018 version that needed many more
+CPU cores to balance the GPU. Performance on hardware that had good
+balance of GPU and CPU also shows minor improvements, and the capacity
+for hardware with strong GPUs to run effective simulations is now
+greatly improved.
+
+Intel integrated GPUs are now supported for GPU offload with OpenCL
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+On Intel CPUs with integrated GPUs, it is now possible to offload nonbonded tasks
+to the GPU the same way as offload is done to other GPU architectures.
+This can have performance benefits, in particular on modern desktop and mobile
+Intel CPUs this offload can give up to 20% higher simulation performance.
)
gmx_compile_cpp_as_cuda(
pme-gpu-internal.cpp
+ pme-gpu-program.cpp
pme-gpu-timings.cpp
)
elseif (GMX_USE_OPENCL)
pme-gpu-internal.cpp
pme-gpu-timings.cpp
)
+ file(GLOB EWALD_OPENCL_RUNTIME_SOURCES pme-spread.clh pme-solve.clh pme-gather.clh pme-gpu-utils.clh pme-program.cl pme-gpu-types.h)
+ set(MDLIB_OPENCL_KERNELS ${MDLIB_OPENCL_KERNELS} ${EWALD_OPENCL_RUNTIME_SOURCES} PARENT_SCOPE)
else()
gmx_add_libgromacs_sources(
# Files that implement stubs
--- /dev/null
+/*
+ * This file is part of the GROMACS molecular simulation package.
+ *
+ * Copyright (c) 2018, by the GROMACS development team, led by
+ * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ * and including many others, as listed in the AUTHORS file in the
+ * top-level source directory and at http://www.gromacs.org.
+ *
+ * GROMACS is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1
+ * of the License, or (at your option) any later version.
+ *
+ * GROMACS is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with GROMACS; if not, see
+ * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * If you want to redistribute modifications to GROMACS, please
+ * consider that scientific software is very special. Version
+ * control is crucial - bugs must be traceable. We will be happy to
+ * consider code for inclusion in the official distribution, but
+ * derived work must not be called official GROMACS. Details are found
+ * in the README & COPYING files - if they are missing, get the
+ * official version at http://www.gromacs.org.
+ *
+ * To help us fund GROMACS development, we humbly ask that you cite
+ * the research papers on the package. Check out http://www.gromacs.org.
+ */
+
+/*! \internal \file
+ * \brief Implements PME OpenCL force gathering kernel.
+ * When including this and other PME OpenCL kernel files, plenty of common
+ * constants/macros are expected to be defined (such as "order" which is PME interpolation order).
+ * For details, please see how pme-program.cl is compiled in pme-gpu-program-impl-ocl.cpp.
+ *
+ * This file's kernels specifically expect the following definitions:
+ *
+ * - atomsPerBlock which expresses how many atoms are processed by a single work group
+ * - order which is a PME interpolation order
+ * - overwriteForces must evaluate to either true or false to specify whether the kernel
+ * overwrites or reduces into the forces buffer
+ * - wrapX and wrapY must evaluate to either true or false to specify whether the grid overlap
+ * in dimension X/Y is to be used
+ *
+ * \author Aleksei Iupinov <a.yupinov@gmail.com>
+ */
+
+#include "pme-gpu-types.h"
+#include "pme-gpu-utils.clh"
+
+#ifndef COMPILE_GATHER_HELPERS_ONCE
+#define COMPILE_GATHER_HELPERS_ONCE
+
+/*! \brief
+ * Unrolls the dynamic index accesses to the constant grid sizes to avoid local memory operations.
+ */
+inline float read_grid_size(const float *realGridSizeFP,
+ const int dimIndex)
+{
+ switch (dimIndex)
+ {
+ case XX: return realGridSizeFP[XX];
+ case YY: return realGridSizeFP[YY];
+ case ZZ: return realGridSizeFP[ZZ];
+ }
+ assert(false);
+ return 0.0f;
+}
+
+/*! \brief Reduce the partial force contributions.
+ *
+ * FIXME: this reduction should be simplified and improved, it does 3x16 force component
+ * reduction per 16 threads so no extra shared mem should be needed for intermediates
+ * or passing results back.
+ *
+ * \param[out] sm_forces Local memory array with the output forces (rvec).
+ * \param[in] atomIndexLocal Local atom index
+ * \param[in] splineIndex Spline index
+ * \param[in] lineIndex Line index (same as threadLocalId)
+ * \param[in] realGridSizeFP Local grid size constant
+ * \param[in] fx Input force partial component X
+ * \param[in] fy Input force partial component Y
+ * \param[in] fz Input force partial component Z
+ * \param[in,out] sm_forceReduction Reduction working buffer
+ * \param[in] sm_forceTemp Convenience pointers into \p sm_forceReduction
+ */
+inline void reduce_atom_forces(__local float * __restrict__ sm_forces,
+ const int atomIndexLocal,
+ const int splineIndex,
+ const int lineIndex,
+ const float *realGridSizeFP,
+ float fx,
+ float fy,
+ float fz,
+ __local float * __restrict__ sm_forceReduction,
+ __local float ** __restrict__ sm_forceTemp
+ )
+
+{
+ // TODO: implement AMD intrinsics reduction, like with shuffles in CUDA version. #2514
+
+ /* Number of data components and threads for a single atom */
+#define atomDataSize PME_SPREADGATHER_THREADS_PER_ATOM
+ // We use blockSize local memory elements to read fx, or fy, or fz, and then reduce them to fit into smemPerDim elements
+ // All those guys are defines and not consts, because they go into the local memory array size.
+#define blockSize (atomsPerBlock * atomDataSize)
+#define smemPerDim warp_size
+#define smemReserved ((DIM - 1) * smemPerDim)
+
+ const int numWarps = blockSize / smemPerDim;
+ const int minStride = max(1, atomDataSize / numWarps);
+
+#pragma unroll DIM
+ for (int dimIndex = 0; dimIndex < DIM; dimIndex++)
+ {
+ int elementIndex = smemReserved + lineIndex;
+ // Store input force contributions
+ sm_forceReduction[elementIndex] = (dimIndex == XX) ? fx : (dimIndex == YY) ? fy : fz;
+ /* This barrier was not needed in CUDA. Different OpenCL compilers might have different ideas
+ * about #pragma unroll, though. OpenCL 2 has _attribute__((opencl_unroll_hint)).
+ * #2519
+ */
+ barrier(CLK_LOCAL_MEM_FENCE);
+
+ // Reduce to fit into smemPerDim (warp size)
+#pragma unroll
+ for (int redStride = atomDataSize >> 1; redStride > minStride; redStride >>= 1)
+ {
+ if (splineIndex < redStride)
+ {
+ sm_forceReduction[elementIndex] += sm_forceReduction[elementIndex + redStride];
+ }
+ }
+ barrier(CLK_LOCAL_MEM_FENCE);
+ // Last iteration - packing everything to be nearby, storing convenience pointer
+ sm_forceTemp[dimIndex] = sm_forceReduction + dimIndex * smemPerDim;
+ int redStride = minStride;
+ if (splineIndex < redStride)
+ {
+ const int packedIndex = atomIndexLocal * redStride + splineIndex;
+ sm_forceTemp[dimIndex][packedIndex] = sm_forceReduction[elementIndex] + sm_forceReduction[elementIndex + redStride];
+ }
+ }
+
+ barrier(CLK_LOCAL_MEM_FENCE);
+
+ assert ((blockSize / warp_size) >= DIM);
+
+ const int warpIndex = lineIndex / warp_size;
+ const int dimIndex = warpIndex;
+
+ // First 3 warps can now process 1 dimension each
+ if (dimIndex < DIM)
+ {
+ int sourceIndex = lineIndex % warp_size;
+#pragma unroll
+ for (int redStride = minStride >> 1; redStride > 1; redStride >>= 1)
+ {
+ if (!(splineIndex & redStride))
+ {
+ sm_forceTemp[dimIndex][sourceIndex] += sm_forceTemp[dimIndex][sourceIndex + redStride];
+ }
+ }
+
+ const float n = read_grid_size(realGridSizeFP, dimIndex);
+
+ const int atomIndex = sourceIndex / minStride;
+ if (sourceIndex == minStride * atomIndex)
+ {
+ sm_forces[atomIndex * DIM + dimIndex] = (sm_forceTemp[dimIndex][sourceIndex] + sm_forceTemp[dimIndex][sourceIndex + 1]) * n;
+ }
+ }
+}
+
+#endif //COMPILE_GATHER_HELPERS_ONCE
+
+/*! \brief
+ * An OpenCL kernel which gathers the atom forces from the grid.
+ * The grid is assumed to be wrapped in dimension Z.
+ * Please see the file description for additional defines which this kernel expects.
+ *
+ * \param[in] kernelParams All the PME GPU data.
+ * \param[in] gm_coefficients Atom charges/coefficients.
+ * \param[in] gm_grid Global 3D grid.
+ * \param[in] gm_theta Atom spline parameter values
+ * \param[in] gm_dtheta Atom spline parameter derivatives
+ * \param[in] gm_gridlineIndices Atom gridline indices (ivec)
+ * \param[in,out] gm_forces Atom forces (rvec)
+ */
+__attribute__((reqd_work_group_size(order, order, atomsPerBlock)))
+__kernel void CUSTOMIZED_KERNEL_NAME(pme_gather_kernel)(const struct PmeOpenCLKernelParams kernelParams,
+ __global const float * __restrict__ gm_coefficients,
+ __global const float * __restrict__ gm_grid,
+ __global const float * __restrict__ gm_theta,
+ __global const float * __restrict__ gm_dtheta,
+ __global const int * __restrict__ gm_gridlineIndices,
+ __global float * __restrict__ gm_forces
+ )
+{
+ /* These are the atom indices - for the shared and global memory */
+ const int atomIndexLocal = get_local_id(ZZ);
+ const int atomIndexOffset = get_group_id(XX) * atomsPerBlock;
+ const int atomIndexGlobal = atomIndexOffset + atomIndexLocal;
+
+ /* Some sizes which are defines and not consts because they go into the array size */
+ #define blockSize (atomsPerBlock * atomDataSize)
+ #define smemPerDim warp_size
+ #define smemReserved ((DIM - 1) * smemPerDim)
+ #define totalSharedMemory (smemReserved + blockSize)
+ #define gridlineIndicesSize (atomsPerBlock * DIM)
+ #define splineParamsSize (atomsPerBlock * DIM * order)
+
+ __local int sm_gridlineIndices[gridlineIndicesSize];
+ __local float2 sm_splineParams[splineParamsSize]; /* Theta/dtheta pairs as .x/.y */
+
+ /* Spline Y/Z coordinates */
+ const int ithy = get_local_id(YY);
+ const int ithz = get_local_id(XX);
+
+ const int threadLocalId = (get_local_id(2) * get_local_size(1) + get_local_id(1)) * get_local_size(0) + get_local_id(0);
+
+ /* These are the spline contribution indices in shared memory */
+ const int splineIndex = (get_local_id(1) * get_local_size(0) + get_local_id(0)); /* Relative to the current particle , 0..15 for order 4 */
+ const int lineIndex = threadLocalId; /* And to all the block's particles */
+
+ /* Staging the atom gridline indices, DIM * atomsPerBlock threads */
+ const int localGridlineIndicesIndex = threadLocalId;
+ const int globalGridlineIndicesIndex = get_group_id(XX) * gridlineIndicesSize + localGridlineIndicesIndex;
+ const int globalCheckIndices = pme_gpu_check_atom_data_index(globalGridlineIndicesIndex, kernelParams.atoms.nAtoms * DIM);
+ if ((localGridlineIndicesIndex < gridlineIndicesSize) & globalCheckIndices)
+ {
+ sm_gridlineIndices[localGridlineIndicesIndex] = gm_gridlineIndices[globalGridlineIndicesIndex];
+ assert(sm_gridlineIndices[localGridlineIndicesIndex] >= 0);
+ }
+ /* Staging the spline parameters, DIM * order * atomsPerBlock threads */
+ const int localSplineParamsIndex = threadLocalId;
+ const int globalSplineParamsIndex = get_group_id(XX) * splineParamsSize + localSplineParamsIndex;
+ const int globalCheckSplineParams = pme_gpu_check_atom_data_index(globalSplineParamsIndex, kernelParams.atoms.nAtoms * DIM * order);
+ if ((localSplineParamsIndex < splineParamsSize) && globalCheckSplineParams)
+ {
+ sm_splineParams[localSplineParamsIndex].x = gm_theta[globalSplineParamsIndex];
+ sm_splineParams[localSplineParamsIndex].y = gm_dtheta[globalSplineParamsIndex];
+ assert(isfinite(sm_splineParams[localSplineParamsIndex].x));
+ assert(isfinite(sm_splineParams[localSplineParamsIndex].y));
+ }
+ barrier(CLK_LOCAL_MEM_FENCE);
+
+ float fx = 0.0f;
+ float fy = 0.0f;
+ float fz = 0.0f;
+
+ const int globalCheck = pme_gpu_check_atom_data_index(atomIndexGlobal, kernelParams.atoms.nAtoms);
+ const int chargeCheck = pme_gpu_check_atom_charge(gm_coefficients[atomIndexGlobal]);
+
+ if (chargeCheck & globalCheck)
+ {
+ const int nx = kernelParams.grid.realGridSize[XX];
+ const int ny = kernelParams.grid.realGridSize[YY];
+ const int nz = kernelParams.grid.realGridSize[ZZ];
+ const int pny = kernelParams.grid.realGridSizePadded[YY];
+ const int pnz = kernelParams.grid.realGridSizePadded[ZZ];
+
+ const int atomWarpIndex = atomIndexLocal % PME_SPREADGATHER_ATOMS_PER_WARP;
+ const int warpIndex = atomIndexLocal / PME_SPREADGATHER_ATOMS_PER_WARP;
+
+ const int splineIndexBase = getSplineParamIndexBase(warpIndex, atomWarpIndex);
+ const int splineIndexY = getSplineParamIndex(splineIndexBase, YY, ithy);
+ const float2 tdy = sm_splineParams[splineIndexY];
+ const int splineIndexZ = getSplineParamIndex(splineIndexBase, ZZ, ithz);
+ const float2 tdz = sm_splineParams[splineIndexZ];
+
+ const int ixBase = sm_gridlineIndices[atomIndexLocal * DIM + XX];
+ int iy = sm_gridlineIndices[atomIndexLocal * DIM + YY] + ithy;
+ if (wrapY & (iy >= ny))
+ {
+ iy -= ny;
+ }
+ int iz = sm_gridlineIndices[atomIndexLocal * DIM + ZZ] + ithz;
+ if (iz >= nz)
+ {
+ iz -= nz;
+ }
+ const int constOffset = iy * pnz + iz;
+
+#pragma unroll order
+ for (int ithx = 0; (ithx < order); ithx++)
+ {
+ int ix = ixBase + ithx;
+ if (wrapX & (ix >= nx))
+ {
+ ix -= nx;
+ }
+ const int gridIndexGlobal = ix * pny * pnz + constOffset;
+ assert(gridIndexGlobal >= 0);
+ const float gridValue = gm_grid[gridIndexGlobal];
+ assert(isfinite(gridValue));
+ const int splineIndexX = getSplineParamIndex(splineIndexBase, XX, ithx);
+ const float2 tdx = sm_splineParams[splineIndexX];
+ const float fxy1 = tdz.x * gridValue;
+ const float fz1 = tdz.y * gridValue;
+ fx += tdx.y * tdy.x * fxy1;
+ fy += tdx.x * tdy.y * fxy1;
+ fz += tdx.x * tdy.x * fz1;
+ }
+ }
+
+ // Reduction of partial force contributions
+ __local float sm_forces[atomsPerBlock * DIM];
+
+ __local float sm_forceReduction[totalSharedMemory];
+ __local float *sm_forceTemp[DIM];
+
+ reduce_atom_forces(sm_forces,
+ atomIndexLocal, splineIndex, lineIndex,
+ kernelParams.grid.realGridSizeFP,
+ fx, fy, fz,
+ sm_forceReduction,
+ sm_forceTemp);
+ barrier(CLK_LOCAL_MEM_FENCE);
+
+ /* Calculating the final forces with no component branching, atomsPerBlock threads */
+ const int forceIndexLocal = threadLocalId;
+ const int forceIndexGlobal = atomIndexOffset + forceIndexLocal;
+ const int calcIndexCheck = pme_gpu_check_atom_data_index(forceIndexGlobal, kernelParams.atoms.nAtoms);
+ if ((forceIndexLocal < atomsPerBlock) & calcIndexCheck)
+ {
+ const float3 atomForces = vload3(forceIndexLocal, sm_forces);
+ const float negCoefficient = -gm_coefficients[forceIndexGlobal];
+ float3 result;
+ result.x = negCoefficient * kernelParams.current.recipBox[XX][XX] * atomForces.x;
+ result.y = negCoefficient * (kernelParams.current.recipBox[XX][YY] * atomForces.x +
+ kernelParams.current.recipBox[YY][YY] * atomForces.y);
+ result.z = negCoefficient * (kernelParams.current.recipBox[XX][ZZ] * atomForces.x +
+ kernelParams.current.recipBox[YY][ZZ] * atomForces.y +
+ kernelParams.current.recipBox[ZZ][ZZ] * atomForces.z);
+ vstore3(result, forceIndexLocal, sm_forces);
+ }
+
+#if !defined(_AMD_SOURCE_) && !defined(_NVIDIA_SOURCE_)
+ /* This is only here for execution of e.g. 32-sized warps on 16-wide hardware; this was gmx_syncwarp() in CUDA.
+ * #2519
+ */
+ barrier(CLK_LOCAL_MEM_FENCE);
+#endif
+
+ assert(atomsPerBlock <= warp_size);
+
+ /* Writing or adding the final forces component-wise, single warp */
+ const int blockForcesSize = atomsPerBlock * DIM;
+ const int numIter = (blockForcesSize + warp_size - 1) / warp_size;
+ const int iterThreads = blockForcesSize / numIter;
+ if (threadLocalId < iterThreads)
+ {
+#pragma unroll
+ for (int i = 0; i < numIter; i++)
+ {
+ const int outputIndexLocal = i * iterThreads + threadLocalId;
+ const int outputIndexGlobal = get_group_id(XX) * blockForcesSize + outputIndexLocal;
+ const int globalOutputCheck = pme_gpu_check_atom_data_index(outputIndexGlobal, kernelParams.atoms.nAtoms * DIM);
+ if (globalOutputCheck)
+ {
+ const float outputForceComponent = sm_forces[outputIndexLocal];
+ if (overwriteForces)
+ {
+ gm_forces[outputIndexGlobal] = outputForceComponent;
+ }
+ else
+ {
+ gm_forces[outputIndexGlobal] += outputForceComponent;
+ }
+ }
+ }
+ }
+}
void pme_gpu_init_internal(PmeGpu *pmeGpu)
{
+#if GMX_GPU == GMX_GPU_CUDA
+ // Prepare to use the device that this PME task was assigned earlier.
+ // Other entities, such as CUDA timing events, are known to implicitly use the device context.
+ CU_RET_ERR(cudaSetDevice(pmeGpu->deviceInfo->id), "Switching to PME CUDA device");
+#endif
+
/* Allocate the target-specific structures */
pmeGpu->archSpecific.reset(new PmeGpuSpecific());
pmeGpu->kernelParams.reset(new PmeGpuKernelParams());
* TODO: PME could also try to pick up nice grid sizes (with factors of 2, 3, 5, 7).
*/
+ // TODO: this is just a convenient reuse because programHandle_ currently is in charge of creating context
+ pmeGpu->archSpecific->context = pmeGpu->programHandle_->impl_->context;
+
// timing enabling - TODO put this in gpu_utils (even though generally this is just option handling?) and reuse in NB
if (GMX_GPU == GMX_GPU_CUDA)
{
pmeGpu->archSpecific->useTiming = (getenv("GMX_DISABLE_GPU_TIMING") == nullptr);
}
- // TODO: this is just a convenient reuse because programHandle_ currently is in charge of creating context
- pmeGpu->archSpecific->context = pmeGpu->programHandle_->impl_->context;
-
-#if GMX_GPU == GMX_GPU_CUDA
- // Prepare to use the device that this PME task was assigned earlier.
- CU_RET_ERR(cudaSetDevice(pmeGpu->deviceInfo->id), "Switching to PME CUDA device");
-#endif
-
#if GMX_GPU == GMX_GPU_CUDA
pmeGpu->maxGridWidthX = pmeGpu->deviceInfo->prop.maxGridSize[0];
#elif GMX_GPU == GMX_GPU_OPENCL
- //TODO we'll need work size checks for OpenCL too
+ pmeGpu->maxGridWidthX = INT32_MAX / 2;
+ // TODO: is there no really global work size limit in OpenCL?
#endif
/* Creating a PME GPU stream:
int indexBase, result;
switch (atomsPerWarp)
{
+ case 1:
+ indexBase = getSplineParamIndexBase<fixedOrder, 1>(warpIndex, atomWarpIndex);
+ result = getSplineParamIndex<fixedOrder, 1>(indexBase, dimIndex, splineIndex);
+ break;
+
case 2:
indexBase = getSplineParamIndexBase<fixedOrder, 2>(warpIndex, atomWarpIndex);
result = getSplineParamIndex<fixedOrder, 2>(indexBase, dimIndex, splineIndex);
break;
+ case 4:
+ indexBase = getSplineParamIndexBase<fixedOrder, 4>(warpIndex, atomWarpIndex);
+ result = getSplineParamIndex<fixedOrder, 4>(indexBase, dimIndex, splineIndex);
+ break;
+
+ case 8:
+ indexBase = getSplineParamIndexBase<fixedOrder, 8>(warpIndex, atomWarpIndex);
+ result = getSplineParamIndex<fixedOrder, 8>(indexBase, dimIndex, splineIndex);
+ break;
+
default:
GMX_THROW(gmx::NotImplementedError(gmx::formatString("Test function call not unrolled for atomsPerWarp = %d in getSplineParamFullIndex", atomsPerWarp)));
}
{
int timerId = (dir == GMX_FFT_REAL_TO_COMPLEX) ? gtPME_FFT_R2C : gtPME_FFT_C2R;
- auto *timingEvent = pme_gpu_fetch_timing_event(pmeGpu, timerId);
pme_gpu_start_timing(pmeGpu, timerId);
- pmeGpu->archSpecific->fftSetup[grid_index]->perform3dFft(dir, timingEvent);
+ pmeGpu->archSpecific->fftSetup[grid_index]->perform3dFft(dir, pme_gpu_fetch_timing_event(pmeGpu, timerId));
pme_gpu_stop_timing(pmeGpu, timerId);
}
const auto *kernelParamsPtr = pmeGpu->kernelParams.get();
GMX_ASSERT(kernelParamsPtr->atoms.nAtoms > 0, "No atom data in PME GPU spread");
- const int maxThreadsPerBlock = c_spreadMaxWarpsPerBlock * pmeGpu->programHandle_->impl_->warpSize;
- // TODO take device limitations (CL_DEVICE_MAX_WORK_GROUP_SIZE) into account for all 3 kernels
- const int order = pmeGpu->common->pme_order;
- const int atomsPerBlock = maxThreadsPerBlock / PME_SPREADGATHER_THREADS_PER_ATOM;
+ const size_t blockSize = pmeGpu->programHandle_->impl_->spreadWorkGroupSize;
+
+ const int order = pmeGpu->common->pme_order;
+ const int atomsPerBlock = blockSize / PME_SPREADGATHER_THREADS_PER_ATOM;
// TODO: pick smaller block size in runtime if needed
// (e.g. on 660 Ti where 50% occupancy is ~25% faster than 100% occupancy with RNAse (~17.8k atoms))
// If doing so, change atomsPerBlock in the kernels as well.
pme_gpu_start_timing(pmeGpu, timingId);
auto *timingEvent = pme_gpu_fetch_timing_event(pmeGpu, timingId);
- const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#if c_canEmbedBuffers
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#else
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr,
+ &kernelParamsPtr->atoms.d_theta,
+ &kernelParamsPtr->atoms.d_dtheta,
+ &kernelParamsPtr->atoms.d_gridlineIndices,
+ &kernelParamsPtr->grid.d_realGrid,
+ &kernelParamsPtr->grid.d_fractShiftsTable,
+ &kernelParamsPtr->grid.d_gridlineIndicesTable,
+ &kernelParamsPtr->atoms.d_coefficients,
+ &kernelParamsPtr->atoms.d_coordinates
+ );
+#endif
+
launchGpuKernel(kernelPtr, config, timingEvent, "PME spline/spread", kernelArgs);
pme_gpu_stop_timing(pmeGpu, timingId);
GMX_ASSERT(false, "Implement grid ordering here and below for the kernel launch");
}
- const int maxThreadsPerBlock = c_solveMaxWarpsPerBlock * pmeGpu->programHandle_->impl_->warpSize;
- const int maxBlockSize = maxThreadsPerBlock;
+ const size_t maxBlockSize = pmeGpu->programHandle_->impl_->solveMaxWorkGroupSize;
+
const int gridLineSize = pmeGpu->kernelParams->grid.complexGridSize[minorDim];
- const int gridLinesPerBlock = std::max(maxBlockSize / gridLineSize, 1);
+ const int gridLinesPerBlock = std::max(maxBlockSize / gridLineSize, 1UL);
const int blocksPerGridLine = (gridLineSize + maxBlockSize - 1) / maxBlockSize;
const int cellsPerBlock = gridLineSize * gridLinesPerBlock;
const size_t warpSize = pmeGpu->programHandle_->impl_->warpSize;
pme_gpu_start_timing(pmeGpu, timingId);
auto *timingEvent = pme_gpu_fetch_timing_event(pmeGpu, timingId);
- const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#if c_canEmbedBuffers
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#else
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr,
+ &kernelParamsPtr->grid.d_splineModuli,
+ &kernelParamsPtr->constants.d_virialAndEnergy,
+ &kernelParamsPtr->grid.d_fourierGrid);
+#endif
launchGpuKernel(kernelPtr, config, timingEvent, "PME solve", kernelArgs);
pme_gpu_stop_timing(pmeGpu, timingId);
pme_gpu_copy_input_gather_atom_data(pmeGpu);
}
- const int maxThreadsPerBlock = c_gatherMaxWarpsPerBlock * pmeGpu->programHandle_->impl_->warpSize;
- const int atomsPerBlock = maxThreadsPerBlock / PME_SPREADGATHER_THREADS_PER_ATOM;
+ const size_t blockSize = pmeGpu->programHandle_->impl_->gatherWorkGroupSize;
+ const int atomsPerBlock = blockSize / PME_SPREADGATHER_THREADS_PER_ATOM;
GMX_ASSERT(!c_usePadding || !(PME_ATOM_DATA_ALIGNMENT % atomsPerBlock), "inconsistent atom data padding vs. gathering block size");
const int blockCount = pmeGpu->nAtomsPadded / atomsPerBlock;
pme_gpu_start_timing(pmeGpu, timingId);
auto *timingEvent = pme_gpu_fetch_timing_event(pmeGpu, timingId);
- const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#if c_canEmbedBuffers
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr);
+#else
+ const auto kernelArgs = prepareGpuKernelArguments(kernelPtr, config, kernelParamsPtr,
+ &kernelParamsPtr->atoms.d_coefficients,
+ &kernelParamsPtr->grid.d_realGrid,
+ &kernelParamsPtr->atoms.d_theta,
+ &kernelParamsPtr->atoms.d_dtheta,
+ &kernelParamsPtr->atoms.d_gridlineIndices,
+ &kernelParamsPtr->atoms.d_forces);
+#endif
launchGpuKernel(kernelPtr, config, timingEvent, "PME gather", kernelArgs);
pme_gpu_stop_timing(pmeGpu, timingId);
#include "gromacs/gpu_utils/ocl_compiler.h"
#include "gromacs/utility/stringutil.h"
+#include "pme-gpu-constants.h"
#include "pme-gpu-internal.h" // for GridOrdering enum
#include "pme-gpu-program-impl.h"
#include "pme-gpu-types-host.h"
+#include "pme-grid.h"
PmeGpuProgramImpl::PmeGpuProgramImpl(const gmx_device_info_t *deviceInfo)
{
- // Context creation (which should happen outside of this class: #2522
+ // Context creation (which should happen outside of this class: #2522)
cl_platform_id platformId = deviceInfo->ocl_gpu_id.ocl_platform_id;
cl_device_id deviceId = deviceInfo->ocl_gpu_id.ocl_device_id;
cl_context_properties contextProperties[3];
GMX_THROW(gmx::InternalError(errorString));
}
- warpSize = gmx::ocl::getWarpSize(context, deviceId);
+ // kernel parameters
+ warpSize = gmx::ocl::getWarpSize(context, deviceId);
+ spreadWorkGroupSize = std::min(c_spreadMaxWarpsPerBlock * warpSize,
+ deviceInfo->maxWorkGroupSize);
+ solveMaxWorkGroupSize = std::min(c_solveMaxWarpsPerBlock * warpSize,
+ deviceInfo->maxWorkGroupSize);
+ gatherWorkGroupSize = std::min(c_gatherMaxWarpsPerBlock * warpSize,
+ deviceInfo->maxWorkGroupSize);
- //TODO: OpenCL kernel compilation should be here.
+ compileKernels(deviceInfo);
}
PmeGpuProgramImpl::~PmeGpuProgramImpl()
{
// TODO: log releasing errors
- clReleaseContext(context);
+ cl_int gmx_used_in_debug stat = 0;
+ stat |= clReleaseKernel(splineAndSpreadKernel);
+ stat |= clReleaseKernel(splineKernel);
+ stat |= clReleaseKernel(spreadKernel);
+ stat |= clReleaseKernel(gatherKernel);
+ stat |= clReleaseKernel(gatherReduceWithInputKernel);
+ stat |= clReleaseKernel(solveXYZKernel);
+ stat |= clReleaseKernel(solveXYZEnergyKernel);
+ stat |= clReleaseKernel(solveYZXKernel);
+ stat |= clReleaseKernel(solveYZXEnergyKernel);
+ stat |= clReleaseContext(context);
+ GMX_ASSERT(stat == CL_SUCCESS, gmx::formatString("Failed to release PME OpenCL resources %d: %s",
+ stat, ocl_get_error_string(stat).c_str()).c_str());
+}
+
+void PmeGpuProgramImpl::compileKernels(const gmx_device_info_t *deviceInfo)
+{
+ // We might consider storing program as a member variable if it's needed later
+ cl_program program = nullptr;
+ /* Need to catch std::bad_alloc here and during compilation string handling. */
+ try
+ {
+ /* Here we pass macros and static const int variables defined in include
+ * files outside as macros, to avoid including those files
+ * in the JIT compilation that happens at runtime.
+ */
+ constexpr int order = 4;
+ const std::string commonDefines = gmx::formatString(
+ "-Dwarp_size=%zd "
+ "-Dorder=%d "
+ "-DPME_SPREADGATHER_ATOMS_PER_WARP=%zd "
+ "-DPME_SPREADGATHER_THREADS_PER_ATOM=%d "
+ // forwarding from pme-grid.h, used for spline computation table sizes only
+ "-Dc_pmeMaxUnitcellShift=%f "
+ // forwarding PME behavior constants from pme-gpu-constants.h
+ "-Dc_usePadding=%d "
+ "-Dc_skipNeutralAtoms=%d "
+ "-Dc_virialAndEnergyCount=%d "
+ // forwarding kernel work sizes
+ "-Dc_spreadWorkGroupSize=%zd "
+ "-Dc_solveMaxWorkGroupSize=%zd "
+ "-Dc_gatherWorkGroupSize=%zd "
+ // forwarding from vectypes.h
+ "-DDIM=%d -DXX=%d -DYY=%d -DZZ=%d "
+ // decomposition parameter placeholders
+ "-DwrapX=true -DwrapY=true ",
+ warpSize,
+ order,
+ warpSize / PME_SPREADGATHER_THREADS_PER_ATOM,
+ PME_SPREADGATHER_THREADS_PER_ATOM,
+ static_cast<float>(c_pmeMaxUnitcellShift),
+ c_usePadding,
+ c_skipNeutralAtoms,
+ c_virialAndEnergyCount,
+ spreadWorkGroupSize,
+ solveMaxWorkGroupSize,
+ gatherWorkGroupSize,
+ DIM, XX, YY, ZZ);
+ try
+ {
+ /* TODO when we have a proper MPI-aware logging module,
+ the log output here should be written there */
+ program = gmx::ocl::compileProgram(stderr,
+ "src/gromacs/ewald",
+ "pme-program.cl",
+ commonDefines,
+ context,
+ deviceInfo->ocl_gpu_id.ocl_device_id,
+ deviceInfo->vendor_e);
+ }
+ catch (gmx::GromacsException &e)
+ {
+ e.prependContext(gmx::formatString("Failed to compile PME kernels for GPU #%s\n",
+ deviceInfo->device_name));
+ throw;
+ }
+ }
+ GMX_CATCH_ALL_AND_EXIT_WITH_FATAL_ERROR;
+
+ constexpr cl_uint expectedKernelCount = 9;
+ // Has to be equal or larger than the number of kernel instances.
+ // If it is not, CL_INVALID_VALUE will be thrown.
+ std::vector<cl_kernel> kernels(expectedKernelCount, nullptr);
+ cl_uint actualKernelCount = 0;
+ cl_int clError = clCreateKernelsInProgram(program, kernels.size(), kernels.data(), &actualKernelCount);
+ if (clError != CL_SUCCESS)
+ {
+ const std::string errorString = gmx::formatString("Failed to create kernels for PME on GPU #%s:\n OpenCL error %d: %s",
+ deviceInfo->device_name, clError, ocl_get_error_string(clError).c_str());
+ GMX_THROW(gmx::InternalError(errorString));
+ }
+ kernels.resize(actualKernelCount);
+
+ std::array<char, 100> kernelNamesBuffer;
+ for (const auto &kernel : kernels)
+ {
+ clError = clGetKernelInfo(kernel, CL_KERNEL_FUNCTION_NAME,
+ kernelNamesBuffer.size(), kernelNamesBuffer.data(), nullptr);
+ if (clError != CL_SUCCESS)
+ {
+ const std::string errorString = gmx::formatString("Failed to parse kernels for PME on GPU #%s:\n OpenCL error %d: %s",
+ deviceInfo->device_name, clError, ocl_get_error_string(clError).c_str());
+ GMX_THROW(gmx::InternalError(errorString));
+ }
+
+ // The names below must correspond to those defined in pme-program.cl
+ // TODO use a map with string key instead?
+ if (!strcmp(kernelNamesBuffer.data(), "pmeSplineKernel"))
+ {
+ splineKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSplineAndSpreadKernel"))
+ {
+ splineAndSpreadKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSpreadKernel"))
+ {
+ spreadKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeGatherKernel"))
+ {
+ gatherKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeGatherReduceWithInputKernel"))
+ {
+ gatherReduceWithInputKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSolveYZXKernel"))
+ {
+ solveYZXKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSolveYZXEnergyKernel"))
+ {
+ solveYZXEnergyKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSolveXYZKernel"))
+ {
+ solveXYZKernel = kernel;
+ }
+ else if (!strcmp(kernelNamesBuffer.data(), "pmeSolveXYZEnergyKernel"))
+ {
+ solveXYZEnergyKernel = kernel;
+ }
+ }
+ clReleaseProgram(program);
}
#include "pme-gpu-program-impl.h"
-PmeGpuProgramImpl::PmeGpuProgramImpl(const gmx_device_info_t * /*unused*/) : warpSize(0){}
+PmeGpuProgramImpl::PmeGpuProgramImpl(const gmx_device_info_t * /*unused*/) :
+ warpSize(0),
+ spreadWorkGroupSize(0),
+ gatherWorkGroupSize(0),
+ solveMaxWorkGroupSize(0)
+{}
PmeGpuProgramImpl::~PmeGpuProgramImpl() = default;
#include "pme-gpu-program-impl.h"
-#include "gromacs/gpu_utils/cuda_arch_utils.cuh" // only for warp_size
-
+#include "pme-gpu-constants.h"
#include "pme-gpu-internal.h" // for GridOrdering enum
#include "pme-gpu-types-host.h"
PmeGpuProgramImpl::PmeGpuProgramImpl(const gmx_device_info_t *)
{
- warpSize = warp_size;
+ // kernel parameters
+ warpSize = warp_size;
+ spreadWorkGroupSize = c_spreadMaxThreadsPerBlock;
+ solveMaxWorkGroupSize = c_solveMaxThreadsPerBlock;
+ gatherWorkGroupSize = c_gatherMaxThreadsPerBlock;
// PME interpolation order
constexpr int pmeOrder = 4;
#include "gromacs/utility/classhelpers.h"
#if GMX_GPU == GMX_GPU_CUDA
-// TODO uncomment when we learn to compile .cpp with CUDA compiler
-//! include "gromacs/gpu_utils/gputraits.cuh"
-using Context = void *;
+#include "gromacs/gpu_utils/gputraits.cuh"
#elif GMX_GPU == GMX_GPU_OPENCL
#include "gromacs/gpu_utils/gputraits_ocl.h"
#elif GMX_GPU == GMX_GPU_NONE
* Spreading kernels also have hardcoded X/Y indices wrapping parameters,
* as a placeholder for implementing 1/2D decomposition.
*/
+ size_t spreadWorkGroupSize;
+
PmeKernelHandle splineKernel;
PmeKernelHandle spreadKernel;
PmeKernelHandle splineAndSpreadKernel;
/** Same for gather: hardcoded X/Y unwrap parameters, order of 4, plus
* it can either reduce with previous forces in the host buffer, or ignore them.
*/
+ size_t gatherWorkGroupSize;
+
PmeKernelHandle gatherReduceWithInputKernel;
PmeKernelHandle gatherKernel;
//@}
/** Solve kernel doesn't care about the interpolation order, but can optionally
* compute energy and virial, and supports XYZ and YZX grid orderings.
*/
+ size_t solveMaxWorkGroupSize;
+
PmeKernelHandle solveYZXKernel;
PmeKernelHandle solveXYZKernel;
PmeKernelHandle solveYZXEnergyKernel;
explicit PmeGpuProgramImpl(const gmx_device_info_t *deviceInfo);
~PmeGpuProgramImpl();
GMX_DISALLOW_COPY_AND_ASSIGN(PmeGpuProgramImpl);
+
+ private:
+ // Compiles kernels, if supported. Called by the constructor.
+ void compileKernels(const gmx_device_info_t *deviceInfo);
};
#endif
/*! \brief Kernel scheduling grid width limit in X - derived from deviceinfo compute capability in CUDA.
* Declared as very large int to make it useful in computations with type promotion, to avoid overflows.
+ * OpenCL seems to not have readily available global work size limit, so we just assign a large arbitrary constant to this instead.
+ * TODO: this should be in PmeGpuProgram(Impl)
*/
std::intmax_t maxGridWidthX;
--- /dev/null
+/*
+ * This file is part of the GROMACS molecular simulation package.
+ *
+ * Copyright (c) 2018, by the GROMACS development team, led by
+ * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ * and including many others, as listed in the AUTHORS file in the
+ * top-level source directory and at http://www.gromacs.org.
+ *
+ * GROMACS is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1
+ * of the License, or (at your option) any later version.
+ *
+ * GROMACS is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with GROMACS; if not, see
+ * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * If you want to redistribute modifications to GROMACS, please
+ * consider that scientific software is very special. Version
+ * control is crucial - bugs must be traceable. We will be happy to
+ * consider code for inclusion in the official distribution, but
+ * derived work must not be called official GROMACS. Details are found
+ * in the README & COPYING files - if they are missing, get the
+ * official version at http://www.gromacs.org.
+ *
+ * To help us fund GROMACS development, we humbly ask that you cite
+ * the research papers on the package. Check out http://www.gromacs.org.
+ */
+#ifndef GMX_EWALD_PME_GPU_UTILS_CLH
+#define GMX_EWALD_PME_GPU_UTILS_CLH
+
+/*! \internal \file
+ * \brief This file defines the small PME OpenCL inline device functions.
+ * This closely mirrors pme-gpu-utils.h (which is used in CUDA and unit tests), except with no templates.
+ * Instead of templated parameters this file expects following defines during compilation:
+ * - order - PME interpolation order;
+ * - PME_SPREADGATHER_ATOMS_PER_WARP - number of atoms processed by a warp (fixed for spread and gather kernels to be the same);
+ * - c_usePadding and c_skipNeutralAtoms - same as in pme-gpu-constants.h.
+ *
+ * \author Aleksei Iupinov <a.yupinov@gmail.com>
+ * \ingroup module_ewald
+ */
+
+/*! \internal \brief
+ * Gets a base of the unique index to an element in a spline parameter buffer (theta/dtheta),
+ * which is laid out for GPU spread/gather kernels. The base only corresponds to the atom index within the execution block.
+ * Feed the result into getSplineParamIndex() to get a full index.
+ * TODO: it's likely that both parameters can be just replaced with a single atom index, as they are derived from it.
+ * Do that, verifying that the generated code is not bloated, and/or revise the spline indexing scheme.
+ * Removing warp dependency would also be nice (and would probably coincide with removing PME_SPREADGATHER_ATOMS_PER_WARP).
+ *
+ * \param[in] warpIndex Warp index wrt the block.
+ * \param[in] atomWarpIndex Atom index wrt the warp (from 0 to PME_SPREADGATHER_ATOMS_PER_WARP - 1).
+ *
+ * \returns Index into theta or dtheta array using GPU layout.
+ */
+inline int getSplineParamIndexBase(int warpIndex, int atomWarpIndex)
+{
+ assert((atomWarpIndex >= 0) && (atomWarpIndex < PME_SPREADGATHER_ATOMS_PER_WARP));
+ const int dimIndex = 0;
+ const int splineIndex = 0;
+ // The zeroes are here to preserve the full index formula for reference
+ return (((splineIndex + order * warpIndex) * DIM + dimIndex) * PME_SPREADGATHER_ATOMS_PER_WARP + atomWarpIndex);
+}
+
+/*! \internal \brief
+ * Gets a unique index to an element in a spline parameter buffer (theta/dtheta),
+ * which is laid out for GPU spread/gather kernels. The index is wrt to the execution block,
+ * in range(0, atomsPerBlock * order * DIM).
+ * This function consumes result of getSplineParamIndexBase() and adjusts it for \p dimIndex and \p splineIndex.
+ *
+ * \param[in] paramIndexBase Must be result of getSplineParamIndexBase().
+ * \param[in] dimIndex Dimension index (from 0 to 2)
+ * \param[in] splineIndex Spline contribution index (from 0 to \p order - 1)
+ * \param[in] order PME order
+ * \param[in] atomsPerWarp Number of atoms processed by a warp
+ *
+ * \returns Index into theta or dtheta array using GPU layout.
+ */
+inline int getSplineParamIndex(int paramIndexBase, int dimIndex, int splineIndex)
+{
+ assert((dimIndex >= XX) && (dimIndex < DIM));
+ assert((splineIndex >= 0) && (splineIndex < order));
+ return (paramIndexBase + (splineIndex * DIM + dimIndex) * PME_SPREADGATHER_ATOMS_PER_WARP);
+}
+
+/*! \brief
+ * A function for checking the global atom data indices against the atom data array sizes.
+ *
+ * \param[in] atomDataIndexGlobal The atom data index.
+ * \param[in] nAtomData The atom data array element count.
+ * \returns Non-0 if index is within bounds (or PME data padding is enabled), 0 otherwise.
+ *
+ * This is called from the spline_and_spread and gather PME kernels.
+ * The goal is to isolate the global range checks, and allow avoiding them with c_usePadding being true.
+ */
+inline int pme_gpu_check_atom_data_index(const size_t atomDataIndex, const size_t nAtomData)
+{
+ return c_usePadding ? 1 : (atomDataIndex < nAtomData);
+}
+
+/*! \brief
+ * A function for optionally skipping neutral charges, depending on c_skipNeutralAtoms.
+ *
+ * \param[in] coefficient The atom charge/coefficient.
+ * \returns Non-0 if atom should be processed, 0 otherwise.
+ */
+inline int pme_gpu_check_atom_charge(const float coefficient)
+{
+ return c_skipNeutralAtoms ? (coefficient != 0.0f) : 1;
+}
+
+#endif
#define GMX_EWALD_PME_GPU_UTILS_H
/*! \internal \file
- * \brief This file defines the small PME GPU inline host/device functions.
+ * \brief This file defines small PME GPU inline host/device functions.
+ * Note that OpenCL device-side functions can't use C++ features, so they are
+ * located in a similar file pme-gpu-utils.clh.
+ * Be sure to keep the logic in sync in both files when changing it!
*
* \author Aleksei Iupinov <a.yupinov@gmail.com>
* \ingroup module_ewald
return (paramIndexBase + (splineIndex * DIM + dimIndex) * atomsPerWarp);
}
+#if GMX_GPU == GMX_GPU_CUDA
+// CUDA device code helpers below
+
+/*! \internal \brief
+ * An inline CUDA function for checking the global atom data indices against the atom data array sizes.
+ *
+ * \param[in] atomDataIndex The atom data index.
+ * \param[in] nAtomData The atom data array element count.
+ * \returns Non-0 if index is within bounds (or PME data padding is enabled), 0 otherwise.
+ *
+ * This is called from the spline_and_spread and gather PME kernels.
+ * The goal is to isolate the global range checks, and allow avoiding them with c_usePadding enabled.
+ */
+int __device__ __forceinline__ pme_gpu_check_atom_data_index(const int atomDataIndex, const int nAtomData)
+{
+ return c_usePadding ? 1 : (atomDataIndex < nAtomData);
+}
+
+/*! \internal \brief
+ * An inline CUDA function for skipping the zero-charge atoms.
+ *
+ * \returns Non-0 if atom should be processed, 0 otherwise.
+ * \param[in] coefficient The atom charge.
+ *
+ * This is called from the spline_and_spread and gather PME kernels.
+ */
+int __device__ __forceinline__ pme_gpu_check_atom_charge(const float coefficient)
+{
+ assert(isfinite(coefficient));
+ return c_skipNeutralAtoms ? (coefficient != 0.0f) : 1;
+}
+
+#endif
+
#endif
#include "gmxpre.h"
+#include "config.h"
+
#include <list>
#include "gromacs/ewald/ewald-utils.h"
wallcycle_start_nocount(wcycle, ewcWAIT_GPU_PME_GATHER);
- if (completionKind == GpuTaskCompletion::Check)
+ constexpr bool c_streamQuerySupported = (GMX_GPU == GMX_GPU_CUDA);
+ // TODO: implement c_streamQuerySupported with an additional GpuEventSynchronizer per stream (#2521)
+ if ((completionKind == GpuTaskCompletion::Check) && c_streamQuerySupported)
{
// Query the PME stream for completion of all tasks enqueued and
// if we're not done, stop the timer before early return.
--- /dev/null
+/*
+ * This file is part of the GROMACS molecular simulation package.
+ *
+ * Copyright (c) 2018, by the GROMACS development team, led by
+ * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ * and including many others, as listed in the AUTHORS file in the
+ * top-level source directory and at http://www.gromacs.org.
+ *
+ * GROMACS is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1
+ * of the License, or (at your option) any later version.
+ *
+ * GROMACS is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with GROMACS; if not, see
+ * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * If you want to redistribute modifications to GROMACS, please
+ * consider that scientific software is very special. Version
+ * control is crucial - bugs must be traceable. We will be happy to
+ * consider code for inclusion in the official distribution, but
+ * derived work must not be called official GROMACS. Details are found
+ * in the README & COPYING files - if they are missing, get the
+ * official version at http://www.gromacs.org.
+ *
+ * To help us fund GROMACS development, we humbly ask that you cite
+ * the research papers on the package. Check out http://www.gromacs.org.
+ */
+
+/*! \internal \file
+ * \brief Top-level file for generating PME OpenCL kernels.
+ * This file includes all the PME OpenCL kernel files multiple times, with additional defines.
+ * Plenty of common constants/macros are also expected to be already defined by the compiler
+ * (such as "order" which is PME interpolation order).
+ * For details, please see how pme-program.cl is compiled in pme-gpu-program-impl-ocl.cpp.
+ *
+ * This file specifically expects the following work size definitions:
+ *
+ * - c_spreadWorkGroupSize is a spline/spread kernels work group size.
+ * - c_solveMaxWorkGroupSize is a solve kernels maximum work group size.
+ * - c_gatherWorkGroupSize is a gather kernels work group size.
+ *
+ * \author Aleksei Iupinov <a.yupinov@gmail.com>
+ */
+
+// Assert placeholders, to not rip them out from OpenCL implementation - hopefully they come in handy some day with OpenCL 2
+#define static_assert(a, b)
+#define assert(a)
+
+#define PmeOpenCLKernelParams PmeGpuKernelParamsBase
+
+/* SPREAD/SPLINE */
+
+#define atomsPerBlock (c_spreadWorkGroupSize / PME_SPREADGATHER_THREADS_PER_ATOM)
+
+// spline/spread fused
+#define computeSplines 1
+#define spreadCharges 1
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSplineAndSpreadKernel
+#include "pme-spread.clh"
+#undef computeSplines
+#undef spreadCharges
+#undef CUSTOMIZED_KERNEL_NAME
+
+// spline
+#define computeSplines 1
+#define spreadCharges 0
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSplineKernel
+#include "pme-spread.clh"
+#undef computeSplines
+#undef spreadCharges
+#undef CUSTOMIZED_KERNEL_NAME
+
+// spread
+#define computeSplines 0
+#define spreadCharges 1
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSpreadKernel
+#include "pme-spread.clh"
+#undef computeSplines
+#undef spreadCharges
+#undef CUSTOMIZED_KERNEL_NAME
+
+
+#undef atomsPerBlock
+
+
+/* GATHER */
+
+#define atomsPerBlock (c_gatherWorkGroupSize / PME_SPREADGATHER_THREADS_PER_ATOM)
+
+// gather
+#define overwriteForces 1
+#define CUSTOMIZED_KERNEL_NAME(x) pmeGatherKernel
+#include "pme-gather.clh"
+#undef overwriteForces
+#undef CUSTOMIZED_KERNEL_NAME
+
+// gather with reduction
+#define overwriteForces 0
+#define CUSTOMIZED_KERNEL_NAME(x) pmeGatherReduceWithInputKernel
+#include "pme-gather.clh"
+#undef overwriteForces
+#undef CUSTOMIZED_KERNEL_NAME
+
+
+#undef atomsPerBlock
+
+/* SOLVE */
+
+// GridOrdering enum replacement
+#define YZX 1
+#define XYZ 2
+
+// solve, YZX dimension order
+#define gridOrdering YZX
+#define computeEnergyAndVirial 0
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSolveYZXKernel
+#include "pme-solve.clh"
+#undef gridOrdering
+#undef computeEnergyAndVirial
+#undef CUSTOMIZED_KERNEL_NAME
+
+// solve with reduction, YZX dimension order
+#define gridOrdering YZX
+#define computeEnergyAndVirial 1
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSolveYZXEnergyKernel
+#include "pme-solve.clh"
+#undef gridOrdering
+#undef computeEnergyAndVirial
+#undef CUSTOMIZED_KERNEL_NAME
+
+// solve, XYZ dimension order
+#define gridOrdering XYZ
+#define computeEnergyAndVirial 0
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSolveXYZKernel
+#include "pme-solve.clh"
+#undef gridOrdering
+#undef computeEnergyAndVirial
+#undef CUSTOMIZED_KERNEL_NAME
+
+// solve with reduction, XYZ dimension order
+#define gridOrdering XYZ
+#define computeEnergyAndVirial 1
+#define CUSTOMIZED_KERNEL_NAME(x) pmeSolveXYZEnergyKernel
+#include "pme-solve.clh"
+#undef gridOrdering
+#undef computeEnergyAndVirial
+#undef CUSTOMIZED_KERNEL_NAME
--- /dev/null
+/*
+ * This file is part of the GROMACS molecular simulation package.
+ *
+ * Copyright (c) 2018, by the GROMACS development team, led by
+ * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ * and including many others, as listed in the AUTHORS file in the
+ * top-level source directory and at http://www.gromacs.org.
+ *
+ * GROMACS is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1
+ * of the License, or (at your option) any later version.
+ *
+ * GROMACS is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with GROMACS; if not, see
+ * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * If you want to redistribute modifications to GROMACS, please
+ * consider that scientific software is very special. Version
+ * control is crucial - bugs must be traceable. We will be happy to
+ * consider code for inclusion in the official distribution, but
+ * derived work must not be called official GROMACS. Details are found
+ * in the README & COPYING files - if they are missing, get the
+ * official version at http://www.gromacs.org.
+ *
+ * To help us fund GROMACS development, we humbly ask that you cite
+ * the research papers on the package. Check out http://www.gromacs.org.
+ */
+
+/*! \internal \file
+ * \brief Implements PME OpenCL solving kernel.
+ * When including this and other PME OpenCL kernel files, plenty of common
+ * constants/macros are expected to be defined.
+ * For details, please see how pme-program.cl is compiled in pme-gpu-program-impl-ocl.cpp.
+ *
+ * This file's solving kernel specifically expects the following definitions for its flavors:
+ *
+ * - gridOrdering must be defined to either XYZ or YZX
+ * and corresponds to the dimension order of the grid (GridOrdering enum in CUDA kernels);
+ * - computeEnergyAndVirial must evaluate to true or false, and expresses
+ * whether the reduction is performed.
+ *
+ * \author Aleksei Iupinov <a.yupinov@gmail.com>
+ */
+
+#include "pme-gpu-types.h"
+#include "gromacs/gpu_utils/vectype_ops.clh"
+
+/*! \brief
+ * PME complex grid solver kernel function.
+ * Please see the file description for additional defines which this kernel expects.
+ *
+ * \param[in] kernelParams Input PME GPU data in constant memory.
+ * \param[in] gm_splineModuli B-Spline moduli.
+ * \param[out] gm_virialAndEnergy Reduced virial and enrgy (only with computeEnergyAndVirial == true)
+ * \param[in,out] gm_grid Fourier grid to transform.
+ */
+__attribute__((work_group_size_hint(c_solveMaxWorkGroupSize, 1, 1)))
+__kernel void CUSTOMIZED_KERNEL_NAME(pme_solve_kernel)(const struct PmeOpenCLKernelParams kernelParams,
+ __global const float * __restrict__ gm_splineModuli,
+ __global float * __restrict__ gm_virialAndEnergy,
+ __global float2 * __restrict__ gm_grid)
+{
+ /* This kernel supports 2 different grid dimension orderings: YZX and XYZ */
+ int majorDim, middleDim, minorDim;
+ if (gridOrdering == YZX)
+ {
+ majorDim = YY;
+ middleDim = ZZ;
+ minorDim = XX;
+ }
+ if (gridOrdering == XYZ)
+ {
+ majorDim = XX;
+ middleDim = YY;
+ minorDim = ZZ;
+ }
+
+ __global const float * __restrict__ gm_splineValueMajor = gm_splineModuli + kernelParams.grid.splineValuesOffset[majorDim];
+ __global const float * __restrict__ gm_splineValueMiddle = gm_splineModuli + kernelParams.grid.splineValuesOffset[middleDim];
+ __global const float * __restrict__ gm_splineValueMinor = gm_splineModuli + kernelParams.grid.splineValuesOffset[minorDim];
+
+ /* Various grid sizes and indices */
+ const int localOffsetMinor = 0, localOffsetMajor = 0, localOffsetMiddle = 0; //unused
+ const int localSizeMinor = kernelParams.grid.complexGridSizePadded[minorDim];
+ const int localSizeMiddle = kernelParams.grid.complexGridSizePadded[middleDim];
+ const int localCountMiddle = kernelParams.grid.complexGridSize[middleDim];
+ const int localCountMinor = kernelParams.grid.complexGridSize[minorDim];
+ const int nMajor = kernelParams.grid.realGridSize[majorDim];
+ const int nMiddle = kernelParams.grid.realGridSize[middleDim];
+ const int nMinor = kernelParams.grid.realGridSize[minorDim];
+ const int maxkMajor = (nMajor + 1) / 2; // X or Y
+ const int maxkMiddle = (nMiddle + 1) / 2; // Y OR Z => only check for !YZX
+ const int maxkMinor = (nMinor + 1) / 2; // Z or X => only check for YZX
+
+ /* Each thread works on one cell of the Fourier space complex 3D grid (gm_grid).
+ * Each block handles up to c_solveMaxWorkGroupSize cells -
+ * depending on the grid contiguous dimension size,
+ * that can range from a part of a single gridline to several complete gridlines.
+ */
+ const int threadLocalId = get_local_id(XX);
+ const int gridLineSize = localCountMinor;
+ const int gridLineIndex = threadLocalId / gridLineSize;
+ const int gridLineCellIndex = threadLocalId - gridLineSize * gridLineIndex;
+ const int gridLinesPerBlock = get_local_size(XX) / gridLineSize;
+ const int activeWarps = (get_local_size(XX) / warp_size);
+ const int indexMinor = get_group_id(XX) * get_local_size(XX) + gridLineCellIndex;
+ const int indexMiddle = get_group_id(YY) * gridLinesPerBlock + gridLineIndex;
+ const int indexMajor = get_group_id(ZZ);
+
+ /* Optional outputs */
+ float energy = 0.0f;
+ float virxx = 0.0f;
+ float virxy = 0.0f;
+ float virxz = 0.0f;
+ float viryy = 0.0f;
+ float viryz = 0.0f;
+ float virzz = 0.0f;
+
+ assert(indexMajor < kernelParams.grid.complexGridSize[majorDim]);
+ if ((indexMiddle < localCountMiddle) & (indexMinor < localCountMinor) & (gridLineIndex < gridLinesPerBlock))
+ {
+ /* The offset should be equal to the global thread index for coalesced access */
+ const int gridIndex = (indexMajor * localSizeMiddle + indexMiddle) * localSizeMinor + indexMinor;
+ __global float2 * __restrict__ gm_gridCell = gm_grid + gridIndex;
+
+ const int kMajor = indexMajor + localOffsetMajor;
+ /* Checking either X in XYZ, or Y in YZX cases */
+ const float mMajor = (kMajor < maxkMajor) ? kMajor : (kMajor - nMajor);
+
+ const int kMiddle = indexMiddle + localOffsetMiddle;
+ float mMiddle = kMiddle;
+ /* Checking Y in XYZ case */
+ if (gridOrdering == XYZ)
+ {
+ mMiddle = (kMiddle < maxkMiddle) ? kMiddle : (kMiddle - nMiddle);
+ }
+ const int kMinor = localOffsetMinor + indexMinor;
+ float mMinor = kMinor;
+ /* Checking X in YZX case */
+ if (gridOrdering == YZX)
+ {
+ mMinor = (kMinor < maxkMinor) ? kMinor : (kMinor - nMinor);
+ }
+ /* We should skip the k-space point (0,0,0) */
+ const bool notZeroPoint = (kMinor > 0) | (kMajor > 0) | (kMiddle > 0);
+
+ float mX, mY, mZ;
+ if (gridOrdering == YZX)
+ {
+ mX = mMinor;
+ mY = mMajor;
+ mZ = mMiddle;
+ }
+ if (gridOrdering == XYZ)
+ {
+ mX = mMajor;
+ mY = mMiddle;
+ mZ = mMinor;
+ }
+
+ /* 0.5 correction factor for the first and last components of a Z dimension */
+ float corner_fac = 1.0f;
+ if (gridOrdering == YZX)
+ {
+ if ((kMiddle == 0) | (kMiddle == maxkMiddle))
+ {
+ corner_fac = 0.5f;
+ }
+ }
+ if (gridOrdering == XYZ)
+ {
+ if ((kMinor == 0) | (kMinor == maxkMinor))
+ {
+ corner_fac = 0.5f;
+ }
+ }
+
+ if (notZeroPoint)
+ {
+ const float mhxk = mX * kernelParams.current.recipBox[XX][XX];
+ const float mhyk = mX * kernelParams.current.recipBox[XX][YY] + mY * kernelParams.current.recipBox[YY][YY];
+ const float mhzk = mX * kernelParams.current.recipBox[XX][ZZ] + mY * kernelParams.current.recipBox[YY][ZZ] + mZ * kernelParams.current.recipBox[ZZ][ZZ];
+
+ const float m2k = mhxk * mhxk + mhyk * mhyk + mhzk * mhzk;
+ assert(m2k != 0.0f);
+ const float denom = m2k * M_PI_F * kernelParams.current.boxVolume * gm_splineValueMajor[kMajor] * gm_splineValueMiddle[kMiddle] * gm_splineValueMinor[kMinor];
+ assert(isfinite(denom));
+ assert(denom != 0.0f);
+ const float tmp1 = exp(-kernelParams.grid.ewaldFactor * m2k);
+ const float etermk = kernelParams.constants.elFactor * tmp1 / denom;
+
+ float2 gridValue = *gm_gridCell;
+ const float2 oldGridValue = gridValue;
+
+ gridValue.x *= etermk;
+ gridValue.y *= etermk;
+ *gm_gridCell = gridValue;
+
+ if (computeEnergyAndVirial)
+ {
+ const float tmp1k = 2.0f * (gridValue.x * oldGridValue.x + gridValue.y * oldGridValue.y);
+
+ const float vfactor = (kernelParams.grid.ewaldFactor + 1.0f / m2k) * 2.0f;
+ const float ets2 = corner_fac * tmp1k;
+ energy = ets2;
+
+ const float ets2vf = ets2 * vfactor;
+
+ virxx = ets2vf * mhxk * mhxk - ets2;
+ virxy = ets2vf * mhxk * mhyk;
+ virxz = ets2vf * mhxk * mhzk;
+ viryy = ets2vf * mhyk * mhyk - ets2;
+ viryz = ets2vf * mhyk * mhzk;
+ virzz = ets2vf * mhzk * mhzk - ets2;
+ }
+ }
+ }
+
+ // This is only for reduction below. OpenCL 1.2: all local memory must be declared on kernel scope.
+ __local float sm_virialAndEnergy[c_virialAndEnergyCount * warp_size];
+
+ /* Optional energy/virial reduction */
+ if (computeEnergyAndVirial)
+ {
+ // TODO: implement AMD intrinsics reduction, like with shuffles in CUDA version. #2514
+
+ /* Shared memory reduction with atomics.
+ * Each component is first reduced into warp_size positions in the shared memory;
+ * Then first few warps reduce everything further and add to the global memory.
+ * This can likely be improved, but is anyway faster than the previous straightforward reduction,
+ * which was using too much shared memory (for storing all 7 floats on each thread).
+ */
+
+ const int lane = threadLocalId & (warp_size - 1);
+ const int warpIndex = threadLocalId / warp_size;
+ const bool firstWarp = (warpIndex == 0);
+ if (firstWarp)
+ {
+ sm_virialAndEnergy[0 * warp_size + lane] = virxx;
+ sm_virialAndEnergy[1 * warp_size + lane] = viryy;
+ sm_virialAndEnergy[2 * warp_size + lane] = virzz;
+ sm_virialAndEnergy[3 * warp_size + lane] = virxy;
+ sm_virialAndEnergy[4 * warp_size + lane] = virxz;
+ sm_virialAndEnergy[5 * warp_size + lane] = viryz;
+ sm_virialAndEnergy[6 * warp_size + lane] = energy;
+ }
+ barrier(CLK_LOCAL_MEM_FENCE);
+ if (!firstWarp)
+ {
+ atomicAdd_l_f(sm_virialAndEnergy + 0 * warp_size + lane, virxx);
+ atomicAdd_l_f(sm_virialAndEnergy + 1 * warp_size + lane, viryy);
+ atomicAdd_l_f(sm_virialAndEnergy + 2 * warp_size + lane, virzz);
+ atomicAdd_l_f(sm_virialAndEnergy + 3 * warp_size + lane, virxy);
+ atomicAdd_l_f(sm_virialAndEnergy + 4 * warp_size + lane, virxz);
+ atomicAdd_l_f(sm_virialAndEnergy + 5 * warp_size + lane, viryz);
+ atomicAdd_l_f(sm_virialAndEnergy + 6 * warp_size + lane, energy);
+ }
+ barrier(CLK_LOCAL_MEM_FENCE);
+
+ const int numIter = (c_virialAndEnergyCount + activeWarps - 1) / activeWarps;
+ for (int i = 0; i < numIter; i++)
+ {
+ const int componentIndex = i * activeWarps + warpIndex;
+ if (componentIndex < c_virialAndEnergyCount)
+ {
+ const int targetIndex = componentIndex * warp_size + lane;
+ #pragma unroll
+ for (int reductionStride = warp_size >> 1; reductionStride >= 1; reductionStride >>= 1)
+ {
+ if (lane < reductionStride)
+ {
+ sm_virialAndEnergy[targetIndex] += sm_virialAndEnergy[targetIndex + reductionStride];
+ }
+#ifdef _NVIDIA_SOURCE_
+ /* FIXME: this execution happens within execution width aka warp, but somehow NVIDIA OpenCL of all things
+ * fails without the memory barrier here. #2519
+ */
+ barrier(CLK_LOCAL_MEM_FENCE);
+#endif
+ }
+ if (lane == 0)
+ {
+ atomicAdd_g_f(gm_virialAndEnergy + componentIndex, sm_virialAndEnergy[targetIndex]);
+ }
+ }
+ }
+ }
+}
--- /dev/null
+/*
+ * This file is part of the GROMACS molecular simulation package.
+ *
+ * Copyright (c) 2018, by the GROMACS development team, led by
+ * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ * and including many others, as listed in the AUTHORS file in the
+ * top-level source directory and at http://www.gromacs.org.
+ *
+ * GROMACS is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public License
+ * as published by the Free Software Foundation; either version 2.1
+ * of the License, or (at your option) any later version.
+ *
+ * GROMACS is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with GROMACS; if not, see
+ * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ * Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ * If you want to redistribute modifications to GROMACS, please
+ * consider that scientific software is very special. Version
+ * control is crucial - bugs must be traceable. We will be happy to
+ * consider code for inclusion in the official distribution, but
+ * derived work must not be called official GROMACS. Details are found
+ * in the README & COPYING files - if they are missing, get the
+ * official version at http://www.gromacs.org.
+ *
+ * To help us fund GROMACS development, we humbly ask that you cite
+ * the research papers on the package. Check out http://www.gromacs.org.
+ */
+
+/*! \internal \file
+ * \brief Implements PME OpenCL spline parameter computation and charge spread kernels.
+ * When including this and other PME OpenCL kernel files, plenty of common
+ * constants/macros are expected to be defined (such as "order" which is PME interpolation order).
+ * For details, please see how pme-program.cl is compiled in pme-gpu-program-impl-ocl.cpp.
+ *
+ * This file's kernels specifically expect the following definitions:
+ *
+ * - atomsPerBlock which expresses how many atoms are processed by a single work group
+ * - order which is a PME interpolation order
+ * - computeSplines and spreadCharges must evaluate to either true or false to specify which
+ * kernel falvor is being compiled
+ * - wrapX and wrapY must evaluate to either true or false to specify whether the grid overlap
+ * in dimension X/Y is to be used
+ *
+ * \author Aleksei Iupinov <a.yupinov@gmail.com>
+ */
+
+#include "pme-gpu-types.h"
+#include "pme-gpu-utils.clh"
+#include "gromacs/gpu_utils/vectype_ops.clh"
+
+/*
+ * This define affects the spline calculation behaviour in the kernel.
+ * 0: a single GPU thread handles a single dimension of a single particle (calculating and storing (order) spline values and derivatives).
+ * 1: (order) threads do redundant work on this same task, each one stores only a single theta and single dtheta into global arrays.
+ * The only efficiency difference is less global store operations, countered by more redundant spline computation.
+ *
+ * TODO: estimate if this should be a boolean parameter (and add it to the unit test if so).
+ */
+#define PME_GPU_PARALLEL_SPLINE 0
+
+#ifndef COMPILE_SPREAD_HELPERS_ONCE
+#define COMPILE_SPREAD_HELPERS_ONCE
+
+/*! \brief
+ * General purpose function for loading atom-related data from global to shared memory.
+ *
+ * \param[in] kernelParams Input PME GPU data in constant memory.
+ * \param[out] sm_destination Local memory array for output.
+ * \param[in] gm_source Global memory array for input.
+ * \param[in] dataCountPerAtom Number of data elements per single atom (e.g. DIM for an rvec coordinates array).
+ *
+ */
+inline void pme_gpu_stage_atom_data(const struct PmeOpenCLKernelParams kernelParams,
+ __local float * __restrict__ sm_destination,
+ __global const float * __restrict__ gm_source,
+ const int dataCountPerAtom)
+{
+ const size_t threadLocalIndex = ((get_local_id(2) * get_local_size(1) + get_local_id(1)) * get_local_size(0) + get_local_id(0));
+ const size_t localIndex = threadLocalIndex;
+ const size_t globalIndexBase = get_group_id(XX) * atomsPerBlock * dataCountPerAtom;
+ const size_t globalIndex = globalIndexBase + localIndex;
+ const int globalCheck = pme_gpu_check_atom_data_index(globalIndex, kernelParams.atoms.nAtoms * dataCountPerAtom);
+ if ((localIndex < atomsPerBlock * dataCountPerAtom) & globalCheck)
+ {
+ assert(isfinite(float(gm_source[globalIndex])));
+ sm_destination[localIndex] = gm_source[globalIndex];
+ }
+}
+
+/*! \brief
+ * PME GPU spline parameter and gridline indices calculation.
+ * This corresponds to the CPU functions calc_interpolation_idx() and make_bsplines().
+ * First stage of the whole kernel.
+ *
+ * \param[in] kernelParams Input PME GPU data in constant memory.
+ * \param[in] atomIndexOffset Starting atom index for the execution block w.r.t. global memory.
+ * \param[in] sm_coordinates Atom coordinates in the shared memory (rvec)
+ * \param[in] sm_coefficients Atom charges/coefficients in the shared memory.
+ * \param[out] sm_theta Atom spline values in the shared memory.
+ * \param[out] sm_gridlineIndices Atom gridline indices in the shared memory.
+ * \param[out] sm_fractCoords Atom fractional coordinates (rvec)
+ * \param[out] gm_theta Atom spline parameter values
+ * \param[out] gm_dtheta Atom spline parameter derivatives
+ * \param[out] gm_gridlineIndices Same as \p sm_gridlineIndices but in global memory.
+ * \param[in] gm_fractShiftsTable Atom fractional coordinates correction table
+ * \param[in] gm_gridlineIndicesTable Atom fractional coordinates correction table
+ */
+inline void calculate_splines(const struct PmeOpenCLKernelParams kernelParams,
+ const int atomIndexOffset,
+ __local const float * __restrict__ sm_coordinates,
+ __local const float * __restrict__ sm_coefficients,
+ __local float * __restrict__ sm_theta,
+ __local int * __restrict__ sm_gridlineIndices,
+ __local float * __restrict__ sm_fractCoords,
+ __global float * __restrict__ gm_theta,
+ __global float * __restrict__ gm_dtheta,
+ __global int * __restrict__ gm_gridlineIndices,
+ __global const float * __restrict__ gm_fractShiftsTable,
+ __global const int * __restrict__ gm_gridlineIndicesTable)
+{
+ /* Thread index w.r.t. block */
+ const int threadLocalIndex = ((get_local_id(2) * get_local_size(1) + get_local_id(1)) * get_local_size(0) + get_local_id(0));
+ /* Warp index w.r.t. block - could probably be obtained easier? */
+ const int warpIndex = threadLocalIndex / warp_size;
+ /* Thread index w.r.t. warp */
+ const int threadWarpIndex = threadLocalIndex % warp_size;
+ /* Atom index w.r.t. warp - alternating 0 1 0 1 .. */
+ const int atomWarpIndex = threadWarpIndex % PME_SPREADGATHER_ATOMS_PER_WARP;
+ /* Atom index w.r.t. block/shared memory */
+ const int atomIndexLocal = warpIndex * PME_SPREADGATHER_ATOMS_PER_WARP + atomWarpIndex;
+
+ /* Atom index w.r.t. global memory */
+ const int atomIndexGlobal = atomIndexOffset + atomIndexLocal;
+ /* Spline contribution index in one dimension */
+ const int orderIndex = threadWarpIndex / (PME_SPREADGATHER_ATOMS_PER_WARP * DIM);
+ /* Dimension index */
+ const int dimIndex = (threadWarpIndex - orderIndex * (PME_SPREADGATHER_ATOMS_PER_WARP * DIM)) / PME_SPREADGATHER_ATOMS_PER_WARP;
+
+ /* Multi-purpose index of rvec/ivec atom data */
+ const int sharedMemoryIndex = atomIndexLocal * DIM + dimIndex;
+
+ /* Spline parameters need a working buffer.
+ * With PME_GPU_PARALLEL_SPLINE == 0 it is just a local array of (order) values for some of the threads, which is fine;
+ * With PME_GPU_PARALLEL_SPLINE == 1 (order) times more threads are involved, so the shared memory is used to avoid overhead.
+ * The buffer's size, striding and indexing are adjusted accordingly.
+ * The buffer is accessed with SPLINE_DATA_PTR and SPLINE_DATA macros.
+ */
+#if PME_GPU_PARALLEL_SPLINE
+ #define splineDataStride (atomsPerBlock * DIM)
+ const int splineDataIndex = sharedMemoryIndex;
+ __local float sm_splineData[splineDataStride * order];
+ __local float *splineDataPtr = sm_splineData;
+#else
+ #define splineDataStride 1
+ const int splineDataIndex = 0;
+ float splineData[splineDataStride * order];
+ float *splineDataPtr = splineData;
+#endif
+
+#define SPLINE_DATA_PTR(i) (splineDataPtr + ((i) * splineDataStride + splineDataIndex))
+#define SPLINE_DATA(i) (*SPLINE_DATA_PTR(i))
+
+ const int localCheck = (dimIndex < DIM) && (orderIndex < (PME_GPU_PARALLEL_SPLINE ? order : 1));
+ const int globalCheck = pme_gpu_check_atom_data_index(atomIndexGlobal, kernelParams.atoms.nAtoms);
+
+ if (localCheck && globalCheck)
+ {
+ /* Indices interpolation */
+ if (orderIndex == 0)
+ {
+ int tableIndex, tInt;
+ float n, t;
+ const float3 x = vload3(atomIndexLocal, sm_coordinates);
+
+ /* Accessing fields in fshOffset/nXYZ/recipbox/... with dimIndex offset
+ * puts them into local memory(!) instead of accessing the constant memory directly.
+ * That's the reason for the switch, to unroll explicitly.
+ * The commented parts correspond to the 0 components of the recipbox.
+ */
+ switch (dimIndex)
+ {
+ case XX:
+ tableIndex = kernelParams.grid.tablesOffsets[XX];
+ n = kernelParams.grid.realGridSizeFP[XX];
+ t = x.x * kernelParams.current.recipBox[dimIndex][XX] + x.y * kernelParams.current.recipBox[dimIndex][YY] + x.z * kernelParams.current.recipBox[dimIndex][ZZ];
+ break;
+
+ case YY:
+ tableIndex = kernelParams.grid.tablesOffsets[YY];
+ n = kernelParams.grid.realGridSizeFP[YY];
+ t = /*x.x * kernelParams.current.recipbox[dimIndex][XX] + */ x.y * kernelParams.current.recipBox[dimIndex][YY] + x.z * kernelParams.current.recipBox[dimIndex][ZZ];
+ break;
+
+ case ZZ:
+ tableIndex = kernelParams.grid.tablesOffsets[ZZ];
+ n = kernelParams.grid.realGridSizeFP[ZZ];
+ t = /*x.x * kernelParams.current.recipbox[dimIndex][XX] + x.y * kernelParams.current.recipbox[dimIndex][YY] + */ x.z * kernelParams.current.recipBox[dimIndex][ZZ];
+ break;
+ }
+ const float shift = c_pmeMaxUnitcellShift;
+
+ /* Fractional coordinates along box vectors, adding a positive shift to ensure t is positive for triclinic boxes */
+ t = (t + shift) * n;
+ tInt = (int)t;
+ sm_fractCoords[sharedMemoryIndex] = t - tInt;
+ tableIndex += tInt;
+ assert(tInt >= 0);
+ assert(tInt < c_pmeNeighborUnitcellCount * n);
+ sm_fractCoords[sharedMemoryIndex] += gm_fractShiftsTable[tableIndex];
+ sm_gridlineIndices[sharedMemoryIndex] = gm_gridlineIndicesTable[tableIndex];
+ gm_gridlineIndices[atomIndexOffset * DIM + sharedMemoryIndex] = sm_gridlineIndices[sharedMemoryIndex];
+ }
+
+ /* B-spline calculation */
+
+ const int chargeCheck = pme_gpu_check_atom_charge(sm_coefficients[atomIndexLocal]);
+ if (chargeCheck)
+ {
+ float div;
+ int o = orderIndex; // This is an index that is set once for PME_GPU_PARALLEL_SPLINE == 1
+
+ const float dr = sm_fractCoords[sharedMemoryIndex];
+ assert(isfinite(dr));
+
+ /* dr is relative offset from lower cell limit */
+ *SPLINE_DATA_PTR(order - 1) = 0.0f;
+ *SPLINE_DATA_PTR(1) = dr;
+ *SPLINE_DATA_PTR(0) = 1.0f - dr;
+
+#pragma unroll order
+ for (int k = 3; k < order; k++)
+ {
+ div = 1.0f / (k - 1.0f);
+ *SPLINE_DATA_PTR(k - 1) = div * dr * SPLINE_DATA(k - 2);
+#pragma unroll
+ for (int l = 1; l < (k - 1); l++)
+ {
+ *SPLINE_DATA_PTR(k - l - 1) = div * ((dr + l) * SPLINE_DATA(k - l - 2) + (k - l - dr) * SPLINE_DATA(k - l - 1));
+ }
+ *SPLINE_DATA_PTR(0) = div * (1.0f - dr) * SPLINE_DATA(0);
+ }
+
+ const int thetaIndexBase = getSplineParamIndexBase(warpIndex, atomWarpIndex);
+ const int thetaGlobalOffsetBase = atomIndexOffset * DIM * order;
+
+ /* Differentiation and storing the spline derivatives (dtheta) */
+#if !PME_GPU_PARALLEL_SPLINE
+ // With PME_GPU_PARALLEL_SPLINE == 1, o is already set to orderIndex;
+ // With PME_GPU_PARALLEL_SPLINE == 0, we loop o over range(order).
+#pragma unroll order
+ for (o = 0; o < order; o++)
+#endif
+ {
+ const int thetaIndex = getSplineParamIndex(thetaIndexBase, dimIndex, o);
+ const int thetaGlobalIndex = thetaGlobalOffsetBase + thetaIndex;
+
+ const float dtheta = ((o > 0) ? SPLINE_DATA(o - 1) : 0.0f) - SPLINE_DATA(o);
+ assert(isfinite(dtheta));
+ gm_dtheta[thetaGlobalIndex] = dtheta;
+ }
+
+ div = 1.0f / (order - 1.0f);
+ *SPLINE_DATA_PTR(order - 1) = div * dr * SPLINE_DATA(order - 2);
+#pragma unroll
+ for (int k = 1; k < (order - 1); k++)
+ {
+ *SPLINE_DATA_PTR(order - k - 1) = div * ((dr + k) * SPLINE_DATA(order - k - 2) + (order - k - dr) * SPLINE_DATA(order - k - 1));
+ }
+ *SPLINE_DATA_PTR(0) = div * (1.0f - dr) * SPLINE_DATA(0);
+
+ /* Storing the spline values (theta) */
+#if !PME_GPU_PARALLEL_SPLINE
+ // See comment for the loop above
+#pragma unroll order
+ for (o = 0; o < order; o++)
+#endif
+ {
+ const int thetaIndex = getSplineParamIndex(thetaIndexBase, dimIndex, o);
+ const int thetaGlobalIndex = thetaGlobalOffsetBase + thetaIndex;
+
+ sm_theta[thetaIndex] = SPLINE_DATA(o);
+ assert(isfinite(sm_theta[thetaIndex]));
+ gm_theta[thetaGlobalIndex] = SPLINE_DATA(o);
+ }
+ }
+ }
+}
+
+/*! \brief
+ * Charge spreading onto the grid.
+ * This corresponds to the CPU function spread_coefficients_bsplines_thread().
+ * Second stage of the whole kernel.
+ *
+ * \param[in] kernelParams Input PME GPU data in constant memory.
+ * \param[in] atomIndexOffset Starting atom index for the execution block w.r.t. global memory.
+ * \param[in] sm_coefficients Atom coefficents/charges in the shared memory.
+ * \param[in] sm_gridlineIndices Atom gridline indices in the shared memory.
+ * \param[in] sm_theta Atom spline values in the shared memory.
+ * \param[out] gm_grid Global 3D grid for spreading.
+ */
+inline void spread_charges(const struct PmeOpenCLKernelParams kernelParams,
+ int atomIndexOffset,
+ __local const float * __restrict__ sm_coefficients,
+ __local const int * __restrict__ sm_gridlineIndices,
+ __local const float * __restrict__ sm_theta,
+ __global float * __restrict__ gm_grid)
+{
+ const int nx = kernelParams.grid.realGridSize[XX];
+ const int ny = kernelParams.grid.realGridSize[YY];
+ const int nz = kernelParams.grid.realGridSize[ZZ];
+ const int pny = kernelParams.grid.realGridSizePadded[YY];
+ const int pnz = kernelParams.grid.realGridSizePadded[ZZ];
+
+ const int offx = 0, offy = 0, offz = 0; // unused for now
+
+ const int atomIndexLocal = get_local_id(ZZ);
+ const int atomIndexGlobal = atomIndexOffset + atomIndexLocal;
+
+ const int globalCheck = pme_gpu_check_atom_data_index(atomIndexGlobal, kernelParams.atoms.nAtoms);
+ const int chargeCheck = pme_gpu_check_atom_charge(sm_coefficients[atomIndexLocal]);
+ if (chargeCheck & globalCheck)
+ {
+ // Spline Y/Z coordinates
+ const int ithy = get_local_id(YY);
+ const int ithz = get_local_id(XX);
+ const int ixBase = sm_gridlineIndices[atomIndexLocal * DIM + XX] - offx;
+ int iy = sm_gridlineIndices[atomIndexLocal * DIM + YY] - offy + ithy;
+ if (wrapY & (iy >= ny))
+ {
+ iy -= ny;
+ }
+ int iz = sm_gridlineIndices[atomIndexLocal * DIM + ZZ] - offz + ithz;
+ if (iz >= nz)
+ {
+ iz -= nz;
+ }
+
+ /* Atom index w.r.t. warp - alternating 0 1 0 1 .. */
+ const int atomWarpIndex = atomIndexLocal % PME_SPREADGATHER_ATOMS_PER_WARP;
+ /* Warp index w.r.t. block - could probably be obtained easier? */
+ const int warpIndex = atomIndexLocal / PME_SPREADGATHER_ATOMS_PER_WARP;
+
+ const int splineIndexBase = getSplineParamIndexBase(warpIndex, atomWarpIndex);
+ const int splineIndexZ = getSplineParamIndex(splineIndexBase, ZZ, ithz);
+ const float thetaZ = sm_theta[splineIndexZ];
+ const int splineIndexY = getSplineParamIndex(splineIndexBase, YY, ithy);
+ const float thetaY = sm_theta[splineIndexY];
+ const float constVal = thetaZ * thetaY * sm_coefficients[atomIndexLocal];
+ assert(isfinite(constVal));
+ const int constOffset = iy * pnz + iz;
+
+#pragma unroll order
+ for (int ithx = 0; (ithx < order); ithx++)
+ {
+ int ix = ixBase + ithx;
+ if (wrapX & (ix >= nx))
+ {
+ ix -= nx;
+ }
+ const int gridIndexGlobal = ix * pny * pnz + constOffset;
+ const int splineIndexX = getSplineParamIndex(splineIndexBase, XX, ithx);
+ const float thetaX = sm_theta[splineIndexX];
+ assert(isfinite(thetaX));
+ assert(isfinite(gm_grid[gridIndexGlobal]));
+ atomicAdd_g_f(gm_grid + gridIndexGlobal, thetaX * constVal);
+ }
+ }
+}
+
+#endif //COMPILE_SPREAD_HELPERS_ONCE
+
+/*! \brief
+ * A spline computation and/or charge spreading kernel function.
+ * Please see the file description for additional defines which this kernel expects.
+ *
+ * \param[in] kernelParams Input PME GPU data in constant memory.
+ * \param[in,out] gm_theta Atom spline parameter values
+ * \param[out] gm_dtheta Atom spline parameter derivatives
+ * \param[in,out] gm_gridlineIndices Atom gridline indices (ivec)
+ * \param[out] gm_grid Global 3D grid for charge spreading.
+ * \param[in] gm_fractShiftsTable Atom fractional coordinates correction table
+ * \param[in] gm_gridlineIndicesTable Atom fractional coordinates correction table
+ * \param[in] gm_coefficients Atom charges/coefficients.
+ * \param[in] gm_coordinates Atom coordinates (rvec)
+ */
+__attribute__((reqd_work_group_size(order, order, atomsPerBlock)))
+__kernel void CUSTOMIZED_KERNEL_NAME(pme_spline_and_spread_kernel)(const struct PmeOpenCLKernelParams kernelParams,
+ __global float * __restrict__ gm_theta,
+ __global float * __restrict__ gm_dtheta,
+ __global int * __restrict__ gm_gridlineIndices,
+ __global float *__restrict__ gm_grid,
+ __global const float * __restrict__ gm_fractShiftsTable,
+ __global const int * __restrict__ gm_gridlineIndicesTable,
+ __global const float * __restrict__ gm_coefficients,
+ __global const float * __restrict__ gm_coordinates)
+{
+ // Gridline indices, ivec
+ __local int sm_gridlineIndices[atomsPerBlock * DIM];
+ // Charges
+ __local float sm_coefficients[atomsPerBlock];
+ // Spline values
+ __local float sm_theta[atomsPerBlock * DIM * order];
+ // Fractional coordinates - only for spline computation
+ __local float sm_fractCoords[atomsPerBlock * DIM];
+ // Staging coordinates - only for spline computation
+ __local float sm_coordinates[DIM * atomsPerBlock];
+
+ const int atomIndexOffset = get_group_id(XX) * atomsPerBlock;
+
+ /* Staging coefficients/charges for both spline and spread */
+ pme_gpu_stage_atom_data(kernelParams, sm_coefficients, gm_coefficients, 1);
+
+ if (computeSplines)
+ {
+ /* Staging coordinates */
+ pme_gpu_stage_atom_data(kernelParams, sm_coordinates, gm_coordinates, DIM);
+
+ barrier(CLK_LOCAL_MEM_FENCE);
+ calculate_splines(kernelParams, atomIndexOffset, sm_coordinates,
+ sm_coefficients, sm_theta, sm_gridlineIndices,
+ sm_fractCoords, gm_theta, gm_dtheta, gm_gridlineIndices,
+ gm_fractShiftsTable, gm_gridlineIndicesTable);
+#if !defined(_AMD_SOURCE_) && !defined(_NVIDIA_SOURCE_)
+ /* This is only here for execution of e.g. 32-sized warps on 16-wide hardware; this was gmx_syncwarp() in CUDA.
+ * #2519
+ */
+ barrier(CLK_LOCAL_MEM_FENCE);
+#endif
+ }
+ else
+ {
+ /* Staging the data for spread
+ * (the data is assumed to be in GPU global memory with proper layout already,
+ * as in after running the spline kernel)
+ */
+ /* Spline data - only thetas (dthetas will only be needed in gather) */
+ pme_gpu_stage_atom_data(kernelParams, sm_theta, gm_theta, DIM * order);
+ /* Gridline indices - they're actually int and not float, but C99 is angry about overloads */
+ pme_gpu_stage_atom_data(kernelParams, (__local float *)sm_gridlineIndices, (__global const float *)gm_gridlineIndices, DIM);
+
+ barrier(CLK_LOCAL_MEM_FENCE);
+ }
+ /* Spreading */
+ if (spreadCharges)
+ {
+ spread_charges(kernelParams, atomIndexOffset, sm_coefficients, sm_gridlineIndices, sm_theta, gm_grid);
+ }
+}
{
errorReasons.emplace_back("double precision");
}
- if (GMX_GPU != GMX_GPU_CUDA)
+ if (GMX_GPU == GMX_GPU_NONE)
{
- errorReasons.emplace_back("non-CUDA build of GROMACS");
+ errorReasons.emplace_back("non-GPU build of GROMACS");
}
return addMessageIfNotSupported(errorReasons, error);
}
{
errorReasons.emplace_back("double precision");
}
- if (GMX_GPU != GMX_GPU_CUDA)
+ if (GMX_GPU == GMX_GPU_NONE)
{
- errorReasons.emplace_back("non-CUDA build of GROMACS");
+ errorReasons.emplace_back("non-GPU build of GROMACS");
}
return addMessageIfNotSupported(errorReasons, error);
*/
/*! \internal \file
- * \brief This file defines the PME CUDA-specific data structure,
- * various compile-time constants shared among the PME CUDA kernels,
- * and also names some PME CUDA memory management routines.
- * TODO: consider changing defines into variables where possible; have inline getters.
+ * \brief This file defines the PME CUDA-specific kernel parameter data structure.
+ * \todo Rename the file (pme-gpu-types.cuh?), reconsider inheritance approach.
*
* \author Aleksei Iupinov <a.yupinov@gmail.com>
*/
#ifndef GMX_EWALD_PME_CUH
#define GMX_EWALD_PME_CUH
-#include <cassert>
+#include "gromacs/math/vectypes.h" // for DIM
#include "pme-gpu-constants.h"
-#include "pme-gpu-internal.h"
+#include "pme-gpu-internal.h" // for GridOrdering
#include "pme-gpu-types.h"
-#include "pme-gpu-types-host.h"
-#include "pme-gpu-types-host-impl.h"
-
-/*! \brief \internal
- * An inline CUDA function for checking the global atom data indices against the atom data array sizes.
- *
- * \param[in] atomDataIndexGlobal The atom data index.
- * \param[in] nAtomData The atom data array element count.
- * \returns Non-0 if index is within bounds (or PME data padding is enabled), 0 otherwise.
- *
- * This is called from the spline_and_spread and gather PME kernels.
- * The goal is to isolate the global range checks, and allow avoiding them with c_usePadding enabled.
- */
-int __device__ __forceinline__ pme_gpu_check_atom_data_index(const int atomDataIndex, const int nAtomData)
-{
- return c_usePadding ? 1 : (atomDataIndex < nAtomData);
-}
-
-/*! \brief \internal
- * An inline CUDA function for skipping the zero-charge atoms.
- *
- * \returns Non-0 if atom should be processed, 0 otherwise.
- * \param[in] coefficient The atom charge.
- *
- * This is called from the spline_and_spread and gather PME kernels.
- */
-int __device__ __forceinline__ pme_gpu_check_atom_charge(const float coefficient)
-{
- assert(isfinite(coefficient));
- return c_skipNeutralAtoms ? (coefficient != 0.0f) : 1;
-}
/*! \brief \internal
* A single structure encompassing all the PME data used in CUDA kernels.
}
std::map<GridOrdering, std::string> gridOrderingsToTest = {{GridOrdering::YZX, "YZX"}};
- if (codePath == CodePath::CUDA)
+ if (codePath == CodePath::GPU)
{
gridOrderingsToTest[GridOrdering::XYZ] = "XYZ";
}
implemented = true;
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
implemented = (pme_gpu_supports_build(nullptr) &&
pme_gpu_supports_input(*inputRec, mtop, nullptr));
break;
)
{
const MDLogger dummyLogger;
- const auto runMode = (mode == CodePath::CPU) ? PmeRunMode::CPU : PmeRunMode::GPU;
+ const auto runMode = (mode == CodePath::CPU) ? PmeRunMode::CPU : PmeRunMode::Mixed;
t_commrec dummyCommrec = {0};
NumPmeDomains numPmeDomains = { 1, 1 };
gmx_pme_t *pmeDataRaw = gmx_pme_init(&dummyCommrec, numPmeDomains, inputRec, atomCount, false, false, true,
invertBoxMatrix(boxTemp, pme->recipbox);
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
pme_gpu_set_testing(pme->gpu, true);
pme_gpu_update_input_box(pme->gpu, boxTemp);
break;
/* With decomposition there would be more boilerplate atc code here, e.g. do_redist_pos_coeffs */
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
gmx_pme_reinit_atoms(pmeSafe.get(), atomCount, charges.data());
pme_gpu_copy_input_coordinates(pmeSafe->gpu, as_rvec_array(coordinates.data()));
break;
gmx_parallel_3dfft_real_limits(pme->pfft_setup[gridIndex], gridSize, gridOffsetUnused, paddedGridSize);
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
pme_gpu_get_real_grid_sizes(pme->gpu, &gridSize, &paddedGridSize);
break;
}
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
pme_gpu_spread(pme->gpu, gridIndex, fftgrid, computeSplines, spreadCharges);
break;
}
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
switch (method)
{
case PmeSolveAlgorithm::Coulomb:
gather_f_bsplines(pme, pmegrid, !forceReductionWithInput, atc, &atc->spline[threadIndex], scale);
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
{
// Variable initialization needs a non-switch scope
auto stagingForces = pme_gpu_get_forces(pme->gpu);
case CodePath::CPU:
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
pme_gpu_synchronize(pme->gpu);
break;
std::copy(splineValues.begin(), splineValues.end(), splineBuffer);
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
std::copy(splineValues.begin(), splineValues.end(), splineBuffer);
pme_gpu_transform_spline_atom_data(pme->gpu, atc, type, dimIndex, PmeLayoutTransform::HostToGpu);
break;
switch (mode)
{
- case CodePath::CUDA:
+ case CodePath::GPU:
memcpy(pme->gpu->staging.h_gridlineIndices, gridLineIndices.data(), atomCount * sizeof(gridLineIndices[0]));
break;
switch (mode)
{
- case CodePath::CUDA: // intentional absence of break, the grid will be copied from the host buffer in testing mode
+ case CodePath::GPU: // intentional absence of break, the grid will be copied from the host buffer in testing mode
case CodePath::CPU:
std::memset(grid, 0, paddedGridSize[XX] * paddedGridSize[YY] * paddedGridSize[ZZ] * sizeof(ValueType));
for (const auto &gridValue : gridValues)
SplineParamsDimVector result;
switch (mode)
{
- case CodePath::CUDA:
+ case CodePath::GPU:
pme_gpu_transform_spline_atom_data(pme->gpu, atc, type, dimIndex, PmeLayoutTransform::GpuToHost);
result = arrayRefFromArray(sourceBuffer, dimSize);
break;
GridLineIndicesVector gridLineIndices;
switch (mode)
{
- case CodePath::CUDA:
+ case CodePath::GPU:
gridLineIndices = arrayRefFromArray(reinterpret_cast<IVec *>(pme->gpu->staging.h_gridlineIndices), atomCount);
break;
SparseGridValuesOutput<ValueType> gridValues;
switch (mode)
{
- case CodePath::CUDA: // intentional absence of break
+ case CodePath::GPU: // intentional absence of break
case CodePath::CPU:
gridValues.clear();
for (int ix = 0; ix < gridSize[XX]; ix++)
GMX_THROW(InternalError("Test not implemented for this mode"));
}
break;
- case CodePath::CUDA:
+ case CodePath::GPU:
switch (method)
{
case PmeSolveAlgorithm::Coulomb:
{
case CodePath::CPU:
return "CPU";
- case CodePath::CUDA:
- return "CUDA";
+ case CodePath::GPU:
+ return "GPU";
default:
GMX_THROW(NotImplementedError("This CodePath should support codePathToString"));
}
char stmp[200] = {};
get_gpu_device_info_string(stmp, hardwareInfo_->gpu_info, gpuIndex);
std::string description = "(GPU " + std::string(stmp) + ") ";
- // TODO should this be CodePath::GPU?
hardwareContexts_.emplace_back(compat::make_unique<TestHardwareContext>
- (CodePath::CUDA, description.c_str(),
+ (CodePath::GPU, description.c_str(),
deviceInfo));
}
}
enum class CodePath
{
CPU,
- CUDA
+ GPU
};
//! Return a string useful for human-readable messages describing a \c codePath.
gpu_info->gpu_dev[device_index].vendor_e = get_vendor_id(gpu_info->gpu_dev[device_index].device_vendor);
+ clGetDeviceInfo(ocl_device_ids[j], CL_DEVICE_MAX_WORK_ITEM_SIZES, 3 * sizeof(size_t), &gpu_info->gpu_dev[device_index].maxWorkItemSizes, nullptr);
+
+ clGetDeviceInfo(ocl_device_ids[j], CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &gpu_info->gpu_dev[device_index].maxWorkGroupSize, nullptr);
+
gpu_info->gpu_dev[device_index].stat = is_gmx_supported_gpu_id(gpu_info->gpu_dev + device_index);
if (egpuCompatible == gpu_info->gpu_dev[device_index].stat)
CommandStream stream = nullptr; //!< Stream to launch kernel in
};
+//! Sets whether device code can use arrays that are embedded in structs.
+#define c_canEmbedBuffers true
+// TODO this should be constexpr bool
+
#endif
CommandStream stream = nullptr; //!< Stream to launch kernel in
};
+/*! \brief Sets whether device code can use arrays that are embedded in structs.
+ * Note that OpenCL 2.x might be able to do this, but we use 1.2.
+ */
+#define c_canEmbedBuffers false
+
#endif
int adress_bits; /**< number of adress bits the device is capable of */
int stat; /**< device status takes values of e_gpu_detect_res_t */
ocl_vendor_id_t vendor_e; /**< device vendor as defined by ocl_vendor_id_t */
+ size_t maxWorkItemSizes[3]; /**< workgroup size limits (CL_DEVICE_MAX_WORK_ITEM_SIZES) */
+ size_t maxWorkGroupSize; /**< workgroup total size limit (CL_DEVICE_MAX_WORK_GROUP_SIZE) */
};
/*! \internal
*/
#include "gmxpre.h"
-#include "config.h"
-
#include <map>
#include <string>
#include <vector>
//! Before any test is run, work out whether any compatible GPUs exist.
static void SetUpTestCase();
//! Store whether any compatible GPUs exist.
- static bool s_hasCompatibleCudaGpus;
+ static bool s_hasCompatibleGpus;
//! Convenience typedef
using RunModesList = std::map < std::string, std::vector < const char *>>;
//! Runs the test with the given inputs
void runTest(const RunModesList &runModes);
};
-bool PmeTest::s_hasCompatibleCudaGpus = false;
+bool PmeTest::s_hasCompatibleGpus = false;
void PmeTest::SetUpTestCase()
{
// there is no GPU support in the build.
//
// TODO report any error messages gracefully.
- if (GMX_GPU == GMX_GPU_CUDA &&
- canDetectGpus(nullptr))
+ if (canDetectGpus(nullptr))
{
findGpus(&gpuInfo);
- s_hasCompatibleCudaGpus = (gpuInfo.n_dev_compatible > 0);
+ s_hasCompatibleGpus = (gpuInfo.n_dev_compatible > 0);
}
free_gpu_info(&gpuInfo);
}
for (const auto &mode : runModes)
{
auto modeTargetsGpus = (mode.first.find("Gpu") != std::string::npos);
- if (modeTargetsGpus && !s_hasCompatibleCudaGpus)
+ if (modeTargetsGpus && !s_hasCompatibleGpus)
{
// This run mode will cause a fatal error from mdrun when
// it can't find GPUs, which is not something we're trying