Merge branch release-5-1

author Mark Abraham <mark.j.abraham@gmail.com>

Mon, 28 Dec 2015 00:24:52 +0000 (11:24 +1100)

committer Mark Abraham <mark.j.abraham@gmail.com>

Mon, 28 Dec 2015 00:51:25 +0000 (11:51 +1100)
author Mark Abraham <mark.j.abraham@gmail.com>
Mon, 28 Dec 2015 00:24:52 +0000 (11:24 +1100)
committer Mark Abraham <mark.j.abraham@gmail.com>
Mon, 28 Dec 2015 00:51:25 +0000 (11:51 +1100)
diff --combined docs/install-guide/index.rst

index 88ec60a454db1759f03a24d7686970b3dccea9fc,9a4cf3f9858b084c5e4646805abfee09e0e4d82d..5a814fd1813744c8eaa863712dbf5ebefd050225
--- 1/docs/install-guide/index.rst
--- 2/docs/install-guide/index.rst
+++ b/docs/install-guide/index.rst
@@@ -85,30 -85,14 +85,30 @@@ architectures including x86, AMD64/x86-
   Compiler
   --------
   Technically, |Gromacs| can be compiled on any platform with an ANSI C99
- -and C++98 compiler, and their respective standard C/C++ libraries.
- -We use only a few C99 features, but note that the C++ compiler also needs to
- -support these C99 features (notably, int64_t and related things), which are not
- -part of the C++98 standard.
+ +and C++11 compiler, and their respective standard C/C++ libraries.
+ +GROMACS uses a subset of C99 and C++11. A not fully standard compliant
+ +compiler might be able to compile GROMACS.
   Getting good performance on an OS and architecture requires choosing a
   good compiler. In practice, many compilers struggle to do a good job
   optimizing the |Gromacs| architecture-optimized SIMD kernels.
   
+ +C++11 support requires both support in the compiler as well as in the
+ +C++ library. Multiple compilers do not provide their own library
+ +but use the system library. It is required to select a library with
+ +sufficient C++11 support. Both the Intel and clang compiler on Linux use
+ +the libstdc++ which comes with gcc as the default C++ library. 4.6.1 of
+ +that library is required. Also the C++ library version has to be
+ +supported by the compiler. To select the C++ library version use:
+ +
+ +* For Intel: ``CXXFLAGS=-gcc-name=/path/to/gcc/binary`` or make sure
+ +  that the correct gcc version is first in path (e.g. by loading the gcc
+ +  module)
+ +* For clang: ``CFLAGS=--gcc-toolchain=/path/to/gcc/folder
+ +  CXXFLAGS=--gcc-toolchain=/path/to/gcc/folder``. This folder should
+ +  contain ``include/c++``.
+ +* On Windows with e.g. Intel: at least MSVC 2013 is required. Load the
+ +  enviroment with vcvarsall.bat.
+ +
   For best performance, the |Gromacs| team strongly recommends you get the
   most recent version of your preferred compiler for your platform.
   There is a large amount of |Gromacs| code that depends on effective
@@@ -188,9 -172,10 +188,10 @@@ contexts
   
   To make it possible to use other accelerators, |Gromacs| also includes
   OpenCL_ support. The current version is recommended for use with
- GCN-based AMD GPUs. It does work with NVIDIA GPUs, but see the
- known limitations in the user guide. The minimum
- OpenCL version required is |REQUIRED_OPENCL_MIN_VERSION|.
+ GCN-based AMD GPUs. It does work with NVIDIA GPUs, but using the latest
+ NVIDIA driver (which includes the NVIDIA OpenCL runtime) is recommended,
+ and please see the known limitations in the |Gromacs| user guide. The
+ minimum OpenCL version required is |REQUIRED_OPENCL_MIN_VERSION|.
   
   It is not possible to configure both CUDA and OpenCL support in the
   same version of |Gromacs|.
@@@ -311,6 -296,9 +312,6 @@@ Optional build component
   -------------------------
   * Compiling to run on NVIDIA GPUs requires CUDA_
   * Compiling to run on AMD GPUs requires OpenCL_
- -* An external Boost library can be used to provide better
- -  implementation support for smart pointers and exception handling,
- -  but the |Gromacs| source bundles a subset of Boost 1.55.0 as a fallback
   * Hardware-optimized BLAS and LAPACK libraries are useful
     for a few of the |Gromacs| utilities focused on normal modes and
     matrix manipulation, but they do not provide any benefits for normal
@@@ -1155,8 -1143,8 +1156,8 @@@ much everywhere, it is important that w
   it works because we have tested it. We do test on Linux, Windows, and
   Mac with a range of compilers and libraries for a range of our
   configuration options. Every commit in our git source code repository
- -is currently tested on x86 with gcc versions ranging from 4.1 through
- -5.1, and versions 12 through 15 of the Intel compiler as well as Clang
+ +is currently tested on x86 with gcc versions ranging from 4.6 through
+ +5.1, and versions 14 and 15 of the Intel compiler as well as Clang
   version 3.4 through 3.6. For this, we use a variety of GNU/Linux
   flavors and versions as well as recent versions of Mac OS X and Windows.  Under
   Windows we test both MSVC and the Intel compiler. For details, you can
diff --combined src/gromacs/gpu_utils/oclutils.cpp

index be95e9c6a9f26ea766feb484f21ed76a97bde047,0000000000000000000000000000000000000000..466a6bee375b5078b7999430dea2bde7e544a190

mode 100644,000000..100644
--- 1/src/gromacs/gpu_utils/oclutils.cpp
--- /dev/null
+++ b/src/gromacs/gpu_utils/oclutils.cpp
@@@ -1,195 -1,0 +1,274 @@@
+ +/*
+ + * This file is part of the GROMACS molecular simulation package.
+ + *
+ + * Copyright (c) 2014,2015, by the GROMACS development team, led by
+ + * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ + * and including many others, as listed in the AUTHORS file in the
+ + * top-level source directory and at http://www.gromacs.org.
+ + *
+ + * GROMACS is free software; you can redistribute it and/or
+ + * modify it under the terms of the GNU Lesser General Public License
+ + * as published by the Free Software Foundation; either version 2.1
+ + * of the License, or (at your option) any later version.
+ + *
+ + * GROMACS is distributed in the hope that it will be useful,
+ + * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ + * Lesser General Public License for more details.
+ + *
+ + * You should have received a copy of the GNU Lesser General Public
+ + * License along with GROMACS; if not, see
+ + * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ + * Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
+ + *
+ + * If you want to redistribute modifications to GROMACS, please
+ + * consider that scientific software is very special. Version
+ + * control is crucial - bugs must be traceable. We will be happy to
+ + * consider code for inclusion in the official distribution, but
+ + * derived work must not be called official GROMACS. Details are found
+ + * in the README & COPYING files - if they are missing, get the
+ + * official version at http://www.gromacs.org.
+ + *
+ + * To help us fund GROMACS development, we humbly ask that you cite
+ + * the research papers on the package. Check out http://www.gromacs.org.
+ + */
+ +/*! \internal \file
+ + *  \brief Define utility routines for OpenCL
+ + *
+ + *  \author Anca Hamuraru <anca@streamcomputing.eu>
+ + */
+ +#include "gmxpre.h"
+ +
+ +#include "oclutils.h"
+ +
+ +#include <stdlib.h>
+ +
+ +#include <cassert>
+ +#include <cstdio>
+ +
+ +#include "gromacs/utility/fatalerror.h"
+ +#include "gromacs/utility/smalloc.h"
+ +
+ +/*! \brief Launches synchronous or asynchronous host to device memory copy.
+ + *
+ + *  If copy_event is not NULL, on return it will contain an event object
+ + *  identifying this particular host to device operation. The event can further
+ + *  be used to queue a wait for this operation or to query profiling information.
+ + */
+ +static int ocl_copy_H2D_generic(cl_mem d_dest, void* h_src,
+ +                                size_t offset, size_t bytes,
+ +                                bool bAsync /* = false*/,
+ +                                cl_command_queue command_queue,
+ +                                cl_event *copy_event)
+ +{
+ +    cl_int gmx_unused cl_error;
+ +
+ +    if (d_dest == NULL || h_src == NULL || bytes == 0)
+ +    {
+ +        return -1;
+ +    }
+ +
+ +    if (bAsync)
+ +    {
+ +        cl_error = clEnqueueWriteBuffer(command_queue, d_dest, CL_FALSE, offset, bytes, h_src, 0, NULL, copy_event);
+ +        assert(cl_error == CL_SUCCESS);
+ +        // TODO: handle errors
+ +    }
+ +    else
+ +    {
+ +        cl_error = clEnqueueWriteBuffer(command_queue, d_dest, CL_TRUE, offset, bytes, h_src, 0, NULL, copy_event);
+ +        assert(cl_error == CL_SUCCESS);
+ +        // TODO: handle errors
+ +    }
+ +
+ +    return 0;
+ +}
+ +
+ +/*! \brief Launches asynchronous host to device memory copy.
+ + *
+ + *  If copy_event is not NULL, on return it will contain an event object
+ + *  identifying this particular host to device operation. The event can further
+ + *  be used to queue a wait for this operation or to query profiling information.
+ + */
+ +int ocl_copy_H2D_async(cl_mem d_dest, void * h_src,
+ +                       size_t offset, size_t bytes,
+ +                       cl_command_queue command_queue,
+ +                       cl_event *copy_event)
+ +{
+ +    return ocl_copy_H2D_generic(d_dest, h_src, offset, bytes, true, command_queue, copy_event);
+ +}
+ +
+ +/*! \brief Launches synchronous host to device memory copy.
+ + */
+ +int ocl_copy_H2D(cl_mem d_dest, void * h_src,
+ +                 size_t offset, size_t bytes,
+ +                 cl_command_queue command_queue)
+ +{
+ +    return ocl_copy_H2D_generic(d_dest, h_src, offset, bytes, false, command_queue, NULL);
+ +}
+ +
+ +/*! \brief Launches synchronous or asynchronous device to host memory copy.
+ + *
+ + *  If copy_event is not NULL, on return it will contain an event object
+ + *  identifying this particular device to host operation. The event can further
+ + *  be used to queue a wait for this operation or to query profiling information.
+ + */
+ +int ocl_copy_D2H_generic(void * h_dest, cl_mem d_src,
+ +                         size_t offset, size_t bytes,
+ +                         bool bAsync,
+ +                         cl_command_queue command_queue,
+ +                         cl_event *copy_event)
+ +{
+ +    cl_int gmx_unused cl_error;
+ +
+ +    if (h_dest == NULL || d_src == NULL || bytes == 0)
+ +    {
+ +        return -1;
+ +    }
+ +
+ +    if (bAsync)
+ +    {
+ +        cl_error = clEnqueueReadBuffer(command_queue, d_src, CL_FALSE, offset, bytes, h_dest, 0, NULL, copy_event);
+ +        assert(cl_error == CL_SUCCESS);
+ +        // TODO: handle errors
+ +    }
+ +    else
+ +    {
+ +        cl_error = clEnqueueReadBuffer(command_queue, d_src, CL_TRUE, offset, bytes, h_dest, 0, NULL, copy_event);
+ +        assert(cl_error == CL_SUCCESS);
+ +        // TODO: handle errors
+ +    }
+ +
+ +    return 0;
+ +}
+ +
+ +/*! \brief Launches asynchronous device to host memory copy.
+ + *
+ + *  If copy_event is not NULL, on return it will contain an event object
+ + *  identifying this particular host to device operation. The event can further
+ + *  be used to queue a wait for this operation or to query profiling information.
+ + */
+ +int ocl_copy_D2H_async(void * h_dest, cl_mem d_src,
+ +                       size_t offset, size_t bytes,
+ +                       cl_command_queue command_queue,
+ +                       cl_event *copy_event)
+ +{
+ +    return ocl_copy_D2H_generic(h_dest, d_src, offset, bytes, true, command_queue, copy_event);
+ +}
+ +
+ +/*! \brief \brief Allocates nbytes of host memory. Use ocl_free to free memory allocated with this function.
+ + *
+ + *  \todo
+ + *  This function should allocate page-locked memory to help reduce D2H and H2D
+ + *  transfer times, similar with pmalloc from pmalloc_cuda.cu.
+ + *
+ + * \param[in,out]    h_ptr   Pointer where to store the address of the newly allocated buffer.
+ + * \param[in]        nbytes  Size in bytes of the buffer to be allocated.
+ + */
+ +void ocl_pmalloc(void **h_ptr, size_t nbytes)
+ +{
+ +    /* Need a temporary type whose size is 1 byte, so that the
+ +     * implementation of snew_aligned can cope without issuing
+ +     * warnings. */
+ +    char **temporary = reinterpret_cast<char **>(h_ptr);
+ +
+ +    /* 16-byte alignment is required by the neighbour-searching code,
+ +     * because it uses four-wide SIMD for bounding-box calculation.
+ +     * However, when we organize using page-locked memory for
+ +     * device-host transfers, it will probably need to be aligned to a
+ +     * 4kb page, like CUDA does. */
+ +    snew_aligned(*temporary, nbytes, 16);
+ +}
+ +
+ +/*! \brief Frees memory allocated with ocl_pmalloc.
+ + *
+ + * \param[in]    h_ptr   Buffer allocated with ocl_pmalloc that needs to be freed.
+ + */
+ +void ocl_pfree(void *h_ptr)
+ +{
+ +
+ +    if (h_ptr)
+ +    {
+ +        sfree_aligned(h_ptr);
+ +    }
+ +    return;
+ +}
++
++/*! \brief Convert error code to diagnostic string */
++const char *ocl_get_error_string(cl_int error)
++{
++    switch (error)
++    {
++        // run-time and JIT compiler errors
++        case 0: return "CL_SUCCESS";
++        case -1: return "CL_DEVICE_NOT_FOUND";
++        case -2: return "CL_DEVICE_NOT_AVAILABLE";
++        case -3: return "CL_COMPILER_NOT_AVAILABLE";
++        case -4: return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
++        case -5: return "CL_OUT_OF_RESOURCES";
++        case -6: return "CL_OUT_OF_HOST_MEMORY";
++        case -7: return "CL_PROFILING_INFO_NOT_AVAILABLE";
++        case -8: return "CL_MEM_COPY_OVERLAP";
++        case -9: return "CL_IMAGE_FORMAT_MISMATCH";
++        case -10: return "CL_IMAGE_FORMAT_NOT_SUPPORTED";
++        case -11: return "CL_BUILD_PROGRAM_FAILURE";
++        case -12: return "CL_MAP_FAILURE";
++        case -13: return "CL_MISALIGNED_SUB_BUFFER_OFFSET";
++        case -14: return "CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST";
++        case -15: return "CL_COMPILE_PROGRAM_FAILURE";
++        case -16: return "CL_LINKER_NOT_AVAILABLE";
++        case -17: return "CL_LINK_PROGRAM_FAILURE";
++        case -18: return "CL_DEVICE_PARTITION_FAILED";
++        case -19: return "CL_KERNEL_ARG_INFO_NOT_AVAILABLE";
++
++        // compile-time errors
++        case -30: return "CL_INVALID_VALUE";
++        case -31: return "CL_INVALID_DEVICE_TYPE";
++        case -32: return "CL_INVALID_PLATFORM";
++        case -33: return "CL_INVALID_DEVICE";
++        case -34: return "CL_INVALID_CONTEXT";
++        case -35: return "CL_INVALID_QUEUE_PROPERTIES";
++        case -36: return "CL_INVALID_COMMAND_QUEUE";
++        case -37: return "CL_INVALID_HOST_PTR";
++        case -38: return "CL_INVALID_MEM_OBJECT";
++        case -39: return "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";
++        case -40: return "CL_INVALID_IMAGE_SIZE";
++        case -41: return "CL_INVALID_SAMPLER";
++        case -42: return "CL_INVALID_BINARY";
++        case -43: return "CL_INVALID_BUILD_OPTIONS";
++        case -44: return "CL_INVALID_PROGRAM";
++        case -45: return "CL_INVALID_PROGRAM_EXECUTABLE";
++        case -46: return "CL_INVALID_KERNEL_NAME";
++        case -47: return "CL_INVALID_KERNEL_DEFINITION";
++        case -48: return "CL_INVALID_KERNEL";
++        case -49: return "CL_INVALID_ARG_INDEX";
++        case -50: return "CL_INVALID_ARG_VALUE";
++        case -51: return "CL_INVALID_ARG_SIZE";
++        case -52: return "CL_INVALID_KERNEL_ARGS";
++        case -53: return "CL_INVALID_WORK_DIMENSION";
++        case -54: return "CL_INVALID_WORK_GROUP_SIZE";
++        case -55: return "CL_INVALID_WORK_ITEM_SIZE";
++        case -56: return "CL_INVALID_GLOBAL_OFFSET";
++        case -57: return "CL_INVALID_EVENT_WAIT_LIST";
++        case -58: return "CL_INVALID_EVENT";
++        case -59: return "CL_INVALID_OPERATION";
++        case -60: return "CL_INVALID_GL_OBJECT";
++        case -61: return "CL_INVALID_BUFFER_SIZE";
++        case -62: return "CL_INVALID_MIP_LEVEL";
++        case -63: return "CL_INVALID_GLOBAL_WORK_SIZE";
++        case -64: return "CL_INVALID_PROPERTY";
++        case -65: return "CL_INVALID_IMAGE_DESCRIPTOR";
++        case -66: return "CL_INVALID_COMPILER_OPTIONS";
++        case -67: return "CL_INVALID_LINKER_OPTIONS";
++        case -68: return "CL_INVALID_DEVICE_PARTITION_COUNT";
++
++        // extension errors
++        case -1000: return "CL_INVALID_GL_SHAREGROUP_REFERENCE_KHR";
++        case -1001: return "CL_PLATFORM_NOT_FOUND_KHR";
++        case -1002: return "CL_INVALID_D3D10_DEVICE_KHR";
++        case -1003: return "CL_INVALID_D3D10_RESOURCE_KHR";
++        case -1004: return "CL_D3D10_RESOURCE_ALREADY_ACQUIRED_KHR";
++        case -1005: return "CL_D3D10_RESOURCE_NOT_ACQUIRED_KHR";
++        default: return "Unknown OpenCL error";
++    }
++}
diff --combined src/gromacs/gpu_utils/oclutils.h

index 9c329408e181a2e9d5d1af76719f24aed25267a7,0000000000000000000000000000000000000000..365b91328a2a1d4a9257404221bafa3bd715e8ed

mode 100644,000000..100644
--- 1/src/gromacs/gpu_utils/oclutils.h
--- /dev/null
+++ b/src/gromacs/gpu_utils/oclutils.h
@@@ -1,129 -1,0 +1,132 @@@
+ +/*
+ + * This file is part of the GROMACS molecular simulation package.
+ + *
+ + * Copyright (c) 2014,2015, by the GROMACS development team, led by
+ + * Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
+ + * and including many others, as listed in the AUTHORS file in the
+ + * top-level source directory and at http://www.gromacs.org.
+ + *
+ + * GROMACS is free software; you can redistribute it and/or
+ + * modify it under the terms of the GNU Lesser General Public License
+ + * as published by the Free Software Foundation; either version 2.1
+ + * of the License, or (at your option) any later version.
+ + *
+ + * GROMACS is distributed in the hope that it will be useful,
+ + * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ + * Lesser General Public License for more details.
+ + *
+ + * You should have received a copy of the GNU Lesser General Public
+ + * License along with GROMACS; if not, see
+ + * http://www.gnu.org/licenses, or write to the Free Software Foundation,
+ + * Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
+ + *
+ + * If you want to redistribute modifications to GROMACS, please
+ + * consider that scientific software is very special. Version
+ + * control is crucial - bugs must be traceable. We will be happy to
+ + * consider code for inclusion in the official distribution, but
+ + * derived work must not be called official GROMACS. Details are found
+ + * in the README & COPYING files - if they are missing, get the
+ + * official version at http://www.gromacs.org.
+ + *
+ + * To help us fund GROMACS development, we humbly ask that you cite
+ + * the research papers on the package. Check out http://www.gromacs.org.
+ + */
+ +/*! \libinternal \file
+ + *  \brief Declare utility routines for OpenCL
+ + *
+ + *  \author Anca Hamuraru <anca@streamcomputing.eu>
+ + *  \inlibraryapi
+ + */
+ +#ifndef GMX_GPU_UTILS_OCLUTILS_H
+ +#define GMX_GPU_UTILS_OCLUTILS_H
+ +
+ +/*! \brief Declare to OpenCL SDKs that we intend to use OpenCL API
+ +   features that were deprecated in 2.0, so that they don't warn about
+ +   it. */
+ +#define CL_USE_DEPRECATED_OPENCL_2_0_APIS
+ +#ifdef __APPLE__
+ +#    include <OpenCL/opencl.h>
+ +#else
+ +#    include <CL/opencl.h>
+ +#endif
+ +
+ +/*! \brief OpenCL vendor IDs */
+ +typedef enum {
+ +    OCL_VENDOR_NVIDIA = 0,
+ +    OCL_VENDOR_AMD,
+ +    OCL_VENDOR_INTEL,
+ +    OCL_VENDOR_UNKNOWN
+ +} ocl_vendor_id_t;
+ +
+ +/*! \internal \brief OpenCL GPU device identificator
+ + * An OpenCL device is identified by its ID.
+ + * The platform ID is also included for caching reasons.
+ + */
+ +typedef struct
+ +{
+ +    cl_platform_id      ocl_platform_id; /**< Platform ID */
+ +    cl_device_id        ocl_device_id;   /**< Device ID */
+ +} ocl_gpu_id_t;
+ +
+ +/*! \internal \brief OpenCL GPU information
+ + *
+ + * \todo Move context and program outside this data structure.
+ + * They are specific to a certain usage of the device (e.g. with/without OpenGL
+ + * interop) and do not provide general device information as the data structure
+ + * name indicates.
+ + *
+ + * TODO Document fields
+ + */
+ +struct gmx_device_info_t
+ +{
+ +    //! @cond Doxygen_Suppress
+ +    ocl_gpu_id_t        ocl_gpu_id;
+ +    char                device_name[256];
+ +    char                device_version[256];
+ +    char                device_vendor[256];
+ +    int                 compute_units;
+ +    int                 adress_bits;
+ +    int                 stat;
+ +    ocl_vendor_id_t     vendor_e;
+ +
+ +    cl_context          context;
+ +    cl_program          program;
+ +    //! @endcond Doxygen_Suppress
+ +
+ +};
+ +
+ +#if !defined(NDEBUG)
+ +/* Debugger callable function that prints the name of a kernel function pointer */
+ +cl_int dbg_ocl_kernel_name(const cl_kernel kernel);
+ +cl_int dbg_ocl_kernel_name_address(void* kernel);
+ +#endif
+ +
+ +
+ +/*! \brief Launches asynchronous host to device memory copy. */
+ +int ocl_copy_H2D_async(cl_mem d_dest, void * h_src,
+ +                       size_t offset, size_t bytes,
+ +                       cl_command_queue command_queue,
+ +                       cl_event *copy_event);
+ +
+ +/*! \brief Launches asynchronous device to host memory copy. */
+ +int ocl_copy_D2H_async(void * h_dest, cl_mem d_src,
+ +                       size_t offset, size_t bytes,
+ +                       cl_command_queue command_queue,
+ +                       cl_event *copy_event);
+ +
+ +/*! \brief Launches synchronous host to device memory copy. */
+ +int ocl_copy_H2D(cl_mem d_dest, void * h_src,
+ +                 size_t offset, size_t bytes,
+ +                 cl_command_queue command_queue);
+ +
+ +/*! \brief Allocate host memory in malloc style */
+ +void ocl_pmalloc(void **h_ptr, size_t nbytes);
+ +
+ +/*! \brief Free host memory in malloc style */
+ +void ocl_pfree(void *h_ptr);
+ +
++/*! \brief Convert error code to diagnostic string */
++const char *ocl_get_error_string(cl_int error);
++
+ +#endif
diff --combined src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp

index 19f94a92351059c2394c3d5176281204ea98f36b,8ce759d744c31ee27c224c542877dfedc0355082..c4ed640a80e0051aeaef86b7eb27957c115cd68d
--- 1/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp
--- 2/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp
+++ b/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp
@@@ -51,9 -51,10 +51,9 @@@
   #include <limits>
   #endif
   
- -#include "gromacs/gmxlib/ocl_tools/oclutils.h"
- -#include "gromacs/legacyheaders/types/force_flags.h"
- -#include "gromacs/legacyheaders/types/hw_info.h"
- -#include "gromacs/legacyheaders/types/simple.h"
+ +#include "gromacs/gpu_utils/oclutils.h"
+ +#include "gromacs/hardware/hw_info.h"
+ +#include "gromacs/mdlib/force_flags.h"
   #include "gromacs/mdlib/nb_verlet.h"
   #include "gromacs/mdlib/nbnxn_consts.h"
   #include "gromacs/mdlib/nbnxn_pairlist.h"
@@@ -68,6 -69,7 +68,7 @@@
   #include "gromacs/pbcutil/ishift.h"
   #include "gromacs/utility/cstringutil.h"
   #include "gromacs/utility/fatalerror.h"
+ #include "gromacs/utility/gmxassert.h"
   
   #include "nbnxn_ocl_types.h"
   
@@@ -251,16 -253,11 +252,16 @@@ static inline int calc_shmem_required(
       /* size of shmem (force-buffers/xq/atom type preloading) */
       /* NOTE: with the default kernel on sm3.0 we need shmem only for pre-loading */
       /* i-atom x+q in shared memory */
- -    //shmem  = NCL_PER_SUPERCL * CL_SIZE * sizeof(float4);
       shmem  = NCL_PER_SUPERCL * CL_SIZE * sizeof(float) * 4; /* xqib */
       /* cj in shared memory, for both warps separately */
       shmem += 2 * NBNXN_GPU_JGROUP_SIZE * sizeof(int);       /* cjs  */
- -#ifdef IATYPE_SHMEM                                         // CUDA ARCH >= 300
+ +#ifdef IATYPE_SHMEM
+ +    /* FIXME: this should not be compile-time decided but rather at runtime.
+ +     * This issue propagated from the CUDA code where due to the source to source
+ +     * compilation there was confusion the way to set up arch-dependent launch parameters.
+ +     * Here too this should be converted to a hardware/arch/generation dependent
+ +     * conditional when re-evaluating the need for i atom type preloading.
+ +     */
       /* i-atom types in shared memory */
       #pragma error "Should not be defined"
       shmem += NCL_PER_SUPERCL * CL_SIZE * sizeof(int);       /* atib */
@@@ -335,7 -332,7 +336,7 @@@ void sync_ocl_event(cl_command_queue st
       cl_error = clEnqueueWaitForEvents(stream, 1, ocl_event);
   #endif
   
-     assert(CL_SUCCESS == cl_error);
+     GMX_RELEASE_ASSERT(CL_SUCCESS == cl_error, ocl_get_error_string(cl_error));
   
       /* Release event and reset it to 0. It is ok to release it as enqueuewaitforevents performs implicit retain for events. */
       cl_error = clReleaseEvent(*ocl_event);
@@@ -343,7 -340,7 +344,7 @@@
       *ocl_event = 0;
   }
   
- -/*! \brief Returns the duration in miliseconds for the command associated with the event.
+ +/*! \brief Returns the duration in milliseconds for the command associated with the event.
    *
    * It then releases the event and sets it to 0.
    * Before calling this function, make sure the command has finished either by
@@@ -932,6 -929,15 +933,15 @@@ void nbnxn_gpu_launch_cpyback(gmx_nbnxn
       /* don't launch non-local copy-back if there was no non-local work to do */
       if (iloc == eintNonlocal && nb->plist[iloc]->nsci == 0)
       {
+         /* TODO An alternative way to signal that non-local work is
+            complete is to use a clEnqueueMarker+clEnqueueBarrier
+            pair. However, the use of bNonLocalStreamActive has the
+            advantage of being local to the host, so probably minimizes
+            overhead. Curiously, for NVIDIA OpenCL with an empty-domain
+            test case, overall simulation performance was higher with
+            the API calls, but this has not been tested on AMD OpenCL,
+            so could be worth considering in future. */
+         nb->bNonLocalStreamActive = false;
           return;
       }
   
@@@ -951,7 -957,7 +961,7 @@@
   
       /* With DD the local D2H transfer can only start after the non-local
          has been launched. */
-     if (iloc == eintLocal && nb->bUseTwoStreams)
+     if (iloc == eintLocal && nb->bNonLocalStreamActive)
       {
           sync_ocl_event(stream, &(nb->nonlocal_done));
       }
@@@ -976,6 -982,7 +986,7 @@@
           cl_error = clEnqueueMarker(stream, &(nb->nonlocal_done));
   #endif
           assert(CL_SUCCESS == cl_error);
+         nb->bNonLocalStreamActive = true;
       }
   
       /* only transfer energies in the local stream */
@@@ -1007,6 -1014,7 +1018,6 @@@
    * transfers to finish.
    */
   void nbnxn_gpu_wait_for_gpu(gmx_nbnxn_ocl_t *nb,
- -                            const nbnxn_atomdata_t gmx_unused *nbatom,
                               int flags, int aloc,
                               real *e_lj, real *e_el, rvec *fshift)
   {
@@@ -1140,12 -1148,9 +1151,12 @@@ int nbnxn_gpu_pick_ewald_kernel_type(bo
                      "requested through environment variables.");
       }
   
- -    /* CUDA: By default, on SM 3.0 and later use analytical Ewald, on earlier tabulated. */
- -    /* OpenCL: By default, use analytical Ewald, on earlier tabulated. */
- -    // TODO: decide if dev_info parameter should be added to recognize NVIDIA CC>=3.0 devices.
+ +    /* OpenCL: By default, use analytical Ewald
+ +     * TODO: tabulated does not work, it needs fixing, see init_nbparam() in nbnxn_ocl_data_mgmt.cpp
+ +     *
+ +     * TODO: decide if dev_info parameter should be added to recognize NVIDIA CC>=3.0 devices.
+ +     *
+ +     */
       //if ((dev_info->prop.major >= 3 || bForceAnalyticalEwald) && !bForceTabulatedEwald)
       if ((1                         || bForceAnalyticalEwald) && !bForceTabulatedEwald)
       {
diff --combined src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_types.h

index a41588066922012f3e65f63f7bb62606239e89e4,000805fe690340c4ff38d9f98e10ae035611f645..67529febe0bf496dcc3a3beedd8d6771f7038497
--- 1/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_types.h
--- 2/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_types.h
+++ b/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_types.h
@@@ -54,8 -54,8 +54,8 @@@
   #    include <CL/opencl.h>
   #endif
   
- -#include "gromacs/legacyheaders/types/interaction_const.h"
   #include "gromacs/mdlib/nbnxn_pairlist.h"
+ +#include "gromacs/mdtypes/interaction_const.h"
   #include "gromacs/utility/real.h"
   
   /* kernel does #include "gromacs/math/utilities.h" */
@@@ -289,16 -289,17 +289,17 @@@ struct gmx_nbnxn_ocl_
       cl_kernel           kernel_zero_e_fshift;
       ///@}
   
-     cl_bool             bUseTwoStreams; /**< true if doing both local/non-local NB work on GPU          */
+     cl_bool             bUseTwoStreams;        /**< true if doing both local/non-local NB work on GPU          */
+     cl_bool             bNonLocalStreamActive; /**< true indicates that the nonlocal_done event was enqueued   */
   
-     cl_atomdata_t      *atdat;          /**< atom data                                                  */
-     cl_nbparam_t       *nbparam;        /**< parameters required for the non-bonded calc.               */
-     cl_plist_t         *plist[2];       /**< pair-list data structures (local and non-local)            */
-     cl_nb_staging_t     nbst;           /**< staging area where fshift/energies get downloaded          */
+     cl_atomdata_t      *atdat;                 /**< atom data                                                  */
+     cl_nbparam_t       *nbparam;               /**< parameters required for the non-bonded calc.               */
+     cl_plist_t         *plist[2];              /**< pair-list data structures (local and non-local)            */
+     cl_nb_staging_t     nbst;                  /**< staging area where fshift/energies get downloaded          */
   
-     cl_mem              debug_buffer;   /**< debug buffer */
+     cl_mem              debug_buffer;          /**< debug buffer */
   
-     cl_command_queue    stream[2];      /**< local and non-local GPU queues                             */
+     cl_command_queue    stream[2];             /**< local and non-local GPU queues                             */
   
       /** events used for synchronization */
       cl_event nonlocal_done;               /**< event triggered when the non-local non-bonded kernel
diff --combined src/programs/mdrun/tests/moduletest.cpp

index 5e1726a7ee0668e39cad941a24c97ecb3dec2487,fe14dfbedea1105f4f228479d6d49632814582cb..56dfa17725a922c24a5e71dd294755fd45718a60
--- 1/src/programs/mdrun/tests/moduletest.cpp
--- 2/src/programs/mdrun/tests/moduletest.cpp
+++ b/src/programs/mdrun/tests/moduletest.cpp
@@@ -45,13 -45,16 +45,16 @@@
   
   #include "config.h"
   
+ #include <cstdio>
+ 
   #include "gromacs/gmxpreprocess/grompp.h"
- -#include "gromacs/legacyheaders/gmx_detect_hardware.h"
++#include "gromacs/hardware/detecthardware.h"
   #include "gromacs/options/basicoptions.h"
- -#include "gromacs/options/options.h"
+ +#include "gromacs/options/ioptionscontainer.h"
   #include "gromacs/utility/basedefinitions.h"
   #include "gromacs/utility/basenetwork.h"
- -#include "gromacs/utility/file.h"
   #include "gromacs/utility/gmxmpi.h"
+ +#include "gromacs/utility/textwriter.h"
   #include "programs/mdrun/mdrun_main.h"
   
   #include "testutils/cmdlinetest.h"
@@@ -133,13 -136,13 +136,13 @@@ SimulationRunner::useStringAsMdpFile(co
   void
   SimulationRunner::useStringAsMdpFile(const std::string &mdpString)
   {
- -    gmx::File::writeFileFromString(mdpInputFileName_, mdpString);
+ +    gmx::TextWriter::writeFileFromString(mdpInputFileName_, mdpString);
   }
   
   void
   SimulationRunner::useStringAsNdxFile(const char *ndxString)
   {
- -    gmx::File::writeFileFromString(ndxFileName_, ndxString);
+ +    gmx::TextWriter::writeFileFromString(ndxFileName_, ndxString);
   }
   
   void
@@@ -217,26 -220,38 +220,50 @@@ SimulationRunner::callMdrun(const Comma
           caller.addOption("-nsteps", nsteps_);
       }
   
+ +#ifdef GMX_MPI
+ +#  if GMX_GPU != GMX_GPU_NONE
+ +#    ifdef GMX_THREAD_MPI
+ +    int         numGpusNeeded = g_numThreads;
+ +#    else   /* Must be real MPI */
+ +    int         numGpusNeeded = gmx_node_num();
+ +#    endif
+ +    std::string gpuIdString(numGpusNeeded, '0');
+ +    caller.addOption("-gpu_id", gpuIdString.c_str());
+ +#  endif
+ +#endif
+ +
   #ifdef GMX_THREAD_MPI
-     caller.addOption("-nt", g_numThreads);
+     caller.addOption("-ntmpi", g_numThreads);
   #endif
   
   #ifdef GMX_OPENMP
       caller.addOption("-ntomp", g_numOpenMPThreads);
   #endif
   
+ #if defined GMX_GPU
+     /* TODO Ideally, with real MPI, we could call
+      * gmx_collect_hardware_mpi() here and find out how many nodes
+      * mdrun will run on. For now, we assume that we're running on one
+      * node regardless of the number of ranks, because that's true in
+      * Jenkins and for most developers running the tests. */
+     int numberOfNodes = 1;
+ #if defined GMX_THREAD_MPI
+     /* Can't use gmx_node_num() because it is only valid after spawn of thread-MPI threads */
+     int numberOfRanks = g_numThreads;
+ #elif defined GMX_LIB_MPI
+     int numberOfRanks = gmx_node_num();
+ #else
+     int numberOfRanks = 1;
+ #endif
+     if (numberOfRanks > numberOfNodes && !gmx_multiple_gpu_per_node_supported())
+     {
+         if (gmx_node_rank() == 0)
+         {
+             fprintf(stderr, "GROMACS in this build configuration cannot run on more than one GPU per node,\n so with %d ranks and %d nodes, this test will disable GPU support", numberOfRanks, numberOfNodes);
+         }
+         caller.addOption("-nb", "cpu");
+     }
+ #endif
       return gmx_mdrun(caller.argc(), caller.argv());
   }
author	Mark Abraham <mark.j.abraham@gmail.com>
	Mon, 28 Dec 2015 00:24:52 +0000 (11:24 +1100)
committer	Mark Abraham <mark.j.abraham@gmail.com>
	Mon, 28 Dec 2015 00:51:25 +0000 (11:51 +1100)
		1	2
docs/install-guide/index.rst	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/gpu_utils/oclutils.cpp	patch \|	diff1 \|	\|	blob \| history
src/gromacs/gpu_utils/oclutils.h	patch \|	diff1 \|	\|	blob \| history
src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp	patch \|	diff1 \|	diff2 \|	blob \| history
src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_types.h	patch \|	diff1 \|	diff2 \|	blob \| history
src/programs/mdrun/tests/moduletest.cpp	patch \|	diff1 \|	diff2 \|	blob \| history