From: Berk Hess Date: Wed, 10 Jun 2015 08:07:35 +0000 (+0200) Subject: Improve pair search thread load balance X-Git-Url: http://biod.pnpi.spb.ru/gitweb/?a=commitdiff_plain;h=301edb905b2c69d52f791284f569b09084e01a30;p=alexxy%2Fgromacs.git Improve pair search thread load balance With very small systems and many OpenMP threads, especially when using GPUs, some threads can end up without pair search work. Better load balancing reduces the pair search time. Also the CPU non-bonded kernel time is slightly reduced in the extreme parallelization limit. Change-Id: Ib036ea3ba59f497eeee7afa73a71fb0e0ccd216e --- diff --git a/src/gromacs/mdlib/nbnxn_search.c b/src/gromacs/mdlib/nbnxn_search.c index 7e278a94ee..6ec71ccd2a 100644 --- a/src/gromacs/mdlib/nbnxn_search.c +++ b/src/gromacs/mdlib/nbnxn_search.c @@ -4818,11 +4818,24 @@ static int get_ci_block_size(const nbnxn_grid_t *gridi, /* Without domain decomposition * or with less than 3 blocks per task, divide in nth blocks. */ - if (!bDomDec || ci_block*3*nth > gridi->nc) + if (!bDomDec || nth*3*ci_block > gridi->nc) { ci_block = (gridi->nc + nth - 1)/nth; } + if (ci_block > 1 && (nth - 1)*ci_block >= gridi->nc) + { + /* Some threads have no work. Although reducing the block size + * does not decrease the block count on the first few threads, + * with GPUs better mixing of "upper" cells that have more empty + * clusters results in a somewhat lower max load over all threads. + * Without GPUs the regime of so few atoms per thread is less + * performance relevant, but with 8-wide SIMD the same reasoning + * applies, since the pair list uses 4 i-atom "sub-clusters". + */ + ci_block--; + } + return ci_block; }