Commit
f2b9db26 introduced the thread index z component as a stride in
the middle j4 loop. As this index is not a constant but a value
loaded from a special register, this change caused up to a few %
performance loss in the force kernels. This went unnoticed because
some architectures (cc 3.5/5.2) and some compilers (CUDA 7.0) were
barely affected.
Change-Id: I423790e8fb01a35f7234d26ff064dcc555e73c48
unsigned int tidxi = threadIdx.x;
unsigned int tidxj = threadIdx.y;
unsigned int tidx = threadIdx.y * blockDim.x + threadIdx.x;
+#if NTHREAD_Z == 1
+ unsigned int tidxz = 0;
+#else
unsigned int tidxz = threadIdx.z;
+#endif
unsigned int bidx = blockIdx.x;
unsigned int widx = tidx / WARP_SIZE; /* warp index */