Fix AMD OpenCL float3 array optimization bug
Because float3 by OpenCL spec is 16-byte, when used as an array type
the allocation needs to optimized to avoid unnecessary register use.
The nbnxm kernels use a float3 i-force accumulator array in registers.
Starting with ROCm 2.3 the AMD OpenCL compiler regressed and lost
its ability to effectively optimize code that uses float3 register
arrays. The large amount of extra registers used limits the kernel
occupancy and significantly impacts performance.
Only the AMD platform is affected, other vendors' compilers are able to
do the necessary transformations to avoid the extra register use.
This change converts the float3 array to a float[3] saving 8*4 bytes
register space. This improves nonbonded kernel performance
on an AMD Vega GPU by 25% and 40% for the most common flavor of the
Ewald and RF force-only kernels, respectively.
Note that eliminating the rest of the non-array use of float3 has no
significant impact.