Allow OCL CL_SIZE to be set to 4 for Intel
Add GMX_OCL_CLUSTER_SIZE which can be set to 4 for e.g. Intel.
The kernel should now work on any HW with at least
CL_SIZE*CL_SIZE/2 wide sub-groups (warp-sync execution).
This is 8(/32) for CL_SIZE 4(/8). Not tested for CL_SIZE other
than 4 or 8.
Fixes:
- make_fep_list_supersub was incorrect for CL_SIZE!=8.
- reduce_force_i_pow2 was incorrect for CL_SIZE<8 and 2 warps.
- i-atom preload, nbnxn_excl_t, warp-any init for CL_SIZE!=8.
- gpu_ref for CL_SIZE!=8.
Change-Id: I1114e408d28b9eb6306722c41fd6a6ccec52211b