Added CUDA LJ-PME nbnxn kernels
This change implements CUDA non-bonded kernels for LJ-PME introduced
in the Verlet scheme with 99029d.
The CUDA kernels implement geometric as well as Lorentz-Berthelot (LB)
combinations rules (unlike the CPU SIMD) mostly because even though PME
is very slow with LB, it is still beneficial to let the user offload the
non-bondeds to a GPU and potentially bump up the cut-off to further
reduce the CPU PME load.
Note that as now we have 120 kernels compiled for up to four different
target architectures, the nbnxn_cuda module takes a very long time to
build and can become the bottleneck during compilation. We will deal
with this later.
Change-Id: I819b59a8948da0c8492eac6a43d4a7fb6dc98354