removed last OpenMP O(nthreads) bottleneck
The nbnxn non-bonded force buffer clear would zero the full force
buffer for every OpenMP thread, which lead to O(nthreads) operations.
Now each threads only clears block of the buffer is actually uses.
The buffer flagging procedure has been simplified and the reduction
has been made more efficient. With domain decomposition there is not
much improvement yet, here we should change the assignment of pairs
to threads.
Change-Id: Ib657f8eaa394afdd82086dcc59cbb8c5926f77f0