Merge "minor speed-up and code clean-up in nbnxn kernels" into release-4-6