fix minor CUDA NB kernel performance regression