add workaround for clang on AMD with FMA
The clang bug which causes incorrect code to be generated on AMD
with FMA (Piledriver and Bulldozer) is in fact caused by an assembler
bug and can be worked around by switching to an external assembler.
This commit implements such the workaround for clang 3.1 and 3.2.
The bug was fix in clang version 3.2-svn-r173176:
http://llvm.org/bugs/show_bug.cgi?id=15040
Hence, no workaround is needed for clang 3.3 with AVX_128_FMA
acceleration.
This workaround is ineffective on Mac OS because there is no suitable
external assembler, but distinguishing between vanilla and Apple clang
requires more code than what's worth adding to cover the case of
AMD-based Hackintoshes (all Mac-s are Intel-based).
It is appropriate to amend only C and not C++ compilation for
release-4-6 branch, but when merging this to master branch, reconsider
the detail of the implementation.
Refs #1099
Change-Id: I835e0f2436ee33507514d3bc3980fcc227e2e36f