Adapted clang-AMD-FMA workaround for both C and C++
Based on
34a402e7bfb0c950edd7a7a624acf48334333a2d
The clang bug which causes incorrect code to be generated on AMD
with FMA (Piledriver and Bulldozer) is in fact caused by an assembler
bug and can be worked around by switching to an external assembler.
This commit implements such the workaround for clang 3.1 and 3.2.
The bug was fix in clang version 3.2-svn-r173176:
http://llvm.org/bugs/show_bug.cgi?id=15040
Hence, no workaround is needed for clang 3.3 with AVX_128_FMA
acceleration.
This workaround is ineffective on Mac OS because there is no suitable
external assembler, but distinguishing between vanilla and Apple clang
requires more code than what's worth adding to cover the case of
AMD-based Hackintoshes (all Mac-s are Intel-based).
Fixes #1099
Change-Id: Ice7c17a792fe257c4516628030b680d3b3b1a484