core: Add ARM NEON optimized sample conversion code
final:
* includes some minor style fixes and build-time changes to allow
building a single binary for neon and non-neon systems
v4:
* fix for sample length < 4
v3:
* convert from intrinsics to inline assembly
v2:
* load and store data with vld1/vld1q and vst1/vst1q, resp., to work
around alignment issues of compiler-generated vldmia instruction
* remove redundant check for NEON flags