neon: Use the real intrinsics for divf and sqrtf
The existing implementation used the reciprocal for the calculations, without windowing out denormals.
(Marking as draft because I'd like feedback on whether the previous approach could be optimized, and how folks test the Arm backend's performance.)
Fixes #62 (closed)