The goal of this MR is to implement a generic SIMD backend (packet ops) for Eigen that uses clang vector extensions instead of platform-dependent intrinsics. Ideally, this should make it possible to build Eigen and achieve reasonable speed on any platform that has a recent clang compiler, without having to write any inline assembly or intrinsics.
Caveats:
* The current implementation is a proof of concept and supports vectorization for float, double, int32_t, and int64_t using fixed-size 512-bit vectors (a somewhat arbitrary choice). I have not done much to tune this for speed yet.
* For now, there is no way to enable this other than setting -DEIGEN_VECTORIZE_GENERIC on the command line.
* This only compiles with newer versions of clang. I have tested that it compiles and all tests pass with clang 19.1.7.
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectorsCloses#2998 and #2997
See merge request libeigen/eigen!2051
Co-authored-by: Rasmus Munk Larsen <rmlarsen@google.com>
Co-authored-by: Antonio Sánchez <cantonios@google.com>
Current implementations fail to consider half-float packets, only
half-float scalars. Added specializations for packets on AVX, AVX512 and
NEON. Added tests to `special_packetmath`.
The current `special_functions` tests would fail for half and bfloat16 due to
lack of precision. The NEON tests also fail with precision issues and
due to different handling of `sqrt(inf)`, so special functions bessel, ndtri
have been disabled.
Tested with AVX, AVX512.
- Split SpecialFunctions files in to a separate BesselFunctions file.
In particular add:
- Modified bessel functions of the second kind k0, k1, k0e, k1e
- Bessel functions of the first kind j0, j1
- Bessel functions of the second kind y0, y1
The major changes are
1. Moving CUDA/PacketMath.h to GPU/PacketMath.h
2. Moving CUDA/MathFunctions.h to GPU/MathFunction.h
3. Moving CUDA/CudaSpecialFunctions.h to GPU/GpuSpecialFunctions.h
The above three changes effectively enable the Eigen "Packet" layer for the HIP platform
4. Merging the "hip_basic" and "cuda_basic" unit tests into one ("gpu_basic")
5. Updating the "EIGEN_DEVICE_FUNC" marking in some places
The change has been tested on the HIP and CUDA platforms.
In addition to igamma(a, x), this code implements:
* igamma_der_a(a, x) = d igamma(a, x) / da -- derivative of igamma with respect to the parameter
* gamma_sample_der_alpha(alpha, sample) -- reparameterization derivative of a Gamma(alpha, 1) random variable sample with respect to the alpha parameter
The derivatives are computed by forward mode differentiation of the igamma(a, x) code. Although gamma_sample_der_alpha can be implemented via igamma_der_a, a separate function is more accurate and efficient due to analytical cancellation of some terms. All three functions are implemented by a method parameterized with "mode" that always computes the derivatives, but does not return them unless required by the mode. The compiler is expected to (and, based on benchmarks, does) skip the unnecessary computations depending on the mode.
The functions are conventionally called i0e and i1e. The exponentially scaled version is more numerically stable. The standard Bessel functions can be obtained as i0(x) = exp(|x|) i0e(x)
The code is ported from Cephes and tested against SciPy.