Commit Graph

1023 Commits

Author SHA1 Message Date
Antonio Sanchez
3f4684f87d Include <cstdint> in one place, remove custom typedefs
Originating from
[this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case),
some win32 compilers define `__int32` as a `long`, but MinGW defines
`std::int32_t` as an `int`, leading to a type conflict.

To avoid this, we remove the custom `typedef` definitions for win32.  The
Tensor module requires C++11 anyways, so we are guaranteed to have
included `<cstdint>` already in `Eigen/Core`.

Also re-arranged the headers to only include `<cstdint>` in one place to
avoid this type of error again.
2021-01-26 14:23:05 -08:00
Chip Kerchner
0784d9f87b Fix sqrt, ldexp and frexp compilation errors. 2021-01-25 15:22:19 -06:00
Antonio Sanchez
f0e46ed5d4 Fix pow and other cwise ops for half/bfloat16.
The new `generic_pow` implementation was failing for half/bfloat16 since
their construction from int/float is not `constexpr`. Modified
in `GenericPacketMathFunctions` to remove `constexpr`.

While adding tests for half/bfloat16, found other issues related to
implicit conversions.

Also needed to implement `numext::arg` for non-integer, non-complex,
non-float/double/long double types.  These seem to be  implicitly
converted to `std::complex<T>`, which then fails for half/bfloat16.
2021-01-22 11:10:54 -08:00
Antonio Sanchez
f19bcffee6 Specialize std::complex operators for use on GPU device.
NVCC and older versions of clang do not fully support `std::complex` on device,
leading to either compile errors (Cannot call `__host__` function) or worse,
runtime errors (Illegal instruction).  For most functions, we can
implement specialized `numext` versions. Here we specialize the standard
operators (with the exception of stream operators and member function operators
with a scalar that are already specialized in `<complex>`) so they can be used
in device code as well.

To import these operators into the current scope, use
`EIGEN_USING_STD_COMPLEX_OPERATORS`. By default, these are imported into
the `Eigen`, `Eigen:internal`, and `Eigen::numext` namespaces.

This allow us to remove specializations of the
sum/difference/product/quotient ops, and allow us to treat complex
numbers like most other scalars (e.g. in tests).
2021-01-22 18:19:19 +00:00
David Tellenbach
65e2169c45 Add support for Arm SVE
This patch adds support for Arm's new vector extension SVE (Scalable Vector Extension). In contrast to other vector extensions that are supported by Eigen, SVE types are inherently *sizeless*. For the use in Eigen we fix their size at compile-time (note that this is not necessary in general, SVE is *length agnostic*).

During compilation the flag `-msve-vector-bits=N` has to be set where `N` is a power of two in the range of `128`to `2048`, indicating the length of an SVE vector.

Since SVE is rather young, we decided to disable it by default even if it would be available. A user has to enable it explicitly by defining `EIGEN_ARM64_USE_SVE`.

This patch introduces the packet types `PacketXf` and `PacketXi` for packets of `float` and `int32_t` respectively. The size of these packets depends on the SVE vector length. E.g. if `-msve-vector-bits=512` is set, `PacketXf` will contain `512/32 = 16` elements.

This MR is joint work with Miguel Tairum <miguel.tairum@arm.com>.
2021-01-21 21:11:57 +00:00
Antonio Sanchez
b2126fd6b5 Fix pfrexp/pldexp for half.
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`.  This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.

Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
2021-01-21 19:32:28 +00:00
Rasmus Munk Larsen
cdd8fdc32e Vectorize pow(x, y). This closes https://gitlab.com/libeigen/eigen/-/issues/2085, which also contains a description of the algorithm.
I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics:

```
max_rel_error = 8.34405e-07
rms_rel_error = 2.76654e-07
```

If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`:

```
max_rel_error = 0.666667
rms = 6.8727e-05
count = 1335165689
argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45
```

which seems reasonable, since these results are subnormals with only couple of significant bits left.
2021-01-18 13:25:16 +00:00
Antonio Sanchez
bde6741641 Improved std::complex sqrt and rsqrt.
Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously
`complex_sqrt` was only used for CUDA and MSVC), and implements
custom `complex_rsqrt`.

Also introduces `numext::rsqrt` to simplify implementation, and modified
`numext::hypot` to adhere to IEEE IEC 6059 for special cases.

The `complex_sqrt` and `complex_rsqrt` implementations were found to be
significantly faster than `std::sqrt<std::complex<T>>` and
`1/numext::sqrt<std::complex<T>>`.

Benchmark file attached.
```
GCC 10, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           9.21 ns         9.21 ns     73225448
BM_StdSqrt<std::complex<float>>        17.1 ns         17.1 ns     40966545
BM_Sqrt<std::complex<double>>          8.53 ns         8.53 ns     81111062
BM_StdSqrt<std::complex<double>>       21.5 ns         21.5 ns     32757248
BM_Rsqrt<std::complex<float>>          10.3 ns         10.3 ns     68047474
BM_DivSqrt<std::complex<float>>        16.3 ns         16.3 ns     42770127
BM_Rsqrt<std::complex<double>>         11.3 ns         11.3 ns     61322028
BM_DivSqrt<std::complex<double>>       16.5 ns         16.5 ns     42200711

Clang 11, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           7.46 ns         7.45 ns     90742042
BM_StdSqrt<std::complex<float>>        16.6 ns         16.6 ns     42369878
BM_Sqrt<std::complex<double>>          8.49 ns         8.49 ns     81629030
BM_StdSqrt<std::complex<double>>       21.8 ns         21.7 ns     31809588
BM_Rsqrt<std::complex<float>>          8.39 ns         8.39 ns     82933666
BM_DivSqrt<std::complex<float>>        14.4 ns         14.4 ns     48638676
BM_Rsqrt<std::complex<double>>         9.83 ns         9.82 ns     70068956
BM_DivSqrt<std::complex<double>>       15.7 ns         15.7 ns     44487798

Clang 9, Pixel 2, aarch64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           24.2 ns         24.1 ns     28616031
BM_StdSqrt<std::complex<float>>         104 ns          103 ns      6826926
BM_Sqrt<std::complex<double>>          31.8 ns         31.8 ns     22157591
BM_StdSqrt<std::complex<double>>        128 ns          128 ns      5437375
BM_Rsqrt<std::complex<float>>          31.9 ns         31.8 ns     22384383
BM_DivSqrt<std::complex<float>>        99.2 ns         98.9 ns      7250438
BM_Rsqrt<std::complex<double>>         46.0 ns         45.8 ns     15338689
BM_DivSqrt<std::complex<double>>        119 ns          119 ns      5898944
```
2021-01-17 08:50:57 -08:00
Guoqiang QI
38ae5353ab 1)provide a better generic paddsub op implementation
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
2021-01-13 22:54:03 +00:00
Antonio Sanchez
587fd6ab70 Only specialize complex sqrt_impl for CUDA if not MSVC.
We already specialize `sqrt_impl` on windows due to MSVC's mishandling
of `inf` (!355).
2021-01-11 09:15:45 -08:00
Antonio Sanchez
f149e0ebc3 Fix MSVC complex sqrt and packetmath test.
MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`.
Here we replace it with a custom version (currently used on GPU).

Also fixed the `packetmath` test, which previously skipped several
corner cases since `CHECK_CWISE1` only tests the first `PacketSize`
elements.
2021-01-08 01:17:19 +00:00
Antonio Sanchez
070d303d56 Add CUDA complex sqrt.
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.

Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.

Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
2020-12-22 23:25:23 -08:00
Rasmus Munk Larsen
05754100fe * Add iterative psqrt<double> for AVX and SSE when FMA is available. This provides a ~10% speedup.
* Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA.
* Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster.

The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx.

AVX+FMA (float)

name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_float/1     1.08ns ± 0%  1.09ns ± 1%    ~
BM_eigen_sqrt_float/8     2.07ns ± 0%  2.08ns ± 1%    ~
BM_eigen_sqrt_float/64    12.4ns ± 0%  12.4ns ± 1%    ~
BM_eigen_sqrt_float/512   95.7ns ± 0%  95.5ns ± 0%    ~
BM_eigen_sqrt_float/4k     776ns ± 0%   763ns ± 0%  -1.67%
BM_eigen_sqrt_float/32k   6.57µs ± 1%  6.13µs ± 0%  -6.69%
BM_eigen_sqrt_float/256k  83.7µs ± 3%  83.3µs ± 2%    ~
BM_eigen_sqrt_float/1M     335µs ± 2%   332µs ± 2%    ~

SSE+FMA (float)
name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_float/1     1.08ns ± 0%  1.09ns ± 0%    ~
BM_eigen_sqrt_float/8     2.07ns ± 0%  2.06ns ± 0%    ~
BM_eigen_sqrt_float/64    12.4ns ± 0%  12.4ns ± 1%    ~
BM_eigen_sqrt_float/512   95.7ns ± 0%  96.3ns ± 4%    ~
BM_eigen_sqrt_float/4k     774ns ± 0%   763ns ± 0%  -1.50%
BM_eigen_sqrt_float/32k   6.58µs ± 2%  6.11µs ± 0%  -7.06%
BM_eigen_sqrt_float/256k  82.7µs ± 1%  82.6µs ± 1%    ~
BM_eigen_sqrt_float/1M     330µs ± 1%   329µs ± 2%    ~

SSE+FMA (double)
BM_eigen_sqrt_double/1      1.63ns ± 0%  1.63ns ± 0%     ~
BM_eigen_sqrt_double/8      6.51ns ± 0%  6.08ns ± 0%   -6.68%
BM_eigen_sqrt_double/64     52.1ns ± 0%  46.5ns ± 1%  -10.65%
BM_eigen_sqrt_double/512     417ns ± 0%   374ns ± 1%  -10.29%
BM_eigen_sqrt_double/4k     3.33µs ± 0%  2.97µs ± 1%  -11.00%
BM_eigen_sqrt_double/32k    26.7µs ± 0%  23.7µs ± 0%  -11.07%
BM_eigen_sqrt_double/256k    213µs ± 0%   206µs ± 1%   -3.31%
BM_eigen_sqrt_double/1M      862µs ± 0%   870µs ± 2%   +0.96%

AVX+FMA (double)
name                        old cpu/op  new cpu/op  delta
BM_eigen_sqrt_double/1      1.63ns ± 0%  1.63ns ± 0%     ~
BM_eigen_sqrt_double/8      6.51ns ± 0%  6.06ns ± 0%   -6.95%
BM_eigen_sqrt_double/64     52.1ns ± 0%  46.5ns ± 1%  -10.80%
BM_eigen_sqrt_double/512     417ns ± 0%   373ns ± 1%  -10.59%
BM_eigen_sqrt_double/4k     3.33µs ± 0%  2.97µs ± 1%  -10.79%
BM_eigen_sqrt_double/32k    26.7µs ± 0%  23.8µs ± 0%  -10.94%
BM_eigen_sqrt_double/256k    214µs ± 0%   208µs ± 2%   -2.76%
BM_eigen_sqrt_double/1M      866µs ± 3%   923µs ± 7%     ~
2020-12-16 18:16:11 +00:00
Rasmus Munk Larsen
6cee8d347e Add an additional step of Newton-Raphson for psqrt<double> on Arm, which otherwise has an error of ~1000 ulps. 2020-12-15 04:06:41 +00:00
David Tellenbach
751f18f2c0 Remove comma at the end of enumeration list to silence C++03 warnings 2020-12-13 18:11:02 +01:00
Antonio Sanchez
5dc2fbabee Fix implicit cast to double.
Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors
in Google due to `-Wall`.
2020-12-12 09:26:20 -08:00
Antonio Sanchez
55967f87d1 Fix NEON pmax<PropagateNumbers,Packet4bf>.
Simple typo, the max impl called pmin instead of pmax for floats.
2020-12-11 21:50:52 -08:00
Antonio Sanchez
839aa505c3 Fix typo in AVX512 packet math. 2020-12-11 21:35:44 -08:00
David Tellenbach
536c8a79f2 Remove unused macro in Half.h 2020-12-12 00:53:26 +01:00
Antonio Sanchez
8c9976d7f0 Fix more SSE/AVX packet conversions for peven.
MSVC doesn't like function-style casts and forces us to use intrinsics.
2020-12-11 15:46:42 -08:00
Antonio Sanchez
c6efc4e0ba Replace M_LOG2E and M_LN2 with custom macros.
For these to exist we would need to define `_USE_MATH_DEFINES` before
`cmath` or `math.h` is first included.  However, we don't
control the include order for projects outside Eigen, so even defining
the macro in `Eigen/Core` does not fix the issue for projects that
end up including `<cmath>` before Eigen does (explicitly or transitively).

To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
2020-12-11 14:34:31 -08:00
Antonio Sanchez
e82722a4a7 Fix MSVC SSE casts.
MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be
converted using intrinsic methods.
2020-12-11 08:52:59 -08:00
Deven Desai
f3d2ea48f5 Fix for broken ROCm/HIP Support
The following commit introduced a breakage in ROCm/HIP support for Eigen.

5ec4907434 (1958e65719641efe5483abc4ce0b61806270f6f3_525_517)

```
Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o
In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20:
In file included from /home/rocm-user/eigen/test/main.h:356:
In file included from /home/rocm-user/eigen/Eigen/QR:11:
In file included from /home/rocm-user/eigen/Eigen/Core:222:
/home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'?
  return half2half2(from);
         ^~~~~~~~~~
         __half2half2
/opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here
            __half2 __half2half2(__half x)
                    ^
1 error generated when compiling for gfx900.

```

The cause seems to be a copy-paster error, and the fix is trivial
2020-12-11 16:14:57 +00:00
David Tellenbach
c7eb3a74cb Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64 2020-12-11 12:41:52 +01:00
Everton Constantino
bccf055a7c Add Armv8 guard on PropagateNumbers implementation. 2020-12-10 22:01:55 -03:00
David Tellenbach
00be0a7ff3 Fix vectorization of complex sqrt on NEON 2020-12-10 15:23:23 +00:00
David Tellenbach
8eb461a431 Remove comma at end of enumerator list in NEON PacketMath 2020-12-10 15:22:55 +01:00
Rasmus Munk Larsen
125cc9a5df Implement vectorized complex square root.
Closes #1905

Measured speedup for sqrt of `complex<float>` on Skylake:

SSE:
```
name                      old time/op             new time/op  delta
BM_eigen_sqrt_ctype/1     49.4ns ± 0%             54.3ns ± 0%  +10.01%
BM_eigen_sqrt_ctype/8      332ns ± 0%               50ns ± 1%  -84.97%
BM_eigen_sqrt_ctype/64    2.81µs ± 1%             0.38µs ± 0%  -86.49%
BM_eigen_sqrt_ctype/512   23.8µs ± 0%              3.0µs ± 0%  -87.32%
BM_eigen_sqrt_ctype/4k     202µs ± 0%               24µs ± 2%  -88.03%
BM_eigen_sqrt_ctype/32k   1.63ms ± 0%             0.19ms ± 0%  -88.18%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%              1.5ms ± 1%  -88.20%
BM_eigen_sqrt_ctype/1M    52.1ms ± 0%              6.2ms ± 0%  -88.18%
```

AVX2:
```
name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_ctype/1     53.6ns ± 0%  55.6ns ± 0%   +3.71%
BM_eigen_sqrt_ctype/8      334ns ± 0%    27ns ± 0%  -91.86%
BM_eigen_sqrt_ctype/64    2.79µs ± 0%  0.22µs ± 2%  -92.28%
BM_eigen_sqrt_ctype/512   23.8µs ± 1%   1.7µs ± 1%  -92.81%
BM_eigen_sqrt_ctype/4k     201µs ± 0%    14µs ± 1%  -93.24%
BM_eigen_sqrt_ctype/32k   1.62ms ± 0%  0.11ms ± 1%  -93.29%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%   0.9ms ± 1%  -93.31%
BM_eigen_sqrt_ctype/1M    52.0ms ± 0%   3.5ms ± 1%  -93.31%
```

AVX512:
```
name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_ctype/1     53.7ns ± 0%  56.2ns ± 1%   +4.75%
BM_eigen_sqrt_ctype/8      334ns ± 0%    18ns ± 2%  -94.63%
BM_eigen_sqrt_ctype/64    2.79µs ± 0%  0.12µs ± 1%  -95.54%
BM_eigen_sqrt_ctype/512   23.9µs ± 1%   1.0µs ± 1%  -95.89%
BM_eigen_sqrt_ctype/4k     202µs ± 0%     8µs ± 1%  -96.13%
BM_eigen_sqrt_ctype/32k   1.63ms ± 0%  0.06ms ± 1%  -96.15%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%   0.5ms ± 4%  -96.11%
BM_eigen_sqrt_ctype/1M    52.1ms ± 0%   2.0ms ± 1%  -96.13%
```
2020-12-08 18:13:35 -08:00
Antonio Sanchez
8cfe0db108 Fix host/device calls for __half.
The previous code had `__host__ __device__` functions calling `__device__`
functions (e.g. `__low2half`) which caused build failures in tensorflow.
Also tried to simplify the `#ifdef` guards to make them more clear.
2020-12-08 20:31:02 +00:00
Everton Constantino
baf9d762b7 - Enabling PropagateNaN and PropagateNumbers for NEON.
- Adding propagate tests to bfloat16.
2020-12-08 17:05:05 +00:00
Antonio Sanchez
5ec4907434 Clean up #ifs in GPU PacketPath.
Removed redundant checks and redundant code for CUDA/HIP.

Note: there are several issues here of calling `__device__` functions
from `__host__ __device__` functions, in particular `__low2half`.
We do not address that here -- only modifying this file enough
to get our current tests to compile.

Fixed: #1847
2020-12-04 16:14:03 -08:00
Rasmus Munk Larsen
f9fac1d5b0 Add log2() to Eigen. 2020-12-04 21:45:09 +00:00
Antonio Sanchez
e2f21465fe Special function implementations for half/bfloat16 packets.
Current implementations fail to consider half-float packets, only
half-float scalars.  Added specializations for packets on AVX, AVX512 and
NEON.  Added tests to `special_packetmath`.

The current `special_functions` tests would fail for half and bfloat16 due to
lack of precision. The NEON tests also fail with precision issues and
due to different handling of `sqrt(inf)`, so special functions bessel, ndtri
have been disabled.

Tested with AVX, AVX512.
2020-12-04 10:16:29 -08:00
Antonio Sanchez
9ee9ac81de Fix shfl* macros for CUDA/HIP
The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so
they are defined whenever the corresponding CUDA/HIP ones are.

Also changed the HIP/CUDA<9.0 versions to cast to int instead of
doing the conversion `half`<->`float`.

Fixes #2083
2020-12-04 17:18:32 +00:00
Rasmus Munk Larsen
f23dc5b971 Revert "Add log2() operator to Eigen"
This reverts commit 4d91519a9b.
2020-12-03 14:32:45 -08:00
Rasmus Munk Larsen
4d91519a9b Add log2() operator to Eigen 2020-12-03 22:31:44 +00:00
Rasmus Munk Larsen
25d8ae7465 Small cleanup of generic plog implementations:
Adding the term e*ln(2) is split into two step for no obvious reason.
This dates back to the original Cephes code from which the algorithm is adapted.
It appears that this was done in Cephes to prevent the compiler from reordering
the addition of the 3 terms in the approximation

  log(1+x) ~= x - 0.5*x^2 + x^3*P(x)/Q(x)

which must be added in reverse order since |x| < (sqrt(2)-1).

This allows rewriting the code to just 2 pmadd and 1 padd instructions,
which on a Skylake processor speeds up the code by 5-7%.
2020-12-03 19:40:40 +00:00
Antonio Sanchez
70fbcf82ed Fix typo in F32MaskToBf16Mask. 2020-12-02 07:58:34 -08:00
Antonio Sanchez
2627e2f2e6 Fix neon cmp* functions for bf16.
The current impl corrupts the comparison masks when converting
from float back to bfloat16.  The resulting masks are then
no longer all zeros or all ones, which breaks when used with
`pselect` (e.g. in `pmin<PropagateNumbers>`).  This was
causing `packetmath_15` to fail on arm.

Introducing a simple `F32MaskToBf16Mask` corrects this (takes
the lower 16-bits for each float mask).
2020-12-02 01:29:34 +00:00
Antonio Sanchez
ddd48b242c Implement CUDA __shfl* for Eigen::half
Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu`
test are broken, as well as several ops in Tensorflow. The gpu functions
`__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float.
Here we add the required specializations.
2020-12-01 14:36:52 -08:00
Rasmus Munk Larsen
e57281a741 Fix a few issues for AVX512. This change enables vectorized versions of log, exp, log1p, expm1 when AVX512DQ is not available. 2020-12-01 11:31:47 -08:00
Antonio Sanchez
1992af3de2 Fix #2077, EIGEN_CONSTEXPR in Half.
`bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from
`raw_half_as_uint16(...)`.  This shouldn't affect anything else, since
it is only used in `a bit_cast<uint16_t,half>()` which is not itself
`constexpr`.

Fixes #2077.
2020-12-01 03:10:21 +00:00
Antonio Sanchez
89f90b585d AVX512 missing ops.
This allows the `packetmath` tests to pass for AVX512 on skylake.
Made `half` and `bfloat16` consistent in terms of ops they support.

Note the `log` tests are currently disabled for `bfloat16` since
they fail due to poor precision (they were previously disabled for
`Packet8bf` via test function specialization -- I just removed that
specialization and disabled it in the generic test).
2020-11-30 16:28:57 +00:00
Andreas Krebbel
1e74f93d55 Fix some packet-functions in the IBM ZVector packet-math. 2020-11-25 14:11:23 +00:00
Rasmus Munk Larsen
79818216ed Revert "Fix Half NaN definition and test."
This reverts commit c770746d70.
2020-11-24 12:57:28 -08:00
Rasmus Munk Larsen
c770746d70 Fix Half NaN definition and test.
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`.  Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.

Also modified the `bfloat16_float` test to match.

Tested with `cortex-a53` and `cortex-a55`.
2020-11-24 20:53:07 +00:00
Antonio Sanchez
22f67b5958 Fix boolean float conversion and product warnings.
This fixes some gcc warnings such as:
```
Eigen/src/Core/GenericPacketMath.h:655:63: warning: implicit conversion turns floating-point number into bool: 'typename __gnu_cxx::__enable_if<__is_integer<bool>::__value, double>::__type' (aka 'double') to 'bool' [-Wimplicit-conversion-floating-point-to-bool]
    Packet psqrt(const Packet& a) { EIGEN_USING_STD(sqrt); return sqrt(a); }
```

Details:

- Added `scalar_sqrt_op<bool>` (`-Wimplicit-conversion-floating-point-to-bool`).

- Added `scalar_square_op<bool>` and `scalar_cube_op<bool>`
specializations (`-Wint-in-bool-context`)

- Deprecated above specialized ops for bool.

- Modified `cxx11_tensor_block_eval` to specialize generator for
booleans (`-Wint-in-bool-context`) and to use `abs` instead of `square` to
avoid deprecated bool ops.
2020-11-24 20:20:36 +00:00
Antonio Sanchez
a3b300f1af Implement missing AVX half ops.
Minimal implementation of AVX `Eigen::half` ops to bring in line
with `bfloat16`.  Allows `packetmath_13` to pass.

Also adjusted `bfloat16` packet traits to match the supported set
of ops (e.g. Bessel is not actually implemented).
2020-11-24 16:46:41 +00:00
Antonio Sanchez
38abf2be42 Fix Half NaN definition and test.
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`.  Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.

Also modified the `bfloat16_float` test to match.

Tested with `cortex-a53` and `cortex-a55`.
2020-11-23 14:13:59 -08:00
Antonio Sanchez
4cf01d2cf5 Update AVX half packets, disable test.
The AVX half implementation is incomplete, causing the `packetmath_13` test
to fail.  This disables the test.

Also refactored the existing AVX implementation to use `bit_cast`
instead of direct access to `.x`.
2020-11-21 09:05:10 -08:00