Commit Graph

118 Commits

Author SHA1 Message Date
Chip Kerchner
9b51dc7972 Fixed performance issues for VSX and P10 MMA in general_matrix_matrix_product 2021-02-17 17:49:23 +00:00
Antonio Sanchez
7ff0b7a980 Updated pfrexp implementation.
The original implementation fails for 0, denormals, inf, and NaN.

See #2150
2021-02-17 02:23:24 +00:00
Antonio Sanchez
4cb563a01e Fix ldexp implementations.
The previous implementations produced garbage values if the exponent did
not fit within the exponent bits.  See #2131 for a complete discussion,
and !375 for other possible implementations.

Here we implement the 4-factor version. See `pldexp_impl` in
`GenericPacketMathFunctions.h` for a full description.

The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>`
requires `por`.

Left as a "TODO" is to delegate to a faster version if we know the
exponent does fit within the exponent bits.

Fixes #2131.
2021-02-10 22:45:41 +00:00
Antonio Sanchez
1615a27993 Fix altivec packetmath.
Allows the altivec packetmath tests to pass.  There were a few issues:
- `pstoreu` was missing MSQ on `_BIG_ENDIAN` systems
- `cmp_*` didn't properly handle conversion of bool flags (0x7FC instead
of 0xFFFF)
- `pfrexp` needed to set the `exponent` argument.

Related to !370, #2128

cc: @ChipKerchner @pdrocaldeira

Tested on `_BIG_ENDIAN` running on QEMU with VSX.  Couldn't figure out build
flags to get it to work for little endian.
2021-01-28 18:37:09 +00:00
Chip Kerchner
1414e2212c Fix clang compilation for AltiVec from previous check-in 2021-01-28 18:36:40 +00:00
Chip Kerchner
0784d9f87b Fix sqrt, ldexp and frexp compilation errors. 2021-01-25 15:22:19 -06:00
Pedro Caldeira
35d149e34c Add missing functions for Packet8bf in Altivec architecture.
Including new tests for bfloat16 Packets.
Fix prsqrt on GenericPacketMath.
2020-09-08 09:22:11 -05:00
Everton Constantino
6568856275 Changing u/int8_t to un/signed char because clang does not understand
it.

Implementing pcmp_eq to Packet8 and Packet16.
2020-09-02 17:02:15 -03:00
Chip Kerchner
e5886457c8 Change Packet8s and Packet8us to use vector commands on Power for pmadd, pmul and psub. 2020-08-28 19:27:32 +00:00
Pedro Caldeira
704798d1df Add support for Bfloat16 to use vector instructions on Altivec
architecture
2020-08-10 13:22:01 -05:00
Pedro Caldeira
2d67af2d2b Add pscatter for Packet16{u}c (int8) 2020-05-20 17:29:34 -03:00
Pedro Caldeira
5fdc179241 Altivec template functions to better code reusability 2020-05-11 21:04:51 +00:00
Rasmus Munk Larsen
225ab040e0 Remove unused packet op "palign".
Clean up a compiler warning in c++03 mode in AVX512/Complex.h.
2020-05-07 17:14:26 -07:00
Pedro Caldeira
29f0917a43 Add support to vector instructions to Packet16uc and Packet16c 2020-04-27 12:48:08 -03:00
Rasmus Munk Larsen
e80ec24357 Remove unused packet op "preduxp". 2020-04-23 18:17:14 +00:00
Pedro Caldeira
0c67b855d2 Add Packet8s and Packet8us to support signed/unsigned int16/short Altivec vector operations 2020-04-21 14:52:46 -03:00
Everton Constantino
deb93ed1bf Adhere to recommended load/store intrinsics for pp64le 2020-03-23 15:18:15 -03:00
Everton Constantino
5afdaa473a Fixing float32's pround halfway criteria to match STL's criteria. 2020-03-21 22:30:54 -05:00
Joel Holdsworth
232f904082 Add shift_left<N> and shift_right<N> coefficient-wise unary Array functions 2020-03-19 17:24:06 +00:00
Everton Constantino
5a8b97b401 Switching unpacket_traits<Packet4i> to vectorizable=true. 2020-01-13 16:08:20 -03:00
Rasmus Munk Larsen
6de5ed08d8 Add generic PacketMath implementation of the Error Function (erf). 2019-09-19 12:48:30 -07:00
João P. L. de Carvalho
5ac7984ffa Fix debug macros in p{load,store}u 2019-08-14 11:59:12 -06:00
João P. L. de Carvalho
db9147ae40 Add missing pcmp_XX methods for double/Packet2d
This actually fixes an issue in unit-test packetmath_2 with pcmp_eq when it is compiled with clang. When pcmp_eq(Packet4f,Packet4f) is used instead of pcmp_eq(Packet2d,Packet2d), the unit-test does not pass due to NaN on ref vector.
2019-08-14 10:37:39 -06:00
João P. L. de Carvalho
787f6ef025 Fix packed load/store for PowerPC's VSX
The vec_vsx_ld/vec_vsx_st builtins were wrongly used for aligned load/store. In fact, they perform unaligned memory access and, even when the address is 16-byte aligned, they are much slower (at least 2x) than their aligned counterparts.

For double/Packet2d vec_xl/vec_xst should be prefered over vec_ld/vec_st, although the latter works when casted to float/Packet4f.

Silencing some weird warning with throw but some GCC versions. Such warning are not thrown by Clang.
2019-08-09 16:02:55 -06:00
João P. L. de Carvalho
4d29aa0294 Fix offset argument of ploadu/pstoreu for Altivec
If no offset is given, them it should be zero.

Also passes full address to vec_vsx_ld/st builtins.

Removes userless _EIGEN_ALIGNED_PTR & _EIGEN_MASK_ALIGNMENT.

Removes unnecessary casts.
2019-08-09 15:59:26 -06:00
João P. L. de Carvalho
66d073c38e bug #1718: Add cast to successfully compile with clang on PowerPC
Ignoring -Wc11-extensions warnings thrown by clang at Altivec/PacketMath.h
2019-08-09 15:56:26 -06:00
Eugene Zhulenev
e9f0eb8a5e Add masked_store_available to unpacket_traits 2019-05-02 14:52:58 -07:00
Anuj Rawat
8c7a6feb8e Adding lowlevel APIs for optimized RHS packet load in TensorFlow
SpatialConvolution

Low-level APIs are added in order to optimized packet load in gemm_pack_rhs
in TensorFlow SpatialConvolution. The optimization is for scenario when a
packet is split across 2 adjacent columns. In this case we read it as two
'partial' packets and then merge these into 1. Currently this only works for
Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other
packet types (such as Packet8d) also.

This optimization shows significant speedup in SpatialConvolution with
certain parameters. Some examples are below.

Benchmark parameters are specified as:
Batch size, Input dim, Depth, Num of filters, Filter dim

Speedup numbers are specified for number of threads 1, 2, 4, 8, 16.

AVX512:

Parameters                  | Speedup (Num of threads: 1, 2, 4, 8, 16)
----------------------------|------------------------------------------
128,   24x24,  3, 64,   5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X
128,   24x24,  1, 64,   8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X
 32,   24x24,  3, 64,   5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X
128,   24x24,  3, 64,   3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X
 32,   14x14, 24, 64,   5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X
128, 128x128,  3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X

AVX2:

Parameters                  | Speedup (Num of threads: 1, 2, 4, 8, 16)
----------------------------|------------------------------------------
128,   24x24,  3, 64,   5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X
 32,   24x24,  3, 64,   5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X
128,   24x24,  1, 64,   5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X
128,   24x24,  3, 64,   3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X
128, 128x128,  3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X

In the higher level benchmark cifar10, we observe a runtime improvement
of around 6% for AVX512 on Intel Skylake server (8 cores).

On lower level PackRhs micro-benchmarks specified in TensorFlow
tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe
the following runtime numbers:

AVX512:

Parameters                                                     | Runtime without patch (ns) | Runtime with patch (ns) | Speedup
---------------------------------------------------------------|----------------------------|-------------------------|---------
BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56)  |  41350                     | 15073                   | 2.74X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56)  |   7277                     |  7341                   | 0.99X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56)  |   8675                     |  8681                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56)  |  24155                     | 16079                   | 1.50X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56)  |  25052                     | 17152                   | 1.46X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) |  18269                     | 18345                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) |  19468                     | 19872                   | 0.98X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432)   | 156060                     | 42432                   | 3.68X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432)   | 132701                     | 36944                   | 3.59X

AVX2:

Parameters                                                     | Runtime without patch (ns) | Runtime with patch (ns) | Speedup
---------------------------------------------------------------|----------------------------|-------------------------|---------
BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56)  | 26233                      | 12393                   | 2.12X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56)  |  6091                      |  6062                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56)  |  7427                      |  7408                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56)  | 23453                      | 20826                   | 1.13X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56)  | 23167                      | 22091                   | 1.09X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422                      | 23682                   | 0.99X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165                      | 23663                   | 0.98X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432)   | 72689                      | 44969                   | 1.62X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432)   | 61732                      | 39779                   | 1.55X

All benchmarks on Intel Skylake server with 8 cores.
2019-04-20 06:46:43 +00:00
Gael Guennebaud
eb4c6bb22d Fix conflicts and merge 2019-01-30 15:57:08 +01:00
Gael Guennebaud
250dcd1fdb bug #1652: fix position of EIGEN_ALIGN16 attributes in Neon and Altivec 2019-01-14 21:45:56 +01:00
Gael Guennebaud
47810cf5b7 Add dedicated implementations of predux_any for AVX512, NEON, and Altivec/VSE 2019-01-09 16:40:42 +01:00
Gustavo Lima Chaves
e763fcd09e Introducing "vectorized" byte on unpacket_traits structs
This is a preparation to a change on gebp_traits, where a new template
argument will be introduced to dictate the packet size, so it won't be
bound to the current/max packet size only anymore.

By having packet types defined early on gebp_traits, one has now to
act on packet types, not scalars anymore, for the enum values defined
on that class. One approach for reaching the vectorizable/size
properties one needs there could be getting the packet's scalar again
with unpacket_traits<>, then the size/Vectorizable enum entries from
packet_traits<>. It turns out guards like "#ifndef
EIGEN_VECTORIZE_AVX512" at AVX/PacketMath.h will hide smaller packet
variations of packet_traits<> for some types (and it makes sense to
keep that). In other words, one can't go back to the scalar and create
a new PacketType, as this will always lead to the maximum packet type
for the architecture.

The less costly/invasive solution for that, thus, is to add the
vectorizable info on every unpacket_traits struct as well.
2018-12-19 14:24:44 -08:00
Gael Guennebaud
c785464430 Add packet sin and cos to Altivec/VSX and NEON 2018-11-30 16:21:33 +01:00
Gael Guennebaud
b131a4db24 bug #1631: fix compilation with ARM NEON and clang, and cleanup the weird pshiftright_and_cast and pcast_and_shiftleft functions. 2018-11-27 23:45:00 +01:00
Gael Guennebaud
b5695a6008 Unify Altivec/VSX pexp(double) with default implementation 2018-11-27 13:53:05 +01:00
Gael Guennebaud
5c8406babc Unify Altivec/VSX's pexp with generic implementation 2018-11-26 16:47:13 +01:00
Gael Guennebaud
c2f35b1b47 Unify Altivec/VSX's plog with generic implementation, and enable it! 2018-11-26 15:58:11 +01:00
Rasmus Munk Larsen
bda71ad394 Fix typo in pbend for AltiVec. 2018-06-22 15:04:35 -07:00
Gael Guennebaud
13f5df9f67 Add a note on vec_min vs asm 2018-04-04 13:10:38 +02:00
Gael Guennebaud
e91e314347 bug #1494: makes pmin/pmax behave on Altivec/VSX as on x86 regading NaNs 2018-04-04 11:39:19 +02:00
luz.paz
e3912f5e63 MIsc. source and comment typos
Found using `codespell` and `grep` from downstream FreeCAD
2018-03-11 10:01:44 -04:00
Gael Guennebaud
ca79c1545a Add std:: namespace prefix to all (hopefully) instances if size_t/ptrdfiff_t 2017-01-23 22:02:53 +01:00
Gael Guennebaud
8c0e701504 bug #1360: fix sign issue with pmull on altivec 2016-12-18 22:13:19 +00:00
Konstantinos Margaritis
9f7caa7e7d minor fixes for big endian altivec/vsx 2016-07-10 07:05:10 -03:00
Konstantinos Margaritis
be107e387b fix compilation with clang 3.9, fix performance with pset1, use vector operators instead of intrinsics in some cases 2016-06-23 10:19:05 -03:00
Konstantinos Margaritis
b410d46482 mostly cleanups and modernizing code 2016-06-19 16:12:52 -03:00
Konstantinos Margaritis
8ed26120c8 bring Altivec/VSX to a better state, implement some of the missing functions 2016-04-28 14:32:42 -03:00
Doug Kwan
5c9ee73eb9 Implement plog and pexp for AltiVec. 2015-07-30 11:12:42 -07:00
Gael Guennebaud
6245591349 Fix prototype of plset and generalize linspace functor. 2015-08-07 19:27:59 +02:00
Gael Guennebaud
ce57dbd937 Let unpacket_traits<> exposes the required alignment and make use of it everywhere 2015-08-07 10:44:01 +02:00