eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2026-04-10 11:34:33 +08:00

Author	SHA1	Message	Date
Rasmus Munk Larsen	a3298b22ec	Implement vectorized versions of log1p and expm1 in Eigen using Kahan's formulas, and change the scalar implementations to properly handle infinite arguments. Depending on instruction set, significant speedups are observed for the vectorized path: log1p wall time is reduced 60-93% (2.5x - 15x speedup) expm1 wall time is reduced 0-85% (1x - 7x speedup) The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly. Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM	2019-08-12 13:53:28 -07:00
Rasmus Munk Larsen	988f24b730	Various fixes for packet ops. 1. Fix buggy pcmp_eq and unit test for half types. 2. Add unit test for pselect and add specializations for SSE 4.1, AVX512, and half types. 3. Get rid of FIXME: Implement faster pnegate for half by XOR'ing with a sign bit mask.	2019-06-20 11:47:49 -07:00
Eugene Zhulenev	e9f0eb8a5e	Add masked_store_available to unpacket_traits	2019-05-02 14:52:58 -07:00
Eugene Zhulenev	b4010f02f9	Add masked pstoreu to AVX and AVX512 PacketMath	2019-05-02 13:14:18 -07:00
Anuj Rawat	8c7a6feb8e	Adding lowlevel APIs for optimized RHS packet load in TensorFlow SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \|2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 \|2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 \|2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 \|1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 \|1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 \|2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \| 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 \| 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 \| 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 \| 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 \| 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 41350 \| 15073 \| 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 7277 \| 7341 \| 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 8675 \| 8681 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 24155 \| 16079 \| 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 25052 \| 17152 \| 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 18269 \| 18345 \| 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 19468 \| 19872 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 156060 \| 42432 \| 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 132701 \| 36944 \| 3.59X AVX2: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 26233 \| 12393 \| 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 6091 \| 6062 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 7427 \| 7408 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 23453 \| 20826 \| 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 23167 \| 22091 \| 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 23422 \| 23682 \| 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 23165 \| 23663 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 72689 \| 44969 \| 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 61732 \| 39779 \| 1.55X All benchmarks on Intel Skylake server with 8 cores.	2019-04-20 06:46:43 +00:00
Gael Guennebaud	0b25a5c431	fix alignment in ploadquad	2019-02-22 21:39:36 +01:00
Gael Guennebaud	cca6c207f4	AVX512: implement faster ploadquad<Packet16f> thus speeding up GEMM	2019-02-21 17:18:28 +01:00
Gael Guennebaud	d85ae650bf	bug #1678 : workaround MSVC compilation issues with AVX512	2019-02-15 10:24:17 +01:00
Gael Guennebaud	eb4c6bb22d	Fix conflicts and merge	2019-01-30 15:57:08 +01:00
Christoph Hertzberg	5a52e35f9a	Renaming some more `I` identifiers	2019-01-26 13:18:21 +01:00
Gael Guennebaud	61b6eb05fe	AVX512 (r)sqrt(double) was mistakenly disabled with clang and others	2019-01-14 17:28:47 +01:00
Rasmus Munk Larsen	fcfced13ed	Rename pones -> ptrue. Use _CMP_TRUE_UQ where appropriate.	2019-01-09 17:20:33 -08:00
Rasmus Munk Larsen	8f04442526	Collapsed revision * Collapsed revision * Add packet up "pones". Write pnot(a) as pxor(pones(a), a). * Collapsed revision * Simplify a bit. * Undo useless diffs. * Fix typo.	2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen	cb955df9a6	Add packet up "pones". Write pnot(a) as pxor(pones(a), a).	2019-01-09 16:17:08 -08:00
Rasmus Larsen	cb3c059fa4	Merged eigen/eigen into default	2019-01-09 15:04:17 -08:00
Gael Guennebaud	47810cf5b7	Add dedicated implementations of predux_any for AVX512, NEON, and Altivec/VSE	2019-01-09 16:40:42 +01:00
Gael Guennebaud	aeec68f77b	Add missing pcmp_lt and others for AVX512	2019-01-09 15:36:41 +01:00
Rasmus Munk Larsen	055f0b73db	Add support for pcmp_eq and pnot, including for complex types.	2019-01-07 16:53:36 -08:00
Mark D Ryan	bc5dd4cafd	PR560: Fix the AVX512f only builds Commit `c53eececb0` introduced AVX512 support for complex numbers but required avx512dq to build. Commit `1d683ae2f5` fixed some but not, it would seem all, of the hard avx512dq dependencies. Build failures are still evident on Eigen and TensorFlow when compiling with just avx512f and no avx512dq using gcc 7.3. Looking at the code there does indeed seem to be a problem. Commit `c53eececb0` calls avx512dq intrinsics directly, e.g, _mm512_extractf32x8_ps and _mm512_and_ps. This commit fixes the issue by replacing the direct intrinsic calls with the various wrapper functions that are safe to use on avx512f only builds.	2019-01-03 14:33:04 +01:00
Gael Guennebaud	60d3fe9a89	One more stupid AVX 512 fix (I don't have direct access to AVX512 machines)	2018-12-24 13:05:03 +01:00
Gael Guennebaud	4aa667b510	Add EIGEN_STRONG_INLINE where required	2018-12-24 10:45:01 +01:00
Gael Guennebaud	961ff567e8	Add missing pcmp_lt_or_nan for AVX512	2018-12-23 22:13:29 +01:00
Gustavo Lima Chaves	e763fcd09e	Introducing "vectorized" byte on unpacket_traits structs This is a preparation to a change on gebp_traits, where a new template argument will be introduced to dictate the packet size, so it won't be bound to the current/max packet size only anymore. By having packet types defined early on gebp_traits, one has now to act on packet types, not scalars anymore, for the enum values defined on that class. One approach for reaching the vectorizable/size properties one needs there could be getting the packet's scalar again with unpacket_traits<>, then the size/Vectorizable enum entries from packet_traits<>. It turns out guards like "#ifndef EIGEN_VECTORIZE_AVX512" at AVX/PacketMath.h will hide smaller packet variations of packet_traits<> for some types (and it makes sense to keep that). In other words, one can't go back to the scalar and create a new PacketType, as this will always lead to the maximum packet type for the architecture. The less costly/invasive solution for that, thus, is to add the vectorizable info on every unpacket_traits struct as well.	2018-12-19 14:24:44 -08:00
Gael Guennebaud	0a7e7af6fd	Properly set the number of registers for AVX512	2018-12-11 15:33:17 +01:00
Gael Guennebaud	cbf2f4b7a0	AVX512f includes FMA but GCC does not define __FMA__ with -mavx512f only	2018-12-06 18:21:56 +01:00
Gael Guennebaud	c53eececb0	Implement AVX512 vectorization of std::complex<float/double>	2018-12-06 15:58:06 +01:00
Gael Guennebaud	69ace742be	Several improvements regarding packet-bitwise operations: - add unit tests - optimize their AVX512f implementation - add missing implementations (half, Packet4f, ...)	2018-11-30 15:56:08 +01:00
Gael Guennebaud	fa87f9d876	Add psin/pcos on AVX512 -> almost for free, at last!	2018-11-30 14:33:13 +01:00
Gael Guennebaud	f91500d303	Fix pandnot order in AVX512	2018-11-30 14:32:06 +01:00
Gael Guennebaud	43633fbaba	Fix warning with AVX512f	2018-10-11 10:13:48 +02:00
Gael Guennebaud	b3f66d29a5	Enable avx512 plog with clang	2018-10-11 10:12:21 +02:00
Gael Guennebaud	626942d9dd	fix alignment issue in ploaddup for AVX512	2018-09-28 16:57:32 +02:00
Gael Guennebaud	5a30eed17e	Fix warnings in AVX512	2018-09-20 16:58:51 +02:00
Christoph Hertzberg	ad4a08fb68	Use Intel cast intrinsics, since MSVC does not allow direct casting. Reported by David Winkler.	2018-08-24 19:04:33 +02:00
Gael Guennebaud	7134fa7a2e	Fix compilation with MSVC by reverting to char* for _mm_prefetch except for PGI (the later being the one that has the wrong prototype).	2018-06-07 09:33:10 +02:00
Gael Guennebaud	584951ca4d	Rename predux_downto4 to be more accurate on its semantic.	2018-04-03 14:28:38 +02:00
Gael Guennebaud	6719409cd9	AVX512: add missing pinsertfirst and pinsertlast, implement pblend for Packet8d, fix compilation without AVX512DQ	2018-04-03 14:11:56 +02:00
Rasmus Munk Larsen	7b6aaa3440	Fix NaN propagation for AVX512.	2017-01-24 13:37:08 -08:00
Benoit Steiner	354baa0fb1	Avoid using horizontal adds since they're not very efficient.	2016-12-21 20:55:07 -08:00
Benoit Steiner	d7825b6707	Use native AVX512 types instead of Eigen Packets whenever possible.	2016-12-21 20:06:18 -08:00
Benoit Steiner	923acadfac	Fixed compilation errors with gcc6 when compiling the AVX512 intrinsics	2016-12-19 13:02:27 -08:00
Benoit Steiner	507b661106	Renamed predux_half into predux_downto4	2016-10-06 17:57:04 -07:00
Benoit Steiner	a498ff7df6	Fixed incorrect comment	2016-10-06 15:27:27 -07:00
Benoit Steiner	a7473d6d5a	Fixed compilation error with gcc >= 5.3	2016-10-06 14:33:22 -07:00
Benoit Steiner	5e64cea896	Silenced a compilation warning	2016-10-06 14:24:17 -07:00
Benoit Steiner	cb5cd69872	Silenced a compilation warning.	2016-10-05 18:50:53 -07:00
Benoit Steiner	9c2b6c049b	Silenced a few compilation warnings	2016-10-05 18:37:31 -07:00
Benoit Steiner	fa5a8f055a	Implemented palign_impl for AVX512	2016-04-29 13:30:13 -07:00
Benoit Steiner	ef3ac9d05a	Fixed the AVX512 packet traits	2016-04-29 13:28:36 -07:00
Benoit Steiner	d7b75e8d86	Added pdiv packet primitives for avx512	2016-04-29 13:26:47 -07:00

1 2

64 Commits