eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2026-04-10 11:34:33 +08:00

Author	SHA1	Message	Date
Mehdi Goli	0b24e1cb5c	[SYCL] Adding the SYCL memory model. The SYCL memory model provides : * an interface for SYCL buffers to behave as a non-dereferenceable pointer * an interface for placeholder accessor to behave like a pointer on both host and device	2019-07-01 16:02:30 +01:00
Rasmus Munk Larsen	8053eeb51e	Fix CUDA compilation error for pselect<half>.	2019-06-28 12:07:29 -07:00
Mehdi Goli	16a56b2ddd	[SYCL] This PR adds the minimum modifications to Eigen core required to run Eigen unsupported modules on devices supporting SYCL. * Adding SYCL memory model * Enabling/Disabling SYCL backend in Core * Supporting Vectorization	2019-06-27 12:25:09 +01:00
Deven Desai	ba506d5bd2	fix for a ROCm/HIP specificcompile errror introduced by a recent commit.	2019-06-22 00:06:05 +00:00
Rasmus Munk Larsen	c9394d7a0e	Remove extra "one" in comment.	2019-06-20 16:23:19 -07:00
Rasmus Munk Larsen	b8f8dac4eb	Update comment as suggested by tra@google.com.	2019-06-20 16:18:37 -07:00
Rasmus Munk Larsen	e5e63c2cad	Fix grammar.	2019-06-20 16:03:59 -07:00
Rasmus Munk Larsen	302a404b7e	Added comment explaining the surprising EIGEN_COMP_CLANG && !EIGEN_COMP_NVCC clause.	2019-06-20 15:59:08 -07:00
Rasmus Munk Larsen	b5237f53b1	Fix CUDA build on Mac.	2019-06-20 15:44:14 -07:00
Rasmus Munk Larsen	988f24b730	Various fixes for packet ops. 1. Fix buggy pcmp_eq and unit test for half types. 2. Add unit test for pselect and add specializations for SSE 4.1, AVX512, and half types. 3. Get rid of FIXME: Implement faster pnegate for half by XOR'ing with a sign bit mask.	2019-06-20 11:47:49 -07:00
Rasmus Munk Larsen	b08527b0c1	Clean up CUDA/NVCC version macros and their use in Eigen, and a few other CUDA build failures.	2019-05-31 15:26:06 -07:00
Deven Desai	2c38930161	fix for HIP build errors that were introduced by a commit earlier this week	2019-05-24 14:25:32 +00:00
Rasmus Munk Larsen	ab0a30e429	Make Eigen build with cuda 10 and clang.	2019-05-15 13:32:15 -07:00
Anuj Rawat	ad372084f5	Removing unused API to fix compile error in TensorFlow due to AVX512VL, AVX512BW usage	2019-05-12 14:43:10 +00:00
Eugene Zhulenev	45b40d91ca	Fix AVX512 & GCC 6.3 compilation	2019-05-07 16:44:55 -07:00
Eugene Zhulenev	e9f0eb8a5e	Add masked_store_available to unpacket_traits	2019-05-02 14:52:58 -07:00
Eugene Zhulenev	96e30e936a	Add masked pstoreu for Packet16h	2019-05-02 14:11:01 -07:00
Eugene Zhulenev	b4010f02f9	Add masked pstoreu to AVX and AVX512 PacketMath	2019-05-02 13:14:18 -07:00
Gael Guennebaud	578407f42f	Fix regression in changeset `ae33e866c7`	2019-05-02 15:45:21 +02:00
Andy May	ae33e866c7	Fix compilation with PGI version 19	2019-04-25 21:23:19 +01:00
Eugene Zhulenev	68a2a8c445	Use packet ops instead of AVX2 intrinsics	2019-04-23 11:41:02 -07:00
Anuj Rawat	8c7a6feb8e	Adding lowlevel APIs for optimized RHS packet load in TensorFlow SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \|2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 \|2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 \|2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 \|1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 \|1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 \|2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \| 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 \| 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 \| 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 \| 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 \| 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 41350 \| 15073 \| 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 7277 \| 7341 \| 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 8675 \| 8681 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 24155 \| 16079 \| 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 25052 \| 17152 \| 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 18269 \| 18345 \| 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 19468 \| 19872 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 156060 \| 42432 \| 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 132701 \| 36944 \| 3.59X AVX2: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 26233 \| 12393 \| 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 6091 \| 6062 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 7427 \| 7408 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 23453 \| 20826 \| 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 23167 \| 22091 \| 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 23422 \| 23682 \| 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 23165 \| 23663 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 72689 \| 44969 \| 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 61732 \| 39779 \| 1.55X All benchmarks on Intel Skylake server with 8 cores.	2019-04-20 06:46:43 +00:00
William D. Irons	8de66719f9	Collapsed revision from PR-619 * Add support for pcmp_eq in AltiVec/Complex.h * Fixed implementation of pcmp_eq for double The new logic is based on the logic from NEON for double.	2019-03-26 18:14:49 +00:00
Gael Guennebaud	f11364290e	ICC does not support -fno-unsafe-math-optimizations	2019-03-22 09:26:24 +01:00
Deven Desai	51e399fc15	updates requested in the PR feedback. Also droping coded within #ifdef EIGEN_HAS_OLD_HIP_FP16	2019-03-19 21:45:25 +00:00
Deven Desai	2dbea5510f	Merged eigen/eigen into default	2019-03-19 16:52:38 -04:00
Rasmus Munk Larsen	8450a6d519	Clean up half packet traits and add a few more missing packet ops.	2019-03-14 15:18:06 -07:00
Rasmus Munk Larsen	77f7d4a894	Clean up PacketMathHalf.h and add a few missing logical packet ops.	2019-03-11 17:51:16 -07:00
Gael Guennebaud	656d9bc66b	Apply SSE's pmin/pmax fix for GCC <= 5 to AVX's pmin/pmax	2019-03-10 21:19:18 +01:00
Gael Guennebaud	0b25a5c431	fix alignment in ploadquad	2019-02-22 21:39:36 +01:00
Gael Guennebaud	cca6c207f4	AVX512: implement faster ploadquad<Packet16f> thus speeding up GEMM	2019-02-21 17:18:28 +01:00
Gael Guennebaud	1c09ee8541	bug #1674 : workaround clang fast-math aggressive optimizations	2019-02-22 15:48:53 +01:00
Gael Guennebaud	7e3084bb6f	Fix compilation on ARM.	2019-02-22 14:56:12 +01:00
Rasmus Munk Larsen	4d7f317102	Add a few missing packet ops: cmp_eq for NEON. pfloor for GPU.	2019-02-21 13:32:13 -08:00
Gael Guennebaud	d85ae650bf	bug #1678 : workaround MSVC compilation issues with AVX512	2019-02-15 10:24:17 +01:00
Gael Guennebaud	871e2e5339	bug #1674 : disable GCC's unsafe-math-optimizations in sin/cos vectorization (results are completely wrong otherwise)	2019-02-03 08:54:47 +01:00
Gael Guennebaud	eb4c6bb22d	Fix conflicts and merge	2019-01-30 15:57:08 +01:00
Christoph Hertzberg	5a52e35f9a	Renaming some more `I` identifiers	2019-01-26 13:18:21 +01:00
Rasmus Munk Larsen	2eccbaf3f7	Add missing logical packet ops for GPU and NEON.	2019-01-17 17:45:08 -08:00
Rasmus Munk Larsen	7401e2541d	Fix compilation error for logical packet ops with older compilers.	2019-01-16 14:43:33 -08:00
Gael Guennebaud	250dcd1fdb	bug #1652 : fix position of EIGEN_ALIGN16 attributes in Neon and Altivec	2019-01-14 21:45:56 +01:00
Gael Guennebaud	3c9e6d206d	AVX512: fix pgather/pscatter for Packet4cd and unaligned pointers	2019-01-14 17:57:28 +01:00
Gael Guennebaud	61b6eb05fe	AVX512 (r)sqrt(double) was mistakenly disabled with clang and others	2019-01-14 17:28:47 +01:00
Gael Guennebaud	4356a55a61	PR 571: Implements an accurate argument reduction algorithm for huge inputs of sin/cos and call it instead of falling back to std::sin/std::cos. This makes both the small and huge argument cases faster because: - for small inputs this removes the last pselect - for large inputs only the reduction part follows a scalar path, the rest use the same SIMD path as the small-argument case.	2019-01-14 13:54:01 +01:00
Gael Guennebaud	9005f0111f	Replace compiler's alignas/alignof extension by respective c++11 keywords when available. This also fix a compilation issue with gcc-4.7.	2019-01-11 17:10:54 +01:00
Rasmus Munk Larsen	89c4001d6f	Fix warnings in ptrue for complex and half types.	2019-01-11 14:10:57 -08:00
Rasmus Munk Larsen	df29511ac0	Fix merge.	2019-01-11 10:36:36 -08:00
Rasmus Munk Larsen	9396ace46b	Merge.	2019-01-11 10:28:52 -08:00
Rasmus Larsen	74882471d0	Merged eigen/eigen into default	2019-01-11 10:20:55 -08:00
Mark D Ryan	3c9add6598	Remove reinterpret_cast from AVX512 complex implementation The reinterpret_casts used in ptranspose(PacketBlock<Packet8cf,4>&) ptranspose(PacketBlock<Packet8cf,8>&) don't appear to be working correctly. They're used to convert the kernel parameters to PacketBlock<Packet8d,T>& so that the complex number versions of ptranspose can be written using the existing double implementations. Unfortunately, they don't seem to work and are responsible for 9 unit test failures in the AVX512 build of tensorflow master. This commit fixes the issue by manually initialising PacketBlock<Packet8d,T> variables with the contents of the kernel parameter before calling the double version of ptranspose, and then copying the resulting values back into the kernel parameter before returning.	2019-01-11 14:02:09 +01:00

1 2 3 4 5 ...

769 Commits