Commit Graph

77 Commits

Author SHA1 Message Date
Rasmus Munk Larsen
350544eb01 Clean up TensorDeviceThreadPool.h 2025-03-07 18:14:17 +00:00
Devon Loehr
b3e3b7b0ec Remove implicit this capture in lambdas 2024-08-02 20:06:35 +00:00
Tobias Wood
f38e16c193 Apply clang-format 2023-11-29 11:12:48 +00:00
Charles Schlosser
6f9ad7da61 fix Wshorten-64-to-32 warnings in div_ceil 2023-10-27 15:52:00 +00:00
Rasmus Munk Larsen
a96545777b Consolidate multiple implementations of divup/div_up/div_ceil. 2023-10-10 17:16:59 +00:00
Antonio Sánchez
6e4d5d4832 Add IWYU private pragmas to internal headers. 2023-08-21 16:25:22 +00:00
Antonio Sánchez
e5794873cb Replace assert with eigen_assert. 2022-10-04 17:11:23 +00:00
Erik Schultheis
64909b82bd static const class members turned into constexpr 2022-04-04 17:33:33 +00:00
Erik Schultheis
421cbf0866 Replace Eigen type metaprogramming with corresponding std types and make use of alias templates 2022-03-16 16:43:40 +00:00
Kolja Brix
afa616bc9e Fix some typos found 2021-09-23 15:22:00 +00:00
sciencewhiz
4b6036e276 fix various typos 2021-09-22 16:15:06 +00:00
Rasmus Munk Larsen
d7d0bf832d Issue an error in case of direct inclusion of internal headers. 2021-09-10 19:12:26 +00:00
Antonio Sanchez
1e6c6c1576 Replace memset with fill to work for non-trivial scalars.
For custom scalars, zero is not necessarily represented by
a zeroed-out memory block (e.g. gnu MPFR). We therefore
cannot rely on `memset` if we want to fill a matrix or tensor
with zeroes. Instead, we should rely on `fill`, which for trivial
types does end up getting converted to a `memset` under-the-hood
(at least with gcc/clang).

Requires adding a `fill(begin, end, v)` to `TensorDevice`.

Replaced all potentially bad instances of memset with fill.

Fixes #2245.
2021-07-08 18:34:41 +00:00
Eugene Zhulenev
bb7ccac3af Add recursive work splitting to EvalShardedByInnerDimContext 2019-12-05 14:51:49 -08:00
Eugene Zhulenev
6e40454a6e Add beta to TensorContractionKernel and make memset optional 2019-10-02 11:06:02 -07:00
Eugene Zhulenev
f35b9ab510 Fix a bug in a packed block type in TensorContractionThreadPool 2019-09-24 16:54:36 -07:00
Christoph Hertzberg
ba0736fa8e Fix (or mask away) conversion warnings introduced in 553caeb6a3
.
2019-09-23 15:58:05 +02:00
Eugene Zhulenev
bf8866b466 Fix maybe-unitialized warnings in TensorContractionThreadPool 2019-09-13 14:29:55 -07:00
Eugene Zhulenev
553caeb6a3 Use ThreadLocal container in TensorContractionThreadPool 2019-09-13 12:14:44 -07:00
Eugene Zhulenev
79c402e40e Fix shadow warnings in TensorContractionThreadPool 2019-08-30 15:38:31 -07:00
Eugene Zhulenev
f0b36fb9a4 evalSubExprsIfNeededAsync + async TensorContractionThreadPool 2019-08-30 15:13:38 -07:00
Eugene Zhulenev
071311821e Remove XSMM support from Tensor module 2019-08-19 11:44:25 -07:00
Rasmus Munk Larsen
144ca33321 Remove deprecation annotation from typedef Eigen::Index Index, as it would generate too many build warnings. 2019-04-24 08:50:07 -07:00
Rasmus Munk Larsen
039ee52125 Tweak cost model for tensor contraction when parallelizing over the inner dimension.
https://bitbucket.org/snippets/rmlarsen/MexxLo
2019-04-12 13:35:10 -07:00
Eugene Zhulenev
4e2f6de1a8 Add support for custom packed Lhs/Rhs blocks in tensor contractions 2019-04-01 11:47:31 -07:00
Eugene Zhulenev
a407e022e6 Tune tensor contraction threadpool heuristics 2019-03-05 14:19:59 -08:00
Eugene Zhulenev
59998117bb Don't do parallel_pack if we can use thread_local memory in tensor contractions 2019-02-07 09:21:25 -08:00
Eugene Zhulenev
8491127082 Do not reduce parallelism too much in contractions with small number of threads 2019-02-04 12:59:33 -08:00
Eugene Zhulenev
eb21bab769 Parallelize tensor contraction only by sharding dimension and use 'thread-local' memory for packing 2019-02-04 10:43:16 -08:00
Eugene Zhulenev
1e6d15b55b Fix shorten-64-to-32 warning in TensorContractionThreadPool 2019-01-11 11:41:53 -08:00
Eugene Zhulenev
0abe03764c Fix shorten-64-to-32 warning in TensorContractionThreadPool 2019-01-10 10:27:55 -08:00
Eugene Zhulenev
e70ffef967 Optimize evalShardedByInnerDim 2019-01-08 16:26:31 -08:00
Mark D Ryan
36f8f6d0be Fix evalShardedByInnerDim for AVX512 builds
evalShardedByInnerDim ensures that the values it passes for start_k and
end_k to evalGemmPartialWithoutOutputKernel are multiples of 8 as the kernel
does not work correctly when the values of k are not multiples of the
packet_size.  While this precaution works for AVX builds, it is insufficient
for AVX512 builds where the maximum packet size is 16.  The result is slightly
incorrect float32 contractions on AVX512 builds.

This commit fixes the problem by ensuring that k is always a multiple of
the packet_size if the packet_size is > 8.
2018-12-05 12:29:03 +01:00
Christoph Hertzberg
b786ce8c72 Fix conversion warning ... again 2018-10-02 18:35:25 +02:00
Rasmus Munk Larsen
104e8fa074 Fix a few warnings and rename a variable to not shadow "last". 2018-09-28 12:00:08 -07:00
Rasmus Munk Larsen
7c1b47840a Merged in ezhulenev/eigen-01 (pull request PR-514)
Add tests for evalShardedByInnerDim contraction + fix bugs
2018-09-28 18:37:54 +00:00
Eugene Zhulenev
524c81f3fa Add tests for evalShardedByInnerDim contraction + fix bugs 2018-09-28 11:24:08 -07:00
Christoph Hertzberg
86ba50be39 Fix integer conversion warnings 2018-09-28 19:33:39 +02:00
Eugene Zhulenev
a7a3e9f2b6 Merge with eigen/eigen default 2018-09-27 12:05:06 -07:00
Eugene Zhulenev
9f4988959f Remove explicit mkldnn support and redundant TensorContractionKernelBlocking 2018-09-27 11:49:19 -07:00
Rasmus Munk Larsen
d956204ab2 Remove "false &&" left over from test. 2018-09-26 17:03:30 -07:00
Rasmus Munk Larsen
3815aeed7a Parallelize tensor contraction over the inner dimension in cases where where one or both of the outer dimensions (m and n) are small but k is large. This speeds up individual matmul microbenchmarks by up to 85%.
Naming below is BM_Matmul_M_K_N_THREADS, measured on a 2-socket Intel Broadwell-based server.

Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_Matmul_1_80_13522_1                  387457    396013     -2.2%
BM_Matmul_1_80_13522_2                  406487    230789    +43.2%
BM_Matmul_1_80_13522_4                  395821    123211    +68.9%
BM_Matmul_1_80_13522_6                  391625     97002    +75.2%
BM_Matmul_1_80_13522_8                  408986    113828    +72.2%
BM_Matmul_1_80_13522_16                 399988     67600    +83.1%
BM_Matmul_1_80_13522_22                 411546     60044    +85.4%
BM_Matmul_1_80_13522_32                 393528     57312    +85.4%
BM_Matmul_1_80_13522_44                 390047     63525    +83.7%
BM_Matmul_1_80_13522_88                 387876     63592    +83.6%
BM_Matmul_1_1500_500_1                  245359    248119     -1.1%
BM_Matmul_1_1500_500_2                  401833    143271    +64.3%
BM_Matmul_1_1500_500_4                  210519    100231    +52.4%
BM_Matmul_1_1500_500_6                  251582     86575    +65.6%
BM_Matmul_1_1500_500_8                  211499     80444    +62.0%
BM_Matmul_3_250_512_1                    70297     68551     +2.5%
BM_Matmul_3_250_512_2                    70141     52450    +25.2%
BM_Matmul_3_250_512_4                    67872     58204    +14.2%
BM_Matmul_3_250_512_6                    71378     63340    +11.3%
BM_Matmul_3_250_512_8                    69595     41652    +40.2%
BM_Matmul_3_250_512_16                   72055     42549    +40.9%
BM_Matmul_3_250_512_22                   70158     54023    +23.0%
BM_Matmul_3_250_512_32                   71541     56042    +21.7%
BM_Matmul_3_250_512_44                   71843     57019    +20.6%
BM_Matmul_3_250_512_88                   69951     54045    +22.7%
BM_Matmul_3_1500_512_1                  369328    374284     -1.4%
BM_Matmul_3_1500_512_2                  428656    223603    +47.8%
BM_Matmul_3_1500_512_4                  205599    139508    +32.1%
BM_Matmul_3_1500_512_6                  214278    139071    +35.1%
BM_Matmul_3_1500_512_8                  184149    142338    +22.7%
BM_Matmul_3_1500_512_16                 156462    156983     -0.3%
BM_Matmul_3_1500_512_22                 163905    158259     +3.4%
BM_Matmul_3_1500_512_32                 155314    157662     -1.5%
BM_Matmul_3_1500_512_44                 235434    158657    +32.6%
BM_Matmul_3_1500_512_88                 156779    160275     -2.2%
BM_Matmul_1500_4_512_1                  363358    349528     +3.8%
BM_Matmul_1500_4_512_2                  303134    263319    +13.1%
BM_Matmul_1500_4_512_4                  176208    130086    +26.2%
BM_Matmul_1500_4_512_6                  148026    115449    +22.0%
BM_Matmul_1500_4_512_8                  131656     98421    +25.2%
BM_Matmul_1500_4_512_16                 134011     82861    +38.2%
BM_Matmul_1500_4_512_22                 134950     85685    +36.5%
BM_Matmul_1500_4_512_32                 133165     90081    +32.4%
BM_Matmul_1500_4_512_44                 133203     90644    +32.0%
BM_Matmul_1500_4_512_88                 134106    100566    +25.0%
BM_Matmul_4_1500_512_1                  439243    435058     +1.0%
BM_Matmul_4_1500_512_2                  451830    257032    +43.1%
BM_Matmul_4_1500_512_4                  276434    164513    +40.5%
BM_Matmul_4_1500_512_6                  182542    144827    +20.7%
BM_Matmul_4_1500_512_8                  179411    166256     +7.3%
BM_Matmul_4_1500_512_16                 158101    155560     +1.6%
BM_Matmul_4_1500_512_22                 152435    155448     -1.9%
BM_Matmul_4_1500_512_32                 155150    149538     +3.6%
BM_Matmul_4_1500_512_44                 193842    149777    +22.7%
BM_Matmul_4_1500_512_88                 149544    154468     -3.3%
2018-09-26 16:47:13 -07:00
Eugene Zhulenev
71cd3fbd6a Support multiple contraction kernel types in TensorContractionThreadPool 2018-09-26 11:08:47 -07:00
Gael Guennebaud
9419f506d0 Fix regression introduced by the previous fix for AVX512.
It brokes the complex-complex case on SSE.
2018-09-20 17:32:34 +02:00
Benoit Steiner
edf46bd7a2 Merged in yuefengz/eigen (pull request PR-370)
Use device's allocate function instead of internal::aligned_malloc.
2018-07-31 22:38:28 +00:00
Rasmus Munk Larsen
e478532625 Reduce the number of template specializations of classes related to tensor contraction to reduce binary size. 2018-07-27 12:36:34 -07:00
Eugene Zhulenev
6e654f3379 Reduce number of allocations in TensorContractionThreadPool. 2018-07-16 14:26:39 -07:00
Yuefeng Zhou
1eff6cf8a7 Use device's allocate function instead of internal::aligned_malloc. This would make it easier to track memory usage in device instances. 2018-02-20 16:50:05 -08:00
Eugene Zhulenev
e204ecdaaf Remove SimpleThreadPool and always use {NonBlocking}ThreadPool 2018-07-16 15:06:57 -07:00
Eugene Zhulenev
01fd4096d3 Fuse computations into the Tensor contractions using output kernel 2018-07-10 13:16:38 -07:00