This is substantially faster on ARM, where it's important to minimize the number of loads.
This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome.
Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
This is only a debugging/testing patch. It allows testing specific
product blocking sizes, typically to study the impact on performance.
Example usage:
int testk, testm, testn;
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn
#include <Eigen/Core>
bug #877, bug #572: Get rid of Index conversion warnings, summary of changes:
- Introduce a global typedef Eigen::Index making Eigen::DenseIndex and AnyExpr<>::Index deprecated (default is std::ptrdiff_t).
- Eigen::Index is used throughout the API to represent indices, offsets, and sizes.
- Classes storing an array of indices uses the type StorageIndex to store them. This is a template parameter of the class. Default is int.
- Methods that *explicitly* set or return an element of such an array take or return a StorageIndex type. In all other cases, the Index type is used.
- fix usage of Index (API) versus StorageIndex (when multiple indexes are stored)
- use StorageIndex(val) when the input has already been check
- use internal::convert_index<StorageIndex>(val) when val is potentially unsafe (directly comes from user input)
because this is what they are about. "Fused" means "no intermediate rounding
between the mul and the add, only one rounding at the end". Instead,
what we are concerned about here is whether a temporary register is needed,
i.e. whether the MUL and ADD are separate instructions.
Concretely, on ARM NEON, a single-instruction mul-add is always available: VMLA.
But a true fused mul-add is only available on VFPv4: VFMA.
in both GCC and Clang on ARM/NEON, whereby they spill registers,
severely harming performance. The reason why the asm comments
make a difference is that they prevent the compiler from
reordering code across these boundaries, which has the effect
of extending the lifetime of local variables and increasing
register pressure on this register-tight code.