Add the operator interface needed for GPU iterative solvers:
- BLAS Level-1 on DeviceMatrix: dot(), norm(), squaredNorm(), setZero(),
noalias(), operator+=/-=/\*= dispatching to cuBLAS axpy/scal/dot/nrm2.
- DeviceScalar<Scalar>: device-resident scalar returned by reductions.
Defers host sync until value is read (implicit conversion). Device-side
division via NPP for real types.
- GpuContext: stream-borrowing constructor, setThreadLocal(), cublasLtHandle(),
cusparseHandle().
- GEMM upgraded from cublasGemmEx to cublasLtMatmul with heuristic algorithm
selection and plan caching.
- GpuSparseContext: GpuContext& constructor for same-stream execution,
deviceView() returning DeviceSparseView with operator* for device-resident
SpMV (d_y = d_A * d_x).
- geam expressions: d_C = d_A + alpha * d_B via cublasXgeam.
- GpuSVD::matrixV() convenience wrapper.
These additions make DeviceMatrix usable as a VectorType in Eigen algorithm
templates. Conjugate gradient is the motivating example and is tested against
CPU ConjugateGradient for correctness.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add GPU sparse direct solvers (Cholesky, LDL^T, LU) via cuDSS, 1D/2D FFT
via cuFFT with plan caching, and sparse matrix-vector/matrix multiply
(SpMV/SpMM) via cuSPARSE.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add QR (geqrf + ormqr + trsm), SVD (gesvd), and self-adjoint eigenvalue
decomposition (syevd) via cuSOLVER. All support host and DeviceMatrix input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>