Add the operator interface needed for GPU iterative solvers: - BLAS Level-1 on DeviceMatrix: dot(), norm(), squaredNorm(), setZero(), noalias(), operator+=/-=/\*= dispatching to cuBLAS axpy/scal/dot/nrm2. - DeviceScalar<Scalar>: device-resident scalar returned by reductions. Defers host sync until value is read (implicit conversion). Device-side division via NPP for real types. - GpuContext: stream-borrowing constructor, setThreadLocal(), cublasLtHandle(), cusparseHandle(). - GEMM upgraded from cublasGemmEx to cublasLtMatmul with heuristic algorithm selection and plan caching. - GpuSparseContext: GpuContext& constructor for same-stream execution, deviceView() returning DeviceSparseView with operator* for device-resident SpMV (d_y = d_A * d_x). - geam expressions: d_C = d_A + alpha * d_B via cublasXgeam. - GpuSVD::matrixV() convenience wrapper. These additions make DeviceMatrix usable as a VectorType in Eigen algorithm templates. Conjugate gradient is the motivating example and is tested against CPU ConjugateGradient for correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6.3 KiB
Bundle Adjustment: GPU CG vs CPU CG Results
Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle
adjustment (BAL datasets). Compares CPU ConjugateGradient (Jacobi preconditioner)
against GPU CG using DeviceMatrix + GpuSparseContext + DeviceScalar.
Hardware
- CPU: Intel Core i7-13700HX (Raptor Lake, 12 cores / 24 threads, single thread for Eigen CG)
- GPU: NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, 4608 CUDA cores, 8 GB GDDR6)
- CUDA: 13.2 / Driver 595.79
- OS: Ubuntu 24.04 (WSL2, kernel 6.6.87)
Software
- Eigen:
eigen-gpu-cgbranch - Google Benchmark 1.9.1
- Compiler: nvcc 13.2 + g++ 13.3
- Normal equations: H = J^T*J + I (Levenberg-Marquardt damping lambda=1.0)
- CG tolerance: 1e-8, max iterations: 10000
Method
For each BAL problem file:
- Parse the BAL file (cameras, 3D points, 2D observations)
- Compute the full Jacobian J using the BAL camera model (Rodrigues rotation + perspective projection + radial distortion) with central finite differences
- Form the normal equations H = J^TJ + lambdaI (sparse, symmetric positive definite)
- Solve Hdx = -J^Tr using CG with Jacobi preconditioner on CPU and GPU
- Report wall-clock time (mean of 3 repetitions)
GPU CG uses: GpuSparseContext for SpMV, DeviceMatrix for vectors,
DeviceScalar with CUBLAS_POINTER_MODE_DEVICE for dot/norm reductions,
in-place cwiseProduct via NPP for Jacobi preconditioner application,
device-pointer-mode scal to avoid host sync on the beta update.
Results
Summary table
| Dataset | Cameras | Points | Obs | H size | H nnz | CG iters | CPU CG (ms) | GPU CG (ms) | Speedup |
|---|---|---|---|---|---|---|---|---|---|
| Ladybug-49 | 49 | 7,776 | 31,843 | 23,769 | 1.8M | 4,421 | 4,006 | 1,152 | 3.5x |
| Ladybug-138 | 138 | 19,878 | 85,217 | 60,876 | 4.8M | 7,008 | 21,498 | 3,553 | 6.1x |
| Ladybug-646 | 646 | 73,584 | 327,297 | 226,566 | 18.4M | 10,000* | 123,727 | 14,268 | 8.7x |
| Dubrovnik-356 | 356 | 226,730 | 1,255,268 | 683,394 | 69.8M | 4,308 | 216,149 | 24,493 | 8.8x |
* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU hit the same cap, so timing comparison remains valid.
Profile breakdown (Ladybug-138, nsys)
GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1 operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).
| Kernel | Time (ms) | % | Calls |
|---|---|---|---|
| cuSPARSE csrmv (SpMV) | 2507 | 91.3% | 7,006 |
| cuBLAS dot | 92 | 3.4% | 21,020 |
| cuBLAS axpy (device ptr) | 27 | 1.0% | 14,012 |
| cuSPARSE partition | 19 | 0.7% | 7,006 |
| NPP cwiseProduct | 16 + 13 | 1.1% | 14,011 + 7,006 |
| cuBLAS axpy (host ptr) | 12 | 0.5% | 7,005 |
| cuBLAS scal (device ptr) | 11 | 0.4% | 7,005 |
| NPP scalar ops | 7 | 0.2% | 7,006 |
Optimizations applied
Three profiling-driven optimizations reduced GPU CG time by 1.8x (6.5s → 3.6s on Ladybug-138):
-
In-place
cwiseProduct: The Jacobi preconditioner apply (z = invdiag .* residual) was allocating a new DeviceMatrix every iteration. Addedz.cwiseProduct(ctx, a, b)that reusesz's buffer. ReducedcudaMalloccalls from 7,053 to 23 (saving 2.3s). -
squaredNormviadot(x,x): cuBLASnrm2uses a numerically careful scaled-sum-of-squares algorithm (29µs/call). Replaced withdot(x,x)(6.4µs/call) — 4.5x faster per call, saving ~320ms. -
Device-pointer
scal:p *= betawas convertingDeviceScalarbeta to host (triggering a stream sync), then calling host-pointer-mode scal. Addedoperator*=(DeviceScalar)that uses device-pointer-mode scal, eliminating one sync per iteration. HalvedcudaStreamSynchronizecalls from 14K to 7K.
Observations
-
GPU speedup scales with problem size: from 3.5x on small problems (24K variables) to 8.8x on large problems (683K variables). This is expected — larger problems have more parallelism for the GPU to exploit.
-
Iteration counts match: CPU and GPU CG converge in the same number of iterations (within 1%), confirming numerical equivalence.
-
Bottleneck is SpMV: CG iteration time is dominated (91%) by the sparse matrix-vector product on H. Further speedup requires either faster SpMV (e.g., block-sparse formats) or algorithmic improvements (Schur complement, better preconditioners).
-
Remaining overhead: CUDA API calls (cudaMemcpyAsync for 8-byte DeviceScalar transfers) account for ~50% of non-kernel time. Batching multiple scalar reductions into a single transfer would help.
-
Jacobi preconditioner is weak for BA: The Ladybug-646 problem does not converge in 10K iterations. Ceres uses block Jacobi or Schur complement preconditioners that would also benefit from GPU acceleration.
Scaling plot data
# n nnz_H cpu_ms gpu_ms speedup
23769 1793475 4006 1152 3.48
60876 4791762 21498 3553 6.05
226566 18387948 123727 14268 8.67
683394 69827066 216149 24493 8.82
BAL datasets
Downloaded from http://grail.cs.washington.edu/projects/bal/
| File | Source |
|---|---|
| problem-49-7776-pre.txt | Ladybug sequence |
| problem-138-19878-pre.txt | Ladybug sequence |
| problem-646-73584-pre.txt | Ladybug sequence |
| problem-356-226730-pre.txt | Dubrovnik reconstruction |
Reproducing
# Build
cmake -G Ninja -B build-bench-gpu -S benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build-bench-gpu --target bench_gpu_ba
# Download BAL datasets
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2
bunzip2 *.bz2
# Run (one at a time)
BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3