Files
eigen/benchmarks/GPU/ba_results.md
Rasmus Munk Larsen 014f12f11a GPU: Add BLAS-1 ops, DeviceScalar, device-resident SpMV, and CG interop (5/5)
Add the operator interface needed for GPU iterative solvers:

- BLAS Level-1 on DeviceMatrix: dot(), norm(), squaredNorm(), setZero(),
  noalias(), operator+=/-=/\*= dispatching to cuBLAS axpy/scal/dot/nrm2.
- DeviceScalar<Scalar>: device-resident scalar returned by reductions.
  Defers host sync until value is read (implicit conversion). Device-side
  division via NPP for real types.
- GpuContext: stream-borrowing constructor, setThreadLocal(), cublasLtHandle(),
  cusparseHandle().
- GEMM upgraded from cublasGemmEx to cublasLtMatmul with heuristic algorithm
  selection and plan caching.
- GpuSparseContext: GpuContext& constructor for same-stream execution,
  deviceView() returning DeviceSparseView with operator* for device-resident
  SpMV (d_y = d_A * d_x).
- geam expressions: d_C = d_A + alpha * d_B via cublasXgeam.
- GpuSVD::matrixV() convenience wrapper.

These additions make DeviceMatrix usable as a VectorType in Eigen algorithm
templates. Conjugate gradient is the motivating example and is tested against
CPU ConjugateGradient for correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:19:59 -07:00

6.3 KiB

Bundle Adjustment: GPU CG vs CPU CG Results

Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle adjustment (BAL datasets). Compares CPU ConjugateGradient (Jacobi preconditioner) against GPU CG using DeviceMatrix + GpuSparseContext + DeviceScalar.

Hardware

  • CPU: Intel Core i7-13700HX (Raptor Lake, 12 cores / 24 threads, single thread for Eigen CG)
  • GPU: NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, 4608 CUDA cores, 8 GB GDDR6)
  • CUDA: 13.2 / Driver 595.79
  • OS: Ubuntu 24.04 (WSL2, kernel 6.6.87)

Software

  • Eigen: eigen-gpu-cg branch
  • Google Benchmark 1.9.1
  • Compiler: nvcc 13.2 + g++ 13.3
  • Normal equations: H = J^T*J + I (Levenberg-Marquardt damping lambda=1.0)
  • CG tolerance: 1e-8, max iterations: 10000

Method

For each BAL problem file:

  1. Parse the BAL file (cameras, 3D points, 2D observations)
  2. Compute the full Jacobian J using the BAL camera model (Rodrigues rotation + perspective projection + radial distortion) with central finite differences
  3. Form the normal equations H = J^TJ + lambdaI (sparse, symmetric positive definite)
  4. Solve Hdx = -J^Tr using CG with Jacobi preconditioner on CPU and GPU
  5. Report wall-clock time (mean of 3 repetitions)

GPU CG uses: GpuSparseContext for SpMV, DeviceMatrix for vectors, DeviceScalar with CUBLAS_POINTER_MODE_DEVICE for dot/norm reductions, in-place cwiseProduct via NPP for Jacobi preconditioner application, device-pointer-mode scal to avoid host sync on the beta update.

Results

Summary table

Dataset Cameras Points Obs H size H nnz CG iters CPU CG (ms) GPU CG (ms) Speedup
Ladybug-49 49 7,776 31,843 23,769 1.8M 4,421 4,006 1,152 3.5x
Ladybug-138 138 19,878 85,217 60,876 4.8M 7,008 21,498 3,553 6.1x
Ladybug-646 646 73,584 327,297 226,566 18.4M 10,000* 123,727 14,268 8.7x
Dubrovnik-356 356 226,730 1,255,268 683,394 69.8M 4,308 216,149 24,493 8.8x

* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU hit the same cap, so timing comparison remains valid.

Profile breakdown (Ladybug-138, nsys)

GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1 operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).

Kernel Time (ms) % Calls
cuSPARSE csrmv (SpMV) 2507 91.3% 7,006
cuBLAS dot 92 3.4% 21,020
cuBLAS axpy (device ptr) 27 1.0% 14,012
cuSPARSE partition 19 0.7% 7,006
NPP cwiseProduct 16 + 13 1.1% 14,011 + 7,006
cuBLAS axpy (host ptr) 12 0.5% 7,005
cuBLAS scal (device ptr) 11 0.4% 7,005
NPP scalar ops 7 0.2% 7,006

Optimizations applied

Three profiling-driven optimizations reduced GPU CG time by 1.8x (6.5s → 3.6s on Ladybug-138):

  1. In-place cwiseProduct: The Jacobi preconditioner apply (z = invdiag .* residual) was allocating a new DeviceMatrix every iteration. Added z.cwiseProduct(ctx, a, b) that reuses z's buffer. Reduced cudaMalloc calls from 7,053 to 23 (saving 2.3s).

  2. squaredNorm via dot(x,x): cuBLAS nrm2 uses a numerically careful scaled-sum-of-squares algorithm (29µs/call). Replaced with dot(x,x) (6.4µs/call) — 4.5x faster per call, saving ~320ms.

  3. Device-pointer scal: p *= beta was converting DeviceScalar beta to host (triggering a stream sync), then calling host-pointer-mode scal. Added operator*=(DeviceScalar) that uses device-pointer-mode scal, eliminating one sync per iteration. Halved cudaStreamSynchronize calls from 14K to 7K.

Observations

  1. GPU speedup scales with problem size: from 3.5x on small problems (24K variables) to 8.8x on large problems (683K variables). This is expected — larger problems have more parallelism for the GPU to exploit.

  2. Iteration counts match: CPU and GPU CG converge in the same number of iterations (within 1%), confirming numerical equivalence.

  3. Bottleneck is SpMV: CG iteration time is dominated (91%) by the sparse matrix-vector product on H. Further speedup requires either faster SpMV (e.g., block-sparse formats) or algorithmic improvements (Schur complement, better preconditioners).

  4. Remaining overhead: CUDA API calls (cudaMemcpyAsync for 8-byte DeviceScalar transfers) account for ~50% of non-kernel time. Batching multiple scalar reductions into a single transfer would help.

  5. Jacobi preconditioner is weak for BA: The Ladybug-646 problem does not converge in 10K iterations. Ceres uses block Jacobi or Schur complement preconditioners that would also benefit from GPU acceleration.

Scaling plot data

# n        nnz_H       cpu_ms    gpu_ms    speedup
23769      1793475     4006      1152      3.48
60876      4791762     21498     3553      6.05
226566     18387948    123727    14268     8.67
683394     69827066    216149    24493     8.82

BAL datasets

Downloaded from http://grail.cs.washington.edu/projects/bal/

File Source
problem-49-7776-pre.txt Ladybug sequence
problem-138-19878-pre.txt Ladybug sequence
problem-646-73584-pre.txt Ladybug sequence
problem-356-226730-pre.txt Dubrovnik reconstruction

Reproducing

# Build
cmake -G Ninja -B build-bench-gpu -S benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build-bench-gpu --target bench_gpu_ba

# Download BAL datasets
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2
bunzip2 *.bz2

# Run (one at a time)
BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3