Files
eigen/benchmarks/GPU/ba_results.md
Rasmus Munk Larsen 014f12f11a GPU: Add BLAS-1 ops, DeviceScalar, device-resident SpMV, and CG interop (5/5)
Add the operator interface needed for GPU iterative solvers:

- BLAS Level-1 on DeviceMatrix: dot(), norm(), squaredNorm(), setZero(),
  noalias(), operator+=/-=/\*= dispatching to cuBLAS axpy/scal/dot/nrm2.
- DeviceScalar<Scalar>: device-resident scalar returned by reductions.
  Defers host sync until value is read (implicit conversion). Device-side
  division via NPP for real types.
- GpuContext: stream-borrowing constructor, setThreadLocal(), cublasLtHandle(),
  cusparseHandle().
- GEMM upgraded from cublasGemmEx to cublasLtMatmul with heuristic algorithm
  selection and plan caching.
- GpuSparseContext: GpuContext& constructor for same-stream execution,
  deviceView() returning DeviceSparseView with operator* for device-resident
  SpMV (d_y = d_A * d_x).
- geam expressions: d_C = d_A + alpha * d_B via cublasXgeam.
- GpuSVD::matrixV() convenience wrapper.

These additions make DeviceMatrix usable as a VectorType in Eigen algorithm
templates. Conjugate gradient is the motivating example and is tested against
CPU ConjugateGradient for correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 20:19:59 -07:00

150 lines
6.3 KiB
Markdown

# Bundle Adjustment: GPU CG vs CPU CG Results
Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle
adjustment (BAL datasets). Compares CPU `ConjugateGradient` (Jacobi preconditioner)
against GPU CG using `DeviceMatrix` + `GpuSparseContext` + `DeviceScalar`.
## Hardware
- **CPU**: Intel Core i7-13700HX (Raptor Lake, 12 cores / 24 threads, single thread for Eigen CG)
- **GPU**: NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, 4608 CUDA cores, 8 GB GDDR6)
- **CUDA**: 13.2 / Driver 595.79
- **OS**: Ubuntu 24.04 (WSL2, kernel 6.6.87)
## Software
- Eigen: `eigen-gpu-cg` branch
- Google Benchmark 1.9.1
- Compiler: nvcc 13.2 + g++ 13.3
- Normal equations: H = J^T*J + I (Levenberg-Marquardt damping lambda=1.0)
- CG tolerance: 1e-8, max iterations: 10000
## Method
For each BAL problem file:
1. Parse the BAL file (cameras, 3D points, 2D observations)
2. Compute the full Jacobian J using the BAL camera model (Rodrigues rotation +
perspective projection + radial distortion) with central finite differences
3. Form the normal equations H = J^T*J + lambda*I (sparse, symmetric positive definite)
4. Solve H*dx = -J^T*r using CG with Jacobi preconditioner on CPU and GPU
5. Report wall-clock time (mean of 3 repetitions)
GPU CG uses: `GpuSparseContext` for SpMV, `DeviceMatrix` for vectors,
`DeviceScalar` with `CUBLAS_POINTER_MODE_DEVICE` for dot/norm reductions,
in-place `cwiseProduct` via NPP for Jacobi preconditioner application,
device-pointer-mode `scal` to avoid host sync on the beta update.
## Results
### Summary table
| Dataset | Cameras | Points | Obs | H size | H nnz | CG iters | CPU CG (ms) | GPU CG (ms) | Speedup |
|---------|---------|--------|-----|--------|-------|----------|-------------|-------------|---------|
| Ladybug-49 | 49 | 7,776 | 31,843 | 23,769 | 1.8M | 4,421 | 4,006 | 1,152 | **3.5x** |
| Ladybug-138 | 138 | 19,878 | 85,217 | 60,876 | 4.8M | 7,008 | 21,498 | 3,553 | **6.1x** |
| Ladybug-646 | 646 | 73,584 | 327,297 | 226,566 | 18.4M | 10,000* | 123,727 | 14,268 | **8.7x** |
| Dubrovnik-356 | 356 | 226,730 | 1,255,268 | 683,394 | 69.8M | 4,308 | 216,149 | 24,493 | **8.8x** |
\* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU
hit the same cap, so timing comparison remains valid.
### Profile breakdown (Ladybug-138, nsys)
GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1
operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).
| Kernel | Time (ms) | % | Calls |
|--------|-----------|---|-------|
| cuSPARSE csrmv (SpMV) | 2507 | 91.3% | 7,006 |
| cuBLAS dot | 92 | 3.4% | 21,020 |
| cuBLAS axpy (device ptr) | 27 | 1.0% | 14,012 |
| cuSPARSE partition | 19 | 0.7% | 7,006 |
| NPP cwiseProduct | 16 + 13 | 1.1% | 14,011 + 7,006 |
| cuBLAS axpy (host ptr) | 12 | 0.5% | 7,005 |
| cuBLAS scal (device ptr) | 11 | 0.4% | 7,005 |
| NPP scalar ops | 7 | 0.2% | 7,006 |
### Optimizations applied
Three profiling-driven optimizations reduced GPU CG time by **1.8x**
(6.5s → 3.6s on Ladybug-138):
1. **In-place `cwiseProduct`**: The Jacobi preconditioner apply
(`z = invdiag .* residual`) was allocating a new DeviceMatrix every
iteration. Added `z.cwiseProduct(ctx, a, b)` that reuses `z`'s buffer.
Reduced `cudaMalloc` calls from 7,053 to 23 (saving 2.3s).
2. **`squaredNorm` via `dot(x,x)`**: cuBLAS `nrm2` uses a numerically
careful scaled-sum-of-squares algorithm (29µs/call). Replaced with
`dot(x,x)` (6.4µs/call) — 4.5x faster per call, saving ~320ms.
3. **Device-pointer `scal`**: `p *= beta` was converting `DeviceScalar`
beta to host (triggering a stream sync), then calling host-pointer-mode
scal. Added `operator*=(DeviceScalar)` that uses device-pointer-mode
scal, eliminating one sync per iteration. Halved `cudaStreamSynchronize`
calls from 14K to 7K.
### Observations
1. **GPU speedup scales with problem size**: from 3.5x on small problems
(24K variables) to 8.8x on large problems (683K variables). This is
expected — larger problems have more parallelism for the GPU to exploit.
2. **Iteration counts match**: CPU and GPU CG converge in the same number
of iterations (within 1%), confirming numerical equivalence.
3. **Bottleneck is SpMV**: CG iteration time is dominated (91%) by the
sparse matrix-vector product on H. Further speedup requires either
faster SpMV (e.g., block-sparse formats) or algorithmic improvements
(Schur complement, better preconditioners).
4. **Remaining overhead**: CUDA API calls (cudaMemcpyAsync for 8-byte
DeviceScalar transfers) account for ~50% of non-kernel time. Batching
multiple scalar reductions into a single transfer would help.
5. **Jacobi preconditioner is weak for BA**: The Ladybug-646 problem does
not converge in 10K iterations. Ceres uses block Jacobi or Schur
complement preconditioners that would also benefit from GPU acceleration.
### Scaling plot data
```
# n nnz_H cpu_ms gpu_ms speedup
23769 1793475 4006 1152 3.48
60876 4791762 21498 3553 6.05
226566 18387948 123727 14268 8.67
683394 69827066 216149 24493 8.82
```
## BAL datasets
Downloaded from http://grail.cs.washington.edu/projects/bal/
| File | Source |
|------|--------|
| problem-49-7776-pre.txt | Ladybug sequence |
| problem-138-19878-pre.txt | Ladybug sequence |
| problem-646-73584-pre.txt | Ladybug sequence |
| problem-356-226730-pre.txt | Dubrovnik reconstruction |
## Reproducing
```bash
# Build
cmake -G Ninja -B build-bench-gpu -S benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build-bench-gpu --target bench_gpu_ba
# Download BAL datasets
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2
wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2
bunzip2 *.bz2
# Run (one at a time)
BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
```