mirror of
https://gitlab.com/libeigen/eigen.git
synced 2026-04-10 11:34:33 +08:00
Add the operator interface needed for GPU iterative solvers: - BLAS Level-1 on DeviceMatrix: dot(), norm(), squaredNorm(), setZero(), noalias(), operator+=/-=/\*= dispatching to cuBLAS axpy/scal/dot/nrm2. - DeviceScalar<Scalar>: device-resident scalar returned by reductions. Defers host sync until value is read (implicit conversion). Device-side division via NPP for real types. - GpuContext: stream-borrowing constructor, setThreadLocal(), cublasLtHandle(), cusparseHandle(). - GEMM upgraded from cublasGemmEx to cublasLtMatmul with heuristic algorithm selection and plan caching. - GpuSparseContext: GpuContext& constructor for same-stream execution, deviceView() returning DeviceSparseView with operator* for device-resident SpMV (d_y = d_A * d_x). - geam expressions: d_C = d_A + alpha * d_B via cublasXgeam. - GpuSVD::matrixV() convenience wrapper. These additions make DeviceMatrix usable as a VectorType in Eigen algorithm templates. Conjugate gradient is the motivating example and is tested against CPU ConjugateGradient for correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
150 lines
6.3 KiB
Markdown
150 lines
6.3 KiB
Markdown
# Bundle Adjustment: GPU CG vs CPU CG Results
|
|
|
|
Benchmark of Eigen's GPU CG pipeline on normal equations arising from bundle
|
|
adjustment (BAL datasets). Compares CPU `ConjugateGradient` (Jacobi preconditioner)
|
|
against GPU CG using `DeviceMatrix` + `GpuSparseContext` + `DeviceScalar`.
|
|
|
|
## Hardware
|
|
|
|
- **CPU**: Intel Core i7-13700HX (Raptor Lake, 12 cores / 24 threads, single thread for Eigen CG)
|
|
- **GPU**: NVIDIA GeForce RTX 4070 Laptop GPU (Ada Lovelace, 4608 CUDA cores, 8 GB GDDR6)
|
|
- **CUDA**: 13.2 / Driver 595.79
|
|
- **OS**: Ubuntu 24.04 (WSL2, kernel 6.6.87)
|
|
|
|
## Software
|
|
|
|
- Eigen: `eigen-gpu-cg` branch
|
|
- Google Benchmark 1.9.1
|
|
- Compiler: nvcc 13.2 + g++ 13.3
|
|
- Normal equations: H = J^T*J + I (Levenberg-Marquardt damping lambda=1.0)
|
|
- CG tolerance: 1e-8, max iterations: 10000
|
|
|
|
## Method
|
|
|
|
For each BAL problem file:
|
|
1. Parse the BAL file (cameras, 3D points, 2D observations)
|
|
2. Compute the full Jacobian J using the BAL camera model (Rodrigues rotation +
|
|
perspective projection + radial distortion) with central finite differences
|
|
3. Form the normal equations H = J^T*J + lambda*I (sparse, symmetric positive definite)
|
|
4. Solve H*dx = -J^T*r using CG with Jacobi preconditioner on CPU and GPU
|
|
5. Report wall-clock time (mean of 3 repetitions)
|
|
|
|
GPU CG uses: `GpuSparseContext` for SpMV, `DeviceMatrix` for vectors,
|
|
`DeviceScalar` with `CUBLAS_POINTER_MODE_DEVICE` for dot/norm reductions,
|
|
in-place `cwiseProduct` via NPP for Jacobi preconditioner application,
|
|
device-pointer-mode `scal` to avoid host sync on the beta update.
|
|
|
|
## Results
|
|
|
|
### Summary table
|
|
|
|
| Dataset | Cameras | Points | Obs | H size | H nnz | CG iters | CPU CG (ms) | GPU CG (ms) | Speedup |
|
|
|---------|---------|--------|-----|--------|-------|----------|-------------|-------------|---------|
|
|
| Ladybug-49 | 49 | 7,776 | 31,843 | 23,769 | 1.8M | 4,421 | 4,006 | 1,152 | **3.5x** |
|
|
| Ladybug-138 | 138 | 19,878 | 85,217 | 60,876 | 4.8M | 7,008 | 21,498 | 3,553 | **6.1x** |
|
|
| Ladybug-646 | 646 | 73,584 | 327,297 | 226,566 | 18.4M | 10,000* | 123,727 | 14,268 | **8.7x** |
|
|
| Dubrovnik-356 | 356 | 226,730 | 1,255,268 | 683,394 | 69.8M | 4,308 | 216,149 | 24,493 | **8.8x** |
|
|
|
|
\* Hit 10,000 iteration cap (poorly conditioned problem). Both CPU and GPU
|
|
hit the same cap, so timing comparison remains valid.
|
|
|
|
### Profile breakdown (Ladybug-138, nsys)
|
|
|
|
GPU kernel time is dominated by SpMV (91%). The remaining 9% is BLAS-1
|
|
operations (dot, axpy, scal) and NPP element-wise ops (cwiseProduct).
|
|
|
|
| Kernel | Time (ms) | % | Calls |
|
|
|--------|-----------|---|-------|
|
|
| cuSPARSE csrmv (SpMV) | 2507 | 91.3% | 7,006 |
|
|
| cuBLAS dot | 92 | 3.4% | 21,020 |
|
|
| cuBLAS axpy (device ptr) | 27 | 1.0% | 14,012 |
|
|
| cuSPARSE partition | 19 | 0.7% | 7,006 |
|
|
| NPP cwiseProduct | 16 + 13 | 1.1% | 14,011 + 7,006 |
|
|
| cuBLAS axpy (host ptr) | 12 | 0.5% | 7,005 |
|
|
| cuBLAS scal (device ptr) | 11 | 0.4% | 7,005 |
|
|
| NPP scalar ops | 7 | 0.2% | 7,006 |
|
|
|
|
### Optimizations applied
|
|
|
|
Three profiling-driven optimizations reduced GPU CG time by **1.8x**
|
|
(6.5s → 3.6s on Ladybug-138):
|
|
|
|
1. **In-place `cwiseProduct`**: The Jacobi preconditioner apply
|
|
(`z = invdiag .* residual`) was allocating a new DeviceMatrix every
|
|
iteration. Added `z.cwiseProduct(ctx, a, b)` that reuses `z`'s buffer.
|
|
Reduced `cudaMalloc` calls from 7,053 to 23 (saving 2.3s).
|
|
|
|
2. **`squaredNorm` via `dot(x,x)`**: cuBLAS `nrm2` uses a numerically
|
|
careful scaled-sum-of-squares algorithm (29µs/call). Replaced with
|
|
`dot(x,x)` (6.4µs/call) — 4.5x faster per call, saving ~320ms.
|
|
|
|
3. **Device-pointer `scal`**: `p *= beta` was converting `DeviceScalar`
|
|
beta to host (triggering a stream sync), then calling host-pointer-mode
|
|
scal. Added `operator*=(DeviceScalar)` that uses device-pointer-mode
|
|
scal, eliminating one sync per iteration. Halved `cudaStreamSynchronize`
|
|
calls from 14K to 7K.
|
|
|
|
### Observations
|
|
|
|
1. **GPU speedup scales with problem size**: from 3.5x on small problems
|
|
(24K variables) to 8.8x on large problems (683K variables). This is
|
|
expected — larger problems have more parallelism for the GPU to exploit.
|
|
|
|
2. **Iteration counts match**: CPU and GPU CG converge in the same number
|
|
of iterations (within 1%), confirming numerical equivalence.
|
|
|
|
3. **Bottleneck is SpMV**: CG iteration time is dominated (91%) by the
|
|
sparse matrix-vector product on H. Further speedup requires either
|
|
faster SpMV (e.g., block-sparse formats) or algorithmic improvements
|
|
(Schur complement, better preconditioners).
|
|
|
|
4. **Remaining overhead**: CUDA API calls (cudaMemcpyAsync for 8-byte
|
|
DeviceScalar transfers) account for ~50% of non-kernel time. Batching
|
|
multiple scalar reductions into a single transfer would help.
|
|
|
|
5. **Jacobi preconditioner is weak for BA**: The Ladybug-646 problem does
|
|
not converge in 10K iterations. Ceres uses block Jacobi or Schur
|
|
complement preconditioners that would also benefit from GPU acceleration.
|
|
|
|
### Scaling plot data
|
|
|
|
```
|
|
# n nnz_H cpu_ms gpu_ms speedup
|
|
23769 1793475 4006 1152 3.48
|
|
60876 4791762 21498 3553 6.05
|
|
226566 18387948 123727 14268 8.67
|
|
683394 69827066 216149 24493 8.82
|
|
```
|
|
|
|
## BAL datasets
|
|
|
|
Downloaded from http://grail.cs.washington.edu/projects/bal/
|
|
|
|
| File | Source |
|
|
|------|--------|
|
|
| problem-49-7776-pre.txt | Ladybug sequence |
|
|
| problem-138-19878-pre.txt | Ladybug sequence |
|
|
| problem-646-73584-pre.txt | Ladybug sequence |
|
|
| problem-356-226730-pre.txt | Dubrovnik reconstruction |
|
|
|
|
## Reproducing
|
|
|
|
```bash
|
|
# Build
|
|
cmake -G Ninja -B build-bench-gpu -S benchmarks/GPU -DCMAKE_CUDA_ARCHITECTURES=89
|
|
cmake --build build-bench-gpu --target bench_gpu_ba
|
|
|
|
# Download BAL datasets
|
|
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-49-7776-pre.txt.bz2
|
|
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-138-19878-pre.txt.bz2
|
|
wget http://grail.cs.washington.edu/projects/bal/data/ladybug/problem-646-73584-pre.txt.bz2
|
|
wget http://grail.cs.washington.edu/projects/bal/data/dubrovnik/problem-356-226730-pre.txt.bz2
|
|
bunzip2 *.bz2
|
|
|
|
# Run (one at a time)
|
|
BAL_FILE=problem-49-7776-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
|
|
BAL_FILE=problem-138-19878-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
|
|
BAL_FILE=problem-646-73584-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
|
|
BAL_FILE=problem-356-226730-pre.txt ./build-bench-gpu/bench_gpu_ba --benchmark_repetitions=3
|
|
```
|