-
Notifications
You must be signed in to change notification settings - Fork 12
Add GPU implementation of preconditioner - v2 #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This update fixes two issues in the GPU preconditioner:
|
|
Hi @ZedongPeng and @jinwen-yang, I have now completed the experiments on MILPLIB and the Mittelmann LP benchmark. Below is an example figure comparing CPU vs GPU preconditioning time across all tested instances: For details of the result and plot, please see the following notebook: Read_Summary.ipynb Overall, the results show that GPU preconditioning is consistently faster than CPU preconditioning, often by several orders of magnitude across the benchmarks. The few cases in which the CPU appears marginally faster occur only when the preconditioning time on the CPU is already very small (< 0.1 sec). Consequently, because the overall solve time is typically dominated by the iterative optimization phase rather than the preconditioning step, the choice between CPU and GPU preconditioning has only a minor impact on the solve time. |
Solving timing observationsDuring the full MILPLIB and Mittelmann benchmark runs, I observed some minor numerical mismatches between the CPU and GPU runs. These differences occur only in a small subset of instances. Below is the full list of mismatches detected: In most mismatched cases, the differences are at the level of floating-point noise, and both implementations return the same status and a very similar number of iterations. However, a few instances stand out with noticeably larger discrepancies between the CPU and GPU runs: neos-4332810-sesia
neos-4332801-seret
All solutions satisfy a relative duality gap of order Preconditioner timing observationsI also looked at the GPU preconditioner timing. Almost all instances have very cheap preconditioning on the GPU (Ruiz + Pock–Chambolle + bound-objective are usually on the order of 1e-3 seconds in total). The only clear outliers are again neos-4332810-sesia and neos-4332801-seret. For these two instances, the GPU preconditioner times are: neos-4332810-sesia
neos-4332801-seret
So these are currently the only instances where GPU preconditioning exceeds ~2 seconds; all other instances are around the 1e-3 second level for the same stages. Most of the extra time comes specifically from the Ruiz scaling stage, which is significantly more expensive in these instances compared to the rest of the benchmark set. For completeness, here are the CPU preconditioner times for the same two problematic instances:
As expected, CPU preconditioning is consistently slower on these instances. However, the GPU times for these two cases are still unusually large. |
…to preconditioner-v2
Although waipa has far more nonzeros than neos-4332810-sesia and neos-4332801-seret, its GPU rescaling is much faster because the preconditioner’s runtime is not determined solely by Because the GPU implementation repeatedly launches vector kernels over arrays of size |
|
I fully compared the cumulative scaling factors produced by the CPU and GPU preconditioners on neos-4332801-seret: Comparing constraint rescaling vectors...
Comparing variable rescaling vectors...
I also compared the two scalar scaling factors used in the bound-objective rescaling step: These results confirm that the CPU and GPU preconditioners produce numerically equivalent scaling. I will check |
|
I performed an additional experiment to check whether the discrepancies come from the solver ( To isolate the effect, I swapped the CPU/GPU scaling factors:
Under this setup, the CPU and GPU solvers behave consistently: given identical scaling vectors, they produce nearly identical trajectories and final results. This strongly suggests that Therefore, @ZedongPeng and @jinwen-yang, even though the scaling vectors differ only at the level of |
|
I check the values in
I compared CPU and GPU runs on the following instances:
Instance:
Instance:
This indicates that the CPU/GPU discrepancy is not coming from |

Summary
This PR introduces a CUDA-based implementation of the preconditioner module, including both Ruiz, Pock–Chambolle, and objective-bound scaling.
Main Changes
preconditioner.cuwithpreconditioner.cinitialize_solver_stateinsolver.cufor GPU preconditioner integrationImplementation Details
A[i,j] *= E[i]) without extra lookups.Next Step
Note
Reduce_bound_norm_sq_atomic currently relies on
atomicAdd(double*)for the bound-norm reduction, which requires CMAKE_CUDA_ARCHITECTURES ≥ 60.Would it be preferable to: