Implement Harmony2 paper improvements for preventing overintegration#625
Implement Harmony2 paper improvements for preventing overintegration#625
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a stabilized penalty mode and per-(batch,cluster) ridge regularization ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/rapids_singlecell/preprocessing/_harmony/_helper.py (1)
554-583:⚠️ Potential issue | 🟠 MajorValidate
lambda_kbbefore building the inverse tensor.
lambda_kbmoved from a scalar to a full matrix, but this helper never checks that it matchesO.shape. A(n_clusters,)or(n_batches, 1)input will broadcast through the transpose arithmetic and silently produce the wrong inverse matrices. Normalizing both inputs todtypehere also keeps this path aligned with the requested precision.As per coding guidelines, `**/*.py`: "Check for and fix dtype mismatches (float32 vs float64) between CuPy arrays and Python scalars, especially when passing arrays to CUDA kernels."♻️ Proposed fix
def _compute_inv_mats_batched( O: cp.ndarray, lambda_kb: cp.ndarray, dtype: cp.dtype, ) -> cp.ndarray: @@ - n_batches, n_clusters = O.shape + if lambda_kb.shape != O.shape: + raise ValueError( + "lambda_kb must have shape (n_batches, n_clusters) matching O" + ) + O = O.astype(dtype, copy=False) + lambda_kb = lambda_kb.astype(dtype, copy=False) + + n_batches, n_clusters = O.shape n_batches_p1 = n_batches + 1 @@ - factor = 1.0 / (O.T + lambda_kb.T) + factor = cp.reciprocal((O + lambda_kb).T)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/preprocessing/_harmony/_helper.py` around lines 554 - 583, In _compute_inv_mats_batched validate and normalize lambda_kb before using it: ensure lambda_kb is a CuPy array with the exact shape (n_batches, n_clusters) (raise a clear ValueError if not), cast both O and lambda_kb to the function dtype (e.g., via cp.asarray(..., dtype=dtype)) to avoid float32/64 mismatches, and only then compute factor = 1.0 / (O.T + lambda_kb.T); this prevents unintended broadcasting from shapes like (n_clusters,) or (n_batches,1) and keeps precision consistent with inv_mats.tests/test_harmony.py (1)
145-154:⚠️ Potential issue | 🟠 MajorAdd a numerical reference check for the new default path.
This keeps the Harmony1 fallback covered, but the new default (
stabilized_penalty=True,dynamic_lambda=True) is still only exercised by shape/correlation checks in this file. A wiring regression in the Harmony2 path can now pass while this reference test stays green. Please add a companion assertion against a frozen Harmony2 output or another trusted reference implementation.As per coding guidelines,
**/*test*.{py,cpp}: "HIGH: Missing validation of numerical correctness against CPU reference, missing edge case coverage (single row, empty input, max-size input), or tests that only check 'runs without error'."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_harmony.py` around lines 145 - 154, The test currently exercises Harmony1 fallback but lacks a numeric reference for the new Harmony2 default path; add a companion assertion that runs rsc.pp.harmony_integrate on the same adata_reference with stabilized_penalty=True and dynamic_lambda=True (and fixed random seed/dtype) and compares a deterministic numeric output (e.g., corrected X or embedding returned by rsc.pp.harmony_integrate) against a frozen expected numpy array using a tight tolerance (np.testing.assert_allclose or pytest.approx); store the frozen reference as a literal array in tests/test_harmony.py (or load from a committed fixture) and include the new assertion in the same test or a new test function to ensure wiring/regression coverage for the Harmony2 path.tests/test_harmony_kernels.py (2)
82-109:⚠️ Potential issue | 🟠 MajorAdd an explicit absent-batch case for the Harmony2 formulas.
Both stabilized test paths still sample strictly positive
O/E, so they never hit theO_kb == 0condition this PR is trying to stabilize. A regression in the absent-batch branch would still pass here; please pin at least one entry toO=0(ideally with a largeE) and assert the kernel stays finite and matches the reference in both modes.As per coding guidelines, "HIGH: Missing validation of numerical correctness against CPU reference, missing edge case coverage (single row, empty input, max-size input), or tests that only check 'runs without error'."
Also applies to: 241-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_harmony_kernels.py` around lines 82 - 109, The test_penalty case currently samples strictly positive O and E so it never exercises the absent-batch branch (O_kb == 0); update the test to pin at least one O entry to 0 (preferably with a large corresponding E) before calling _pen.penalty (and likewise add the same absent-batch scenario in the other related test block covering lines 241-295), then compute the CPU reference denom/expected exactly as before and assert the kernel output is finite and matches expected for both stabilized=True and False; reference the test function test_penalty and the kernel call _pen.penalty to locate where to inject the O[k] = 0 case and the additional assertions.
328-352:⚠️ Potential issue | 🟠 MajorMake
lambda_kbnon-uniform intest_compute_inv_mat.
cp.full_like(O, ridge_lambda)only verifies the legacy scalar case. A bad stride, transpose, or accidental scalar read incompute_inv_mats_kernelwould still pass because every entry is identical; use distinct per-batch/per-cluster values and include a1e30sentinel so the new indexing and pruning path are actually exercised.Suggested test shape
- ridge_lambda = 1.0 - lambda_kb = cp.full_like(O, ridge_lambda) + lambda_kb = cp.asarray( + rng.random((n_batches, n_clusters)) * 2 + 0.1, + dtype=dtype, + ) + lambda_kb[0, 0] = dtype(1e30)As per coding guidelines, "HIGH: Missing validation of numerical correctness against CPU reference, missing edge case coverage (single row, empty input, max-size input), or tests that only check 'runs without error'."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_harmony_kernels.py` around lines 328 - 352, In test_compute_inv_mat replace the uniform lambda_kb = cp.full_like(O, ridge_lambda) with a non-uniform per-batch/per-cluster array so the kernel’s indexing/pruning is exercised: build lambda_kb with the same shape as O (e.g. lambda_kb = ridge_lambda + (cp.arange(O.size, dtype=dtype).reshape(O.shape) * small_scale).astype(dtype)) and then set at least one sentinel value to 1e30 (e.g. lambda_kb[0,0]=dtype(1e30)); keep using that lambda_kb when calling _corr.compute_inv_mat (and ensure dtype and shape match O, n_batches, n_clusters, cluster_k, inv_mat, g_factor, g_P_row0, stream).
🧹 Nitpick comments (1)
src/rapids_singlecell/preprocessing/_harmony_integrate.py (1)
33-41: Consider making the Harmony2 controls explicit kwargs.
stabilized_penaltyanddynamic_lambdaare now documented as public knobs, but they still disappear inside**kwargs, so they will not show up in signatures or generated API docs. Making them explicit—or at least listingalphaandbatch_prune_thresholdunderkwargs—would make the default behavior change much easier to discover.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/preprocessing/_harmony_integrate.py` around lines 33 - 41, The doc comment points out that stabilized_penalty and dynamic_lambda are documented but remain hidden in **kwargs; update the harmony_integrate function signature to expose these parameters explicitly (add stabilized_penalty: bool = True, dynamic_lambda: bool = True) and likewise add explicit kwargs for alpha and batch_prune_threshold with their defaults, remove them from blind **kwargs handling, pass them through to the underlying Harmony2 call (wherever harmony_integrate forwards params), and update the function docstring to list these parameters so they appear in generated API docs; ensure any internal use of **kwargs in harmony_integrate (or helper functions it calls) is adjusted to accept the new named parameters and that callers are updated if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/rapids_singlecell/_cuda/harmony/correction/kernels_correction_fast.cuh`:
- Around line 52-57: The division can produce inf/NaN when o_val + lam == 0;
ensure the per-entry ridge term is strictly positive by flooring the denominator
in the loop that computes f and p (the block using O, lambda_kb, f, p inside the
for over b). Compute a safe denominator = o_val + lam and if it is <= 0 (or
below a tiny epsilon) replace it with a small positive epsilon (use an
appropriate T epsilon, e.g. numeric_limits<T>::epsilon() or a small constant
like 1e-6 as T) before computing f = T(1)/denom and p = -f * o_val so no
division-by-zero or near-zero occurs; alternatively ensure lambda_kb is
validated upstream to be > 0 for all (b,k).
In `@src/rapids_singlecell/preprocessing/_harmony/__init__.py`:
- Around line 58-61: Add upfront validation for the new public knobs so invalid
values fail fast: before the solver is invoked (i.e., before computing lambda_kb
where alpha is used and before applying batch pruning using
batch_prune_threshold in the code in __init__.py), assert that alpha is >= 0 and
that batch_prune_threshold is not None and lies in [0, 1] (or raise a ValueError
with a clear message referencing alpha and batch_prune_threshold). Also document
these checks next to the stabilized_penalty and dynamic_lambda parameter
declarations so callers cannot silently pass alpha < 0 or batch_prune_threshold
> 1.
- Around line 440-457: _compute_lambda_kb computes per-(k,b) regularization but
divides by N_b when forming prune_mask, which can be zero for unused categorical
levels; change the logic to guard against zero N_b by treating those batches as
pruned (i.e., set prune_mask True for rows where N_b == 0) or by using a safe
divisor (e.g., N_b_safe = where(N_b==0, 1, N_b)) before computing (O /
N_b_safe[:, None]); then assign a large regularizer (E.dtype.type(1e30)) to
lambda_kb for those pruned/zero-count batches so downstream 1/(O + lambda_kb)
never divides by zero. Ensure the fix is applied only when dynamic_lambda is
true and keep existing behavior for non-zero N_b.
---
Outside diff comments:
In `@src/rapids_singlecell/preprocessing/_harmony/_helper.py`:
- Around line 554-583: In _compute_inv_mats_batched validate and normalize
lambda_kb before using it: ensure lambda_kb is a CuPy array with the exact shape
(n_batches, n_clusters) (raise a clear ValueError if not), cast both O and
lambda_kb to the function dtype (e.g., via cp.asarray(..., dtype=dtype)) to
avoid float32/64 mismatches, and only then compute factor = 1.0 / (O.T +
lambda_kb.T); this prevents unintended broadcasting from shapes like
(n_clusters,) or (n_batches,1) and keeps precision consistent with inv_mats.
In `@tests/test_harmony_kernels.py`:
- Around line 82-109: The test_penalty case currently samples strictly positive
O and E so it never exercises the absent-batch branch (O_kb == 0); update the
test to pin at least one O entry to 0 (preferably with a large corresponding E)
before calling _pen.penalty (and likewise add the same absent-batch scenario in
the other related test block covering lines 241-295), then compute the CPU
reference denom/expected exactly as before and assert the kernel output is
finite and matches expected for both stabilized=True and False; reference the
test function test_penalty and the kernel call _pen.penalty to locate where to
inject the O[k] = 0 case and the additional assertions.
- Around line 328-352: In test_compute_inv_mat replace the uniform lambda_kb =
cp.full_like(O, ridge_lambda) with a non-uniform per-batch/per-cluster array so
the kernel’s indexing/pruning is exercised: build lambda_kb with the same shape
as O (e.g. lambda_kb = ridge_lambda + (cp.arange(O.size,
dtype=dtype).reshape(O.shape) * small_scale).astype(dtype)) and then set at
least one sentinel value to 1e30 (e.g. lambda_kb[0,0]=dtype(1e30)); keep using
that lambda_kb when calling _corr.compute_inv_mat (and ensure dtype and shape
match O, n_batches, n_clusters, cluster_k, inv_mat, g_factor, g_P_row0, stream).
In `@tests/test_harmony.py`:
- Around line 145-154: The test currently exercises Harmony1 fallback but lacks
a numeric reference for the new Harmony2 default path; add a companion assertion
that runs rsc.pp.harmony_integrate on the same adata_reference with
stabilized_penalty=True and dynamic_lambda=True (and fixed random seed/dtype)
and compares a deterministic numeric output (e.g., corrected X or embedding
returned by rsc.pp.harmony_integrate) against a frozen expected numpy array
using a tight tolerance (np.testing.assert_allclose or pytest.approx); store the
frozen reference as a literal array in tests/test_harmony.py (or load from a
committed fixture) and include the new assertion in the same test or a new test
function to ensure wiring/regression coverage for the Harmony2 path.
---
Nitpick comments:
In `@src/rapids_singlecell/preprocessing/_harmony_integrate.py`:
- Around line 33-41: The doc comment points out that stabilized_penalty and
dynamic_lambda are documented but remain hidden in **kwargs; update the
harmony_integrate function signature to expose these parameters explicitly (add
stabilized_penalty: bool = True, dynamic_lambda: bool = True) and likewise add
explicit kwargs for alpha and batch_prune_threshold with their defaults, remove
them from blind **kwargs handling, pass them through to the underlying Harmony2
call (wherever harmony_integrate forwards params), and update the function
docstring to list these parameters so they appear in generated API docs; ensure
any internal use of **kwargs in harmony_integrate (or helper functions it calls)
is adjusted to accept the new named parameters and that callers are updated if
necessary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5dd65525-b9aa-4eda-986c-b2e844796cbc
📒 Files selected for processing (13)
docs/references.bibsrc/rapids_singlecell/_cuda/harmony/clustering/clustering.cusrc/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuhsrc/rapids_singlecell/_cuda/harmony/correction/correction_batched.cusrc/rapids_singlecell/_cuda/harmony/correction/correction_fast.cusrc/rapids_singlecell/_cuda/harmony/correction/kernels_correction_fast.cuhsrc/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuhsrc/rapids_singlecell/_cuda/harmony/pen/pen.cusrc/rapids_singlecell/preprocessing/_harmony/__init__.pysrc/rapids_singlecell/preprocessing/_harmony/_helper.pysrc/rapids_singlecell/preprocessing/_harmony_integrate.pytests/test_harmony.pytests/test_harmony_kernels.py
src/rapids_singlecell/_cuda/harmony/correction/kernels_correction_fast.cuh
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/rapids_singlecell/preprocessing/_harmony_integrate.py`:
- Around line 23-26: Add an end-to-end unit test that exercises the public API
functions harmony_integrate() and/or harmonize() using a small synthetic dataset
to ensure the new public knobs (stabilized_penalty, dynamic_lambda, alpha,
batch_prune_threshold) are wired correctly: call the function once with the new
defaults (stabilized_penalty=True, dynamic_lambda=True, alpha=0.2,
batch_prune_threshold=1e-5) and once with the Harmony1 fallback
(stabilized_penalty=False, dynamic_lambda=False) and assert key integration
properties (e.g., output shape, no NaNs, batch-correction effect between
batches, and deterministic behavior for fixed seed); vary alpha and
batch_prune_threshold in a couple of cases to ensure parameters are passed
through to the kernels.
In `@src/rapids_singlecell/preprocessing/_harmony/__init__.py`:
- Around line 176-180: The alpha validation currently only checks for negative
values but allows non-finite values (NaN/inf) which break downstream (lambda_kb)
computations; update the validation near the existing alpha check in __init__.py
to reject non-finite values by using a finite check (e.g., math.isfinite or
numpy.isfinite) and raise ValueError if not finite; keep the existing
batch_prune_threshold check unchanged and include alpha in the error message
(e.g., "alpha must be a finite non-negative number, got {alpha}").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f620f875-30e9-4810-ae3a-197b17326cc8
📒 Files selected for processing (5)
src/rapids_singlecell/preprocessing/_harmony/__init__.pysrc/rapids_singlecell/preprocessing/_harmony/_helper.pysrc/rapids_singlecell/preprocessing/_harmony_integrate.pytests/test_harmony.pytests/test_harmony_kernels.py
💤 Files with no reviewable changes (1)
- src/rapids_singlecell/preprocessing/_harmony/_helper.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_harmony.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/rapids_singlecell/preprocessing/_harmony/__init__.py (1)
176-181:⚠️ Potential issue | 🟡 MinorReject non-finite
alphavalues.
alpha=float("nan")andalpha=float("inf")both pass the current check. NaN will poisonlambda_kb, and inf will suppress all corrections. Add a finite check.Suggested fix
- if alpha < 0: - raise ValueError(f"alpha must be non-negative, got {alpha}.") + if not np.isfinite(alpha) or alpha < 0: + raise ValueError(f"alpha must be a finite, non-negative value, got {alpha}.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rapids_singlecell/preprocessing/_harmony/__init__.py` around lines 176 - 181, The current validation for the parameter alpha in __init__.py only checks for negativity but allows non-finite values like NaN and Inf; update the validation around the alpha check (the block that raises ValueError when alpha < 0) to also reject non-finite values by using a finite check (e.g., math.isfinite or numpy.isfinite) and raise a ValueError with a clear message (e.g., "alpha must be a finite non-negative number, got {alpha}") so that NaN and Inf are rejected before they can affect lambda_kb or corrections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/rapids_singlecell/preprocessing/_harmony/__init__.py`:
- Around line 176-181: The current validation for the parameter alpha in
__init__.py only checks for negativity but allows non-finite values like NaN and
Inf; update the validation around the alpha check (the block that raises
ValueError when alpha < 0) to also reject non-finite values by using a finite
check (e.g., math.isfinite or numpy.isfinite) and raise a ValueError with a
clear message (e.g., "alpha must be a finite non-negative number, got {alpha}")
so that NaN and Inf are rejected before they can affect lambda_kb or
corrections.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c235f253-98ea-4ce7-9efd-18018e22455e
📒 Files selected for processing (2)
src/rapids_singlecell/preprocessing/_harmony/__init__.pytests/test_harmony.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/rapids_singlecell/preprocessing/_harmony/__init__.py`:
- Around line 461-472: The early return when dynamic_lambda is False bypasses
the zero-denominator guard; fix by constructing lambda_kb even in the
fixed-ridge branch (instead of returning cp.full_like(E, ridge_lambda)
immediately) or validate ridge_lambda up front to ensure it cannot be zero, then
always apply the sentinel suppression lambda_kb[(O + lambda_kb) == 0] = sentinel
before returning; refer to the symbols dynamic_lambda, cp.full_like(E,
ridge_lambda), lambda_kb, O, sentinel and ensure the sentinel guard runs
unconditionally for both dynamic and fixed branches (and add a regression test
for dynamic_lambda=False with ridge_lambda=0).
In `@tests/test_harmony.py`:
- Around line 63-90: The fixture adata_ircolitis_harmony2 currently uses
pooch.retrieve to download external files (pcs_file, harmony2_file, obs_file)
which makes tests network-bound; replace this by either bundling a small TSV/CSV
fixture under tests/ (e.g., tests/data/ircolitis_*.tsv.gz) and reading it with
pd.read_csv, or generate a deterministic synthetic AnnData in the fixture that
creates obs (DataFrame) and obsm entries "X_pca" and "harmony2_ref" with
appropriate shapes (use numpy for reproducible random data via a fixed seed);
update the fixture adata_ircolitis_harmony2 to remove pooch.retrieve calls and
load from the new local files or synthetic generation so tests no longer depend
on exampledata.scverse.org.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 50682935-1b5a-4354-b362-5fb05986aa23
📒 Files selected for processing (2)
src/rapids_singlecell/preprocessing/_harmony/__init__.pytests/test_harmony.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #625 +/- ##
==========================================
+ Coverage 87.75% 87.98% +0.23%
==========================================
Files 96 96
Lines 7013 7027 +14
==========================================
+ Hits 6154 6183 +29
+ Misses 859 844 -15
|
Implements three algorithmic improvements from the Harmony2 paper (Patikas et al., 2026; doi:10.64898/2026.03.16.711825) to prevent overintegration in biologically heterogeneous
datasets:
The old formula could dominate the objective and force overintegrated clusters. Controlled via stabilized_penalty (default True), templated as a compile-time bool in CUDA kernels.
batch representation in each cluster. Well-represented batches get strong correction; rare/absent batches get weak correction. Controlled via dynamic_lambda (default True) and alpha
(default 0.2).
dynamic lambda — pruned entries get lambda_kb = 1e30, which drives the correction to zero without any kernel changes. Controlled via batch_prune_threshold (default 1e-5).
All three features default to Harmony2 behavior. To recover the original Harmony1 behavior, pass stabilized_penalty=False, dynamic_lambda=False.