Skip to content

Conversation

@smarter
Copy link

@smarter smarter commented Nov 11, 2025

This includes using fixtures for ground truth generation and test configuration,
so that we can just do:

uv run pytest -sv tests/ekfac_tests

and ground truth will be auto-generated.

Additionally, this required adding a pass/fail threshold to tests/ekfac_tests/test_eigenvalue_correction.py.

I haven't tested test_apply_ekfac yet.

The keyword parameter `dtype` for AutoModelForCausalLM.from_pretrained does not
exist in version 4.54.1 which was present in uv.lock (this parameter used to be
called `torch_dtype` which is now a deprecated alias).
Also rework default handling to avoid specifying default values in multiple places.
…ection)

This way when we start using pytest, test failures will be properly reported.

test_eigenvalue_correction had no explicit criterion for success so I made one
up.
This includes using fixtures for ground truth generation and test configuration,
so that we can just do:

uv run pytest -sv tests/ekfac_tests

and ground truth will be auto-generated.
Ran "uv pre-commit run --all-files" which reads from .pre-commit-config.yaml

Unfortunately pre-commit does not respect tool settings in pyproject.toml, so
right now there's conflicting informations in pyproject.toml and
.pre-commit-config.yaml and so different settings and tool versions used
depending on how we run tools.
test_eigenvalue_corrections had to be disabled due to precision errors:

  h.6.attn.attention.out_proj: max_rel_diff=2.285%
  h.6.mlp.c_proj: max_rel_diff=3.599%
  h.7.attn.attention.out_proj: max_rel_diff=4.041%
  h.7.mlp.c_proj: max_rel_diff=2.204%
It seems the working-directory parameter in the CI config is ignored if
pyproject.toml configures pyright, so tweak that instead.
Overwriting is allowed using the --overwrite flag.
@smarter smarter force-pushed the ekfac-pytests branch 5 times, most recently from 2f2ff46 to 08b9e5e Compare December 10, 2025 19:10
Use loss.sum().backward() to avoid scaling the gradients by 1/B (and the
covariance matrix by 1/B^2).

Without this change, G2/G1 is empirically ~0.2 with the default set of
parameters.
This compares the KFAC approximation against the exact FIM computed on a toy
model. We intentionally restrict test conditions to avoid exercising issues with
padding and last token gradient which are fixed in the next commit.
When batching sequences of different lengths, we pad shorter sequences. These
padding positions aren't real data and shouldn't contribute to the FIM.
Similarly, the last position of each sequence has no next token to predict.

Invalid positions affected both covariances differently. The activation
covariance A was contaminated with out-of-distribution activations for padding.
The gradient covariance G was underestimated: gradients are zero for invalid
positions, but total_processed included them in the denominator. When
sample=True, there was a third issue: sampled labels didn't preserve -100 for
padding, so G was corrupted with non-zero gradients.

The fix computes valid_masks in pad_and_tensor() and uses it to filter
activations and restrict loss computation to valid positions.
CovarianceCollector was called without the target_modules parameter, causing it
to hook into all MLP layers instead of just the specified target modules.
LambdaCollector and the ground truth collectors already had this parameter set
correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants