Skip to content

ActivitySim Testing Plan #1038

@jpn--

Description

@jpn--

Background and Goals

ActivitySim's current testing regime relies heavily on end-to-end integration tests that run entire model pipelines and check for bitwise-identical outputs. While these tests provide meaningful coverage, they have two serious drawbacks: they're slow (CI runs take an hour or more), and they're brittle — any small code change tends to break them, triggering investigation and manual correction of expected outputs rather than actual debugging. The goal of this plan is to shift the balance toward faster, more targeted unit tests, while preserving the integration tests that catch real-world regressions. Critically, this shift needs to happen incrementally, without a large upfront investment that would be difficult to justify to funders.

The Problem with ActivitySim's Current Testing Approach

ActivitySim's testing infrastructure has grown organically alongside the codebase, and it shows. The result is a system that is simultaneously slow, brittle, and incomplete — a combination that creates real friction for developers and quietly erodes confidence in the codebase over time.

The dominant form of testing today is what might be called "implementation testing": running an entire model pipeline from start to finish and checking that the outputs are identical to a previously saved reference. These tests are valuable in one sense — they do catch regressions — but they come with serious costs. A full CI run takes an hour or more, which means developers often wait a long time only to discover a problem they could have caught much earlier. Worse, because the tests check for bitwise-identical outputs, even a trivially small change to the code — adjusting a rounding behavior, reordering an operation, fixing an unrelated bug — can cause a cascade of test failures. These failures don't indicate that something is wrong with the model; they indicate that something changed. Distinguishing between the two requires manual investigation, and correcting the reference outputs is tedious work that adds no real value. The practical effect is that developers learn to dread test failures rather than trust them.

The tests are also poorly organized. They are spread across multiple locations without a clear rationale, and there is no obvious place to look when trying to understand what is and isn't covered. This makes it hard for a new contributor to know where to add a test, or even whether a test already exists for the thing they're changing. The path of least resistance is to not write a test at all.

Underlying all of this is a structural gap: there are few unit tests. Many individual functions and components are not tested in isolation, which means bugs can hide in plain sight as long as the end-to-end outputs happen to remain stable.
The end-to-end nature of testing also makes it difficult to test various different input settings. If one particular feature of a model has three different possible settings, and another feature that it interacts with can be turned on or off, running end-to-end tests require 3 * 2 = 6 different full model runs to exercise those interactions.
The combinatorial explosion that results from adding new features prevents rigorous and comprehensive testing via end-to-end integration tests.

When something does break, it's often difficult to pinpoint the cause quickly, because there's no lower-level test suite to consult. ActivitySim's internal calling conventions — the pipeline system, configuration loading, the inject framework — make it genuinely difficult to test a single component without standing up a large amount of surrounding infrastructure, and no shared tools exist to make that easier.

The cumulative effect is a testing regime that imposes high costs while providing incomplete coverage. It slows down development, produces noise that developers learn to tune out, and does little to guide contributors toward writing better code. Addressing this is not just a matter of software hygiene — it directly affects how quickly and confidently the project can evolve.

Guiding Principles

The Boy Scout Rule applies to tests. When a developer touches a piece of code — to fix a bug, add a feature, or refactor — they are expected to leave the tests for that code in better shape than they found them. This doesn't mean rewriting everything; it means adding one or two focused unit tests for the specific behavior they changed. Over time this compounds into meaningful coverage without requiring a dedicated testing sprint.

It should be clear where to put tests. Currently, tests live in many different places: a top-level tests/ directory that is outside the main package source code directory, as well as many individual tests/ directories that are nested at various levels inside the package source code in numerous places. We need to have a "correct" place to put a test for any particular thing, so users can easily tell if we are testing that thing or not. This will also make it easier for contributors to find the right place to add a new test.

Integration tests are a safety net, not a development tool. The existing end-to-end tests should be retained, but their role should be reframed. They exist to catch catastrophic regressions, not to validate every intermediate computation. They should not be the first thing a developer runs, and failures in them should not be the primary mechanism for discovering bugs.

Test Categories

The plan organizes tests into three tiers, each with different expectations about speed, scope, and how often they run.

Tier 1: Unit and Component Tests. These test a single function, class, or component in isolation. They should run in milliseconds to seconds each, require no external files or data other than standard small synthetic testing data, and be completely deterministic. The goal is that the full unit test suite runs in under five minutes. These run on every pull request and every push.

Tier 2: Integration Tests. These are the existing end-to-end tests, largely as they exist today. They run the full model pipelines against the example repositories and check for stable outputs. Because they are slow, they run on pull requests but not on every push.

Addressing the Complexity of Testing ActivitySim Components

The unusual calling conventions and data needs in ActivitySim (pipeline state, the use of large skims, the automatic configuration loading from files) make it genuinely hard to write isolated tests. Rather than requiring every contributor to figure this out from scratch, the plan calls for maintaining a small library of test helpers and fixtures in a dedicated activitysim/testing/ module. This module should provide:

  • A minimal pipeline context that can be set up in a few lines, sufficient to run a single component.

  • Two clear, centralized, and well-documented generic small synthetic datasets (perhaps 100 households, 25 zones, 50 microzones) that is representative enough to exercise model logic, one for a one-zone system and one for a two-zone system.

  • Helper functions for initializing a workflow State directly in Python with all the necessary data and configurations pre-loaded, without standing up the full model or requiring a bunch of individual test-specific YAML or CSV files.

  • Helper functions to add a zone/household/tour/other data element with some specific characteristic to the generic data, so as to test corner cases or similar weirdness, without breaking other existing tests or requiring a whole new synthetic dataset. For example, if we have code that handles infeasible situations such as a tour that by walking but has a stop purpose that is not accessible by walking. We should have a way to be able to inject into the tour data one such tour, without needing to have an entire separate small data file that is constructed with such a tour.

Expectations for Contributors

The contribution guidelines should be updated to state the following clearly:

When submitting a pull request that modifies the behavior of existing code, include at least one new unit or component test that would have caught the regression or demonstrates the new behavior. When fixing a bug, the test should fail before the fix and pass after. When adding a new model component, include a component test using the synthetic fixtures. Reviewers are expected to check for this and should not merge PRs that omit tests for non-trivial changes. For bug fixes in particular, it should be made easy for any reviewer to copy just the test out of a pull request, run the test against the existing codebase, and affirm that it fails.

These expectations should be framed as a quality floor, not a bureaucratic checklist. The goal is to make the codebase easier to work with over time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions