Skip to content

global_substream_cursor: configured lookback_window not applied to should_be_synced filter, causing silent data loss #1014

@devin-ai-integration

Description

@devin-ai-integration

Community report

Reported by a user in the Airbyte community Slack: https://airbytehq.slack.com/archives/C021JANJ6TY/p1776936626673869

The user has an Intercom connector with hourly syncs, lookback_window = 1 day, and conversation_parts in incremental append+deduped mode. They found that statistics.count_conversation_parts on conversations consistently exceeds the number of parts in their destination table. Manual verification against the Intercom API confirmed the API returns the correct number of parts but Airbyte drops some during sync.

Summary

When a substream uses global_substream_cursor: true + is_client_side_incremental: true, the configured lookback_window on the DatetimeBasedCursor is not applied to the should_be_synced client-side filter. Instead, the filter uses the runtime_lookback_window (previous sync's duration in seconds), which is typically only a few minutes. This causes parts whose updated_at is even slightly behind the global cursor to be silently dropped.

Affected code path

  1. ConcurrentPerPartitionCursor._get_cursor() creates a cursor using self._lookback_window, which is the runtime lookback (previous sync duration in seconds), loaded from state at airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py line 474:

    self._lookback_window = int(stream_state.get("lookback_window", 0))
  2. This value is passed to _create_cursor() at line 596-600:

    if self._use_global_cursor:
        return self._create_cursor(
            self._global_cursor,
            self._lookback_window if self._global_cursor else 0,
        )
  3. In model_to_component_factory.py lines 1416-1426, the runtime lookback is subtracted from the cursor value:

    stream_state_value = stream_state.get(cursor_field.cursor_field_key)
    if runtime_lookback_window and stream_state_value:
        new_stream_state = (
            connector_state_converter.parse_timestamp(stream_state_value)
            - runtime_lookback_window
        )
  4. This becomes self.start in ConcurrentCursor.__init__() at airbyte_cdk/sources/streams/concurrent/cursor.py line 225.

  5. should_be_synced() at line 561-573 uses self.start as the lower bound:

    return self.start <= record_cursor_value <= self._end_provider()

The configured lookback_window (e.g., P1D) from the manifest's DatetimeBasedCursor is passed to ConcurrentCursor as self._lookback_window (line 226) but is only used in stream_slices() for generating API request time ranges. It is never referenced by should_be_synced().

Reproduction

Connector: source-intercom
Stream: conversation_parts (substream of conversations)
Config: lookback_window = 1, hourly syncs, incremental append+deduped
Manifest settings: global_substream_cursor: true, is_client_side_incremental: true

Observed behavior:

  • Intercom API returns 44 parts for a conversation
  • Airbyte destination table has 38 parts (6 missing)
  • The 6 missing parts were created seconds/minutes before other parts on the same conversation that were successfully synced
  • Example: On April 28, parts created at 15:33:10-16 were dropped, while parts created at 15:33:18 (2 seconds later) on the same conversation were synced. The should_be_synced cutoff landed between them.

Expected behavior:

  • With a 1-day lookback_window, self.start should be global_cursor - 1 day, not global_cursor - previous_sync_duration_seconds
  • All 44 parts should be synced

Root cause

The global_substream_cursor tracks one cursor across all partitions. When the cursor advances (from processing parts on other conversations), parts from individual conversations that have updated_at slightly behind the global max are dropped by should_be_synced. The runtime lookback (a few minutes) is far too small to cover the spread of updated_at values across conversations in a workspace.

Suggested fix

Apply the configured lookback_window (from the manifest's DatetimeBasedCursor) to the should_be_synced filter when using global_substream_cursor, either instead of or in addition to the runtime lookback. The effective lookback should be max(configured_lookback_window, runtime_lookback_window).


Devin session — requested by @mark-grivnin

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions