Skip to content

Fix case-sensitive session ID handling in ServiceBusOrchestrationService#1334

Merged
nytian merged 4 commits intoAzure:mainfrom
solankisamir:sasolank/case-invariant-dictionary
Apr 2, 2026
Merged

Fix case-sensitive session ID handling in ServiceBusOrchestrationService#1334
nytian merged 4 commits intoAzure:mainfrom
solankisamir:sasolank/case-invariant-dictionary

Conversation

@solankisamir
Copy link
Copy Markdown
Member

@solankisamir solankisamir commented Apr 2, 2026

Summary

Service Bus can change the casing of session IDs during upgrades or failovers. The DurableTask framework used ordinal (case-sensitive) ConcurrentDictionary keys for orchestrationSessions and orchestrationMessages, causing a lowercased session ID to create a ghost session with empty state instead of finding the existing session.

Problem

This broke eternal orchestrations (ContinueAsNew timer bridge) because:

  1. Timer message sent with PascalCase session ID (e.g., System_MoveBillingEvents_a3c79b00)
  2. Service Bus delivered to lowercased session ID after upgrade (system_movebillingevents_a3c79b00)
  3. Framework created a new empty session (ghost) instead of finding the existing one
  4. Real session orphaned permanently - no pending messages, stuck forever

Impact

  • Eternal orchestration were getting stuck forever

Changes

1. Case-insensitive dictionary (core fix)

Changed both ConcurrentDictionary instances in ServiceBusOrchestrationService.StartAsync() to use StringComparer.OrdinalIgnoreCase:

\\csharp
// Before
this.orchestrationSessions = new ConcurrentDictionary<string, ServiceBusOrchestrationSession>();

// After
this.orchestrationSessions = new ConcurrentDictionary<string, ServiceBusOrchestrationSession>(StringComparer.OrdinalIgnoreCase);
\\

2. Diagnostic logging (3 new log points)

  • TrySetSessionState-DeletingState (Warning) - Logs the reason when session state is set to null (previously silent)
  • GetSessionState-EmptyState (Warning) - Warns when a session has null/empty state (possible ghost session)
  • SentMessageLog enhancement - Now includes ScheduledEnqueueTimeUtc and target SessionId for timer messages

3. Unit tests (5 new tests)

  • Case-insensitive dictionary lookup with mixed casing
  • Ghost session prevention (duplicate add with different casing fails)
  • Exact reproduction of the IcM scenario
  • Multiple casing variants treated as same key
  • Structural verification of the field type

Testing

All 5 new unit tests pass. Existing build succeeds with 0 warnings, 0 errors.

solankisamir and others added 2 commits April 1, 2026 18:22
Service Bus can change the casing of session IDs during upgrades or
failovers. The DurableTask framework used ordinal (case-sensitive)
ConcurrentDictionary keys for orchestrationSessions and
orchestrationMessages, causing a lowercased session ID to create a
ghost session with empty state instead of finding the existing session.

This broke eternal orchestrations (ContinueAsNew timer bridge) because:
1. Timer message sent to PascalCase session ID
2. Service Bus delivered to lowercased session ID after upgrade
3. Framework created new empty session (ghost) instead of finding existing
4. Real session orphaned permanently with no pending messages

Fix: Use StringComparer.OrdinalIgnoreCase for both ConcurrentDictionary
instances so session lookups are resilient to casing changes.

Incident: IcM 771856247 — Service Bus scheduled message loss
Impact: 15+ APIM tenants, billing orchestrations stuck 43+ hours

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added 3 new log points to ease investigation of session-related failures:

1. TrySetSessionState-DeletingState (Warning) — Logs the reason when
   session state is set to null (runtime state null, missing
   ExecutionStartedEvent, or non-Running status). Previously silent.

2. GetSessionState-EmptyState (Warning) — Warns when a session has null
   or empty state, which may indicate a ghost session from a casing
   change.

3. SentMessageLog enhancement — Now includes ScheduledEnqueueTimeUtc
   and target SessionId for timer messages, enabling end-to-end timer
   lifecycle tracing without cross-event correlation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 2, 2026 01:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a reliability issue in the Service Bus backend where session IDs can be delivered with different casing (e.g., after upgrades/failovers), which previously caused “ghost sessions” and could stall long-running/eternal orchestrations.

Changes:

  • Initialize orchestrationSessions (and currently also orchestrationMessages) with a case-insensitive ConcurrentDictionary comparer.
  • Add diagnostic logging around empty session state reads and state deletion.
  • Add new unit tests intended to validate case-insensitive behavior and reproduce the incident scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/DurableTask.ServiceBus/ServiceBusOrchestrationService.cs Switches dictionaries to OrdinalIgnoreCase and adds new warning logs + enhanced sent-message logging.
Test/DurableTask.ServiceBus.Tests/SessionIdCaseInsensitiveTests.cs Adds new MSTest tests for case-insensitive dictionary behavior and incident reproduction (but currently placed outside the solution’s active test/ test project tree).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cgillum
Copy link
Copy Markdown
Member

cgillum commented Apr 2, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings April 2, 2026 03:14
@solankisamir
Copy link
Copy Markdown
Member Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 1334 in repo Azure/durabletask

@solankisamir
Copy link
Copy Markdown
Member Author

#1318

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/DurableTask.ServiceBus/ServiceBusOrchestrationService.cs:1612

  • GetSessionStateAsync logs when the session state is null/empty, but it still constructs a MemoryStream when state.Length == 0. An empty stream will deserialize as an empty string and will throw in RuntimeStateStreamConverter.DeserializeToRuntimeStateWithFallback. Consider treating empty state the same as null when creating rawSessionStream (e.g., only create the stream when state?.Length > 0).
            byte[] state = await session.GetStateAsync();

            if (state == null || state.Length == 0)
            {
                TraceHelper.TraceSession(
                    TraceEventType.Information,
                    "ServiceBusOrchestrationService-GetSessionState-EmptyState",
                    session.SessionId,
                    $"Session '{session.SessionId}' has null or empty state ({state?.Length ?? 0} bytes).");
            }

            using (Stream rawSessionStream = state != null ? new MemoryStream(state) : null)
            {
                this.ServiceStats.OrchestrationDispatcherStats.SessionGets.Increment();
                return await RuntimeStateStreamConverter.RawStreamToRuntimeState(rawSessionStream, session.SessionId, orchestrationServiceBlobStore, JsonDataConverter.Default);
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

change warning to informational
@solankisamir solankisamir force-pushed the sasolank/case-invariant-dictionary branch from 05189a0 to a59d98c Compare April 2, 2026 03:23
@cgillum
Copy link
Copy Markdown
Member

cgillum commented Apr 2, 2026

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@nytian nytian merged commit a5567a9 into Azure:main Apr 2, 2026
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants