Skip to content

Conversation

@chamons
Copy link
Contributor

@chamons chamons commented Dec 23, 2025

A MultiTopicConsumer contains many child Consumers, one for every partitioned topic, and when you poll the MultiTopicConsumer it polls all child consumers until a new message is ready.

This works great in non-error conditions, but in some rare cases, such as when one or more pulsar brokers are restart in sequence, the child consumer can return Poll::Ready(Some(Err(e)))

The previous behavior was to assume the child consumer was dead and to remove it from the MultiTopicConsumer. This is reasonable if you are using the regex refresh feature, where topics will be discovered at runtime dynamically. However, this is completely the wrong behavior when you have a fixed topic list at startup. Silently dropping a child topic means your MultiTopicConsumer will never listen to that topic ever again.

There is no test included, as an automated test that shut down the entire broker during a test would not behave well locally/CI. It was tested with the example included in the issue - https://github.com/chamons/pulsar-load-shed-repro

…on errors if not using regex

- streamnative#375

A MultiTopicConsumer contains many child Consumers, one for every partitioned topic, and when
you poll the MultiTopicConsumer it polls all child consumers until a new message is ready.

This works great in non-error conditions, but in some rare cases, such as when one or more
pulsar brokers are restart in sequence, the child consumer can return `Poll::Ready(Some(Err(e)))`

The previous behavior was to assume the child consumer was dead and to remove it from the MultiTopicConsumer.
This is reasonable if you are using the regex refresh feature, where topics will be discovered at runtime dynamically.
However, this is completely the wrong behavior when you have a fixed topic list at startup. Silently dropping a child topic
means your MultiTopicConsumer will never listen to that topic ever again.

There is no test included, as an automated test that shut down the entire broker during a test would not behave well locally/CI.
It was tested with the example included in the issue - https://github.com/chamons/pulsar-load-shed-repro
@chamons
Copy link
Contributor Author

chamons commented Dec 23, 2025

@BewareMyPower @freeznet - You've previously reviewed my PRs or have recent active development history. Would you be willing to review my PR?

@mdeltito
Copy link

More a question for maintainers, but I do wonder if it's ever appropriate to remove a topic from the list when we encounter an "unexpected" error in this code path.

@chamons
Copy link
Contributor Author

chamons commented Dec 23, 2025

@mdeltito - If you have as regex and that regex covers all of your topics, then I think removing it from the list is acceptable. In 30 seconds or whatever your refresh timer is, you'll pick it up again.

Technically my fix would fail if you had regex but it didn't cover all of your topics, but it would be no worse that before the fix.

@BewareMyPower BewareMyPower merged commit 8a966f9 into streamnative:master Dec 24, 2025
12 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where MultiTopicConsumer would silently drop topics from consumption when connection errors occurred. The issue manifested during Pulsar broker restarts, where child consumers would return errors and be permanently removed from the multi-topic consumer, even when using a fixed topic list (not regex-based discovery).

Key Changes:

  • Modified error handling to only remove topics from MultiTopicConsumer on errors when topic_regex is present
  • Added explanatory comment documenting the conditional removal logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mdeltito
Copy link

@chamons Thanks! I agree it's acceptable, and I understand that topic_regex smooths over the problem of removing the topic from the list due to the refresh interval. I was mostly looking for some further insight into why removing topics from the list is necessary/desirable in such unexpected scenarios. It seems that continuing to poll the future for the consumer in this scenario would be appropriate regardless of topic_regex.

Also, thanks in general for digging into some of these tricky issues and sharing your findings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants