Skip to content

KAFKA-20314: Update GraalVM to version 25 to fix native image segfault on startup#22059

Open
sstremler wants to merge 3 commits intoapache:trunkfrom
sstremler:KAFKA-20314-update-graalvm-to-25
Open

KAFKA-20314: Update GraalVM to version 25 to fix native image segfault on startup#22059
sstremler wants to merge 3 commits intoapache:trunkfrom
sstremler:KAFKA-20314-update-graalvm-to-25

Conversation

@sstremler
Copy link
Copy Markdown
Contributor

Update GraalVM to version 25 to fix native image segfault on startup.

getpwuid is not thread-safe and caused intermittent segfaults (~2%) during startup of kafka-native. GraalVM 25 fixes this by calling getpwuid_r instead.

Reviewers: TaiJuWu tjwu1217@gmail.com, PoAn Yang payang@apache.org, Ken Huang s7133700@gmail.com, Chia-Ping Tsai chia7712@gmail.com

@github-actions github-actions bot added docker Official Docker image small Small PRs triage PRs from the community labels Apr 14, 2026
Copy link
Copy Markdown

@utafrali utafrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core fix is correct and well-scoped: swapping the GraalVM base image tag from 21 to 25 directly addresses the getpwuid thread-safety segfault, and the Python test helper change cleanly adds local-archive support alongside the existing URL path. Two minor points: the validation ValueError gets swallowed by the broad except clause, and a one-line comment on the Dockerfile FROM line would document the minimum-version constraint for future maintainers.

shutil.copy(kafka_archive, f"{temp_dir_path}/kafka.tgz")
else:
raise ValueError("Either kafka_url or kafka_archive must be provided")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValueError is raised inside the try block whose except: clause catches everything and re-raises as SystemError("Failed to run the tests"). That swallows the descriptive message, making it harder to diagnose when run_docker_tests is called programmatically without either argument.

Move the guard before the try:

def run_docker_tests(image, tag, kafka_url, image_type, kafka_archive=None):
    if not kafka_url and not kafka_archive:
        raise ValueError("Either kafka_url or kafka_archive must be provided")
    temp_dir_path = tempfile.mkdtemp()
    try:
        ...

Note: the archive_group = parser.add_mutually_exclusive_group(required=True) in __main__ already prevents both from being None when called from the CLI, so this guard only matters for programmatic callers. Keeping it outside the try block makes the intent clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread docker/native/Dockerfile
# limitations under the License.

FROM ghcr.io/graalvm/graalvm-community:21 AS build-native-image
FROM ghcr.io/graalvm/graalvm-community:25 AS build-native-image
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a short comment here would help future maintainers understand why this specific major version is required:

# GraalVM 25+ required: earlier versions call getpwuid (not thread-safe), causing
# intermittent segfaults (~2%) on startup. GraalVM 25 switches to getpwuid_r.
FROM ghcr.io/graalvm/graalvm-community:25 AS build-native-image

Without this, the next person who sees version 25 here has no clue whether downgrading to 21 would be safe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sstremler
Copy link
Copy Markdown
Contributor Author

The core fix is correct and well-scoped: swapping the GraalVM base image tag from 21 to 25 directly addresses the getpwuid thread-safety segfault, and the Python test helper change cleanly adds local-archive support alongside the existing URL path. Two minor points: the validation ValueError gets swallowed by the broad except clause, and a one-line comment on the Dockerfile FROM line would document the minimum-version constraint for future maintainers.

@utafrali Thank you for the review. I added the changes you suggested.

@github-actions github-actions bot removed the triage PRs from the community label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Official Docker image small Small PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants