Skip to content

feat: Add Alpamayo end-to-end Expert diffusion support#67

Open
Turoad wants to merge 1 commit intoNVIDIA:mainfrom
Turoad:feature/alpamayo-expert-e2e
Open

feat: Add Alpamayo end-to-end Expert diffusion support#67
Turoad wants to merge 1 commit intoNVIDIA:mainfrom
Turoad:feature/alpamayo-expert-e2e

Conversation

@Turoad
Copy link
Copy Markdown

@Turoad Turoad commented Apr 13, 2026

What does this PR do?

Type of change: New feature

Overview: Add end-to-end TensorRT inference for Alpamayo 1.5 (10B VLM + Diffusion Expert) autonomous driving model. After VLM decode produces a KV cache, the Expert runner performs 10-step flow-matching Euler integration to generate 6 candidate trajectories — all on GPU without host round-trips.

New files

  • AlpamayoExpertRunner (cpp/runtime/alpamayoExpertRunner.h/.cpp): Loads Expert TRT engine, manages GPU buffers (KV reshape, noisy action, timestep, attention mask), runs 10-step Euler diffusion
  • CUDA kernels (cpp/kernels/alpamayoExpertKernels/): kvCacheReshapeRepeat, buildPositionIds, fillTimestep, eulerUpdate

Modified files

  • llmInferenceRuntime: Expert integration after VLM decode, StopAfterEOS for <traj_future_start> token, single-seq and multi-seq candidate modes, KV cache dump for offline validation
  • llm_inference.cpp: CLI flags --expertEngine, --numCandidates, --numDiffusionSteps, --multiSeq, --dumpKVCache
  • QwenViTRunner: preprocessPreparedVisual() for pre-processed multi-camera images (bypasses runtime image decoding)
  • CMakeLists: curand linkage, SM 110 (Jetson Thor) architecture

Validation

Validated on Jetson AGX Thor with 300 samples:

Config mean minADE median minADE Steady-state latency
FP8 VLM + BF16 Expert 6x (recommended) 0.799m 0.637m ~2.83s/sample
PyTorch reference 0.827m 0.700m 3.62s/sample

Usage

llm_inference \
  --engineDir <vlm_engine> \
  --multimodalEngineDir <visual_engine> \
  --expertEngine <expert_trt_engine> \
  --numCandidates 6 \
  --numDiffusionSteps 10 \
  --inputFile input.json \
  --outputFile output.json

🚀 Pull Request Checklist

✅ Pre-commit Checks

  • Code formatted with clang-format (style=file)
  • codespell passed (0 errors)
  • License headers added (SPDX Apache-2.0)

🧪 Tests

  • Compiled and tested on Jetson AGX Thor (ARM64, CUDA 12.8, TRT 10.x)
  • 300-sample precision validation (minADE aligned with PyTorch reference)
  • Speed validation (no regression vs baseline)

📄 Documentation

  • CLI flags documented in commit message

⚙️ Compatibility

  • Backward compatible — Expert features are opt-in via --expertEngine flag
  • No changes to existing inference paths when Expert is not configured

Additional Information

Related issue: #32

Add end-to-end TensorRT inference for Alpamayo 1.5 (10B VLM + Diffusion
Expert) autonomous driving model. After VLM decode produces a KV cache,
the Expert runner performs 10-step flow-matching Euler integration to
generate 6 candidate trajectories — all on GPU without host round-trips.

New files:
- AlpamayoExpertRunner: loads Expert TRT engine, manages GPU buffers
  (KV reshape, noisy action, timestep, attention mask), runs diffusion
- CUDA kernels: kvCacheReshapeRepeat, buildPositionIds, fillTimestep,
  eulerUpdate

Modified files:
- llmInferenceRuntime: Expert integration after VLM decode, StopAfterEOS
  for traj_future_start token, single-seq and multi-seq candidate modes,
  KV cache dump for offline validation
- llm_inference.cpp: CLI flags --expertEngine, --numCandidates,
  --numDiffusionSteps, --multiSeq, --dumpKVCache
- QwenViTRunner: preprocessPreparedVisual for pre-processed multi-camera
  images (bypasses runtime image decoding)
- CMakeLists: curand linkage, SM 110 (Jetson Thor) architecture

Validated on Jetson AGX Thor with 300 samples: FP8 VLM + BF16 Expert 6x
steady-state ~2.83s/sample, mean minADE 0.799m, median 0.637m (aligned
with PyTorch reference mean 0.827m, median 0.700m).

Signed-off-by: thor <thor@nvidia.com>
@Turoad Turoad requested a review from a team April 13, 2026 12:19
@nvluxiaoz
Copy link
Copy Markdown
Collaborator

Thanks a lot for this MR! The core team is also working on supporting Alpamayo in a new release. We will take this MR as a reference and properly cite this great contribution.

@Turoad
Copy link
Copy Markdown
Author

Turoad commented Apr 21, 2026

Thanks for the response! Glad to hear the core team is also working on Alpamayo support.

I'd love to collaborate rather than have parallel efforts — a few thoughts:

  1. This PR is validated: 300-sample precision benchmark on Jetson AGX Thor, steady-state 2.83s/sample, aligned with PyTorch reference. Happy to share the full test harness and dataset scripts if helpful.

  2. Happy to adapt: If the core team's implementation has a different architecture or API design, I'm glad to refactor this PR to align with your internal conventions — just let me know the target structure.

  3. Incremental merge? If the full PR is too large to review at once, I can split it into smaller pieces (e.g., kernels first, then ExpertRunner, then integration).

What would make this most useful for the team — merge as-is, adapt to your internal branch, or contribute specific pieces? I'm flexible on the path, just want to make sure the validated work doesn't go to waste.

@nvluxiaoz

@genie-ahughes
Copy link
Copy Markdown

@Turoad I'd love the test harness and dataset scripts you offered above. Trying to reproduce your end-to-end pipeline on a Jetson AGX Thor against nvidia/Alpamayo-R1-10B. Anywhere you can drop them (gist, branch on the fork, attachment) would be hugely appreciated — particularly the action-expert ONNX export script so the engine I/O matches your alpamayoExpertRunner.cpp contract (single fused kv_cache binding etc.). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants