Skip to content

[Feature] VLLM support for VLM training #642

@HwVanICI

Description

@HwVanICI

Checklist

  • This feature will maintain backward compatibility with the current APIs in
    areal/api/. If not, please raise a refactor issue first.

Motivation

The current implementation of the VLLM inference engine is limited to LLM training only. Specifically, the SGLang framework is the only supported option for VLM training. This becomes a critical bottleneck when training VLMs on NPUs, where Ascend-VLLM is the only viable inference engine available. To address this, adding support for the VLLM inference engine will provide better flexibility and optimization for training VLMs, particularly on NPUs, and will also offer more flexibility when training on GPUs.

This limitation restricts the ability to leverage the full capabilities of VLLM-Ascend in production environments that require efficient VLM training, especially when working with multi-modal data that includes both vision and language inputs.


Problem

Currently, the VLLM engine's /v1/completion endpoint is only suitable for text-only inputs, which is insufficient for multi-modal tasks that require both vision and language processing. In contrast, SGLang's /generate endpoint is designed to handle both text and multi-modal inputs, making it more flexible for VLM use cases. The need arises to adapt VLLM to support multi-modal inputs for better VLM training compatibility, particularly on NPUs.


Proposed Solution

To enable multi-modal functionality in the VLLM inference engine, several modifications are required:

  1. Change the VLLM Inference Endpoint:

    • The /v1/completion endpoint currently used by VLLM only accepts text input. For multi-modal support, this should be replaced with the /v1/chat/completions endpoint, which can handle multi-modal inputs (e.g., combining text and image data).
  2. Adjust the Payload Structure:

    • The payload for the /v1/chat/completions endpoint does not accept input_ids as a prompt (which is used in the /v1/completion endpoint). Instead, it requires messages in the payload to handle inputs in a structured format that supports multi-modal data (text and images).
  3. Update Relevant Code Files:
    The following files need to be modified to incorporate the changes for supporting multi-modal VLLM inference:

    • areal/engine/vllm_remote.py – Adjust endpoint and payload handling logic.
    • areal/workflow/vision_rlvr.py – Modify workflow logic to support multi-modal inputs with the new VLLM endpoint.
    • example/vlm/clevr_count_70k_grpo.py – Update example scripts to work with the new VLLM endpoint and input structure.
    • example/vlm/clevr_count_70k_grpo.yaml – Modify configuration to reflect the changes in the VLLM endpoint and input type.
    • areal/api/io_struct.py – Update data structures to accommodate the new messages format and multi-modal input handling.

Expected Outcome

With these changes, the VLLM inference engine will be capable of processing multi-modal inputs, thus enabling efficient training of Vision-Language Models on both NPUs and GPUs. This improvement will enhance the flexibility and scalability of VLM training workflows, especially in environments where Ascend-VLLM is the only available inference engine for NPUs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions