-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Checklist
- This feature will maintain backward compatibility with the current APIs in
areal/api/. If not, please raise a refactor issue first.
Motivation
The current implementation of the VLLM inference engine is limited to LLM training only. Specifically, the SGLang framework is the only supported option for VLM training. This becomes a critical bottleneck when training VLMs on NPUs, where Ascend-VLLM is the only viable inference engine available. To address this, adding support for the VLLM inference engine will provide better flexibility and optimization for training VLMs, particularly on NPUs, and will also offer more flexibility when training on GPUs.
This limitation restricts the ability to leverage the full capabilities of VLLM-Ascend in production environments that require efficient VLM training, especially when working with multi-modal data that includes both vision and language inputs.
Problem
Currently, the VLLM engine's /v1/completion endpoint is only suitable for text-only inputs, which is insufficient for multi-modal tasks that require both vision and language processing. In contrast, SGLang's /generate endpoint is designed to handle both text and multi-modal inputs, making it more flexible for VLM use cases. The need arises to adapt VLLM to support multi-modal inputs for better VLM training compatibility, particularly on NPUs.
Proposed Solution
To enable multi-modal functionality in the VLLM inference engine, several modifications are required:
-
Change the VLLM Inference Endpoint:
- The
/v1/completionendpoint currently used by VLLM only accepts text input. For multi-modal support, this should be replaced with the/v1/chat/completionsendpoint, which can handle multi-modal inputs (e.g., combining text and image data).
- The
-
Adjust the Payload Structure:
- The payload for the
/v1/chat/completionsendpoint does not acceptinput_idsas aprompt(which is used in the/v1/completionendpoint). Instead, it requiresmessagesin the payload to handle inputs in a structured format that supports multi-modal data (text and images).
- The payload for the
-
Update Relevant Code Files:
The following files need to be modified to incorporate the changes for supporting multi-modal VLLM inference:areal/engine/vllm_remote.py– Adjust endpoint and payload handling logic.areal/workflow/vision_rlvr.py– Modify workflow logic to support multi-modal inputs with the new VLLM endpoint.example/vlm/clevr_count_70k_grpo.py– Update example scripts to work with the new VLLM endpoint and input structure.example/vlm/clevr_count_70k_grpo.yaml– Modify configuration to reflect the changes in the VLLM endpoint and input type.areal/api/io_struct.py– Update data structures to accommodate the newmessagesformat and multi-modal input handling.
Expected Outcome
With these changes, the VLLM inference engine will be capable of processing multi-modal inputs, thus enabling efficient training of Vision-Language Models on both NPUs and GPUs. This improvement will enhance the flexibility and scalability of VLM training workflows, especially in environments where Ascend-VLLM is the only available inference engine for NPUs.