[Feature] VLLM support for VLM training

## Checklist

- [x] This feature will maintain backward compatibility with the current APIs in
  `areal/api/`. If not, please raise a refactor issue first.

## Motivation

The current implementation of the VLLM inference engine is limited to LLM training only. Specifically, the SGLang framework is the only supported option for VLM training. This becomes a critical bottleneck when training VLMs on NPUs, where **Ascend-VLLM** is the only viable inference engine available. To address this, adding support for the VLLM inference engine will provide better flexibility and optimization for training VLMs, particularly on NPUs, and will also offer more flexibility when training on GPUs.

This limitation restricts the ability to leverage the full capabilities of VLLM-Ascend in production environments that require efficient VLM training, especially when working with multi-modal data that includes both vision and language inputs.

---

## Problem

Currently, the VLLM engine's [`/v1/completion`](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#supported-apis) endpoint is only suitable for **text-only** inputs, which is insufficient for multi-modal tasks that require both vision and language processing. In contrast, SGLang's `/generate` endpoint is designed to handle both text and multi-modal inputs, making it more flexible for VLM use cases. The need arises to adapt VLLM to support multi-modal inputs for better VLM training compatibility, particularly on NPUs.

---

## Proposed Solution

To enable multi-modal functionality in the VLLM inference engine, several modifications are required:

1. **Change the VLLM Inference Endpoint**:

   * The `/v1/completion` endpoint currently used by VLLM only accepts **text input**. For multi-modal support, this should be replaced with the `/v1/chat/completions` endpoint, which can handle **multi-modal inputs** (e.g., combining text and image data).
2. **Adjust the Payload Structure**:

   * The payload for the `/v1/chat/completions` endpoint does not accept `input_ids` as a `prompt` (which is used in the `/v1/completion` endpoint). Instead, it requires `messages` in the payload to handle inputs in a structured format that supports multi-modal data (text and images).
3. **Update Relevant Code Files**:
   The following files need to be modified to incorporate the changes for supporting multi-modal VLLM inference:

   * `areal/engine/vllm_remote.py` – Adjust endpoint and payload handling logic.
   * `areal/workflow/vision_rlvr.py` – Modify workflow logic to support multi-modal inputs with the new VLLM endpoint.
   * `example/vlm/clevr_count_70k_grpo.py` – Update example scripts to work with the new VLLM endpoint and input structure.
   * `example/vlm/clevr_count_70k_grpo.yaml` – Modify configuration to reflect the changes in the VLLM endpoint and input type.
   * `areal/api/io_struct.py` – Update data structures to accommodate the new `messages` format and multi-modal input handling.

---

## Expected Outcome

With these changes, the VLLM inference engine will be capable of processing multi-modal inputs, thus enabling efficient training of Vision-Language Models on both NPUs and GPUs. This improvement will enhance the flexibility and scalability of VLM training workflows, especially in environments where Ascend-VLLM is the only available inference engine for NPUs.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] VLLM support for VLM training #642

Checklist

Motivation

Problem

Proposed Solution

Expected Outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] VLLM support for VLM training #642

Description

Checklist

Motivation

Problem

Proposed Solution

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions