Implement pipeline and new schema for image datasets #158

MajikalExplosions · 2025-11-20T18:02:47Z

Summary

Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.

Implementation

schema/observation/image.py: Schema enhancements that:

Add optional fields to ImageAnnotation: content_description, clickable, editable
Maintain backward compatibility (all fields default to None)

agents/openhands/std_to_sft.py: Multimodal SFT conversion that:

Inserts <image> tokens in conversation text at appropriate positions
Tracks image paths via internal _image_path metadata
Converts annotations to human-readable format with interactivity indicators (e.g., [clickable], [editable])
Generates LLaMA Factory-compatible output with separate images array
Supports both file paths and base64-encoded images (though schema explicitly requires paths)
Handles nested images in WebObservations

Dataset converters (raw_to_standardized.py): Updated 9 image datasets.

android_in_the_wild: Intelligent two-pass conversion that analyzes action sequences to infer clickable/editable UI elements (from source repo), then generates trajectory with enriched annotations
androidcontrol: Populates content_description with resource ID/hint/tooltip, and clickable/editable from raw data fields
llava_plus: Regenerated samples only
omniact: Moves semantic labels from text to content_description, marks all elements as clickable=True
webarena_successful: Regenerated samples only
weblinx: Regenerated samples only
wonderbread: Populates content_description with XPath information
go-browse-wa: Regenerated samples only
openhands: Converts base64 screenshots to files, saving them to datasets/openhands/screenshots/{trajectory_id}/ directory structure

Testing

In progress. Some verification is done on the output.

Notes

Output matches LLaMA Factory multimodal requirements (count(<image> tokens) == len(images array))
Tests are forthcoming.

neubig · 2025-11-21T22:36:16Z

@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks!

neubig

submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!

neubig

A few comments!

agents/openhands/system_prompt/tools/browser.py

agents/openhands/std_to_sft.py

convert_samples.ps1

openhands-ai · 2025-12-07T23:09:34Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #158 at branch `android-in-the-wild-images`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

neubig

Hey, basically looks good! One comment though, I think we should try to fix the schema to support things we need instead of using getattr and hasattr

neubig · 2025-12-10T17:12:50Z

agents/openhands/std_to_sft.py

+
+        # Handle nested image observation
+        image_path = None
+        if hasattr(event, "image_observation") and event.image_observation:


I'm a bit confused, where would this nested image observation come from? I didn't see it in other parts of the code.

In general, "getattr" and "hasattr" are kinda anti-patterns in Python programming. They are indicative of not strictly adhering to type definitions, and can cause all kinds of tricky runtime errors. Let's try to write this without using these.

MajikalExplosions added 3 commits November 10, 2025 11:53

Update sample_std.json

6f39257

Add new image schema, conversion for android_in_the_wild

46b3ae3

Implement full pipelines for image datasets

3a9a049

MajikalExplosions requested review from neubig and yueqis November 20, 2025 18:02

MajikalExplosions marked this pull request as draft November 20, 2025 18:18

neubig reviewed Nov 22, 2025

View reviewed changes

MajikalExplosions added 2 commits November 27, 2025 15:18

Fix failing PR

4508e78

Fix missed formatting

16f29a0

MajikalExplosions requested a review from neubig November 27, 2025 20:36

neubig reviewed Dec 6, 2025

View reviewed changes

Address review comments

a05ecd4

MajikalExplosions added openhands and removed openhands labels Dec 7, 2025

Add newline

7ced439

MajikalExplosions requested a review from neubig December 9, 2025 21:11

neubig reviewed Dec 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement pipeline and new schema for image datasets #158

Implement pipeline and new schema for image datasets #158

Uh oh!

MajikalExplosions commented Nov 20, 2025 •

edited

Loading

Uh oh!

neubig commented Nov 21, 2025

Uh oh!

neubig left a comment

Uh oh!

neubig left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openhands-ai bot commented Dec 7, 2025

Uh oh!

neubig left a comment

Uh oh!

neubig Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement pipeline and new schema for image datasets #158

Are you sure you want to change the base?

Implement pipeline and new schema for image datasets #158

Uh oh!

Conversation

MajikalExplosions commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation

Testing

Notes

Uh oh!

neubig commented Nov 21, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openhands-ai bot commented Dec 7, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

neubig Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MajikalExplosions commented Nov 20, 2025 •

edited

Loading