Skip to content

Conversation

@MajikalExplosions
Copy link
Collaborator

@MajikalExplosions MajikalExplosions commented Nov 20, 2025

Summary

Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.

Implementation

schema/observation/image.py: Schema enhancements that:

  • Add optional fields to ImageAnnotation: content_description, clickable, editable
  • Maintain backward compatibility (all fields default to None)

agents/openhands/std_to_sft.py: Multimodal SFT conversion that:

  • Inserts <image> tokens in conversation text at appropriate positions
  • Tracks image paths via internal _image_path metadata
  • Converts annotations to human-readable format with interactivity indicators (e.g., [clickable], [editable])
  • Generates LLaMA Factory-compatible output with separate images array
  • Supports both file paths and base64-encoded images (though schema explicitly requires paths)
  • Handles nested images in WebObservations

Dataset converters (raw_to_standardized.py): Updated 9 image datasets.

  • android_in_the_wild: Intelligent two-pass conversion that analyzes action sequences to infer clickable/editable UI elements (from source repo), then generates trajectory with enriched annotations
  • androidcontrol: Populates content_description with resource ID/hint/tooltip, and clickable/editable from raw data fields
  • llava_plus: Regenerated samples only
  • omniact: Moves semantic labels from text to content_description, marks all elements as clickable=True
  • webarena_successful: Regenerated samples only
  • weblinx: Regenerated samples only
  • wonderbread: Populates content_description with XPath information
  • go-browse-wa: Regenerated samples only
  • openhands: Converts base64 screenshots to files, saving them to datasets/openhands/screenshots/{trajectory_id}/ directory structure

Testing

In progress. Some verification is done on the output.

Notes

  • Output matches LLaMA Factory multimodal requirements (count(<image> tokens) == len(images array))
  • Tests are forthcoming.

@MajikalExplosions MajikalExplosions marked this pull request as draft November 20, 2025 18:18
@neubig
Copy link
Contributor

neubig commented Nov 21, 2025

@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks!

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments!

@openhands-ai
Copy link

openhands-ai bot commented Dec 7, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit Checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #158 at branch `android-in-the-wild-images`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, basically looks good! One comment though, I think we should try to fix the schema to support things we need instead of using getattr and hasattr


# Handle nested image observation
image_path = None
if hasattr(event, "image_observation") and event.image_observation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, where would this nested image observation come from? I didn't see it in other parts of the code.

In general, "getattr" and "hasattr" are kinda anti-patterns in Python programming. They are indicative of not strictly adhering to type definitions, and can cause all kinds of tricky runtime errors. Let's try to write this without using these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants