Skip to content

Potential answer-distribution bias in several EventBench tasks #4

@HYLZ-2019

Description

@HYLZ-2019

Hi, thank you for releasing EventBench.

While analyzing the annotations, we noticed potential answer-distribution bias in several tasks:

  • Detailed Understanding (DU): The correct answer is heavily concentrated on option B. A fixed-answer baseline that always selects B gets 257/483 = 53.21% accuracy. We also noticed that this task appears to mix questions with different numbers of options, including 3-choice and 5-choice questions.
  • Causal Reasoning (CR): The correct answer is heavily concentrated on option A. A fixed-answer baseline that always selects A gets 119/180 = 66.11% accuracy.
  • Spatial Relationship (SR): The label distribution is imbalanced. A fixed-answer baseline that always predicts left gets 109/301 = 36.21% accuracy.
  • Object Counting (OC): The count distribution is strongly imbalanced. A fixed-answer baseline that always predicts 2 gets 253/414 = 61.11% accuracy.

These fixed-answer baselines suggest that models may obtain non-trivial scores without relying on event understanding, especially for CR, DU, and OC.

If these issues are confirmed after checking the annotations, would it be possible to release an updated version of the benchmark? For the multiple-choice tasks, one possible fix is to shuffle the answer options and regenerate the questions/annotations so that the correct option positions are more balanced.

Thanks again for your work on this benchmark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions