Potential answer-distribution bias in several EventBench tasks

Hi, thank you for releasing EventBench.

While analyzing the annotations, we noticed potential answer-distribution bias in several tasks:

- **Detailed Understanding (DU)**: The correct answer is heavily concentrated on option **B**. A fixed-answer baseline that always selects B gets **257/483 = 53.21%** accuracy. We also noticed that this task appears to mix questions with different numbers of options, including 3-choice and 5-choice questions.
- **Causal Reasoning (CR)**: The correct answer is heavily concentrated on option **A**. A fixed-answer baseline that always selects A gets **119/180 = 66.11%** accuracy.
- **Spatial Relationship (SR)**: The label distribution is imbalanced. A fixed-answer baseline that always predicts **left** gets **109/301 = 36.21%** accuracy.
- **Object Counting (OC)**: The count distribution is strongly imbalanced. A fixed-answer baseline that always predicts **2** gets **253/414 = 61.11%** accuracy.

These fixed-answer baselines suggest that models may obtain non-trivial scores without relying on event understanding, especially for CR, DU, and OC.

If these issues are confirmed after checking the annotations, would it be possible to release an updated version of the benchmark? For the multiple-choice tasks, one possible fix is to shuffle the answer options and regenerate the questions/annotations so that the correct option positions are more balanced.

Thanks again for your work on this benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential answer-distribution bias in several EventBench tasks #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Potential answer-distribution bias in several EventBench tasks #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions