Hi, thank you for releasing EventBench.
While analyzing the annotations, we noticed potential answer-distribution bias in several tasks:
- Detailed Understanding (DU): The correct answer is heavily concentrated on option B. A fixed-answer baseline that always selects B gets 257/483 = 53.21% accuracy. We also noticed that this task appears to mix questions with different numbers of options, including 3-choice and 5-choice questions.
- Causal Reasoning (CR): The correct answer is heavily concentrated on option A. A fixed-answer baseline that always selects A gets 119/180 = 66.11% accuracy.
- Spatial Relationship (SR): The label distribution is imbalanced. A fixed-answer baseline that always predicts left gets 109/301 = 36.21% accuracy.
- Object Counting (OC): The count distribution is strongly imbalanced. A fixed-answer baseline that always predicts 2 gets 253/414 = 61.11% accuracy.
These fixed-answer baselines suggest that models may obtain non-trivial scores without relying on event understanding, especially for CR, DU, and OC.
If these issues are confirmed after checking the annotations, would it be possible to release an updated version of the benchmark? For the multiple-choice tasks, one possible fix is to shuffle the answer options and regenerate the questions/annotations so that the correct option positions are more balanced.
Thanks again for your work on this benchmark.
Hi, thank you for releasing EventBench.
While analyzing the annotations, we noticed potential answer-distribution bias in several tasks:
These fixed-answer baselines suggest that models may obtain non-trivial scores without relying on event understanding, especially for CR, DU, and OC.
If these issues are confirmed after checking the annotations, would it be possible to release an updated version of the benchmark? For the multiple-choice tasks, one possible fix is to shuffle the answer options and regenerate the questions/annotations so that the correct option positions are more balanced.
Thanks again for your work on this benchmark.