Skip to content

Recommending a New Research Benchmark to Better Showcase Agent Capabilities #41

@black-yt

Description

@black-yt

Problem Statement

With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.

However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform end-to-end scientific research.

Most existing benchmarks focus on:

  • knowledge recall,
  • reasoning tasks,
  • or code generation,

but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.

As a result, it is still unclear:

  • whether current agents can genuinely reproduce scientific findings,
  • how different research agents compare under a unified setting,
  • and what gaps remain between current systems and real-world research capability.

Proposed Solution

We would like to suggest trying ResearchClawBench, a benchmark specifically designed for evaluating auto research agents.

It introduces a two-stage evaluation framework:

  • Stage 1 — Autonomous Research
    The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing.

  • Stage 2 — Paper-level Evaluation
    The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.

The scoring is calibrated such that:

  • ~50 corresponds to reproducing the original paper (Re-Discovery)
  • higher scores indicate surpassing the original work (New Discovery)

The benchmark includes:

  • 40 tasks across 10 scientific domains
  • real datasets and reproducible setups
  • fine-grained evaluation grounded in expert annotations
  • support for multiple agents and easy integration of custom systems

We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.
If relevant, it could be interesting to see how your system performs under such a benchmark.

Links:

ResearchClawBench.mp4

Alternatives Considered

No response

Feature Area

AI / Chat / Agent

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions