Recommending a New Research Benchmark to Better Showcase Agent Capabilities

### Problem Statement

With the rapid development of auto research agents, it is becoming increasingly impressive to see systems capable of multi-step reasoning, coding, and tool use.

However, there is still a lack of benchmarks that can rigorously evaluate whether these agents can truly perform *end-to-end scientific research*.

Most existing benchmarks focus on:
- knowledge recall,
- reasoning tasks,
- or code generation,

but they do not capture the full research workflow — from raw data understanding, to analysis, to producing paper-level conclusions.

As a result, it is still unclear:
- whether current agents can genuinely reproduce scientific findings,
- how different research agents compare under a unified setting,
- and what gaps remain between current systems and real-world research capability.

### Proposed Solution

We would like to suggest trying **ResearchClawBench**, a benchmark specifically designed for evaluating auto research agents.

It introduces a two-stage evaluation framework:

- **Stage 1 — Autonomous Research**  
  The agent is given raw datasets, task instructions, and references, and must independently perform data analysis, coding, visualization, and report writing.

- **Stage 2 — Paper-level Evaluation**  
  The generated report is compared against a real published paper using expert-designed checklists (rubrics) and an LLM-based judge.

The scoring is calibrated such that:
- ~50 corresponds to reproducing the original paper (Re-Discovery)
- higher scores indicate surpassing the original work (New Discovery)

The benchmark includes:
- 40 tasks across 10 scientific domains  
- real datasets and reproducible setups  
- fine-grained evaluation grounded in expert annotations  
- support for multiple agents and easy integration of custom systems  

We believe this setup may provide a more direct way to evaluate and demonstrate research capabilities of agents.  
If relevant, it could be interesting to see how your system performs under such a benchmark.

Links:
- https://internscience.github.io/ResearchClawBench-Home/
- https://github.com/InternScience/ResearchClawBench

https://github.com/user-attachments/assets/342a9d64-7b4b-4713-9931-297d9ccd11a2

### Alternatives Considered

_No response_

### Feature Area

AI / Chat / Agent

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommending a New Research Benchmark to Better Showcase Agent Capabilities #41

Problem Statement

Proposed Solution

Alternatives Considered

Feature Area

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Recommending a New Research Benchmark to Better Showcase Agent Capabilities #41

Description

Problem Statement

Proposed Solution

Alternatives Considered

Feature Area

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions