Skip to content

Fix open-source filter for mini-SWE-agent-v2 on Verified leaderboard#56

Open
aorwall wants to merge 1 commit intomasterfrom
fix/regenerate-leaderboards-os-model
Open

Fix open-source filter for mini-SWE-agent-v2 on Verified leaderboard#56
aorwall wants to merge 1 commit intomasterfrom
fix/regenerate-leaderboards-os-model

Conversation

@aorwall
Copy link
Copy Markdown
Member

@aorwall aorwall commented Mar 27, 2026

Summary

Fixes SWE-bench/SWE-bench#544 — selecting Verified → mini-SWE-agent-v2 → Open source only showed no results because leaderboards.json was stale.

  • Root cause: os_model was false for 4 open-weight models (DeepSeek V3.2, GLM-5, Kimi K2.5, MiniMax M2.5) in the generated JSON, even though their metadata.yaml files in experiments/ correctly had os_model: true. The JSON had not been regenerated.
  • Fix: Regenerated leaderboards.json via python -m analysis.get_leaderboard from the experiments repo.

All changes from regeneration

  • os_model: false → true for DeepSeek V3.2, GLM-5, Kimi K2.5, MiniMax M2.5 (+ their Verified cross-listings)
  • model_release_date populated (was null for all bash-only/multilingual entries)
  • Duplicate "GPT 5.2 Codex" entry removed from bash-only and Multilingual
  • S3 logs URLs added for GPT 5.2 Codex entries

Related: SWE-bench/experiments#433 (metadata name fix for GPT 5.2 Codex)

Test plan

  • Open https://www.swebench.com/index.html after deploy
  • Select Verified → mini-SWE-agent-v2 → Open source only → verify 4 models appear
  • Select Proprietary only → verify 9 models appear
  • Select All models → verify all 13 models appear

@ofirpress
Copy link
Copy Markdown
Member

name corrected to "GPT-5-2 Codex"
hmm it should stay "GPT 5.2 Codex" i think

Fixes SWE-bench/SWE-bench#544 — the "Open source only" filter on the
Verified leaderboard for mini-SWE-agent-v2 showed no results because
four open-weight models had os_model incorrectly set to false.

Changes from regeneration:
- Fix os_model: false → true for DeepSeek V3.2, GLM-5, Kimi K2.5,
  MiniMax M2.5 (plus their Verified cross-listings)
- Populate model_release_date (was null for all bash-only/multilingual entries)
- Remove duplicate "GPT 5.2 Codex" entry, fix name to "GPT-5-2 Codex"
- Add S3 logs URLs for GPT-5-2 Codex entries
@aorwall aorwall force-pushed the fix/regenerate-leaderboards-os-model branch from 36b3d65 to 12dd14d Compare March 29, 2026 10:14
@aorwall
Copy link
Copy Markdown
Member Author

aorwall commented Mar 29, 2026

name corrected to "GPT-5-2 Codex"
hmm it should stay "GPT 5.2 Codex" i think

Updated SWE-bench/experiments#433

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verified Leaderboard for mini-SWE-agent-v2 treats all models as proprietary

2 participants