Fix open-source filter for mini-SWE-agent-v2 on Verified leaderboard#56
Open
Fix open-source filter for mini-SWE-agent-v2 on Verified leaderboard#56
Conversation
Member
|
Fixes SWE-bench/SWE-bench#544 — the "Open source only" filter on the Verified leaderboard for mini-SWE-agent-v2 showed no results because four open-weight models had os_model incorrectly set to false. Changes from regeneration: - Fix os_model: false → true for DeepSeek V3.2, GLM-5, Kimi K2.5, MiniMax M2.5 (plus their Verified cross-listings) - Populate model_release_date (was null for all bash-only/multilingual entries) - Remove duplicate "GPT 5.2 Codex" entry, fix name to "GPT-5-2 Codex" - Add S3 logs URLs for GPT-5-2 Codex entries
36b3d65 to
12dd14d
Compare
1 task
Member
Author
Updated SWE-bench/experiments#433 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes SWE-bench/SWE-bench#544 — selecting Verified → mini-SWE-agent-v2 → Open source only showed no results because
leaderboards.jsonwas stale.os_modelwasfalsefor 4 open-weight models (DeepSeek V3.2, GLM-5, Kimi K2.5, MiniMax M2.5) in the generated JSON, even though theirmetadata.yamlfiles inexperiments/correctly hados_model: true. The JSON had not been regenerated.leaderboards.jsonviapython -m analysis.get_leaderboardfrom the experiments repo.All changes from regeneration
os_model: false → truefor DeepSeek V3.2, GLM-5, Kimi K2.5, MiniMax M2.5 (+ their Verified cross-listings)model_release_datepopulated (wasnullfor all bash-only/multilingual entries)logsURLs added for GPT 5.2 Codex entriesRelated: SWE-bench/experiments#433 (metadata name fix for GPT 5.2 Codex)
Test plan