[Leaderboard] Altimate Code — Claude Sonnet 4.6 — Pass@1 0.671 (relaxed validators), 0.6187 (Original validators)#44
Conversation
|
Hi @sahrizvi — thank you for your contribution! |
|
Hi @Ruiying-Ma! Thanks for the quick turnaround. The full set of per-trial traces (270 trials, 54 queries × 5) is attached as a 13.4 MB zip. Layout Happy to provide additional metadata or adjust the layout if anything's harder to verify than expected. |
|
Thank you @sahrizvi ! Once these leakage patterns are addressed, we’ll re-run the verification and post the Pass@1 results. Thank you! |
|
Hi @Ruiying-Ma, Thank you for the leakage flag — it surfaced more than the agnews issue you found. We did a thorough audit We've reworked the harness and re-executed the affected trials. Attached is the updated submission What changed
New result
The drop from our prior 0.6710 number is mostly the format_hint correction. Per-dataset table is in ProvenanceThe 270 trial answers come from three runs of the same hardened harness on the same machine:
PackageSame shape as the previous traces zip (
Happy to answer questions or re-run any specific trials you'd like spot-checked under tighter conditions. |
Altimate Code — Leaderboard Submission
Agent name: Altimate Code
Project page: altimate.sh
Backbone LLM: Claude Sonnet 4.6 (via OpenRouter —
openrouter/anthropic/claude-sonnet-4.6)Hints: Yes (
db_description_withhint.txtinjected into the user prompt)Trials: 5 per query (270 trials total across 12 datasets, 54 queries)
Result
The same 270 trial answers were scored against two validator versions of the benchmark — the version we ran against, and the post-relaxation version on
mainat submission time.9031c68ad)5ec934595)Note on validator versions. Our trials executed when
vendor/DataAgentBenchwas at commit9031c68ad. Upstream subsequently merged commits16ccc3cbd("Relax 16 validators to accept semantically-correct answers") and7c94cbf4c("Relax 3 more validators"), which together updated 17validate.pyfiles across 6 datasets. Re-running the scoring step (no agent re-execution) against the relaxed validators lifted 12 trials from fail to pass. Both numbers are reproducible from the same trial answers insubmission.json.Architecture
A heterogeneous-warehouse data agent built on top of altimate-code, an open-source TypeScript agent runtime, that:
db_description_withhint.txt(injected into the user prompt) for cross-DB join keys, term codes, and output-format guidance.schema_index,schema_search,schema_inspect,sql_execute,warehouse_list) to introspect schemas and run queries against PostgreSQL, SQLite, DuckDB, and MongoDB.sql-review,query-optimize,lineage-diff,sql-translate) to catch SQL anti-patterns and trace column provenance before committing an answer.ANSWERrather than producing a meta-summary.solve.pyper query and iterates in place (Edit, not rewrite) until convergence; final answer goes toANSWER.Per-dataset Pass@1
Note on PATENTS
PATENTS scores 0.000 under both validator sets. Our agent produced well-formed CSV answers on every PATENTS trial but reached a different subset of CPC codes than the reference; the failure mode is query-interpretation (specifically: EMA initialization convention and CPC hierarchy-level definition are not pinned down by the question), not format or harness.
We chose not to add per-dataset hand-tuning to lift this number, in keeping with our principle of only using general-purpose agent improvements.
Configuration