Skip to content

prosyslab/expecto-artifact

Repository files navigation

Expecto: Extracting Formal Specifications from Natural Language Description for Trustworthy Oracles (Artifact)

This repository contains the artifact implementation for the paper Expecto: Extracting Formal Specifications from Natural Language Description for Trustworthy Oracles.

1. Getting started

1.1 System requirements

The experiments in the paper were conducted with the following setup:

  • Ubuntu 22.04
  • Docker 24.0.2
  • 64 CPU cores (Xeon Gold 6226R)
  • 128 GB RAM
  • 256 GB storage

1.2 Setup with Docker

Step 1. Pull or load the Docker image

First, obtain the Docker image that includes the datasets, dependencies, and experiment code. Pulling the image is the easiest option:

docker pull prosyslab/expecto-artifact

If you downloaded expecto-artifact.tar.gz from Zenodo, you can load the image instead:

gunzip -c expecto-artifact.tar.gz | docker load

You can verify that the image was pulled or loaded correctly by running:

docker images | grep expecto-artifact

Expected output:

prosyslab/expecto-artifact   latest    ...

Step 1 (Alternative). Set up without Docker using Conda

If Docker is not available, you can set up the environment using Conda instead. It takes about 15-20 minutes for installing and building dependencies.

# clone the repository
git clone https://github.com/prosyslab/expecto-artifact

# create the Conda environment
conda env create -f environment.yml
conda activate expecto-artifact

# install dependencies
bash setup.sh

# reactivate the Conda environment to apply environment variables
conda deactivate
conda activate expecto-artifact

# download the datasets from Zenodo
wget https://zenodo.org/records/19083598/files/datasets.tar.gz?download=1 -O datasets.tar.gz
tar -xzf datasets.tar.gz

Step 2. Start the container / activate conda environment

If you used Docker in Step 1:

docker run -it --name expecto-artifact prosyslab/expecto-artifact zsh

Otherwise:

conda activate expecto-artifact

Step 3. Create the .env file

Expecto uses GPT-4.1-mini (OpenAI) as its LLM backend, so you need a valid OpenAI API key to run the experiments. Inside the container, change to /workspace/expecto-artifact and create a .env file with your OpenAI API key.

Note: An OpenAI API key will be provided through HotCRP comment

cd /workspace/expecto-artifact
cat > .env <<'EOF'
OPENAI_API_KEY=YOUR_KEY_HERE
EOF

Step 4. Check the setup with the test script

Run the following script to confirm that the setup completed successfully. Make sure every item in the Summary section is marked as [PASS]. It takes about 2 minutes.

Command:

python3 scripts/test.py

Expected output:

[PASS] .env file: /workspace/expecto-artifact/.env
[PASS] OPENAI_API_KEY: OPENAI_API_KEY is defined in .env and starts with 'sk-'
[PASS] datasets directory: /workspace/expecto-artifact/datasets
[PASS] dataset file: datasets/apps.json
[PASS] dataset file: datasets/human_eval_plus.json
[PASS] dataset file: datasets/defects4j.jsonl
Running RQ1 smoke test:
/usr/bin/python3 /workspace/expecto-artifact/scripts/run_artifact.py rq1 --limit 1 --output-root /workspace/data/experiment/artifact/test-smoke --force
...
[PASS] RQ1 execution: run_artifact.py completed successfully
[PASS] Expecto markers: .../full/runs/apps/ts/evaluation_result/manifest.json
[PASS] NL2Postcond marker: .../full/runs/apps/nl2_base/.../aggregated_result.json
[PASS] RQ1 figure output: .../full/figures/rq1/evaluation.rq1.table.pdf

Summary:
[PASS] .env file
[PASS] OPENAI_API_KEY
[PASS] datasets directory
[PASS] dataset files
[PASS] RQ1 execution
[PASS] Expecto markers
[PASS] NL2Postcond markers
[PASS] RQ1 figure outputs
[PASS] sample result outputs

This test script checks the following:

  1. The required datasets exist.
  2. OPENAI_API_KEY is set.
  3. The RQ1 experiment can run on a single sample, and the final figure files are generated correctly.

Note: The figures generated by test.py are not intended to match the exact values reported in the paper. They only confirm that the workflow runs correctly before you start a full reproduction.

1.3 Notes before you start

  • LLM non-determinism: The generated specifications and their classifications may differ from those reported in the paper because of LLM non-determinism. However, the overall tendencies (e.g., Expecto producing more S&C specifications than NL2Postcond) should still hold.

2. Directory structure

The expecto-artifact repository has the following directory structure:

/workspace/expecto-artifact
├── README.md                          <- Top-level README (this file)
├── analyzer/                          <- Figure and table generation scripts
├── datasets/                          <- HumanEval+, APPS, Defects4J, and benchmark extension examples
│   └── examples/                      <- Minimal APPS-style and Java/Defects4J-style benchmark examples
├── expecto/                           <- Core Expecto implementation
├── nl-2-postcond/                     <- NL2Postcond baseline from FSE '24
└── scripts/                           <- Scripts for running artifact experiments
    └── run_artifact.py                <- Main artifact runner

Generated outputs from run_artifact.py are stored as follows. The most important files to inspect are sample_results.json for per-run summaries and the PDFs under figures/ for the reproduced paper artifacts.

/workspace/data/experiment/artifact
├── target/                            <- Outputs for targeted reproduction (Section 3)
│   └── runs/                          <- Raw LLM outputs and evaluation results
│       └── <BENCHMARK>/<CONFIG>/      <- Outputs for one benchmark/configuration pair
│           └── sample_results.json    <- Summary of generated sample results
├── full/                              <- Outputs for full paper reproduction (Sections 4 and 5)
│   ├── runs/                          <- Raw LLM outputs and evaluation results
│   └── figures/                       <- Tables and figures generated from `runs/`
│       ├── rq1/                       <- Figures for RQ1
│       ├── rq2/                       <- Figures for RQ2
│       ├── rq3/                       <- Figures for RQ3
│       └── rq4/                       <- Figures for RQ4
└── mini/                              <- Outputs for reduced quick-check runs (Section 6)
    ├── runs/
    └── figures/

3. Reproducing specific benchmark problems

Use the target command when you want to reproduce individual benchmark problems instead of the full paper results. The paper covers three benchmarks, four Expecto configurations, and two NL2Postcond configurations; you can run any supported combination by specifying the benchmark, configuration, and sample ID(s).

3.1 Reproducing the paper's motivating example (Fig. 2)

Run the following commands to compare Expecto's tree-search configuration (ts) with the two NL2Postcond configurations (nl2_base and nl2_simple) on the motivating example in Fig. 2.

It takes about 2 minutes.

# run Expecto (`ts` means the tree search specification synthesis + specification validation with test cases)
python3 scripts/run_artifact.py target \
  --benchmark apps \
  --config ts \
  --sample-ids 75

# run NL2Postcond base prompt configuration
python3 scripts/run_artifact.py target \
  --benchmark apps \
  --config nl2_base \
  --sample-ids 75

# run NL2Postcond simple prompt configuration
python3 scripts/run_artifact.py target \
  --benchmark apps \
  --config nl2_simple \
  --sample-ids 75

Each command writes its result to one of the following files:

  • /workspace/data/experiment/artifact/target/runs/apps/ts/sample_results.json:
[
  {
    "id": "75",
    "classification": "S&C",
    "nl_description": "You are given a description of a depot...",
    "specification": "predicate spec(n: int, m: int, grid: list[string], out_status: string, out_x: int, out_y: int) { ... }"
  }
]
  • /workspace/data/experiment/artifact/target/runs/apps/nl2_base/sample_results.json:
[
  {
    "id": "75",
    "classification": "S",
    "nl_description": "You are given a description of a depot...",
    "specification": "assert ..."
  }
]
  • /workspace/data/experiment/artifact/target/runs/apps/nl2_simple/sample_results.json:
[
  {
    "id": "75",
    "classification": "C",
    "nl_description": "You are given a description of a depot...",
    "specification": "assert ..."
  }
]

Each sample_results.json entry contains:

  • id: the benchmark problem ID. In this example, it is always 75.
  • classification: the evaluation result for the generated specification. Here, Expecto produces a sound and complete (S&C) specification, while the two NL2Postcond configurations produce sound but incomplete (S) or complete but unsound (C) specifications.
  • nl_description: the original natural-language description from the APPS benchmark prompt.
  • specification: the generated formal specification. This field stores the specifications shown in Fig. 2. Expecto emits a DSL specification such as predicate spec(...), and NL2Postcond emits a single assertion such as assert ....

3.2 Running other target problems

Use the following command template for any other target problem:

python3 scripts/run_artifact.py target \
  --benchmark <BENCHMARK> \
  --config <CONFIG> \
  --sample-ids <SAMPLE-IDS>
  • BENCHMARK specifies which benchmark to use. You can choose apps, humaneval_plus, or defects4j.
  • CONFIG specifies the Expecto or NL2Postcond configuration used in the paper.
Method CONFIG value Description Used paper section(s)
Expecto mono Monolithic specification synthesis + Specification validation with test cases 4.3
Expecto topdown Top-down specification synthesis + Specification validation with test cases 4.3
Expecto ts Tree search specification synthesis + Specification validation with test cases 4.2, 4.3, 4.4, 4.5
Expecto without_tc Tree search specification synthesis 4.4
NL2Postcond nl2_base NL2Postcond base prompt strategy 4.2, 4.5
NL2Postcond nl2_simple NL2Postcond simple prompt strategy 4.2, 4.5
  • SAMPLE-IDS specifies the benchmark problem IDs as a comma-separated list. For example, you can provide 15,23,56. You can find the available IDs for each benchmark in /workspace/expecto-artifact/datasets/available_target_ids.csv.

This command generates raw LLM outputs and evaluates the resulting specifications for soundness and completeness. The generated outputs and evaluation results are saved in /workspace/data/experiment/artifact/target/runs/<BENCHMARK>/<CONFIG>. After the evaluation is complete, the runner also generates /workspace/data/experiment/artifact/target/runs/<BENCHMARK>/<CONFIG>/sample_results.json, so you can immediately inspect the classification result, original natural-language description, and generated specification for each sample.

sample_results.json contains one object per sample, and each object has the following fields:

  • id: the sample ID
  • classification: the sample classification result. It can be one of the following values:
    • S&C: sound and complete
    • S: sound but incomplete
    • C: complete but unsound
    • W: neither sound nor complete
  • nl_description: the original natural-language description
    • APPS and HumanEval+ use the benchmark prompt text
    • Defects4J uses the method Javadoc description
  • specification: the generated specification
    • Expecto generates specifications in the form of predicate spec(...) { ... }
    • NL2Postcond generates specifications in the form of assert ...

4. Reproducing the full paper results

This section explains the full command, which runs all experiments in the paper for RQ1 through RQ4. It performs the generation and evaluation described in Section 3 on the full benchmarks, then produces the paper's figures and tables. The total expected runtime of the full command is approximately 50 hours.

4.1 How to run full and what this command does

It takes about 50 hours.

python3 scripts/run_artifact.py full

Running the command above executes all generation-and-evaluation settings used in the paper. For APPS and HumanEval+, it runs 6 settings per benchmark: 4 Expecto configurations and 2 NL2Postcond configurations. For Defects4J, it runs 3 settings: 1 Expecto configuration and 2 NL2Postcond configurations. In total, this is 6 + 6 + 3 = 15 generation-and-evaluation runs. It then processes the outputs into the same figure and table formats shown in the paper.

  • The results for each benchmark / generation-configuration combination are stored at:
    • /workspace/data/experiment/artifact/full/runs/<BENCHMARK>/<CONFIG>/sample_results.json
    • The raw LLM outputs and evaluation artifacts used to build each summary are stored alongside it under /workspace/data/experiment/artifact/full/runs/<BENCHMARK>/<CONFIG>/
  • The paper's figures and tables are stored at:
    • Table 1 (RQ1 main comparison): /workspace/data/experiment/artifact/full/figures/rq1/evaluation.rq1.table.pdf
    • Fig. 8 (RQ1 threshold analysis): /workspace/data/experiment/artifact/full/figures/rq1/evaluation.thresholds.pdf
    • Fig. 9 (RQ2 generation algorithm ablation): /workspace/data/experiment/artifact/full/figures/rq2/evaluation.rq2.pdf
    • Fig. 10 (RQ3 test-case ablation): /workspace/data/experiment/artifact/full/figures/rq3/evaluation.rq3.testcase.pdf
    • Table 2 (RQ4 Defects4J comparison): /workspace/data/experiment/artifact/full/figures/rq4/evaluation.rq4.defects4j.table.pdf

For details on each RQ, see Section 5.

5. Reproducing each RQ

This section explains the commands and outputs used to reproduce each research question (RQ). If you have already completed the full experiment with the full command, you can inspect the generated outputs instead of running the commands again.

5.1 RQ1: Effectiveness of Expecto in formal specification generation (Table 1 and Fig. 8)

This experiment compares Expecto against the two NL2Postcond prompt strategies (nl2_base and nl2_simple) on the APPS and HumanEval+ benchmarks. For Expecto, the comparison uses the tree-search-with-test-cases (ts) configuration.

How to run rq1 and what this command does

It takes about 12 hours from scratch, but if you have already run the full command, you can skip this step and inspect the generated outputs instead.

python3 scripts/run_artifact.py rq1

This command runs specification generation and evaluation for six experiment settings (2 benchmarks x 3 configs) and produces the main RQ1 outputs: Table 1 and Fig. 8. Table 1 shows the number of samples in each classification category (S&C, S, C, and W) for each setting. Fig. 8 shows how the number of samples in the S&C and W categories changes as the threshold X in the soundness criterion varies. For example, suppose a benchmark has 10 incorrect input-output pairs, and a generated specification rejects 8 of them. Then the specification is sound when X = 80, because it catches at least 80% of the incorrect pairs, but it is not sound when X = 90. Fig. 8 repeats this counting for different threshold values and shows how the classification changes. The generated raw data and final outputs can be found in:

  • Paths to sample_results.json:
    • /workspace/data/experiment/artifact/full/runs/apps/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/apps/nl2_base/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/apps/nl2_simple/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/nl2_base/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/nl2_simple/sample_results.json
  • Table 1 (RQ1 main comparison): /workspace/data/experiment/artifact/full/figures/rq1/evaluation.rq1.table.pdf
  • Fig. 8 (RQ1 threshold analysis): /workspace/data/experiment/artifact/full/figures/rq1/evaluation.thresholds.pdf

Notes on RQ1 results

  • DSL compiler bug fix: After the submission version, we found bugs in the DSL compiler that caused some specifications to be misclassified as W or S instead of S&C. We have fixed these bugs, so you may see more S&C specifications than those reported in the paper, especially for Expecto on HumanEval+. However, the overall trends and conclusions should still hold.

5.2 RQ2: Effectiveness of top-down synthesis with tree search

This experiment evaluates the two main generation components of Expecto: top-down specification synthesis and tree search. It compares three Expecto configurations on the APPS and HumanEval+ benchmarks: monolithic generation (mono), top-down synthesis without tree search (topdown), and top-down synthesis with tree search (ts).

How to run rq2 and what this command does

It takes about 24 hours from scratch, but if you have already run the full command, you can skip this step and inspect the generated outputs instead.

python3 scripts/run_artifact.py rq2

This command runs specification generation and evaluation for six Expecto ablation settings (2 benchmarks x 3 configs) and produces the main RQ2 output, Fig. 9. Fig. 9 is a bar chart comparing the number of samples in each classification category for different Expecto generation algorithms. It shows that introducing top-down synthesis and tree search improves the quality of the generated specifications, for example by increasing the number of S&C specifications and decreasing the number of W specifications. This can be seen from the fact that, for each benchmark, the number of S&C specifications increases in the order mono < topdown < ts. The generated raw data and final output can be found in:

  • Paths to sample_results.json:
    • /workspace/data/experiment/artifact/full/runs/apps/mono/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/apps/topdown/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/apps/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/mono/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/topdown/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/ts/sample_results.json
  • Fig. 9 (RQ2 generation algorithm ablation): /workspace/data/experiment/artifact/full/figures/rq2/evaluation.rq2.pdf

5.3 RQ3: Impact of test cases on specification generation

This experiment measures how much user-provided test cases help Expecto during specification generation. It compares the full tree-search configuration with test cases (ts) against the version that does not use test cases (without_tc) on the APPS and HumanEval+ benchmarks.

How to run rq3 and what this command does

It takes about 18 hours from scratch, but if you have already run the full command, you can skip this step and inspect the generated outputs instead.

python3 scripts/run_artifact.py rq3

This command runs specification generation and evaluation for four experiment settings (2 benchmarks x 2 configs) and produces the main RQ3 output, Fig. 10. Fig. 10 is a bar chart comparing the number of samples in each classification category (S&C, S, C, W) between the ts configuration, which uses test cases, and the without_tc configuration, which does not. It shows that introducing test cases improves the quality of the generated specifications, for example by increasing the number of S&C specifications. The generated raw data and final output can be found in:

  • Paths to sample_results.json:
    • /workspace/data/experiment/artifact/full/runs/apps/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/apps/without_tc/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/humaneval_plus/without_tc/sample_results.json
  • Fig. 10 (RQ3 test-case ablation): /workspace/data/experiment/artifact/full/figures/rq3/evaluation.rq3.testcase.pdf

5.4 RQ4: Effectiveness of Expecto for bug detection in real-world software

This experiment evaluates whether Expecto can generate bug-detecting specifications for real-world software methods. It compares Expecto with the two NL2Postcond prompt strategies (nl2_base and nl2_simple) on the Defects4J benchmark, using the tree-search configuration with test cases (ts) for Expecto.

How to run rq4 and what this command does

It takes about 12 hours from scratch, but if you have already run the full command, you can skip this step and inspect the generated outputs instead.

python3 scripts/run_artifact.py rq4

This command runs specification generation and evaluation for three Defects4J experiment settings (1 benchmark x 3 configs) and produces the main RQ4 output, Table 2. Table 2 compares the number of specifications in each classification category (S&C, S, C, W) produced on the Defects4J benchmark by Expecto (ts) and the two NL2Postcond prompt strategies (nl2_base, nl2_simple). The table shows that Expecto produces more specifications that are both correct and capable of detecting bugs (S&C) than NL2Postcond, while also producing fewer W specifications. The generated raw data and final output can be found in:

  • Paths to sample_results.json:
    • /workspace/data/experiment/artifact/full/runs/defects4j/ts/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/defects4j/nl2_base/sample_results.json
    • /workspace/data/experiment/artifact/full/runs/defects4j/nl2_simple/sample_results.json
  • Table 2 (RQ4 Defects4J comparison): /workspace/data/experiment/artifact/full/figures/rq4/evaluation.rq4.defects4j.table.pdf

Notes on RQ4 results

  • DSL compiler bug fix: After the submission version, we found bugs in the DSL compiler that caused some specifications to be misclassified as W or S instead of S&C. We have fixed these bugs, so you may see more S&C specifications than those reported in the paper, especially for Expecto on Defects4J. However, the overall trends and conclusions should still hold.

6. Running reduced benchmarks for quick inspection

This section explains the mini command, which runs a reduced version of the experiments on fixed subsets of the benchmarks used in the paper. It performs the same generation, evaluation, and figure/table export steps as the full command, but uses only 20 samples from each benchmark so that the results can be produced much more quickly.

We recommend running mini first if you want to check your setup produces reasonable results before committing to the ~50-hour full run. It takes 4.5 hours.

Note: mini is not meant to reproduce the exact paper numbers. It is meant for quick inspection and trend checking.

6.1 How to run mini and what this command does

To run the reduced-profile experiments for RQ1 through RQ4, use:

python3 scripts/run_artifact.py mini

If you only want to run the reduced-profile experiment for a specific RQ, specify the RQ number as follows:

python3 scripts/run_artifact.py <rq1|rq2|rq3|rq4> --mini

Running the command above executes the same experiment families as the full profile for RQ1 through RQ4, but on fixed 20-sample subsets of APPS, HumanEval+, and Defects4J. It writes per-run outputs to:

  • /workspace/data/experiment/artifact/mini/runs/<BENCHMARK>/<CONFIG>/sample_results.json

It also generates the corresponding reduced-profile figures and tables at:

  • Table 1 (RQ1 main comparison): /workspace/data/experiment/artifact/mini/figures/rq1/evaluation.rq1.table.pdf
  • Fig. 8 (RQ1 threshold analysis): /workspace/data/experiment/artifact/mini/figures/rq1/evaluation.thresholds.pdf
  • Fig. 9 (RQ2 generation algorithm ablation): /workspace/data/experiment/artifact/mini/figures/rq2/evaluation.rq2.pdf
  • Fig. 10 (RQ3 test-case ablation): /workspace/data/experiment/artifact/mini/figures/rq3/evaluation.rq3.testcase.pdf
  • Table 2 (RQ4 Defects4J comparison): /workspace/data/experiment/artifact/mini/figures/rq4/evaluation.rq4.defects4j.table.pdf

The fixed sample IDs used by mini are listed below. You do not need these IDs unless you want to inspect exactly which subset is used.

  • APPS IDs: 15, 57, 23, 76, 83, 39, 101, 3701, 4004, 37, 94, 16, 61, 52, 42, 47, 72, 90, 4005, 71
  • HumanEval+ IDs: 22, 52, 75, 61, 83, 92, 34, 53, 73, 129, 140, 158, 54, 124, 6, 71, 32, 123, 162, 63
  • Defects4J IDs:
Chart_6_workspace_objdump_d4j_full_fresh_chart_6_source_org_jfree_chart_util_ShapeList_java_boolean_equals_Object_obj, Cli_18_workspace_objdump_d4j_full_fresh_cli_18_src_java_org_apache_commons_cli_PosixParser_java_void_processOptionToken_String_token_boolean_stopAtNonOption, Compress_40_workspace_objdump_d4j_full_compress_40_src_main_java_org_apache_commons_compress_utils_BitInputStream_java_long_readBits_int_count, Jsoup_76_workspace_objdump_d4j_full_jsoup_76_src_main_java_org_jsoup_parser_HtmlTreeBuilderState_java_boolean_process_Token_t_HtmlTreeBuilder_tb, Jsoup_85_workspace_objdump_d4j_full_jsoup_85_src_main_java_org_jsoup_nodes_Attribute_java_Attribute_String_key_String_val_Attributes_parent, Lang_32_workspace_objdump_d4j_full_lang_32_src_main_java_org_apache_commons_lang3_builder_HashCodeBuilder_java_boolean_isRegistered_Object_value, Math_35_workspace_objdump_d4j_full_math_35_src_main_java_org_apache_commons_math3_genetics_ElitisticListPopulation_java_ElitisticListPopulation_int_populationLimit_double_elitismRate, Math_73_workspace_objdump_d4j_full_math_73_src_main_java_org_apache_commons_math_analysis_solvers_BrentSolver_java_double_solve_UnivariateRealFunction_f_double_min_double_max, Math_80_workspace_objdump_d4j_full_math_80_src_main_java_org_apache_commons_math_linear_EigenDecompositionImpl_java_boolean_flipIfWarranted_int_n_int_step, Math_96_workspace_objdump_d4j_full_math_96_src_java_org_apache_commons_math_complex_Complex_java_boolean_equals_Object_other, Cli_10_workspace_objdump_d4j_full_fresh_cli_10_src_java_org_apache_commons_cli_Parser_java_CommandLine_parse_Options_options_String_arguments_Properties_properties_boolean_stopAtNonOption, Cli_18_workspace_objdump_d4j_full_fresh_cli_18_src_java_org_apache_commons_cli_PosixParser_java_String_flatten_Options_options_String_arguments_boolean_stopAtNonOption, Cli_32_workspace_objdump_d4j_full_fresh_cli_32_src_main_java_org_apache_commons_cli_HelpFormatter_java_int_findWrapPos_String_text_int_width_int_startPos, Closure_114_workspace_objdump_d4j_full_fresh_closure_114_src_com_google_javascript_jscomp_NameAnalyzer_java_void_visit_NodeTraversal_t_Node_n_Node_parent, Closure_74_workspace_objdump_d4j_full_fresh_closure_74_src_com_google_javascript_jscomp_PeepholeFoldConstants_java_Node_tryFoldBinaryOperator_Node_subtree, Closure_78_workspace_objdump_d4j_full_fresh_closure_78_src_com_google_javascript_jscomp_PeepholeFoldConstants_java_Node_tryFoldArithmeticOp_Node_n_Node_left_Node_right, Closure_97_workspace_objdump_d4j_full_fresh_closure_97_src_com_google_javascript_jscomp_PeepholeFoldConstants_java_Node_tryFoldBinaryOperator_Node_subtree, Codec_4_workspace_objdump_d4j_full_codec_4_src_java_org_apache_commons_codec_binary_Base64_java_byte_encodeBase64_byte_binaryData_boolean_isChunked_boolean_urlSafe_int_maxResultSize, Jsoup_38_workspace_objdump_d4j_full_jsoup_38_src_main_java_org_jsoup_parser_HtmlTreeBuilderState_java_boolean_process_Token_t_HtmlTreeBuilder_tb, Jsoup_46_workspace_objdump_d4j_full_jsoup_46_src_main_java_org_jsoup_nodes_Entities_java_void_escape_StringBuilder_accum_String_string_Document_OutputSettings_out_boolean_inAttribute_boolean_normaliseWhite_boolean_stripLeadingWhite

7. Adding a new benchmark

This section gives two complete hello-world-sized examples you can copy from:

  • datasets/examples/apps_hello_world.json: a minimal APPS-style task with a parser
  • datasets/examples/defects4j_hello_world.jsonl: a minimal Java / Defects4J-style task

7.1 APPS-style hello world with a parser

Start from datasets/examples/apps_hello_world.json. It contains a single task named hello_world_sum:

[
  {
    "problem_id": "hello_world_sum",
    "difficulty": "introductory",
    "prompt": "Read two integers separated by whitespace and print their sum.",
    "input_output": "{\"inputs\": [\"1 2\\n\", \"10 -3\\n\", \"0 0\\n\"], \"outputs\": [\"3\\n\", \"7\\n\", \"0\\n\"]}",
    "mutated_input_output": "{\"inputs\": [\"1 2\\n\", \"10 -3\\n\", \"0 0\\n\"], \"outputs\": [\"4\\n\", \"8\\n\", \"1\\n\"]}",
    "parser": "def parser(input: str, output: str) -> tuple[int, int, int]:\n    a_str, b_str = input.strip().split()\n    return int(a_str), int(b_str), int(output.strip())",
    "signature": "def postcondition(a: int, b: int, result: int):\n    pass"
  }
]

This example works as follows:

  • problem_id = "hello_world_sum" becomes the sample identifier.
  • prompt is the natural-language task description shown to the model.
  • input_output stores correct stdin/stdout examples. Expecto should accept these examples when checking completeness.
  • mutated_input_output stores incorrect stdin/stdout examples. Expecto should reject these examples when checking soundness.
  • parser converts raw stdin/stdout text into the typed arguments expected by postcondition(...).
  • signature tells Expecto that the synthesized specification is over a: int, b: int, and result: int.
  • In this example, that means the parser must return values with types (int, int, int) in exactly that order.

What the parser is allowed and expected to do

For APPS-style datasets, the parser is only a conversion layer from stored text examples to typed values. The parser in apps_hello_world.json is intentionally small because that is the right contract:

  • It reads the raw input string and raw output string from the dataset.
  • It converts them to the exact argument types declared in signature.
  • It returns a deterministic tuple whose order matches postcondition(...).

The mapping rule is positional. Expecto evaluates parser(input, output) and then passes the returned values to postcondition(...) in the same order as the parameters in signature. So the parser return type must match the postcondition argument types exactly. You should think of the parser as producing the full argument list for postcondition(...). For the hello-world example:

def postcondition(a: int, b: int, result: int):
    pass

def parser(input: str, output: str) -> tuple[int, int, int]:
    a_str, b_str = input.strip().split()
    return int(a_str), int(b_str), int(output.strip())

Here the parser return type is tuple[int, int, int] because postcondition(...) takes (a: int, b: int, result: int). For input = "1 2\n" and output = "3\n", Expecto does:

parser("1 2\n", "3\n") == (1, 2, 3)
postcondition(1, 2, 3)

The same rule applies in general:

  • def postcondition(x: int, ok: bool): ... means parser return type tuple[int, bool]
  • def postcondition(nums: list[int], result: bool): ... means parser return type tuple[list[int], bool]

There is no name-based matching here. If the parser returns the wrong arity or incompatible types, evaluation will fail. Expecto may still apply small compatibility conversions after parsing, such as bool -> int or int -> float, when those are required by the annotated signature types. In practice, you should still write the parser so that its annotated return type already matches the postcondition parameter types directly, rather than relying on those conversions.

The parser should not:

  • re-implement the task logic
  • decide whether an example is correct or incorrect
  • read external files, call network services, or depend on hidden state

If the raw output format sometimes includes exceptional strings, the parser may normalize them into a sentinel value such as -1 or 0, as long as the returned type still matches signature. If your dataset already stores structured values, as HumanEval+ does, set parser to null instead of writing a dummy parser.

Minimal steps to run this as a new task

  1. Copy expecto/src/tasks/apps.py to expecto/src/tasks/hello_world_apps.py.
  2. Change @task(name="apps") to @task(name="hello_world_apps").
  3. Change the dataset file from apps.json to examples/apps_hello_world.json.
  4. Register the new task in expecto/src/tasks/__init__.py.
  5. Run one sample with:
python3 scripts/executor.py \
  --task hello_world_apps \
  --solver tree_search \
  --model openai/gpt-4.1-mini \
  --base_dir /workspace/data/hello-world \
  --exp_name hello-world-apps \
  --sample-ids hello_world_sum \
  --n_completions 1 \
  --max_attempts 1 \
  --use_test_cases \
  --use_memo \
  --scorers postcondition \
  --export-sample-results \
  --sample-results-benchmark apps \
  --sample-results-variant topdown

If /workspace/data/hello-world/hello-world-apps/ does not already exist, this command writes its outputs there. Because hello_world_apps is a new task name, pass --scorers postcondition explicitly instead of relying on the built-in default scorer map. The most important files are:

  • /workspace/data/hello-world/hello-world-apps/evaluation_result/manifest.json
  • /workspace/data/hello-world/hello-world-apps/evaluation_result/samples/hello_world_sum.json
  • /workspace/data/hello-world/hello-world-apps/sample_results.json

If that directory already exists, scripts/executor.py creates a suffixed directory such as /workspace/data/hello-world/hello-world-apps_1/ instead.

7.2 Java hello world (Defects4J-style)

Start from datasets/examples/defects4j_hello_world.jsonl. This file contains one Java method example:

/**
 * Returns a greeting for the supplied name.
 *
 * @param name the name to greet
 * @return the greeting string
 */
public static String hello(final String name) {
    return "Hello, " + name + "!";
}

Unlike APPS, the Java benchmark does not use stdin/stdout strings plus a parser. Instead, each record contains:

  • method_info: the Java signature, Javadoc, code, and entry/exit schemas
  • corrects: serialized executions that should satisfy the specification
  • incorrects: serialized executions that should be rejected by the specification

For this hello-world example, the important schema fields are:

  • entry_schema.params = "record[name: string]"
  • entry_schema.self = "nonetype"
  • exit_schema.self = "nonetype"
  • exit_schema.ret = "string"

The corresponding serialized executions look like this conceptually:

{
  "entry": {
    "params": {"name": "World"},
    "self": null
  },
  "exit": {
    "self": null,
    "ret": "Hello, World!"
  }
}

Use this format when your Java benchmark is method-based rather than stdin/stdout-based. There is no parser field here because the dataset already stores serialized arguments, pre-state, post-state, and return values. In this hello-world example, self is always null because the method is static, so plain nonetype is enough. You only need option[...] when the schema genuinely needs an optional value.

Minimal steps to run this as a new Java task

  1. Copy expecto/src/tasks/defects4j.py to expecto/src/tasks/hello_world_java.py.
  2. Change @task(name="defects4j") to @task(name="hello_world_java").
  3. Change the dataset file resolver so it reads datasets/examples/defects4j_hello_world.jsonl.
  4. Register the new task in expecto/src/tasks/__init__.py.
  5. Run one sample with:
python3 scripts/executor.py \
  --task hello_world_java \
  --solver defects4j_tree_search \
  --model openai/gpt-4.1-mini \
  --base_dir /workspace/data/hello-world \
  --exp_name hello-world-java \
  --sample-ids Hello_1_example_Hello_java_String_hello_String_name \
  --n_completions 1 \
  --max_attempts 1 \
  --use_test_cases \
  --use_memo \
  --scorers defects4j \
  --export-sample-results \
  --sample-results-benchmark defects4j \
  --sample-results-variant ts

If /workspace/data/hello-world/hello-world-java/ does not already exist, this command writes its outputs there. Because hello_world_java is a new task name, pass --scorers defects4j explicitly instead of relying on the built-in default scorer map. The most important files are:

  • /workspace/data/hello-world/hello-world-java/evaluation_result/manifest.json
  • /workspace/data/hello-world/hello-world-java/evaluation_result/samples/Hello_1_example_Hello_java_String_hello_String_name.json
  • /workspace/data/hello-world/hello-world-java/sample_results.json

If that directory already exists, scripts/executor.py creates a suffixed directory such as /workspace/data/hello-world/hello-world-java_1/ instead.

7.3 General rules for APPS/HumanEval+-style items

If your new benchmark follows the same structure, each item should contain the following fields.

Field Required? Where it is used
problem_id Yes Becomes Sample.id and metadata["problem_id"]. It is also used as the stable key for sample filtering and validation-test sampling in record_to_sample(...).
prompt Yes Becomes the natural-language problem description seen by the solver. In APPS, the loader combines it with a few example test cases; in HumanEval+, the loader uses it directly as Sample.input.
input_output Yes A string representation of an object that contains inputs and outputs for correct input-output pairs. record_to_sample(...) calls json.loads(...) on this string, pairs each input with the corresponding output, and stores the result in metadata["test_list"]. This list is used for completeness evaluation. A small subset also becomes metadata["prompt_test_list"] to guide generation.
mutated_input_output Yes A string representation of an object that contains inputs and outputs for incorrect or bug-revealing input-output pairs. record_to_sample(...) calls json.loads(...) on this string, pairs each input with the corresponding output, and stores the result in metadata["mutated_test_list"]. This list is used for soundness evaluation.
signature Yes Copied to metadata["signature"]. Every parameter must have an explicit type annotation. Write this signature using Python type syntax such as list[int], bool, and tuple[int, str]. The existing internal converters then transform those Python type annotations into the specification type system used by Expecto, so you should reuse that path instead of writing Expecto-specific types directly in the dataset.
parser No Copied to metadata["parser"]. This field is only needed when the stored input and output examples must be converted before they match the signature. APPS needs it because its examples are stored as standard input and output strings. HumanEval+ does not need it because its examples are already stored as structured values, so it uses null. When a parser is present, Expecto evaluates parser(input, output) and passes the returned values to postcondition(...) positionally, in signature order. Therefore, the parser return type should be the tuple of postcondition parameter types in that same order. See datasets/examples/apps_hello_world.json for a minimal parser.

If your benchmark needs extra metadata, you can add fields beyond those listed above. The fields listed here are the ones that must be handled by the current APPS- and HumanEval+-style loaders. Any other fields may be added freely for benchmark-specific use.

7.4 Expected shape of the example fields

The input_output and mutated_input_output fields are stored as strings. Each string must contain a JSON object with two arrays named inputs and outputs.

All benchmarks in this format share the same first loading step:

  1. Read the string in input_output or mutated_input_output.
  2. Call json.loads(...) on that string.
  3. Read the inputs array and the outputs array from the decoded object.
  4. Pair them with zip(inputs, outputs).
  5. Store the resulting list of (input, output) pairs in sample metadata.

Regardless of the benchmark, these fields must decode into an object with the exact keys inputs and outputs.

Case 1. HumanEval+ style: json.loads(...) is enough

If your benchmark already stores structured values, json.loads(...) is enough, and you usually do not need a parser. HumanEval+ uses this style. After loading, the data can look like this:

{
  "inputs": [[[1, 2, 3], 2], [[5, 5], 1]],
  "outputs": [true, false]
}

In this case, each decoded input value already matches the structure expected by the signature. For example, if the signature is def postcondition(nums: list[int], result: bool):, an input may already be a list such as [1, 2, 3], and an output may already be true or false. When these values come from input_output, they are treated as correct input-output pairs and used for completeness evaluation. When they come from mutated_input_output, they are treated as incorrect input-output pairs and used for soundness evaluation.

Case 2. APPS style: json.loads(...) plus parser

If your benchmark stores raw standard input and standard output text, json.loads(...) is only the first step. APPS uses this style. After loading, the data may look like this:

{
  "inputs": ["1 2 3\n", "5 4\n"],
  "outputs": ["True\n", "False\n"]
}

In this case, the decoded values are still just strings. They do not yet match the structure expected by the signature, so you need a parser field that converts an input string and an output string into the argument tuple required by postcondition(...).

For example, if the signature is def postcondition(nums: list[int], result: bool):, the parser may need to:

  1. Split the input string.
  2. Convert the tokens to integers.
  3. Convert the output string to a Boolean value.
  4. Return (nums, result).

The important rule is that the parser return type must match the postcondition parameter types. For this example, the expected parser return type is tuple[list[int], bool], because postcondition(...) takes nums: list[int] and result: bool.

The loader still stores (input, output) pairs in metadata, but later generation and evaluation steps call the parser to convert those raw strings into meaningful values. Pairs from input_output are used to check completeness, and pairs from mutated_input_output are used to check soundness.

In both cases, the loader pairs inputs[i] with outputs[i], so both arrays must have the same length.

7.5 Type annotations in signature

For signature, prefer ordinary Python type annotations. For example:

def postcondition(xs: list[int], target: int, result: bool):
    pass

This project already contains converters that read these Python type annotations and turn them into the internal specification types used during template generation and evaluation. You should reuse that conversion path rather than inventing a separate type notation for the dataset.

Python annotation Internal specification type
int int
bool bool
float real
str string
list[T] list[T]
set[T] set[T]
tuple[T1, T2, ...] tuple[T1, T2, ...]
dict[K, V] map[K, V]
Optional[T] option[T]

The relevant conversion logic is in /workspace/expecto-artifact/expecto/src/utils/dsl.py.

7.6 General rules for Defects4J-style items

Defects4J-style datasets are method-level Java datasets. Use this format when each item describes a Java method execution with explicit before-and-after state, rather than stdin/stdout examples. The default loader accepts either a JSONL file with one bug record per line or a JSON array of the same records.

Each top-level bug record should contain the following fields.

Field Required? Where it is used
project Yes Becomes part of every method sample ID and is copied into sample metadata.
bug_id Yes Becomes part of every method sample ID and is copied into sample metadata. Store it as a string or a value that can be converted to a string.
method_dumps Yes A list of method-level records. Each element becomes one benchmark sample.

Each method_dumps element should contain the following fields.

Field Required? Where it is used
method_info Yes Describes the target Java method and the schemas used to interpret serialized executions.
corrects Yes A mapping from test IDs to serialized executions that should satisfy the generated specification. These executions are used for completeness evaluation.
incorrects Yes A mapping from test IDs to serialized executions that should be rejected by the generated specification. These executions are used for soundness evaluation.

The method_info object should contain the following fields.

Field Required? Where it is used
code Yes The Java method source shown to the model when method code is included in the prompt.
file Yes The source file path. Together with signature, it is used to build the stable method sample ID.
javadoc Yes A structured Javadoc object shown to the model. The existing examples use description, params, returns, and throws.
signature Yes The Java method signature shown to the model and used in the stable method sample ID. This is a Java signature such as String hello(String name), not a Python postcondition(...) signature.
entry_schema Yes A schema object with params and self DSL type strings describing the pre-state.
exit_schema Yes A schema object with self and ret DSL type strings describing the post-state and return value.

The generated sample ID is derived from project, bug_id, method_info.file, and method_info.signature. The file path and signature are sanitized into an identifier, so an example such as project = "Hello", bug_id = "1", file = "example/Hello.java", and signature = "String hello(String name)" becomes:

Hello_1_example_Hello_java_String_hello_String_name

If two methods produce the same base ID inside the same bug record, the loader appends a numeric suffix such as _2. Use the generated IDs with --sample-ids.

Schemas and serialized executions

Defects4J-style schemas use Expecto DSL type syntax directly. They do not use the Python type annotations from APPS/HumanEval+-style signature fields.

Common schema types are:

Java / serialized concept Schema type
integer value int
Boolean value bool
floating-point value real
string value string
no receiver or no return value nonetype
nullable value option[T]
array or list list[T]
set set[T]
fixed tuple tuple[T1, T2, ...]
map map[K, V]
object or parameter bundle record[field1: T1, field2: T2, ...]

entry_schema.params should normally be a record[...] with one field for each Java parameter. The field names in that record must match the keys stored in each execution's entry.params object. For a static method, set entry_schema.self = "nonetype" and exit_schema.self = "nonetype", and store null for both entry.self and exit.self. For an instance method, use self to store the receiver object before and after the call. For a void method, use exit_schema.ret = "nonetype" and store null for exit.ret.

Each execution in corrects or incorrects must have this shape:

{
  "entry": {
    "params": {"name": "World"},
    "self": null
  },
  "exit": {
    "self": null,
    "ret": "Hello, World!"
  }
}

The values inside entry and exit must match entry_schema and exit_schema. For option[T], store null when the value is absent; non-null values are wrapped as some(...) during DSL evaluation. Avoid relying on missing keys: provide every key described by the schema so the dataset is easy to inspect and failures are explicit.

There is no parser, input_output, mutated_input_output, or Python signature field in Defects4J-style items. The dataset already stores serialized parameter values, receiver state, return values, and post-state. During evaluation, corrects must all be accepted for completeness to pass. For soundness, the current Defects4J scorer passes a sample when at least one execution in incorrects is rejected, so include incorrect executions that represent meaningful bug-revealing behavior.

If your benchmark needs extra metadata, add it only after preserving the fields above. The built-in Defects4J loader ignores unknown fields; a custom task can copy extra fields into sample metadata if it needs them.

7.7 Steps to add the benchmark

  1. Add a new dataset file such as datasets/my_benchmark.json.
  2. Create a new task file by copying /workspace/expecto-artifact/expecto/src/tasks/apps.py or /workspace/expecto-artifact/expecto/src/tasks/humaneval_plus.py.
  3. Update the task name, dataset file name, and benchmark= argument passed to sample_sequence_for_validation(...).
  4. Keep the record_to_sample(...) contract compatible with the existing solvers and scorers.

If your new benchmark can be expressed with this schema, you usually do not need to change the solver or scorer code. In practice, adding the dataset file and a new task module modeled after /workspace/expecto-artifact/expecto/src/tasks/apps.py or /workspace/expecto-artifact/expecto/src/tasks/humaneval_plus.py is usually enough.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages