Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions examples/10_Agentic_Inference/kimi_agentic_benchmark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,29 @@ datasets:
# Required benchmark default; set to true only for faster optimization/debug runs.
stop_issuing_on_first_user_complete: false

- name: "aime25::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "pass_at_1"
ground_truth: "answer"
extractor: "boxed_math_extractor"
num_repeats: 8

- name: "gpqa::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "pass_at_1"
extractor: "abcd_extractor"
ground_truth: "ground_truth"
num_repeats: 5

- name: "livecodebench::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "code_bench_scorer"
extractor: "python_code_extractor"
num_repeats: 3

settings:
runtime:
min_duration_ms: 0
Expand Down
63 changes: 63 additions & 0 deletions examples/10_Agentic_Inference/qwen_agentic_benchmark.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: "qwen-agentic-benchmark"
version: "1.0"
type: "online"

model_params:
name: "Qwen/Qwen3.6-35B-A3B"
temperature: 1.0
top_k: 20
top_p: 0.95
repetition_penalty: 1.0
presence_penalty: 1.5
max_new_tokens: 8192
chat_template_kwargs:
preserve_thinking: true

datasets:
- name: agentic_coding
type: performance
path: /home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dataset path is hardcoded to a local absolute path (/home/tianmuli/vllm_test/datasets/agentic_combined_v4.jsonl). This makes the example configuration non-portable and will fail for other users. Please use a placeholder path (e.g., /path/to/agentic_combined_v4.jsonl) or a relative path instead.

    path: /path/to/agentic_combined_v4.jsonl

accuracy_config:
eval_method: agentic_inference_inline # required benchmark default.
agentic_inference:
turn_timeout_s: 14400.0
num_trajectories_to_issue: 1

- name: "aime25::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "pass_at_1"
ground_truth: "answer"
extractor: "boxed_math_extractor"
num_repeats: 8

- name: "gpqa::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "pass_at_1"
extractor: "abcd_extractor"
ground_truth: "ground_truth"
num_repeats: 5

- name: "livecodebench::gptoss"
type: "accuracy"
accuracy_config:
eval_method: "code_bench_scorer"
extractor: "python_code_extractor"
num_repeats: 3

settings:
runtime:
min_duration_ms: 0
max_duration_ms: 36000000

load_pattern:
type: agentic_inference
target_concurrency: 128 # Submission-specific concurrency.

endpoint_config:
endpoints:
- "http://localhost:30000"
api_type: openai

report_dir: logs/qwen_agentic
33 changes: 26 additions & 7 deletions src/inference_endpoint/commands/benchmark/execute.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,20 @@ def setup_benchmark(config: BenchmarkConfig, test_mode: TestMode) -> BenchmarkCo
dataloader, accuracy_datasets, eval_configs = _load_datasets(config, report_dir)

# Setup runtime settings using factory method
rt_settings = RuntimeSettings.from_config(config, dataloader.num_samples())
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]
agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None
assert dataloader.conversation_metadata is not None
agentic_overrides = {
"agentic_num_conversations": dataloader.conversation_metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)
Comment on lines +398 to +411

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two improvement opportunities here:

  1. Defensive Programming: Using assert for runtime validation/control flow is discouraged because assertions can be globally disabled in Python (e.g., via the -O flag or PYTHONOPTIMIZE environment variable). If assertions are disabled, dataloader.conversation_metadata could be None and lead to an unhandled AttributeError on the next line. It is safer to explicitly check for None and raise a SetupError.
  2. Code Reusability: Instead of manually filtering config.datasets to find the performance dataset, you can leverage the existing config.get_single_dataset() helper method.
Suggested change
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_cfgs = [d for d in config.datasets if d.type == DatasetType.PERFORMANCE]
agentic_cfg = perf_cfgs[0].agentic_inference if perf_cfgs else None
assert dataloader.conversation_metadata is not None
agentic_overrides = {
"agentic_num_conversations": dataloader.conversation_metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)
agentic_overrides: dict = {}
if isinstance(dataloader, AgenticInferenceDataset):
perf_ds = config.get_single_dataset()
agentic_cfg = perf_ds.agentic_inference if perf_ds is not None else None
metadata = dataloader.conversation_metadata
if metadata is None:
raise SetupError("AgenticInferenceDataset is missing conversation metadata.")
agentic_overrides = {
"agentic_num_conversations": metadata.num_conversations,
"agentic_num_trajectories": agentic_cfg.num_trajectories_to_issue
if agentic_cfg is not None
else None,
}
rt_settings = RuntimeSettings.from_config(
config, dataloader.num_samples(), **agentic_overrides
)


# Calculate and display expected sample count
total_samples = rt_settings.total_samples_to_issue()
Expand Down Expand Up @@ -476,6 +489,7 @@ def _build_phases(

# Accuracy phases — use eval_cfg.dataset_name as phase name so it matches
# what Scorer._load_sample_index_map() looks up in sample_idx_map.json
perf_lp = ctx.rt_settings.load_pattern
for eval_cfg in ctx.eval_configs:
if eval_cfg.dataset_name == "performance":
continue
Expand All @@ -486,12 +500,17 @@ def _build_phases(
"AgenticInferenceDataset, which is not yet supported for "
"accuracy evaluation."
)
# Accuracy phases run at MAX_THROUGHPUT; inheriting perf_lp (e.g. POISSON)
# would silently rate-limit evaluation until an agentic inference accuracy strategy
# and QPS-budgeting support are added.
acc_load_pattern: LoadPattern | None = LoadPattern(
type=LoadPatternType.MAX_THROUGHPUT
)
# Accuracy phases inherit the perf load pattern so the KV pool is not
# exhausted by simultaneous burst requests. AGENTIC_INFERENCE is
# downgraded to CONCURRENCY with the same cap because plain accuracy
# datasets are single-turn and cannot use the agentic scheduler.
if perf_lp is not None and perf_lp.type == LoadPatternType.AGENTIC_INFERENCE:
acc_load_pattern: LoadPattern | None = LoadPattern(
type=LoadPatternType.CONCURRENCY,
target_concurrency=perf_lp.target_concurrency,
)
else:
acc_load_pattern = perf_lp
acc_settings = RuntimeSettings(
metric_target=ctx.rt_settings.metric_target,
reported_metrics=ctx.rt_settings.reported_metrics,
Expand Down
34 changes: 29 additions & 5 deletions src/inference_endpoint/config/runtime_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,12 @@ class RuntimeSettings:
load_pattern: LoadPattern | None
"""Load pattern configuration"""

agentic_num_trajectories: int | None = None
"""For agentic inference: num_trajectories_to_issue from dataset config (None = all)."""

agentic_num_conversations: int | None = None
"""For agentic inference: total distinct conversations in the loaded dataset."""

@classmethod
def from_config(
cls,
Expand Down Expand Up @@ -200,19 +206,37 @@ def total_samples_to_issue(
self.load_pattern is not None
and self.load_pattern.type == LoadPatternType.AGENTIC_INFERENCE
):
if self.n_samples_from_dataset < self.min_sample_count:
total = self.n_samples_from_dataset
if (
self.agentic_num_trajectories is not None
and self.agentic_num_conversations is not None
and self.agentic_num_conversations > 0
):
# Scale proportionally: turns_per_trajectory ≈ total_turns / num_conversations
total = max(
1,
round(
self.n_samples_from_dataset
* self.agentic_num_trajectories
/ self.agentic_num_conversations
),
)
if total < self.min_sample_count:
logger.warning(
"Agentic inference run: min_sample_count=%d exceeds dataset "
"client-turn count=%d; using dataset size. Agentic inference cannot "
"issue more samples than the dataset provides.",
self.min_sample_count,
self.n_samples_from_dataset,
total,
)
logger.debug(
"Sample count: %d (agentic inference: issuing all client turns)",
self.n_samples_from_dataset,
"Sample count: %d (agentic inference: %s)",
total,
f"{self.agentic_num_trajectories} trajectories × avg turns"
if self.agentic_num_trajectories is not None
else "issuing all client turns",
)
return self.n_samples_from_dataset
return total

# If min_duration is 0, use all dataset samples (new CLI default behavior)
if self.min_duration_ms == 0:
Expand Down
4 changes: 2 additions & 2 deletions src/inference_endpoint/config/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,14 +273,14 @@ class AgenticInferenceConfig(BaseModel):
),
)
enable_salt: bool = Field(
False,
True,
description=(
"Add deterministic salt markers before and after the system prompt "
"to prevent KV cache reuse across trajectories in agentic inference setting."
),
)
Comment on lines 275 to 281

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the default of enable_salt to True introduces a strict requirement that every conversation in the dataset must have a system prompt. If a dataset lacks a system prompt, AgenticInferenceStrategy._validate_salt_system_prompts will raise an InputValidationError, breaking backward compatibility for simpler datasets.

To make this default robust, consider updating AgenticInferenceStrategy (in a separate PR or file) to automatically prepend an empty system prompt if none is present, rather than failing. Otherwise, keeping the default as False might be safer to avoid unexpected validation failures for users.

inject_tool_delay: bool = Field(
False,
True,
description=(
"Pause for a predefined duration between turns. Duration is defined "
"in dataset."
Expand Down
3 changes: 2 additions & 1 deletion tests/integration/test_agentic_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ def _make_strategy(
inject_tool_delay: bool = False,
) -> AgenticInferenceStrategy:
agentic_cfg = AgenticInferenceConfig(
enable_salt=False,
turn_timeout_s=10.0,
inject_tool_delay=inject_tool_delay,
)
Expand Down Expand Up @@ -377,7 +378,7 @@ async def test_turn_ordering_enforced_end_to_end(echo_server):
{"conversation_id": "c1", "turn": 3, "role": "user", "content": "Second"},
]
ds = _make_dataset(rows)
agentic_cfg = AgenticInferenceConfig(turn_timeout_s=10.0)
agentic_cfg = AgenticInferenceConfig(enable_salt=False, turn_timeout_s=10.0)
conv_manager = ConversationManager()
strategy = AgenticInferenceStrategy(
conversation_manager=conv_manager,
Expand Down
27 changes: 20 additions & 7 deletions tests/unit/load_generator/test_agentic_inference_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ def _make_dataset_metadata(conversations: dict[str, list[int]]) -> ConversationM
async def test_first_user_complete_stops_tracking_but_can_continue_for_accuracy():
conv_manager = ConversationManager()
metadata = _make_dataset_metadata({"conv1": [1], "conv2": [1, 2]})
cfg = AgenticInferenceConfig(num_trajectories_to_issue=2)
cfg = AgenticInferenceConfig(enable_salt=False, num_trajectories_to_issue=2)
strategy = AgenticInferenceStrategy(
conv_manager,
metadata,
Expand Down Expand Up @@ -209,6 +209,7 @@ async def test_stop_on_first_user_complete_refills_until_budget_exhausted():
conv_manager = ConversationManager()
metadata = _make_dataset_metadata({"conv1": [1], "conv2": [1], "conv3": [1]})
cfg = AgenticInferenceConfig(
enable_salt=False,
stop_issuing_on_first_user_complete=True,
num_trajectories_to_issue=3,
)
Expand Down Expand Up @@ -766,7 +767,9 @@ async def test_abort_remaining_turns_includes_pending_delayed_turn():
conv_manager = ConversationManager()
conv_manager.get_or_create("c1", expected_client_turns=3)
metadata = _metadata_with_delay("c1", [1, 2, 3], delay_turn=2, delay=60.0)
cfg = AgenticInferenceConfig(turn_timeout_s=5.0, inject_tool_delay=True)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=5.0, inject_tool_delay=True
)
strategy = AgenticInferenceStrategy(
conv_manager, metadata, agentic_inference_config=cfg
)
Expand Down Expand Up @@ -814,7 +817,9 @@ async def test_execute_waits_for_delayed_first_turns():
conv_manager = ConversationManager()
metadata = _make_dataset_metadata({"c1": [1], "c2": [1]})
metadata.delay_seconds_by_key = {("c1", 1): 0.02, ("c2", 1): 0.02}
cfg = AgenticInferenceConfig(turn_timeout_s=5.0, inject_tool_delay=True)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=5.0, inject_tool_delay=True
)
strategy = AgenticInferenceStrategy(
conv_manager,
metadata,
Expand Down Expand Up @@ -851,7 +856,9 @@ async def test_inject_tool_delay_defers_issue_via_call_later():

conv_manager = ConversationManager()
metadata = _metadata_with_delay("c1", [1, 2], delay_turn=2, delay=0.05)
cfg = AgenticInferenceConfig(turn_timeout_s=5.0, inject_tool_delay=True)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=5.0, inject_tool_delay=True
)
strategy = AgenticInferenceStrategy(
conv_manager, metadata, agentic_inference_config=cfg
)
Expand Down Expand Up @@ -998,7 +1005,9 @@ async def test_inject_tool_delay_disabled_issues_immediately():

conv_manager = ConversationManager()
metadata = _metadata_with_delay("c1", [1, 2], delay_turn=2, delay=2.0)
cfg = AgenticInferenceConfig(turn_timeout_s=5.0, inject_tool_delay=False)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=5.0, inject_tool_delay=False
)
strategy = AgenticInferenceStrategy(
conv_manager, metadata, agentic_inference_config=cfg
)
Expand Down Expand Up @@ -1049,7 +1058,9 @@ async def test_inject_tool_delay_no_dataset_field_back_compat():

conv_manager = ConversationManager()
metadata = _make_dataset_metadata({"c1": [1, 2]})
cfg = AgenticInferenceConfig(turn_timeout_s=5.0, inject_tool_delay=True)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=5.0, inject_tool_delay=True
)
strategy = AgenticInferenceStrategy(
conv_manager, metadata, agentic_inference_config=cfg
)
Expand Down Expand Up @@ -1082,7 +1093,9 @@ async def test_inject_tool_delay_cancels_on_timeout():

conv_manager = ConversationManager()
metadata = _metadata_with_delay("c1", [1, 2, 3], delay_turn=3, delay=1.0)
cfg = AgenticInferenceConfig(turn_timeout_s=0.1, inject_tool_delay=True)
cfg = AgenticInferenceConfig(
enable_salt=False, turn_timeout_s=0.1, inject_tool_delay=True
)
strategy = AgenticInferenceStrategy(
conv_manager, metadata, agentic_inference_config=cfg
)
Expand Down
Loading