E2E RAG reference implementation by mkankana · Pull Request #2602 · mlcommons/inference

mkankana · 2026-06-15T21:04:22Z

This PR introduces a comprehensive end-to-end Retrieval-Augmented Generation (RAG) benchmark for evaluating multi-hop question answering systems on the FRAMES dataset. The benchmark measures both retrieval accuracy and answer quality for complex questions requiring information synthesis from multiple Wikipedia documents.

Multi-Shot Iterative Retrieval

Novel multi-hop retrieval pipeline:

Initial retrieval with user query
LLM evaluates document relevance
If insufficient, LLM generates focused sub-queries
Retrieves additional documents per sub-query
Iterates until sufficient information or max iterations reached
Final answer generation from accumulated context

This approach achieves 29-32% on the full 824-query FRAMES dataset, significantly outperforming single-shot retrieval (20%)

Architecture

Pipeline Flow

Documents → Chunking → Embedding → Vector DB → Retrieval → Reranking → LLM → Answer

## Benchmark Results
Full FRAMES dataset (824 queries):
| Approach | Precision@N | Recall@N | F1@N | LLM Judge Accuracy |
|---|---|---|---|---|
| Oracle (ground truth docs) | 100% | 100% | 100% | 68% |
| Single-shot retrieval | 39% | 70% | 42% | 20% |
| **Multi-shot iterative** | **72%** | **67%** | **66%** | **29-32%** |

- Each query is scored 0 to 1 depending on the number of correct links - Final score is averaged

Support no-save feature Add more bm25 params Refactor single shot script to consolidate vector and bm25 db

Change default bm25 backend to numba (faster)

add no-rerank option to compare vector and bm25 method

Simplified evaluation logic

- Looking for empty GPUs to load embedding and reranker - INFERENCE_EMBEDDING_GPU_DEVICES, INFERENCE_RERANKER_GPU_DEVICES to override - Fixed a bug in embedding index and GPU indices were the same (GPU indices could start from non-zero)

- INFERENCE_RERANKER_NUMA_NODE pin reranker child to NUMA node N - INFERENCE_RERANKER_OMP_NUM_THREADS override reranker OMP threads - INFERENCE_EMBEDDING_NUMA_NODES CSV (one per --num_embedding_devices) - INFERENCE_EMBEDDING_OMP_NUM_THREADS cap per worker (default = even split)

Removed --llm_service_url / --llm_model and added url endpoint and model for each component

…bout OPENROUTER_API_KEY Signed-off-by: Rajesh Poornachandran <rajesh.poornachandran@amd.com>

…inference into perf_test_with_cached_output

…reading - Mask all OpenRouter API keys (security) - Default to local vLLM server (http://127.0.0.1:8123) - Add max_async_queries=10 for concurrent processing - Fix cache key to use stable sample_id - Update run_multi_shot.sh and evaluate.py to use local vLLM

Added MLPerf copyright and Apache License 2.0 headers to all core Python modules for compliance with project licensing requirements. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Updated .gitignore to exclude data artifacts, cache files, and generated outputs - Removed WIP language and internal references from README - Replaced hardcoded paths with generic defaults (intfloat/e5-base-v2, "llama") - Added Apache 2.0 license headers to remaining Python files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-06-15T21:04:35Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…ched_output

- Add --judge_service_url and --judge_model CLI arguments to accuracy_eval.py - Update reference_mlperf_accuracy.sh to use Llama-3.1-8B-Instruct as default judge - Pass judge configuration through reference_mlperf.py to accuracy_eval.py - Allow override via JUDGE_SERVICE_URL and JUDGE_MODEL environment variables Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Make parsing and chunking mandatory (removed indexing-only mode) - Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations - Set default manifest file to scripts/db_manifest_intel_xpu.json.gz - Move database validation after save (exclude from performance measurement) - Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second) - Suppress confusing benchmark summary from read_docs.py - Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save) - Add temp_complete_kpi_*.json to .gitignore Performance measurement now correctly includes: 1. HTML parsing (BeautifulSoup) + text extraction 2. Passage chunking (fixed-length splitting) 3. Embedding generation + FAISS indexing 4. Database serialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Make parsing and chunking mandatory (removed indexing-only mode) - Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations - Set default manifest file to scripts/db_manifest_intel_xpu.json.gz - Move database validation after save (exclude from performance measurement) - Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second) - Suppress confusing benchmark summary from read_docs.py - Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save) - Add temp_complete_kpi_*.json to .gitignore Performance measurement now correctly includes: 1. HTML parsing (BeautifulSoup) + text extraction 2. Passage chunking (fixed-length splitting) 3. Embedding generation + FAISS indexing 4. Database serialization

Implements automated TEST09 compliance test to verify output token length in performance mode, preventing output truncation cheating. Files added: - compliance/TEST09/e2e-rag/audit.config: LoadGen config with thresholds * min_output_tokens: 211.92 (90% of reference) * max_output_tokens: 259.02 (110% of reference) * reference mean: 235.47 tokens from 5 production runs - compliance/TEST09/e2e-rag/README.md: Usage instructions - run_compliance_test09.sh: Fully automated test runner * Copies audit.config to working directory * Runs performance test with compliance logging * Verifies output token length thresholds * Generates submission artifacts * Cleans up automatically - ISL_OSL_statistics.txt: Reference data for threshold calculation * answer_generator OSL from 5 runs (4021 total samples) * Weighted mean: 235.47 tokens Usage: bash run_compliance_test09.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Both e2e-rag and e2e-datasetup workloads now sort query_samples by index before submitting to thread pools. This ensures consistent processing order across runs despite loadgen's sample shuffling. Benefits: - Reduces run-to-run performance variation - Maintains deterministic database construction (datasetup) - Improves reproducibility for compliance testing The sorting overhead is negligible (O(n log n) for batch size n) and parallel processing is fully preserved. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Add scripts/download_dataset_and_models.sh for one-command download - Create MLCOMMONS_ASSETS.md with detailed model/dataset information - Simplify README.md with streamlined workflow and MLCommons downloads - Remove alternative download methods (HuggingFace, etc.) - Update Prerequisites section with MLCommons storage instructions - Add Environment Setup section with Docker instructions - Clean up Table of Contents to match README structure All assets now downloaded from MLCommons storage for reliability: - FRAMES dataset (~674KB) - e5-base-v2 embedding model (~2.2GB) - ColBERTv2.0 reranker (~1.4GB) - GPT-OSS-120B model (~196GB) - GPT-OSS-20B model (~83GB) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mlcommons-bot and others added 30 commits December 20, 2024 22:46

[Automated Commit] Format Codebase

9b59d2e

Merge branch 'mlcommons:master' into master

fe51c12

[Automated Commit] Format Codebase

1f2666c

Add download_pdf.py

ec84225

Add artefacts downloading scripts

a77eb4d

Fix downloading script

9f0db17

[Automated Commit] Format Codebase

b4dcd57

cleanup read_pdf.py

e0f28b2

add env setup files

d301bff

add single-shot retrieval

daa07f8

renmae setup file

23aaea6

Add README

a15bc6d

name fix

fb5f157

Add reranker

94b11d1

Add README.md

9275aaa

Save vector db

c5c7213

Support Intel XPU for embedding and reranker model

5142a06

Separate tok_k for retrieval and reranking

9f5d8c5

save url mapping in passage for evaluation

41cd5c4

Implement evaluation

7b9390f

- Each query is scored 0 to 1 depending on the number of correct links - Final score is averaged

Support bm25. RagDB is a superclass of vectordb and bm25db

142d41d

Fix a bug in scoring; code cleanup

d1ed96d

Add BM25 params

73d1350

Change url_mapping to exclude file extension to support both pdf and txt

9ab82f9

Support txt file to ingest: --passages renamed to --ingest

beeb74f

Support no-save feature Add more bm25 params Refactor single shot script to consolidate vector and bm25 db

Add stemmer to bm25

1f70b6c

Change default bm25 backend to numba (faster)

Add missing ingest function for vectordb

94a46cb

Add metrics recall, precision, F1, MAP (Mean Average Precision)

1140a8a

add no-rerank option to compare vector and bm25 method

implment retrieval strategies (top_p, relative, elbow, ...)

4fd651c

Simplified evaluation logic

rename evaluate() to evaluate_retrieval()

140add3

hans-intel and others added 20 commits May 22, 2026 12:52

Add per-process GPU index allocator

b701630

- Looking for empty GPUs to load embedding and reranker - INFERENCE_EMBEDDING_GPU_DEVICES, INFERENCE_RERANKER_GPU_DEVICES to override - Fixed a bug in embedding index and GPU indices were the same (GPU indices could start from non-zero)

add cross-system DB verification scripts

88df0be

Decouple endpoint routing from OpenRouter

ff80999

Removed --llm_service_url / --llm_model and added url endpoint and model for each component

Update readme

9527fdd

performance test simulated with cached result

e025c66

add a script that calculates prefix cache hit rate

99835eb

Fixes to download docs (with delay and proper URL formatting), WARN a…

32041c5

…bout OPENROUTER_API_KEY Signed-off-by: Rajesh Poornachandran <rajesh.poornachandran@amd.com>

added missing perf cache result file

ed2419b

Merge branch 'multi-vendor-support' of https://github.com/hans-intel/…

e811b74

…inference into perf_test_with_cached_output

performance metric added for variation test purpose

826c784

added loadgen integration

2ee4fd3

seperated endpoint parameters for different servers

5fbae0c

added indexing measurments kpi

9e0c1dd

passing the threads option from reference script

60f8e6d

fixed threading loadgen integration

16ab751

added datasetup KPI implementation

0edc531

Add Apache 2.0 license headers to all Python files

6635959

Added MLPerf copyright and Apache License 2.0 headers to all core Python modules for compliance with project licensing requirements. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mkankana requested a review from a team as a code owner June 15, 2026 21:04

mkankana and others added 8 commits June 15, 2026 14:19

Merge remote-tracking branch 'upstream/master' into perf_test_with_ca…

be83c3f

…ched_output

loadgen integration for e2e-datasetup workload

e8ec8a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E RAG reference implementation#2602

E2E RAG reference implementation#2602
mkankana wants to merge 126 commits into
mlcommons:masterfrom
hans-intel:perf_test_with_cached_output

mkankana commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mkankana commented Jun 15, 2026

Multi-Shot Iterative Retrieval

Architecture

Pipeline Flow

Uh oh!

github-actions Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented Jun 15, 2026 •

edited

Loading