Skip to content

E2E RAG reference implementation#2602

Open
mkankana wants to merge 126 commits into
mlcommons:masterfrom
hans-intel:perf_test_with_cached_output
Open

E2E RAG reference implementation#2602
mkankana wants to merge 126 commits into
mlcommons:masterfrom
hans-intel:perf_test_with_cached_output

Conversation

@mkankana

Copy link
Copy Markdown

This PR introduces a comprehensive end-to-end Retrieval-Augmented Generation (RAG) benchmark for evaluating multi-hop question answering systems on the FRAMES dataset. The benchmark measures both retrieval accuracy and answer quality for complex questions requiring information synthesis from multiple Wikipedia documents.

Multi-Shot Iterative Retrieval

Novel multi-hop retrieval pipeline:

  1. Initial retrieval with user query
  2. LLM evaluates document relevance
  3. If insufficient, LLM generates focused sub-queries
  4. Retrieves additional documents per sub-query
  5. Iterates until sufficient information or max iterations reached
  6. Final answer generation from accumulated context

This approach achieves 29-32% on the full 824-query FRAMES dataset, significantly outperforming single-shot retrieval (20%)

Architecture

Pipeline Flow

Documents → Chunking → Embedding → Vector DB → Retrieval → Reranking → LLM → Answer

## Benchmark Results
Full FRAMES dataset (824 queries):
| Approach | Precision@N | Recall@N | F1@N | LLM Judge Accuracy |
|---|---|---|---|---|
| Oracle (ground truth docs) | 100% | 100% | 100% | 68% |
| Single-shot retrieval | 39% | 70% | 42% | 20% |
| **Multi-shot iterative** | **72%** | **67%** | **66%** | **29-32%** |

mlcommons-bot and others added 30 commits December 20, 2024 22:46
- Each query is scored 0 to 1 depending on the number of correct links
- Final score is averaged
Support no-save feature
Add more bm25 params
Refactor single shot script to consolidate vector and bm25 db
Change default bm25 backend to numba (faster)
add no-rerank option to compare vector and bm25 method
hans-intel and others added 20 commits May 22, 2026 12:52
- Looking for empty GPUs to load embedding and reranker
- INFERENCE_EMBEDDING_GPU_DEVICES, INFERENCE_RERANKER_GPU_DEVICES to override
- Fixed a bug in embedding index and GPU indices were the same (GPU indices could start from non-zero)
- INFERENCE_RERANKER_NUMA_NODE       pin reranker child to NUMA node N
- INFERENCE_RERANKER_OMP_NUM_THREADS  override reranker OMP threads
- INFERENCE_EMBEDDING_NUMA_NODES     CSV (one per --num_embedding_devices)
- INFERENCE_EMBEDDING_OMP_NUM_THREADS  cap per worker (default = even split)
Removed --llm_service_url / --llm_model and added url endpoint and model for each component
…bout OPENROUTER_API_KEY

Signed-off-by: Rajesh Poornachandran <rajesh.poornachandran@amd.com>
…reading

- Mask all OpenRouter API keys (security)
- Default to local vLLM server (http://127.0.0.1:8123)
- Add max_async_queries=10 for concurrent processing
- Fix cache key to use stable sample_id
- Update run_multi_shot.sh and evaluate.py to use local vLLM
Added MLPerf copyright and Apache License 2.0 headers to all core
Python modules for compliance with project licensing requirements.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Updated .gitignore to exclude data artifacts, cache files, and generated outputs
- Removed WIP language and internal references from README
- Replaced hardcoded paths with generic defaults (intfloat/e5-base-v2, "llama")
- Added Apache 2.0 license headers to remaining Python files

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mkankana mkankana requested a review from a team as a code owner June 15, 2026 21:04
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

mkankana and others added 8 commits June 15, 2026 14:19
- Add --judge_service_url and --judge_model CLI arguments to accuracy_eval.py
- Update reference_mlperf_accuracy.sh to use Llama-3.1-8B-Instruct as default judge
- Pass judge configuration through reference_mlperf.py to accuracy_eval.py
- Allow override via JUDGE_SERVICE_URL and JUDGE_MODEL environment variables

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Make parsing and chunking mandatory (removed indexing-only mode)
- Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations
- Set default manifest file to scripts/db_manifest_intel_xpu.json.gz
- Move database validation after save (exclude from performance measurement)
- Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second)
- Suppress confusing benchmark summary from read_docs.py
- Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save)
- Add temp_complete_kpi_*.json to .gitignore

Performance measurement now correctly includes:
  1. HTML parsing (BeautifulSoup) + text extraction
  2. Passage chunking (fixed-length splitting)
  3. Embedding generation + FAISS indexing
  4. Database serialization

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
    - Make parsing and chunking mandatory (removed indexing-only mode)
    - Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations
    - Set default manifest file to scripts/db_manifest_intel_xpu.json.gz
    - Move database validation after save (exclude from performance measurement)
    - Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second)
    - Suppress confusing benchmark summary from read_docs.py
    - Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save)
    - Add temp_complete_kpi_*.json to .gitignore

    Performance measurement now correctly includes:
      1. HTML parsing (BeautifulSoup) + text extraction
      2. Passage chunking (fixed-length splitting)
      3. Embedding generation + FAISS indexing
      4. Database serialization
Implements automated TEST09 compliance test to verify output token
length in performance mode, preventing output truncation cheating.

Files added:
- compliance/TEST09/e2e-rag/audit.config: LoadGen config with thresholds
  * min_output_tokens: 211.92 (90% of reference)
  * max_output_tokens: 259.02 (110% of reference)
  * reference mean: 235.47 tokens from 5 production runs

- compliance/TEST09/e2e-rag/README.md: Usage instructions

- run_compliance_test09.sh: Fully automated test runner
  * Copies audit.config to working directory
  * Runs performance test with compliance logging
  * Verifies output token length thresholds
  * Generates submission artifacts
  * Cleans up automatically

- ISL_OSL_statistics.txt: Reference data for threshold calculation
  * answer_generator OSL from 5 runs (4021 total samples)
  * Weighted mean: 235.47 tokens

Usage: bash run_compliance_test09.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Both e2e-rag and e2e-datasetup workloads now sort query_samples by
index before submitting to thread pools. This ensures consistent
processing order across runs despite loadgen's sample shuffling.

Benefits:
- Reduces run-to-run performance variation
- Maintains deterministic database construction (datasetup)
- Improves reproducibility for compliance testing

The sorting overhead is negligible (O(n log n) for batch size n)
and parallel processing is fully preserved.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add scripts/download_dataset_and_models.sh for one-command download
- Create MLCOMMONS_ASSETS.md with detailed model/dataset information
- Simplify README.md with streamlined workflow and MLCommons downloads
- Remove alternative download methods (HuggingFace, etc.)
- Update Prerequisites section with MLCommons storage instructions
- Add Environment Setup section with Docker instructions
- Clean up Table of Contents to match README structure

All assets now downloaded from MLCommons storage for reliability:
- FRAMES dataset (~674KB)
- e5-base-v2 embedding model (~2.2GB)
- ColBERTv2.0 reranker (~1.4GB)
- GPT-OSS-120B model (~196GB)
- GPT-OSS-20B model (~83GB)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants