E2E RAG reference implementation#2602
Open
mkankana wants to merge 126 commits into
Open
Conversation
- Each query is scored 0 to 1 depending on the number of correct links - Final score is averaged
Support no-save feature Add more bm25 params Refactor single shot script to consolidate vector and bm25 db
Change default bm25 backend to numba (faster)
add no-rerank option to compare vector and bm25 method
Simplified evaluation logic
- Looking for empty GPUs to load embedding and reranker - INFERENCE_EMBEDDING_GPU_DEVICES, INFERENCE_RERANKER_GPU_DEVICES to override - Fixed a bug in embedding index and GPU indices were the same (GPU indices could start from non-zero)
- INFERENCE_RERANKER_NUMA_NODE pin reranker child to NUMA node N - INFERENCE_RERANKER_OMP_NUM_THREADS override reranker OMP threads - INFERENCE_EMBEDDING_NUMA_NODES CSV (one per --num_embedding_devices) - INFERENCE_EMBEDDING_OMP_NUM_THREADS cap per worker (default = even split)
Removed --llm_service_url / --llm_model and added url endpoint and model for each component
…bout OPENROUTER_API_KEY Signed-off-by: Rajesh Poornachandran <rajesh.poornachandran@amd.com>
…inference into perf_test_with_cached_output
…reading - Mask all OpenRouter API keys (security) - Default to local vLLM server (http://127.0.0.1:8123) - Add max_async_queries=10 for concurrent processing - Fix cache key to use stable sample_id - Update run_multi_shot.sh and evaluate.py to use local vLLM
Added MLPerf copyright and Apache License 2.0 headers to all core Python modules for compliance with project licensing requirements. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Updated .gitignore to exclude data artifacts, cache files, and generated outputs - Removed WIP language and internal references from README - Replaced hardcoded paths with generic defaults (intfloat/e5-base-v2, "llama") - Added Apache 2.0 license headers to remaining Python files Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
- Add --judge_service_url and --judge_model CLI arguments to accuracy_eval.py - Update reference_mlperf_accuracy.sh to use Llama-3.1-8B-Instruct as default judge - Pass judge configuration through reference_mlperf.py to accuracy_eval.py - Allow override via JUDGE_SERVICE_URL and JUDGE_MODEL environment variables Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Make parsing and chunking mandatory (removed indexing-only mode) - Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations - Set default manifest file to scripts/db_manifest_intel_xpu.json.gz - Move database validation after save (exclude from performance measurement) - Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second) - Suppress confusing benchmark summary from read_docs.py - Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save) - Add temp_complete_kpi_*.json to .gitignore Performance measurement now correctly includes: 1. HTML parsing (BeautifulSoup) + text extraction 2. Passage chunking (fixed-length splitting) 3. Embedding generation + FAISS indexing 4. Database serialization Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Make parsing and chunking mandatory (removed indexing-only mode)
- Set default database name to vector_html_hnsw_len768_ov32_word to match MLPerf expectations
- Set default manifest file to scripts/db_manifest_intel_xpu.json.gz
- Move database validation after save (exclude from performance measurement)
- Fix metric field name mismatches (data_setup_time_seconds, throughput_passages_per_second)
- Suppress confusing benchmark summary from read_docs.py
- Exclude DB initialization time from KPI (only measure parsing+chunking+indexing+save)
- Add temp_complete_kpi_*.json to .gitignore
Performance measurement now correctly includes:
1. HTML parsing (BeautifulSoup) + text extraction
2. Passage chunking (fixed-length splitting)
3. Embedding generation + FAISS indexing
4. Database serialization
Implements automated TEST09 compliance test to verify output token length in performance mode, preventing output truncation cheating. Files added: - compliance/TEST09/e2e-rag/audit.config: LoadGen config with thresholds * min_output_tokens: 211.92 (90% of reference) * max_output_tokens: 259.02 (110% of reference) * reference mean: 235.47 tokens from 5 production runs - compliance/TEST09/e2e-rag/README.md: Usage instructions - run_compliance_test09.sh: Fully automated test runner * Copies audit.config to working directory * Runs performance test with compliance logging * Verifies output token length thresholds * Generates submission artifacts * Cleans up automatically - ISL_OSL_statistics.txt: Reference data for threshold calculation * answer_generator OSL from 5 runs (4021 total samples) * Weighted mean: 235.47 tokens Usage: bash run_compliance_test09.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Both e2e-rag and e2e-datasetup workloads now sort query_samples by index before submitting to thread pools. This ensures consistent processing order across runs despite loadgen's sample shuffling. Benefits: - Reduces run-to-run performance variation - Maintains deterministic database construction (datasetup) - Improves reproducibility for compliance testing The sorting overhead is negligible (O(n log n) for batch size n) and parallel processing is fully preserved. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add scripts/download_dataset_and_models.sh for one-command download - Create MLCOMMONS_ASSETS.md with detailed model/dataset information - Simplify README.md with streamlined workflow and MLCommons downloads - Remove alternative download methods (HuggingFace, etc.) - Update Prerequisites section with MLCommons storage instructions - Add Environment Setup section with Docker instructions - Clean up Table of Contents to match README structure All assets now downloaded from MLCommons storage for reliability: - FRAMES dataset (~674KB) - e5-base-v2 embedding model (~2.2GB) - ColBERTv2.0 reranker (~1.4GB) - GPT-OSS-120B model (~196GB) - GPT-OSS-20B model (~83GB) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a comprehensive end-to-end Retrieval-Augmented Generation (RAG) benchmark for evaluating multi-hop question answering systems on the FRAMES dataset. The benchmark measures both retrieval accuracy and answer quality for complex questions requiring information synthesis from multiple Wikipedia documents.
Multi-Shot Iterative Retrieval
Novel multi-hop retrieval pipeline:
This approach achieves 29-32% on the full 824-query FRAMES dataset, significantly outperforming single-shot retrieval (20%)
Architecture
Pipeline Flow