A specialized Small Language Model (SLM) for answering questions about experimentation methodology, statistical analysis, and A/B testing best practices.
Current Focus: Fine-tuning Llama 3.2 3B with QLoRA on experimentation Q&A data.
Future: Graph-based Retrieval Augmented Generation (GraphRAG) for enhanced multi-hop reasoning.
Phase I: Model Training (In Progress)
- β Data collection infrastructure (Cross Validated, ArXiv)
- β Synthetic Q&A generation with Claude API
- β QLoRA training infrastructure with HuggingFace TRL
- π Generating ~860 synthetic Q&A pairs from 172 ArXiv papers
- βοΈ Fine-tune Llama 3.2 3B on 5,000-10,000 training examples
- βοΈ Build evaluation framework
- βοΈ Deploy inference API
Phase II: GraphRAG Enhancement (Planned)
- Knowledge graph construction
- SurrealDB integration
- Graph-enhanced retrieval
I will be leveraging Claude and other LLM tools to help me get to a proof of concept faster. I will continue to use my software engineering skills to direct, fix, and modify the project code. I'd like the code to be readable and exercising the best practices for security and performance.
Build a domain-specific AI assistant that provides expert-level guidance on:
- Statistical Concepts: Confidence intervals, p-values, statistical significance, power analysis
- Experimentation Design: Sample size calculations, test duration, metric selection
- Analysis Methods: Hypothesis testing, variance reduction, multiple testing corrections
- Best Practices: Our internal experimentation processes and methodologies
- Entities & Relationships: Statistical concepts are interconnected (e.g., "power analysis" relates to "sample size", "effect size", "significance level")
- Multi-hop Reasoning: Questions like "How does variance affect my sample size calculation?" require traversing concept relationships
- Context Preservation: Graph structure maintains the semantic relationships between experimentation concepts
- Better Retrieval: Find relevant information through concept graphs, not just keyword matching
- Cost Efficiency: 10-100x cheaper than GPT-4 for high-volume queries
- Speed: Sub-second responses for real-time guidance
- Privacy: All data stays in-house (important for proprietary methodologies)
- Specialization: Focused training on statistics/experimentation = better domain performance
- Native Graph + Document Store: Store both structured relationships and unstructured content
- GraphQL Support: Flexible querying for graph traversal
- Scalable: Handles growing knowledge base as we add experiments
- SQL-like Syntax: Easy to work with, familiar to data scientists
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Question β
β "What sample size do I need for 5% MDE at 80% power?" β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Understanding Layer β
β - Extract entities (sample size, MDE, power) β
β - Identify question type (calculation vs explanation) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GraphRAG Pipeline β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Entity Extraction β β
β β - Identify concepts in query β β
β βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββΌβββββββββββββββββββββββββββββββββ β
β β 2. Graph Traversal (SurrealDB) β β
β β - Find related concepts β β
β β - Retrieve connected documentation β β
β βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββΌβββββββββββββββββββββββββββββββββ β
β β 3. Context Assembly β β
β β - Rank by relevance β β
β β - Build knowledge subgraph β β
β βββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Small Language Model (3-7B) β
β - Phi-3 / Llama 3.2 / Qwen2.5 β
β - Fine-tuned on statistics/experimentation β
β - Generates answer from retrieved context β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Response Formatter β
β - Answer + reasoning β
β - Source citations (graph nodes) β
β - Related concepts to explore β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Rationale: Start with fine-tuning to establish baseline performance before building GraphRAG infrastructure.
Goal: Gather training data from public sources
-
Cross Validated (StackExchange)
- Real Q&A about experimentation and statistics
- Tags: a-b-testing, experimental-design, sample-size, statistical-power, etc.
- ~100-1,000 question-answer pairs
-
ArXiv Papers
- Research papers on experimentation methodology
- ~172 papers collected
- Metadata and abstracts
Goal: Generate training data from ArXiv papers using Claude API
-
Claude API Integration
- Rate-limited wrapper with retry logic
- Cost estimation and tracking
-
Q&A Generation from Abstracts
- 5 diverse Q&A pairs per paper
- Question types: definition, methodology, comparison, best practice
- ~860 synthetic Q&A pairs from 172 papers
-
Quality Validation
- Length validation (50-5000 chars)
- Content quality checks
- ChatML formatting for Llama 3.2
Goal: Fine-tune Llama 3.2 3B with QLoRA
-
Training Infrastructure
- QLoRA configuration (4-bit quantization + LoRA adapters)
- HuggingFace TRL SFTTrainer
- Dataset combination and preprocessing
-
Training Execution
- Combine Cross Validated + ArXiv synthetic datasets
- Train for 3 epochs on 5,000-10,000 examples
- Evaluate on holdout test set
-
Model Artifacts
- Save LoRA adapters
- Export merged model
- Document hyperparameters
Goal: Measure model quality
-
Test Set Creation
- 100 curated questions covering:
- Definitions (20%)
- Calculations (30%)
- Methodology (30%)
- Best practices (20%)
- 100 curated questions covering:
-
Metrics
- Accuracy on classification questions
- ROUGE/BLEU for generation quality
- Human evaluation on critical questions
-
Target Performance
- 80%+ accuracy overall
- 85%+ on definition questions
- 75%+ on methodology questions
Goal: Serve the fine-tuned model
-
Inference API
- FastAPI server with model loading
/queryendpoint for Q&A- Response streaming support
-
Optimization
- Quantization for faster inference
- Request batching
- Response caching
- Metrics: Accuracy, Completeness, Citation quality
- Latency benchmarks
-
Documentation
- API docs (auto-generated with FastAPI)
- Usage examples
- Deployment guide
Goal: Add graph-based retrieval for multi-hop reasoning and enhanced context
-
Entity Extraction
- Identify statistical concepts from trained model outputs
- Extract definitions, formulas, relationships
- Tag by category (statistical test, metric, methodology)
-
Relationship Mapping
- Map concept relationships: "power analysis REQUIRES sample_size, effect_size"
- Build prerequisite chains: "understand p-value BEFORE FDR"
- Connect related methods: "t-test SIMILAR_TO mann_whitney"
-
SurrealDB Integration
- Graph + document + vector database
- Schema design for concepts and relationships
- Efficient graph traversal queries
-
Query Understanding
- Entity recognition in user questions
- Intent classification (definition, calculation, methodology)
-
Graph Traversal
- BFS/DFS for concept exploration
- Weighted path finding
- Subgraph extraction around query entities
-
Hybrid Retrieval
- Combine fine-tuned model with graph context
- Enhanced multi-hop reasoning
- Improved citation tracking
- Comparative Analysis
- Fine-tuned SLM alone (Phase I)
- Fine-tuned SLM + GraphRAG (Phase II)
- Measure improvement on complex multi-hop questions
- Monitoring and observability
- A/B testing framework for model changes
- User feedback collection
- Continuous improvement pipeline
- Multi-turn conversations
- Calculation tools integration
- Experiment analysis integration
- Internal knowledge base (company-specific)
Phase I Targets (Fine-Tuned SLM):
- 80%+ overall accuracy on test questions
- 85%+ accuracy on statistical definition questions
- 75%+ accuracy on methodology questions
- End-to-end latency < 2 seconds
- Cost < $0.001 per query (vs $0.01-0.03 for GPT-4)
- Training cost < $10 total (data collection + fine-tuning)
Phase II Targets (SLM + GraphRAG):
- 90%+ accuracy on multi-hop reasoning questions
- 95%+ citation accuracy (graph-backed answers)
- Retrieval latency < 200ms
- Improved context relevance through graph traversal
| Component | Technology | Status | Reasoning |
|---|---|---|---|
| Language | Python 3.13+ | β Active | ML ecosystem, rapid development |
| Package Mgr | uv | β Active | 10-100x faster than pip |
| Data Collection | ArXiv, StackExchange API | β Active | Public Q&A and research papers |
| Synthetic Data | Claude Sonnet 4.5 | β Active | High-quality Q&A generation |
| LLM | Llama 3.2 3B Instruct | β Active | Strong reasoning, 128K context |
| Training | HuggingFace TRL + QLoRA | β Active | Efficient 4-bit fine-tuning |
| Dataset Mgr | HuggingFace Datasets | β Active | Standard dataset loading |
| Quantization | BitsAndBytes (4-bit) | β Active | Memory-efficient training |
| API Framework | FastAPI | β Active | Async, auto-docs, type hints |
| Testing | pytest | β Active | Standard Python testing |
| Graph DB | SurrealDB | Phase II | Native graph + document store |
| Vector Store | SurrealDB (native) | Phase II | Single DB for graph + embeddings |
| Embeddings | sentence-transformers | Phase II | Local, fast, good quality |
| Inference | Ollama | Phase II | Easy local deployment |
| Orchestration | Docker Compose | Phase II | Multi-service setup |
- Python 3.13+
- uv package manager
- GPU with 6+ GB VRAM (recommended for training)
- API Keys:
- StackExchange API key (optional, increases rate limit)
- Anthropic API key (required for synthetic Q&A generation)
# Clone repo
git clone <repo-url>
cd slm-rag-experimentation
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# or: make install-uv
# Run automated setup
make setup
# This will:
# - Create virtual environment
# - Install all dependencies
# - Set up .env file (add your API keys here)1. Collect Data
# Collect from Cross Validated (safe without API key)
make collect-cv
# Collect ArXiv papers metadata
make collect-arxiv2. Generate Synthetic Q&A
# Add your Anthropic API key to .env first
# ANTHROPIC_API_KEY=sk-ant-your-key-here
# Test with 10 papers (~$0.24)
make generate-qa-test
# Full generation from all papers (~$4-5)
make generate-qa3. Train Model
# Combine datasets and train (4-12 hours on GPU)
make train-full
# Or step by step:
make combine-datasets
make train4. Deploy (Coming Soon)
# Run inference API
make run
# Start infrastructure
docker-compose up -d
# Start API server
python -m uvicorn src.api.main:app --reload
# Test query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is statistical power?"}'We're currently building the training data pipeline and SLM fine-tuning:
# Collect training data from Cross Validated
make collect-cv
# Collect ArXiv papers for synthetic generation
make collect-arxiv
# Test configuration
make test-configLearning Materials:
- GraphRAG Paper - Microsoft Research
- LlamaIndex Knowledge Graphs
- SurrealDB Graph Docs
Data Sources:
- OpenStax Statistics Textbook (CC-BY)
- Trustworthy Online Controlled Experiments (Kohavi et al.)
- Evan Miller's blog on statistics
- Netflix, Booking.com, Airbnb tech blogs
This is a research/experimentation project. Key areas for contribution:
- New data sources for experimentation knowledge
- Improved entity extraction methods
- Better graph traversal algorithms
- Evaluation test cases
- Documentation improvements