Multiple LLMs compete to solve coding challenges.
The Vitalis compiler is the impartial judge.
Evolution breeds the winners into code no single model could write.
Claude's elegance × GPT's brute force × Gemini's lateral thinking → native compiled code
Every AI coding agent generates code. None of them evolve it.
The Forge takes solutions from multiple LLMs, compiles them through a real JIT compiler (Vitalis), benchmarks them with statistical rigor, and then breeds the winners together — crossover, mutation, selection — for hundreds of generations.
The result: code that's measurably better than any single model wrote.
4 LLMs × 1 challenge × 100 generations = code no single model could write
┌─────────────────────────────────────────────────────────────┐
│ 8. NATIVE DESKTOP BRIDGE │
│ File Explorer · VS Code · PowerShell · Clipboard Sync │
├─────────────────────────────────────────────────────────────┤
│ 7. GOVERNANCE LAYER │
│ SHA-256 provenance · Policy engine · Circuit breaker │
├─────────────────────────────────────────────────────────────┤
│ 6. MEMORY & SKILLS │
│ Working → Episodic → Semantic · Cosine retrieval · Decay │
├─────────────────────────────────────────────────────────────┤
│ 5. SELF-IMPROVEMENT ENGINE │
│ LLM patching · AST validation · Fitness gating · Backup │
├─────────────────────────────────────────────────────────────┤
│ 4. AGENT FACTORY │
│ 16 artifact types · 8 languages · Auto-deploy │
├─────────────────────────────────────────────────────────────┤
│ 3. EVOLUTION LAYER │
│ Boltzmann · Quantum annealing · Bayesian UCB · Lévy flight │
├─────────────────────────────────────────────────────────────┤
│ 2. COMPILER LAYER (Vitalis) │
│ Lex → Parse → Type-check → Lint → Capability → JIT → Exec │
├─────────────────────────────────────────────────────────────┤
│ 1. ORCHESTRATION LAYER │
│ Multi-LLM dispatch · Budget tracking · Cost observability │
└─────────────────────────────────────────────────────────────┘
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.12+ | Orchestration engine |
| API Key | Any vendor | At least one: OpenRouter, Anthropic, OpenAI, Google, or DeepSeek |
| Ollama (optional) | Latest | Local GPU/CPU inference (qwen, llama, etc.) |
| Vitalis (optional) | v60 | JIT compilation backend (falls back to Python) |
# Clone The Forge
git clone https://github.com/ModernOps888/the-forge.git
cd the-forge
# Install dependencies (minimal — mostly stdlib)
pip install -r requirements.txt
# Set at least one API key (see .env.example for all options)
# Windows PowerShell:
$env:OPENROUTER_API_KEY = "sk-or-v1-your-key-here"
# Or use direct vendor keys for lower latency:
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:OPENAI_API_KEY = "sk-..."
# Linux/macOS:
export OPENROUTER_API_KEY="sk-or-v1-your-key-here"
# Launch the server + dashboard
python forge_server.py
# Open dashboard
# → http://localhost:8777# Run a demo tournament (mock providers, no API keys needed)
python forge_cli.py demo --generations 5 --population 10
# Compile a .sl snippet through Vitalis
python forge_cli.py compile "fn main() -> i64 { 42 }"
# Type-check without executing
python forge_cli.py check "fn main() -> i64 { 42 }"
# Run a full tournament from a challenge file
python forge_cli.py challenge challenges/sort_integers.json --generations 20
# Chat with any model
python forge_cli.py chat "Explain quicksort in Vitalis .sl" --model claude
# Multi-agent research (all models answer, then synthesize)
python forge_cli.py consensus "Best practices for Rust error handling"
# Build an artifact (4 models compete)
python forge_cli.py build "Build a GitHub webhook receiver with HMAC verification"
# Orchestrate a complex task
python forge_cli.py orchestrate "Design a microservice architecture for a payment gateway"The Factory is the crown jewel. Tell it what to build, and 4 models compete to write the best implementation.
from forge import factory_build
result = factory_build(
description="Build a GitHub MCP server with PR review and issue triage tools",
artifact_type="mcp_server", # or "auto" to detect
)
print(result.leaderboard())
# Winner is auto-saved with all deployment config| Type | Language | What It Generates |
|---|---|---|
agent |
Python | Autonomous agent with tools, memory, multi-step reasoning |
mcp_server |
Python | MCP server with JSON-RPC, tool schemas, stdio transport |
api |
Python | FastAPI REST API with Pydantic models, CORS, OpenAPI docs |
cli |
Python | CLI tool with argparse, rich output, config management |
pipeline |
Python | Data pipeline with transform stages, validation, sinks |
integration |
Python | Third-party connector (Slack, GitHub, Notion) |
sdk |
Python | Client library with retries, auth, pagination |
webhook |
Python | Webhook receiver with HMAC verification, retry queue |
rust_lib |
Rust | Library crate with error types, tests, Cargo.toml |
go_service |
Go | Microservice with handlers, middleware, graceful shutdown |
react_app |
TypeScript | React/Next.js component with hooks, CSS, accessibility |
terraform |
HCL | IaC module with variables, outputs, remote state |
workflow |
YAML | GitHub Actions CI/CD workflow |
extension |
TypeScript | VS Code extension with package.json, tsconfig |
docker |
Dockerfile | Dockerfile + docker-compose.yml |
skill |
Vitalis .sl | Native JIT-compiled hot-path function |
LLM Generation (4 models)
↓
Vitalis Heuristic Scoring (sub-µs, native Rust)
• Line count, import density, error handling, complexity
↓
LLM Semantic Scoring (Gemini, cheapest model)
• Correctness, completeness, security, quality, production-readiness
↓
Final Score = 40% LLM + 60% Heuristic
(We trust math over LLM self-evaluation)
↓
Winner saved + auto-deployed with config files
Every artifact type gets full deployment wiring:
| Type | Generated Files |
|---|---|
mcp_server |
server.py, mcp.json (Cursor), vscode-mcp-snippet.json, README.md |
api |
main.py, requirements.txt, Dockerfile, docker-compose.yml, .env.example |
cli |
tool.py, pyproject.toml, run.sh |
agent |
agent.py, requirements.txt, run.sh, agent.service (systemd) |
rust_lib |
src/lib.rs, Cargo.toml |
go_service |
main.go, go.mod, Makefile, Dockerfile |
react_app |
Component.tsx, package.json |
terraform |
main.tf, backend.tf, terraform.tfvars.example |
Not a basic genetic algorithm. This is research-grade evolutionary computation powered by native Rust hotpaths.
| Algorithm | Purpose |
|---|---|
| Boltzmann (Softmax) Selection | Temperature-controlled exploration/exploitation. T→0 = greedy, T→∞ = random |
| Bayesian UCB1 | Multi-armed bandit — try under-explored solutions more |
| Elite Passthrough | Top N individuals survive unchanged |
| Quantum-Inspired Annealing | Probabilistically accept worse solutions to escape local optima |
| Operator | What It Does |
|---|---|
| Uniform Crossover | For each line, randomly pick from parent A or B |
| Single-Point Crossover | Swap tails at a random body line |
| AST-Aware Crossover | Swap entire brace-depth blocks between parents (prevents syntax breaks) |
| Operator | What It Does |
|---|---|
| Swap Lines | Swap two adjacent body lines |
| Insert Guard | Add an early-return guard clause |
| Change Constant | Alter a numeric literal by a Lévy-flight step |
| Add Comment | Inject a descriptive comment (drives novelty score) |
| Rename Variable | Rename a variable for code distance |
| Dimension | Weight | Source |
|---|---|---|
| Correctness | 30% | Test case pass rate |
| Performance | 25% | Relative execution speed, algorithmic complexity |
| Code Quality | 15% | Structure, naming, documentation, type hints |
| Robustness | 15% | Error handling, input validation, edge cases |
| Efficiency | 10% | Memory patterns, allocations, streaming |
| Novelty | 5% | Code distance from population (prevents convergence) |
Every submission passes through 7 gates before being scored:
Source → LEX → PARSE → TYPE CHECK → LINT → CAPABILITY → COMPILE → EXECUTE
↓ ↓ ↓ ↓ ↓ ↓ ↓
Reject Reject Reject Score Reject Reject Score
If any gate fails, the submission is eliminated. The compiler is the impartial judge — it doesn't care which LLM wrote the code.
┌──────────────────────────────────────────┐
│ Tier 1: WORKING MEMORY (in-RAM) │
│ Per-session, cosine dedup via Vitalis │
│ Max 30 entries, importance-weighted │
├──────────────────────────────────────────┤
│ Tier 2: EPISODIC MEMORY (JSON on disk) │
│ EMA importance decay, 500 entry limit │
│ Automatic eviction of low-importance │
├──────────────────────────────────────────┤
│ Tier 3: SEMANTIC (Skills + Instructions)│
│ Injected into system prompts │
│ Boltzmann-selected based on query match │
└──────────────────────────────────────────┘
Skills are context-aware prompt injections. When you ask about debugging, the debugging skill activates and enriches the system prompt. Skills have fitness scores that update via adaptive fitness scoring.
Instructions are standing orders (e.g., "Be concise", "Use type hints") that persist across sessions.
The Forge can modify its own source code:
User instruction → LLM generates line-range patches
↓
JSON repair layer
↓
AST syntax validation (Python)
↓
┌─── Fitness Gate ───┐
│ Baseline score │
│ Mutant score │
│ Regression? BLOCK │
│ Below floor? BLOCK│
└────────────────────┘
↓
Auto-backup (.bak)
↓
Overwrite + hot-swap
Safety layers:
- JSON repair (fix truncated LLM output)
- AST dry-run before overwrite
- 6-dimensional fitness scoring — no regressions allowed
- Quality floor (30/100 minimum)
- Auto-backup before every write
- Hot-swap only when patching the server itself
Every action is recorded in a tamper-evident, hash-chained ledger:
entry = ProvenanceEntry(
timestamp=time.time(),
event_type="factory_complete", # submission, compilation, mutation, promotion...
entity_id="run_a1b2c3d4",
actor="factory",
data={"winner_provider": "claude", "score": 87.2},
parent_hash="previous_entry_sha256",
entry_hash="sha256(timestamp|type|id|actor|data|parent)",
)Modify any entry → all subsequent hashes invalidate → tamper detected.
| Policy | Enforcement |
|---|---|
| No file system access | Compile-time capability check |
| No network access | Compile-time capability check |
| No process execution | Compile-time capability check |
| Budget enforcement | Halt all calls when budget exhausted |
| Type safety | Type-check must pass before JIT |
| Timeout enforcement | Kill execution exceeding limit |
| Regression prevention | Champion must beat current best |
| Novelty threshold | Reject <10% code distance |
Protects against cascading API failures:
- Closed → normal operation
- Open → blocks all calls after N consecutive failures
- Half-open → allows one test call after cooldown
Every API call is traced with:
| Metric | Tracked |
|---|---|
| Input/output tokens | Per request |
| Cost (USD) | Per request, per provider |
| Latency | Per request (ms) |
| EMA spike detection | Alert on anomalous costs |
| Per-provider breakdown | Who's expensive, who's cheap |
| Budget enforcement | Hard cap with remaining balance |
| Optimization hints | "Switch to Gemini for 3x savings" |
tracker = get_tracker(BudgetConfig(max_budget_usd=50.0))
# Every call auto-records. Dashboard shows real-time spend.The Forge ships with a full interactive dashboard at http://localhost:8777:
| Tab | Features |
|---|---|
| Chat | Multi-model chat with vendor badges (DIRECT/OPENROUTER/LOCAL) |
| Cowork | Desktop co-work agent with native OS hooks |
| Research | Fan query to multiple models, consensus synthesis |
| Build | Agent Factory — 4 models compete, leaderboard scoring |
| Compile | Vitalis .sl code compilation & execution |
| Dashboard | KPIs, Compute & Platform panel, Vendor Routing matrix |
| Orchestrate | Multi-step complex task execution |
| Evolve | Self-improvement engine with fitness gating |
| Trace | Execution traces + Human-in-the-Loop approvals |
| Antigravity | IDE export inbox |
- 6 KPI Cards: Total Cost, Tokens, Requests, Active Vendors, GPU Count, Vitalis Version
- Compute & Platform Panel: Detected GPU (name, VRAM bar), OS badge, CPU cores, RAM, compute mode selector
- Vendor Routing Matrix: Shows which vendors are ACTIVE/INACTIVE with IDE export targets
- Real-time telemetry: CPU%, RAM%, GPU% with VRAM utilization in the top bar
from forge import Arena, ArenaConfig, Challenge
# Define a challenge
challenge = Challenge(
name="Fibonacci",
description="Compute the nth Fibonacci number",
function_signature="fn fib(n: i64) -> i64",
)
# Create arena
arena = Arena(config=ArenaConfig(max_generations=50))
arena.register_provider("claude", my_claude_api)
arena.register_provider("gpt", my_gpt_api)
# Run tournament
tournament = arena.run(challenge)
# Get the champion
print(f"Winner: {tournament.champion.fitness.total}")
print(tournament.champion.source_code)from forge import multi_agent_research, consensus
# Ask all models the same question
results = multi_agent_research("Best Rust error handling patterns")
# → {"claude": "...", "gpt": "...", "gemini": "...", "deepseek": "..."}
# Or get a synthesized consensus
result = consensus("Compare actor model vs CSP for Go concurrency")
# → {"individual": {...}, "consensus": "merged answer", "cost_usd": 0.003}from forge import auto_route
# Automatically picks cheapest model that can handle the task
answer = auto_route("What's 2+2?") # → Gemini (cheapest)
answer = auto_route("Design a payment API") # → Claude (premium)| Name | Model ID | Tier | Routing |
|---|---|---|---|
claude |
anthropic/claude-sonnet-4.6 |
Premium | Direct → OpenRouter fallback |
claude-opus |
anthropic/claude-opus-4.7 |
Elite | Direct → OpenRouter fallback |
gpt |
openai/gpt-5.4 |
Premium | Direct → OpenRouter fallback |
gpt-mini |
openai/gpt-5.4-mini |
Mid | Direct → OpenRouter fallback |
gpt-nano |
openai/gpt-5.4-nano |
Economy | Direct → OpenRouter fallback |
gemini |
google/gemini-2.5-pro |
Premium | Direct → OpenRouter fallback |
gemini-lite |
google/gemini-2.5-flash |
Economy | Direct → OpenRouter fallback |
deepseek |
deepseek/deepseek-chat-v3-0324 |
Economy | Direct → OpenRouter fallback |
qwen-local |
qwen2.5-coder:7b |
Local | Ollama (GPU/CPU/Split) |
The Forge uses a smart routing strategy:
- Direct vendor key present? → Call vendor API directly (lowest latency)
- No direct key? → Fall back to OpenRouter (universal relay)
- Local model? → Route to Ollama with hardware-aware
num_gpu
Anthropic key → api.anthropic.com (direct, ~200ms)
No key → openrouter.ai/api (relay, ~400ms)
qwen-local → localhost:11434 (Ollama, GPU/CPU)
The Forge auto-detects available hardware for local inference:
| Platform | GPU Detection | RAM Detection |
|---|---|---|
| Windows | nvidia-smi (NVIDIA) |
wmic / psutil |
| Linux | nvidia-smi (NVIDIA), rocm-smi (AMD) |
psutil |
| macOS | Apple Silicon (unified memory) | psutil |
| Mode | Behavior | Use Case |
|---|---|---|
auto |
Detect GPU, fallback to CPU | Default — works everywhere |
gpu |
Force all layers on GPU | Dedicated GPU machines |
cpu |
Force CPU-only (num_gpu=0) |
No GPU, 16GB+ RAM |
split |
Split layers across GPU+CPU | Limited VRAM (4-6GB) |
The dashboard's Compute & Platform panel shows detected hardware in real-time:
- GPU name, VRAM total/free with usage bar
- Platform badge (🪟 Windows / 🐧 Linux / 🍎 macOS)
- CPU cores, system RAM with capacity bar
- Active compute mode selector visualization
The Forge doesn't use vibes. Every comparison uses proper statistics:
- Welford's algorithm for numerically stable online mean/variance
- 95% confidence intervals via Student's t distribution
- Welch's t-test for comparing two candidates with unequal variance
- Cohen's d effect size measurement
- MAD-based outlier detection (modified Z-score)
- Minimum 30 runs before declaring significance
- Pareto front computation for multi-objective optimization
TheForge/
├── forge/
│ ├── __init__.py # Package exports (v2.0.0)
│ ├── models.py # Data models (Challenge, Submission, Score, etc.)
│ ├── compiler.py # Vitalis JIT integration (7-gate pipeline)
│ ├── benchmark.py # Welford stats, t-tests, outlier detection
│ ├── evolution.py # Crossover, mutation, Boltzmann/UCB selection
│ ├── arena.py # Tournament orchestrator
│ ├── factory.py # Agent Factory (16 types, 8 languages)
│ ├── deployer.py # Auto-deployment wiring for all artifact types
│ ├── fitness_engine.py # 6-dimensional polyglot code scoring
│ ├── providers.py # OpenRouter + Ollama LLM providers
│ ├── budget.py # Cost tracking, EMA spike detection
│ ├── governance.py # Provenance chain, policy engine, circuit breaker
│ ├── memory.py # Three-tier memory (working/episodic/semantic)
│ ├── marketplace.py # Skill marketplace registry
│ ├── skills.py # Skill DNA certification from compilation
│ ├── tracer.py # Execution traces
│ ├── replay.py # Deterministic tournament replay
│ └── vitalis_ffi.py # Vitalis Rust FFI bindings (234K LOC of bindings)
├── dashboard/
│ └── index.html # Full interactive dashboard (single-file, no deps)
├── challenges/
│ └── sort_integers.json # Example challenge definition
├── forge_server.py # HTTP API server (1800+ lines, stdlib only)
├── forge_cli.py # Full CLI entry point
├── forge_desktop.py # Native Windows desktop integration
├── forge_mcp.py # MCP server interface
├── launch.ps1 # Windows PowerShell launcher
├── requirements.txt # Minimal dependencies
└── README.md
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/chat |
Send message to any model |
POST |
/api/research |
Fan query to selected models |
POST |
/api/consensus |
All models + synthesis |
POST |
/api/compile |
Compile .sl code via Vitalis |
POST |
/api/build |
Agent Factory — build any artifact |
POST |
/api/deploy |
Deploy a factory result |
POST |
/api/orchestrate |
Multi-step complex task execution |
POST |
/api/score |
Score code with fitness engine |
POST |
/api/self-improve |
Self-improvement patching |
POST |
/api/cowork |
Desktop co-work agent |
POST |
/api/quick-chat |
Stateless single-shot chat |
POST |
/api/analyze-file |
AI code review of local file |
POST |
/api/export |
Export to IDE bridge (multi-target) |
POST |
/api/native |
Native OS bridge (explorer/IDE/terminal) |
GET |
/api/health |
System health + active vendors |
GET |
/api/models |
Available models + pricing + vendor badges |
GET |
/api/budget |
Cost & token report |
GET |
/api/provenance |
Governance audit trail |
GET |
/api/memory |
Memory system stats |
GET |
/api/traces |
Execution traces |
GET |
/api/vendor-health |
Vendor connection status |
GET |
/api/compute |
Hardware detection (GPUs, CPU, RAM) |
GET |
/api/ide-targets |
Configured IDE export targets |
- Core engine (compiler, benchmark, evolution, arena)
- CLI with demo mode
- Vitalis JIT integration with native Rust hotpaths
- Real LLM providers via OpenRouter (Claude, GPT, Gemini, DeepSeek)
- Multi-vendor smart routing (direct Anthropic/OpenAI/Google/DeepSeek + OpenRouter fallback)
- Cross-platform compute detection (NVIDIA, AMD ROCm, Apple Silicon, CPU-only)
- IDE-agnostic export bridge (VS Code, Cursor, Windsurf, Antigravity, clipboard)
- Budget & cost tracking with EMA spike detection
- Enterprise governance (provenance chain, policy engine, circuit breaker)
- Agent Factory (16 artifact types, 8 languages)
- Auto-deployment wiring for all artifact types
- Three-tier memory engine (working/episodic/semantic)
- Self-improvement engine with fitness gating
- Interactive dashboard with 10+ tabs
- Native desktop bridge (File Explorer, VS Code, PowerShell)
- Polyglot fitness engine (Python, Rust, Go, TypeScript, etc.)
- Local model support via Ollama (GPU/CPU/split compute modes)
- MCP server (11 tools, smart-routed)
- SQLite evolution ledger
- AST-level crossover (not line-level)
- Property-based fuzz testing via Vitalis
- Multi-user collaboration mode
- WebSocket real-time dashboard updates
MIT License — see LICENSE for details.
Built with 🔥 by ModernOps888
Where code is forged through competition, refined by evolution, and hardened by compilation.

