Indexed dary heap by pratzl · Pull Request #26 · stdgraph/graph-v3

pratzl · 2026-04-28T19:30:00Z

No description provided.

…Phase 0) - dijkstra_fixtures.hpp: Erdős–Rényi, 2D grid, Barabási–Albert, path graph generators for CSR and VoV containers - benchmark_dijkstra.cpp: Google Benchmark suite covering 4 topologies × 2 containers × 3 scales (1K/10K/100K) - benchmark/data/README.md: fetch instructions for SNAP real-world graphs - .gitignore: exclude benchmark/data/ graph files - indexed_dary_heap_baseline.md: captured priority_queue baseline (CSR ER Sparse 100K = 29.1 ms; target ≤ 22 ms after Phase 4) - indexed_dary_heap_plan.md: initial plan

- indexed_dary_heap.hpp: templated O(log_d V) heap with push, pop, top, decrease(k), contains; position map keeps one entry per vertex (O(V)) - heap_position_map.hpp: two adapters — vector_position_map (integral keys, O(1) lookup via vector<size_t>) assoc_position_map (hashable keys, unordered_map<Key, size_t>) - test_indexed_dary_heap.cpp: unit tests for arity 2/4/8, both position map adapters, decrease-key, custom comparator, heap ordering invariant

Phase 2 — heap-selector tag + dense dispatch: - use_default_heap / use_indexed_dary_heap<Arity=4> tags added - if constexpr dispatch on index_vertex_range<G>: dense path uses vector_position_map; removes O(E) lazy heap, replaces re-push with decrease(k); visitor semantics verified (examine/finish fire once) Phase 3 — mapped-container support: - else branch selects assoc_position_map<key_type> for graphs satisfying hashable_vertex_id; reserve(num_vertices(g)) preallocates the map - SPARSE_VERTEX_TYPES (mov/mod/mol/uov/uod/uol) tests: sparse CLRS distances match default heap; visitor parity (examine==finish==5) - Non-integral string-keyed graph test: CLRS topology with VId=std::string, textbook distances and per-vertex parity with default heap 21 test cases / 158 assertions; 846 total / 4206 assertions, 0 failures

… 4.1) Extend benchmark_dijkstra.cpp to cover all four heap variants (Default, Idx2, Idx4, Idx8) across all four topologies for CSR; add Idx4 for VoV. Run 3x; results in agents/indexed_dary_heap_results.md. Key findings at CSR 100K: ER Sparse: Idx8 −25% (20.2 ms vs 27.0 ms avg; −31% vs Phase 0 baseline) BA: Idx8 −17% (19.0 ms vs 22.9 ms) Grid: Default wins; Idx4 +35%, Idx8 +39% (position-map overhead) Path: Default wins; all indexed variants +22% Recommendation: keep default_heap as the default; document use_indexed_dary_heap<8> for high-E/V random/BA workloads on CSR.

…d (Phase 4.2) Phase 4.1 results were mixed across topologies: CSR 100K (3-run avg, vs use_default_heap): ER Sparse E/V≈8: Idx8 -25% (20.2 ms vs 27.0 ms) BA E/V≈8: Idx8 -17% (19.0 ms vs 22.9 ms) Grid E/V≈4: Idx8 +39% (8.4 ms vs 6.0 ms) Path E/V=1: Idx8 +22% (0.33 ms vs 0.27 ms) The grid regression is too large to justify switching the default. A heuristic E/V dispatch was considered but rejected as premature — users with known graph shapes can opt in explicitly. Decision: keep use_default_heap as the default; document use_indexed_dary_heap<8> as the recommended opt-in for high-E/V random or scale-free workloads on compressed_graph. - Update use_default_heap / use_indexed_dary_heap doc comments with workload guidance and the measured numbers. - Add CHANGELOG entry under [Unreleased] describing the new selector tags, the position-map adapters, and the default decision. - Mark Phase 4.1 / 4.2 complete in agents/indexed_dary_heap_plan.md with the decision recorded. Verification: full suite still green (4847/4847 tests pass).

…Heap) Reorder the trailing template and function parameters on all four overloads of dijkstra_shortest_paths / dijkstra_shortest_distances so that Alloc / alloc is the last parameter, following the new Heap / heap_tag parameter rather than preceding it. Rationale: Alloc has been a stable parameter since 0.5.0 while Heap is new in this release. Putting the more-stable, less-frequently-used parameter last gives callers who only want to override the heap tag a single positional argument to add, instead of having to also restate the allocator. Old order: ..., Combine, Alloc, Heap New order: ..., Combine, Heap, Alloc Updated: - all 4 dijkstra overloads (template params, function params, internal forwarding from single-source -> multi-source overloads) - prim_minimum_spanning_tree in mst.hpp (was passing alloc into the Heap slot positionally; now passes use_default_heap{}, alloc) - benchmark_dijkstra.cpp DEFINE_DIJKSTRA_BM macro - all call sites in test_dijkstra_indexed_heap.cpp Verification: full suite green (4847/4847 tests pass).

Add optional Boost.Graph (BGL) comparison benchmarks alongside the existing graph-v3 Dijkstra benchmarks. Both libraries operate on topologically identical graphs built from the same edge_list, so the numbers can be compared directly. - benchmark/algorithms/bgl_dijkstra_fixtures.hpp * make_bgl_csr -> boost::compressed_sparse_row_graph * make_bgl_adj -> boost::adjacency_list<vecS, vecS, directedS, ...> * run_bgl_dijkstra: dijkstra_shortest_paths_no_color_map_no_init for fairness vs graph-v3's no-init semantics. - benchmark/algorithms/benchmark_dijkstra.cpp * BENCH_BGL-gated section adds 8 BGL benchmarks (CSR/Adj x ER/Grid/BA/Path). * Startup parity check (check_bgl_distance_parity) asserts BGL and graph-v3 produce identical distance vectors on ER, BA, and Path at n=1024 from source 0; abort otherwise. - benchmark/algorithms/CMakeLists.txt * New options DIJKSTRA_BENCH_BGL + BGL_INCLUDE_DIR. Fatal error if ON without a directory containing boost/graph headers. Results at n=100K, 3-run average (CSR): Topology | gv3 def | gv3 Idx8 | BGL CSR | BGL adj -----------|----------|----------|----------|-------- ER Sparse | 26.2 ms | 22.9 ms | 19.9 ms | 34.2 ms BA | 26.9 ms | 21.7 ms | 19.6 ms | 30.9 ms Grid | 6.2 ms | 8.9 ms | 6.1 ms | 9.9 ms Path | 0.27 ms | 0.33 ms | 0.28 ms | 0.52 ms Conclusions: - graph-v3 default beats BGL adjacency_list on every topology by 23-48% (no missing-feature gap on the closer-equivalent container). - BGL CSR is 10-15% faster than graph-v3 Idx8 on dense graphs; remaining gap is plausibly CSR layout, not heap. - On low-E/V graphs (grid, path) graph-v3 ties or beats BGL CSR. - No further heap changes recommended; Phase 4.2 decision (default = use_default_heap, opt-in use_indexed_dary_heap<8>) stands. Full numbers and discussion: agents/indexed_dary_heap_results.md \§ Phase 4.3.

… (Phase 5) Pre-existing latent bug: prim() delegates to dijkstra_shortest_paths with combine = (d_u, w_uv) -> w_uv, breaking Dijkstra's monotonicity invariant. A finalized vertex v could be re-relaxed by a later-popped neighbor, silently corrupting weight[v] (the MST output) on the default heap and crashing the indexed heap (decrease() on a popped vertex with position == npos). Fix (Option 1 from indexed_dary_heap_plan.md § 5.2): track finalized vertices in a set and wrap weight_fn so finalized targets report +infinity, suppressing the relax. Storage is dispatched on adj_list::index_vertex_range<G>: std::vector<bool> for dense / contiguous-id graphs, std::unordered_set<vertex_id_t<G>> for sparse / mapped-id graphs. Also exposes the Heap template parameter on prim() so callers can opt into use_indexed_dary_heap<D>{} (Phase 4.2 recommendation for dense / scale-free workloads). Option 2 (standalone Prim, ~5-10% faster on dense graphs by removing Dijkstra's distance[] shadow and combine-call overhead) is documented in the plan as a deferred future optimization. New regression test 'prim - indexed d-ary heap parity' (MST = 18 on an 8-vertex graph cross-checked with Kruskal) verifies all three heap strategies (default, Idx4, Idx8) agree. Also documents Open Questions 1, 2, 6 in the plan and adds a comment explaining the compile-time-arity rationale on indexed_dary_heap. Full ctest: 4848/4848 pass.

Documents a detailed investigation into the performance gap between graph-v3 and BGL's CSR Dijkstra implementations, ruling out heap arity and confirming the bottleneck is in the edge-value access path. Adds results tables, an investigation plan, and a test confirming visitor event parity for multi-source runs between heap variants. Guides future profiling efforts away from the heap and towards the CSR access path.

Three changes that together unblock the Windows performance workflow described in indexed_dary_heap_plan.md (Phase 4.3b on Windows): 1. CMakePresets.json: windows-base now declares architecture=x64 (strategy=external). The preset previously inherited whatever the host shell offered, which on Visual Studio's default Developer PowerShell is x86. That gave 32-bit size_t and tripped a static_assert in test_dynamic_graph_integration.cpp (sizeof vertex_id_t == sizeof uint64_t). With architecture pinned, callers must launch from a vcvars64-initialised shell or CMake will hard-fail with a clear preset-architecture-mismatch message instead of silently building the wrong word size. 2. benchmark/algorithms/CMakeLists.txt: BGL include directory now auto-discovers. Resolution order: -DBGL_INCLUDE_DIR -> env BGL_INCLUDE_DIR -> env BOOST_ROOT -> per-platform default list. Windows defaults start with D:/dev_graph/boost (the workspace location); Linux keeps the existing \C:\Users\pratz/dev_graph/boost. Adding a new environment is one line in DIJKSTRA_BENCH_BGL_DEFAULT_PATHS. 3. agents/indexed_dary_heap_baseline_msvc.md: MSVC release baseline of CSR Dijkstra benchmark (4 topologies x 4 heap variants x 3 sizes, median of 5 reps, single-core pinned, High priority). Same machine as the Linux baseline so MSVC-vs-GCC differences are toolchain-only. Headline finding: Path under MSVC has the indexed heap 2.7x faster than std::priority_queue at n=100K, opposite of the GCC ordering. Anchors all subsequent VTune profiling in Phase 4.3b.

Cross-references the MSVC baseline (agents/indexed_dary_heap_baseline_msvc.md) and clarifies that this is a separate, MSVC-specific issue in std::priority_queue codegen - not the BGL CSR gap this plan is investigating. The note also establishes the MSVC baseline as the comparison anchor for any Windows-side profiling work, so VTune samples are never cross-compared against the Linux/GCC numbers.

…ot inlined Phase 4.3b first profile run on Windows. The MSVC build of benchmark_dijkstra (Grid_Idx4/100K, 30s sample) shows that sift_down_ (31.2%), three separate copies of std::less<double>::operator() (17.5% combined), container_value_fn::operator() (9.5%), and the relax/incidence/iterator infrastructure (~12%) appear as live, non-inlined symbols consuming ~80% of the workload. This contradicts the GCC-verified Open Question 3 in indexed_dary_heap_plan.md, which had concluded the heap-comparator chain collapses to a single ucomisd at -O3 and -O2. That collapse holds on GCC but NOT on MSVC /O2. The original csr_edge_value_perf_plan.md diagnosis (gap is in the edge-value access path) is GCC-specific; on MSVC the heap itself dominates. Captured user-mode (software) sampling only because hardware event-based sampling needs the SEP driver or admin privileges, neither available in the current session. The microarchitecture exploration run (Front-End / Bad-Speculation / Back-End-Memory / Retiring breakdown) is deferred and will be the next step once HW counters are available. Added build/vtune/ to .gitignore so raw multi-MB collections do not leak into the repo.

- Annotated less_than_, place_, sift_up_, sift_down_ with GRAPH_DETAIL_FORCE_INLINE - VTune results 004 (inner-only) and 005 (all sift fns) both show sift_down_ unchanged at ~31-33% CPU: MSVC ignores __forceinline at /O2 /Ob2 for functions of this complexity at a large template call site - Updated macro comment to document the negative result and intent - Next step: investigate /Ob3 in the release CMake preset (see Phase 4.3d next steps in agents/indexed_dary_heap_results.md)

- Restore accidentally-dropped 'namespace graph::detail {' opening line - Remove GRAPH_DETAIL_FORCE_INLINE from sift_up_ and sift_down_: Phase 4.3d showed MSVC ignores __forceinline on functions of this complexity; the annotation had no measurable effect and caused C2059 parse errors at /Ob3. less_than_ and place_ retain their annotations (small one-liners). - All 4847 tests pass on windows-msvc-relwithdebinfo x64

- CMakePresets.json: windows-msvc-release now sets /O2 /Ob3 /DNDEBUG - indexed_dary_heap.hpp: re-annotate sift_down_ with GRAPH_DETAIL_FORCE_INLINE - VTune result ob3_001: 98.8% CPU in single inlined run-lambda; sift_down_, std::less, container_value_fn all gone from the profile (same shape as GCC) - Wall-clock: Path_Idx4/100K -14.9% win; Grid needs full-suite comparison - 4847/4847 tests pass on windows-msvc-release /Ob3

… 4.3e) Full-suite comparison /Ob2 vs /Ob3 + FORCE_INLINE(sift_down_) at 100K: Path/Idx4: -7.6% win (inlining bottleneck resolved) Grid/Idx4: +8.2% regression (icache pressure from expanded lambda) BA/Idx4: +6.3% regression (same cause) ER/Idx4: -2.6% (within noise) Net: FORCE_INLINE on sift_down_ is not universally beneficial. Revert it. /Ob3 flag is retained in CMakePresets (windows-msvc-release) — provides the budget for less_than_/place_ annotations and is net-neutral on the suite. Full data in agents/indexed_dary_heap_baseline_msvc.md Phase 4.3e section.

Reverts conflated changes from Phase 4.3e: - windows-msvc-release: back to MSVC defaults (/O2 /Ob2 /DNDEBUG). Phase 4.3e full-suite data showed /Ob3 regresses Grid +8.2% and BA +6.3%; net loss for production codegen. - windows-msvc-relwithdebinfo: back to MSVC defaults (/O2 /Ob1 /Zi). Tests should run against MSVC's intentional 'debuggable optimized' config, not a flag set chosen for profiling visibility. - Adds windows-msvc-profile: /O2 /Ob3 /Zi /DNDEBUG + /DEBUG linker. Dedicated investigation preset for VTune / disassembly. Inherits from windows-msvc-release; only diverges on Ob3 + debug info. Also: BUILD_BENCHMARKS=ON moved into windows-msvc-release so the benchmark binary is buildable from the standard release preset. Note: this exposes a pre-existing teardown SEGFAULT in the graph_benchmarks ctest harness (BFS/TopoSort exe); does not affect benchmark_dijkstra or any correctness test (4847/4848 pass). .gitignore: add vtune/ alongside the existing build/vtune/.

Original Phase 4.3a baseline (Linux GCC, 2025): graph-v3 was +7% to +37% slower than BGL CSR. Re-running on Windows MSVC with the new windows-msvc-profile preset (/O2 /Ob3 /Zi) shows graph-v3 is now 34-64% *faster* than BGL on every topology at n=100K. The gap has fully inverted. Two plausible causes documented in the plan: - Toolchain effect: GCC inlines BGL's property-map chain aggressively; MSVC /Ob3 inlines graph-v3's views::incidence + edge_value chain aggressively (settled in Phase 4.3e). - Code drift since 4.3a: 5085c60 Edge desc, 7645a19 traversal_common simplification, 1c871a8 basic_incidence, aa95fe0 target_id in incidence_view all touch the suspect access path. Linux GCC re-run is needed to know if the gap is gone on the original toolchain too. That's blocked on this Windows session. Proceeding with Phase 2 disassembly on MSVC anyway (cheap with the profile preset's PDB) to document the access-path codegen. Also: enable DIJKSTRA_BENCH_BGL=ON in windows-msvc-profile preset so BGL benchmarks build by default in the investigation preset.

…utomation Replaces ad-hoc PowerShell one-liners that have been the bottleneck for Phases 4.3-e and Thread B. All scripts are stdlib-only (Python 3.10+). bench_run.py orchestrate Google Benchmark with core-pinning and priority High; emit median/CV rows as JSON. bench_compare.py diff two bench_run.py JSONs as a markdown table with regression/win flags at a configurable threshold. vtune_top.py parse a VTune -format csv hotspots report; emit a normalized top-N table with template clutter stripped. disasm_func.py find a function by demangled-name substring; dump only that function's bytes via dumpbin (avoids the 14k+ irrelevant entries of /disasm on the full exe). Smoke-tested on the windows-msvc-profile build: - bench_run.py captured 4 aggregate rows from a 3-rep run cleanly. - vtune_top.py parsed the new hot_grid_idx4_profile_001 collection and emitted a 12-row markdown table with normalized symbols. - disasm_func.py found 3 sift_down_ instantiations in the exe and dumped Idx4's range to artifacts/perf/sift_down_idx4.asm. .gitignore: add artifacts/ (bench JSON, hotspot CSV, disassembly captures).

VTune anchor on windows-msvc-profile (with /Zi): heap::sift_down_ 34.9 % less::operator() (3 copies) 16.5 % combined cfn::operator() (2 copies) 9.0 % combined incidence_view iter operator* 5.9 % vector<double>::operator[] 4.8 % Note: this differs from Phase 4.3e's 98.8% in one frame because /Zi preserves function boundaries for symbol attribution; codegen is identical between /O2 /Ob3 and /O2 /Ob3 /Zi. Idx4 sift_down_ inner loop (artifacts/perf/sift_down_idx4.asm): mov eax, [r11 + r8*4] ; child key mov ecx, [r11 + r9*4] ; other child key movsd xmm0, [r10 + rax*8] ; child distance comisd xmm0, [r10 + rcx*8] ; compare cmova r8, r9 ; conditional best update Five instructions per comparison, no calls, no template scaffolding, no pointer subtractions. Comparator chain (std::less, container_value_fn, distance_fn) fully collapsed by /Ob3. 4-children-per-iteration unrolled outer loop, identical 1-child remainder loop. This is the textbook shape Open Q3 hypothesised; on MSVC it requires /Ob3 to materialise. Phases 3-5 deferred pending Linux GCC re-run of Phase 4.3a (the only place the original BGL gap lived).

@ilt

Diagnosed the iteration-time cost of the previous Phase-2 work and rebuilt the perf tooling around the bottlenecks: * dumpbin /disasm:nobytes on the 1.4 MB benchmark exe takes ~30s every invocation. New scripts/perf/sym_index.py caches the parsed symbol table to <exe>.symidx.json, invalidated by mtime+size. Subsequent lookups drop from ~30s to ~0.5s (60x). * cmd /c interprets < and > as redirection even inside double quotes, so passing 'use_indexed_dary_heap<4>' on the command line silently fails. New tools accept --regex (which can use '.' wildcards instead of literal angle brackets) and skip @ilt thunks by default. * Bulk capture beats per-symbol invocations. New scripts/perf/capture_asm.py consumes a manifest (basename, length, regex, substrings) and one-shot dumps every entry against the cached index. With :N suffix on the basename you can disambiguate when the same regex matches multiple symbols. * Linux/WSL has no PMU, so the WSL session can't do 'perf stat -e cache-misses'. To compensate, this commit pre-collects everything that DOES need PMU on the Windows side and lands it in artifacts/perf/msvc_profile/ (gitignored) plus a tracked inventory at agents/perf_msvc_profile_inventory.md. The Linux session compares against those reference artefacts using software-only events. New scripts: scripts/perf/sym_index.py cached dumpbin index scripts/perf/find_func.py symbol search by substring + regex scripts/perf/capture_asm.py bulk dumpbin manifest driver scripts/perf/objdump_capture.py Linux/GCC counterpart (nm + objdump) scripts/perf/linux_gcc_capture.sh one-shot Linux capture runbook Updated: scripts/perf/disasm_func.py uses sym_index, adds --regex, --rebuild-cache, --no-truncate scripts/perf/README.md full inventory + cmd quoting note .gitignore __pycache__, *.pyc New documentation: agents/thread_b_linux_runbook.md WSL-aware Linux runbook agents/perf_capture_manifest.txt MSVC capture targets (12 syms) agents/perf_capture_manifest_linux.txt GCC counterpart agents/perf_msvc_profile_inventory.md inventory of pre-collected refs Pre-collected MSVC reference (artifacts/perf/msvc_profile/, gitignored): - 12 disassembly captures (sift_down_, sift_up_, dijkstra body, BGL counterparts, container_value_fn) totaling ~140 KB - VTune hotspots.csv and callstacks.csv from the Idx4/Grid/100K run - 96-row wallclock_baseline.json across 24 benchmarks x 4 aggregates Headline finding from line-count proxies (MSVC, /O2 /Ob3 /Zi): graph-v3 dijkstra body 206 lines BGL dijkstra body 505 lines (~2.5x larger) graph-v3 sift_down_(Idx4) 184 lines BGL preserve_heap_property_down 299 lines (~1.6x larger) Consistent with Phase 1.1 wall-clock data (graph-v3 -34% to -64% vs BGL).

… BGL) Reruns the BGL comparison and disassembly capture on linux-gcc-release at the indexed-dary-heap HEAD, per agents/thread_b_linux_runbook.md. Findings (decision-tree branch: 'still +30%+ slower on Grid' fires): - graph-v3 CSR Idx4 vs BGL CSR (Linux GCC, median, 5 reps, CV <= 5%): ER_Sparse 100K +14.7 % slower Grid 100K +36.2 % slower <- 2025 4.3a worst case, intact BA 100K +6.0 % slower Path 100K +15.2 % slower - The post-4.3a commits (5085c60, 7645a19, 1c871a8, aa95fe0) closed the gap on MSVC (graph-v3 -34 % to -64 %) but not on GCC. Phase 2 (objdump) localises the size delta: - MSVC: graph-v3 inlined body 499 lines vs BGL 1,008 (2.0x). - GCC: graph-v3 inlined body 387 lines vs BGL 412 (1.06x). GCC compresses BGL's get(weight, edge) chain ~59 %; graph-v3's per-edge chain only ~22 %. Phases 3-5 of csr_edge_value_perf_plan.md are un-deferred. Manifest fix: agents/perf_capture_manifest_linux.txt now matches the inlined GCC dijkstra closure (use_indexed_dary_heap<Nul> + operator() + graph-type substring) and BGL's run_bgl_dijkstra wrapper rather than the standalone sift_down_/preserve_heap_property symbols, which have no body under -O3. Files (gitignored, regenerable via scripts/perf/linux_gcc_capture.sh): artifacts/perf/linux_gcc/wallclock_baseline.json artifacts/perf/linux_gcc/diff_msvc_vs_gcc.md artifacts/perf/linux_gcc/dijkstra_{csr_idx2,csr_idx4,csr_idx8,vov_idx4}.asm artifacts/perf/linux_gcc/dijkstra_bgl_{csr,adj}.asm artifacts/perf/linux_gcc/perfstat_*.{stdout,stderr}

Consolidated reference document for the Heap template parameter evaluation across Phase 4.1–4.3e. Covers: - Linux/GCC and Windows/MSVC benchmark results (CSR + VoV) - Topology-by-topology heap comparison (ER, BA, Grid, Path) - vs BGL compressed_sparse_row_graph and adjacency_list - Default decision: use_default_heap, Arity=4 for indexed variant - MSVC inlining investigation (/Ob3 + GRAPH_DETAIL_FORCE_INLINE) - Open follow-ups in CSR edge-value access path

Move dary_heap-related agent files into agents/dary_heap/: csr_edge_value_perf_plan.md indexed_dary_heap_baseline.md indexed_dary_heap_baseline_msvc.md indexed_dary_heap_plan.md indexed_dary_heap_results.md perf_capture_manifest.txt perf_capture_manifest_linux.txt perf_linux_gcc_inventory.md perf_msvc_profile_inventory.md thread_b_linux_runbook.md Move unrelated agent files into agents/archive/: doc_revision_plan.md index_vertex_descriptor_plan.md map_container_plan.md map_container_strategy.md

pratzl added 24 commits April 25, 2026 23:22

pratzl merged commit ce2f042 into main Apr 28, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexed dary heap#26

Indexed dary heap#26
pratzl merged 24 commits intomainfrom
indexed-dary-heap

pratzl commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratzl commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant