Skip to content

Indexed dary heap#26

Merged
pratzl merged 24 commits intomainfrom
indexed-dary-heap
Apr 28, 2026
Merged

Indexed dary heap#26
pratzl merged 24 commits intomainfrom
indexed-dary-heap

Conversation

@pratzl
Copy link
Copy Markdown
Collaborator

@pratzl pratzl commented Apr 28, 2026

No description provided.

pratzl added 24 commits April 25, 2026 23:22
…Phase 0)

- dijkstra_fixtures.hpp: Erdős–Rényi, 2D grid, Barabási–Albert, path
  graph generators for CSR and VoV containers
- benchmark_dijkstra.cpp: Google Benchmark suite covering 4 topologies
  × 2 containers × 3 scales (1K/10K/100K)
- benchmark/data/README.md: fetch instructions for SNAP real-world graphs
- .gitignore: exclude benchmark/data/ graph files
- indexed_dary_heap_baseline.md: captured priority_queue baseline
  (CSR ER Sparse 100K = 29.1 ms; target ≤ 22 ms after Phase 4)
- indexed_dary_heap_plan.md: initial plan
- indexed_dary_heap.hpp: templated O(log_d V) heap with push, pop, top,
  decrease(k), contains; position map keeps one entry per vertex (O(V))
- heap_position_map.hpp: two adapters —
    vector_position_map (integral keys, O(1) lookup via vector<size_t>)
    assoc_position_map (hashable keys, unordered_map<Key, size_t>)
- test_indexed_dary_heap.cpp: unit tests for arity 2/4/8, both position
  map adapters, decrease-key, custom comparator, heap ordering invariant
Phase 2 — heap-selector tag + dense dispatch:
- use_default_heap / use_indexed_dary_heap<Arity=4> tags added
- if constexpr dispatch on index_vertex_range<G>: dense path uses
  vector_position_map; removes O(E) lazy heap, replaces re-push with
  decrease(k); visitor semantics verified (examine/finish fire once)

Phase 3 — mapped-container support:
- else branch selects assoc_position_map<key_type> for graphs satisfying
  hashable_vertex_id; reserve(num_vertices(g)) preallocates the map
- SPARSE_VERTEX_TYPES (mov/mod/mol/uov/uod/uol) tests: sparse CLRS
  distances match default heap; visitor parity (examine==finish==5)
- Non-integral string-keyed graph test: CLRS topology with VId=std::string,
  textbook distances and per-vertex parity with default heap

21 test cases / 158 assertions; 846 total / 4206 assertions, 0 failures
… 4.1)

Extend benchmark_dijkstra.cpp to cover all four heap variants (Default,
Idx2, Idx4, Idx8) across all four topologies for CSR; add Idx4 for VoV.
Run 3x; results in agents/indexed_dary_heap_results.md.

Key findings at CSR 100K:
  ER Sparse: Idx8 −25% (20.2 ms vs 27.0 ms avg; −31% vs Phase 0 baseline)
  BA:        Idx8 −17% (19.0 ms vs 22.9 ms)
  Grid:      Default wins; Idx4 +35%, Idx8 +39% (position-map overhead)
  Path:      Default wins; all indexed variants +22%

Recommendation: keep default_heap as the default; document
use_indexed_dary_heap<8> for high-E/V random/BA workloads on CSR.
…d (Phase 4.2)

Phase 4.1 results were mixed across topologies:

  CSR 100K (3-run avg, vs use_default_heap):
    ER Sparse  E/V≈8: Idx8 -25%   (20.2 ms vs 27.0 ms)
    BA         E/V≈8: Idx8 -17%   (19.0 ms vs 22.9 ms)
    Grid       E/V≈4: Idx8 +39%   (8.4  ms vs 6.0  ms)
    Path       E/V=1: Idx8 +22%   (0.33 ms vs 0.27 ms)

The grid regression is too large to justify switching the default. A
heuristic E/V dispatch was considered but rejected as premature — users
with known graph shapes can opt in explicitly.

Decision: keep use_default_heap as the default; document
use_indexed_dary_heap<8> as the recommended opt-in for high-E/V random
or scale-free workloads on compressed_graph.

  - Update use_default_heap / use_indexed_dary_heap doc comments with
    workload guidance and the measured numbers.
  - Add CHANGELOG entry under [Unreleased] describing the new selector
    tags, the position-map adapters, and the default decision.
  - Mark Phase 4.1 / 4.2 complete in agents/indexed_dary_heap_plan.md
    with the decision recorded.

Verification: full suite still green (4847/4847 tests pass).
…Heap)

Reorder the trailing template and function parameters on all four
overloads of dijkstra_shortest_paths / dijkstra_shortest_distances so
that Alloc / alloc is the last parameter, following the new Heap /
heap_tag parameter rather than preceding it.

Rationale: Alloc has been a stable parameter since 0.5.0 while Heap is
new in this release. Putting the more-stable, less-frequently-used
parameter last gives callers who only want to override the heap tag a
single positional argument to add, instead of having to also restate
the allocator.

  Old order: ..., Combine, Alloc, Heap
  New order: ..., Combine, Heap, Alloc

Updated:
  - all 4 dijkstra overloads (template params, function params, internal
    forwarding from single-source -> multi-source overloads)
  - prim_minimum_spanning_tree in mst.hpp (was passing alloc into the
    Heap slot positionally; now passes use_default_heap{}, alloc)
  - benchmark_dijkstra.cpp DEFINE_DIJKSTRA_BM macro
  - all call sites in test_dijkstra_indexed_heap.cpp

Verification: full suite green (4847/4847 tests pass).
Add optional Boost.Graph (BGL) comparison benchmarks alongside the
existing graph-v3 Dijkstra benchmarks. Both libraries operate on
topologically identical graphs built from the same edge_list, so the
numbers can be compared directly.

  - benchmark/algorithms/bgl_dijkstra_fixtures.hpp
      * make_bgl_csr  -> boost::compressed_sparse_row_graph
      * make_bgl_adj  -> boost::adjacency_list<vecS, vecS, directedS, ...>
      * run_bgl_dijkstra: dijkstra_shortest_paths_no_color_map_no_init
        for fairness vs graph-v3's no-init semantics.

  - benchmark/algorithms/benchmark_dijkstra.cpp
      * BENCH_BGL-gated section adds 8 BGL benchmarks
        (CSR/Adj x ER/Grid/BA/Path).
      * Startup parity check (check_bgl_distance_parity) asserts
        BGL and graph-v3 produce identical distance vectors on ER, BA,
        and Path at n=1024 from source 0; abort otherwise.

  - benchmark/algorithms/CMakeLists.txt
      * New options DIJKSTRA_BENCH_BGL + BGL_INCLUDE_DIR. Fatal error if
        ON without a directory containing boost/graph headers.

Results at n=100K, 3-run average (CSR):

  Topology   | gv3 def  | gv3 Idx8 | BGL CSR  | BGL adj
  -----------|----------|----------|----------|--------
  ER Sparse  | 26.2 ms  | 22.9 ms  | 19.9 ms  | 34.2 ms
  BA         | 26.9 ms  | 21.7 ms  | 19.6 ms  | 30.9 ms
  Grid       |  6.2 ms  |  8.9 ms  |  6.1 ms  |  9.9 ms
  Path       | 0.27 ms  | 0.33 ms  | 0.28 ms  |  0.52 ms

Conclusions:
  - graph-v3 default beats BGL adjacency_list on every topology by
    23-48% (no missing-feature gap on the closer-equivalent container).
  - BGL CSR is 10-15% faster than graph-v3 Idx8 on dense graphs;
    remaining gap is plausibly CSR layout, not heap.
  - On low-E/V graphs (grid, path) graph-v3 ties or beats BGL CSR.
  - No further heap changes recommended; Phase 4.2 decision (default =
    use_default_heap, opt-in use_indexed_dary_heap<8>) stands.

Full numbers and discussion: agents/indexed_dary_heap_results.md
\§ Phase 4.3.
… (Phase 5)

Pre-existing latent bug: prim() delegates to dijkstra_shortest_paths
with combine = (d_u, w_uv) -> w_uv, breaking Dijkstra's monotonicity
invariant. A finalized vertex v could be re-relaxed by a later-popped
neighbor, silently corrupting weight[v] (the MST output) on the
default heap and crashing the indexed heap (decrease() on a popped
vertex with position == npos).

Fix (Option 1 from indexed_dary_heap_plan.md § 5.2): track finalized
vertices in a set and wrap weight_fn so finalized targets report
+infinity, suppressing the relax. Storage is dispatched on
adj_list::index_vertex_range<G>: std::vector<bool> for dense /
contiguous-id graphs, std::unordered_set<vertex_id_t<G>> for sparse /
mapped-id graphs.

Also exposes the Heap template parameter on prim() so callers can
opt into use_indexed_dary_heap<D>{} (Phase 4.2 recommendation for
dense / scale-free workloads).

Option 2 (standalone Prim, ~5-10% faster on dense graphs by removing
Dijkstra's distance[] shadow and combine-call overhead) is documented
in the plan as a deferred future optimization.

New regression test 'prim - indexed d-ary heap parity' (MST = 18 on
an 8-vertex graph cross-checked with Kruskal) verifies all three heap
strategies (default, Idx4, Idx8) agree.

Also documents Open Questions 1, 2, 6 in the plan and adds a comment
explaining the compile-time-arity rationale on indexed_dary_heap.

Full ctest: 4848/4848 pass.
Documents a detailed investigation into the performance gap between
graph-v3 and BGL's CSR Dijkstra implementations, ruling out heap arity
and confirming the bottleneck is in the edge-value access path. Adds
results tables, an investigation plan, and a test confirming visitor
event parity for multi-source runs between heap variants. Guides future
profiling efforts away from the heap and towards the CSR access path.
Three changes that together unblock the Windows performance workflow described in indexed_dary_heap_plan.md (Phase 4.3b on Windows):

1. CMakePresets.json: windows-base now declares architecture=x64 (strategy=external). The preset previously inherited whatever the host shell offered, which on Visual Studio's default Developer PowerShell is x86. That gave 32-bit size_t and tripped a static_assert in test_dynamic_graph_integration.cpp (sizeof vertex_id_t == sizeof uint64_t). With architecture pinned, callers must launch from a vcvars64-initialised shell or CMake will hard-fail with a clear preset-architecture-mismatch message instead of silently building the wrong word size.

2. benchmark/algorithms/CMakeLists.txt: BGL include directory now auto-discovers. Resolution order: -DBGL_INCLUDE_DIR -> env BGL_INCLUDE_DIR -> env BOOST_ROOT -> per-platform default list. Windows defaults start with D:/dev_graph/boost (the workspace location); Linux keeps the existing \C:\Users\pratz/dev_graph/boost. Adding a new environment is one line in DIJKSTRA_BENCH_BGL_DEFAULT_PATHS.

3. agents/indexed_dary_heap_baseline_msvc.md: MSVC release baseline of CSR Dijkstra benchmark (4 topologies x 4 heap variants x 3 sizes, median of 5 reps, single-core pinned, High priority). Same machine as the Linux baseline so MSVC-vs-GCC differences are toolchain-only. Headline finding: Path under MSVC has the indexed heap 2.7x faster than std::priority_queue at n=100K, opposite of the GCC ordering. Anchors all subsequent VTune profiling in Phase 4.3b.
Cross-references the MSVC baseline (agents/indexed_dary_heap_baseline_msvc.md) and clarifies that this is a separate, MSVC-specific issue in std::priority_queue codegen - not the BGL CSR gap this plan is investigating. The note also establishes the MSVC baseline as the comparison anchor for any Windows-side profiling work, so VTune samples are never cross-compared against the Linux/GCC numbers.
…ot inlined

Phase 4.3b first profile run on Windows. The MSVC build of benchmark_dijkstra (Grid_Idx4/100K, 30s sample) shows that sift_down_ (31.2%), three separate copies of std::less<double>::operator() (17.5% combined), container_value_fn::operator() (9.5%), and the relax/incidence/iterator infrastructure (~12%) appear as live, non-inlined symbols consuming ~80% of the workload.

This contradicts the GCC-verified Open Question 3 in indexed_dary_heap_plan.md, which had concluded the heap-comparator chain collapses to a single ucomisd at -O3 and -O2. That collapse holds on GCC but NOT on MSVC /O2. The original csr_edge_value_perf_plan.md diagnosis (gap is in the edge-value access path) is GCC-specific; on MSVC the heap itself dominates.

Captured user-mode (software) sampling only because hardware event-based sampling needs the SEP driver or admin privileges, neither available in the current session. The microarchitecture exploration run (Front-End / Bad-Speculation / Back-End-Memory / Retiring breakdown) is deferred and will be the next step once HW counters are available.

Added build/vtune/ to .gitignore so raw multi-MB collections do not leak into the repo.
- Annotated less_than_, place_, sift_up_, sift_down_ with GRAPH_DETAIL_FORCE_INLINE
- VTune results 004 (inner-only) and 005 (all sift fns) both show sift_down_
  unchanged at ~31-33% CPU: MSVC ignores __forceinline at /O2 /Ob2 for
  functions of this complexity at a large template call site
- Updated macro comment to document the negative result and intent
- Next step: investigate /Ob3 in the release CMake preset (see Phase 4.3d
  next steps in agents/indexed_dary_heap_results.md)
- Restore accidentally-dropped 'namespace graph::detail {' opening line
- Remove GRAPH_DETAIL_FORCE_INLINE from sift_up_ and sift_down_: Phase 4.3d
  showed MSVC ignores __forceinline on functions of this complexity; the
  annotation had no measurable effect and caused C2059 parse errors at /Ob3.
  less_than_ and place_ retain their annotations (small one-liners).
- All 4847 tests pass on windows-msvc-relwithdebinfo x64
- CMakePresets.json: windows-msvc-release now sets /O2 /Ob3 /DNDEBUG
- indexed_dary_heap.hpp: re-annotate sift_down_ with GRAPH_DETAIL_FORCE_INLINE
- VTune result ob3_001: 98.8% CPU in single inlined run-lambda; sift_down_,
  std::less, container_value_fn all gone from the profile (same shape as GCC)
- Wall-clock: Path_Idx4/100K -14.9% win; Grid needs full-suite comparison
- 4847/4847 tests pass on windows-msvc-release /Ob3
… 4.3e)

Full-suite comparison /Ob2 vs /Ob3 + FORCE_INLINE(sift_down_) at 100K:
  Path/Idx4:  -7.6% win  (inlining bottleneck resolved)
  Grid/Idx4:  +8.2% regression (icache pressure from expanded lambda)
  BA/Idx4:    +6.3% regression (same cause)
  ER/Idx4:    -2.6% (within noise)

Net: FORCE_INLINE on sift_down_ is not universally beneficial. Revert it.
/Ob3 flag is retained in CMakePresets (windows-msvc-release) — provides the
budget for less_than_/place_ annotations and is net-neutral on the suite.
Full data in agents/indexed_dary_heap_baseline_msvc.md Phase 4.3e section.
Reverts conflated changes from Phase 4.3e:
  - windows-msvc-release: back to MSVC defaults (/O2 /Ob2 /DNDEBUG).
    Phase 4.3e full-suite data showed /Ob3 regresses Grid +8.2% and
    BA +6.3%; net loss for production codegen.
  - windows-msvc-relwithdebinfo: back to MSVC defaults (/O2 /Ob1 /Zi).
    Tests should run against MSVC's intentional 'debuggable optimized'
    config, not a flag set chosen for profiling visibility.
  - Adds windows-msvc-profile: /O2 /Ob3 /Zi /DNDEBUG + /DEBUG linker.
    Dedicated investigation preset for VTune / disassembly. Inherits
    from windows-msvc-release; only diverges on Ob3 + debug info.

Also: BUILD_BENCHMARKS=ON moved into windows-msvc-release so the
benchmark binary is buildable from the standard release preset.
Note: this exposes a pre-existing teardown SEGFAULT in the
graph_benchmarks ctest harness (BFS/TopoSort exe); does not affect
benchmark_dijkstra or any correctness test (4847/4848 pass).

.gitignore: add vtune/ alongside the existing build/vtune/.
Original Phase 4.3a baseline (Linux GCC, 2025): graph-v3 was +7% to +37%
slower than BGL CSR. Re-running on Windows MSVC with the new
windows-msvc-profile preset (/O2 /Ob3 /Zi) shows graph-v3 is now 34-64%
*faster* than BGL on every topology at n=100K.

The gap has fully inverted. Two plausible causes documented in the plan:
  - Toolchain effect: GCC inlines BGL's property-map chain aggressively;
    MSVC /Ob3 inlines graph-v3's views::incidence + edge_value chain
    aggressively (settled in Phase 4.3e).
  - Code drift since 4.3a: 5085c60 Edge desc, 7645a19 traversal_common
    simplification, 1c871a8 basic_incidence, aa95fe0 target_id in
    incidence_view all touch the suspect access path.

Linux GCC re-run is needed to know if the gap is gone on the original
toolchain too. That's blocked on this Windows session. Proceeding with
Phase 2 disassembly on MSVC anyway (cheap with the profile preset's PDB)
to document the access-path codegen.

Also: enable DIJKSTRA_BENCH_BGL=ON in windows-msvc-profile preset so
BGL benchmarks build by default in the investigation preset.
…utomation

Replaces ad-hoc PowerShell one-liners that have been the bottleneck for
Phases 4.3-e and Thread B. All scripts are stdlib-only (Python 3.10+).

  bench_run.py     orchestrate Google Benchmark with core-pinning and
                   priority High; emit median/CV rows as JSON.
  bench_compare.py diff two bench_run.py JSONs as a markdown table with
                   regression/win flags at a configurable threshold.
  vtune_top.py     parse a VTune -format csv hotspots report; emit a
                   normalized top-N table with template clutter stripped.
  disasm_func.py   find a function by demangled-name substring; dump only
                   that function's bytes via dumpbin (avoids the 14k+
                   irrelevant entries of /disasm on the full exe).

Smoke-tested on the windows-msvc-profile build:
  - bench_run.py captured 4 aggregate rows from a 3-rep run cleanly.
  - vtune_top.py parsed the new hot_grid_idx4_profile_001 collection
    and emitted a 12-row markdown table with normalized symbols.
  - disasm_func.py found 3 sift_down_ instantiations in the exe and
    dumped Idx4's range to artifacts/perf/sift_down_idx4.asm.

.gitignore: add artifacts/ (bench JSON, hotspot CSV, disassembly captures).
VTune anchor on windows-msvc-profile (with /Zi):
  heap::sift_down_                  34.9 %
  less::operator() (3 copies)       16.5 % combined
  cfn::operator() (2 copies)         9.0 % combined
  incidence_view iter operator*      5.9 %
  vector<double>::operator[]         4.8 %

  Note: this differs from Phase 4.3e's 98.8% in one frame because /Zi
  preserves function boundaries for symbol attribution; codegen is
  identical between /O2 /Ob3 and /O2 /Ob3 /Zi.

Idx4 sift_down_ inner loop (artifacts/perf/sift_down_idx4.asm):
  mov   eax, [r11 + r8*4]      ; child key
  mov   ecx, [r11 + r9*4]      ; other child key
  movsd xmm0, [r10 + rax*8]    ; child distance
  comisd xmm0, [r10 + rcx*8]   ; compare
  cmova r8, r9                 ; conditional best update

Five instructions per comparison, no calls, no template scaffolding,
no pointer subtractions. Comparator chain (std::less, container_value_fn,
distance_fn) fully collapsed by /Ob3. 4-children-per-iteration unrolled
outer loop, identical 1-child remainder loop. This is the textbook shape
Open Q3 hypothesised; on MSVC it requires /Ob3 to materialise.

Phases 3-5 deferred pending Linux GCC re-run of Phase 4.3a (the only
place the original BGL gap lived).
Diagnosed the iteration-time cost of the previous Phase-2 work and rebuilt
the perf tooling around the bottlenecks:

  * dumpbin /disasm:nobytes on the 1.4 MB benchmark exe takes ~30s every
    invocation. New scripts/perf/sym_index.py caches the parsed symbol
    table to <exe>.symidx.json, invalidated by mtime+size. Subsequent
    lookups drop from ~30s to ~0.5s (60x).

  * cmd /c interprets < and > as redirection even inside double quotes,
    so passing 'use_indexed_dary_heap<4>' on the command line silently
    fails. New tools accept --regex (which can use '.' wildcards instead
    of literal angle brackets) and skip @ilt thunks by default.

  * Bulk capture beats per-symbol invocations. New scripts/perf/capture_asm.py
    consumes a manifest (basename, length, regex, substrings) and
    one-shot dumps every entry against the cached index. With :N suffix
    on the basename you can disambiguate when the same regex matches
    multiple symbols.

  * Linux/WSL has no PMU, so the WSL session can't do 'perf stat
    -e cache-misses'. To compensate, this commit pre-collects everything
    that DOES need PMU on the Windows side and lands it in
    artifacts/perf/msvc_profile/ (gitignored) plus a tracked inventory at
    agents/perf_msvc_profile_inventory.md. The Linux session compares
    against those reference artefacts using software-only events.

New scripts:
  scripts/perf/sym_index.py        cached dumpbin index
  scripts/perf/find_func.py        symbol search by substring + regex
  scripts/perf/capture_asm.py      bulk dumpbin manifest driver
  scripts/perf/objdump_capture.py  Linux/GCC counterpart (nm + objdump)
  scripts/perf/linux_gcc_capture.sh  one-shot Linux capture runbook

Updated:
  scripts/perf/disasm_func.py      uses sym_index, adds --regex,
                                   --rebuild-cache, --no-truncate
  scripts/perf/README.md           full inventory + cmd quoting note
  .gitignore                       __pycache__, *.pyc

New documentation:
  agents/thread_b_linux_runbook.md      WSL-aware Linux runbook
  agents/perf_capture_manifest.txt      MSVC capture targets (12 syms)
  agents/perf_capture_manifest_linux.txt   GCC counterpart
  agents/perf_msvc_profile_inventory.md inventory of pre-collected refs

Pre-collected MSVC reference (artifacts/perf/msvc_profile/, gitignored):
  - 12 disassembly captures (sift_down_, sift_up_, dijkstra body, BGL
    counterparts, container_value_fn) totaling ~140 KB
  - VTune hotspots.csv and callstacks.csv from the Idx4/Grid/100K run
  - 96-row wallclock_baseline.json across 24 benchmarks x 4 aggregates

Headline finding from line-count proxies (MSVC, /O2 /Ob3 /Zi):
  graph-v3 dijkstra body         206 lines
  BGL      dijkstra body         505 lines  (~2.5x larger)
  graph-v3 sift_down_(Idx4)      184 lines
  BGL      preserve_heap_property_down  299 lines  (~1.6x larger)
Consistent with Phase 1.1 wall-clock data (graph-v3 -34% to -64% vs BGL).
… BGL)

Reruns the BGL comparison and disassembly capture on linux-gcc-release
at the indexed-dary-heap HEAD, per agents/thread_b_linux_runbook.md.

Findings (decision-tree branch: 'still +30%+ slower on Grid' fires):
- graph-v3 CSR Idx4 vs BGL CSR (Linux GCC, median, 5 reps, CV <= 5%):
    ER_Sparse 100K  +14.7 %   slower
    Grid      100K  +36.2 %   slower  <- 2025 4.3a worst case, intact
    BA        100K   +6.0 %   slower
    Path      100K  +15.2 %   slower
- The post-4.3a commits (5085c60, 7645a19, 1c871a8, aa95fe0) closed
  the gap on MSVC (graph-v3 -34 % to -64 %) but not on GCC.

Phase 2 (objdump) localises the size delta:
- MSVC: graph-v3 inlined body 499 lines vs BGL 1,008 (2.0x).
- GCC:  graph-v3 inlined body 387 lines vs BGL   412 (1.06x).
  GCC compresses BGL's get(weight, edge) chain ~59 %; graph-v3's
  per-edge chain only ~22 %.

Phases 3-5 of csr_edge_value_perf_plan.md are un-deferred.

Manifest fix: agents/perf_capture_manifest_linux.txt now matches the
inlined GCC dijkstra closure (use_indexed_dary_heap<Nul> + operator()
+ graph-type substring) and BGL's run_bgl_dijkstra wrapper rather
than the standalone sift_down_/preserve_heap_property symbols, which
have no body under -O3.

Files (gitignored, regenerable via scripts/perf/linux_gcc_capture.sh):
  artifacts/perf/linux_gcc/wallclock_baseline.json
  artifacts/perf/linux_gcc/diff_msvc_vs_gcc.md
  artifacts/perf/linux_gcc/dijkstra_{csr_idx2,csr_idx4,csr_idx8,vov_idx4}.asm
  artifacts/perf/linux_gcc/dijkstra_bgl_{csr,adj}.asm
  artifacts/perf/linux_gcc/perfstat_*.{stdout,stderr}
Consolidated reference document for the Heap template parameter
evaluation across Phase 4.1–4.3e. Covers:
- Linux/GCC and Windows/MSVC benchmark results (CSR + VoV)
- Topology-by-topology heap comparison (ER, BA, Grid, Path)
- vs BGL compressed_sparse_row_graph and adjacency_list
- Default decision: use_default_heap, Arity=4 for indexed variant
- MSVC inlining investigation (/Ob3 + GRAPH_DETAIL_FORCE_INLINE)
- Open follow-ups in CSR edge-value access path
Move dary_heap-related agent files into agents/dary_heap/:
  csr_edge_value_perf_plan.md
  indexed_dary_heap_baseline.md
  indexed_dary_heap_baseline_msvc.md
  indexed_dary_heap_plan.md
  indexed_dary_heap_results.md
  perf_capture_manifest.txt
  perf_capture_manifest_linux.txt
  perf_linux_gcc_inventory.md
  perf_msvc_profile_inventory.md
  thread_b_linux_runbook.md

Move unrelated agent files into agents/archive/:
  doc_revision_plan.md
  index_vertex_descriptor_plan.md
  map_container_plan.md
  map_container_strategy.md
@pratzl pratzl merged commit ce2f042 into main Apr 28, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant