Not understanding read performance higher than NVME capacity #662

fede-vaccaro · 2026-04-05T17:57:33Z

fede-vaccaro
Apr 5, 2026

Hello everyone! I was playing with the driver_test benchmark, but I cannot wrap my mind around the fact that I'm getting such high randread throughput, despite the cache size being smaller than the database size and using O_DIRECT.
Here's the benchmark report and the command I launched

:~/splinterdb$ ./driver_test splinter_test --perf --num-inserts 100000000 --data-size 100 --db-capacity-gib 30 --set-O_DIRECT --cache-capacity-gib 2                      
./driver_test: splinterdb_build_version 2b6d030f-dirty
Dispatch test splinter_test
Bumped up IO queue size to 1280
Running splinter_test with 1 caches
splinter_test: SplinterDB performance test started with 1 tables
splinter_perf_inserts() starting num_insert_threads=14, num_threads=20, num_inserts=100003840 (~100 million) ...
Thread 12 inserting  99% complete for table 0 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.
 Test done for all 1 tables.

Total time=25s, per-splinter per-thread insert time per tuple 3501 ns
splinter total insertion rate: 3998453 insertions/second
splinter bandwidth: 0 megabytes/second
splinter max insert latency: 0 msec
Statistics are not enabled
Space use: trunk 13500416 bytes, maplet 179748864 bytes, branch 14020472832 bytes, total 14213722112 bytes
--------------------------------------------------------------------
| height | trunk bytes | maplet bytes | branch bytes | total bytes |
|--------|-------------|--------------|--------------|-------------|
|      0 |    11403264 |    141914112 |  12322316288 | 12475633664 |
|      1 |     1703936 |     32411648 |   1430347776 |  1464463360 |
|      2 |      262144 |      4894720 |    235163648 |   240320512 |
|      3 |      131072 |       528384 |     32645120 |    33304576 |
--------------------------------------------------------------------
splinter_perf_inserts() completed.

splinter_perf_lookups() starting num_lookup_threads=20, max_async_inflight=64 ...
Thread 8 lookups  99% complete for table 00
Total time=65s, per-splinter per-thread lookup time per tuple 13082 ns
splinter total lookup rate: 1528751 lookups/second
100% lookups were async
max lookup latency ns (sync=0, async=74078469)
Statistics are not enabled
splinter_perf_lookups() completed.

Here you see a whopping 1.5M lookups/second.

However, on another tab I was inspecting my Nvme activity with iostat 1 -x, and here's a sample of it (I can tell it was quite uniform during the randread test, so this single sample is representative)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       600981,00 2403924,00     0,00   0,00    0,07     4,00   29,00    212,00    22,00  43,14    1,31     7,31    4,00    168,00     0,00   0,00    0,25    42,00    6,00    3,00   40,86  86,90

So, on one hand I'm reading around 600K pages/s, which I assume for a randread test it should translate to 600K read/s (under uniform distribution, each read is going to hit a different disk page); if I consider the 2GB cache, with 100M key values (roughly each is 128 byte), it should roughly cover the 16% of the dataset, so I should expect no more than Disk reads/s * 1.16 , which would be around 696K read/s, still way lower than the measured 1.5M reads/s ! And vmtouch shows that the page cache is actually bypassed, so - even though the --set-O_DIRECT flag is set, I'm positive that I'm not going through the OS page cache.

Even considered this, the most I can get from my NVME with fio is between 800K and 1M rand read/s with O_DIRECT.

My question here is: what am I missing here? Where is the 1.5M reads/s coming from?

Thank you!

rtjohnso · 2026-04-06T20:33:24Z

rtjohnso
Apr 6, 2026
Maintainer

Thanks for the interesting inquiry.

I ran this locally and I see similar results: fio can drive my disk to about 1M random 4K IOs per second, iostat during the splinter lookup phase indicates it is driving about 500K IOPS, but the benchmark reports about 1.25M lookups/sec.

I enabled --stats in the benchmark and, after some bug fixes (which I expect to merge into main in a few minutes), I see that the cache indicates that there are only about 28M cache misses during the benchmark of 100M lookups.

I think that is clearly the smoking gun: somehow we are getting a very high cache hit rate in the benchmark. That explains the performance discrepancy.

I am still working to figure out why we get such a high cache hit rate. My first guess is that perhaps the random-number generator we use to generate the workload is not very good. The ith number output by our generator is, more-or-less, XXH64(i). Perhaps xxhash is not as good as we'd hoped for this purpose.

One thing I notice is that your database is substantially larger than what I saw -- you had around 14GBs, whereas I had only around 9.5GBs. I suspect this is because the benchmark generates messages of various sizes between 8 and 100 bytes, which should result in about 8-9GBs of data after inserting 100M such messages with 24-byte keys. Did you modify the benchmark to always generate 100-byte messages?

1 reply

fede-vaccaro Apr 6, 2026
Author

Did you modify the benchmark to always generate 100-byte messages?

Yes! I apologize for not specifying, I forgot, but it was needed for consistency with other benchmarks (and did not change the final result, anyway).

By the way, I might have gotten closer to solving the mistery: If, for the splinter_perf_lookups test I use a permutation of the indices (so instead of hash(i) i do hash(perm[i])) the results are what I would expect:

Lookup result:

splinter_perf_lookups() starting num_lookup_threads=20, max_async_inflight=64 ...
Thread 8 lookups  99% complete for table 00
Total time=117s, per-splinter per-thread lookup time per tuple 23489 ns
splinter total lookup rate: 851436 lookups/second
100% lookups were async
max lookup latency ns (sync=0, async=25409161)
Statistics are not enabled
splinter_perf_lookups() completed.

iostat sample:

$ iostat 1 -x
Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1       868467,00 3473868,00     0,00   0,00    0,10     4,00   92,00    440,00    22,00  19,30    0,20     4,78    1,00     12,00     0,00   0,00    1,00    12,00    4,00    1,50   86,99  91,20

This is the patch:

diff --git a/tests/functional/splinter_test.c b/tests/functional/splinter_test.c
index d6f4c81..cfffb99 100644
--- a/tests/functional/splinter_test.c
+++ b/tests/functional/splinter_test.c
@@ -25,6 +25,7 @@
 #include "test_common.h"
 
 #include "random.h"
+#include "btree.h"
 #include "poison.h"
 
 #define TEST_INSERT_GRANULARITY 4096
@@ -73,6 +74,7 @@ typedef struct test_splinter_thread_params {
    uint64             seed;
    uint64             range_lookups_done;
    uint64             progress;
+   const uint64      *lookup_perm; // if non-NULL, permuted key indices for random lookups
 } test_splinter_thread_params;
 
 /*
@@ -283,6 +285,9 @@ test_trunk_lookup_thread(void *arg)
             test_async_lookup *async_lookup = params->async_lookup[spl_idx];
             test_async_ctxt   *ctxt;
             uint64             lookup_num = lookup_base[spl_idx] + op_offset;
+            uint64             key_idx    = params->lookup_perm
+                                              ? params->lookup_perm[lookup_num]
+                                              : lookup_num;
             timestamp          ts;
 
             if (async_lookup->max_async_inflight == 0) {
@@ -290,7 +295,7 @@ test_trunk_lookup_thread(void *arg)
 
                test_key(&keybuf,
                         test_cfg[spl_idx].key_type,
-                        lookup_num,
+                        key_idx,
                         thread_number,
                         test_cfg[spl_idx].semiseq_freq,
                         core_max_key_size(spl),
@@ -304,7 +309,7 @@ test_trunk_lookup_thread(void *arg)
                platform_assert(SUCCESS(rc));
                verify_tuple(spl,
                             test_cfg->gen,
-                            lookup_num,
+                            key_idx,
                             key_buffer_key(&keybuf),
                             merge_accumulator_to_message(&data),
                             expected_found);
@@ -312,12 +317,12 @@ test_trunk_lookup_thread(void *arg)
                ctxt = test_async_ctxt_get(spl, async_lookup, &vtarg);
                test_key(&ctxt->key,
                         test_cfg[spl_idx].key_type,
-                        lookup_num,
+                        key_idx,
                         thread_number,
                         test_cfg[spl_idx].semiseq_freq,
                         core_max_key_size(spl),
                         test_cfg[spl_idx].period);
-               ctxt->lookup_num = lookup_num;
+               ctxt->lookup_num = key_idx;
                async_ctxt_submit(spl,
                                  async_lookup,
                                  ctxt,
@@ -1103,6 +1108,25 @@ splinter_perf_lookups(platform_heap_id             hid,
    uint64          start_time = platform_get_timestamp();
    platform_status rc;
 
+   // Build a Fisher-Yates random permutation of [0, total_inserts) so that
+   // each key is looked up exactly once in uniformly random order.
+   uint64 *lookup_perm = TYPED_ARRAY_MALLOC(hid, lookup_perm, total_inserts);
+   platform_assert(lookup_perm != NULL);
+   for (uint64 i = 0; i < total_inserts; i++) {
+      lookup_perm[i] = i;
+   }
+   random_state perm_rs;
+   random_init(&perm_rs, 42, 0);
+   for (uint64 i = total_inserts - 1; i > 0; i--) {
+      uint64 j       = random_next_uint64(&perm_rs) % (i + 1);
+      uint64 tmp     = lookup_perm[i];
+      lookup_perm[i] = lookup_perm[j];
+      lookup_perm[j] = tmp;
+   }
+   for (uint64 i = 0; i < num_lookup_threads; i++) {
+      params[i].lookup_perm = lookup_perm;
+   }
+
    do_n_async_ctxt_inits(
       hid, num_lookup_threads, num_tables, max_async_inflight, cfg, params);
 
@@ -1113,6 +1137,7 @@ splinter_perf_lookups(platform_heap_id             hid,
                             hid,
                             test_trunk_lookup_thread);
    if (!SUCCESS(rc)) {
+      platform_free(hid, lookup_perm);
       return rc;
    }
 
@@ -1174,6 +1199,7 @@ splinter_perf_lookups(platform_heap_id             hid,
       cache_reset_stats(spl->cc);
    }
    platform_default_log("%s() completed.\n\n", __FUNCTION__);
+   platform_free(hid, lookup_perm);
    return rc;
 }
 
diff --git a/tests/functional/test.h b/tests/functional/test.h
index 62db05c..5800d09 100644
--- a/tests/functional/test.h
+++ b/tests/functional/test.h
@@ -296,7 +296,7 @@ test_config_init(system_config          *system_cfg, // OUT
    }
 
    gen->type             = MESSAGE_TYPE_INSERT;
-   gen->min_payload_size = GENERATOR_MIN_PAYLOAD_SIZE;
+   gen->min_payload_size = master_cfg->message_size;
    gen->max_payload_size = master_cfg->message_size;
    return rc;
 }

My feeling is that the xxhash distribution is fine. Just guessing: could it be that the B-tree structure kind of "overfits" the insertion order?

rtjohnso · 2026-04-07T01:13:00Z

rtjohnso
Apr 7, 2026
Maintainer

Thanks for the update. Actually, I think your result supports my concern that XXH64 is the problem. My concern is not that XXH64 has a lot of collisions. Rather, I suspect that XXH64(i) may be "close" to XXH64(i+d) for many values of i and small values of d, so that queries that are close together in the query workload often go to the same branches or even the same leaves of branches.

I will do some experiments and possibly replace XXH64 with something else.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not understanding read performance higher than NVME capacity #662

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Not understanding read performance higher than NVME capacity #662

Uh oh!

fede-vaccaro Apr 5, 2026

Replies: 2 comments · 1 reply

Uh oh!

rtjohnso Apr 6, 2026 Maintainer

Uh oh!

fede-vaccaro Apr 6, 2026 Author

Uh oh!

rtjohnso Apr 7, 2026 Maintainer

fede-vaccaro
Apr 5, 2026

Replies: 2 comments 1 reply

rtjohnso
Apr 6, 2026
Maintainer

fede-vaccaro Apr 6, 2026
Author

rtjohnso
Apr 7, 2026
Maintainer