feat: introduce turbo. by luoxiaojian · Pull Request #224 · alibaba/zvec

luoxiaojian · 2026-03-13T09:31:50Z

Greptile Summary

This PR introduces a "turbo" acceleration layer (zvec_turbo) that provides runtime-dispatched, architecture-optimised distance kernels for int8 SquaredEuclidean search, initially targeting AVX-512 VNNI. On VNNI-capable hosts the turbo functions are returned as a fast path in QuantizedIntegerMetric, transparently replacing the generic DistanceMatrixCompute and BaseDistanceBatchWithScoreUnquantized implementations.

Key changes:

New src/turbo/ library with AVX-512 VNNI kernels (l2_int8_distance_avx512_vnni, l2_int8_batch_distance_avx512_vnni, l2_int8_query_preprocess_avx512_vnni) and empty AVX2 stubs for future work.
cmake/option.cmake drops the global -march for x86 in AUTO_DETECT_ARCH mode; per-file march flags are now managed inside each library's own CMakeLists.txt.
QuantizedIntegerMetric is updated to try the turbo dispatch first and fall through to the original implementation when no turbo path is available.
The DistanceFunc / BatchDistanceFunc type aliases in turbo.h declare size_t for dimension and count parameters, while the concrete implementations accept int — creating a silent narrowing conversion when the std::function wrappers are invoked (see inline comment).
TURBO_MARCH_FLAG_AVX2 is computed by setup_compiler_march_for_x86 but is never applied to avx2_impl.cc, meaning future AVX2 code added to that file will silently compile without the required -march=core-avx2 flag.

Confidence Score: 3/5

PR needs fixes before merging — a size_t→int narrowing mismatch in the public API and several items carried over from prior review threads remain unresolved.
The AVX-512 VNNI batch kernel and preprocess logic are structurally sound, and the integration into QuantizedIntegerMetric follows a clean try-turbo-then-fallback pattern. However the DistanceFunc/BatchDistanceFunc type aliases use size_t while the implementations accept int, introducing a silent narrowing conversion that can produce wrong results for any caller that passes a dimension or batch count as size_t. Additionally, several issues raised in prior review threads (bias correction, stray double-comment markers, missing trailing newline) have not yet been fully resolved, keeping overall confidence low.
src/include/zvec/turbo/turbo.h (size_t/int mismatch in function-type aliases) and src/turbo/euclidean/avx512_impl.cc (ongoing items from prior threads).

Important Files Changed

Filename	Overview
src/turbo/euclidean/avx512_impl.cc	Core AVX-512 VNNI kernel for L2 int8 distance. Sign-trick inner product, VNNI batch path with scalar tail (correct int8_t cast), query preprocessing (int8→uint8), and the L2 reconstruction formula. Has the noted mismatches with the DistanceFunc signature (size_t vs int dim). Previous thread issues (bias correction, scalar fallback types, stray comments) are documented separately.
src/include/zvec/turbo/turbo.h	Public API header for the turbo layer. DistanceFunc and BatchDistanceFunc aliases use size_t for dimension/count parameters, creating a narrowing-conversion mismatch with the int-typed implementations.
src/turbo/turbo.cc	Runtime dispatch layer that checks CpuFeatures::static_flags_ and returns the appropriate AVX-512 VNNI or nullptr implementations. Logic is clear; AVX2 stubs still return nullptr (no implementation yet).
src/turbo/CMakeLists.txt	Turbo library build definition. AVX-512 march flags are correctly applied to avx512_impl.cc; TURBO_MARCH_FLAG_AVX2 is computed but never applied, which will silently omit the flag when AVX2 code is eventually added.
src/core/metric/quantized_integer_metric.cc	Integrates turbo distance/preprocess functions as a fast path before falling back to the existing implementations. Logic is straightforward: try turbo, return if non-null, else fall through.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant QIM as QuantizedIntegerMetric
    participant TD as turbo dispatch<br/>(turbo.cc)
    participant CPU as CpuFeatures
    participant VNNI as avx512_impl.cc
    participant Legacy as Legacy impl

    Caller->>QIM: get_query_preprocess_func()
    QIM->>TD: get_query_preprocess_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_query_preprocess_avx512_vnni
        QIM-->>Caller: turbo preprocess func
        Caller->>VNNI: preprocess(query, dim)<br/>(int8 → uint8, first d bytes)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy preprocess func
    end

    Caller->>QIM: batch_distance()
    QIM->>TD: get_batch_distance_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_batch_distance_avx512_vnni
        QIM-->>Caller: turbo batch func
        Caller->>VNNI: batch_distance(vectors[], query, n, dim, out[])<br/>(VNNI inner products + bias correction + L2 formula)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy batch func (BaseDistanceBatchWithScoreUnquantized)
        Caller->>Legacy: ComputeBatch(...)
    end

    Caller->>QIM: distance_matrix(m, n)
    QIM->>TD: get_distance_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_distance_avx512_vnni
        QIM-->>Caller: turbo single-distance func
        Caller->>VNNI: distance(a, b, dim, out)<br/>(sign-trick inner product + L2 formula)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy DistanceMatrixCompute
        Caller->>Legacy: compute(m, n)
    end

_{Last reviewed commit: 67663e0}

Greptile also left 2 inline comments on this PR.

src/turbo/euclidean/avx512_impl.cc

src/turbo/turbo.cc

src/turbo/euclidean/avx512_impl.cc

src/turbo/euclidean/avx512_impl.h

luoxiaojian · 2026-03-13T09:56:46Z

@greptile

greptile-apps · 2026-03-13T10:03:27Z

src/include/zvec/turbo/turbo.h

+using DistanceFunc =
+    std::function<void(const void *m, const void *q, size_t dim, float *out)>;
+using BatchDistanceFunc = std::function<void(
+    const void **m, const void *q, size_t num, size_t dim, float *out)>;


size_t → int narrowing in function-type aliases

DistanceFunc declares size_t dim and BatchDistanceFunc declares size_t num, size_t dim, but the concrete implementations (l2_int8_distance_avx512_vnni, l2_int8_batch_distance_avx512_vnni) accept int dim and int n. When these function pointers are stored inside the std::function wrappers and later called with size_t arguments, the std::function machinery performs a silent narrowing conversion from size_t to int at the call site. For any dimension or batch count that exceeds INT_MAX the values will be silently truncated, producing wrong results without any diagnostic.

Either align the concrete function signatures to use size_t:

void l2_int8_distance_avx512_vnni(const void *a, const void *b, size_t dim, float *distance); void l2_int8_batch_distance_avx512_vnni(const void *const *vectors, const void *query, size_t n, size_t dim, float *distances);

or change the aliases to use int to match the actual implementations:

using DistanceFunc = std::function<void(const void *m, const void *q, int dim, float *out)>; using BatchDistanceFunc = std::function<void( const void **m, const void *q, int num, int dim, float *out)>;

greptile-apps · 2026-03-13T10:03:28Z

src/turbo/CMakeLists.txt

+        setup_compiler_march_for_x86(TURBO_MARCH_FLAG_SSE TURBO_MARCH_FLAG_AVX2 TURBO_MARCH_FLAG_AVX512)
+        set_source_files_properties(
+            ${CMAKE_CURRENT_SOURCE_DIR}/euclidean/avx512_impl.cc
+            PROPERTIES
+            COMPILE_FLAGS "${TURBO_MARCH_FLAG_AVX512}"
+        )


TURBO_MARCH_FLAG_AVX2 computed but never applied

setup_compiler_march_for_x86 is called with all three flag variables, and TURBO_MARCH_FLAG_AVX512 is correctly applied to avx512_impl.cc. However TURBO_MARCH_FLAG_AVX2 is never used in any set_source_files_properties call — avx2_impl.cc (and avx2_impl.h) are currently empty placeholders, but when AVX2 code is eventually added there will be no compile-time reminder to apply the flag. Consider adding a commented-out placeholder now so it is not forgotten:

# set_source_files_properties( # ${CMAKE_CURRENT_SOURCE_DIR}/euclidean/avx2_impl.cc # PROPERTIES # COMPILE_FLAGS "${TURBO_MARCH_FLAG_AVX2}" # )

introduce turbo

d846025

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

luoxiaojian added 2 commits March 13, 2026 17:45

format

8702af1

fix

b5303d8

comment

67663e0

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

add guard

f6230c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce turbo.#224

feat: introduce turbo.#224
luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
luoxiaojian:turbo

luoxiaojian commented Mar 13, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luoxiaojian commented Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luoxiaojian commented Mar 13, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luoxiaojian commented Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luoxiaojian commented Mar 13, 2026 •

edited by greptile-apps bot

Loading