Skip to content

feat: introduce turbo.#224

Open
luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
luoxiaojian:turbo
Open

feat: introduce turbo.#224
luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
luoxiaojian:turbo

Conversation

@luoxiaojian
Copy link
Collaborator

@luoxiaojian luoxiaojian commented Mar 13, 2026

Greptile Summary

This PR introduces a "turbo" acceleration layer (zvec_turbo) that provides runtime-dispatched, architecture-optimised distance kernels for int8 SquaredEuclidean search, initially targeting AVX-512 VNNI. On VNNI-capable hosts the turbo functions are returned as a fast path in QuantizedIntegerMetric, transparently replacing the generic DistanceMatrixCompute and BaseDistanceBatchWithScoreUnquantized implementations.

Key changes:

  • New src/turbo/ library with AVX-512 VNNI kernels (l2_int8_distance_avx512_vnni, l2_int8_batch_distance_avx512_vnni, l2_int8_query_preprocess_avx512_vnni) and empty AVX2 stubs for future work.
  • cmake/option.cmake drops the global -march for x86 in AUTO_DETECT_ARCH mode; per-file march flags are now managed inside each library's own CMakeLists.txt.
  • QuantizedIntegerMetric is updated to try the turbo dispatch first and fall through to the original implementation when no turbo path is available.
  • The DistanceFunc / BatchDistanceFunc type aliases in turbo.h declare size_t for dimension and count parameters, while the concrete implementations accept int — creating a silent narrowing conversion when the std::function wrappers are invoked (see inline comment).
  • TURBO_MARCH_FLAG_AVX2 is computed by setup_compiler_march_for_x86 but is never applied to avx2_impl.cc, meaning future AVX2 code added to that file will silently compile without the required -march=core-avx2 flag.

Confidence Score: 3/5

  • PR needs fixes before merging — a size_tint narrowing mismatch in the public API and several items carried over from prior review threads remain unresolved.
  • The AVX-512 VNNI batch kernel and preprocess logic are structurally sound, and the integration into QuantizedIntegerMetric follows a clean try-turbo-then-fallback pattern. However the DistanceFunc/BatchDistanceFunc type aliases use size_t while the implementations accept int, introducing a silent narrowing conversion that can produce wrong results for any caller that passes a dimension or batch count as size_t. Additionally, several issues raised in prior review threads (bias correction, stray double-comment markers, missing trailing newline) have not yet been fully resolved, keeping overall confidence low.
  • src/include/zvec/turbo/turbo.h (size_t/int mismatch in function-type aliases) and src/turbo/euclidean/avx512_impl.cc (ongoing items from prior threads).

Important Files Changed

Filename Overview
src/turbo/euclidean/avx512_impl.cc Core AVX-512 VNNI kernel for L2 int8 distance. Sign-trick inner product, VNNI batch path with scalar tail (correct int8_t cast), query preprocessing (int8→uint8), and the L2 reconstruction formula. Has the noted mismatches with the DistanceFunc signature (size_t vs int dim). Previous thread issues (bias correction, scalar fallback types, stray comments) are documented separately.
src/include/zvec/turbo/turbo.h Public API header for the turbo layer. DistanceFunc and BatchDistanceFunc aliases use size_t for dimension/count parameters, creating a narrowing-conversion mismatch with the int-typed implementations.
src/turbo/turbo.cc Runtime dispatch layer that checks CpuFeatures::static_flags_ and returns the appropriate AVX-512 VNNI or nullptr implementations. Logic is clear; AVX2 stubs still return nullptr (no implementation yet).
src/turbo/CMakeLists.txt Turbo library build definition. AVX-512 march flags are correctly applied to avx512_impl.cc; TURBO_MARCH_FLAG_AVX2 is computed but never applied, which will silently omit the flag when AVX2 code is eventually added.
src/core/metric/quantized_integer_metric.cc Integrates turbo distance/preprocess functions as a fast path before falling back to the existing implementations. Logic is straightforward: try turbo, return if non-null, else fall through.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant QIM as QuantizedIntegerMetric
    participant TD as turbo dispatch<br/>(turbo.cc)
    participant CPU as CpuFeatures
    participant VNNI as avx512_impl.cc
    participant Legacy as Legacy impl

    Caller->>QIM: get_query_preprocess_func()
    QIM->>TD: get_query_preprocess_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_query_preprocess_avx512_vnni
        QIM-->>Caller: turbo preprocess func
        Caller->>VNNI: preprocess(query, dim)<br/>(int8 → uint8, first d bytes)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy preprocess func
    end

    Caller->>QIM: batch_distance()
    QIM->>TD: get_batch_distance_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_batch_distance_avx512_vnni
        QIM-->>Caller: turbo batch func
        Caller->>VNNI: batch_distance(vectors[], query, n, dim, out[])<br/>(VNNI inner products + bias correction + L2 formula)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy batch func (BaseDistanceBatchWithScoreUnquantized)
        Caller->>Legacy: ComputeBatch(...)
    end

    Caller->>QIM: distance_matrix(m, n)
    QIM->>TD: get_distance_func(kSquaredEuclidean, kInt8, kDefault)
    TD->>CPU: AVX512_VNNI?
    alt AVX-512 VNNI available
        CPU-->>TD: true
        TD-->>QIM: l2_int8_distance_avx512_vnni
        QIM-->>Caller: turbo single-distance func
        Caller->>VNNI: distance(a, b, dim, out)<br/>(sign-trick inner product + L2 formula)
    else no VNNI
        CPU-->>TD: false
        TD-->>QIM: nullptr
        QIM-->>Caller: legacy DistanceMatrixCompute
        Caller->>Legacy: compute(m, n)
    end
Loading

Last reviewed commit: 67663e0

Greptile also left 2 inline comments on this PR.

@luoxiaojian
Copy link
Collaborator Author

@greptile

Comment on lines +21 to +24
using DistanceFunc =
std::function<void(const void *m, const void *q, size_t dim, float *out)>;
using BatchDistanceFunc = std::function<void(
const void **m, const void *q, size_t num, size_t dim, float *out)>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_tint narrowing in function-type aliases

DistanceFunc declares size_t dim and BatchDistanceFunc declares size_t num, size_t dim, but the concrete implementations (l2_int8_distance_avx512_vnni, l2_int8_batch_distance_avx512_vnni) accept int dim and int n. When these function pointers are stored inside the std::function wrappers and later called with size_t arguments, the std::function machinery performs a silent narrowing conversion from size_t to int at the call site. For any dimension or batch count that exceeds INT_MAX the values will be silently truncated, producing wrong results without any diagnostic.

Either align the concrete function signatures to use size_t:

void l2_int8_distance_avx512_vnni(const void *a, const void *b, size_t dim,
                                  float *distance);

void l2_int8_batch_distance_avx512_vnni(const void *const *vectors,
                                        const void *query, size_t n, size_t dim,
                                        float *distances);

or change the aliases to use int to match the actual implementations:

using DistanceFunc =
    std::function<void(const void *m, const void *q, int dim, float *out)>;
using BatchDistanceFunc = std::function<void(
    const void **m, const void *q, int num, int dim, float *out)>;

Comment on lines +6 to +11
setup_compiler_march_for_x86(TURBO_MARCH_FLAG_SSE TURBO_MARCH_FLAG_AVX2 TURBO_MARCH_FLAG_AVX512)
set_source_files_properties(
${CMAKE_CURRENT_SOURCE_DIR}/euclidean/avx512_impl.cc
PROPERTIES
COMPILE_FLAGS "${TURBO_MARCH_FLAG_AVX512}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TURBO_MARCH_FLAG_AVX2 computed but never applied

setup_compiler_march_for_x86 is called with all three flag variables, and TURBO_MARCH_FLAG_AVX512 is correctly applied to avx512_impl.cc. However TURBO_MARCH_FLAG_AVX2 is never used in any set_source_files_properties call — avx2_impl.cc (and avx2_impl.h) are currently empty placeholders, but when AVX2 code is eventually added there will be no compile-time reminder to apply the flag. Consider adding a commented-out placeholder now so it is not forgotten:

# set_source_files_properties(
#     ${CMAKE_CURRENT_SOURCE_DIR}/euclidean/avx2_impl.cc
#     PROPERTIES
#     COMPILE_FLAGS "${TURBO_MARCH_FLAG_AVX2}"
# )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant