Skip to content

RISC-V: Add RVV vectorized FindMatchLength optimization#233

Open
zhanchangbao-sanechips wants to merge 1 commit intogoogle:mainfrom
zhanchangbao-sanechips:rvvopt
Open

RISC-V: Add RVV vectorized FindMatchLength optimization#233
zhanchangbao-sanechips wants to merge 1 commit intogoogle:mainfrom
zhanchangbao-sanechips:rvvopt

Conversation

@zhanchangbao-sanechips
Copy link
Copy Markdown

Summary

This PR adds RISC-V Vector (RVV) optimization for the FindMatchLength() function in the Snappy compression library. The optimization leverages RVV instructions to compare 16 bytes in parallel, resulting in improved compression performance on RISC-V platforms.

Motivation

The Snappy compression algorithm spends a significant portion of its time in FindMatchLength() during the compression phase. On RISC-V platforms with RVV support, we can accelerate this critical path by using vector instructions to perform parallel byte comparisons.

Changes Made

  • Added RVV vectorized loop in FindMatchLength() to process 16-byte blocks in parallel
  • Used RVV intrinsics: __riscv_vsetvl_e8m1(), __riscv_vle8_v_u8m1(), __riscv_vmsne_vv_u8m1_b8(), __riscv_vfirst_m_b8()
  • Maintained full backward compatibility: non-RISC-V platforms are completely unaffected
  • Preserved original 8-byte scalar loop as fallback for remaining data (< 32 bytes)

Implementation Details

The RVV optimization is strategically placed between SNAPPY_PREFETCH and the scalar 8-byte loop:

  1. RVV loop: Handles 32+ byte chunks with 16-byte parallelism using vector comparisons
  2. Scalar 8-byte loop: Handles 16-31 byte remainder (original code preserved)
  3. Byte-by-byte loop: Handles final <16 bytes (original code preserved)

This layered approach ensures optimal performance across all input sizes while maintaining code clarity.

Performance Results

Test Environment

  • Hardware: Banana Pi K1 (SpacemiT X60)
  • CPU: 8-core X60 @ 1.6GHz
  • Vector Length: VLEN=256 bits
  • Compiler: GCC with RVV support

ZFlat (Compression) - Key Improvements

Benchmark Data Before (MiB/s) After (MiB/s) Improvement
BM_ZFlat/11/1 gaviota 69.94 79.05 +13.03%
BM_ZFlat/10/1 pb 132.41 145.61 +9.97%
BM_ZFlat/4/1 pdf 616.88 672.88 +9.08%
BM_ZFlat/0/1 html 117.95 127.54 +8.13%
BM_ZFlat/5/1 html4 100.62 106.11 +5.45%
BM_ZFlat/6/1 txt1 46.00 48.09 +4.55% ↑
BM_ZFlat/11/2 gaviota 32.15 33.55 +4.35% ↑
BM_ZFlat/1/1 urls 41.78 43.37 +3.79% ↑
ZFlat Average - - - +2.67%

Other Operations

Operation Average Improvement Assessment
UIOVecSource +2.33% Unexpected bonus
UFlat (Decompress) -0.50% Within measurement noise
UValidate -0.11% Within measurement noise
UIOVecSink -0.07% Within measurement noise
UFlatSink -1.18% Dominated by JPG regression

Key Observations

  • Text-like data (html, pdf, txt, gaviota) shows the best improvements: +5% to +13%
  • Pre-compressed data (jpg) shows minor regression: -1.75% (acceptable, as JPG is rarely Snappy-compressed)
  • Overall positive impact with no significant side effects on decompression

Test Repeatability

Three independent test runs confirm consistent and reproducible results:

Run ZFlat Improvement UFlat Improvement Notes
1 +2.29% -0.52% Initial test
2 +2.67% -0.52% Consistent with run 1
3 +2.67% -0.50% Confirms stability
Average +2.54% -0.51% Highly consistent

Stability: 21 out of 24 test cases showed <1% variance across all three runs, indicating high test-retest reliability.

Compatibility and Portability

RISC-V with RVV Support

  • Automatically detected at compile time via __riscv && SNAPPY_HAVE_RVV
  • Uses vectorized path for optimal performance

RISC-V without RVV Support

  • Gracefully falls back to existing scalar code
  • No performance degradation

Non-RISC-V Platforms (x86_64, ARM64, etc.)

  • Zero code changes - the RVV code is completely guarded by preprocessor conditionals
  • Zero performance impact - existing optimizations (SSE, NEON, CRC32) continue to work unchanged
  • Zero maintenance burden - no modifications to existing platform-specific code paths

Testing

  • snappy_unittest passes all tests
  • snappy_benchmark verified on RISC-V hardware (Banana Pi K1)
  • Three independent test runs for statistical validity
  • No regressions on x86_64 (verified by CI)
  • Backward compatibility verified: non-RISC-V builds unchanged

Checklist

  • Code follows Google C++ style guide
  • Comments added for non-obvious logic and RVV-specific operations
  • Performance data included with multiple independent test runs
  • Full backward compatibility maintained
  • No breaking changes to API or behavior
  • All existing unit tests pass

Future Work (Out of Scope for This PR)

  • RISC-V Zba extension optimization for hash table lookups (separate PR)
  • RVV optimization for MemCopy64 operations (separate PR)
  • Dynamic heuristic for RVV/scalar path selection based on data characteristics

Screenshots

Unit Tests - All Pass

snappy_unittest_pass

Benchmark - Before Optimization

benchmark_before

Benchmark - After Optimization

benchmark_after

Add vectorized match length computation using RVV instructions.
Processes 16 bytes in parallel with __riscv_vle8_v_u8m1 and
__riscv_vmsne_vv_u8m1_b8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant