RISC-V: Add RVV vectorized FindMatchLength optimization by zhanchangbao-sanechips · Pull Request #233 · google/snappy

zhanchangbao-sanechips · 2026-04-16T07:24:33Z

Summary

This PR adds RISC-V Vector (RVV) optimization for the FindMatchLength() function in the Snappy compression library. The optimization leverages RVV instructions to compare 16 bytes in parallel, resulting in improved compression performance on RISC-V platforms.

Motivation

The Snappy compression algorithm spends a significant portion of its time in FindMatchLength() during the compression phase. On RISC-V platforms with RVV support, we can accelerate this critical path by using vector instructions to perform parallel byte comparisons.

Changes Made

Added RVV vectorized loop in FindMatchLength() to process 16-byte blocks in parallel
Used RVV intrinsics: __riscv_vsetvl_e8m1(), __riscv_vle8_v_u8m1(), __riscv_vmsne_vv_u8m1_b8(), __riscv_vfirst_m_b8()
Maintained full backward compatibility: non-RISC-V platforms are completely unaffected
Preserved original 8-byte scalar loop as fallback for remaining data (< 32 bytes)

Implementation Details

The RVV optimization is strategically placed between SNAPPY_PREFETCH and the scalar 8-byte loop:

RVV loop: Handles 32+ byte chunks with 16-byte parallelism using vector comparisons
Scalar 8-byte loop: Handles 16-31 byte remainder (original code preserved)
Byte-by-byte loop: Handles final <16 bytes (original code preserved)

This layered approach ensures optimal performance across all input sizes while maintaining code clarity.

Performance Results

Test Environment

Hardware: Banana Pi K1 (SpacemiT X60)
CPU: 8-core X60 @ 1.6GHz
Vector Length: VLEN=256 bits
Compiler: GCC with RVV support

ZFlat (Compression) - Key Improvements

Benchmark	Data	Before (MiB/s)	After (MiB/s)	Improvement
BM_ZFlat/11/1	gaviota	69.94	79.05	+13.03%
BM_ZFlat/10/1	pb	132.41	145.61	+9.97%
BM_ZFlat/4/1	pdf	616.88	672.88	+9.08%
BM_ZFlat/0/1	html	117.95	127.54	+8.13%
BM_ZFlat/5/1	html4	100.62	106.11	+5.45%
BM_ZFlat/6/1	txt1	46.00	48.09	+4.55% ↑
BM_ZFlat/11/2	gaviota	32.15	33.55	+4.35% ↑
BM_ZFlat/1/1	urls	41.78	43.37	+3.79% ↑
ZFlat Average	-	-	-	+2.67%

Other Operations

Operation	Average Improvement	Assessment
UIOVecSource	+2.33%	Unexpected bonus
UFlat (Decompress)	-0.50%	Within measurement noise
UValidate	-0.11%	Within measurement noise
UIOVecSink	-0.07%	Within measurement noise
UFlatSink	-1.18%	Dominated by JPG regression

Key Observations

Text-like data (html, pdf, txt, gaviota) shows the best improvements: +5% to +13%
Pre-compressed data (jpg) shows minor regression: -1.75% (acceptable, as JPG is rarely Snappy-compressed)
Overall positive impact with no significant side effects on decompression

Test Repeatability

Three independent test runs confirm consistent and reproducible results:

Run	ZFlat Improvement	UFlat Improvement	Notes
1	+2.29%	-0.52%	Initial test
2	+2.67%	-0.52%	Consistent with run 1
3	+2.67%	-0.50%	Confirms stability
Average	+2.54%	-0.51%	Highly consistent

Stability: 21 out of 24 test cases showed <1% variance across all three runs, indicating high test-retest reliability.

Compatibility and Portability

RISC-V with RVV Support

Automatically detected at compile time via __riscv && SNAPPY_HAVE_RVV
Uses vectorized path for optimal performance

RISC-V without RVV Support

Gracefully falls back to existing scalar code
No performance degradation

Non-RISC-V Platforms (x86_64, ARM64, etc.)

Zero code changes - the RVV code is completely guarded by preprocessor conditionals
Zero performance impact - existing optimizations (SSE, NEON, CRC32) continue to work unchanged
Zero maintenance burden - no modifications to existing platform-specific code paths

Testing

snappy_unittest passes all tests
snappy_benchmark verified on RISC-V hardware (Banana Pi K1)
Three independent test runs for statistical validity
No regressions on x86_64 (verified by CI)
Backward compatibility verified: non-RISC-V builds unchanged

Checklist

Code follows Google C++ style guide
Comments added for non-obvious logic and RVV-specific operations
Performance data included with multiple independent test runs
Full backward compatibility maintained
No breaking changes to API or behavior
All existing unit tests pass

Future Work (Out of Scope for This PR)

RISC-V Zba extension optimization for hash table lookups (separate PR)
RVV optimization for MemCopy64 operations (separate PR)
Dynamic heuristic for RVV/scalar path selection based on data characteristics

Screenshots

Unit Tests - All Pass

Benchmark - Before Optimization

Benchmark - After Optimization

Add vectorized match length computation using RVV instructions. Processes 16 bytes in parallel with __riscv_vle8_v_u8m1 and __riscv_vmsne_vv_u8m1_b8.

RISC-V: Add RVV optimization for FindMatchLength()

6078470

Add vectorized match length computation using RVV instructions. Processes 16 bytes in parallel with __riscv_vle8_v_u8m1 and __riscv_vmsne_vv_u8m1_b8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RISC-V: Add RVV vectorized FindMatchLength optimization#233

RISC-V: Add RVV vectorized FindMatchLength optimization#233
zhanchangbao-sanechips wants to merge 1 commit intogoogle:mainfrom
zhanchangbao-sanechips:rvvopt

zhanchangbao-sanechips commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhanchangbao-sanechips commented Apr 16, 2026

Summary

Motivation

Changes Made

Implementation Details

Performance Results

Test Environment

ZFlat (Compression) - Key Improvements

Other Operations

Key Observations

Test Repeatability

Compatibility and Portability

RISC-V with RVV Support

RISC-V without RVV Support

Non-RISC-V Platforms (x86_64, ARM64, etc.)

Testing

Checklist

Future Work (Out of Scope for This PR)

Screenshots

Unit Tests - All Pass

Benchmark - Before Optimization

Benchmark - After Optimization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant