Skip to content

tr: add ASCII range translation fast path#12118

Open
parasol-aser wants to merge 3 commits into
uutils:mainfrom
parasol-aser:perf/P002
Open

tr: add ASCII range translation fast path#12118
parasol-aser wants to merge 3 commits into
uutils:mainfrom
parasol-aser:perf/P002

Conversation

@parasol-aser
Copy link
Copy Markdown

What

Adds a narrow fast path for bytewise ASCII range translations such as:

tr 'a-z' 'A-Z'

The change detects translation tables that modify one contiguous ASCII range by a constant wrapping delta, then processes that range with an AVX2 range compare plus masked add on x86/x86_64 hosts that support AVX2. Other translations continue to use the existing single-byte or table-lookup paths, and non-AVX2 hosts use the scalar fallback.

Why

Before this change, tr 'a-z' 'A-Z' mapped every byte through a scalar 256-byte translation table. This is a common case, and it can be handled more directly by checking whether each byte falls within the translated ASCII range and adding the fixed delta.

Measurements

Environment:

  • CPU: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
  • OS: Linux x86_64
  • Rust: rustc 1.92.0
  • Candidate branch: perf/P002
  • Baseline commit: 4b5a2af7a916910bfeaf46b298a963d8a038565a
  • hyperfine was not installed, so this used /usr/bin/time, 2 warmups, and 12 measured runs.

Input was corpus/large_text.txt repeated 16 times, 1,342,178,256 bytes total.

/usr/bin/time -f '%e %M' ./runs/P002-rerun/bin/tr-baseline 'a-z' 'A-Z' < runs/P002-rerun/input/large_text_x16.txt > /dev/null
/usr/bin/time -f '%e %M' ./runs/P002-rerun/bin/tr-p002 'a-z' 'A-Z' < runs/P002-rerun/input/large_text_x16.txt > /dev/null
/usr/bin/time -f '%e %M' /usr/bin/tr 'a-z' 'A-Z' < runs/P002-rerun/input/large_text_x16.txt > /dev/null
implementation mean wall time stddev throughput
uutils baseline 1.068 s 0.129 s 1198.1 MiB/s
uutils candidate 0.421 s 0.029 s 3041.6 MiB/s
GNU tr 1.251 s 0.209 s 1023.3 MiB/s

The candidate is 2.54x faster than the uutils baseline on this workload, a 60.6% wall-time reduction. The earlier 80 MiB pipeline benchmark also showed a 53.8% reduction, from 0.065 s to 0.030 s.

Correctness

For the 1.3 GB input, baseline uutils, candidate uutils, and GNU tr produced the same transformed output SHA256:

6f2d6cb371ca0b423a90a5690ee7f6dac0be6a7d889f308ff5b15f2957e853db

Tests

cargo test --release --test tests -- --nocapture --test-threads=1 test_tr::test_ascii_range_translate_alignment_boundaries
cargo clippy --release -p uu_tr -- -D warnings
cargo fmt --check --package uu_tr

The regression test covers 0, 1, 31, 32, and 33 byte inputs around the AVX2 lane width, all byte values, and a UTF-8/non-ASCII boundary case, with GNU parity when GNU tr is available.

Caveats

The speedup is from the AVX2 path on this x86_64 host. Non-AVX2 targets use the scalar fallback and should be behavior-preserving, but I did not benchmark those targets here.

jeffhuang added 2 commits May 1, 2026 22:36
Files: src/uu/tr/src/operation.rs, src/uu/tr/src/simd.rs.

Mechanism: detect translation tables that change one contiguous ASCII range by a constant wrapping delta, then process those chunks with an AVX2 range-compare and masked add kernel with scalar fallback. The existing single-byte and table-lookup paths remain for non-affine translations.

Predicted delta: tr/tr_lower_to_upper_large_text_stdout_discarded should improve by 10-20% versus the 0.065s baseline on AVX2 hosts.
Covers ASCII range translation for a-z to A-Z at input lengths 0, 1, 31, 32, and 33 around the AVX2 lane width, plus all byte values and a UTF-8 boundary/non-ASCII case.

Assertions cover exit code success, empty stderr, byte-exact stdout, and GNU tr parity when a GNU tr binary is available on PATH.

Test command used in this repo: cargo test --release --test tests -- --nocapture --test-threads=1 test_tr::test_ascii_range_translate_alignment_boundaries. The requested package-scoped command cargo test --release -p uu_tr --test test_tr -- --nocapture --test-threads=1 test_ascii_range_translate_alignment_boundaries is unavailable because uu_tr has no test_tr target.
@parasol-aser parasol-aser marked this pull request as ready for review May 2, 2026 03:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

GNU testsuite comparison:

Skip an intermittent issue tests/cut/bounded-memory (fails in this run but passes in the 'main' branch)
Note: The gnu test tests/cp/link-heap is now being skipped but was previously passing.
Note: The gnu test tests/rm/many-dir-entries-vs-OOM is now being skipped but was previously passing.
Note: The gnu test tests/env/env-signal-handler was skipped on 'main' but is now failing.

@oech3

This comment was marked as resolved.

The intrinsics blendv, cmpgt, loadu, and storeu introduced in the AVX2
range translation kernel are flagged by cspell. Annotate the file so the
style/spelling CI job stays green.
@parasol-aser
Copy link
Copy Markdown
Author

You need to add spell-checker: ignore

@oech3 added thanks

@sylvestre
Copy link
Copy Markdown
Contributor

could you please write a benchmark that can be executed with codspeed
and run the benchmark with hyperfine and not time? thanks

@parasol-aser
Copy link
Copy Markdown
Author

@sylvestre done in ffc64b6:

  • codspeed bench at src/uu/tr/benches/tr_bench.rs (divan), uu_tr added to .github/workflows/benchmarks.yml. Run locally: cargo bench -p uu_tr.
  • Re-ran with hyperfine (3 warmups, 20 runs) on the 1.3 GB input:
Command Mean [s] Throughput Relative
uutils baseline 1.023 ± 0.029 1251 MiB/s 2.24×
uutils candidate 0.456 ± 0.028 2807 MiB/s 1.00×
GNU tr 1.087 ± 0.019 1177 MiB/s 2.38×

PR description updated with the full table and bench details.

@oech3
Copy link
Copy Markdown
Contributor

oech3 commented May 7, 2026

Would you split PR adding benchmark without unused import?

@parasol-aser
Copy link
Copy Markdown
Author

@oech3 split per your suggestion: bench moved to #12189 (against main, fixed the unused-import that the Windows clippy job was flagging by gating the uses with the same #[cfg(unix)] as the bench fns). This PR is now perf-only.

@sylvestre — codspeed coverage will land via #12189.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants