Skip to content

Conversation

@WaterWhisperer
Copy link
Contributor

Fixes: #9971

GNU join uses LC_COLLATE for field comparison. This PR (ref expr) implements locale-aware string comparison using uucore's i18n::collator module.

Reproduce:

coreutils$ export LC_ALL=en_US.UTF-8
coreutils$ cat > f1 <<'EOF'
ab:d  1
abc:d 2
EOF
coreutils$ cat > f2 <<'EOF'
ab:d  x
abc:d y
EOF
coreutils$ sort -k1,1 f1 > f1.sorted
coreutils$ sort -k1,1 f2 > f2.sorted
coreutils$ /usr/bin/join --check-order f1.sorted f2.sorted
abc:d 2 y
ab:d 1 x
coreutils$ ./target/release/join --check-order f1.sorted f2.sorted 
abc:d 2 y
ab:d 1 x

@sylvestre
Copy link
Contributor

did you run some benchmarks ?

@WaterWhisperer
Copy link
Contributor Author

did you run some benchmarks ?

Thanks for reminding me. Here are the results of benchmark test.
截图 2026-01-02 18-06-01
截图 2026-01-02 18-06-59
It seems there's a significant performance drop :( So I'm trying to find ways to improve it.

@WaterWhisperer
Copy link
Contributor Author

I have a few optimization ideas:

  • Cache the locale check at startup: Currently checking locale on every comparison. We could check once and store a flag in the Input struct.
  • Fast path for C locale: Detect C/POSIX locale early and use direct byte comparison.

Initial testing shows this could bring the C locale overhead from ~16% down to ~1%.
Is this performance trade-off acceptable for the project? Would you prefer a different approach?

Happy to iterate on this based on your guidance! @sylvestre

@sylvestre
Copy link
Contributor

Why not both :)

@WaterWhisperer
Copy link
Contributor Author

Why not both :)

Yeah, that's exactly what I did.

@sylvestre
Copy link
Contributor

I don't see the change :)

Note that we will need a benchmark for join
In a separatepr if you are interested in doing it

@WaterWhisperer
Copy link
Contributor Author

WaterWhisperer commented Jan 2, 2026

I don't see the change :)

Sorry, I just pushed.

Note that we will need a benchmark for join
In a separatepr if you are interested in doing it

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

join: locale collation should be considered

2 participants