Skip to content

Document Unicode locale extension key ks in Collator::setStrength #5559

@masakielastic

Description

@masakielastic

Affected page

https://www.php.net/manual/en/collator.setstrength.php

Current issue

The Collator::setStrength() documentation explains ICU collation strength
levels and lists the corresponding Collator constants, such as
Collator::PRIMARY, Collator::SECONDARY, Collator::TERTIARY,
Collator::QUATERNARY, and Collator::IDENTICAL.

However, the page does not mention that collation strength may also be
requested through the Unicode locale extension key ks when creating a
Collator from a locale identifier.

For example, the following two collators have the same strength:

$collator1 = new Collator('en_US');
$collator1->setStrength(Collator::IDENTICAL);

$collator2 = new Collator('en_US-u-ks-identic');

var_dump($collator1->getStrength() === $collator2->getStrength());
// bool(true)

Suggested improvement

Add a short note explaining that the Unicode locale extension key ks can
be used to request a collation strength in the locale identifier.

For example:

  • ks-level1 corresponds to Collator::PRIMARY
  • ks-level2 corresponds to Collator::SECONDARY
  • ks-level3 corresponds to Collator::TERTIARY
  • ks-level4 corresponds to Collator::QUATERNARY
  • ks-identic corresponds to Collator::IDENTICAL

Also mention that omitting the ks key lets ICU use the default strength
for the locale, rather than specifying a separate value corresponding to
Collator::DEFAULT_STRENGTH.

This would help users understand the relationship between
Collator::setStrength() and strength requested through locale identifiers.

It is also useful for APIs that accept a locale identifier but do not accept
a Collator object directly.

Additional context (optional)

This behavior is based on the Unicode LDML Collation setting options.
The ks Unicode locale extension key is defined as the BCP 47 key for
collation strength, with values such as level1, level2, level3,
level4, and identic.

Specification reference:
https://www.unicode.org/reports/tr35/dev/tr35-collation.html#Setting_Options

The following script verifies that these ks values are reflected in
Collator::getStrength():

<?php

$locales = [
    'PRIMARY'    => 'en_US-u-ks-level1',
    'SECONDARY'  => 'en_US-u-ks-level2',
    'TERTIARY'   => 'en_US-u-ks-level3',
    'QUATERNARY' => 'en_US-u-ks-level4',
    'IDENTICAL'  => 'en_US-u-ks-identic',
    'DEFAULT'    => 'en_US',
];

foreach ($locales as $label => $locale) {
    $collator = new Collator($locale);

    printf(
        "%-10s %-25s strength = %d\n",
        $label,
        $locale,
        $collator->getStrength()
    );
}

Example output:

PRIMARY    en_US-u-ks-level1         strength = 0
SECONDARY  en_US-u-ks-level2         strength = 1
TERTIARY   en_US-u-ks-level3         strength = 2
QUATERNARY en_US-u-ks-level4         strength = 3
IDENTICAL  en_US-u-ks-identic        strength = 15
DEFAULT    en_US                     strength = 2

This confirms the following correspondence:

ks-level1  -> Collator::PRIMARY
ks-level2  -> Collator::SECONDARY
ks-level3  -> Collator::TERTIARY
ks-level4  -> Collator::QUATERNARY
ks-identic -> Collator::IDENTICAL

The DEFAULT row omits the ks key. It shows the default strength chosen
by ICU for the locale, rather than a locale extension value corresponding
to Collator::DEFAULT_STRENGTH.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions