Skip to content

fix: metasound truncation order, add 4 missing consonants per paper#1384

Open
phoneee wants to merge 1 commit intoPyThaiNLP:devfrom
phoneee:fix/metasound-algorithm
Open

fix: metasound truncation order, add 4 missing consonants per paper#1384
phoneee wants to merge 1 commit intoPyThaiNLP:devfrom
phoneee:fix/metasound-algorithm

Conversation

@phoneee
Copy link
Copy Markdown
Contributor

@phoneee phoneee commented Mar 29, 2026

What do these changes do

Fix metasound:

  • Filter karan spaces before truncating to requested length
  • Add ฏ, ฑ, ถ, ธ to _C2 per the original paper (Snae & Brückner, 2009, Table p.507)
  • Remove duplicate ข in prayut_and_somchaip

Fixes #1383

  • Passed code styles and structures
  • Passed code linting checks and unit test

…licate

Three fixes in soundex module:
- Filter karan spaces before truncating to requested length
- Add ฏ,ฑ,ถ,ธ to _C2 (dental class, same sound as ท,ด)
- Remove duplicate ข in prayut_and_somchaip _C2
@sonarqubecloud
Copy link
Copy Markdown

@bact bact added this to PyThaiNLP Mar 29, 2026
@bact bact added the bug bugs in the library label Mar 29, 2026
@bact bact requested a review from Copilot March 29, 2026 16:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes MetaSound’s implementation to match the referenced paper and resolves a truncation-order bug that produced incorrect codes when karan (์) removal introduced spaces.

Changes:

  • Filter out karan-introduced spaces before truncating MetaSound output to the requested length.
  • Add ฏ, ฑ, ถ, ธ to MetaSound _C2 consonant group (per Snae & Brückner, 2009).
  • Remove a duplicate ข in prayut_and_somchaip._C2 and add regression tests for the MetaSound fixes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/core/test_soundex.py Adds regression tests covering karan-truncation order and the 4 missing _C2 consonants.
pythainlp/soundex/prayut_and_somchaip.py Removes duplicated ข from _C2 (no functional behavior change expected).
pythainlp/soundex/metasound.py Fixes truncation order and expands _C2 to include ฏ, ฑ, ถ, ธ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug bugs in the library

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

bug: metasound missing 4 consonants from _C2 per paper; truncation before space removal

3 participants