Skip to content

Conversation

@ngachchi
Copy link
Contributor

@ngachchi ngachchi commented Oct 30, 2024

What does this PR do ?

This PR introduces Hindi support for a wide range of numerical and temporal formats, including:

  • Cardinal numbers: Natural numbers (e.g., एक, दो, तीन)
  • Decimal numbers: Numbers with decimal points (e.g., दशमलव दो दशमलव पांच)
  • Fractions: Rational numbers expressed as ratios (e.g., एक बटा दो)
  • Dates: Various date formats (e.g., आज, कल, १४ नवंबर २०२४)
  • Time: Time formats (e.g., दो बजकर पांच मिनट)
  • Money: Monetary amounts (e.g., दस रुपये पचास पैसे)
  • Measure: Units of measurement (e.g., दस किलोमीटर)

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@ngachchi ngachchi changed the title Hindi ITN Support for Cardinal, Decimal, Fraction, Date, Time, Money and Measure Hindi TN Support for Cardinal, Decimal, Fraction, Date, Time, Money and Measure Oct 30, 2024
Copy link
Contributor

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: Namrata Gachchi <[email protected]>
Copy link
Contributor Author

@ngachchi ngachchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remaining files from the whitelist data class will be removed and single would be there

@ngachchi ngachchi requested a review from mgrafu October 30, 2024 13:38
zoobereq and others added 2 commits November 13, 2024 14:14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason why we have English vocab as part of the Hindi TN grammar? I believe best approach for now would be monolingual

@mgrafu mgrafu merged commit c8a937a into NVIDIA:main Nov 18, 2024
4 checks passed
ngachchi added a commit to ngachchi/NeMo-text-processing that referenced this pull request Jun 23, 2025
…nd Measure (NVIDIA#241)

* Hindi TN changes

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated date for Hindi TN cache

Signed-off-by: Namrata Gachchi <[email protected]>

* additional whitelist class .tsv files and unused imports removed

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* incorporated suggestions for unused statements and another for closing the file opened

Signed-off-by: Namrata Gachchi <[email protected]>

* Combined Hindi TN and ITN seperate blocks into single

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added init.py files and removed unused commented lines

Signed-off-by: Namrata Gachchi <[email protected]>

* commented irrevelant references and unused snippets from whitelist and word file

Signed-off-by: Namrata Gachchi <[email protected]>

* Whitelist and Word class changes

Signed-off-by: Namrata Gachchi <[email protected]>

* post processor changes with minor fixes

Signed-off-by: Namrata Gachchi <[email protected]>

* remove space before punctuation for sparrowhawk file

Signed-off-by: Namrata Gachchi <[email protected]>

* minor fixes for measure class

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated Jenkinsfile

Signed-off-by: Namrata Gachchi <[email protected]>

* removed unused imports and statements

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated date stamp for HI cache and commented ITN grammars

Signed-off-by: Namrata Gachchi <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* Disables Hindi ITN L0 checks

Signed-off-by: Simon Zuberek <[email protected]>

* Reapplies ITN CI Checks

Signed-off-by: Simon Zuberek <[email protected]>

* Adds missing inits

Signed-off-by: Simon Zuberek <[email protected]>

* resolved the failing sparrowhawk test cases failed

Signed-off-by: Namrata Gachchi <[email protected]>

---------

Signed-off-by: Namrata Gachchi <[email protected]>
Signed-off-by: Simon Zuberek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Simon Zuberek <[email protected]>
Signed-off-by: Namrata Gachchi <[email protected]>
FredHaa pushed a commit to FredHaa/NeMo-text-processing that referenced this pull request Aug 15, 2025
…nd Measure (NVIDIA#241)

* Hindi TN changes

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated date for Hindi TN cache

Signed-off-by: Namrata Gachchi <[email protected]>

* additional whitelist class .tsv files and unused imports removed

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* incorporated suggestions for unused statements and another for closing the file opened

Signed-off-by: Namrata Gachchi <[email protected]>

* Combined Hindi TN and ITN seperate blocks into single

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added init.py files and removed unused commented lines

Signed-off-by: Namrata Gachchi <[email protected]>

* commented irrevelant references and unused snippets from whitelist and word file

Signed-off-by: Namrata Gachchi <[email protected]>

* Whitelist and Word class changes

Signed-off-by: Namrata Gachchi <[email protected]>

* post processor changes with minor fixes

Signed-off-by: Namrata Gachchi <[email protected]>

* remove space before punctuation for sparrowhawk file

Signed-off-by: Namrata Gachchi <[email protected]>

* minor fixes for measure class

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updated Jenkinsfile

Signed-off-by: Namrata Gachchi <[email protected]>

* removed unused imports and statements

Signed-off-by: Namrata Gachchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated date stamp for HI cache and commented ITN grammars

Signed-off-by: Namrata Gachchi <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* Disables Hindi ITN L0 checks

Signed-off-by: Simon Zuberek <[email protected]>

* Reapplies ITN CI Checks

Signed-off-by: Simon Zuberek <[email protected]>

* Adds missing inits

Signed-off-by: Simon Zuberek <[email protected]>

* resolved the failing sparrowhawk test cases failed

Signed-off-by: Namrata Gachchi <[email protected]>

---------

Signed-off-by: Namrata Gachchi <[email protected]>
Signed-off-by: Simon Zuberek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Simon Zuberek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants