Skip to content

Add unicode box characters#1700

Merged
rocky merged 7 commits intomasterfrom
add-unicode-box-characters
Feb 22, 2026
Merged

Add unicode box characters#1700
rocky merged 7 commits intomasterfrom
add-unicode-box-characters

Conversation

@rocky
Copy link
Member

@rocky rocky commented Feb 20, 2026

Handle boxing escape characters inside a string, in built-in functions Characters, StringLength, StringTake, and ToCharacterCode.

Tracks API changes in MathicsScanner, so must be merged not before Mathics3/mathics-scanner#152.

Fixes #1622

Handle unicode for boxing operators inside strings.
@rocky rocky marked this pull request as draft February 20, 2026 14:34
@rocky rocky requested a review from mmatera February 20, 2026 14:34
and add more tests. Check more argument counts on some builtin functions.
* Add argument error-check test.
* Add test to ensure an escape character is treated like one character.
@rocky rocky force-pushed the add-unicode-box-characters branch 7 times, most recently from 0f1d580 to e52afbe Compare February 21, 2026 01:25
@rocky rocky marked this pull request as ready for review February 21, 2026 01:28
@rocky rocky force-pushed the add-unicode-box-characters branch 3 times, most recently from 9c8df99 to 317b90f Compare February 21, 2026 03:19
@rocky rocky force-pushed the add-unicode-box-characters branch from 317b90f to bc42ee9 Compare February 21, 2026 12:01
strform_str = safe_backquotes(strform.value)
strform_str = safe_backquotes(replace_box_unicode_with_ascii(strform.value))
parts = strform_str.split("`")
# Rocky: This looks like a hack to me: is it needed?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is a hack, to allow escape from the backquote. Maybe with the last changes in the scanner, it is no longer needed.

Copy link
Member Author

@rocky rocky Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we still need in order to make:

    >> StringForm["`` is Global\\`a", a]
     = a is Global`a

work. It might require a further change in the scanner. Or maybe there is something deeper here; a better change to StringForm is needed?

Either way, it's something that requires more thought, and since this stuff is already too large for a single change, I'd like to get this down and come back to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for check this. We can look again in another round

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for check this. We can look again in another round

Teamwork!

# These tests are commented out due to the bug reported in issue #906
# Octal and hexadecimal notation works alone, but fails
# as a part of another expression. For example,
# F[\.78\.79\.7A] or "\.78\.79\.7A" produces a syntax error in Mathics.
Copy link
Contributor

@mmatera mmatera Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see this fail in master, and the previous version of Mathics-Scanner. Maybe this is a new bug? Now all these tests passes without problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. Tests reinstated now.

"StringTake[abc, {0, 0}]",
None,
),
# These tests are commented out due to the bug reported in issue #906
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with this version, and the commented tests seem to pass without problems

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and reporting back. Reinstated now.

Copy link
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@rocky rocky force-pushed the add-unicode-box-characters branch from 9121c63 to 9f9ca46 Compare February 22, 2026 12:15
@rocky rocky merged commit a41c5f0 into master Feb 22, 2026
21 checks passed
@rocky rocky deleted the add-unicode-box-characters branch February 22, 2026 12:23
@rocky
Copy link
Member Author

rocky commented Feb 22, 2026

I'd like to mention something related to this change.

In the past, there was a thought that all operators had to be represented by Unicode symbols. (This kind of thing was last done circa tokenized Basic in the mid 1970s.)

This is not that. Here, we are doing this only for Box Expression "operators" (where grouping is also treated as an "operator"). And even then, only inside strings.

To my mind, the thing that justifies this is the behavior of StringTake, Characters, StringLength, and ToCharacters. Especially ToCharacters, where the specific Unicode values are explicit.

Exactly why it is that these functions work in this behavior is not readily apparent to me. I am sure there was some reason for this. But still, it feels like more of a hack than a fundamental principle.

And even here, WMA was, I think, careful to restrict this Unicode character idea to the private-use section of Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Escape sequences in string parsing

2 participants