Skip to content

A more compact charindex#1331

Merged
eggrobin merged 3 commits intounicode-org:mainfrom
eggrobin:compactification
Apr 20, 2026
Merged

A more compact charindex#1331
eggrobin merged 3 commits intounicode-org:mainfrom
eggrobin:compactification

Conversation

@eggrobin
Copy link
Copy Markdown
Member

@eggrobin eggrobin commented Apr 14, 2026

Compare https://eggrobin.github.io/unicode-annotations/charindex.html (old) and https://eggrobin.github.io/unicode-annotations/charindex-smol.html (with this change).
(Note that I will probably replace charindex.html with the -smol one after merging this.)

Take all the highly repetitive strings (both the actual property values and the HTML snippets), stick them in a giant string, and deflate that, replacing the strings with indices in the giant string throughout the data structures: it goes from 22 MiB to 1388 kiB (6.3%). Also don’t try to pretty-print a map with 66666 entries.

This brings the generated charindex.html from 42.3 MiB to 8.91 MiB (21% of its size).
The page gets compressed by the server, and the compressed size doesn’t change much (4477 kB vs. 3175 kB, says Chrome), so this doesn’t change download times very much.

However, this massively reduces the time spent parsing JS. When the page is loaded from disk cache, the time to DomContentLoaded goes from 2.10 s to 636 ms.

@eggrobin eggrobin requested a review from markusicu April 14, 2026 23:23
Comment on lines +303 to +306
final int snippetIndex =
stringIndices.getOrDefault(snippet, allTheStrings.length());
if (snippetIndex == allTheStrings.length()) {
allTheStrings.append(snippet).append(RECORD_SEPARATOR);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create & use a helper function that takes a string (without the separator) and returns the index. Internally, figure out whether to reuse or append.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. And by systematically checking for a pre-existing string (I was doing that for the property values but not for the HTML), brought the size down to 8.76 MiB (from 8.91 MiB) mentioned above.

Comment thread unicodetools/src/main/java/org/unicode/text/tools/Indexer.java Outdated
@eggrobin eggrobin merged commit d44cfa6 into unicode-org:main Apr 20, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants