Skip to content

security: reject nested-quantifier regex in Split/Replace to prevent ReDoS (CWE-1333)#2060

Open
Allen930311 wants to merge 1 commit into
huggingface:mainfrom
Allen930311:fix/redos-user-supplied-regex
Open

security: reject nested-quantifier regex in Split/Replace to prevent ReDoS (CWE-1333)#2060
Allen930311 wants to merge 1 commit into
huggingface:mainfrom
Allen930311:fix/redos-user-supplied-regex

Conversation

@Allen930311
Copy link
Copy Markdown

Summary

The Split pre-tokenizer and Replace normalizer both accept arbitrary regex patterns supplied by users via JSON configuration. These patterns are compiled and executed by Oniguruma, a backtracking regex engine.

An attacker can craft a tokenizer JSON that contains a pattern with nested quantifiers (e.g. (a+)+, (a*)*, ([a-z]+)+). When such a tokenizer is loaded by a service and then used to tokenize carefully chosen input, Oniguruma performs exponential backtracking — O(2^n) — hanging the process for seconds (or indefinitely).

CWE: CWE-1333 (Inefficient Regular Expression Complexity) / CWE-400 (Uncontrolled Resource Consumption)

Proof of Concept

from tokenizers import Tokenizer
import time

MALICIOUS_JSON = """
{
    "version": "1.0",
    "truncation": null, "padding": null, "added_tokens": [],
    "normalizer": null,
    "pre_tokenizer": {
        "type": "Split",
        "pattern": {"Regex": "(a+)+"},
        "behavior": "Removed",
        "invert": false
    },
    "post_processor": null,
    "decoder": null,
    "model": {"type": "WordLevel", "vocab": {}, "unk_token": "[UNK]"}
}
"""

tok = Tokenizer.from_str(MALICIOUS_JSON)

for n in [5, 10, 15, 20, 25]:
    evil = "a" * n + "b"
    t = time.time()
    try:
        tok.encode(evil)
    except Exception:
        pass
    print(f"n={n:2d}: {time.time() - t:.4f}s")

Expected output (exponential growth):

n= 5: 0.0001s
n=10: 0.0010s
n=15: 0.0330s
n=20: 1.048s
n=25: ~33s   ← DoS

The same PoC works for the Replace normalizer with a Regex pattern.

Changes

File Change
tokenizers/src/utils/mod.rs Add check_redos_risk(pattern) using regex-syntax AST analysis
tokenizers/src/pre_tokenizers/split.rs Call check in Split::new() for SplitPattern::Regex
tokenizers/src/normalizers/replace.rs Call check in Replace::new() for ReplacePattern::Regex

regex-syntax is already a declared dependency — no new crates required.

Patterns using Oniguruma-specific syntax that regex-syntax cannot parse are silently accepted (best-effort). Hardcoded library patterns (ByteLevel, etc.) are not affected.

Test plan

  • Split::new(SplitPattern::Regex("(a+)+".into()), ...) → returns Err (rejected)
  • Replace::new(ReplacePattern::Regex("(a*)*".into()), ...) → returns Err (rejected)
  • Normal patterns like \s+, \w+, [a-z]+ → accepted unchanged
  • Oniguruma-specific patterns like \p{L}+ → accepted unchanged (best-effort)

… CWE-1333)

The Split pre-tokenizer and Replace normalizer both accept user-supplied
regex patterns that are compiled and executed by Oniguruma, a backtracking
regex engine.  An attacker can craft a pattern with nested quantifiers (e.g.
`(a+)+`) that causes catastrophic backtracking — O(2^n) — when tokenising
carefully chosen input, resulting in a denial-of-service (CWE-1333 / CWE-400).

Fix:
- Add `utils::check_redos_risk()` that parses the pattern with `regex-syntax`
  (already a dependency) and returns an error when nested quantifiers are
  detected.  Patterns using Oniguruma-specific syntax that `regex-syntax`
  cannot parse are passed through unchanged (best-effort check).
- Call `check_redos_risk()` in `Split::new()` and `Replace::new()` for the
  `Regex` variant, before the Oniguruma compilation step.

No new dependencies are introduced; `regex-syntax` is already in Cargo.toml.

Affected call sites:
  tokenizers/src/pre_tokenizers/split.rs   — SplitPattern::Regex
  tokenizers/src/normalizers/replace.rs    — ReplacePattern::Regex

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant