Skip to content

<regex>: Revising the stack and complexity limits #5944

@muellerj2

Description

@muellerj2

With #5939, the non-recursive <regex> rewrite has reached the point where we should decide how we adjust the stack and complexity limits in the implementation.

While rewriting the matcher, I made sure that the existing counters have kept working roughly the same (except that the counts are sometimes a bit lower) so that the set of allowed input strings should not have shrunk at least.

But while these counters were natural for the recursive parser (counting recursive calls), the counters are now adjusted in lots of different places, creating quite a mess. I think we should change the counters and limits to something more natural and maintainable now.

Why are such limits needed?

In theoretical computer science courses, it is taught that a regular expression can be matched by a deterministic finite automaton (DFA), which basically translates into a linear-time algorithm. But (a) a DFA might have exponential size compared to the non-deterministic finite automaton (NFA) that the regular expression was parsed into, and (b) practical regex grammars can be used to accept non-regular languages. For example, matching a regex grammar supporting backreferences is (probably) not possible in linear time, because this problem is NP-hard. The limits are required to defend against super-linear running times when matching regexes against input strings.

The current limits are too strict for long input strings

The limits are supposed to defend against super-linear running time, but the limits are currently constants, so they even defend against linear running time. This means that the current implementation can only successfully match each regex grammar against input strings up to some maximum constant length. This is a pointlessly strict limit, because users can easily achieve this same restriction by just explicitly limiting the length of the input string they pass to the regex algorithms.

This is particularly bad for the stack limit, which is currently set to 1000. A constant like that made sense for the recursive parser, because it was supposed to limit the usage of the thread's stack with a constant size of usually 1 MB. But now this stack lives on the heap and can grow.

The stack counter

After #5939, we have reached the point that the stack counter is bounded from above by what is essentially _Frames_count times some factor (I would have to do the exact math, but the factor should be roughly $$3\cdot \texttt{\_Ncap} + 1$$ for ECMAScript, which in turn is at most 3001 because of the capture group limit in the parser). So I propose to do away with the current stack counter and replace it by _Frames_count, if we want to keep applying a limit on the stack size as well.

The stack limit

The more important question is how we should adjust the stack limit. The problem is that one can craft regexes such that the size of the stack is exponential in the size of the regex (think of the regex (|){2147483641} and nested versions of it). I'm not sure what is an appropriate limit here, although I think it should usually be linear in the length of the input string (something like $c \cdot n + d$ for some reasonably chosen constants $c$ and $d$, where $n$ is the length of the input string). Alternatively, we could also decide that we do not implement some specific stack limit (beyond the maximum enforced by the vector implementation), but leave the defense against misbehaving regexes to the complexity limit.

The complexity counter

The complexity counter currently roughly counts the number of loop repetitions, entered assertions and attempted trajectories, but it does not count all state transitions (and as an additional complication, matching some states such as backreferences potentially involves a linear amount of work).

It is clear that any revised complexity counter must at least account for loop repetitions and attempted trajectories:

  • Counting loop repetitions alone is not sufficient because it is possible to construct a reduction from 3-SAT to REGEX MATCHING WITH BACKREFERENCES (proving the NP-hardness of matching regexes with backreferences) such that the constructed regex does not contain any loops.
  • Counting attempted trajectories alone is not sufficient because it does not defend against regexes containing (){2147483641} and nested versions of it.

I think the natural thing to do is to count the number of state transitions in the NFA (perhaps also accounting for the amount of work performed for each state), which is an upper bound on the number of loop repetitions and attempted trajectories. However, I do not think there is a nice way to upper-bound this suggested complexity counter by the existing one, so doing this change will probably mean that we will reject matching some regexes against some inputs that were accepted with the old counter whatever adjusted limit we choose.

The complexity limit

The limit on the complexity counter is currently the constant 10000000, which is often lenient enough in practice, but pointlessly prevents matching very long input strings (such as a sequence of $2^{32}-1$ a's in the test case of #439). And conversely, I think this constant is probably also too large for very small strings as it allows a disproportionate amount of work.

I think we should go for some limit which is roughly linear in the length of the input string, or perhaps linear in the length of the input string times the length of the regex (which is roughly the size of the parsed NFA), something like $c \cdot (n + 1)$ or $c \cdot m \cdot (n + 1)$, where $n$ is the length of the input string, §m$ is the number of states in the NFA and $c$ is a constant. This would allow users to limit the amount of time by limiting the length of the input string and the length of the regex. I'm open to alternative suggestions, though, or what constants to choose to calculate the limit.

A complication: Input string iterators that are not random access

Most of my suggestions for stack limits require that we know the length of the input string. That's easy to compute if the range of the input string is represented by random access iterators, but the standard only requires bidirectional iterators. I think we can choose between three alternatives here, with different pros and cons:

  • Either we calculate the limits in the same way as for random access iterators, but this means that we have to iterate once through the whole range. This might result in unnecessary computational work, especially if matching will fail early on in the input string.
  • Fall back to constants for the limits. But this means that matching long input strings will keep failing for such iterators.
  • Adjust the limits on the fly, i.e., the limits are increased whenever another character is read from the input string. This would increase the complexity of the matcher, though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementSomething can be improvedregexmeow is a substring of homeowner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions