-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
With #5939, the non-recursive <regex> rewrite has reached the point where we should decide how we adjust the stack and complexity limits in the implementation.
While rewriting the matcher, I made sure that the existing counters have kept working roughly the same (except that the counts are sometimes a bit lower) so that the set of allowed input strings should not have shrunk at least.
But while these counters were natural for the recursive parser (counting recursive calls), the counters are now adjusted in lots of different places, creating quite a mess. I think we should change the counters and limits to something more natural and maintainable now.
Why are such limits needed?
In theoretical computer science courses, it is taught that a regular expression can be matched by a deterministic finite automaton (DFA), which basically translates into a linear-time algorithm. But (a) a DFA might have exponential size compared to the non-deterministic finite automaton (NFA) that the regular expression was parsed into, and (b) practical regex grammars can be used to accept non-regular languages. For example, matching a regex grammar supporting backreferences is (probably) not possible in linear time, because this problem is NP-hard. The limits are required to defend against super-linear running times when matching regexes against input strings.
The current limits are too strict for long input strings
The limits are supposed to defend against super-linear running time, but the limits are currently constants, so they even defend against linear running time. This means that the current implementation can only successfully match each regex grammar against input strings up to some maximum constant length. This is a pointlessly strict limit, because users can easily achieve this same restriction by just explicitly limiting the length of the input string they pass to the regex algorithms.
This is particularly bad for the stack limit, which is currently set to 1000. A constant like that made sense for the recursive parser, because it was supposed to limit the usage of the thread's stack with a constant size of usually 1 MB. But now this stack lives on the heap and can grow.
The stack counter
After #5939, we have reached the point that the stack counter is bounded from above by what is essentially _Frames_count times some factor (I would have to do the exact math, but the factor should be roughly _Frames_count, if we want to keep applying a limit on the stack size as well.
The stack limit
The more important question is how we should adjust the stack limit. The problem is that one can craft regexes such that the size of the stack is exponential in the size of the regex (think of the regex (|){2147483641} and nested versions of it). I'm not sure what is an appropriate limit here, although I think it should usually be linear in the length of the input string (something like
The complexity counter
The complexity counter currently roughly counts the number of loop repetitions, entered assertions and attempted trajectories, but it does not count all state transitions (and as an additional complication, matching some states such as backreferences potentially involves a linear amount of work).
It is clear that any revised complexity counter must at least account for loop repetitions and attempted trajectories:
- Counting loop repetitions alone is not sufficient because it is possible to construct a reduction from 3-SAT to REGEX MATCHING WITH BACKREFERENCES (proving the NP-hardness of matching regexes with backreferences) such that the constructed regex does not contain any loops.
- Counting attempted trajectories alone is not sufficient because it does not defend against regexes containing
(){2147483641}and nested versions of it.
I think the natural thing to do is to count the number of state transitions in the NFA (perhaps also accounting for the amount of work performed for each state), which is an upper bound on the number of loop repetitions and attempted trajectories. However, I do not think there is a nice way to upper-bound this suggested complexity counter by the existing one, so doing this change will probably mean that we will reject matching some regexes against some inputs that were accepted with the old counter whatever adjusted limit we choose.
The complexity limit
The limit on the complexity counter is currently the constant 10000000, which is often lenient enough in practice, but pointlessly prevents matching very long input strings (such as a sequence of
I think we should go for some limit which is roughly linear in the length of the input string, or perhaps linear in the length of the input string times the length of the regex (which is roughly the size of the parsed NFA), something like
A complication: Input string iterators that are not random access
Most of my suggestions for stack limits require that we know the length of the input string. That's easy to compute if the range of the input string is represented by random access iterators, but the standard only requires bidirectional iterators. I think we can choose between three alternatives here, with different pros and cons:
- Either we calculate the limits in the same way as for random access iterators, but this means that we have to iterate once through the whole range. This might result in unnecessary computational work, especially if matching will fail early on in the input string.
- Fall back to constants for the limits. But this means that matching long input strings will keep failing for such iterators.
- Adjust the limits on the fly, i.e., the limits are increased whenever another character is read from the input string. This would increase the complexity of the matcher, though.