Preventing Regular Expression Denial of Service (ReDoS)
The previous topic explains catastrophic backtracking with practical examples from the perspective of somebody trying to get their regular expressions to work and perform well on their own PC. You should understand those examples before reading this topic.
It’s annoying when catastrophic backtracking happens on your PC. But when it happens in a server application with multiple concurrent users, it can really be catastrophic. Too many users running regexes that exhibit catastrophic backtracking will bring down the whole server. And “too many” need only be as few as the number of CPU cores in the server.
If the server accepts regexes from the user, then the user can easily provide one that exhibits catastrophic backtracking on any subject. If the server accepts subject data from the user, then the user may be able to provide subjects that trigger catastrophic backtracking in regexes used by the server, if those regexes are predisposed to catastrophic backtracking. When the user can do either of those things, the server is susceptible to regular expression denial of service (ReDoS). When enough users (or one actor masquerading as many users) provide malicious regexes and/or subjects to match against, the server will be spending nearly all its CPU cycles on trying to match those regexes.
Handling Regexes Provided by The User
If your application allows the user to provide their own regexes, then your only real defense is to use a text-directed regex engine. Those engines don’t backtrack. Their performance depends on the length of the subject string, not the complexity of the regular expression. But they also don’t support features like backreferences that depend on backtracking and that many users expect.
If your application uses a backtracking engine with user-provided regexes, then you can only mitigate the consequences of catastrophic backtracking. And you’ll really need to do so. It’s very easy for people with limited regex skills to accidentally craft one that degenerates into catastrophic backtracking.
You’ll need to use a regex engine that aborts the match attempt when catastrophic backtracking occurs rather than running until the script crashes or the OS kills it. You can easily test this. When the regex (x\w{1,10})+y
is attempted on an ever growing string of x
’s there should be a reasonable limit on how long it takes for the regex engine to give up. Ideally your engine will allow you to configure this limit for your purposes. The .NET engine, for example, allows you to pass a timeout to the Regex() constructor. The PCRE engine allows you to set recursion limits. The lower your limits the better the protection against ReDoS, but higher the risk of aborting legitimate regexes that would find a valid match given slightly more time. Low recursion limits may prevent long regex matches. Low timeouts may abort searches through large files too early.
If your regex engine has no such features, you could implement your own timeout. Spawn a separate thread to execute the regular expression. Wait on the thread with a timeout. If the thread finishes before the wait times out, process its result. Otherwise, kill the thread and tell the user the regex is too complex. The safe_regexp package implements this for Ruby.
Reviewing Regexes in The Application
If the server only uses regexes that are hard-coded in your application, then you can prevent regex-based denial of service attacks entirely. You need to make sure that your regexes won’t exhibit catastrophic backtracking regardless of the subjects they’re used on. This isn’t particularly difficult for somebody with a solid grasp of regular expressions. But it does require care and attention. It’s not enough to just test that the regex matches valid subjects. You need to make sure, by looking at the regex independently of any subject data, that it is not possible for multiple permutations of the same regex to match the same thing.
Permutations occur when you give the regular expression a choice. You can do this with alternation and with quantifiers. So these are the regex tokens you need to inspect. Possessive quantifiers are excepted, because they never backtrack.
Alternation
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking.
A classic example is (.|\s)*
to match any amount of any text when the regex flavor does not have a “dot matches line breaks” mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces will break the regex. The engine will try every possible combination of the spaces being matched by .
or \s
. For example, 3 spaces could be matched as ...
, ..\s
, .\s.
, .\s\s
, \s..
, \s.\s
, \s\s.
, or \s\s\s
. That’s 2^N permutations. The fix is to use (.|\n)*
to make the alternatives mutually exclusive. Even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E]*
for ASCII printables, tabs, and line breaks.
It is acceptable for two alternatives to partially match the same text. [0-9]*\.[0-9]+|[0-9]+
is perfectly fine to match a floating point number with optional integer part and optional fraction. Though a subject that consists of only digits is initially matched by [0-9]*
and does cause some backtracking when \.
fails, this backtracking never becomes catastrophic. Even if you put this inside a group in a longer regex, the group only does a minimal amount of backtracking. (But the group mustn’t have a quantifier or it will fall foul of the rule for nested quantifiers.)
Quantifiers in Sequence
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive with what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A token inside a group with alternation is still in sequence with any token before or after the group.
A classic example is a.*?b.*?c
to match 3 things with “anything” between them. When c
can’t be matched the first .*?
expands character by character until the end of the line or file. For each expansion the second .*?
expands character by character to match the remainder of the line or file. The fix is to realize that you can’t have “anything” between them. The first run needs to stop at b
and the second run needs to stop at c
. With single characters a[^b]*b[^c]*c
is an easy solution. The negated character classes guarantee the repetition stops at the delimiter. If your regex flavor supports possessive quantifiers then you can use a[^b]*+b[^c]*+c
to further increase performance.
For a more complex example and solution, see matching a complete HTML file in the previous topic. This explains how you can use atomic grouping to prevent backtracking in more complex situations.
Nested Quantifiers
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier.
The regex (x\w{1,10})+y
matches a sequence of one or more codes that start with an x
followed by 1 to 10 word characters, all followed by a y
. All is well as long as the y
can be matched. When the y
is missing, backtracking occurs. If the string doesn’t have too many x
’s then backtracking happens very quickly. Things only turn catastrophic when the subject contains a long sequence of x
’s. x
and x
are not mutually exclusive. So the repeated group can match xxxx
in one iteration as x\w\w\w
or in two iterations as x\wx\w
.
To solve this, you first need to consider whether x
and y
should be allowed in the 1 to 10 characters that follow it. Excluding the x
eliminates most backtracking. What’s left won’t be catastrophic. You could exclude it with character class subtraction as in (x[\w-[x]]{1,10})+y
or with character class intersection as in (x[\w&&[^x]]{1,10})+y
. If you don’t have those features you’ll need to spell out the characters you want to allow: (x[a-wyz0-9_]{1,10})+y
.
If the x
should be allowed then your only solution is to disallow the y
in the same way. Then you can make the group atomic or the quantifier possessive to eliminate the backtracking.
If both x
and y
should be allowed in the sequences of 1 to 10 characters, then there is no regex-only solution. You can’t make the group atomic or the quantifier possessive as then \w{1,10}
matches the final y
which causes y
to fail.
Other Defensive Techniques
In addition to preventing catastrophic backtracking as explained above, you should make your regular expressions as strict as possible. The stricter the regex, the less backtracking it does and thus the better it performs. Even if you can’t measure the performance difference because the regex is used infrequently on short strings, proper technique is a habit. It also reduces the chance that a less experienced developer introduces catastrophic backtracking when they extend your regex later.
Make groups that contain alternatives atomic as much as you can. Use \b(?>one|two|three)\b
to match a list of words.
Make quantifiers possessive as much as you can. If a repeated token is mutually exclusive with what follows, enforce that with a possessive quantifier.
Use (negated) character classes instead of the dot. It’s rare that you really want to allow “anything”. A double-quoted string, for example, can’t contain “anything”. It can’t contain unescaped double quotes. So use "[^"\n]*+"
instead of ".*?"
. Though both find exactly the same matches when used on their own, the latter can lead to catastrophic backtracking when pasted into a longer regex. The former never backtracks regardless of anything else the regex needs to match.
Why Use Regexes at All?
Some would certainly argue that the above only shows that regexes are dangerous and that they should not be used. They’ll then force developers to do the job with procedural code. Procedural code to match non-trivial patterns quickly becomes long and complicated, increasing the chance of bugs and the cost to develop and maintain the code. Many pattern matching problems are naturally solved with recursion. And when a large subject string can’t be matched, runaway recursion leads to stack overflows that crash the application.
Developers need to learn to correctly use their tools. This is no different for regular expressions than for anything else.