Using Selective Memoization to Defeat Regular Expression Denial of Service

11 min readSep 4, 2020

--

Regexes across the system stack. ReDoS may occur when a slow regex meets unsanitized input on a slow regex engine.

This is a brief for the research paper Using Selective Memoization to Defeat Regular Expression Denial of Service (REDOS), published at IEEE S&P 2021. I led the work, with help from Francisco Servant and Dongyoon Lee.

In this article I use the word “regex” as shorthand for “regular expression”.

Summary

Attackers can use regex-based denial of service (ReDoS) attacks to damage vulnerable web services. These attacks take advantage of the slow algorithm used by regex engines to evaluate regexes. We present novel optimizations to provably improve the worst-case behavior of these engines to linear-time. Nothing in life is free, so these optimizations cost space. We present two space-reduction schemes with varying time/space tradeoffs. We also show how to support the extended features of backreferences and lookaround assertions within our memoization framework. Our prototype achieves linear time complexity with constant space costs for 90% of a large corpus of super-linear regexes.

Motivation

Regexes are a denial of service vector in most mainstream programming languages [1]. Regexes are used for input sanitization across the system stack, rendering many web services vulnerable to ReDoS [2]. (See figure at the top of the article). One high-profile ReDoS incident occurred at CloudFlare in 2019, affecting the availability of thousands of Internet services [3].

Background

The root cause of ReDoS is that regex engines can take quite a long time to evaluate some regexes on particular inputs. When these regexes are used to process untrusted client input, ReDoS can result.

Programming languages use a regex engine to determine whether a regex matches an input string. The behavior of these regex engines can be modeled using a non-deterministic finite automaton (NFA), with extensions for capture groups, irregular features, etc. An NFA describes a set of strings in terms of reachability: “Is there a path from the left side of the diagram to the right side, where the next input character must always match the label on the edge I take?“

The next figure shows example NFAs for some regex operators. For exale:

To describe a* we can use a loop that consumes a’s, and once we’re done we can proceed along the “epsilon” edge (consumes no characters).
To describe a|b, we can introduce a fork. You can go up to match an a or down to match a b. So there’s a path to the right side of the diagram that consumes an a and a path that consumes a b, just like you’d expect from the meaning of the regex a|b.

Regex operators and NFA equivalents for character, concatenation, repetition, and disjunction (OR). Edges are labeled with the input characters that must be consumed in order to traverse them. The funny “e” symbols are called epsilon-edges and can be taken without consuming a character.

Regex engines can be thought of as parsing the regex into an NFA. Some regex engines (e.g. Python) explicitly generate an NFA-like entity, following what Cox calls the Virtual Machine approach [4]. Others (e.g. Perl) implicitly use an NFA, but more closely resemble a predictive parser in their implementation.

These engines then simulate the NFA on the input string.

Most engines use a backtracking algorithm inherited from Spencer [5]: they start at the entrance (left side of the graph) and follow edges based on whether their label matches the current index in the string. Given a choice of edges (e.g. due to an OR), they try one and save the others for later. This is essentially a depth-first search through the automaton graph for a path to the accept state. These engines exhibit exponential worst-case behavior.
A few engines (Rust, Go) use Thompson’s “lockstep” algorithm [6]. When there is a choice of edges, they try taking all of them, and track all of the NFA states they might be in at any moment. This is akin to a breadth-first search. These engines exhibit linear worst-case behavior.

We are concerned with Spencer-style regex engines. Their exponential behavior can be seen by considering the NFA for (a|a)+$, shown here:

Regex with exponential worst-case behavior in a Spencer-style backtracking regex engine.

Consider a backtracking simulation of this NFA on the string a…ab. Let’s observe two things:

There are two paths to the accept state for the first a, twice that many for aa, and twice again for aaa. There are exponentially many paths to explore in the number of a’s. Each of these paths results in failure at the moment the b is observed.
Most of these paths are redundant! The paths split and then join again, so we need explore only one of them per a in order to correctly reject the input string.

(More examples of worse-than-linear behavior available in my cheat sheet.)

Why would you use backtracking and not lockstep?

I don’t know for certain, but here are two possibilities:

Ease of implementation and extension. Within a backtracking framework, each new regex feature (backreferences, lookaround assertions, etc.) can be defined naturally in terms of a walk down the regex AST — create a new rule for matching the new feature, and you’re good to go. Some of these features are sufficiently non-regular that they incur substantial space costs in a lockstep approach. For others we lack the automata theory to express them as an NFA.
The weight of history / Technical inertia. Regex engines often inherited from Henry Spencer’s original (backtracking) implementation and have since been extended. Many engines are decades old now. They are used in myriad critical places. Any semantic regression is bad news. Converting a regex engine to a different algorithm would be a substantial and risky engineering effort. For an application, switching regex engines is also risky because of semantic subtleties and lack of extended feature support [1].

Approach

Our approach is based on the following assumptions:

A substantial rewrite of a regex engine is unlikely. The engineering costs are high and so are the risks.
However, it would be nice to deal with ReDoS. Empirical research shows that super-linear regexes are common and are used in performance-critical contexts [1,2].

Therefore, our aim was to eliminate ReDoS within the backtracking framework (low engineering cost), while minimizing the externally-visible effects (low application cost).

Memoization

For many problematic regexes, the worst-case behavior is caused by redundancy in the search algorithm. We apply Michie’s memoization technique to eliminate this redundancy [7]. The idea of memoization is that if you have to compute the same function many times, then you can record the inputs you see and their outputs. If you see the same input again, you can return the corresponding output without re-executing the function.

To apply this idea to regex engines, note that the outcome of a regex match is determined by two things:

The current position in the NFA (out of the m vertices)
The index (offset) into the input string (out of the n indices)
(OK, 3 things. The captured substrings also matter for backreferences. See Section IX of the paper for details.)

For a given input, we may not explore every position in the NFA nor every index in the input string. But once we have explored any pair and failed to find a match, we don’t need to explore that pair again. Once the table is full or the input is exhausted, we’re done and can return a mismatch.

There are n*m pairs to track, and the cost of evaluating each pair depends on the structure of the NFA (at most m vertices to consider). So we can fill out this table in O(m²n) steps. Most server-side regexes are fixed, so the time complexity is linear in the length of the input.

Reducing memoization costs

Linear time complexity is great — loads better than exponential. However, regexes may be big [8] (large m) and inputs long (large n), and so an n*m memo table may be problematically large. Can we reduce the cost?

We propose two ideas to reduce the space cost: selective memoization and run-length encoding.

Selective memoization

The memo table contains NUM VERTICES x INPUT LENGTH entries. Using selective memoization, we reduce the number of vertices we track.

We can memoize only the vertices with in-degree > 1 (i.e. more than one incoming edge). These vertices result from disjunctions like a|a and repetitions like a{1,3} and a*. Why does this work? For there to be redundancy in the NFA simulation, we must reach the same position (vertex + offset) along two different paths. At some point those paths must merge, which can only happen at a vertex with in-degree > 1. If we memoize the visits to that vertex, then we reach it at most once for each offset — never redundantly.
We can memoize only the vertices that are “cycle ancestors” —those to which back-edges are directed as a result of unbounded repetition. This does not completely eliminate redundancy, but it bounds it to a constant that depends only on the NFA, not on the input. Why does this work? Slow regex behavior arises when ambiguity in the NFA is compounded by the presence of a cycle. When we memoize, we eliminate all but one of the possible cycles, limiting the redundancy to the ambiguity of the corresponding loop-free automaton.

In the next figure, the grey nodes indicate the vertices that we memoize under the in-degree scheme (left) and ancestor scheme (right).

(Left) In-degree>1 memoization for (a|a). (Right) Ancestor memoization for (a|a)+.

This approach reduces the cost of storing the memo table to NUM IN-DEG VERTICES x INPUT LENGTH or to NUM ANCESTOR VERTICES x INPUT LENGTH. For regexes that do not use these features much, this approach reduces the cost of the simulation to a few copies of the input string — probably good enough. Of course, if we can pay less that would be even better.

Table encodings

So far we’ve talked about the memo table as a two-dimensional array. We get fast access times, but must pay for all of the space up front.

We could instead use that favourite recourse of the software engineer: a hash table. Now we get pretty-fast access times, and we only pay for the space we use. However, on pathological inputs we’ll still end up visiting O(m*n) slots. This is not helping.

Perhaps we can compress the data in the table. Why might this work?

Compression is effective when the data is not random. Some structure is needed.
The semantics of regex engines give us the structure we want. Regex match semantics impose an order in which to explore the NFA. In particular, updates will tend to accrete to adjacent indices as characters are processed. This is particularly true for the common case of the catch-all quantifier .* or .+.

We propose to use run-length encoding [9] (RLE) to represent the memo table. We use a binary tree, with runs as elements keyed by their offsets. This offers logarithmic access time in the number of runs, quite fast if the compression is effective. RLE can reduce the cost of storing the memo table, with best-case constant cost and worst-case m*n still. (*I believe we can improve the worst-case cost, but I haven’t finished the math yet).

Extended regexes

One of the appeals of backtracking is to support extended regex features. The most common of these are lookaround assertions and backreferences. In typical regex engines these features are both exponential.

I won’t get into the analysis here, but:

For most regexes that use lookaround assertions, the time cost can be reduced to linear in n with no increase in the space cost.
For most regexes that use backreferences, the time cost is a small polynomial of n. However, these come with a greater space cost.

Evaluation

Prototype

We implemented the selective memoization and encoding schemes within a Spencer-style backtracking regex engine published by Cox [10]. We extended this engine to support many regular features as well as lookaround assertions and backreferences. Our prototype supports approximately 66% of the corpus we published in [1], including 84% of the super-linear regexes labeled therein.

Time cost

It works! The behavior of our prototype follows the predictions of our theorems: linear time cost in n. Here are some examples.

Case studies of applying our techniques to regular and extended regexes (K- and E-). The regexes are: (a) Exponential Microsoft username regex (responsibly disclosed); (b) Cloudflare regex; (c,d) Hand-crafted examples. All K-regexes and REWZWA can be handled in linear-in-n. For REWBR, memoization reduces the degree of the exponent.

Space cost

The selective memoization schemes offer order-of-magnitude reductions in the size of the memoization table.

Sizes of the vertex sets for the selective memoization schemes. Whiskers indicate the (1, 99)th percentiles. Outliers are not shown. Among the supported regexes, the results are similar for all regexes and the super-linear subset.

In our experiments, using run-length encoding (RLE) reduces the space cost to constant for 90% of the regexes we tried.

Selective memoization significantly reduces the per-query storage cost. The overheads of a hash table appear to outweigh the benefits of the “only track what you’ve seen” approach (orange bars). For RLE, the 90th percentile cost was about 10 runs, resulting in effectively constant costs in this experiment. This experiment used input strings of length 20KB to imitate the Stack Overflow outage.

Conclusions

The ReDoS problem has been rumored since the 1960s [6] and defined carefully since 2002, but has thus far defied a practical solution. Although there are alternative algorithmic frameworks with better time complexity guarantees, they complicate implementations and limit extensibility and expressiveness. Meanwhile, a memoization-based optimization would fit nicely into the popular backtracking framework, but these optimizations have been rejected for their high space complexity.

We revisited this longstanding problem from a data-driven perspective. We proposed two selective memoization schemes that offer comparable time complexity guarantees with lower space complexity. Using our newly-available large-scale regex corpus, we evaluated these schemes and found that they offer an order of magnitude reduction in memoization space costs for typical regexes. Furthermore, leveraging insights into regex engine search semantics, we showed that memoization space costs can be further reduced to constant for typical regexes. In short, we have demonstrated a provably sound ReDoS defense that fits within a backtracking framework, and shown how to reduce its space cost from problematic to constant for the majority of super-linear regexes.

It is our hope that our proposal will be incorporated into mainstream regex engines to eliminate the problem of ReDoS.

In closing, we note that the References in this blog are primarily from two decades: 1960–1970 and 2009–2019. The general techniques we use to solve ReDoS are old; what’s changed is the empirical evidence that ReDoS is a problem worth solving.