Using Selective Memoization to Defeat Regular Expression Denial of Service

Regexes across the system stack. ReDoS may occur when a slow regex meets unsanitized input on a slow regex engine.

This is a brief for the research paper Using Selective Memoization to Defeat Regular Expression Denial of Service (REDOS), published at IEEE S&P 2021. I led the work, with help from Francisco Servant and Dongyoon Lee.

In this article I use the word “regex” as shorthand for “regular expression”.

Summary

Motivation

Background

Programming languages use a regex engine to determine whether a regex matches an input string. The behavior of these regex engines can be modeled using a non-deterministic finite automaton (NFA), with extensions for capture groups, irregular features, etc. An NFA describes a set of strings in terms of reachability: “Is there a path from the left side of the diagram to the right side, where the next input character must always match the label on the edge I take?

The next figure shows example NFAs for some regex operators. For exale:

  • To describe a* we can use a loop that consumes a’s, and once we’re done we can proceed along the “epsilon” edge (consumes no characters).
  • To describe a|b, we can introduce a fork. You can go up to match an a or down to match a b. So there’s a path to the right side of the diagram that consumes an a and a path that consumes a b, just like you’d expect from the meaning of the regex a|b.
Regex operators and NFA equivalents for character, concatenation, repetition, and disjunction (OR). Edges are labeled with the input characters that must be consumed in order to traverse them. The funny “e” symbols are called epsilon-edges and can be taken without consuming a character.

Regex engines can be thought of as parsing the regex into an NFA. Some regex engines (e.g. Python) explicitly generate an NFA-like entity, following what Cox calls the Virtual Machine approach [4]. Others (e.g. Perl) implicitly use an NFA, but more closely resemble a predictive parser in their implementation.

These engines then simulate the NFA on the input string.

  • Most engines use a backtracking algorithm inherited from Spencer [5]: they start at the entrance (left side of the graph) and follow edges based on whether their label matches the current index in the string. Given a choice of edges (e.g. due to an OR), they try one and save the others for later. This is essentially a depth-first search through the automaton graph for a path to the accept state. These engines exhibit exponential worst-case behavior.
  • A few engines (Rust, Go) use Thompson’s “lockstep” algorithm [6]. When there is a choice of edges, they try taking all of them, and track all of the NFA states they might be in at any moment. This is akin to a breadth-first search. These engines exhibit linear worst-case behavior.

We are concerned with Spencer-style regex engines. Their exponential behavior can be seen by considering the NFA for (a|a)+$, shown here:

Regex with exponential worst-case behavior in a Spencer-style backtracking regex engine.

Consider a backtracking simulation of this NFA on the string a…ab. Let’s observe two things:

  1. There are two paths to the accept state for the first a, twice that many for aa, and twice again for aaa. There are exponentially many paths to explore in the number of a’s. Each of these paths results in failure at the moment the b is observed.
  2. Most of these paths are redundant! The paths split and then join again, so we need explore only one of them per a in order to correctly reject the input string.

(More examples of worse-than-linear behavior available in my cheat sheet.)

Why would you use backtracking and not lockstep?

  1. Ease of implementation and extension. Within a backtracking framework, each new regex feature (backreferences, lookaround assertions, etc.) can be defined naturally in terms of a walk down the regex AST — create a new rule for matching the new feature, and you’re good to go. Some of these features are sufficiently non-regular that they incur substantial space costs in a lockstep approach. For others we lack the automata theory to express them as an NFA.
  2. The weight of history / Technical inertia. Regex engines often inherited from Henry Spencer’s original (backtracking) implementation and have since been extended. Many engines are decades old now. They are used in myriad critical places. Any semantic regression is bad news. Converting a regex engine to a different algorithm would be a substantial and risky engineering effort. For an application, switching regex engines is also risky because of semantic subtleties and lack of extended feature support [1].

Approach

  1. A substantial rewrite of a regex engine is unlikely. The engineering costs are high and so are the risks.
  2. However, it would be nice to deal with ReDoS. Empirical research shows that super-linear regexes are common and are used in performance-critical contexts [1,2].

Therefore, our aim was to eliminate ReDoS within the backtracking framework (low engineering cost), while minimizing the externally-visible effects (low application cost).

Memoization

To apply this idea to regex engines, note that the outcome of a regex match is determined by two things:

  1. The current position in the NFA (out of the m vertices)
  2. The index (offset) into the input string (out of the n indices)
  3. (OK, 3 things. The captured substrings also matter for backreferences. See Section IX of the paper for details.)

For a given input, we may not explore every position in the NFA nor every index in the input string. But once we have explored any pair and failed to find a match, we don’t need to explore that pair again. Once the table is full or the input is exhausted, we’re done and can return a mismatch.

There are n*m pairs to track, and the cost of evaluating each pair depends on the structure of the NFA (at most m vertices to consider). So we can fill out this table in O(m²n) steps. Most server-side regexes are fixed, so the time complexity is linear in the length of the input.

Reducing memoization costs

We propose two ideas to reduce the space cost: selective memoization and run-length encoding.

Selective memoization

The memo table contains NUM VERTICES x INPUT LENGTH entries. Using selective memoization, we reduce the number of vertices we track.

  1. We can memoize only the vertices with in-degree > 1 (i.e. more than one incoming edge). These vertices result from disjunctions like a|a and repetitions like a{1,3} and a*. Why does this work? For there to be redundancy in the NFA simulation, we must reach the same position (vertex + offset) along two different paths. At some point those paths must merge, which can only happen at a vertex with in-degree > 1. If we memoize the visits to that vertex, then we reach it at most once for each offset — never redundantly.
  2. We can memoize only the vertices that are “cycle ancestors” —those to which back-edges are directed as a result of unbounded repetition. This does not completely eliminate redundancy, but it bounds it to a constant that depends only on the NFA, not on the input. Why does this work? Slow regex behavior arises when ambiguity in the NFA is compounded by the presence of a cycle. When we memoize, we eliminate all but one of the possible cycles, limiting the redundancy to the ambiguity of the corresponding loop-free automaton.

In the next figure, the grey nodes indicate the vertices that we memoize under the in-degree scheme (left) and ancestor scheme (right).

(Left) In-degree>1 memoization for (a|a). (Right) Ancestor memoization for (a|a)+.

This approach reduces the cost of storing the memo table to NUM IN-DEG VERTICES x INPUT LENGTH or to NUM ANCESTOR VERTICES x INPUT LENGTH. For regexes that do not use these features much, this approach reduces the cost of the simulation to a few copies of the input string — probably good enough. Of course, if we can pay less that would be even better.

Table encodings

We could instead use that favourite recourse of the software engineer: a hash table. Now we get pretty-fast access times, and we only pay for the space we use. However, on pathological inputs we’ll still end up visiting O(m*n) slots. This is not helping.

Perhaps we can compress the data in the table. Why might this work?

  • Compression is effective when the data is not random. Some structure is needed.
  • The semantics of regex engines give us the structure we want. Regex match semantics impose an order in which to explore the NFA. In particular, updates will tend to accrete to adjacent indices as characters are processed. This is particularly true for the common case of the catch-all quantifier .* or .+.

We propose to use run-length encoding [9] (RLE) to represent the memo table. We use a binary tree, with runs as elements keyed by their offsets. This offers logarithmic access time in the number of runs, quite fast if the compression is effective. RLE can reduce the cost of storing the memo table, with best-case constant cost and worst-case m*n still. (*I believe we can improve the worst-case cost, but I haven’t finished the math yet).

Extended regexes

I won’t get into the analysis here, but:

  1. For most regexes that use lookaround assertions, the time cost can be reduced to linear in n with no increase in the space cost.
  2. For most regexes that use backreferences, the time cost is a small polynomial of n. However, these come with a greater space cost.

Evaluation

Prototype

Time cost

Case studies of applying our techniques to regular and extended regexes (K- and E-). The regexes are: (a) Exponential Microsoft username regex (responsibly disclosed); (b) Cloudflare regex; (c,d) Hand-crafted examples. All K-regexes and REWZWA can be handled in linear-in-n. For REWBR, memoization reduces the degree of the exponent.

Space cost

Sizes of the vertex sets for the selective memoization schemes. Whiskers indicate the (1, 99)th percentiles. Outliers are not shown. Among the supported regexes, the results are similar for all regexes and the super-linear subset.

In our experiments, using run-length encoding (RLE) reduces the space cost to constant for 90% of the regexes we tried.

Selective memoization significantly reduces the per-query storage cost. The overheads of a hash table appear to outweigh the benefits of the “only track what you’ve seen” approach (orange bars). For RLE, the 90th percentile cost was about 10 runs, resulting in effectively constant costs in this experiment. This experiment used input strings of length 20KB to imitate the Stack Overflow outage.

Conclusions

We revisited this longstanding problem from a data-driven perspective. We proposed two selective memoization schemes that offer comparable time complexity guarantees with lower space complexity. Using our newly-available large-scale regex corpus, we evaluated these schemes and found that they offer an order of magnitude reduction in memoization space costs for typical regexes. Furthermore, leveraging insights into regex engine search semantics, we showed that memoization space costs can be further reduced to constant for typical regexes. In short, we have demonstrated a provably sound ReDoS defense that fits within a backtracking framework, and shown how to reduce its space cost from problematic to constant for the majority of super-linear regexes.

It is our hope that our proposal will be incorporated into mainstream regex engines to eliminate the problem of ReDoS.

In closing, we note that the References in this blog are primarily from two decades: 1960–1970 and 2009–2019. The general techniques we use to solve ReDoS are old; what’s changed is the empirical evidence that ReDoS is a problem worth solving.

More information

  1. The artifact associated with the paper is available here.
  2. The slides are available here (talk recording here).

References

[2] Staicu and Pradel, 2018. Freezing the Web.

[3] Graham-Cumming, 2019. Details of the Cloudflare outage on July 2, 2019.

[4] Cox, 2009. Regular Expression Matching: the Virtual Machine Approach.

[5] Spencer, 1994. A regular-expression matcher.

[6] Thompson, 1968. Regular expression search algorithm.

[7] Michie, 1968. “Memo” functions and machine learning.

[8] Davis et al., 2019. Testing Regex Generalizability And Its Implications: A Large-Scale Many-Language Measurement Study.

[9] Robinson & Cherry, 1967. Results of a Prototype Television Bandwidth Compression Scheme.

[10] Cox, 2010. re1: A simple regular expression engine, easy to read and study.

I am a professor in ECE@Purdue. I hold a PhD in computer science from Virginia Tech. I try to summarize my research findings in practitioner-friendly ways.