Using Selective Memoization to Defeat Regular Expression Denial of Service

Regexes across the system stack. ReDoS may occur when a slow regex meets unsanitized input on a slow regex engine.




  • To describe a* we can use a loop that consumes a’s, and once we’re done we can proceed along the “epsilon” edge (consumes no characters).
  • To describe a|b, we can introduce a fork. You can go up to match an a or down to match a b. So there’s a path to the right side of the diagram that consumes an a and a path that consumes a b, just like you’d expect from the meaning of the regex a|b.
Regex operators and NFA equivalents for character, concatenation, repetition, and disjunction (OR). Edges are labeled with the input characters that must be consumed in order to traverse them. The funny “e” symbols are called epsilon-edges and can be taken without consuming a character.
  • Most engines use a backtracking algorithm inherited from Spencer [5]: they start at the entrance (left side of the graph) and follow edges based on whether their label matches the current index in the string. Given a choice of edges (e.g. due to an OR), they try one and save the others for later. This is essentially a depth-first search through the automaton graph for a path to the accept state. These engines exhibit exponential worst-case behavior.
  • A few engines (Rust, Go) use Thompson’s “lockstep” algorithm [6]. When there is a choice of edges, they try taking all of them, and track all of the NFA states they might be in at any moment. This is akin to a breadth-first search. These engines exhibit linear worst-case behavior.
Regex with exponential worst-case behavior in a Spencer-style backtracking regex engine.
  1. There are two paths to the accept state for the first a, twice that many for aa, and twice again for aaa. There are exponentially many paths to explore in the number of a’s. Each of these paths results in failure at the moment the b is observed.
  2. Most of these paths are redundant! The paths split and then join again, so we need explore only one of them per a in order to correctly reject the input string.

Why would you use backtracking and not lockstep?

  1. Ease of implementation and extension. Within a backtracking framework, each new regex feature (backreferences, lookaround assertions, etc.) can be defined naturally in terms of a walk down the regex AST — create a new rule for matching the new feature, and you’re good to go. Some of these features are sufficiently non-regular that they incur substantial space costs in a lockstep approach. For others we lack the automata theory to express them as an NFA.
  2. The weight of history / Technical inertia. Regex engines often inherited from Henry Spencer’s original (backtracking) implementation and have since been extended. Many engines are decades old now. They are used in myriad critical places. Any semantic regression is bad news. Converting a regex engine to a different algorithm would be a substantial and risky engineering effort. For an application, switching regex engines is also risky because of semantic subtleties and lack of extended feature support [1].


  1. A substantial rewrite of a regex engine is unlikely. The engineering costs are high and so are the risks.
  2. However, it would be nice to deal with ReDoS. Empirical research shows that super-linear regexes are common and are used in performance-critical contexts [1,2].


  1. The current position in the NFA (out of the m vertices)
  2. The index (offset) into the input string (out of the n indices)
  3. (OK, 3 things. The captured substrings also matter for backreferences. See Section IX of the paper for details.)

Reducing memoization costs

  1. We can memoize only the vertices with in-degree > 1 (i.e. more than one incoming edge). These vertices result from disjunctions like a|a and repetitions like a{1,3} and a*. Why does this work? For there to be redundancy in the NFA simulation, we must reach the same position (vertex + offset) along two different paths. At some point those paths must merge, which can only happen at a vertex with in-degree > 1. If we memoize the visits to that vertex, then we reach it at most once for each offset — never redundantly.
  2. We can memoize only the vertices that are “cycle ancestors” —those to which back-edges are directed as a result of unbounded repetition. This does not completely eliminate redundancy, but it bounds it to a constant that depends only on the NFA, not on the input. Why does this work? Slow regex behavior arises when ambiguity in the NFA is compounded by the presence of a cycle. When we memoize, we eliminate all but one of the possible cycles, limiting the redundancy to the ambiguity of the corresponding loop-free automaton.
(Left) In-degree>1 memoization for (a|a). (Right) Ancestor memoization for (a|a)+.

Table encodings

  • Compression is effective when the data is not random. Some structure is needed.
  • The semantics of regex engines give us the structure we want. Regex match semantics impose an order in which to explore the NFA. In particular, updates will tend to accrete to adjacent indices as characters are processed. This is particularly true for the common case of the catch-all quantifier .* or .+.

Extended regexes

  1. For most regexes that use lookaround assertions, the time cost can be reduced to linear in n with no increase in the space cost.
  2. For most regexes that use backreferences, the time cost is a small polynomial of n. However, these come with a greater space cost.



Time cost

Case studies of applying our techniques to regular and extended regexes (K- and E-). The regexes are: (a) Exponential Microsoft username regex (responsibly disclosed); (b) Cloudflare regex; (c,d) Hand-crafted examples. All K-regexes and REWZWA can be handled in linear-in-n. For REWBR, memoization reduces the degree of the exponent.

Space cost

Sizes of the vertex sets for the selective memoization schemes. Whiskers indicate the (1, 99)th percentiles. Outliers are not shown. Among the supported regexes, the results are similar for all regexes and the super-linear subset.
Selective memoization significantly reduces the per-query storage cost. The overheads of a hash table appear to outweigh the benefits of the “only track what you’ve seen” approach (orange bars). For RLE, the 90th percentile cost was about 10 runs, resulting in effectively constant costs in this experiment. This experiment used input strings of length 20KB to imitate the Stack Overflow outage.


More information

  1. The full paper is available here.
  2. The artifact associated with the paper is available here.
  3. The slides are available here (talk recording here).




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store