Exploiting Input Sanitization for Regex Denial of Service

9 min readMar 7, 2022

This is a brief for the research paper “Exploiting Input Sanitization for Regex Denial of Service”, published at ICSE 2022. This work was led by my students Efe Barlas and Xin Du. The full paper is here. They also wrote this brief, which I have lightly edited.

In this article we use the word “regex” as shorthand for “regular expression”.

Summary

Regexes are common in software and are great tools for dealing with string processing tasks. Unfortunately, in many programming languages, the default regex engine may consume a lot of computational power. Some regexes take polynomial or even exponential time to process a string, based on the string length. If this processing happens on the server side, it could consume a significant amount of resources. This property creates a security threat: if an attacker knows the regexes used on the server side, then they can craft strings to trigger a costly regex match, and cause regex-based denial of service (ReDoS). For example, Cloudflare had a regex-based outage in 2019 that affected thousands of its customers [1].

Although it is hard for hackers to get access to the backend code and learn what regexes are used there, web services sometimes publish regexes backend to help users understand requirements. We wondered if this decision exposes web services to ReDoS risks.

We conducted a black-box study to (ethically) measure ReDoS vulnerabilities in live web services We apply an assumption that client-side sanitization logic, including regexes, is consistent with the sanitization logic on the server-side. We investigated both HTML forms and API specifications. Our result showed that regexes in API specifications pose a greater ReDoS threat. We discovered ReDoS vulnerabilities in several web domains, including domains operated by Microsoft and by Amazon.
To mitigate against ReDoS, we believe that there are better solutions than not publishing regexes at all [9]. This leaves two options: figuring out how to easily turn unsafe regexes into safe regexes (researchers don’t yet know how to do this perfectly) or adopting a safe regex engine for input sanitization.

We think the second option is a good long-term choice. To help web services switch to a safe regex engine and eliminate the threat of ReDoS in input sanitization, we provided engineers with an easy way to use a safe regex engine — that way they can write whatever regexes they want, and not need to worry about ReDoS. As a proof of concept, we added this feature to Ajv [13]. Using our feature [10], engineers can use a safe regex engine with one line of code.

Background

Regexes

Many software engineers use regexes to check whether the input from users meets requirements. For example, regexes can be used to examine whether passwords have both uppercase and lowercase letters. Another use case is for API specifications to help users better understand the data format of each API.

Unsafe Regexes and ReDoS

Some regexes may take a lot of time when processing through certain strings. This can happen when an unsafe regex is used under an unsafe regex engine. You might be curious what makes a regex unsafe. It is this: when a regex can parse a string in multiple ways, the engine may try every single one of them, resulting in problematic time complexities ranging from polynomial to exponential [11]. For a more detailed explanation, take a look at this post.

If an attacker is aware that there are unsafe regexes in a web service’s backend, they can craft and send attack strings. The attack is asymmetrical — it is cheap for the attacker to send the string, but it could take the server a very long time to process it. It’s like your toddler spilling milk everywhere: takes 1 second to spill, but 5 minutes to clean up. As a result, the access of other normal users to this web service may be influenced. Such an attack is called Regex-based Denial of Service (ReDoS).

But how could attackers possibly know about regexes that a particular web service uses? The next section will explain.

Two Common Places for Finding Clues

Engineers may reveal sanitization regexes in (at least) three places on the client side: in HTML forms and in JavaScript handlers (figure (a)); and in API specifications (figure (b)).

If we pretend to be a hacker who is interested in what regexes are used in the backend, there are two common places where we can find clues.

The first is HTML forms, where regexes may be used to examine the input before it is sent to the web service’s backend. Within HTML forms, there are two ways to include regex checks. One is to use the pattern attributes for an input tag. This is a built-in feature of HTML [6]. The other way is to include custom JavaScript which then uses regexes to check the input. This JavaScript code snippet is usually triggered when users type in texts or press the submit button. Web programmers often use JavaScript to do form validation, because it allows them to do much more than what HTML built-in features allow. The two ways are both shown on the left side of the figure.

The second common place is API specifications. Many web services have APIs that users can connect to programmatically. In order to specify the input requirement, such as the path, data format, and return value, of those APIs, web services describe their APIs in various forms. Some are more casual, and some follow a specific format. A common format to use is called OpenAPI specification [7]. An example of an OpenAPI specification is shown on the right side of the figure.

Approach

Our approach relies on an assumption: that client-side sanitization logic, including regexes, is consistent with the sanitization logic on the server-side. This is a common practice among programmers and also recommended by the popular web programming website Mozilla [8]. Because of this assumption, if we can find regexes that are used on the client-side, the regexes that are used on the server-side might be the same. We can then generate strings that trigger super-linear matching time on the regex and send it to the server-side. Based on the difference between response times, we can determine whether the server-side regex is actually vulnerable. We took measures so that our experiments did not impact the availability of the web service.

Finding Regexes

Because there are two common places (HTML forms and API specs) to find clues, we have two different ways of finding regexes from them. For HTML forms, we first randomly selected 1000 web services from the Tranco Top 1M list [2]. Then, we used a web crawler called Apify [3] to crawl those 1000 web services and find web pages that we can analyze. For those web pages that have HTML forms on them, we drove an automated browser via OpenWPM [4]. We used OpenWPM to populate form fields with values. We also added our monkey patch code to those regex-related JavaScript functions in the browser environment. During the process, if any regex-related JavaScript functions were triggered, our monkey-patching code would notify our database which records down the regex and string being used. Regexes used in pattern attributes are easier to find. We simply parsed the HTML code.

For APIs, we collected API specifications from an online directory called apis.guru [12]. We parsed each specification and identified the regex(es) that constrain any string inputs referred to by at least one request schema.

Detecting Super-linear Regexes

Once we found all the regexes through the two places, we then examined them to see how many of them were actually super-linear (so that they actually pose a threat). We used a state-of-the-art vulnerable regex detector [5]. We define a regex as super-linear if it exhibits a more-than-linear increase in match time as the attack string’s length gets longer.

Fingerprinting Unsafe Regex Use

For those web pages and API endpoints that were found to potentially have super-linear regexes, we ethically probed them with carefully designed input. We then monitored the difference between response times and determined whether they were actually vulnerable to ReDoS or not.

Results & Discussion

We analyzed 475 domains’ OpenAPI specifications, and 696 domains’ HTML forms & JavaScript files. 17% of API domains and 39% of HTML form domains used regexes. Interestingly, regex reuse is a lot more common in HTML forms than in API specifications: we found only 33 unique regexes in 4.9k HTML pattern attributes, compared to 1841 unique regexes across 2681 regex uses in API domains.

We found that super-linear regex use is unlikely, as only 15 API domains and 6 HTML form domains used super-linear regexes. However, in our experiments we found response time deviations for 6 out of 15 API domains, 2 of which were major tech companies. We did not find response time deviations in any of the HTML domains. It seems that web services are likely to be ReDoS-vulnerable if they are using super-linear regexes in their API specifications, but the sample size (N=15) is too small for making a convincing claim.

We think that the higher risk of ReDoS for API specifications might be caused by:

API specification-based code generators. API specifications are an important step to designing a web service. Although the safety of regexes isn’t a primary concern during the specification phase, when that specification is used to generate code, software engineers can unintentionally introduce ReDoS vulnerabilities to their web service.
Tools which compose API specifications by parsing code. Software engineers using these tools may unintentionally publish the regexes they’re using within the backend.

It’s easy to tell software engineers to “develop safe regexes”, but it’s hard for software engineers to use this advice in practice (because regexes are hard). In most cases, it’s much easier for the engineer to use a safe tool — switching to use a safe regex engine for simple regexes. So we decided to make a pull request to a popular input validation tool called Ajv. Ajv [13] is used by popular API specification-based middleware (code generators and validators) to validate requests. Our pull request, now merged and released, lets the engineers choose the regex engine they wish to use. This patch enables Ajv’s millions of dependents to eliminate the risk of ReDoS in their input sanitization.

Conclusion

ReDoS has received much recent attention. Because of this interest, we did the first black-box measurement study of ReDoS vulnerabilities on live web services. Our method is based on the observation that server-side input sanitization may be mirrored on the client-side. We examined the extent to which super-linear regexes on the client side can be exploited as ReDoS vulnerabilities on the server side. We compared two common interface types: HTML forms (𝑁 =1,000 domains) and APIs (𝑁 = 475 domains).

The results for the two interface types came out to be different. Although regexes are common in both types of interfaces, the regexes that could cause the most trouble, which are super-linear regexes, are only common in APIs. We found ReDoS vulnerabilities in the APIs of 6 domains (15 subdomains), two of which were major tech companies.

Our findings add weight to the concerns about the risks of ReDoS in practice. We hope that our findings motivate companies to mitigate their ReDoS risks, and that our patch to Ajv gives them an easy avenue by which to do so.

References

[1] Graham-Cumming, 2019. Details of the Cloudflare outage on July 2, 2019. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

[2] Tranco Top 1M List. https://tranco-list.eu/

[3] Apify Web Crawler. https://apify.com/

[4] OpenWPM. https://github.com/openwpm/OpenWPM

[5] vuln-regex-detector. https://github.com/davisjam/vuln-regex-detector

[6] HTML input tag. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input

[7] OpenAPI specification. https://swagger.io/specification/

[8] Mozilla Constraint Validation. https://developer.mozilla.org/en-US/docs/Web/API/Constraint_validation

[9] Security through obscurity. https://en.wikipedia.org/wiki/Security_through_obscurity

[10] Ajv documentation. https://ajv.js.org/security.html#redos-attack

[11] Nicolaas Weideman, Brink van der Merwe, Martin Berglund, and Bruce Watson. 2016. Analyzing matching time behavior of backtracking regular expression matchers by using ambiguity of NFA. In International Conference on Implementation and Application of Automata. Springer, 322–334.

[12] apis.guru.

[13] Ajv. https://ajv.js.org/