Exploiting Input Sanitization for Regex Denial of Service

Summary

Regexes are common in software and are great tools for dealing with string processing tasks. Unfortunately, in many programming languages, the default regex engine may consume a lot of computational power. Some regexes take polynomial or even exponential time to process a string, based on the string length. If this processing happens on the server side, it could consume a significant amount of resources. This property creates a security threat: if an attacker knows the regexes used on the server side, then they can craft strings to trigger a costly regex match, and cause regex-based denial of service (ReDoS). For example, Cloudflare had a regex-based outage in 2019 that affected thousands of its customers [1].

  1. We conducted a black-box study to (ethically) measure ReDoS vulnerabilities in live web services We apply an assumption that client-side sanitization logic, including regexes, is consistent with the sanitization logic on the server-side. We investigated both HTML forms and API specifications. Our result showed that regexes in API specifications pose a greater ReDoS threat. We discovered ReDoS vulnerabilities in several web domains, including domains operated by Microsoft and by Amazon.
  2. To mitigate against ReDoS, we believe that there are better solutions than not publishing regexes at all [9]. This leaves two options: figuring out how to easily turn unsafe regexes into safe regexes (researchers don’t yet know how to do this perfectly) or adopting a safe regex engine for input sanitization.

Background

Regexes

Many software engineers use regexes to check whether the input from users meets requirements. For example, regexes can be used to examine whether passwords have both uppercase and lowercase letters. Another use case is for API specifications to help users better understand the data format of each API.

Unsafe Regexes and ReDoS

Some regexes may take a lot of time when processing through certain strings. This can happen when an unsafe regex is used under an unsafe regex engine. You might be curious what makes a regex unsafe. It is this: when a regex can parse a string in multiple ways, the engine may try every single one of them, resulting in problematic time complexities ranging from polynomial to exponential [11]. For a more detailed explanation, take a look at this post.

Two Common Places for Finding Clues

Engineers may reveal sanitization regexes in (at least) three places on the client side: in HTML forms and in JavaScript handlers (figure (a)); and in API specifications (figure (b)).

Approach

Overview of our research methodology

Finding Regexes

Because there are two common places (HTML forms and API specs) to find clues, we have two different ways of finding regexes from them. For HTML forms, we first randomly selected 1000 web services from the Tranco Top 1M list [2]. Then, we used a web crawler called Apify [3] to crawl those 1000 web services and find web pages that we can analyze. For those web pages that have HTML forms on them, we drove an automated browser via OpenWPM [4]. We used OpenWPM to populate form fields with values. We also added our monkey patch code to those regex-related JavaScript functions in the browser environment. During the process, if any regex-related JavaScript functions were triggered, our monkey-patching code would notify our database which records down the regex and string being used. Regexes used in pattern attributes are easier to find. We simply parsed the HTML code.

Detecting Super-linear Regexes

Once we found all the regexes through the two places, we then examined them to see how many of them were actually super-linear (so that they actually pose a threat). We used a state-of-the-art vulnerable regex detector [5]. We define a regex as super-linear if it exhibits a more-than-linear increase in match time as the attack string’s length gets longer.

Fingerprinting Unsafe Regex Use

For those web pages and API endpoints that were found to potentially have super-linear regexes, we ethically probed them with carefully designed input. We then monitored the difference between response times and determined whether they were actually vulnerable to ReDoS or not.

Results & Discussion

We analyzed 475 domains’ OpenAPI specifications, and 696 domains’ HTML forms & JavaScript files. 17% of API domains and 39% of HTML form domains used regexes. Interestingly, regex reuse is a lot more common in HTML forms than in API specifications: we found only 33 unique regexes in 4.9k HTML pattern attributes, compared to 1841 unique regexes across 2681 regex uses in API domains.

  • API specification-based code generators. API specifications are an important step to designing a web service. Although the safety of regexes isn’t a primary concern during the specification phase, when that specification is used to generate code, software engineers can unintentionally introduce ReDoS vulnerabilities to their web service.
  • Tools which compose API specifications by parsing code. Software engineers using these tools may unintentionally publish the regexes they’re using within the backend.

Conclusion

ReDoS has received much recent attention. Because of this interest, we did the first black-box measurement study of ReDoS vulnerabilities on live web services. Our method is based on the observation that server-side input sanitization may be mirrored on the client-side. We examined the extent to which super-linear regexes on the client side can be exploited as ReDoS vulnerabilities on the server side. We compared two common interface types: HTML forms (𝑁 =1,000 domains) and APIs (𝑁 = 475 domains).

References

[1] Graham-Cumming, 2019. Details of the Cloudflare outage on July 2, 2019. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
James Davis

James Davis

139 Followers

I am a professor in ECE@Purdue. My research assistants and I blog here about research findings and engineering tips.