Why Aren’t Regexes a Lingua Franca?
An Empirical Study on the Re-use and Portability of Regular Expressions
This is a brief for the research paper Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions, presented at ESEC/FSE 2019. I was the first author, alongside Mischa Michael, Christy Coghlan, Francisco Servant, and Dongyoon Lee.
In this article I use the word “regex” as shorthand for “regular expression”.
Summary
In this paper we investigated how software developers go about writing regexes, and we studied the extent to which their practices may contribute to software bugs due to portability problems. We found that:
- From a survey of 158 software developers, many say that they re-use regexes, and many think of regexes as a universal language (a “lingua franca”).
- From an analysis of about 200,000 software projects written in 8 programming languages, many software projects actually do contain the same regexes, even across programming languages. We traced some of these regexes back to RegExLib and to Stack Overflow questions tagged “regex”.
- We measured portability by testing the behavior of real-world regexes in these 8 programming languages. We extracted about 540,000 unique regexes from these projects. Although 76% of the regexes will compile in all 8 of these programming languages — 88% if you don’t count Rust (syntactic) — they do not always match the same strings (semantic) or have the same worst-case matching time (performance).
Motivation
I spent a few years working for IBM as a software tester on the GPFS project. One of my responsibilities was to make sure that some of the command-line interfaces were working properly. I used a lot of regexes to parse the output of these commands, in a mix of Bash/grep, Perl, and Python. Sometimes I Googled for regexes (e.g. when a command would emit an email and I wanted a “pretty good” regex for emails), and other times I would write the regex from scratch. And sometimes I would initially use a regex in grep and then move it into a Perl or Python program.
When I wrote, copied, and moved a regex from language to language, I always wondered whether the regex would work the same way in the new language. It would usually compile and seem to work right, which was good enough for my purposes, but ever since I’ve been curious about whether and how portable regexes really are.
The consequences of regex non-portability
Regexes are hard to get right, and if you copy regexes from another source, or if you write them based on your experience with regexes in a different language, then you can run into three regex portability problems. Of course these problems can also emerge without needing to cross programming language boundaries, but the copy-paste’ability of regexes introduces interesting additional wrinkles.
Syntactic portability problems
Copying regexes can lead to runtime exceptions and program crashes due to syntactic portability problems.
Some regexes will compile in some languages and not others. An easy example is regexes that use the Python (?P…) notation for capture groups or named backreferences; these regexes won’t compile in some other programming languages and will throw an error. These errors can result in runtime exceptions and program crashes [1].
Semantic portability problems
Copying regexes can lead to hard-to-debug program misbehavior due to semantic portability problems.
Some regexes will compile in two different programming languages, but will have different match behavior. For example, the notation /\Q…\E/ will quote the text in between the \Q and \E in Java, PHP, Go, and Perl, but in JavaScript, Ruby, and Python these symbols will be treated as the literals Q and E. It’s easy to imagine a Java developer writing (or copying) such a regex into JavaScript, and being surprised when the regex matches a different set of strings. If the regex is short, this behavior and the cause might be obvious, but I don’t think this is the case for all regexes. Some regexes are quite complex and it’s hard to generate input to exercise all paths and to encode this into unit tests [2].
Performance portability problems
Copying regexes can lead to slow software, even denial of service, due to performance portability problems.
Most programming languages implement their own regex engine, whose job it is to accept a regex and a string and determine whether or not the regex matches the string. Regex engines differ in the syntax and semantics they support, but they also differ in their performance, the time and/or space cost to perform a match. Regex engines are pretty fast, and usually they aren’t the bottleneck in a computer program. But some regex engines have a large worst-case cost for particular combinations of regexes and inputs, with time costs ranging from quadratic to exponential [3]. Since regexes are commonly used for input validation, this property exposes service providers to Regular expression Denial of Service (ReDoS). Many web services have been affected by ReDoS [4], most notably CloudFlare [5].
Regex performance portability problems can happen when a slow regex is copied from a context where performance is not a problem to a customer-facing role where performance is key. For example, a regex might be copied from software written in Rust or Go (which guarantee linear-time matches) into a language where the regex has exponential worst-case time complexity.
Research plan
We wanted to learn a lot about regex portability practices and problems. We divided our project into three tasks.
- Talk to software developers and determine whether my personal regex practices were shared by others.
- Corroborate re-use practices using a set of real regexes.
- Perform experiments to determine the extent to which the regex portability problems described above will occur when using real regexes.
Talking to developers
We developed a survey about regex development practices. We sent it out to all the software developers we knew personally, and also posted it in on HackerNews and Reddit. We got 158 responses, about half from friends and friends-of-friends, and about half from the broader Internet.
The next three figures and captions describe what we learned. In short, we found evidence that an appreciable fraction of software developers treat regexes as a lingua franca, acting as though they can cross programming language boundaries without portability problems.
A big thank-you to everybody who filled out the survey!
Corroborating regex re-use practices
The next phase of our research was to corroborate the regex re-use practices that developers reported. Developers reported re-using regexes both from Internet sources like Stack Overflow and RegExLib, and from other source code, so we looked for regexes in both places and compared them.
Regexes in source code
We looked at the major software module registry in 8 programming languages and mapped its modules to GitHub projects as well as we could. We cloned the top 25,000 projects from each registry based on the number of GitHub stars. Then we used static analysis frameworks to extract the regexes declared like /abc/ or new RegExp(“abc”) (or the equivalent in each language) in each project. (This means we didn’t get regexes that are declared using variables, e.g. new RegExp(r)).
We extracted 537,806 unique regexes across 193,524 software projects written in 8 languages — a truly polyglot regex corpus.
Internet regexes
We downloaded a Stack Overflow snapshot from 2018, pulled all the questions and answers tagged with “regex”, and extracted all the code blocks from those posts.
We ran an empty search (i.e. “all regexes”) on RegExLib and scraped the resulting web pages for the regex fields.
These did not always produce strings corresponding to regexes, but we only considered the regexes from these collections that also appeared in source code. We describe this next.
Regex re-use
For each software module, we compared its regexes to:
- The regexes in other modules in the same programming language (inter-language duplicates).
- The regexes in modules in other programming languages (intra-language duplicates).
- The regexes in the Internet sources.
The next figure summarizes the re-use practices we measured.
Regex experiments
We next wanted to measure the extent to which developers might encounter portability problems if they were to copy/paste a regex from one programming language to another.
We performed experiments simulating copy/pasting every regex in our corpus, starting in each of the 8 programming languages in our study and ending in each of the other 7 languages. We tested:
- Whether it would compile in the new programming language;
- Whether it would match the same set of strings; and
- Whether it would have the same worst-case performance.
Tools
Evaluating a regex
We implemented a simple driver in each programming language. Each driver read in a regex and a set of inputs, compiled the regex, and evaluated the regex against each input to determine:
- The match/mismatch behavior;
- The matched string; and
- The contents of any capture groups.
These drivers used partial-match semantics and the default regex flags.
Generating input
We used an ensemble of five regex input generators and generated a median of 2,410 distinct input strings for each regex.
Comparing semantic behavior
If the hypothetical source and destination programming languages both supported a regex, we compared the match behavior for any inconsistencies in match/mismatch outcomes or in the specific string(s) matched.
Worst-case behavior
In my previous project [3] I built tools to test the worst-case performance of a regex using an ensemble of super-linear regex detectors (GitHub project). I leveraged those tools here, extending them with a new detector and overcoming a shortcoming of the detectors using “regex variants” (Section 7.2.1).
Syntax portability
About 76% of all regexes in our corpus compiled in all 8 programming languages. Rust had the most limited syntax; 88% of regexes compiled in all of the other 7 programming languages.
Semantic portability
We observed semantic portability problems for 15% of regexes.
In the paper we analyze the root causes of the differences (Table 4).
Performance portability
About 10% of the regexes in our corpus had worse-than-linear worst-case behavior.
In the paper we discuss the implementation of the PHP and Perl regex engines to explain why they are faster than the engines in the “Slow” family (Section 7.2.3).
Conclusions
Regexes are not a universal language — they are not a lingua franca! This finding is, in itself, perhaps not surprising. Yet it has many implications.
Developers
Be careful with regexes, especially when copy/pasting them across language boundaries or when you aren’t sure of the language of origin.
- For syntax, make sure your test suite actually evaluates the regex.
- For semantics, make sure the inputs in your test suite cover all branches of the regex.
- For performance/security, be careful about applying regexes to untrusted input. Make sure you limit the length of the input, but as discussed in [3] even length limitations may not suffice.
Researchers
Here are some ideas for next steps:
- In the paper we discuss several bugs we found in the programming languages’ regex engines (bug reports). Let’s test the regex engines!
- In the paper we discuss under-specified regex semantics in many programming languages. Can we formally define these semantics (e.g. to support testing the regex engines)?
- Can we unify the regex semantics across programming languages so that regexes truly become a lingua franca?
- Can we help developers search for regexes, e.g. by mining the polyglot regex corpus we created?
- Can we develop a regex translator that takes into account syntax, semantic, and performance differences?
- Can we adapt and extend the Perl and PHP engine techniques to reduce or eliminate performance problems (/ReDoS) in practice?
If you like these ideas or have additional ideas, please let me know in the comments.
More information
- The full paper is available here.
- The PowerPoint slides are available here.
- The research artifact is on Zenodo here (backed by this GitHub project). The artifact contains (1) the polyglot regex corpus, (2) the input generator ensemble, and (3) drivers to test regexes for semantic and performance consistency.
References
[1] Spishak, Dietl, and Ernst, 2012. A type system for regular expressions.
[2] Wang and Stolee, 2018. How well are regular expressions tested in the wild?
[3] Davis et al., 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice.
[4] Staicu and Pradel, 2018. Freezing the Web.
[5] Graham-Cumming, 2019. Cloudflare outage caused by bad software deploy.