Node.fz: Fuzzing the Server-Side Event-Driven Architecture

8 min readSep 29, 2019

This is a brief for the research paper Node.fz: Fuzzing the Server-Side Event-Driven Architecture, presented at the European Conference on Computer Systems (EuroSys) 2017. I was the first author, supported by Arun Thekumparampil and Dongyoon Lee.

Summary

The event-driven architecture (EDA)has reached the mainstream on the server side.

To use the EDA, developers register callbacks for client events, stringing together complex response generation using asynchronous ordering primitives like nested callbacks, async/await, futures, or promises. The event-driven architecture is typically implemented with a single thread, avoiding explicit race conditions due to concurrency. But if a developer mis-encodes the ordering constraints amongst the stages of response generation, then implicit race conditions can occur when their asynchronous response stages fire in unexpected orders.

In this work we characterized 12 race conditions previously identified in open-source Node.js applications. This investigation informed our design of Node.fz, a data race detection tool for Node.js applications. Using Node.fz, we were able to more consistently recreate these bugs as well as to identify new bugs in three applications.

Background

Production web servers have traditionally been implemented using the one-thread-per-client architecture (OTPCA). The Apache web server is the most prominent implementor of this architecture. In the OTPCA, the server maintains a pool of thousands of threads. When a client connects and makes a request, a thread is assigned to generate a response. An OTPCA-based server relies on the operating system scheduler to ensure that each client (thread) makes progress.

One-Thread-Per-Client Architecture (OTPCA). The Apache web server is the most famous example of this architecture.

The OTPCA simplifies the implementation of the server, but managing all of those threads introduces a lot of overhead that is unnecessary for some web server use cases. In particular, each thread costs memory (thread metadata) and time (scheduling). If a web server could make do with fewer threads, it could run on fewer resources — costing its manager less to run. For example, a leaner web server could run on a lightweight AWS VM instead of a beefy one.

In the 1990s, academics debated whether and how a leaner web server could be architected and implemented. They decided that an alternative to the OTPCA was some variation on the event-driven architecture (EDA). In EDA-based servers, all client requests are handled by a small number of threads, and the application-level developer must implement their own scheduler to ensure that each client is fairly treated.

The event-driven architecture (EDA). This architecture has recently been popularized for use in web servers through the Node.js and Twisted web frameworks. This embodiment of the EDA is called the Asymmetric Multiprocess Event-Driven Architecture (AMPED). See the SEDA paper for an alternative.

The OTPCA won the debate in practice, and the Apache web server dominated the web server world for many years. But recently, the EDA has made a comeback. For example, it is used both in reverse proxies like NGINX and in applications built with web development frameworks like Node.js (JavaScript) and Twisted (Python).

In these web development frameworks, the EDA has taken a form called the “asymmetric multiprocess event-driven architecture” (AMPED), which I’ll just call the EDA in this post. In the EDA, all responses to clients are sent by one thread called the event loop. In order for this thread to successfully juggle many clients, the creation of a response is partitioned into synchronous stages (run on the event loop) separated by I/O (run by a “worker pool” of helper threads). When each I/O completes, it is wrapped up into an event that is delivered to the event loop to advance the generation of the response.

Bug study: What kinds of concurrency bugs does Node.js software contain?

Although the EDA offers improved scalability for certain workloads, it can be difficult to correctly implement an EDA-based application. In particular, partitioning the creation of a response into distinct stages — some of which can execute in any order, others of which must be executed in a particular order — is difficult. Developers must correctly make use of the asynchronous primitives available in the framework or programming language, such as callbacks, promises, async/await, and futures.

Race conditions may result if this partitioning is implemented incorrectly. These race conditions are not the traditional multithreaded race condition, in which two threads may access the same memory in an unordered manner. Remember, in the EDA there is only one event loop thread, so it cannot “race” with itself. So race conditions in the EDA take the form of the event loop handling events in an order that the developer did not expect. These events can be associated with a single request, or they might be due to a bad interaction between requests from two different clients.

Example race condition (kue bug #483). The delayed() method was supposed to be called only after the update() method completed. But these methods are both asynchronous — they return before they complete, and instead indicate completion using a callback. As a result, the original code on line 5 did not enforce any ordering between them. Sometimes update() would complete before delayed() was triggered, and the bug would notmanifest. But jobs would be improperly cleaned up if the delayed() event completed first. The repair (lines 6–8) used a callback to ensure that delayed() would only be called after update() completed.

We searched GitHub for issues with the phrase “race condition” associated with Node.js applications. We studied the 12 bugs listed in the following table, characterizing them across several dimensions.

Characteristics of concurrency bugs in Node.js software, sorted by software type (see Table 1 in the paper) and race type. Race type is atomicity violation (AV) or ordering violation (OV); commutative ordering violations are marked with a (C). Races were either “solo” (intra-request) or due to competing concurrent requests. The racing events were network responses (NW) (typically from an external resource like a database), calls to the racy API (Call), timers (Timer), file system interactions (FS — uses worker pool), and “application-dependent asynchronous step” (X).

We drew several insights from this investigation:

Like client-side JavaScript, server-side JavaScript software written for Node.js suffers from race conditions. We observed both atomicity violations and ordering violations in the races we examined, including a new subtype called commutative ordering violation.
It’s no surprise that we observed races on shared memory (e.g. writes to variables and arrays). But due to the “open system” nature of server-side software, we also saw races on external resources (e.g. network traffic, timers, user method calls, queries to database, I/O to file system). This style of race has not been reported in client-side EDA (JavaScript) concurrency studies, and significantly complicates the task of anyone seeking to build a Node.js data race detector.
Race conditions may result in severe consequences including server crashes and inconsistent database states.
While the Node.js community has excellent techniques to fix the OV bugs in our study, they do not seem to have tools to help detect these or the AV bugs.

What changes to Node.js would help developers find such bugs earlier?

Node.fz amplifies Node.js’s internal nondeterminism, allowing applications to explore a broader schedule space for the same input.

The race conditions discussed in our bug study are difficult to find dynamically due to non-determinism in EDA-based systems like Node.js. This non-determinism, arising from the order in which inputs and intermediate events are handled by the event loop and the worker pool, masks the race conditions that cause intra- and inter-callback chain races. Inspired by the success of schedule fuzzing approaches to find race conditions in the multi-threaded context, we developed Node.fz, an EDA schedule fuzzing scheme designed for Node.js.

Highlights of Node.fz, our fuzzed EDA scheme. Dotted lines indicate architectural changes.

At a high level, Node.fz takes control of the event loop’s event queue and the worker pool’s task and done queues. Node.fz then fuzzes these queues to explore alternative schedules.

By shuffling the entries in the event queue before executing each callback, Node.fz yields schedules with alternative input and intermediate event arrival orders.
By shuffling the entries in the worker pool’s task and done queues, Node.fz produces schedules with alternative worker pool task processing and completion order.

Node.fz amplifies the non-determinism in Node.js using the techniques of de-multiplexing, event shuffling, and event delaying, achieving a greater exploration of the possible application schedule space without requiring any developer intervention. As a drop-in replacement for Node.js, developers can easily make use of Node.fz during development and test and then seamlessly switch to the optimized Node.js binary in production. Developers then have the assurance that their applications will be stable under a wider variety of deployment conditions.

To test the effectiveness of Node.fz, we developed test cases to trigger each of the bugs from our bug study. We then ran these cases using Node.js and variations of our Node.fz prototype. As the next figure shows, Node.fz was able to reproduce these race conditions much more consistently than default Node.js. We therefore believe that the Node.fz technique could help Node.js application developers identify their race conditions during testing, not production.

Bug reproduction rates for **nodeV** (vanilla Node.js), **nodeNFZ** (the Node.fz prototype with no fuzzing configured), **nodeFZ** (our prototype with default parameters), and ***nodeFZ(guided)*** (the Node.fz prototype configured to target a particular bug, only used for the KUE (known) bug). In the majority of these cases, only Node.fz was able to cause the bug to manifest.

Conclusions

We presented Node.fz, a novel schedule fuzzing test aid for server-side EDA programs, targeting the Node.js environment. The design of Node.fz was based on the first concurrency bug study of real-world Node.js (and EDA) software, in which we discussed the forms that atomicity and ordering violations take in the EDA. Based on the root causes of the bugs in our study, we designed Node.fz to shuffle the order of input events and callback chains as they appear in the Node.js runtime.

Our results show that Node.fz can trigger known bugs more frequently, expose new bugs, and expand the schedule space explored by a test suite, all with an acceptable overhead.

Reflections

In this paper I began to develop my general approach to research: using empirical software engineering (ESE) research to motivate systems research and development. I think ESE work benefits systems work in a few ways:

You learn a lot about the problem space and see what kinds of issues developers are having.
You can design systems that are well aligned with the problems practitioners face.
You can use the ESE results to evaluate the system. For example, if the ESE component is a bug study and you want to build a bug detection system, you can see if your system finds the bugs from your study.
ESE reports further the scientific venture, giving other researchers tools and a benchmark for their own system evaluations.
ESE reports can be used as a reference point for standards committees and project core developers.

Looking back I see some things I don’t like about the work — the bug study is a bit small, and I wish we’d taken more time to discern how reasonable a systematic schedule exploration approach would have been. But we were the first to begin discussing Node.js from a systems perspective, and it has been cool to see follow-up research from other groups that repair some of our weaknesses (e.g. [1]).

I think if I were to do the work again, I would divide it into two papers:

A much larger bug study (cf. [1]) to improve generality and to consider other classes of problems.
A more thorough bug detection system.

I think it’s hard to fit both the ESE and systems aspects of a research project into a single conference paper and do them both justice. Some systems conferences have larger page limits, which permitted me to try this in Node.cure [2], but the page limits in the Software Engineering conferences preclude trying to put both into one work.

More information

The full paper is here.
The PowerPoint slides are available here.
The research artifact is available here.
Adrian Colyer covered Node.fz in The Morning Paper here.

References

Wang et al., 2017. A Comprehensive Study on Real World Concurrency Bugs in Node.js.
Davis et al., 2018. A Sense of Time for JavaScript and Node.js (and my Medium post about it)