Behind the scenes of science
This notification was thrilling for two reasons:
- This was the first paper I had owned from start to finish.
- The paper had been rejected a lot of times.
This post presents the saga of the paper, and includes the different stages of the manuscript and the reviews each version received. I conclude with some reflections about the process.
My intention in writing the post is to give a behind-the-scenes look at the life of an oft-rejected paper. I have heard rumors of such works before, but I am not aware of a description of the process. I hope this post is interesting to current and future graduate students, both as a “fossil record” and as an encouragement when dealing with rejection.
General idea of the paper
Suppose a server handles many clients on one thread. If a client can convince the server to spend longer than expected handling a request, then this can be used to carry out a Denial of Service attack. My paper describes this type of attack as Event Handler Poisoning since typically such servers use the Event-Driven Architecture.
The most prominent example of a server architecture like this is Node.js, in which a single Event Loop (thread) handles client interactions, with support from a small fixed-size Worker Pool (threadpool). If a request might cause the Event Loop or a Worker to block, then a DoS attack can result, since while a thread is blocked on one client the other clients will starve.
Our workshop paper sketched the attack. Our four conference submissions describe the attack in more detail and proposed a defense: C-WCET Partitioning (rejected from CCS’17 and NDSS’18) and then First-Class Timeouts (rejected from S&P’18, conditionally-then-fully accepted at USENIX Security’18).
A brief timeline
- At EuroSec’17 (a workshop) we described the Event Handler Poisoning attack and sketched some possible defenses. That paper is available here.
- To CCS’17 and NDSS’18 we submitted manuscripts describing a design pattern, C-WCET Partitioning, that developers could use to avoid these attacks, and implemented two examples in Node.js (RE2 regexes, kernel AIO). These manuscripts were rejected with the complaint that our solution would only prevent two specific attacks, and that achieving it in general would incur hefty refactoring and maintenance costs.
- After the NDSS’18 rejection we concluded that our design pattern was ineffective and pursued a new technique suggested by the CCS’17 and NDSS’18 referees. To IEEE S&P’18 (Oakland’18) we submitted a paper describing this technique, First-Class Timeouts. They rejected the work but we believed it was still viable.
- To USENIX Security’18 we submitted a revised version of the paper emphasizing the novelty of the technique and giving a richer description of the system. The USENIX Security referees conditionally accepted this work with a few requests.
For CCS’17 and NDSS’18 my friend Ayaan Kazerouni helped with some analysis of the npm ecosystem in support of our C-WCET Partitioning concept. After that direction dead-ended, Ayaan bowed out to pursue his own projects and requested to be removed from the author list if we changed our defense to the timeout approach suggested by the NDSS referees.
In Fall 2018 I implemented the First-Class Timeouts prototype with help from Eric R. Williamson. He and I rewrote the second half of the manuscript to describe the new defense.
The saga in detail
This paper began life as a project in Dr. Daphne Yao’s Spring 2017 Security course. After completing the course project with my partner Gregor, I wrote it up more formally and it was accepted to EuroSec’17.
In conjunction with the EuroSec’17 workshop, I prepared a conference-length submission to CCS’17. We extended the contributions of the EuroSec’17 paper and implemented defenses against one CPU-bound EHP attack (ReDoS) and one I/O-bound EHP attack (Slow Files) in Node.js.
Our defenses were based on our proposed “C-WCET Partitioning” design principle. In Constant Worst-Case Execution Time (C-WCET) Partitioning, all operations performed by the Event Loop or Worker Pool are partitioned into constant-time pieces, with state preserved across partitions using a technique like baton passing or closures. An EDA-based server partitioned in this manner will serve small requests promptly, while arbitrarily expensive (malicious) requests will regularly (in constant time) defer to the small requests.
- Our ReDoS defense used a hybrid regex engine. When possible we evaluated regexes using Russ Cox’s linear-time regex engine, RE2.
- Our Slow Files defense replaced Node’s synchronous file I/O (done “asynchronously” but blocking on the Worker Pool) with true asynchronous I/O supported by the Linux kernel (aka Kernel AIO or KAIO; read more in Vasily Tarasov’s explanation.
Neither of these defenses was complete. Our ReDoS defense was an O(n)-partitioning for supported regex evaluations, and if the regex was unsupported by RE2 then we fell back to Node’s built-in exponential-time regex engine, Irregexp. And KAIO on Linux only supports regular files, so a read from a slow device file like /dev/random would still be offloaded to the Worker Pool where it would block.
We did not change our prototype in this submission. Instead we focused our time on rewriting the manuscript to clarify our findings in the hopes that a clearer presentation of our ideas and their strengths and weaknesses would be acceptable.
It was not. Encouragingly, we had one referee rate the manuscript a “strong accept” and argue in favor of the work, but this referee was persuaded by the other referees to reject the manuscript during the PC discussion.
The referees criticized the novelty of the work and rejected the paper.
USENIX Security 2018
Similar to what we did between CCS and NDSS, after Oakland we did not touch our prototype but instead focused on effectively communicating our ideas. In a major rhetorical shift, we changed the language from “Timeout Approach” to “First-Class Timeouts” to better emphasize the novelty of our proposal. We didn’t want the referees to think we were just proposing timeouts, since First-Class Timeouts really require re-thinking how EDA-based servers should be written.
Interestingly, as at NDSS, we had one referee rate the manuscript a “strong accept” and argue in favor of the work. This time our champion persuaded the other referees to accept the manuscript during the PC discussion, though the referees attached conditions to our acceptance and assigned us to a shepherd before final acceptance. We rewrote the manuscript to address the concerns of the referees, and the shepherd agreed to accept it to USENIX Security.
- Don’t give up! This paper was submitted to all four of the top security conferences and was rejected from three of them.
- Learn from your referees and mistakes. After unsuccessfully trying a more vigorous defense of our C-WCET Partitioning approach at NDSS, we came to agree with the reviewers’ criticisms. We completely re-implemented the prototype and significantly re-wrote the paper. If it ain’t broke, don’t fix i t— but if it is broke, admit it and try something else.
- Peer review works. I think each iteration of our paper was an improvement over the previous one.
- Between EuroSec and CCS we added a prototype.
- Between CCS and NDSS we vastly improved the writing.
- Between NDSS and Oakland we designed and implemented a new prototype.
- Between Oakland and USENIX Security we vastly improved the writing.
While repeated rejections were frustrating, it was clear that at each step the referees made good points. Their feedback resulted in a better paper each time.
- If your paper proposes a solution to a problem, the solution must satisfactorily solve the problem. If your solution does not yet do so, keep working on it or prove that it is impossible to do better.
The primary reason the first two submissions failed was that the C-WCET Partitioning Principle does not cleanly solve the EHP problem.
- Achieving C-WCET Partitioning for CPU-bound tasks incurs enormous engineering costs (e.g. re-implementing expensive language APIs or 3rd-party APIs) and carries a significant maintenance burden.
- It is not at all clear how to partition I/O. Relying on KAIO seems to shift the burden to an external service (the OS) but this is robbing Peter to pay Paul.
- Audience matters. The Oakland referees strongly critiqued our Timeout Approach, but the USENIX Security referees enjoyed it. Of course, this was in part because we improved the USENIX Security manuscript based on the feedback from Oakland, but I’ve since learned that it was probably also because the Oakland community places greater emphasis on novelty while the USENIX Security community also values systems/engineering and evaluative work. Ask more experienced folks the venue(s) where they think your work might be best received, and check past conference proceedings to find works that look like yours.
- We repeatedly encountered referees who felt that the EuroSec version of the paper competed with the conference version of the paper. I am not sure whether this was an error on our part (putting too much into the workshop) or whether the referees were unfairly penalizing us for sharing a preliminary version of the work in a workshop. Ultimately I don’t think the workshop was the dealbreaker for CCS, NDSS, or Oakland, but it was annoying to keep having to address that criticism in our rebuttals.