Ethical conduct in cybersecurity research

13 min readApr 22, 2021

Follow-up note on 19 May 2021: This post was written concurrent with discussions across the cybersecurity research community. Since then: The authors withdrew their paper; the conference chairs described significant changes for the next edition of the conference; the Linux community issued a statement. There is also a related comment from Ted Ts’o (Linux contributor) at the bottom of this blog post.

In April 2021, the Linux developer community issued a blanket ban on contributions from the University of Minnesota. This remarkable outcome occurred as a result of a research project by a team at UMN. The headlines from the incident have been covered by a variety of tech media outlets, e.g. Neowin. I’d like to take a deeper look, and discuss the case from the perspective of research ethics and experimental design.

I have carefully read the authors’ paper [1] and their FAQ [2]. I believe that they acted in good faith. In their prior work they have made exemplary efforts to promote cybersecurity in major software projects, for which I thank them. However, in this case, they made an ethical misstep out of a misapprehension of what constitutes ethical conduct and human-subjects research in computing.

This post is intended to clarify these topics. I hope it is helpful to other researchers, and perhaps to some Institutional Review Board (IRB) staff.

What happened?

Researchers at the University of Minnesota conducted a research project titled On the Feasibility of Stealthily Introducing Vulnerabilities in Open-Source Software via Hypocrite Commits.

Open-source software is software whose source code is publicly visible. Typically, this code is maintained by a core community (“The Maintainers”), and contributions are also solicited by the users of the software to make it better suit their needs. One of the premises of open-source software is that “with many eyes, all bugs are shallow” [3]; although software defects happen, the belief is that by working together these defects can be identified and eradicated more quickly. However, this philosophy supposes mostly-good actors. Some versions of the open-source model permit contributions from unknown users! If these users act maliciously, the defective behavior they introduce may be accepted into the codebase and subsequently exploited. Open-source communities use a variety of mechanisms to avoid this unhappy outcome, including human review and automatic tools.

As the title suggests, the authors investigate the feasibility of introducing vulnerabilities through deliberately-incorrect code. They refer to such submissions as “hypocrite commits” — the code contributions say one thing but do another. The authors specifically study such commits in the Linux kernel, a hugely important open-source project used in billions of devices around the world (including Android phones, much of “the Cloud”, and most supercomputers). In their investigation, the authors applied two research methods:

A historical analysis of Linux, examining the process by which previous problematic commits entered the codebase without being caught by human or automatic review. This research method is a specific form of “mining software repositories”, wherein researchers seek to learn by analyzing the history of a software project.
An experiment in which they determine whether the Linux maintainers are capable of detecting three security vulnerabilities submitted by the authors. This research method is a specific form of human-subjects research, in which the researchers learn something from human behavior.

This project was published at IEEE Security & Privacy in 2021 — this is one of the most prestigious research conferences in computing. At the time of writing, the paper is available [1], along with a FAQ from the authors that responds to criticism [2].

When the paper was originally accepted, there was some pushback about ethics from other security researchers. The authors made some modifications to the final version of the paper. You can see their remarks in section 6.A “Ethical considerations” and in section 8.D “Feedback of the Linux Community.”

As a consequence of the researchers’ engagement with the Linux community, in April 2021 Linux chief Greg Kroah-Hartman banned all future contributions from the University of Minnesota [4].

This outcome is shocking. Let’s recap:

Human-subjects research in cybersecurity was conducted under the oversight of UMN’s IRB, reviewed favorably by academic peers, and accepted to a (prestigious) conference under the aegis of the sponsoring professional organization, the IEEE.
The human subjects involved felt that the researchers had experimented on them unethically, and banned their sponsoring organization from future involvement with their open-source project.

I hope it is not outlandish for me to suggest that there may be some mismatch between what the academic community accepted as ethical conduct, and what the subjects perceived as ethical conduct. Let’s dive in to the details.

What is human-subjects research?

Here is the US federal government’s definition of human-subjects research [5], emphasis added:

Human subject means a living individual about whom an investigator (whether professional or student) conducting research, [including if the researcher] obtains information or biospecimens through intervention or interaction with the individual, and uses, studies, or analyzes the information or biospecimens.

This definition applies to federally-funded research. That includes funding from the National Science Foundation (NSF), which paid for this study [1].

This is an excerpt from a lengthy federal document, with plenty of sub-definitions and clarifications. The document also lists some exemptions here, e.g. in the case of matters of public record.

Here is a handy flowchart.

Was this human-subjects research?

Arguments for “yes”

Let’s map the federal definition of human-subjects research to the authors’ research method #2 from above. The authors:

Interacted with living individuals (the Linux maintainers)
Obtained information (whether the maintainers would approve a malicious commit)
Analyzed that information (discussion of factors that lead to patch acceptance, and, as discussed in section 8.D, the maintainers’ perspectives on this experiment).

I am not a lawyer, but that sure looks like the definition of human-subjects research to me.

Arguments for “no”

In their FAQ, the authors write “This is not considered human research. This project studies some issues with the patching process instead of individual behaviors” [2].
The UMN IRB agreed with the authors. They (retroactively) approved the research protocol, and according to the authors this approval stated that the work did not involve human-subjects [2].*
The IEEE S&P review committee agreed with the authors. According to the conference’s call for papers, which includes “Ethical Considerations for Human Subjects Research”, the reviewers reserve the right to reject papers that do not meet ethical standards. Since the paper was accepted despite having been conducted without oversight from the institution’s human-subjects review board (the IRB), it appears that the research community* agrees that this was not human-subjects research — or at least, not unethical human-subjects research.

*I hesitate to paint with a broad brush, and I understand that individual community members may feel as I do…but the paper was accepted at the conference.

Settling the argument

We are at an impasse. My naive reading of the federal guidelines says the authors conducted human-subjects research. The research community seems to feel otherwise.

How shall we settle the stalemate? I do not think that decision should be up to the experimenter — let us consult the possibly-experimental-subjects. Do they feel they were experimented on?

In his email message banning the University of Minnesota, Linux chief Greg Kroah-Hartman wrote “Our community does not appreciate being experimented on, and being “tested” by submitting known patches that are either do nothing on purpose, or introduce bugs on purpose. If you wish to do work like this, I suggest you find a different community to run your experiments on, you are not welcome here” [4].

If humans feel they have been experimented on, we should call this “human-subjects research” — despite what the authors, UMN’s IRB, and the research community say.

Did the Linux developers overreact?

The Linux developer community responded to this experiment by banning all contributions from the organization that sponsored the research: the University of Minnesota. This affects both the researchers who conducted this study, and all other UMN researchers, students, and staff members. Is this an over-reaction? No.

The researchers did not act alone. They obtained approval from their university’s human-subjects ethics oversight board, the IRB. The approval was retroactive, but it was approval! From the perspective of the unwitting experimental subjects, UMN can no longer be trusted to provide ethical oversight of research that involves the Linux developer community. It is the university that failed to provide responsible oversight.

But by the same token, the approval of the research community is problematic. IEEE S&P has granted the work its imprimatur, thus, the leaders of the cybersecurity research community agree that the authors behaved appropriately. How do you think the Linux maintainers feel about that? And Linux is a leader in the open-source movement— how might other open-source communities react?

Humans within sociotechnical systems

How can it be that these various parties had such mis-aligned perspectives? I suggest that the academic community failed to consider the notion of a sociotechnical system. Let me illustrate:

From the academic interpretation, the research was apparently on a purely technical entity: the “review process”. If the entity being studied is technical, not human, then the work is not human-subjects.

But this architectural view is only a partial picture, an incomplete model of the actual system. It ignores the role of the Linux maintainers — living individuals — who carry out the review process. These humans actually carry out the process being studied, and as a result they are indeed an (indirect) subject of the experiment.

The role of humans was made explicit by the authors themselves:

In the experiment, we aim to demonstrate the practicality of stealthily introducing vulnerabilities through hypocrite commits. Our goal is not to introduce vulnerabilities to harm OSS. Therefore, we safely conduct the experiment to make sure that the introduced UAF bugs will not be merged into the actual Linux code. In addition to the minor patches that introduce UAF conditions, we also prepare the correct patches for fixing the minor issues. We send the minor patches to the Linux community through email to seek their feedback. Fortunately, there is a time window between the confirmation of a patch and the merging of the patch. Once a maintainer confirmed our patches, e.g., an email reply indicating “looks good”, we immediately notify the maintainers of the introduced UAF and request them to not go ahead to apply the patch.

Although the research is focused on the review process, human subjects are involved in every step: the authors send email to humans, the humans review the patches and reply, and the authors then tell the humans not to proceed. The review process is a sociotechnical system. There is a human inside the box. We cannot pretend otherwise.

Designing an ethical version of this study

I believe that the researchers conducted human-subjects research, and will proceed under this supposition. Did their research protocol honor the human subjects?

To decide this, let us examine the ethical standard to which human-subjects researchers are held: that of their institution’s Insitutional Review Board, or IRB. An IRB is charged with ensuring that human-subjects research is conducted ethically. Among other things, they decide whether the benefits of the experiment outweigh its risks, and they are supposed to take the perspective of the human subjects into consideration.

The current study design is flawed

I suggest that the experiment was low-reward and high-risk.

Low-reward. Let’s recall that the authors began with a historical analysis of problematic commits in Linux. The authors concluded this analysis by noting that many “potentially-hypocrite commits” had already entered the Linux project. It is not clear what additional knowledge the research community would learn from creating and submitting new malicious submissions.
High-risk. First, the protocol involved human subjects who were being deceived. This is an unusual requirement, and should be scrutinized for the way in which it treats the subjects. Second, the human subjects are non-consenting. Their time is valuable; from their perspective they are being volunteered to waste it on the researchers’ game. Third, the protocol could have resulted in security vulnerabilities reaching the Linux core and spreading to billions of devices. Linux is part of the critical infrastructure for systems across the globe.

The authors attempted to control for the third risk by “immediately notifying” the maintainers after their malicious patches were approved. There are several ways this protocol might have failed, including:

The maintainer might have lost or discarded their follow-up email. Emails are lost and ignored all the time.
The authors sent their patches to the general Linux mailing list. Although the patches might not have been merged to mainline Linux, since Linux uses a distributed programming model, any community member could have incorporated the patches into their own versions. [Credit: Santiago Torres-Arias pointed this out to me.]
The authors themselves might have been incapacitated after the patch was approved. Given the timing, the work was presumably conducted during the COVID-19 pandemic. It’s not a great stretch of the imagination to see the whole research team laid low by COVID-19, just in time for the malicious patch to be merged, published, and exploited.

Improved study designs

So, how might we modify this study to obtain interesting findings without the ethical issues? Based on the sociotechnical system depicted above, here are a few ways that a similar experiment might be conducted more ethically (pending IRB approval, anyway):

Change the patch. Submit non-critical commits, e.g. non-functional problems like typos instead of security vulnerabilities. See if these commits are accepted. This still involves deceit and non-consent, but removes the risk of damaging a critical global system.
Inform some of the participants: “CTO-approved pentesting”. Obtain approval from the Linux chiefs (e.g. Greg K-H), who will retroactively explain to the experimented-on maintainers. This still incorporates elements of deceit and non-consent, but obtains organizational buy-in and substantially decreases the risk of merging malicious commits to the Linux codebase.
Inform the participants. Involve the Linux maintainer community throughout the experiment. Everyone consents to the experiment, and there is limited risk of malicious commits reaching the Linux codebase.
Simulate: Ask the Linux maintainers to separately review a bunch of commits “in the style of their Linux review process”, with most commits benign and a few malicious. Again, everyone consents, and this time there is no risk of damaging the Linux codebase.

Each of these changes would decrease the realism of the experiment, and might decrease the generalizability of the results. For example, participants may change their behavior if they are being observed (Hawthorne Effect). But there is a realism-ethics trade-off, and researchers need to stay on the “ethical” side of the trade.

Beyond this case study

Let’s apply sociotechnical reasoning to other cybersecurity experiments. Here are some examples with my perspective:

Finding vulnerabilities in source code or binaries: These studies examine technical artifacts. They need not involve a social component. However, these studies sometimes include discussion with developers. If the researchers report on these interactions, then IRB approval may be necessary. The cybersecurity and systems research communities generally do not seek IRB approval for this class of low-grade interactions. Often these interactions occur publicly in the spirit of the open-source community. Although the data is public, the humans involved are responding to the researchers’ stimulus. I am not certain whether the research community practice is consistent with the IRB’s aims here.
Mining software repositories or public discussions: These studies examine human-generated artifacts (e.g. code, comments) and human data (e.g. posts on Stack Overflow or Twitter). The data are publicly accessible, so the research is likely exempt. The authors might consult the IRB to ensure their analysis plan is acceptable.
Probing sandboxed systems: These studies set up a software system under the control of the researchers, in a “research sandbox”. Only the research team interacts with the system. No human subjects are involved; I suggest no IRB oversight is needed.
Probing systems in the wild: These studies probe a live system operated by some external entity, e.g. a REST API hosted by a company. Live systems are sociotechnical systems. If the researchers’ investigation is “read-only” and at a limited scale, this smacks of a purely technical study. However, if the researchers’ experiment involves either (a) “writing to” — interacting with — the live system, or (b) an intensive workload, e.g. attempting to crawl the entire Internet, then the social side of the system may be called into play. Perhaps an on-call beeper goes off, or perhaps a legitimate user cannot access a service because it is being probed intensively by the research team. I do not know how an IRB might treat this case, but I suggest they be consulted. Ethical norms within the research community should also govern your behavior here.
Observing malicious behavior: These studies may involve a honeypot that is made deliberately insecure, and then observe how it is exploited. The exploit may be automated (a technical artifact) or manual (human behavior). The researcher cannot control whether the data is derived from a human subject in advance, so they should consult their institution’s IRB.

Conclusions

As a member of the research community, I find this outcome troubling. I also think it is the responsibility of concerned community members to speak up when they see unethical behavior. Here is my voice.

As a concerned community member, I recommend that the following steps be taken:

The authors should retract the paper. Although they acted in good faith, their study was unethical. A retraction is the right choice to honor the integrity of the research process.
IEEE S&P should remove the paper from the proceedings — not as a result of the retraction, but as a separate step.
IEEE, the conference sponsor, should weigh issuing a statement about proper research conduct [6].
The University of Minnesota should repair relations with the Linux maintainer community. They have already acknowledged an internal investigation, which is a promising first step.
The academic cybersecurity community should clarify its standards for human-subjects research and for ethical research. These standards should be drafted promptly — hopefully in time for discussion at S&P’21 in May! — and be the subject of a keynote at each of the upcoming “Big Four” meetings of the community. Going forward, these standards should be clearly listed on each of the “Big Four” cybersecurity conference websites as part of the call for papers. Clearly we have failed to communicate these standards to at least one research group. Let’s not wait for more mistakes.