Reflections on Software Failure Analysis

6 min readJan 10, 2023

This is a brief for the research paper “Reflecting on Software Failure Analysis”, published at the ACM 2022 European Software Engineering Conference / Foundations of Software Engineering conference — Ideas, Visions, and Reflections Track (ESEC/FSE’22-IVR). This work was led by Paschal Amusuo. The full paper is available here. Paschal Amusuo wrote this brief, which I have lightly edited.

Background

What are Software Failure Studies?

In software engineering research, some software engineering researchers study the characteristics of defects found in specific software or software classes. We refer to the resulting papers as “failure study papers” and the researchers who conduct them as “investigators.”

The results of these works provide many benefits to the software engineering discipline. For example, knowledge of the dominant types of defects in a specific software system, and the factors that lead to these defects, can help software engineers detect or avoid introducing these defects when reviewing or writing software. In addition, knowledge of these defects’ characteristics can help other software engineering researchers develop tools and techniques for effectively detecting these defects in a software code base. As a result, studying defect characteristics can reduce the occurrence of defects in the wild.

Process model for (software) failure study

Conducting a Software Failure Study usually begins with defining the project’s scope. The project scope includes the software class to be studied (e.g., bugs in deep learning frameworks), the representative software (e.g., PyTorch, TensorFlow), the bug sources (e.g., Github issues and bug-fixing PRs), and the research questions (e.g., “Root causes of bugs?”, and “Fix strategies?”).

Once this is done, the defect reports are collected, and relevant data are extracted from them relevant to the research questions. The bug data are further analyzed, and results are obtained. Finally, following the results, recommendations are made for research and the industry.

What don’t we know?

To the best of our knowledge, no critical analysis has been performed on software failure analysis studies. In this project, we critique the conduct of software failure analysis research over the last 20 years. Using a systematic literature review, we identified several flaws and challenges that affect this research direction. Following the flaws and challenges we iden- tified, we discussed future research directions that the software engineering community can embark on, to aid the conduct of these failure studies. Our research directions are focused on attempting to answer various questions relevant to the efficient conduct and impact of failure studies.

Our Approach

As a data source, we sampled published failure studies. We searched the major scholarly databases — Google Scholar, ACM Digital Library, and the IEEE Xplore, using the search phrase “(empirical OR comprehensive OR taxonomy OR characteristics) AND (bug OR bugs OR faults OR defects OR failures OR vulnerabilities) AND (study OR review).” To ensure our review also included recent papers published in prestigious conferences, we manually searched the proceedings of prestigious Software Engineering conferences (ICSE, ESEC/FSE, and ASE) and journals (IEEE TSE, ESEM, JSS). These two approaches returned 92 research papers.

Next, we filtered the list of returned papers using our inclusion and exclusion criteria and got a total of 52 papers, forming our primary dataset.

We reviewed each paper (2 readers per paper) and extracted data relevant to five dimensions: Study scope, Methodology, Research Questions, Quality measures employed, and Replicability. We analyzed our extracted data and identified various flaws that were prevalent in the conduct of these studies. It is these flaws that formed the main content of our paper.

Flaws in Failure Study Methods

The figure below shows the various flaws we identified during our review. In this summary, we discuss three flaws. The rest can be found in our paper.

Observed flaws in typical papers that follow the (software) failure study model described above.

1. Bias towards open-source software

We observed a huge bias of software engineering researchers toward studying open-source systems. Of the 52 papers we reviewed, only three studied proprietary software, while the other 49 studied defects in open-source software. While we understand that open-source systems are a natural preference because of their publicly available code, documentation, defect reports, and evolution history, prior research has already shown differences between open-source and closed-source systems. Hence, would the results of these papers be comparable to proprietary software?

(a) Bias towards open source software. (b) Bias against re-use.

2. Inconsistent defect taxonomies

Many failure study papers employ a taxonomy to analyze the defects they study. These taxonomies help researchers characterize the root causes of the defects, their manifestation, or some other characteristic. In our study, we noticed some inconsistencies in the taxonomies used by similar papers. Only 10 of the 52 papers we studied reported reusing an existing taxonomy in their research. The rest of the papers invented the taxonomies they used in the paper. While developing a taxonomy is a valuable research activity — we’ve done it ourselves! — we observed this behavior even when there were available taxonomies the papers could use. For example, Cao et al. characterized performance bugs in deep learning systems using a self-generated taxonomy, but they could have adapted taxonomies from prior research on performance defects. This trend makes it difficult to compare the results of papers, hampering the development of general failure knowledge across software systems.

3. Non-integration of Practicing Software Engineers in Study

Our study shows that practicing software engineers are not involved in the research teams that conduct failure studies. Incorporating the perspectives of the software engineers who create and fix these defects can help provide deeper insights into the causes and characteristics of these defects.

In addition, we observed that while failure studies make research recommendations to extend the state of the art, they are only sometimes relevant to current software engineering practice. Only 27% of our reviewed papers proposed recommendations that could be applied to software engineering practice. This contrasts with other engineering disciplines where failure studies result in changes in professional practices that can prevent future occurrences of similar failures.

Proposed Research Agenda

In this section, we discuss various research efforts and directions that can help the conduct of failure studies.

1. Defect Causal Chains

To correctly identify the root cause of defects, we suggest that investigators also use additional sources that provide more information about the defects. We believe that documents such as pull request comments, meeting logs, and design documents may be helpful. Unfortunately, obtaining and analyzing these documents are still challenging and provide research opportunities.

2. Standardizing the conduct of failure studies

To reduce the inconsistencies we discovered in the conduct of failure studies, we propose two ways to standardize failure studies. First, we suggest the addition of a standard for failure studies in the SIGSOFT empirical standards. Secondly, we recommend developing a defect-type taxonomy map for software defects, similar to the CWE used for classifying security vulnerabilities.

3. Increasing impact on engineering practice

Following the bias for open-source software we reported, we suggest increased collaboration between software failure study investigators and software engineering companies. This would provide the investigators with access to defect reports in closed-source software and help ensure the results of these studies are also relevant to industry practitioners. We also recommend increased research emphasis on replicability studies to verify if the results of failure studies conducted on open-source software also hold for proprietary software.

4. Tool support for defect analysis

With our observation that investigators conduct failure studies manually, we suggest various directions that could help automate the conduct of these studies. NLP has successfully identified defects in requirement documents, identified duplicate defect reports, and extracted tasks and user stories from app store reviews. Hence, investigators can explore using NLP to identify target defect reports, characterize the reports or extract relevant information about the defects. This would significantly reduce the time and resources spent conducting failure studies.

Conclusions

In this project, we reflected on the conduct of failure studies in software engineering by surveying 52 published failure study papers. We identified eight recurring flaws that have marred the conduct of failure studies. These flaws impede the correctness, reliability, and impact of the reported results of these studies.

Motivated by these challenges, we identify various ways the research community can support the conduct of these failure studies. We encourage further research on identifying and analyzing causal chains for defects and tool support to simplify defect analysis while recommending efforts to standardize the conduct of failure studies. With these steps, software failure studies may improve software engineering quality.

The full paper is available here.