FAIL: Analyzing Software Failures from the News Using LLMs

James Davis
9 min readNov 9, 2024

--

This is a brief for the research paper “FAIL: Analyzing Software Failures from the News Using LLMs”, published at the IEEE/ACM 2024 conference on Automated Software Engineering (ASE). This work was led by Dharun Anandayuvaraj. The full paper is available here (preprint here). Dharun Anandayuvaraj wrote this brief, which I have lightly edited.

I’m personally very pleased with this work, because it integrated several years’ worth of Dharun’s thinking about anazlying software failures. He shared earlier versions of these ideas at ASE-NIER’22 and SERP4IoT’23 and SCORED’23. Anyway, let’s take a look!

Background

Software failures are no fun — we all know the feeling of trying to access website and seeing it crash or otherwise misbehave. While YouTube being down isn’t exactly a crisis, software failures can be pretty problematic when software is integrated into safety-critical and safety-sensitive systems.

One rather unfortunate example is Dexcom’s glucose monitoring system. This system is supposed to monitor blood sugar levels, and alert users if it reaches a critical level. The Dexcom system processes glucose readings in the Cloud, and then passes the results back to the customer in the form of data and warnings about unhealthy blood sugar levels. In 2019 Dexcom’s cloud system had an outage. During this outage, the people (both adults and kids) who relied on this service were not notified of the outage through the app. This gave them a false sense of safety, even if their blood sugar levels became critical. The lack of notification reflects a design failure: Dexcom failed to incorporate a failsafe — a backup communication mechanism — for this scenario.

Design failure in Dexcom’s IoT diabetes monitoring services. There was no backup communication path and no clear indication that the primary path had failed, leaving customers unaware of the lack of up-to-date information about their blood sugar levels.

Now listen, I get it. Engineering is hard. I think making an error from time to time is unavoidable. But here is how I define good engineering: the same errors don’t occur again and again. So the next string of headlines gives me some concern.

Similar incidents (related to glucose monitors having outages and data loss) occur across a 5-year span, in two different companies.

Failure-Aware Software Engineering

So naturally we wonder, how we can help prevent such recurring failures? In engineering, a fundamental principle to help prevent such failures is to analyze failures and act to mitigate them in the future. This principle has been successful in many engineering disciplines, and has contributed to low failure rates in medical devices, aircrafts, railways, and so on.

Lessons learnt from failures should be used during the engineering life cycle.

However, in software engineering we don’t do this very consistently as part of the software engineering process. Specifically, engineers don’t often share failure stories, nor learn from them across organizations and industries. There are bug trackers and JIRA tickets, to be sure, but many surveys and interviews suggest that there is little culture of formally reflecting on the lessons learned and acting to apply them in the future.

To help address this gap, researchers often study software problems from the news. Although organizations aren’t always willing to publicly disclose their own failures, news agencies often report on public-facing engineering failures. These reports may not contain detailed failure information. But they often contain information related to system and design level causes, impacts, and lessons learned from an incident. Such information could be used by organizations, government bodies, and academics to formulate best practices, draft regulations, or discover research directions.

The multiple prior works that have studied such failure information rely on costly expert manual analysis. For example, the only prior large-scale study related to software failures relied on manual analysis to study the consequences of ~4,000 software problems from the news. The authors reported that this took 1,000 person-hours over 2 months. We ourselves undertook a much smaller-scale study (ASE-NIER’22), and dear reader, it was interesting but, hmm, rather time-consuming.

The FAIL System

Concept

It’s 2024, and the smell of Large Language Models (LLMs) is in the air. We wanted to reduce the costs and improve the scalability of these manual analyses. So we designed and implemented the Failure Analysis Investigation with LLMs (FAIL).

FAIL is a pipeline that employs Large Language Models (LLMs) to help automate the task of collecting and analyzing software failures reported in the news. Using this pipeline we created a database of failure reports for ~2,500 software failures from 2010 to 2022 at a cost of ~$50 per year of data (using a commercial LLM service).

The FAIL concept: Instead of manual analysis, can we figure out how to read through all the news articles associated with a (software) engineering failure and learn something useful? In this figure, our FAIL tool successfully finds and links together all of the pictured articles. The yellow highlighting shows distinct failure knowledge that we can integrate into a richer postmortem.

Design

The next figure gives an overview of our design.

The design of the FAIL system — how we broke the problem down into parts.
  1. In the first step, FAIL uses google news to search for software failures covered by popular news sources. FAIL uses a specific set of keywords related to software failures from a prior work. FAIL found and scraped ~120,000 articles from 2010 to 2022.
  2. Even though FAIL uses a specific set of queries and keywords, there were a lot of articles that didn’t actually report on software failures. So, to gather articles that report on software failures, FAIL uses an LLM to classify whether the articles actually reported on software failures. When compared to manual analysis, FAIL performed this step with an F1 score of 90%. FAIL found ~6,500 articles to actually report on software failures.
  3. To ensure that the articles actually contain enough information for failure analysis, FAIL prompts an LLM to classify this as well by asking whether it contains enough postmortem information. When compared to manual analysis, FAIL performed this step with an F1 score of 91%. FAIL found ~4,000 articles to contain enough information to conduct failure analysis.
  4. Then once FAIL has collected articles that report on SE failures, there may be multiple articles that report on the same incident. For example, an incident like the Boeing 737 crashes may have many articles that need to indexed appropriately into the incident. So first, FAIL uses an LLM to summarize the SE failure reported by the articles. Then FAIL uses a sentence transformer to convert the summary into embeddings, which it uses to calculate the similarity between the failures reported by each article. Then with highly similar summaries, FAIL prompts an LLM to classify whether the similar summaries actually report on the same incident. This enables FAIL to index articles that report on the same incident. When compared to manual analysis, FAIL performed this step with a V-measure of 0.98 out 1 — which means that FAIL indexed articles into incidents very well. FAIL merged the ~4,000 articles into ~2,500 incidents.
  5. Then FAIL creates a failure report for each incident. FAIL does this by extracting failure information described by several prior works on failure analysis, which includes traditional postmortem information — which were open-ended, and information to taxonomize the faults of the incidents and additional details about the incident — which were multiple-choice. To extract such information, we iteratively developed prompts for each attribute of the failure report. To create a create a failure report for each incident, FAIL would pass relevant context and query each prompt. Then, the failure reports are stored into a database. When compared to manual analysis, FAIL extracted 89% of the information that manual analysts extracted. However, FAIL did produce slight hallucination, where on average it hallucinated irrelevant facts 6% of the time.
Illustration of FAIL creating a failure report. Included is a table of all of the failure information extracted for each incident.

Evaluation of the Resulting Database

Using the failure reports from the database, we conducted a large-scale systematic study of recent software failures from the news. To conduct this study, we answer similar research questions as prior works that manually studied software failures. We investigated the following research questions:

  1. What are the characteristics of the causes of recent software failures?
  2. What are the characteristics of the impacts of recent software failures?
  3. How do the causes affect the impacts of recent software failures?

We present a few interesting findings in this post, but please feel free to take a look at our paper for more findings.

Design issues are common: First, we found that most of the failures were due to factors introduced by design as well as operation (see next figure). This might indicate that there may need to be improvements in the way software is designed to handle failure scenarios (fault trees and FMEA, anyone?) or that more guidance might need to be provided to users for them to be mindful of the systems’ usage context.

Most failures were due to factors introduced by design as well as operation.

For example with the diabetes monitor, either the design should have accounted for the failure mode of the cloud going down, or the users should have been provided much clearer guidance that they should not trust the data. (This second option is indeed what the FDA said Dexcom should do in its advertising and medical advice, but y’know, I think that was a pretty unrealistic bar to set. Dexcom’s advertising certainly implies that finger pricking will be a thing of the past).

The consequences of failure are getting more serious: Next, we compared our results of failures 2010 to 2022 with a prior work’s results of failures from 1980 to 2012. The next bar chart shows the distribution of failure consquences, with the 1980–2012 data plotted in orange and the 2010–2022 data plotted in blue.

  • Most software failures did not have documented consequences in the past, but most software failures do have consequences now.
  • The biggest consequence of software failures in the past merely used to be time-based inconvenience, however, now the biggest consequence is impact to goods, money, and data.
  • There has also been an increase in physical harm, and even regularly documented fatalities due to software failures.
Comparing failures 1980–2012 with failures 2010–2022, there appears to have been an increase in consequential software failures.

Failures recur: Finally, we observed that most failures had recurring similar failures! Recurring failures in this context was often similar designs → that had similar failure modes → that led to similar failures. About half of the failures seem to have had similar failures recur both within and across organizations, and a lot of failures seem to have recurred within organizations. Take this one with a grain of salt — for this data we are relying on statements in the articles that a similar failure had previously occurred, and sometimes we have only the journalist’s word for it.

Many failures were recurring: Similar designs have similar failure mode have similar failures.

What then shall we do?

Given that failures are growing more consequential, and are recurring, I think the software engineering community needs to do a better job at learning from failures. We specifically propose a Failure-Aware Software Development Lifecycle (FA-SDLC). After we capture failure knowledge in a postmortem, we can apply it during different phases of the SDLC — for example, past failures could inform requirements, design, testing, and incident management.

One specific direction we are working on is exploring how to leverage failure knowledge to influence design decisions towards a Failure Aware SDLC. Our SERP4IoT’23 paper was a good start — watch this space.

Conclusions

Software engineering organizations need to spend less time “moving fast and breaking things” and more time “moving slowly and not killing people”. This is an engineer’s ethical duty to society, which takes precedence to any duty to their employers. We hope that FAIL can become an ongoing community resource with regular updates, and that the resulting database can inform software engineering practices and research, policy making, and education.

--

--

James Davis
James Davis

Written by James Davis

I am a professor in ECE@Purdue. My research assistants and I blog here about research findings and engineering tips.

No responses yet