Reflecting on Recurring Failures in IoT Development
This is a brief for the research paper “Reflecting on Recurring Failures in IoT Development”, published at the IEEE/ACM 2022 conference on Automated Software Engineering (ASE) — New Ideas and Emerging Results Track. This work was led by Dharun Anandayuvaraj. The full paper is available here. Dharun Anandayuvaraj wrote this brief, which I have lightly edited.
Background
Internet of Things have become pervasive and increasingly interactive with the physical world. They enable software to directly interact with the physical environment, where their faults and defects can be dangerous or safety-critical.
For example, in 2019 an IoT diabetes monitoring service had a cloud outage. This service is supposed to monitor blood sugar levels, and alert users if it reaches a critical level. And in the case of this outage, the children and adults that relied on this service were not notified of the outage through the app, and were under a false sense of safety. The company failed to design a fail-safe for this scenario, even though this type of incident had occurred at the company in the past.
So, naturally we wonder whether past failures could provide design insight. Since IoT has diverse characteristics enabling diverse faults, our first step towards answering this question is to characterize IoT failures in the wild. And as a first step towards this goal, we conducted a systematic study of IoT failures as reported in the media.
Approach
First, to characterize IoT failures in the wild, we would need detailed information about the failures. And since IoT systems are proprietary, we wondered if we could collect relevant information from news sources. So, we searched for news articles that described failures of IoT systems, published at reputable sources within recent years. And from these articles we identified the sources, the impacts, and if available, the repair recommendations for the IoT failures.
Then we utilized a pre-defined framework for characterizing IoT failures. This framework included a standard model of IoT systems, and the ways in which they fail.
We used this framework to perform a structured analysis of the qualitative news reports to code and categorize by a taxonomy of faults.
Results
Using these methods, we studied 22 IoT failures, covering 5 categories of applications:
Sources of IoT faults
From these reports, we identified the common sources of IoT failures. From our sample, we found that the most common source of faults originated at the application level, followed by communication connectivity.
We also note that even though IoT systems compose both software and hardware, our data indicates software as the primary cause of failures. This was regardless of whether failure-triggering events occurred from within or outside of the system.
Furthermore, we observed failure trends within application domains. For example, a common failure within the consumer healthcare domain was due to the lack of fail-safe systems, especially during network loss. In one case, a baby monitoring system failed similar to the diabetes monitor example from Figure 1: during a server outage, parents were not notified that they were not getting fresh data from the baby monitor.
In addition, we also observed failure trends across application domains. For example, a common failure across the automotive and critical infrastructure domains was improper isolation between non-critical vs safety critical software, which exposed safety critical software to attackers. We observed this across cars, oil pipelines, and power plants.
Impacts of IoT failures
From these examples the failure impacts of IoT systems could be significant. We observed general trends of common impacts primarily within application domains. Common impacts included:
- Compromised critical functions (6 cases)
- Fatal collisions (5 cases)
- Exposed safety-critical functions (3 cases)
- False sense of safety (2 cases)
Each of these cases led to extensive human impact and monetary cost.
Reflection
From our results, we reflect that news reports documenting IoT failures provided system-level information about the failures and their impacts. Furthermore, we think that in addition to information from news reports, we might benefit learning from more detailed engineering reports, such as postmortems. We believe that such an approach could enable engineers to focus on faults that lead to catastrophic failures in IoT systems.
Specifically, from our data, it appears that many of these failures can be traced to problems in software and system engineering. Additionally, we observed echoes of past failures recurring in modern IoT systems. For example, failures due to improper isolation of critical infrastructures, in the 1990s, have resulted in historical software engineering lessons, but they still recur in modern IoT systems.
This indicates development challenges to address recurring failures. To aid these challenges, we suggest development processes that place a greater emphasis on learning from past failures.
A Failure-Aware Software Development Life Cycle
We propose three research directions towards a Failure-Aware Software Development Life Cycle for IoT.
- Infrastructure: A failure encyclopedia: First, to help IoT engineers anticipate failures, we believe they would benefit from an encyclopedia of previous IoT failures. A catalog of case studies could outline the failure, the underlying fault(s), the impact, and lessons learned from a system failure. Case studies could be built from within teams & organizations, as well as from external sources such as news reports or other organizations. These case studies could inform software engineering judgment to build systems resilient to past failures.
- Process: An empirical basis for postmortems: Second, we recommend research to establish an empirical basis for software failure postmortems. Our analysis of failures was restricted because we could only observe what was reported by journalists. But IoT engineers working on the affected products could conduct a more detailed analysis to benefit their own and other teams, through a failure postmortem. Although postmortems are widely recommended, they are often omitted. And we know surprisingly little about postmortems in practice. For example, what are effective personal & team practices to collect, analyze, and document system failures? Or how can postmortem knowledge be integrated into the SDLC, managing the tradeoff between agility and risk management?
- Tools: Automation: Finally, there are many opportunities to automate elements of this research agenda. First, we could leverage text mining techniques to extract system postmortem information from diverse representations, including news reports, user complaints, and open-source issue reports. This would facilitate the organization of a large encyclopedia of failures. Then, an engineering team could query this database for relevant failures to guide their system design or maintenance work. This could be enabled by filtering cases relevant to specific contexts, or by transferring lessons across contexts. Finally, during validation, an engineering team could scan their system model or codebase for known hazards using program analysis.
Answering these questions are a starting point to help establish an empirical basis for software failure postmortems.
Conclusion
We studied real-world IoT systems and observed recurring failure trends both within and across application domains. To alleviate recurring failure trends, we recommend a research agenda towards a Failure-Aware Software Development Life Cycle for IoT.