EdgeWise: A Better Stream Processing Engine for the Edge

9 min readJul 9, 2019

This is a brief for the research paper EdgeWise: A Better Stream Processing Engine for the Edge, presented at the USENIX Annual Technical Conference (ATC) 2019. I was the third author, working with Xinwei Fu, Talha Ghaffar, and Dongyoon Lee.

Summary

We wanted to be able to perform streaming data analysis (i.e. low-latency, medium-throughput computation) on Edge-class devices (i.e. few-core, small-memory, everything a server isn’t). We took a look at existing stream processing engines and identified a design decision that limits their effectiveness on Edge-class devices. These engines assume they are being used on non-resource-constrained systems, and allocate many threads regardless of the number of physical cores on the system. We showed using a mix of theory and experiments that tailoring the number of threads to the number of cores can yield large latency gains without a loss in throughput, provided you add in an intelligent scheduler to manage resources wisely.

Motivation and Background

Mechanical Engineering researchers at Virginia Tech have worked to instrument Goodwin Hall, a new engineering building on campus, with hundreds of sensors that collect gigabytes of data every day. There’s no particular reason to stop at hundreds of sensors and gigabytes of data, of course, other than limitations on the ability to analyze that data.

We met with Dr. Pablo Tarazaga’s team and talked about the analyses they’d like to run on the data, the eventual scale they envision, and the problems they’ve encountered along the way. Ideally, they would like to have fast analysis times for their sensor data (low latency), and they would like to be able to analyze a lot of data (high throughput). Of course, who wouldn’t?

Edge Computing

You can solve a lot of problems using Cloud technologies. The Cloud is great for building “big” applications. It can automatically scale up to handle bursty traffic, offer unlimited storage for your big database o’ customer data, and mask hardware failures from your customers.

But sometimes the Cloud is not the right solution. In the IoT context, sensors generate data — a LOT of data! — at the Edge of the network. Imagine if Goodwin Hall produced terabytes of data a day. Analyzing that data in the Cloud can be expensive:

You have to pay ($) for the network bandwidth to ship data to the Cloud and to ingest it. Perhaps your sensors generate enough data that it would be cost-prohibitive to ship, even after sampling, compression, etc.
You have to pay (time) for the network latency to and from the Cloud. Perhaps the analysis you want to run won’t take that long, so the travel time will dominate the total time spent to analyze the data. If your sensor is connected to an Internet backbone, the travel time won’t be that long. If your sensor is in the middle of a desert and bounces radio waves off of satellites to call home, it might be another story.

Arguments like these form the basis of the Fog Computing or Edge Computing paradigm [1]. In this paradigm, the Gateways that connect sensors to the Cloud are considered as a system component in their own right, offering limited computational power but much lower network latencies compared to going all the way to the Cloud.

In the Goodwin Hall building, the sensors are connected to routers and other local compute, which can act as the Gateways in our embodiment of Fog Computing. Each of these Gateways is not beefy enough to run analyses by itself, but by networking them together we expect to have enough computational capacity to handle the sensor data load.

Stream processing engines

OK, so we’ve decided we want to perform data analysis on local Gateways without going all the way to the Cloud, and are planning to do it at the Edge on a network of resource-constrained devices. We turn to stream processing as a natural way to describe the analyses.

Popular stream processing engines like Apache Storm ask you express an analysis as though it’s an assembly line: “Do A on the raw materials, then do B on the result, then you can perform C and D concurrently and do E after them.” Generalizing a little bit, you end up with a directed acyclic graph (DAG) whose vertices represent operations to perform and whose edges represent the data flow, the dependencies between the operations.

For example, in a natural language processing analysis, the sequence of operations might be to split a document into sentences, tokenize each sentence into words, filter to remove stopwords, stem the survivors, and finally compute word frequencies. In the context of Goodwin Hall’s vibration sensors, the analysis steps might be to remove outliers, then smooth the data, then group data from related sensors, and finally apply predictive models to convert vibrations into footfalls in near-real time.

The next figure shows the Edge-centric analysis framework we have in mind:

The Gateways at the Edge connect Things to the Cloud, and can perform local data stream processing. The yellow rectangle in the middle is a topology expressing the analysis as a directed acyclic graph. There is a low-latency connection between the Things and the Edge, and a high-latency connection between the Edge and the Cloud. We assume modest computational resources in the Edge.

The question is, are existing stream processing engines suitable for use at the Edge?

Research plan

Study existing stream processing engines to see how suitable they are in our context.
If unsuitable, identify the problem and design a solution.

The (un-)suitability of stream processing engines

We looked at the major stream processing engines: Apache Storm, Apache Flink, and Apache Heron.

Each follows the DAG model, of course, but when we studied the way they mapped this model into practice we saw a mismatch with our planned deployment. Let’s see why! These systems follow a “One Worker Per Operator Architecture”, implementing the DAG by creating a thread/queue pair for each operator/inbound edge pair. Each thread performs its corresponding operation on each item in its queue, with scheduling managed by the operating system. When the topology is relatively empty (little data), most threads sleep until new data enters their queue, and there is not much contention between the threads. But when the system is loaded, all of the threads are awake and contending for processor time, and most of their queues have data.

This situation is not particularly concerning in the Cloud. In a resource-rich environment, you can always slow down your data ingest or spin up new machines to pick up the load, and momentary queue imbalances won’t exhaust the system’s memory*. But in the resource-constrained Edge:

Contention. The contention between the threads is pure overhead, and since we have relatively few processors the proportion of contention can be significant for larger analysis topologies.
Memory limits. We have limited memory and we can’t magic new processors into existence, so if the system becomes overloaded then imbalanced queue lengths can leave us with nowhere convenient to put the data in the system.

From this analysis we concluded that these stream processing engines might not be suitable for use in our Edge context.

*See the paper for more on why these imbalances can be problematic. In short, due to backpressure, imbalanced queue lengths are not a zero-sum game in stream processing engines.

Redesigning stream processing engines for the Edge

Based on our analysis so far, we set out to find appropriate modifications to existing stream processing engines that would permit them to accommodate the Edge context. The result was our system, EdgeWise.

The resource constraints at the Edge guided us to a classic idea — we can modify a stream processing engine to use a worker pool, so that we only allocate as many threads as there are cores. Look ma, no wasted Therbligs!

But merely adding a threadpool to the system only helps with problem #1 (excessive thread contention). When you create a thread pool, you still have to decide which tasks the threads should work on. So to solve problem #2 (imbalanced queues), we implemented a scheduler.

To design the scheduler, we turned to queuing theory. We couldn’t find an exact fit for our problem in the academic literature (it’s probably buried in some journal of industrial engineering…), so we applied the building blocks from [2] to our situation. Underneath the formalization in our paper, the result is intuitive: Have the workers spend more time on more expensive operations, accounting for both the cost of the operation and the number of times it must be performed for each input tuple. Our analysis shows that this policy will offer better latency without affecting throughput, thanks to the non-linear ripple effect of wiser scheduling decisions through the topology.

Experiments

Experimental design

We used Raspberry Pi v3 devices as representative Gateway-class devices.
We evaluated four analysis topologies from the RIoTBench benchmark suite [3].
We evaluated each topology five times in each configuration. We did not take further measurements because we observed low variance in each configuration.
We conducted both single-node and clustered experiments to see the effect of our implementation at finer and coarser granularities.

Results

Here is one figure from our single-node experiment:

Throughput-latency curve for the PRED topology from RIOTBench, evaluated at varying input rates on different stream processing engines. Storm is Apache Storm v1.1.0. The “WP+Random” line shows the effect of using a worker pool (threadpool) and relying on the OS to schedule the threads, like Storm does. The other “WP+X” lines show implementations of schedulers from the database literature. The EdgeWise curve shows the performance of a worker pool with our scheduler. EdgeWise’s dynamic scheduler performs as well as the WP+MinLat scheduler, with no need for profiling prior to deployment.

And one figure from our clustered experiment:

Throughput curve as we vary the number of nodes. Each point shows the highest throughput we could achieve with the system without exceeding a latency threshold of 100 ms. (Yes, this was a somewhat arbitrary threshold). EdgeWise outperforms Storm in each configuration.

The full paper has experiments and figures galore.

Conclusions and future work

In this paper we explore the implications of the Fog/Edge Computing paradigm in a promising use case: stream processing. We found that existing stream processing engines were designed for the Cloud and behave poorly in the Edge context. Using a congestion-aware scheduler and a fixed-size worker pool, we showed that these engines can obtain higher throughput and lower latency at the Edge.

I began this story with a description of Virginia Tech’s Goodwin Hall, and we designed this system to permit larger-scale near-real-time analyses of the data that Dr. Tarazaga’s team is collecting. In this paper we only evaluated our system on the RIoTBench benchmark suite, which was good for reproducibility but bad for our actual goal —unlocking the full potential of the Goodwin Hall infrastructure, helping the research community see the benefits and drawbacks of “smart buildings”. The next phase of this project is, of course, to deploy and evaluate the system and see how well it meets the needs of our partners. I am looking forward to the results!

Reflections

My favorite part about this paper is the final sentence: “Sometimes the answers in system design lie not in the future but in the past.” Our observation about threadpools was not new; we found it articulated in research papers from the early 2000s, and it predates those as well. The concept of “Let’s use a dedicated scheduler” is hardly new either. But the application to stream processing engines in the Edge context added enough new wrinkles that our paper could stand on its own. I enjoy the barefaced acknowledgment, however, that the novelty is less about the solution and more about the interpretation. It’s a distinction that I’m glad the computer science research community appreciates.

I think computer science is a fundamentally applied discipline, and some of the most interesting problems emerge at the intersection of computing and other disciplines. The whole point of computing is to make the real world cooler! Something I really enjoyed about this project was the opportunity to cross discipline boundaries with Dr. Tarazaga’s research group. The interdisciplinary research setting benefited both parts of the collaboration — from the computer science perspective, we identified a shortcoming in popular stream processing engines and proposed a better design; from the mechanical engineering perspective, they will be able to benefit from the state-of-the-art system we built.

This experience reminded me of a story I heard the mathematician Ron Fagin tell at IBM Research. He said one of his most highly-cited results came from a conversation with an applied team, and that collaborating across disciplinary boundaries (in that case, the “math guys” and the databases group) led to really interesting problems and solutions. I certainly found the experience working with Dr. Tarazaga’s group invigorating, and I hope to have similar experiences in the future.

More information

The full paper is available here.
The PowerPoint slides and presentation video are available here.
The research artifact is available here.

References

[1] Bonomi et al., 2014. Fog computing: A platform for internet of things and analytics.

[2] Gross et al., 2008. Fundamentals of queueing theory.

[3] Shukla, Chaturvedi, and Simmhan, 2017. RIoTBench: An IoT benchmark for distributed stream processing systems.