bits and bytes: Aggregation Paradigm in Real-Time Stream Processing

Very recently, I joined a real-time streaming start-up called - DataTorrent (https://www.datatorrent.com/). I was extremely curious and excited to know more about the DataTorrent's streaming platform, as just few months earlier I was looking to build an analytics product, and for that I was exploring Twitter's real-time streaming computation system - Storm (http://storm-project.net/). After exploring both the platforms, I feel that both have very similar terminology and notion, but a lot of noticeable differences at the same time.

In this post I will touch upon, one of the most important computational aspect in real-time computing – the aggregation computation time window and how DataTorrent and Storm differ on this aspect.

DataTorrent's real-time streaming platform supports various streaming computation models whereas Storm's distributed real-time system has no built-in support.

Common Business Scenarios

As a real-time streaming application developer, I want to be able to associate arbitrary amount of time period as needed to do the aggregations. For example, consider following use-cases related to some very common business applications -

1. Calculate simple moving average of a stock price within 'n' minute window.

2. Calculate volume of transactions for a given stock within 'n' minute window.

3. Track top trending word on Twitter within 'n' minute window.

As you can see there are a lot of such common scenarios that need a windowing concept to be supported by a real-time streaming computation platform. If a platform can't support such a concept, then it becomes responsibility of an application developer to keep track of time window and then do the aggregation computation in-order to emitting accurate results.

DataTorrent’s Windowing Abstraction to Support Aggregation

I particularly like native support for application window abstraction in DataTorrent's platform. This helps implement above mentioned common business requirements easily. As an application developer, I didn’t have to worry at-all about keeping track of time window and I could simply focus on the application business logic. In order to support uses cases such as calculating volume of transaction within ‘n’ minute window, I only needed to define my Aggregate Application Window as an attribute. That’s about everything I had to do to compute the required aggregation in a given time window. The same functional code works as-is if I change my aggregation window.

How Is Vanilla Storm Different?

The vanilla Storm Topology (non-Trident) is quite different or rather quite simplistic as compared to DataTorrent's. Instead of providing time window as an abstraction, it focuses more on the Ack and Fail mechanism for individual tuples that are generated by the Spout and subsequent downstream Bolts. Developers need to pro-actively write Acks. Thus ack’ing has to be done by user code in order to support “at-least once” semantics. DataTorrent has in-built “at least once” semantics without any user code. I couldn’t find any support for windowing abstraction if someone wants to handle above mentioned business scenarios. It too becomes the responsibility of an application developer to handle the time or windowing related information as part of tuples and manage the aggregation computation accordingly.

Is Trident Topology The Answer?

Not really! Trident topologies do provide a notion of Transaction Id which is nothing but a batch of tuples as emitted by the Spout. Here again, it looks like Trident topology is only batching multiple tuples together for Ack and Fail mechanism along with making sure batches are scheduled for commit in sequence. This can help support “exactly once” semantics, but again, can’t really help the application developer to handle aggregation computation for a given time window.

bits and bytes

Thursday, November 14, 2013

Aggregation Paradigm in Real-Time Stream Processing – DataTorrent v/s Storm

No comments:

Post a Comment