Very recently, I joined a real-time streaming start-up called - DataTorrent (https://www.datatorrent.com/). I was extremely curious and excited to know more about the DataTorrent's streaming platform, as just few months earlier I was looking to build an analytics product, and for that I was exploring Twitter's real-time streaming computation system - Storm (http://storm-project.net/). After exploring both the
platforms, I feel that both have very similar terminology and notion, but a lot of
noticeable differences at the same time.
In this post I will touch upon, one of the most
important computational aspect in real-time computing – the aggregation computation time window and how DataTorrent and Storm differ on this aspect.
DataTorrent's real-time streaming platform
supports various streaming computation models whereas Storm's distributed
real-time system has no built-in support.
Common Business
Scenarios
As a real-time streaming
application developer, I want to be able to associate arbitrary amount of time
period as needed to do the aggregations. For example, consider following
use-cases related to some very common business applications -
1. Calculate simple moving average of a stock price
within 'n' minute window.
2. Calculate volume of transactions for a given
stock within 'n' minute window.
3. Track top trending word on Twitter within 'n'
minute window.
As you can see there are a lot of such common
scenarios that need a windowing concept to be supported by a real-time
streaming computation platform. If a platform can't support such a concept,
then it becomes responsibility of an application developer to keep track
of time window and then do the aggregation computation
in-order to emitting accurate results.
DataTorrent’s Windowing Abstraction to Support Aggregation
I particularly like native support for
application window abstraction in DataTorrent's platform. This helps implement
above mentioned common business requirements easily. As an application
developer, I didn’t have to worry at-all about keeping track of time window
and I could simply focus on the application business logic. In order to
support uses cases such as calculating volume of transaction within ‘n’ minute
window, I only needed to define my Aggregate Application Window as an
attribute. That’s about everything I had to do to compute the required
aggregation in a given time window.
The same functional code works as-is if I change my aggregation window.
How Is Vanilla Storm
Different?
The vanilla Storm Topology (non-Trident) is
quite different or rather quite simplistic as compared to DataTorrent's.
Instead of providing time window as
an abstraction, it focuses more on the Ack and Fail mechanism for individual
tuples that are generated by the Spout and subsequent downstream Bolts.
Developers need to pro-actively write Acks. Thus ack’ing has to be done by user
code in order to support “at-least once” semantics. DataTorrent has in-built
“at least once” semantics without any user code. I couldn’t find any support
for windowing abstraction if someone wants to handle above mentioned business
scenarios. It too becomes the responsibility of an application developer to
handle the time or windowing related information as part of tuples and manage
the aggregation computation accordingly.
Is Trident Topology
The Answer?
Not really! Trident topologies do provide a
notion of Transaction Id which is nothing but a batch of tuples as emitted by
the Spout. Here again, it looks like Trident topology is only batching multiple
tuples together for Ack and Fail mechanism along with making sure batches are
scheduled for commit in sequence. This can help support “exactly once”
semantics, but again, can’t really help the application developer to handle
aggregation computation for a given time
window.
No comments:
Post a Comment