Introduction
The Internet of Things will put new demands on Hadoop ingest methods, specifically in its ability to capture raw sensor data — binary streams. As discussed, big data will remove previous data storage constraints and allow streaming of raw sensor data at granularities dictated by the sensors themselves. The focus of this post will be on how three popular Hadoop ingest tools – Flume, Kafka, and Amazon’s Kineses – compare with respect to initial capture of this data, particularly on configuration, monitoring and scale. Future posts will continue to examine processing techniques as data makes its way through the Hadoop data pipeline including a look at Spark Streaming.
What is meant by binary stream data?
Most data, like usage logs, for example, are streams of text events that are a result of some action, like a user click. The data can be serialized into discrete chunks based on the event. Obviously, a binary stream can also be broken down into discrete data points but instead of being event-based, the data is a continuous stream collected at a specific frequency. Consider, for example, a temperature sensor, with a resolution of 1000 readings per second. It may not be feasible to serialize the data at that frequency and instead the focus on ingest will be pure capture and then push the serialization processing to a distributed system like Hadoop.
For this post, the ingests tools were put through their paces using audio ingest as an example use case. Streaming audio has a mild bitrate of 128Kb/s, but if one wanted to, for instance, record all radio stations within a listening zone, the aggregate would be substantial.
All code for this post can be found on Github.
Flume
The first tool evaluated is a popular log-based data ingest platform, Flume. It is customizable enough to allow for inclusion in this evaluation. The interface with the raw data is called a source. There are a lot of pre-implemented sources but none that could natively stream binary data produced at a URL endpoint, so a custom source implementation was required. There are two implementation patterns to follow, Pollable or EventDriven. The continuous stream of data aligned better with the EventDriven implementation. Much of the example code was modeled off of the Netcat implementation. The source specific configurations followed the existing fume configuration patterns nicely. For example, including MBean counters allows the source metrics to be displayed on existing dashboards. Once compiled, the source jar is added to the plugins.d directory, a convenient way to organize custom code.
In order to achieve scale and resiliency, the Flume source hands messages off to a channel. To scale high throughput, flume has the notion of channels selectors with the option to multiplex across multiple channels which allows for horizontal scalability. The throughput of a single channel is determined by its backing.
Flume high-availability can be configured by having multiple collector hosts in which clients are setup to failover when one collector fails. The failed collector’s events will be replayed when the node is back online. RAID can be used to alleviate concerns for data loss on a single node failure.
Pros:
- Good documentation with many existing implementation patterns to follow
- Easy integration with existing monitoring framework that examine MBeans counters
- Integration with Cloudera Manager to monitor Flume processes
Cons:
- Event rather that stream centric
- Calculating capacity is not an exact science but rather confirmed through trials
- Throughput is dependent on the channel backing store.
Kafka
Kafka is a distributed commit log gaining popularity as a data ingestion service. Kafka’s interface with the stream is called a producer. Kafka is starting to get more producer implementations but, again, there were no existing implementations that could stream the audio data of interest. By implementing the producer interface, the resulting program was a standalone process that is purpose built for producing messages to Kafka. Process monitoring would have to be considered and integrated into the system. Currently, Hadoop distributions don’t yet have Kafka integration to help monitoring. Even when they do, their focus will be on the Kafka processes themselves rather than the producer processes.
Kafka achieves additional scale via partitions that are configured in the producer and distributes the data across the nodes in the cluster. The more partitions, the higher the throughput. It is the user’s responsibility to determine the appropriate partitioning scheme. This could be particularly tricky without examining the streams contents, but often times there is other metadata that could be used. For example, in our audio ingest example, we can partition the audio streams based on the URL source of the audio stream, given it does not exceed the throughput limit of one partition.
Kafka has resiliency as a first class feature via defined replicas of the topic. Additional replicas have negligible impact on throughput.
Pros:
- High achievable ingest rates with clear scaling pattern
- High resiliency via distributed replicas with little impact on throughput
Cons:
- No current framework for monitoring and configuring producers
AWS Kinesis
Kinesis is very similar to Kafka, as the original Kafka author points out. The AWS Kinesis SDK does not provide any default producers only an example application. Using that example as the basis, the Kinesis implementation of our audio example ingest followed nicely. The Kinesis service integrates really well with other AWS services making it easy to scale and process data (more about that in another post). Given that Kinesis is a cloud service, communication from on premise source will incur increased latency compared to that of an on premise Kafka cluster installation.
The Kinesis producer implementation followed the Kafka example very closely and suffer from the same hassle of monitoring another producer process.
Pros:
- High achievable ingest rates with clear scaling pattern
- Similar throughput and resiliency characteristics to Kafka
- Integrates with other AWS services like EMR and Data Pipeline.
Cons:
- No current framework for monitoring and configuring producers
- Cloud service. Possible increase in latency from source to Kinesis.
Conclusion
For the basic audio streaming example, each ingest tool was able to capture the stream with a bit of custom ingest code. Each solution requires understanding the scaling and resiliency configurations needed to accommodate the data rates of your sensors to guarantee data is not lost. Kafka and Kinesis have very similar scaling and resiliency patterns. Kinesis is a fully managed service from AWS with integration to other services. Kafka has been gaining popularity and possible future integrations with Hadoop distribution vendors. Both Kafka and Kinesis require custom monitoring and management of the actual producer processes, whereas Flume processes and the subsequent metrics can be gathered automatically with tools like Cloudera Manager. Flume lacks the clear scaling and resiliency configurations that are foundational with Kafka and Kinesis
The internet of things will push Hadoop ingest tools to new scale. Take a look at the example audio streaming code to get a sense of how to implement ingesting binary stream data into Hadoop.