Big Data: Principles and best practices of scalable realtime data systems

Book cover image

It wouldn’t be an exaggeration to say that Nathan Marz, as the original developer of Storm (together with many other relevant pieces of software, such as Cascalog) is among the inventors of the whole Big Data thing. Storm has enabled complicated real-time pipelines to be built, without the headaches of coordinating data transmissons and routing. It is thus a boon that he, together with James Warren, went on to write a book on the exact same topic, sharing the tips and ideas that went into building Storm. As such, it is not a surprise that the book is a great overview of the field and fundamental techniques, and has become standard reading already.

The demands on a big data system, essentially an OLAP application that has to scale linearly with the amount of input, is fundamentally different from those on an OLTP system, which developers are normally used to developing. These requirements are robustness (also in the face of human error), scalability, modularity and ad hoc queries (i.e. joins). The path taken by many development teams in the face of these differing requirements is the orchestration of existing tools, coupled with simply more code. As the authors point out, this approach leads to overly complex, fragile systems, because the transactional tools were not built for reliable and robust computation, and bring in too much complexity overhead with them. The alternative is to start with design principles and accompanying family of software tools that give you these requirements from the beginning, when placed into the right design philosophy. The name given by the authors to this design philosophy is Lambda Architecture.

The lambda architecture starts with the principle of immutable data. Data is the raw bits and bytes the system receives, and cannot be derived from anything else. It is at the beginning of the information dependency chain, so to say. Turning this data into a useful form and storing it is the job of three different layers of processing: Batch layer, serving layer and speed layer. The batch layer is responsible for running preprocessing on the original data to turn it into more accessible form. It has to be performant, scalable, and tolerant to human error. These properties are achieved by using simple storage solutions such as the file system, recomputation algorithms on immutable data, and parallel computation. What the batch layer does not need to be is low-latency. The computations are allowed to run over longer periods of time, in the order of tens of minutes, and work on complete sets of raw data. A central theme of the book, alluded to above already, is avoiding accidental complexity by reducing each layer to the necessary minimum of functionality. In the batch layer, this translates to keeping data immutable in terms of storage, and using recomputation algorithms to create the batch views. Recomputation algorithms have three advantages compared to incremental ones: They can be faster, error correction is recomputation, and they tend to be simpler.

The obvious choice for batch processing is Hadoop, and it is not any different in this book. Hadoop combines scalable, distributed storage and parallelized computation with HDFS and MapReduce. The authors go into some detail on storing and processing data on Hadoop using the Pail data partitioning library and the JCascalog data processing and querying library. One of the weaknesses of the book is obvious in this chapter. Hadoop is not a breeze to install, and the code examples in the batch processing chapters are there only for the reading; they are not particularly ‘hackable’. Also, the code is in Java, which might make sense considering the target audience and the fact that Hadoop and the other big data tools are written in it, but it’s not the prettiest code to look at. I ended up not even skimming the Java code, since it’s not my favorite way of spending time, and just read the textual explanations. The examples picked by the authors (unique views per time window with multiple IDs per user and bounce rate analysis) are fortunately not too simplistic. I can imagine that the code examples are relevant for people who use Java to implement similar things.

The batch layer processes the mass of incoming data to precompute batch views, condensed data that can be stored and easily combined to generate information that is of interest. Data is condensed in two senses: Accumulation and correlation. Accumulation is the calculation of measures of data, such as counts or averages, while also filtering those parts that are irrelevant. Correlation is the combination of rows of data based on column attributes, the simplest being concatenation, more complex cases being kinds of joins. The authors give examples of these operations for sample algorithms first in the standard Java way of writing MapReduce, then using an alternative library called JCascalog. JCascalog allows the description of parallel computations in a style much more similar to pipe diagrams, decoupling computation from physical storage. The discussion of JCascalog is rather in depth, but again, it would have been very useful to have a virtual machine or similar container in which the reader could easily poke around the examples, and maybe even solve a few exercises. The code examples will not be making developers like me who eschew certain programming models such as extensive internal state big friends of Java, as the authors state that components like the aggregators function by “adjusting some internal state for each observed tuple” (p. 129). This appears to be a general pattern: Because Java allows only limited kinds of generic programming, a lot is done using strings and internal state.

The following two chapters are dedicated to the design and implementation of a sample batch layer for website analytics. The individual features selected for this example are of varying complexity on differing dimensions. Pageview counts by URL per time period splits data on the time dimension, whereas unique visitors by URL per time period also requires keeping track of which user visited which page. The unique visitors task is complicated by the fact that the same user can be identified with different IDs, and ID equivalence can come in after the user visits a page. The last feature, bounce-rate analysis, is again different in that it requires tracking the time difference between different visits by the same user. The implementation is explained in detail on actual code, which is a bit tedious at times, but would definitely be useful when you’re working on actually implementing something.

The condensed data created by the batch layer is saved in the serving layer. The batch layer, by virtue of having access to all of the batch data, can condense it by the previously mentioned processes of aggregation and correlation so that there is not only less of it, but the data is mutated to enable efficient queries at the serving layer. These queries can require things like joins, grouping on columns, or calculating set cardinalities (made faster by approximate algorithms such as HyperLogLog). The serving layer has to be designed with the aim of presenting the condensed data in a reliable and rapid manner. Therefore, it should again be distributed to enable fault tolerance, and should allow indexes and collocation for fast retrieval of ranges of values. The first chapter on the serving layer goes into considerable depth to explain how an incremental approach that unifies read and write functionality would not be able to achieve similar performance to the batch & serving layer split. Afterwards, a sample serving layer that stores the results of the previously built batch layer is built, using ElephantDB as the storage engine. ElephantDB is a distributed key-value store explicitly built for exporting data out of Hadoop. One of its major features is that creation of indexes is completely separate from serving them. The indexes are created from shards of data at the end of a Hadoop job, and then fetched by the ElephantDB process during suitable load conditions. It is still not the ideal serving layer database, though, because it does not offer range queries or built-in HyperLogLog sets.

The last component of the lambda architecture is the speed layer. This layer is responsible for real-time processing of fresh data in a limited time window. In order to achieve speed, incremental algorithms are used in this layer, but error correction is still not done by correcting results, but by letting invalid results fall out of the window of processing. The requirements for view data storage in this layer is different from those in the serving layer. Since incremental algorithms are used, batch creation of sharded indexes is not enough; random writes should also be allowed. The correctness requirements are also laxer. Since the results will be improved when the batch layer kicks in and processes the complete dataset once it falls out of the real time view window, approximations for the sake of speed are welcome in the speed layer. This is called eventual accuracy. Due to the use of incremental algorithms and the general availability requirements on all layers, speed layer storage faces particular complexities. One of these is the CAP theorem, which concerns the consistency vs availability trade-off in the presence of network partitioning. Since distributed storage systems are used, partitioning is a condition that is definitely to be accounted for, and in the presence of partitioning, special methods called conflict-free replicated data types (CRDTs) have to be used to achieve incremental algorithms. There are two sets of tools that can be used to deal with these complexities. The first is asynchronous updates, where the data in the store is updated not individually from each speed layer process, but queued in a bust which can also buffer for batch updates. Another is expiring the views that are old enough to be included in the batch layer, and can be incorrect.

The sample implementation for the speed layer starts with a storage for realtime views, built on the Cassandra data store. Cassandra is a column-oriented database which the authors prefer to describe as a map with sorted maps as values. The data is arranged in column groups, which are themselves key-value mappings, where the values are also sorted key-values themselves. These are collocated, so doing efficient queries of the first level of key-values is possible. A number of different patterns for processing data in real time and feeding into the data store are then discussed, such as single-consumer vs multi-consumer queues, one-at-a-time vs micro-batched processing, and queues-and-workers model vs the Storm model. Storm was also written originally by Marz, and uses an alternative processing model for fast stream processing. The processing pipeline is represented in Storm as a topology that consists of streams (sequences of tuples), spouts (sources of tuples) and bolts (which take streams and produce other streams). The path followed by a tuple in this topology corresponds to a directed acyclic graph (DAG), which can be thought of as an alternative to queues, in that instead of maintaining intermediate queues that track what is processed and what is not, the position of a tuple in the DAG is stored. This turns out to be a relatively cheap process, requiring only 20 bytes per tuple. When a tuple is found to fail, it is reprocessed starting from the spout. This way, an at-least-once guarantee similar to that provided by the queues can be given by Storm.

In the illustration for speed layer stream processing, a Strom topology for calculating the uniques-over-time view, and another for bounce rate analysis are built, with the help of Kafka and Zookeeper. The first example serves to illustrate simple Storm topologies, whereas second is for more complicated micro-batch processing. The first example is very Java-centric, also due to the fact that Zookeeper is used, and it reads like an exploded version of a more concise language. The second example includes a more interesting discussion of one-at-a-time vs micro-batch processing. One-at-a-time guarantees that a tuple will be processed, but failure tracking and replay happen at a per-tuple level. It fails to give certain guarantees that are required in precision for certain kinds of tasks that require an exactly-once semantics, such as counting. Exactly-once semantics can be achieved using micro-batch processing, in which batches of tuples are processed together, and the state is stored in terms of IDs for these batches. Each bolt stores the ID of the last batch that it processed, and when a batch errors, whether it was processed can be found by comparing IDs. In the demonstration section, the bounce rate analysis task is implemented using Trident, a library for building pipelines on Storm, Kafka and Cassandra.

As you can see from the length of this review, Big Data is a book with a lot of substance. Here is what this book does not tell you, however: How to analyze the data and derive insights out of it. Other than that, pretty much any topic relevant to big data systems is mentioned. If you are working on a big data system, there is no way around this book.