How did we manage to tame HBase in one of our biggest Big Data environments?
A little over a year ago, I became a Big Data Software Engineer within the Ibis Instruments Big Data team. In the previous several years, this team managed to develop its own Big Data solution – Ibis Performance Insights, iPi. Until now we have developed several iPi Big Data environments and, as it usually goes in the developer world, with each of those we faced various challenges and obstacles.
What is the point of challenges and obstacles if not to learn from them? That is why I decided to share one of the most significant experiences pertaining to HBase and its influence on one of our biggest and most significant Big Data environments with other developers.
The first important thing for understanding the problem is the iPi architecture. Therefore…
What is iPi?
iPi is a Big Data solution based on open-source technologies. Its basis contains an extremely powerful Hadoop platform which, together with the other Big Data components comprises one big ecosystem in charge of storage, processing, and treatment of large quantities of data in almost real time.
This platform is distributed on several machines which together make up the Big Data cluster and are organized in a master-slave architecture. Two of these machines, the so-called namenodes, have a role in the administration and storage of important system information, while slave machines, and/or datanodes, have a role in performing tasks given by the active namenode and they are in charge of reliable and efficient data storage and processing.
The second significant part of iPi is the web application, which serves for displaying results obtained by using the Big Data cluster, but for this post to maintain its conciseness and clarity, this application will be the topic of one of the subsequent posts.
Since I have already written several lines on the Hadoop platform, you can assume that we have in fact relinquished the role of reliable storage to one of the basic components of Hadoop, namely HDFS (Hadoop Distributed File System). HDFS is an extremely significant component, since it can support storing large quantities of data, very easily achieving the fault-tolerant status by copying data in a precise number of instances. However, HDFS pays for these significant qualities by I/O latency and that is where HBase comes into play, being a distributed, scalable data storage which operates through HDFS.
HBase represents a key component for IPI because it is designed to support quick access to tables containing billions of rows and millions of columns, which enables it to efficiently serve applications which require quick, random access to large data sets, which especially suits our web application within iPi.
Additionally, for the purpose of optimizing fast reading and entering large quantities of Time Series data, a specialized layer is installed through HBase for quick entry and storage of data, whereby the instance of this tool exists on every machine in the Big Data cluster. All these instances look at the same HBase table containing data, whereby instances on datanodes are used solely for entry, and instances on namenodes for reading data.
I have already mentioned that the Big Data environment, which is the topic of this post, is one of our biggest and most significant environments. The Big Data cluster of this environment consists of 11 machines with a capacity of approximately 10TB and 64GB of RAM, where nine datanodes currently have a task of reliably and efficiently storing and processing 50TB of compressed data, and it goes without saying that this number will continue to grow.
The Big Data area is becoming increasingly popular every day and it evolves more and more, so if you wish to remain relevant in this field, it is unforgivable to allow yourself to stagnate in any aspect. That is why we as a team tried to develop and implement as many new functionalities within iPi as possible, in as little time as possible, which often meant a significantly larger read and write load on the cluster. Even though space usage on HDFS is close to 50 percent, the number of read and write tasks directed at the cluster increased dramatically and it quickly became apparent that the initial configuration with which we started the whole iPi story was no longer appropriate, which reflected on the system’s performance quite badly.
Although in the previous section I only had praise for HBase and its performance, in our case HBase was the very epicentre of all problems. HBase is based on JAVA, which means that the automatic memory management, i.e., JAVA Garbage Collection, has a crucial influence on its performance.
The data we use within Big Data environments is organized in HBase in the form of a big table distributed on datanodes within the cluster. This table is split into parts called regions and each datanode contains the HBase RegionServer process which is in charge of accessing regions within its datanode during read and write operations.
When the number of read and write requests directed at the RegionServer increases dramatically, and the initial configuration cannot support this load, this results in the so-called “stop the world” pauses. During these pauses, the RegionServer uses all its resources for cleaning heaps, which disables regular work of the system.
In our case, these pauses would last up to several minutes in the most critical situations, meaning that during those several minutes the RegionServer was unable to perform read or write requests. Such situations had a direct influence on the web application, which is a big problem both for us and the user. The web application worked very slowly, and at times the data which was displayed in reports was unrealistic and meaningless. Clearly, it was necessary to quickly find an efficient and long-term solution.
There are three options for resolving such situations: reducing the system load, adding new datanodes in the cluster or HBase tuning. In our case, the first option is not the solution, because our long-term plan does not include any kind of iPi downgrade. The second option, which is in fact the simplest solution, was impossible to implement within such a short deadline. Therefore, we were forced to change the HBase configuration, i.e. perform HBase tuning.
How did we do this?
The first step in resolving our problem was detailed understanding of the process which leads to ”stop the world” pauses stopping the application.
Our initial configuration included the Garbage First Collector, which is quite understandable having in mind the 8GB heap size. This Garbage Collector is one of the most efficient and best-performing Garbage Collectors developed for multiprocessor machines with heap sizes of more than 6GB. That is why we did not wish to change this parameter, but we had the possibility of increasing the heap in mind.
Furthermore, our initial configuration included the desired GC pause of 200ms. Monitoring the status on Garbage Collection duration graphs, we noticed that this situation occurs only two percent of the time. Garbage Collection duration had huge oscillations, ranging from 150ms to 80s, and the problematic Garbage Collection was performed every hour. The graph pertaining to the heap usage displayed a constant 6GB usage value.
Then we had a look at the two of the most significant in-memory components of the RegionServer, MemStore and BlockCache. Simply put, MemStore is a write cache with the role of storage for new data which is not yet written onto the disk. When enough data is collected in MemStore (in our case this means 128MB), data is joined into HFiles and written onto the disk. Similarly, BlockCache represents a read cache. In our case, both of these components were configured in such a way that each of them takes up a maximum of 40% of the heap.
One of the most significant pieces of information which directed us towards an optimal solution was the time during which the Garbage Collection duration was the longest. This time coincided with the writing of the largest quantities of data onto the cluster. Furthermore, the logs for the HBase Region Server at the time displayed information on the fact that files as large as 5MB were written onto the disk, which did not coincide with our desired value of 128MB. That is why we experienced situations of total MemStore usage. In such situations, HBase performs forced writing of files onto the disk (flushing) and blocks every update. This claim is also supported by the fact that Flush Queue Size warnings were displayed on the cluster during the longest duration of Garbage Collection.
Since the size of MemStore was clearly insufficient for the write load onto our cluster, we decided to increase the heap to 12GB. We subordinated fifty percent of this to MemStore, and 30% to BlockCache. In that way, the size of BlockCache, which previously had sufficient space, did not change significantly, and the size of MemStore was expanded significantly.
Unfortunately, there is no universal formula which tell us how much heap is necessary to assign to which system. Our choice of heap size is the result of numerous tests, but the overall recommendation is that the heap size should not exceed 16GB. Of course, we should have in mind the machine’s capacity as well.
In addition to these parameters, it was also necessary to implement some others in order to completely eliminate the problem. The desired GC pause remained at the same value of 200ms, where, according to the recommendation from Oracle, some other parameters, which help in achieving this value, were introduced. A safe recommendation is the parameter -XX:+ParallelRefProcEnabled. It was detected that this parameter reduces GC pauses by as much as 30%, which was the case in our situation.
Finally, the parameter –XX:InitiatingHeapOccupancyPercent tells us what is the minimum heap usage during which the marking phase should be initiated. This parameter can largely help the performance, but if set up improperly, it can have a detrimental effect on performance, regardless of heap size. The default value is 45%, even though there is an unofficial rule that this value should be larger than the combined size of MemStore and BlockCache. Since we did not wish to risk it with an important environment, by testing we came to a conclusion that 70% is the optimal value in our case.
What did we achieve?
After changing the HBase configuration, we had no problem with system performance. The web application continued to work optimally, and the previous warnings no longer appear. The average GC pause is 215ms, with a standard deviation of 10ms, which is a solid result.
We are also aware that there is a significantly larger selection of parameters, but when finding the optimal configuration, we were thinking that it should be as simple as possible because of the risk that every complicated situation bears.
HBase Garbage Collection is crucial, but also a very unpredictable factor pertaining to the optimal operation of the Big Data environment. If the JVM parameters are not set up properly, there can be catastrophic consequences for the whole system. In our case, the catastrophe was averted, and an optimal solution was found. Unfortunately, there is no universal formula leading to the best solution, because each system has different requirements and purposes, but there are guidelines which can make the road easier for you.