This article summarizes the presentation shared by Shi Xiaogang on the Flink Meetup in Beijing on August 11, 2018. Shi Xiaogang is currently engaged in Blink R&D in the Alibaba Big Data team and is responsible for the R&D of Blink state management and fault tolerance. Alibaba Blink is a real-time computing framework built based on Apache's Flink, aimed at simplifying the complexity of real-time computing on Alibaba's ecosystem.
In this article, we will cover the following content:
The result of a computing task relies not only on the input objects, but also on the current state of the data. In fact, most computing tasks involve stateful data. For example, WordCount is a variable used to calculate the count of words. The word count is an output that accumulates new input objects into the existing word count. In this case, the word count is a stateful variable.
Traditional stream computing systems lack efficient support for program states such as:
In traditional batch processing, data is partitioned, and each task processes a partition. When all partitions are executed, the outputs are aggregated as the final result. In this process, the state is not demanding.
However, stream computing has high requirements for the state because an unlimited stream is imported to the stream system, which runs for a long time, say, a couple of days or even several months, without interruption. In this case, the state data must be properly managed. Unfortunately, the traditional stream computing system does not completely support the state management. For example, Storm does not support any program state. A solution is to use Storm with HBase. The state data is stored in HBase, Storm reads the state data for calculation, and then writes the updated data into HBase again. The following problems may occur:
Flink provides rich interfaces for accessing state and efficient fault tolerance. Flink has been designed to provide rich APIs for state access and efficient fault tolerance, as shown in the following figure:
Flink has two types of states based on data partitioning and resizing modes: Keyed States and Operator States.
Use of Keyed States:
Flink also provides multiple data structure types in Keyed States.
Dynamic resizing of Keyed States:
Use of Operator States:
Operator States do not support as many data structures as Keyed States. They only support List currently.
Multiple resizing modes of Operator States:
Operator States support dynamic and flexible resizing. The following describes three resizing methods that Operator States support:
The preceding are the three resizing methods supported by Flink Operator States. You can select any of them as required.
You can enable checkpointing for your program. Flink backs up the program state at a certain interval. In case of a failure, Flink recovers all tasks to the state of the last checkpoint and restarts running the tasks from that checkpoint.
Flink supports two modes to guarantee consistency: at least once and exactly once.
Flink also provides a mechanism that allows the state to be stored in the memory. Flink restores the state during the checkpoint operation.
The running jobs must be stopped before a component is upgraded. After the component upgrade is completed, the jobs must be resumed. Flink provides two modes to resume the jobs:
The following lists three StateBackends provided by Flink for state management and fault tolerance:
You can select any of the modes as required. You can store a small amount of data in MemoryStateBackend or FsStateBackend and store a large amount of data in RocksDBStateBackend.
The following describes HeapKeyedStateBackend and RocksDBKeyedStateBackend:
The checkpoint operation is implemented based on the Chandy-Lamport algorithm.
When backing up data of each node, Flink traverses and writes all the data to external storage, which affects the backup performance. Full checkpointing has been optimized to improve the performance.
RocksDB data is updated to the memory and is written to the disk when the memory is full. By using the incremental checkpoint mechanism, the newly generated files are copied to the persistent storage, while the previously generated files do not need to be copied to the persistent storage. In this way, the amount of data to be copied is reduced, thus improving the performance.
Alibaba has supported the Flink research since 2015. In October 2015, it started the Blink project and optimized and improved Flink in large-scale production environments. In the Double 11 Shopping Festival in 2016, Alibaba used the Blink system to provide services for search, recommendation, and advertising. In May 2017, Blink became Alibaba's real-time computing engine.