Apache Flink - A Solution for Real-Time Analytics

Sandeep Chauhan

Data Engineering

Tags:

Data Engineering

stream analytics

stream processing

real-time analytics

data

Flink

In today's world, data is being generated at an unprecedented rate. Every click, every tap, every swipe, every tweet, every post, every like, every share, every search, and every view generates a trail of data. Businesses are struggling to keep up with the speed and volume of this data, and traditional batch-processing systems cannot handle the scale and complexity of this data in real-time.

This is where streaming analytics comes into play, providing faster insights and more timely decision-making. Streaming analytics is particularly useful for scenarios that require quick reactions to events, such as financial fraud detection or IoT data processing. It can handle large volumes of data and provide continuous monitoring and alerts in real-time, allowing for immediate action to be taken when necessary.

Stream processing or real-time analytics is a method of analyzing and processing data as it is generated, rather than in batches. It allows for faster insights and more timely decision-making. Popular open-source stream processing engines include Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. In this blog, we are going to talk about Apache Flink and its fundamentals and how it can be useful for streaming analytics.

Introduction

Apache Flink is an open-source stream processing framework first introduced in 2014. Flink has been designed to process large amounts of data streams in real-time, and it supports both batch and stream processing. It is built on top of the Java Virtual Machine (JVM) and is written in Java and Scala.

Flink is a distributed system that can run on a cluster of machines, and it has been designed to be highly available, fault-tolerant, and scalable. It supports a wide range of data sources and provides a unified API for batch and stream processing, which makes it easy to build complex data processing applications.

Advantages of Apache Flink

Real-time analytics is the process of analyzing data as it is generated. It requires a system that can handle large volumes of data in real-time and provide insights into the data as soon as possible. Apache Flink has been designed to meet these requirements and has several advantages over other real-time data processing systems.

Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.

Flink Architecture

Apache Flink processes data streams in a distributed manner. The Flink cluster consists of several nodes, each of which is responsible for processing a portion of the data. The nodes communicate with each other using a messaging system, such as Apache Kafka.

The Flink cluster processes data streams in parallel by dividing the data into small chunks, or partitions, and processing them independently. Each partition is processed by a task, which is a unit of work that runs on a node in the cluster.

Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.

Flink application runs on a cluster.
A Flink cluster has a job manager and a bunch of task managers.
A job manager is responsible for effective allocation and management of computing resources.
Task managers are responsible for the execution of a job.

Flink Job Execution

Client system submits job graph to the job manager

A client system prepares and sends a dataflow/job graph to the job manager.
It can be your Java/Scala/Python Flink application or the CLI.
The runtime and program execution do not include the client.
After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.

‍

Given below is an illustration of how the job graph converted from code looks like

The job graph is converted to an execution graph by the job manager

The execution graph is a parallel version of the job graph.
For each job vertex, it contains an execution vertex per parallel subtask.
An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.

Given below is an illustration of what an execution graph looks like:

Job manager submits the parallel instances of execution graph to task managers

Execution resources in Flink are defined through task slots.
Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks.
A pipeline consists of multiple successive tasks

‍

**Parallel instances of execution graph being submitted to task slots**

Flink Program

Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:

Obtain an execution environment

ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:

CODE: https://gist.github.com/velotiotech/c64a785c30c0ba4ce8d4f653c139f946.js

Connect to data stream

We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink:

CODE: https://gist.github.com/velotiotech/70cbeccec31982a1ca36ffc7dd4357a9.js

Perform Transformations

We can perform transformation on the events/data that we receive from the data sources.
A few of the data transformation operations are map, filter, keyBy, flatmap, etc.

Specify where to send the data

Once we have performed the transformation/analytics on the data that is flowing through the stream, we can specify where we will send the data.
The destination can be a filesystem, database, or data streams.

CODE: https://gist.github.com/velotiotech/a83b5dec258d060041249d5f328ef16f.js

Flink Transformations

Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.

Given below is an example of Flink’s map operator:

CODE: https://gist.github.com/velotiotech/1dd2cdf8ab283f26f76c70a3f18c571b.js

Filter: Evaluates a boolean function for each element and retains those for which the function returns true.

Given below is an example of Flink’s filter operator:

CODE: https://gist.github.com/velotiotech/3ba200fae3c59410bc279aa18cd7ccef.js

‍

Reduce: A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

Given below is an example of Flink’s reduce operator:

CODE: https://gist.github.com/velotiotech/93a2da4a0b99f839b15c85c139437d72.js

Input :

Output :

KeyBy:

Logically partitions a stream into disjoint partitions.
All records with the same key are assigned to the same partition.
Internally, keyBy() is implemented with hash partitioning.

The figure below illustrates how key by operator works in Flink.

‍

Fault Tolerance

Flink combines stream replay and checkpointing to achieve fault tolerance.
At a checkpoint, each operator's corresponding state and the specific point in each input stream are marked.
Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.

Checkpointing

Checkpointing is implemented using stream barriers.
Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
Barriers flow with the records as part of the data stream.

Refer below diagram to understand how checkpoint barriers flow with the events:

‍

Saving Snapshots

Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator.
After all sinks have acknowledged a snapshot, it is considered completed.

‍

The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.

‍

Recovery

Flink selects the latest completed checkpoint upon failure.
The system then re-deploys the entire distributed dataflow.
Gives each operator the state that was snapshotted as part of the checkpoint.
The sources are set to start reading the stream from the position given in the snapshot.
For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.

Scalability

A Flink job can be scaled up and scaled down as per requirement.

This can be done manually by:

Triggering a savepoint (manually triggered checkpoint)
Adding/Removing nodes to/from the cluster
Restarting the job from savepoint

Automatically by Reactive Scaling:

The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
Adding a Task Manager will scale up your job, and removing resources will scale it down.
Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
The only downside is that it works only in standalone mode.

Alternatives

Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.

Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.

Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.

Conclusion

In conclusion, Apache Flink is a powerful solution for real-time analytics. With its ability to process data in real-time and support for streaming data sources, it enables businesses to make data-driven decisions with minimal delay. The Flink ecosystem also provides a variety of tools and libraries that make it easy for developers to build scalable and fault-tolerant data processing pipelines.

One of the key advantages of Apache Flink is its support for event-time processing, which allows it to handle delayed or out-of-order data in a way that accurately reflects the sequence of events. This makes it particularly useful for use cases such as fraud detection, where timely and accurate data processing is critical.

Additionally, Flink's support for multiple programming languages, including Java, Scala, and Python, makes it accessible to a broad range of developers. And with its seamless integration with popular big data platforms like Hadoop and Apache Kafka, it is easy to incorporate Flink into existing data infrastructure.

In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.

References

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Apache Flink - A Solution for Real-Time Analytics

Introduction

Advantages of Apache Flink

Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.

Flink Architecture

Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.

Flink application runs on a cluster.
A Flink cluster has a job manager and a bunch of task managers.
A job manager is responsible for effective allocation and management of computing resources.
Task managers are responsible for the execution of a job.

Flink Job Execution

Client system submits job graph to the job manager

A client system prepares and sends a dataflow/job graph to the job manager.
It can be your Java/Scala/Python Flink application or the CLI.
The runtime and program execution do not include the client.
After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.

‍

Given below is an illustration of how the job graph converted from code looks like

The job graph is converted to an execution graph by the job manager

The execution graph is a parallel version of the job graph.
For each job vertex, it contains an execution vertex per parallel subtask.
An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.

Given below is an illustration of what an execution graph looks like:

Job manager submits the parallel instances of execution graph to task managers

Execution resources in Flink are defined through task slots.
Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks.
A pipeline consists of multiple successive tasks

‍

Flink Program

Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:

Obtain an execution environment

ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:

CODE: https://gist.github.com/velotiotech/c64a785c30c0ba4ce8d4f653c139f946.js

Connect to data stream

We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink:

CODE: https://gist.github.com/velotiotech/70cbeccec31982a1ca36ffc7dd4357a9.js

Perform Transformations

We can perform transformation on the events/data that we receive from the data sources.
A few of the data transformation operations are map, filter, keyBy, flatmap, etc.

Specify where to send the data

CODE: https://gist.github.com/velotiotech/a83b5dec258d060041249d5f328ef16f.js

Flink Transformations

Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.

Given below is an example of Flink’s map operator:

CODE: https://gist.github.com/velotiotech/1dd2cdf8ab283f26f76c70a3f18c571b.js

Filter: Evaluates a boolean function for each element and retains those for which the function returns true.

Given below is an example of Flink’s filter operator:

CODE: https://gist.github.com/velotiotech/3ba200fae3c59410bc279aa18cd7ccef.js

‍

Reduce: A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

Given below is an example of Flink’s reduce operator:

CODE: https://gist.github.com/velotiotech/93a2da4a0b99f839b15c85c139437d72.js

Input :

Output :

KeyBy:

Logically partitions a stream into disjoint partitions.
All records with the same key are assigned to the same partition.
Internally, keyBy() is implemented with hash partitioning.

The figure below illustrates how key by operator works in Flink.

‍

Fault Tolerance

Flink combines stream replay and checkpointing to achieve fault tolerance.
At a checkpoint, each operator's corresponding state and the specific point in each input stream are marked.
Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.

Checkpointing

Checkpointing is implemented using stream barriers.
Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
Barriers flow with the records as part of the data stream.

Refer below diagram to understand how checkpoint barriers flow with the events:

‍

Saving Snapshots

Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator.
After all sinks have acknowledged a snapshot, it is considered completed.

‍

The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.

‍

Recovery

Flink selects the latest completed checkpoint upon failure.
The system then re-deploys the entire distributed dataflow.
Gives each operator the state that was snapshotted as part of the checkpoint.
The sources are set to start reading the stream from the position given in the snapshot.
For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.

Scalability

A Flink job can be scaled up and scaled down as per requirement.

This can be done manually by:

Triggering a savepoint (manually triggered checkpoint)
Adding/Removing nodes to/from the cluster
Restarting the job from savepoint

Automatically by Reactive Scaling:

The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
Adding a Task Manager will scale up your job, and removing resources will scale it down.
Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
The only downside is that it works only in standalone mode.

Alternatives

Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.

Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.

Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.

Conclusion

In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.

References

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

Apache Flink - A Solution for Real-Time Analytics

Sandeep Chauhan

Introduction

Advantages of Apache Flink

Flink Architecture

Flink Job Execution

Flink Program

Flink Transformations

Fault Tolerance

Checkpointing

Saving Snapshots

Recovery

Scalability

Alternatives

Conclusion

References

MORE POSTS BY THIS AUTHOR

Sandeep Chauhan

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Apache Flink - A Solution for Real-Time Analytics

Introduction

Advantages of Apache Flink

Flink Architecture

Flink Job Execution

Flink Program

Flink Transformations

Fault Tolerance

Checkpointing

Saving Snapshots

Recovery

Scalability

Alternatives

Conclusion

References

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting