Building a Real-Time Data Streaming Pipeline using Apache Kafka, Flink and Postgres

Kavit (zenwraight)
3 min readNov 25, 2023

For the folks who are interested in following a video, please have a look below:-

In the era of big data, organizations are increasingly turning to real-time data processing to gain instant insights and make data-driven decisions. In this comprehensive guide, we’ll walk through the process of building a robust real-time data streaming pipeline using three powerful technologies: Apache Kafka, Apache Flink, and Postgres.

Why Real-Time Data Streaming?

Real-time data streaming enables organizations to process and analyze data as it is generated, providing instantaneous insights and facilitating quicker decision-making. Whether you’re in finance, e-commerce, or IoT, a well-designed real-time data pipeline is essential for staying competitive in today’s fast-paced business landscape.

Prerequisites

Before we dive into the implementation, make sure you have a basic understanding of the following technologies:

  • Apache Kafka: An open-source stream processing platform that facilitates the building of real-time data pipelines and streaming applications.
  • Apache Flink: A distributed stream processing framework that enables powerful analytics and event-driven applications.
  • Postgres: A powerful, open-source relational database management system.

Step 1: Designing Postgres Schema

We will start by creating a weather table sql file, so that when the container comes up, it executes the below mentioned sql file and creates a table inside postgres container.

Step 2: Create Dockerfile for Postgres container

This Dockerfile path will be mentioned inside docker-compose.yml file.

Any sql inside docker-entrypoint-initdb.d will be executed whenever the docker container comes up.

Step 3: Setting up Kafka Python Producer

We will create a new folder named kafka-producer and add a python script which will run infinitely inside the docker container.

The above file runs periodically to generate (city, temperature) pair and produces it to the Kafka topic weather.

Step 4: Adding Depedency and wait-for-it.sh

So, from a precautionary perspective, I like to add a file called wait-for-it.sh which basically waits for Kafka and zookeeper broker to come up before we start producing message to the topic. This in turn helps us in avoiding any unwanted exception that can take place during runtime.

Along with wait-for-it.sh file, we will also include requirements.txt file which will contain all the dependency required by our Kafka Python Producer.

Step 5: Dockerize Kafka Python Producer

Let’s add Dockerfile for our Kafka Python Producer

Step 6: Setting up Apache Flink Consumer

Now, that we have our Python producer setup and periodically generating pair of (city, temperature) message, we will setup our Flink consumer which will basically consume the messages and aggregate the average temperature over the period of 1 minute.

Now, we will be requiring four main files for our usecase.

  1. pom.xml - which will contain all the dependency required by our Main.java file.
  2. Main.java - contain main logic regarding consuming messages, deserialize our message to Weather class, aggregate the average temperature over the period of 1 minute and then sink it to the Postgres DB.
  3. Weather.java - Weather class file which we will use to create Weather objects based on each message that we consume.
  4. WeatherDeserializationSchema - File contains logic to deserialize raw byte Kafka message to Weather class instance.

pom.xml

Main.java

Weather.java

WeatherDeserializationSchema

Conclusion

Today we looked at how to build a real time data streaming pipeline using Apache Kafka, Apache Flink and Postgres.

Follow me for updates like this.

Connect with me on:-

Twitter 👦🏻:- https://twitter.com/kmmtmm92

Youtube 📹:- https://www.youtube.com/channel/UCpmw7QtwoHXV05D3NUWZ3oQ

Github 💭:- https://github.com/Kavit900

Instagram 📸:- https://www.instagram.com/code_with_kavit/

--

--