All In One Custom Docker Image for Streaming/Real Time Data Preprocessing Developer on Apache Samza with Kafka

George Jen
8 min readNov 19, 2021

George Jen, Jen Tek LLC

We have built a custom docker image that includes everything a data engineering developer would need:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python and Scala Spylon Kernel from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing application with Python
Hadoop
Hive

Some utilities for troubleshooting such as telnet

Additionally, we want to add an ETL/ELT node using Apache Airflow into the big data cluster. We have built another custom docker image that includes the following:

CentOS 8 image
Python3
Python libraries including Numpy, Pandas, and AWS Boto3
AWS Cli
Java Development Toolkit 1.8
Developer tools including C/C++ compiler
RUST compiler
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark for composing Spark application such as streaming and data preprocessing for Airflow in Python, Java and Scala
Graphframes for Graph computing application with Python
Apache Airflow, the ETL tool
Open Source Minio S3 server to have local S3 buckets.

All containers have root user (password root) and user hadoop (password 123456) that owns Spark, Hadoop, Hive, Airflow and Jupyter-notebook servers, etc.

For implementation detail on the above custom docker images with which, we have built the big data cluster containing one Spark master node, many Spark worker nodes, one Hadoop/Hive node, one nginx load balancer node and Airflow ETL/ELT node, please see my writing here at medium.com.

Now, we want to add a node into our big data cluster. That node is to handle streaming and process the streaming data in real time. Hence we have built a new custom docker image that includes the following:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Maven build tool for Java
Sbt build tool for Scala
Jupyter-notebook server to run Python and Scala from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Samza
Apache Samza example application called Hello Samza with Kafka, Yarn and Zookeeper

Note, Hello Samza application bundles Samza with ZooKeeper, Kafka, and YARN. The application launches a job that consumes a feed of real-time edits from Wikipedia, and produces them to a Kafka topic, subsequently launches job parses the messages from the Kafka producer, and extracts information about the size of the wiki edit, who made the change, etc.

Despite the example Samza application bearing the name called “Hello Samza”, it is not trivial and please do not relate it to any code that merely prints “Hello World”.

Here is the detail of Hello Semza example application.

We intend to start Samza/Kafka streaming node along with Airflow ETL/ELT node, Spark master node, Spark worker nodes, Hadoop/Hive node and nginx load balancer node as one cluster. You can however, edit docker-compose.yml to selectively launch container node(s) you desire.

Based upon the above requirements, we have built three docker images, jentekllc/bigdata:latest, jentekllc/etl:latest and jentekllc/samza:latest

Here is the docker image jentekllc/bigdata:latest, here is the docker image jentekllc/etl:latest, here is docker image jentekllc/samza:latest and here is the tarball consisting support files including docker-compose.yml.

System Requirement.

It requires at least 8GB memory available to the docker daemon.

If using docker desktop with Mac and Windows, memory resource to docker defaults to 2GB. You need to set it to 8GB or more.

For Docker desktop with Mac, click docker icon on the top, then click Preferences in the dropdown list, then click Resources on the left, move the memory bar scale to 8GB or greater from 2GB default.

For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.

Load docker images.

Download the files from the links above or below:

bigdata.tgz (3.08GB)
etl.tgz (2.53GB)
docker_samza.tar.gz (6.27GB)
additional_files_20211118.tar (230KB)

Create a folder and place the 3 downloaded files into the new folder. Change directory into the folder.

Run below command:

$ docker load < bigdata.tgz
$ docker load < etl.tgz
$ docker load < docker_samza.tar.gz

It will take while for the docker loads to complete. Then run below command to confirm:

$ docker image ls

REPOSITORY TAG IMAGE ID CREATED SIZE
jentekllc/etl latest c3b44db4c5d6 30 hours ago 6.61GB
jentekllc/bigdata latest c7d3fb6d8221 31 hours ago 5.53GB
jentekllc/samza latest e9086c0ddaab 2 days ago 12.3GB

Expand the additional_files_20211118.tar by

$ tar -xf additional_files.tar

Six files will be extracted:

core-site.xml
hive-site.xml
nginx.conf
start_etl.sh
start_s3.sh
docker-compose.yml

Startup the docker cluster.

In this demo, I will set up the following cluster:

One Apache Airflow ETL/ELT node
One Apache Kafka and Samza Streaming/Real Time Processing node
One Spark Master node
Three Spark Worker nodes
One Hadoop/Hive node
One Nginx Load Balancer node

To start them together, simply run:

$ nohup docker-compose -p j up --scale spark-worker=3 &

To confirm these containers have been started, run below command:

$ docker psCONTAINER ID   IMAGE                      COMMAND                  CREATED             STATUS             PORTS                                                                                                                                                     NAMES
034cde65b534 nginx:latest "/docker-entrypoint.…" About an hour ago Up About an hour 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb
26938b83703c jentekllc/bigdata:latest "/run_sshd_worker.sh" About an hour ago Up About an hour 22/tcp, 0.0.0.0:50864->38080/tcp j_spark-worker_3
88b9ada1ab4c jentekllc/bigdata:latest "/run_sshd_worker.sh" About an hour ago Up About an hour 22/tcp, 0.0.0.0:50863->38080/tcp j_spark-worker_1
2f18f6b5f9a9 jentekllc/bigdata:latest "/run_sshd_worker.sh" About an hour ago Up About an hour 22/tcp, 0.0.0.0:50862->38080/tcp j_spark-worker_2
e497e89b4e67 jentekllc/samza:latest "/home/hadoop/start_…" About an hour ago Up About an hour 0.0.0.0:50022->22/tcp, 0.0.0.0:58088->8088/tcp, 0.0.0.0:58888->8888/tcp, 0.0.0.0:58889->8889/tcp samza-server
23e310732fe2 jentekllc/etl:latest "/start_etl.sh" About an hour ago Up About an hour 0.0.0.0:40022->22/tcp, 0.0.0.0:48080->8080/tcp, 0.0.0.0:48888->8888/tcp, 0.0.0.0:48889->8889/tcp, 0.0.0.0:49000->9000/tcp, 0.0.0.0:40000->30000/tcp etl-server
6d3c81c0d6ff jentekllc/bigdata:latest "/run_sshd_master.sh" About an hour ago Up About an hour 0.0.0.0:4040->4040/tcp, 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8088->8088/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master
73ce7dd06186 jentekllc/bigdata:latest "/run_sshd_hive.sh" About an hour ago Up About an hour 0.0.0.0:30022->22/tcp, 0.0.0.0:38088->8088/tcp, 0.0.0.0:39000->9000/tcp, 0.0.0.0:39083->9083/tcp

To stop the cluster, run:

$ docker-compose -p j down

Container ports exposed to the host, as defined in docker-compose.yml.

For container spark-master

Spark UI port 8080 port is open to the host
jupyter-notebook server port 8888 and 8889 are open to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used on the host

For container Hadoop/Hive

ssh server port 22 is opened to the host as port 30022 because port 22 is used on the host

For container nginx-lb, the load balancer

nginx port is 5000 is opened to the host.

For Apache Airflow

ssh server port 22 is opened to the host as port 50022 because port 22 is used on the host.
Apache Airflow web server port 8080 is opened as port 48080 because port 8080 is used by the host.
Jupyter notebook server port 8888, 8889 are opened as port 48888, 48889 on the host.
Minio server port 9000 is opened as port 49000 on the host
Minio server console port 30000 is opened as port 40000 on the host

For Apache Samza

ssh server port 22 is opened to the host as port 50022 because port 22 is used on the host.
YARN port 8088 is opened as port 58088 because 8088 is used on the host.
Jupyter notebook server port 8888, 8889 are opened as port 58888, 58889 on the host.

Access examples from the host.

Based upon the ports exposed to the host, from the host, Spark master node can be accessed by

http://localhost:8080

Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.

http://localhost:5000

Jupyter-notebook server on Spark master node can be accessed from the host by (No password needed)

http://localhost:8888

Jupyter-notebook server on Airflow ETL node can be accessed from the host by

http://localhost:48888

Apache Airflow web server can be accessed from the host by

http://localhost:48080

Note: user name for Airflow web server is admin, password is 123456.

After Signing in:

Minio server can be accessed from the host by

http://localhost:49000

User name for Minio server is minioadmin, password is minioadmin.

After login

Samza jobs are managed by YARN, which can be accessed by

http://localhost:58088

Jupyter-notebook server on Apache Samza node can be accessed at for quick coding with Python and Scala

http://localhost:58888

You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node or simply run ad-hoc SQL statements on Spark SQL.

#From host
$ ssh -p 20022 hadoop@localhost
#Inside container
$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py
$ spark-sql
spark-sql
2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>

You can ssh into Apache Airflow ETL node to test your bash scripts for the DAGs.

$ ssh -p 40022 hadoop@localhost

You can also log into Samza node by

$ ssh -p 50022 hadoop@localhost

Password for user hadoop is 123456, password for root is root

Inside the Samza container, you would see Java process Kafka, ZooKeeper, Yarn and Samza jobs are running

$ jps144 ResourceManager
1233 ClusterBasedJobCoordinatorRunner
1729 LocalContainerRunner
1848 LocalContainerRunner
1689 LocalContainerRunner
586 Kafka
1946 Jps
426 NodeManager
76 QuorumPeerMain
1324 ClusterBasedJobCoordinatorRunner
1421 ClusterBasedJobCoordinatorRunner

There are two folders under /home/hadoop:

samza: Samza local build, a git workspace
hello-samza: Example application, a git workspace

$ file * | grep directoryhello-samza:        directory
samza: directory

About us

We containerize open source projects. Our containers usually include the associated development environment.

The docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.

You should change passwords for all services in all containers.

Thank you for reading.

Subscription

I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.

--

--

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.