All in one custom Docker Image for ETL pipeline/data preprocessing developer on Apache Airflow

George Jen
7 min readNov 7, 2021

George Jen, Jen Tek LLC

Introduction

We have built a custom docker image that includes everything a data engineering developer would need:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing application with Python
Hadoop
Hive
Some utilities for troubleshooting such as telnet

Additionally, container has root user (password root) and user hadoop (password 123456) that will own Spark, Hadoop and Hive

For implementation detail on the above custom docker image with which, we have built the big data cluster containing one Spark master node, many Spark worker nodes, one Hadoop/Hive node, one nginx load balancer node, please see my writing here.

About the Airflow ETL container node

Now, we want to add an ETL node using Apache Airflow into the big data cluster. Therefore, we need another custom docker image that includes the following:

CentOS 8 image
Python3
Python libraries including Numpy, Pandas, and AWS Boto3
AWS CLI
Java Development Toolkit 1.8
Development tools including C/C++ compiler
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose to using docker exec and docker cp
Apache Spark for composing Spark application for example, streaming and data preprocessing with Airflow in Python, Java and Scala
Graphframes for Graph computing application with Python
Apache Airflow, the ETL tool
Open Source Minio S3 compatible server for having local S3 look alike buckets.

Each of the docker containers has its own IP address and hostname. It has the same look and feel as if it were a VM.

We intend to use docker-compose to start the Airflow ETL node along with Spark master node, Spark worker nodes, Hadoop/Hive node and nginx load balancer node, as one cluster.

Based upon the above requirements, we have built two docker images, jentekllc/bigdata:latest and jentekllc/etl:latest

Here is the docker image jentekllc/bigdata:latest, here is the docker image jentekllc/etl:latest and here is the tarball consisting support files including docker-compose.yml.

System Requirement.

It requires at least 8GB memory available to the docker daemon.

If using docker desktop with Mac and Windows, memory resource to docker defaults to 2GB. You need to set it to 8GB or more.

For Docker desktop with Mac, click docker icon on the top, then click Preferences in the dropdown list, then click Resources on the left, move the memory bar scale to 8GB or greater from 2GB default.

For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.

Load docker images.

Download the files from the links above or below:

bigdata.tgz (3.06GB)
etl.tgz (2.38GB)
additional_files.tar (231KB)

Create a folder and place the 3 downloaded files into the new folder. Change directory into the folder.

Run below command:

$ docker load < bigdata.tgz
$ docker load < etl.tgz

It might take while for the docker load commands to complete. Then run command below to confirm:

$ docker image lsREPOSITORY          TAG       IMAGE ID       CREATED        SIZE
jentekllc/bigdata latest 213ff7954dd2 4 hours ago 5.48GB
jentekllc/etl latest 35bfbb518ca6 25 hours ago 6.29GB

Expand the tar file additional_files.tar by

$ tar -xvf additional_files.tar

Seven files will be created:

core-site.xml
hive-site.xml
nginx.conf
start_etl.sh
start_s3.sh
nginx.conf
docker-compose.yml

Startup the docker cluster.

In this demo, I will set up the following cluster:

One Apache Airflow ETL node
One Spark master node
Three Spark worker nodes
One Hadoop/Hive node
One nginx load balancer node

To start them, simply run:

$ nohup docker-compose -p j up --scale spark-worker=3 &

To confirm these containers have been started, run below command:

$ docker psCONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS       PORTS                                                                                                                                                 NAMES
e9f774390757 nginx:latest "/docker-entrypoint.…" 4 hours ago Up 4 hours 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb
fc2347cd7cf7 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50847->38080/tcp j-spark-worker-1
472295e76f78 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50848->38080/tcp j-spark-worker-3
f2f73912f821 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50849->38080/tcp j-spark-worker-2
05027c6b5a19 jentekllc/bigdata:latest "/run_sshd_master.sh" 4 hours ago Up 4 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master
9005a08d2d9e jentekllc/bigdata:latest "/run_sshd_hive.sh" 4 hours ago Up 4 hours 0.0.0.0:9000->9000/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:30022->22/tcp hadoop-hive
e2911cae487d jentekllc/etl:latest "/start_etl.sh" 4 hours ago Up 4 hours 0.0.0.0:30000->30000/tcp, 0.0.0.0:40022->22/tcp, 0.0.0.0:18080->8080/tcp, 0.0.0.0:18888->8888/tcp, 0.0.0.0:18889->8889/tcp, 0.0.0.0:19000->9000/tcp etl-server

To stop the cluster, run:

$ docker-compose -p j down

Ports exposed to the host.

Container ports exposed to the host, as defined in docker-compose.yml.

For container spark-master

Spark UI port 8080 port is opened to the host
jupyter-notebook server port 8888 and 8889 are opened to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used by the host

For container Hadoop-Hive:

ssh server port 22 is opened to the host as port 30022 because port 22 and port 20022 are used by the host

For container nginx-lb, the load balancer,

nginx port is 5000 is opened to the host.

For container Apache Airflow ETL:

ssh server port 22 is opened to the host as port 40022.
Apache Airflow web server port 8080 is opened to the host as port 18080.
Jupyter notebook server port 8888, 8889 are opened to the host as ports 18888, 18889.
Minio server port 9000 is opened to the host as port 19000.
Minio server console port 30000 is opened to the host as port 30000.

Access examples from the host.

Based upon the ports exposed to the host, from the host, Spark master node UI can be accessed by

http://localhost:8080

Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.

http://localhost:5000

Jupyter-notebook server on Spark master node can be accessed from the host by

http://localhost:8888

Jupyter-notebook server on Airflow ETL node can be accessed from the host by

http://localhost:18888

Note: initial login to Jupyter notebook, if password is asked, it is 123456.

Apache Airflow web server can be accessed from the host by

http://localhost:18080

Note: user name for Airflow web server is admin, password is 123456.

After Signing in:

Minio server can be accessed from the host by

http://localhost:19000

User name for Minio server is minioadmin, password is minioadmin.

After login

You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node or simply run ad-hoc SQL statements on Spark SQL.

ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py
$ spark-sqlspark-sql
2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>

You can ssh into Apache Airflow ETL node to test your bash scripts for the DAGs.

ssh -p 40022 hadoop@localhost

About us

We containerize open source projects. Our containers usually include the associated development environment.

The docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.

You should change passwords for all services in all containers.

Thank you for reading.

Subscription

I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.

--

--

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.