All in one custom Docker Image for ETL pipeline/data preprocessing developer on Apache Airflow
George Jen, Jen Tek LLC
Introduction
We have built a custom docker image that includes everything a data engineering developer would need:
CentOS 8 image
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing application with Python
Hadoop
Hive
Some utilities for troubleshooting such as telnet
Additionally, container has root user (password root) and user hadoop (password 123456) that will own Spark, Hadoop and Hive
For implementation detail on the above custom docker image with which, we have built the big data cluster containing one Spark master node, many Spark worker nodes, one Hadoop/Hive node, one nginx load balancer node, please see my writing here.
About the Airflow ETL container node
Now, we want to add an ETL node using Apache Airflow into the big data cluster. Therefore, we need another custom docker image that includes the following:
CentOS 8 image
Python3
Python libraries including Numpy, Pandas, and AWS Boto3
AWS CLI
Java Development Toolkit 1.8
Development tools including C/C++ compiler
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose to using docker exec and docker cp
Apache Spark for composing Spark application for example, streaming and data preprocessing with Airflow in Python, Java and Scala
Graphframes for Graph computing application with Python
Apache Airflow, the ETL tool
Open Source Minio S3 compatible server for having local S3 look alike buckets.
Each of the docker containers has its own IP address and hostname. It has the same look and feel as if it were a VM.
We intend to use docker-compose to start the Airflow ETL node along with Spark master node, Spark worker nodes, Hadoop/Hive node and nginx load balancer node, as one cluster.
Based upon the above requirements, we have built two docker images, jentekllc/bigdata:latest and jentekllc/etl:latest
Here is the docker image jentekllc/bigdata:latest, here is the docker image jentekllc/etl:latest and here is the tarball consisting support files including docker-compose.yml.
System Requirement.
It requires at least 8GB memory available to the docker daemon.
If using docker desktop with Mac and Windows, memory resource to docker defaults to 2GB. You need to set it to 8GB or more.
For Docker desktop with Mac, click docker icon on the top, then click Preferences in the dropdown list, then click Resources on the left, move the memory bar scale to 8GB or greater from 2GB default.
For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.
Load docker images.
Download the files from the links above or below:
bigdata.tgz (3.06GB)
etl.tgz (2.38GB)
additional_files.tar (231KB)
Create a folder and place the 3 downloaded files into the new folder. Change directory into the folder.
Run below command:
$ docker load < bigdata.tgz
$ docker load < etl.tgz
It might take while for the docker load commands to complete. Then run command below to confirm:
$ docker image lsREPOSITORY TAG IMAGE ID CREATED SIZE
jentekllc/bigdata latest 213ff7954dd2 4 hours ago 5.48GB
jentekllc/etl latest 35bfbb518ca6 25 hours ago 6.29GB
Expand the tar file additional_files.tar by
$ tar -xvf additional_files.tar
Seven files will be created:
core-site.xml
hive-site.xml
nginx.conf
start_etl.sh
start_s3.sh
nginx.conf
docker-compose.yml
Startup the docker cluster.
In this demo, I will set up the following cluster:
One Apache Airflow ETL node
One Spark master node
Three Spark worker nodes
One Hadoop/Hive node
One nginx load balancer node
To start them, simply run:
$ nohup docker-compose -p j up --scale spark-worker=3 &
To confirm these containers have been started, run below command:
$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e9f774390757 nginx:latest "/docker-entrypoint.…" 4 hours ago Up 4 hours 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb
fc2347cd7cf7 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50847->38080/tcp j-spark-worker-1
472295e76f78 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50848->38080/tcp j-spark-worker-3
f2f73912f821 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50849->38080/tcp j-spark-worker-2
05027c6b5a19 jentekllc/bigdata:latest "/run_sshd_master.sh" 4 hours ago Up 4 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master
9005a08d2d9e jentekllc/bigdata:latest "/run_sshd_hive.sh" 4 hours ago Up 4 hours 0.0.0.0:9000->9000/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:30022->22/tcp hadoop-hive
e2911cae487d jentekllc/etl:latest "/start_etl.sh" 4 hours ago Up 4 hours 0.0.0.0:30000->30000/tcp, 0.0.0.0:40022->22/tcp, 0.0.0.0:18080->8080/tcp, 0.0.0.0:18888->8888/tcp, 0.0.0.0:18889->8889/tcp, 0.0.0.0:19000->9000/tcp etl-server
To stop the cluster, run:
$ docker-compose -p j down
Ports exposed to the host.
Container ports exposed to the host, as defined in docker-compose.yml.
For container spark-master
Spark UI port 8080 port is opened to the host
jupyter-notebook server port 8888 and 8889 are opened to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used by the host
For container Hadoop-Hive:
ssh server port 22 is opened to the host as port 30022 because port 22 and port 20022 are used by the host
For container nginx-lb, the load balancer,
nginx port is 5000 is opened to the host.
For container Apache Airflow ETL:
ssh server port 22 is opened to the host as port 40022.
Apache Airflow web server port 8080 is opened to the host as port 18080.
Jupyter notebook server port 8888, 8889 are opened to the host as ports 18888, 18889.
Minio server port 9000 is opened to the host as port 19000.
Minio server console port 30000 is opened to the host as port 30000.
Access examples from the host.
Based upon the ports exposed to the host, from the host, Spark master node UI can be accessed by
http://localhost:8080
Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.
http://localhost:5000
Jupyter-notebook server on Spark master node can be accessed from the host by
http://localhost:8888
Jupyter-notebook server on Airflow ETL node can be accessed from the host by
http://localhost:18888
Note: initial login to Jupyter notebook, if password is asked, it is 123456.
Apache Airflow web server can be accessed from the host by
http://localhost:18080
Note: user name for Airflow web server is admin, password is 123456.
After Signing in:
Minio server can be accessed from the host by
http://localhost:19000
User name for Minio server is minioadmin, password is minioadmin.
After login
You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node or simply run ad-hoc SQL statements on Spark SQL.
ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py$ spark-sqlspark-sql
2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>
You can ssh into Apache Airflow ETL node to test your bash scripts for the DAGs.
ssh -p 40022 hadoop@localhost
About us
We containerize open source projects. Our containers usually include the associated development environment.
The docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.
You should change passwords for all services in all containers.
Thank you for reading.
Subscription
I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.