All in one custom and comprehensive Docker Image for the data engineering developer on Apache Spark
George Jen, Jen Tek LLC
We want to build a custom docker image that includes everything a data engineering developer would need:
CentOS 8 image
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose to using docker exec and docker cp
Graphframes for Apache Spark for Graph computing application with Python
In addition to have the user root (password root), the container has user hadoop (password 123456). User hadoop owns all files belonging to Spark, Hadoop and Hive.
About the container node
The cluster includes one Spark master node, many Spark worker nodes, one Hadoop/Hive node and one nginx load balancer node.
Each container node in the cluster has its own IP address and hostname. All container nodes in the cluster are accessible by ssh internally because want to make each container node to have the same look and feel like a fat VM, but to have much smaller foot print.
We use docker-compose to start and stop the cluster.
We want to be able to scale Spark workers to many nodes as the host memory capacity allows. Given there will be multiple Spark worker nodes, we need to include a load balancer, nginx.
Based upon the above requirements, we have built the docker image, jentekllc/bigdata:latest from scratch.
It requires at least 4GB memory made available to docker.
If using Docker desktop with Mac or Windows, docker resource defaults memory capacity to 2GB. You need to set it to 4GB or more.
For Docker desktop with Mac, click the docker icon, then click Preferences, then move the memory bar scale to 4GB or greater from the 2GB default setting.
For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.
Load docker image:
Download the files from the links above:
Create a folder and place the 2 downloaded files into the new folder. Change directory into the folder.
Run below command:
$ docker load < bigdata.tgz
It might take while for the docker load to complete. Then run below command to confirm:
$ docker image lsREPOSITORY TAG IMAGE ID CREATED SIZEjentekllc/bigdata latest b2b671d197f7 4 hours ago 5.51GB
Expand the tar file bigdata_docker.tar by
$ tar -xf bigdata_docker.tar
Four files will be created:
Startup the docker cluster
In this demo, I will set up the following cluster:
One Spark master node
Three Spark worker nodes
One Hadoop/Hive node
One nginx load balancer node
To start them, simply run:
$ nohup docker-compose -p j up --scale spark-worker=3 &
To confirm these containers have been started, run below command:
$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMESf23c2863e235 nginx:latest "/docker-entrypoint.…" About a minute ago Up 56 seconds 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb1cb418088d2c jentekllc/bigdata:latest "/run_sshd_worker.sh" About a minute ago Up 57 seconds 22/tcp, 0.0.0.0:49851->38080/tcp j-spark-worker-3997537fb1887 jentekllc/bigdata:latest "/run_sshd_worker.sh" About a minute ago Up 57 seconds 22/tcp, 0.0.0.0:49852->38080/tcp j-spark-worker-161bd4afc30a0 jentekllc/bigdata:latest "/run_sshd_worker.sh" About a minute ago Up 58 seconds 22/tcp, 0.0.0.0:49850->38080/tcp j-spark-worker-216a493eb513d jentekllc/bigdata:latest "/run_sshd_master.sh" About a minute ago Up About a minute 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master2707ab560407 jentekllc/bigdata:latest "/run_sshd_hive.sh" About a minute ago Up About a minute 0.0.0.0:9000->9000/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:30022->22/tcp hadoop-hive
To stop the cluster, run:
$ docker-compose -p j down
Ports exposed to the host
Container ports exposed to the host, as defined in docker-compose.yml.
For container spark-master
Spark UI port 8080 port is opened to the host
jupyter-notebook server port 8888 and 8889 are opened to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used by the host
For container Hadoop-Hive:
ssh server port 22 is opened to the host as port 30022 because port 22 and port 20022 are used by the host
For container nginx-lb, the load balancer,
nginx port is 5000 is opened to the host.
Based upon the ports exposed to the host, from the host, Spark master node UI can be accessed by
Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.
jupyter-notebook server can be accessed from the host at
You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node.
ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py
We containerize open source projects. Our containers usually include the associated development environment.
The docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.
You should change passwords for all services in all containers.
Thank you for reading.
I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.