All in one custom and comprehensive Docker Image for the data engineering developer on Apache Spark

George Jen
5 min readOct 18, 2021

George Jen, Jen Tek LLC


We want to build a custom docker image that includes everything a data engineering developer would need:

CentOS 8 image
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose to using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing application with Python

In addition to have the user root (password root), the container has user hadoop (password 123456). User hadoop owns all files belonging to Spark, Hadoop and Hive.

About the container node

The cluster includes one Spark master node, many Spark worker nodes, one Hadoop/Hive node and one nginx load balancer node.

Each container node in the cluster has its own IP address and hostname. All container nodes in the cluster are accessible by ssh internally because want to make each container node to have the same look and feel like a fat VM, but to have much smaller foot print.

We use docker-compose to start and stop the cluster.

We want to be able to scale Spark workers to many nodes as the host memory capacity allows. Given there will be multiple Spark worker nodes, we need to include a load balancer, nginx.

Based upon the above requirements, we have built the docker image, jentekllc/bigdata:latest from scratch.

Here is link to the docker image and here is the link to the associated files including docker-compose.yml

System Requirement:

It requires at least 4GB memory made available to docker.

If using Docker desktop with Mac or Windows, docker resource defaults memory capacity to 2GB. You need to set it to 4GB or more.

For Docker desktop with Mac, click the docker icon, then click Preferences, then move the memory bar scale to 4GB or greater from the 2GB default setting.

For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.

Load docker image:

Download the files from the links above:
bigdata.tgz (3GB)

Create a folder and place the 2 downloaded files into the new folder. Change directory into the folder.

Run below command:

$ docker load < bigdata.tgz

It might take while for the docker load to complete. Then run below command to confirm:

$ docker image lsREPOSITORY          TAG       IMAGE ID       CREATED       SIZEjentekllc/bigdata   latest    b2b671d197f7   4 hours ago   5.51GB

Expand the tar file bigdata_docker.tar by

$ tar -xf  bigdata_docker.tar

Four files will be created:


Startup the docker cluster

In this demo, I will set up the following cluster:

One Spark master node
Three Spark worker nodes
One Hadoop/Hive node
One nginx load balancer node

To start them, simply run:

$ nohup docker-compose -p j up --scale spark-worker=3 &

To confirm these containers have been started, run below command:

$ docker psCONTAINER ID   IMAGE                      COMMAND                  CREATED              STATUS              PORTS                                                                                                     NAMESf23c2863e235   nginx:latest               "/docker-entrypoint.…"   About a minute ago   Up 56 seconds       80/tcp,>5000/tcp                                                                            nginx-lb1cb418088d2c   jentekllc/bigdata:latest   "/"    About a minute ago   Up 57 seconds       22/tcp,>38080/tcp                                                                          j-spark-worker-3997537fb1887   jentekllc/bigdata:latest   "/"    About a minute ago   Up 57 seconds       22/tcp,>38080/tcp                                                                          j-spark-worker-161bd4afc30a0   jentekllc/bigdata:latest   "/"    About a minute ago   Up 58 seconds       22/tcp,>38080/tcp                                                                          j-spark-worker-216a493eb513d   jentekllc/bigdata:latest   "/"    About a minute ago   Up About a minute>7077/tcp,>8080/tcp,>8888-8889/tcp,>22/tcp   spark-master2707ab560407   jentekllc/bigdata:latest   "/"      About a minute ago   Up About a minute>9000/tcp,>9083/tcp,>22/tcp                                     hadoop-hive

To stop the cluster, run:

$ docker-compose -p j down

Ports exposed to the host

Container ports exposed to the host, as defined in docker-compose.yml.

For container spark-master

Spark UI port 8080 port is opened to the host
jupyter-notebook server port 8888 and 8889 are opened to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used by the host

For container Hadoop-Hive:

ssh server port 22 is opened to the host as port 30022 because port 22 and port 20022 are used by the host

For container nginx-lb, the load balancer,

nginx port is 5000 is opened to the host.

Based upon the ports exposed to the host, from the host, Spark master node UI can be accessed by


Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.


jupyter-notebook server can be accessed from the host at


You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node.

ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/

About us

We containerize open source projects. Our containers usually include the associated development environment.

The docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.

You should change passwords for all services in all containers.

Thank you for reading.


I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.



George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.