All In One Custom PostgreSQL For Database Developer Docker Image

10 min readNov 26, 2021

George Jen, Jen Tek LLC

We have built a custom docker image that includes everything a data engineering developer would need:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python and Scala from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing application with Python
Hadoop
Hive

Some utilities for troubleshooting such as telnet

Then, we want to add an ETL/ELT node using Apache Airflow into the big data cluster. We have built another custom docker image that includes the following:

CentOS 8 image
Python3
Python libraries including Numpy, Pandas, and AWS Boto3
AWS Cli
Java Development Toolkit 1.8
Developer tools including C/C++ compiler
RUST compiler
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark for composing Spark application such as streaming and data preprocessing for Airflow in Python, Java and Scala
Graphframes for Graph computing application with Python
Apache Airflow, the ETL tool
Open Source Minio S3 server to have local S3 buckets.

Additionally, we want to add a node into the big data cluster. That node is to handle streaming and process the streaming data in real time. We have built a custom docker image that includes the following:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Maven build tool for Java
Sbt build tool for Scala
Jupyter-notebook server to run Python and Scala from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Samza
Apache Samza Example Application called Hello Samza, that bundles Samza with ZooKeeper, Kafka, and YARN. The application launches a job that consumes a feed of real-time edits from Wikipedia, and produce them to a Kafka topic, subsequently launches job parses the messages from the Kafka producer, and extracts information about the size of the edit, who made the change, etc.

For implementation detail on the above custom docker images with which, we have built the big data cluster containing one Spark master node, many Spark worker nodes, one Hadoop/Hive node, one nginx load balancer node, one Airflow ETL/ELT node and one Apache Kafka and Samza node, please see my writing here at medium.com.

While existing in the cluster, there is Apache Hive, an open-source OLAP system for data warehouse, we want to add PostgreSQL node as OLTP system to handle structured transactional data storage and retrieval. And, for the ease of accessing PostgreSQL via a web interface, we want to add one pgadmin4 node. Hence we have created a custom PostgreSQL docker image that includes the following:

Debian Linux 11.1
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
PostgreSQL 14
PostgreSQL jdbc driver in container folder /drivers (postgresql-42.3.1.jar)
PostgreSQL ODBC driver in container folder /etc (odbcinst.ini)

All containers except PostgreSQL have root user (password root) and user hadoop (password 123456) that will own Spark, Hadoop, Hive, Airflow, Samza and Jupyter-notebook servers. For PostgreSQL container, root password is root, user postgres password is 123456.

We intend to start PostgreSQL, pgadmin4 nodes along with Airflow ETL/ELT node, Spark master node, Spark worker nodes, Hadoop/Hive node and nginx load balancer node as one cluster. You can however, edit docker-compose.yml to selectively launch desired container nodes. We have also included a special docker-compose.yml file to launch PostgreSQL and pgadmin4 containers only if so desired.

Based upon the above requirements, we have built four docker images, jentekllc/bigdata:latest, jentekllc/etl:latest, jentekllc/samza:latest and jentekllc/postgresql:latest.

Here is the docker image jentekllc/bigdata:latest, here is the docker image jentekllc/etl:latest, here is docker image jentekllc/samza:latest, here is the docker image jentekllc/postgresql:latest and here is the tarball consisting support files including docker-compose.yml.

System Requirement.

It requires at least 8GB memory available to the docker daemon.

If using docker desktop with Mac and Windows, memory resource to docker defaults to 2GB. You need to set it to 8GB or more.

For Docker desktop with Mac, click docker icon on the top, then click Preferences in the dropdown list, then click Resources on the left, move the memory bar scale to 8GB or greater from 2GB default.

For Docker desktop with Windows, follow documentation on increasing memory capacity allocation to docker.

Load docker images.

Download the files from the links above or below:

bigdata.tgz (3.08GB)
etl.tgz (2.53GB)
docker_samza.tar.gz (6.27GB)
postgresql.tgz (515.2MB)
additional_files_20211125.tar.gz (5MB)

Create a folder and place the 5 downloaded files into the new folder. Change directory into the folder.

Run below command:

$ docker load < bigdata.tgz
$ docker load < etl.tgz
$ docker load < docker_samza.tar.gz
$ docker load < postgresql.tgz

It will take while for the docker loads to complete. Then run below command to confirm:

$ docker image lsREPOSITORY                    TAG       IMAGE ID       CREATED        SIZE
jentekllc/postgresql          latest    1557485c708b   22 hours ago   1.43GB
jentekllc/etl                 latest    c3b44db4c5d6   8 days ago     6.61GB
jentekllc/bigdata             latest    c7d3fb6d8221   8 days ago     5.53GB
jentekllc/samza               latest    e9086c0ddaab   10 days ago    12.3GB

Expand the additional_files_20211125.tar by

$ tar -xzf additional_files_20211125.tar.gz

Following files and folders will be extracted:

Files:

docker-compose.yml
docker-compose.yml.postgresql.pgadmin4 (To start up PostgreSQL and pgadmin4 containers only, rename this file to docker-compose.yml. Recommend to save original docker-compose.yml first)
core-site.xml
docker_run_postgresql_only.sh (To start PostgreSQL container only)
start_etl.sh
hive-site.xml
start_s3.sh
nginx.conf

Folders:

data (PostgreSQL container file system /var/lib/postgresql/data will mount the data folder on the host to persist the database)
pgadmin4 (pgadmin4 container file system /var/lib/pgadmin will mount pgadmin4 folder on the host to persist the pgadmin4 configuration database)

Startup the docker cluster.

In this demo, I will set up the following cluster:

One Apache Airflow ETL/ELT node
One Apache Kafka and Samza Streaming/Real Time Processing node
One PostgreSQL node for the storage and retrieval of the structured transactional data
One pgadmin4 node for the ease of PostgreSQL administration and running query over web interface
One Spark Master node
Three Spark Worker nodes
One Hadoop/Hive node
One Nginx Load Balancer node

To start them together, simply run:

$ nohup docker-compose -p j up --scale spark-worker=3 &

To confirm these containers have been started, run below command:

$ docker psea6ccaad5924   nginx:latest                  "/docker-entrypoint.…"   9 minutes ago   Up 9 minutes   80/tcp, 0.0.0.0:5000->5000/tcp                                                                                                                            nginx-lb
0ae10942def2   jentekllc/bigdata:latest      "/run_sshd_worker.sh"    9 minutes ago   Up 9 minutes   22/tcp, 0.0.0.0:50000->38080/tcp                                                                                                                          j_spark-worker_3
9075364e60f3   jentekllc/bigdata:latest      "/run_sshd_worker.sh"    9 minutes ago   Up 9 minutes   22/tcp, 0.0.0.0:50001->38080/tcp                                                                                                                          j_spark-worker_2
c92b4e131c1d   jentekllc/bigdata:latest      "/run_sshd_worker.sh"    9 minutes ago   Up 9 minutes   22/tcp, 0.0.0.0:50002->38080/tcp                                                                                                                          j_spark-worker_1
0ed5d7ed0d22   jentekllc/samza:latest        "/home/hadoop/start_…"   9 minutes ago   Up 9 minutes   0.0.0.0:50022->22/tcp, 0.0.0.0:58088->8088/tcp, 0.0.0.0:58888->8888/tcp, 0.0.0.0:58889->8889/tcp                                                          samza-server
0178dc3df8b1   jentekllc/bigdata:latest      "/run_sshd_hive.sh"      9 minutes ago   Up 9 minutes   0.0.0.0:30022->22/tcp, 0.0.0.0:38088->8088/tcp, 0.0.0.0:39000->9000/tcp, 0.0.0.0:39083->9083/tcp                                                          hadoop-hive
e11a0f1c4dd9   dpage/pgadmin4:latest         "/entrypoint.sh"         9 minutes ago   Up 9 minutes   443/tcp, 0.0.0.0:20080->80/tcp                                                                                                                            pgadmin4-container
4effecf897e6   jentekllc/etl:latest          "/start_etl.sh"          9 minutes ago   Up 9 minutes   0.0.0.0:40022->22/tcp, 0.0.0.0:48080->8080/tcp, 0.0.0.0:48888->8888/tcp, 0.0.0.0:48889->8889/tcp, 0.0.0.0:49000->9000/tcp, 0.0.0.0:40000->30000/tcp       etl-server
7b25d8dcd063   jentekllc/postgresql:latest   "/start_postgres_ssh…"   9 minutes ago   Up 9 minutes   5050/tcp, 0.0.0.0:10022->22/tcp, 0.0.0.0:15432->5432/tcp, 0.0.0.0:18888->8888/tcp, 0.0.0.0:18889->8889/tcp                                                postgresql-container
af0f519a43ce   jentekllc/bigdata:latest      "/run_sshd_master.sh"    9 minutes ago   Up 9 minutes   0.0.0.0:4040->4040/tcp, 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8088->8088/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp   spark-master

To stop the cluster, run:

$ docker-compose -p j down

Container ports exposed to the host, as defined in docker-compose.yml.

For container spark-master

Spark UI port 8080 port is open to the host
jupyter-notebook server port 8888 and 8889 are open to the host
ssh server port 22 is opened to the host as port 20022 because port 22 is used on the host

For container Hadoop/Hive

ssh server port 22 is opened to the host as port 30022 because port 22 is used on the host

For container nginx-lb, the load balancer

nginx port is 5000 is opened to the host.

For Apache Airflow

ssh server port 22 is opened to the host as port 50022 because port 22 is used on the host.
Apache Airflow web server port 8080 is opened as port 48080 because port 8080 is used by the host.
Jupyter notebook server port 8888, 8889 are opened as port 48888, 48889 on the host.
Minio server port 9000 is opened as port 49000 on the host
Minio server console port 30000 is opened as port 40000 on the host

For Apache Samza

ssh server port 22 is opened to the host as port 50022 because port 22 is used on the host.
YARN port 8088 is opened as port 58088 because 8088 is used on the host.
Jupyter notebook server port 8888, 8889 are opened as port 58888, 58889 on the host.

For PostgreSQL

ssh server port 22 is opened to the host as port 10022 because port 22 is used on the host.
Jupyter notebook server port 8888, 8889 are opened as port 18888, 18889 on the host.
PostgreSQL server port 5432 is opened as port 15432 on the host

For pgadmin4

http port 80 is opened to the host as port 20080.

Access examples from the host.

Based upon the ports exposed to the host, from the host, Spark master node can be accessed by

http://localhost:8080

Because spark-worker is scaled to 3 nodes, each with unique hostname and IP address, need to access via nginx port at 5000. nginx will show each of the worker nodes in round robin fashion when the web page is refreshed.

http://localhost:5000

Jupyter-notebook server on Spark master node can be accessed from the host by (No password needed)

http://localhost:8888

Jupyter-notebook server on Airflow ETL node can be accessed from the host by

http://localhost:48888

Apache Airflow web server can be accessed from the host by

http://localhost:48080

Note: user name for Airflow web server is admin, password is 123456.

After Signing in

Minio server can be accessed from the host by

http://localhost:49000

User name for Minio server is minioadmin, password is minioadmin.

After login

Samza jobs are managed by YARN, which can be accessed by

http://localhost:58088

Jupyter-notebook server on Apache Samza node can be accessed at for quick coding with Python and Scala

http://localhost:58888

You can also ssh into spark-master node to submit your Spark applications after you have uploaded your Spark application files by scp to the spark-master node or simply run ad-hoc SQL statements on Spark SQL.

#From host
$ ssh -p 20022 hadoop@localhost#Inside container
$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py$ spark-sql2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class               es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>

You can ssh into Apache Airflow ETL node to test your bash scripts for the DAGs.

$ ssh -p 40022 hadoop@localhost

You can also log into Samza node by

$ ssh -p 50022 hadoop@localhost

Password for user hadoop is 123456, password for root is root

Inside the Samza container, you would see Java process Kafka, ZooKeeper, Yarn and Samza jobs are running

$ jps144 ResourceManager
1233 ClusterBasedJobCoordinatorRunner
1729 LocalContainerRunner
1848 LocalContainerRunner
1689 LocalContainerRunner
586 Kafka
1946 Jps
426 NodeManager
76 QuorumPeerMain
1324 ClusterBasedJobCoordinatorRunner
1421 ClusterBasedJobCoordinatorRunner

There are two folders under /home/hadoop:

samza: Samza local build, a git workspace
hello-samza: Example application, a git workspace

$ file * | grep directoryhello-samza:        directory
samza:              directory

Pgadmin4 server can be reached by

http://localhost:20080

User email address is admin@yourname.com, password is 123456.

After logging in

There is default database called postgres, default database user postgres with default password called password. The pgadmin4 container has been pre-configured to connect to the Postgresql server running in the cluster called local.

Jupyter notebook server on PostgreSQL node can be accessed by

http://localhost:18888

You can ssh into postgreSQL node to access PostgreSQL native client psql if you do not have psql client installed on the host.

#On the host, password for user postgres is 123456$ ssh -p 10022 postgres@localhost#Inside the container$ psqlpsql (14.1 (Debian 14.1-1.pgdg110+1))
Type "help" for help.postgres=# \dn  List of schemas
  Name  |  Owner
--------+----------
 public | postgres
(1 row)

About us

We containerize open source projects. Our containers usually include the associated development environment with popular programming languages.

All docker images referenced in this writing are for educational purpose only. There is no warranty on these docker images and their associated files.

You should change passwords for all services in all containers.

Thank you for reading.

Subscription

I am a frequent content contributor here, you can subscribe my writings on data engineering and other computer science subjects.

All In One Custom PostgreSQL For Database Developer Docker Image

Written by George Jen