All in one custom Docker Image for ETL pipeline/data preprocessing developer on Apache Airflow

George Jen, Jen Tek LLC

Introduction

We have built a custom docker image that includes everything a data engineering developer would need:

CentOS 8 image
Python3
Java Development Toolkit 1.8
Jupyter-notebook server to run Python from the host
ssh-server for the ease of connecting to the container using ssh and scp, as oppose using docker exec and docker cp
Apache Spark
Graphframes for Apache Spark for Graph computing