All in one custom Docker Image for ETL pipeline/data preprocessing developer on Apache Airflow

$ docker load < bigdata.tgz
$ docker load < etl.tgz
$ docker image lsREPOSITORY          TAG       IMAGE ID       CREATED        SIZE
jentekllc/bigdata latest 213ff7954dd2 4 hours ago 5.48GB
jentekllc/etl latest 35bfbb518ca6 25 hours ago 6.29GB
$ tar -xvf additional_files.tar
$ nohup docker-compose -p j up --scale spark-worker=3 &
$ docker psCONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS       PORTS                                                                                                                                                 NAMES
e9f774390757 nginx:latest "/docker-entrypoint.…" 4 hours ago Up 4 hours 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb
fc2347cd7cf7 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50847->38080/tcp j-spark-worker-1
472295e76f78 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50848->38080/tcp j-spark-worker-3
f2f73912f821 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50849->38080/tcp j-spark-worker-2
05027c6b5a19 jentekllc/bigdata:latest "/run_sshd_master.sh" 4 hours ago Up 4 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master
9005a08d2d9e jentekllc/bigdata:latest "/run_sshd_hive.sh" 4 hours ago Up 4 hours 0.0.0.0:9000->9000/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:30022->22/tcp hadoop-hive
e2911cae487d jentekllc/etl:latest "/start_etl.sh" 4 hours ago Up 4 hours 0.0.0.0:30000->30000/tcp, 0.0.0.0:40022->22/tcp, 0.0.0.0:18080->8080/tcp, 0.0.0.0:18888->8888/tcp, 0.0.0.0:18889->8889/tcp, 0.0.0.0:19000->9000/tcp etl-server
$ docker-compose -p j down
http://localhost:8080
http://localhost:5000
http://localhost:8888
http://localhost:18888
http://localhost:18080
http://localhost:19000
ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py
$ spark-sqlspark-sql
2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>
ssh -p 40022 hadoop@localhost

--

--

--

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to list all indexes in SQL Server

I’m on TWiT!

attached image

Reflection On My First-Year Capstone Project At Holberton School

Integrate Django Rest Framework to React Front-end applications

Preparing Windows 10 to run a full Kubernetes cluster in your home lab for study and play!

Python Classes VS DataClasses: Which one to choose from?

Comparing Traits in Rust to Type Classes in Scala

TiDB 4.0: An Elastic, Real-Time HTAP Database Ready for the Cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
George Jen

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

More from Medium

Creating Dynamically DAG’s Apache Airflow with Various and Dependencies Task

How to install Airflow on Ubuntu 20.04

Apache Airflow in Nutshell

How to Perform ETL from Snowflake to S3 with Python