All in one custom Docker Image for ETL pipeline/data preprocessing developer on Apache Airflow

$ docker load < bigdata.tgz
$ docker load < etl.tgz
$ docker image lsREPOSITORY          TAG       IMAGE ID       CREATED        SIZE
jentekllc/bigdata latest 213ff7954dd2 4 hours ago 5.48GB
jentekllc/etl latest 35bfbb518ca6 25 hours ago 6.29GB
$ tar -xvf additional_files.tar
$ nohup docker-compose -p j up --scale spark-worker=3 &
$ docker psCONTAINER ID   IMAGE                      COMMAND                  CREATED       STATUS       PORTS                                                                                                                                                 NAMES
e9f774390757 nginx:latest "/docker-entrypoint.…" 4 hours ago Up 4 hours 80/tcp, 0.0.0.0:5000->5000/tcp nginx-lb
fc2347cd7cf7 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50847->38080/tcp j-spark-worker-1
472295e76f78 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50848->38080/tcp j-spark-worker-3
f2f73912f821 jentekllc/bigdata:latest "/run_sshd_worker.sh" 4 hours ago Up 4 hours 22/tcp, 0.0.0.0:50849->38080/tcp j-spark-worker-2
05027c6b5a19 jentekllc/bigdata:latest "/run_sshd_master.sh" 4 hours ago Up 4 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, 0.0.0.0:20022->22/tcp spark-master
9005a08d2d9e jentekllc/bigdata:latest "/run_sshd_hive.sh" 4 hours ago Up 4 hours 0.0.0.0:9000->9000/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:30022->22/tcp hadoop-hive
e2911cae487d jentekllc/etl:latest "/start_etl.sh" 4 hours ago Up 4 hours 0.0.0.0:30000->30000/tcp, 0.0.0.0:40022->22/tcp, 0.0.0.0:18080->8080/tcp, 0.0.0.0:18888->8888/tcp, 0.0.0.0:18889->8889/tcp, 0.0.0.0:19000->9000/tcp etl-server
$ docker-compose -p j down
http://localhost:8080
http://localhost:5000
http://localhost:8888
http://localhost:18888
http://localhost:18080
http://localhost:19000
ssh -p 20022 hadoop@localhost$ cd $SPARK_HOME
$ bin/spark-submit /spark/examples/src/main/python/pi.py
$ spark-sqlspark-sql
2021-11-06 23:26:42,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java class es where applicable
Setting default log level to "WARN".
Spark master: local[*], Application Id: local-1636241205417
spark-sql>
ssh -p 40022 hadoop@localhost

--

--

--

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to get mobility from velocity and force

TO CREATE/LAUNCH AN APPLICATION USING TERRAFORM ON AWS USING EFS SERVICE

What I learned about Neo4j in 48hours

Creating a Personal Chatbot in Python3 using ChatterBot(Part 5—Stock Information)

Scrape the Web Like a PRO with Python and requests-HTML

AWS Lambda: A Serverless Framework

Coding & Hardware Books Bundle

Jira Slack Integration — A Step by Step Guide

Slack and Jira logo in the integration of Jira-Slack notifications displayed in a Slack Channel.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
George Jen

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

More from Medium

Data Transfer from Amazon S3 to PostgreSQL (on RDS) — 2

How to create a custom Airflow sensor?

Data pipeline automation using Airflow DAGs in Docker containers

Building ETL Pipeline using Apache Airflow and MySQL