George Jen
12 min readAug 4, 2020

Create 3 node Hadoop and Spark Cluster Monitored by SignalFx from Splunk

George Jen, Jen Tek LLC

Summary

This writing provides step by step instructions to create 3 node Hadoop, Hive and 3 node Spark cluster in both Standalone mode and managed by yarn from Hadoop as well as how to setup monitoring by SignalFx from Splunk using VirtualBox CentOS VMs on a single Windows 10 hosting PC. Hope it is helpful.

Objective and Task

I need to test a monitoring solution on Hadoop and Spark installation, to monitor key Hadoop cluster and Spark applications as well as OS (Linux) indicators. There are some choices to compare features:

Ganglia

http://ganglia.sourceforge.net/

SignalFx from Splunk

https://www.signalfx.com/

In this exercise, the focus is on SignalFx.

Required

Nothing, except a computer running windows 10 with 32GB RAM, an Intel i7 4 core 8 thread CPU, the more recent generation i7, the better, plenty of free hard drive space.

Tasks

Install VirtualBox software onto the Windows 10

Download and install from

https://www.virtualbox.org/wiki/Downloads

Create VirtualBox VM with CentOS 7.1.

There are plenty of tutorial on the internet, you can search for it, one of them is here

https://www.avoiderrors.com/install-centos-7-virtual-box/

Create user name hadoop, group haoop after create and startup the CentOS VM

Home directory of OS user hadoop is /opt/hadoop, because /home file system is too small for this exercise.

Locate VirtualBox VM files to be duplicated

After create a CentOS 7.1 VirtualBox VM, shutdown the VM, locate the VM file from Settings->Storage

Need 3 VMs, then make 2 additional copies of the VM file. Name these 3 files as CentOS71master.vdi, CentOS71worker1.vdi, CentOS71worker2.vdi.

After copying, change the UUID for the 3 VM files because all these 3 files have the same UUID despite the file names are different. VirtualBox will refuse to launch the VM if there is another VM with the same UUID already exists.

Change VirtualBox UUIDs for all 3 VMs

On the Windows host, open a command windows as administrator and run below commands:

cd C:\Program Files\Oracle\VirtualBox
VBoxManage internalcommands sethduuid "D:\vm\CentOS71master.vdi"
VBoxManage internalcommands sethduuid "D:\vm\CentOS71worker1.vdi"
VBoxManage internalcommands sethduuid "D:\vm\CentOS71worker2.vdi"

Create 3 VMs

Create 3 VMs in VirtualBox named as SparkMaster, SparkWorker1 and SparkWorker2 that link to D:\vm\CentOS71master.vdi, D:\vm\CentOS71Worker1.vdi and D:\vm\CentOS71worker2.vdi respectively.

VirtualBox Settings

Additional VirtualBox VM settings on SparkMaster, SparkWorker1 and SparkWorker2

CPU and Memory

Assign 8GB RAM, 2 CPU for each VM.

Network

Each VM will be in NAT Network, an internal network that allows all 3 VMs can reach each other and can connect to the outside world, except they can not be reached from outside. For that, on the SparkMaster VM, I add a second network adapter configured a NAT and connect from host machine to it through port forwarding. That SparkMaster VM serves as the bastion station, to be used as jump box to reach other VMs in the internal network.

If need to know what are NAT and NAT Network, please refer to the Wikipedia site:

https://en.wikipedia.org/wiki/Network_address_translation

Find IP addresses of all 3 VMs

Start up all 3 VMs in the VirtualBox, open a terminal in each VM:

Open terminal in each VM, run below and record the IP addresses displayed

sudo ifconfig

Here are 3 VMs, OS user is hadoop for all 3 machines, passwords are set to the same

VM Intended  Hostname     Internal IP Address  NAT IP address
SparkMaster masternode 192.168.0.104 10.0.3.15
SparkWorker1 workernode1 192.168.0.105
SparkWorker2 workernode2 192.168.0.106

Setup NAT port forwarding

On the SparkMaster VM, setup port forwarding:

SparkMaster VM has additional network adapter as NAT, which has IP address 10.0.3.15. In the VirtualBox network setting for the NAT adapter, setup port forwarding mapping host port 30 to guest port 22 on IP address 10.0.3.15. Port 22 is the ssh server port. That way, I can connect to SparkMaster through putty or ssh through host port 30. From SparkMaster, I can further connect to SparkWorker1 and SparkWorker2, the other 2 VMs.

Additionally, I setup port forwarding for web access to Hadoop, YARN and Apache Spark, i.e.:

Host port 31, map to guest port 50070 (Hadoop Web Access Port)
Host port 32, map to guest port 8088 (Yarn Web Access Port)
Host port 33, map to guest port 8080 (Spark Web Access Port)

Hadoop, Hive and Spark Setup Preparation

Use putty.exe (ssh) to connect to SparkMaster VM through port forwarding from host Windows.

Define /etc/hosts

Edit /etc/hosts add below:

192.168.0.104 masternode
192.168.0.105 workernode1
192.168.0.106 workernode2

Replicate /etc/hosts across masternode, workernode1 and workernode2

Set hostname to masternode, workernode1, workernode2

On 192.168.0.104 (to set as masternode)

sudo hostnamectl set-hostname masternode

On 192.168.0.105 (to set as workernode1)

sudo hostnamectl set-hostname workernode1

On 192.168.0.106 (to set as workernode2)

sudo hostnamectl set-hostname workernode2

Passwordless ssh setup

Passwordless ssh from masternode to masternode (itself), workernode1 and workernode2 for OS user hadoop

Configure ssh key based authentication for Hadoop account by running the below commands press enters to accept all defaults and leave the passphrase filed blank in order to automatically login via ssh.

ssh-keygen -t rsa
ssh-copy-id masternode
ssh-copy-id workernode1
ssh-copy-id workernode2

Test passwordless ssh

From masternode

ssh workernode1

Should be able log into workernode1 without password

From masternode

ssh workernode2

Should be able log into workernode2 without password

Pre-requisite Installation

Log into masternode, workernode1 and workernode2

Install developer tools

sudo yum group install "Development Tools"

Install JDK 1.8 (Spark works with JDK 8)

Log into masternode, workernode1 and workernode2

mkdir ~/jdk

Download from

https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html

Place downloaded file jdk-8u202-linux-x64.tar.gz in ~/jdk, expand the compressed tar file

cd ~/jdk
tar -xvzf jdk-8u202-linux-x64.tar.gz

JAVA_HOME is ~/jdk/jdk1.8.0_202

Test java and javac

cd ~/jdk/jdk1.8.0_202
bin/java -version
java version “1.8.0_202”
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
bin/javac -version
javac 1.8.0_202

Add below in ~/.bash_profile on each of the VM

export JAVA_HOME=~/jdk/jdk1.8.0_202
export PATH=$JAVA_HOME/bin:$PATH

Log out and log back in.

Get Hadoop, Hive and Spark

Identify proper Hadoop and Spark version to install.

At the time of this writing, the latest release of Spark is Spark 3.0.0 prebuilt for Apache Hadoop 2.7.7

Now we know what version of Spark and Hadoop we need, which is Spark 3.0.0 and Hadoop 2.7.7 or greater.

Setup Hadoop

Since Hadoop needs to be 2.7.7 or greater, get Hadoop 2.9.2

https://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

Hadoop install

Log into masternode and run below command lines

cd ~
wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
tar -xvzf hadoop-2.9.2.tar.gz
ln -s hadoop-2.9.2 hadoop

Add below environment variables to ~/.bash_profile

export HADOOP_HOME=/opt/hadoop/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH

Log out and log back in.

Hadoop configuration:

$HADOOP_HOME/etc/hadoop/slaves

workernode1
workernode2

$HADOOP_HOME/etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://masternode:9000/</value>
</property>
</configuration>

$HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/volume/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/volume/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Note: since hdfs is mounted on /opt/hadoop/volume, Create path /opt/hadoop/volume on each of all 3 VMs. Do not create sub-folder namenode and datanode. They will be created automatically later.

$HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
<! — Site specific YARN configuration properties →
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>masternode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4068</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4068</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1536</value>
</property>
</configuration>

Please note, these memory configuration is relevant to total memory of 8GB on each VM. If you have different total memory, you need to adjust proportionally.

Now replicate /opt/hadoop/hadoop including its contents and its sub-folders from masternode to workernode1 and workernode2.

Hive Setup

spark-sql works with Hive installation directly through thrift server built in with Hive, before setup Spark, setup Hive as well.

At the time of this writing, latest Hive version is 3.1.2

http://mirrors.advancedhosters.com/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Log into masternode, run below commands:

cd ~
wget http://mirrors.advancedhosters.com/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar -xvzf apache-hive-3.1.2-bin.tar.gz
ln -s apache-hive-3.1.2-bin hive

Add below environment variables to ~/.bash_profile

export HIVE_HOME=/opt/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

Log out and log back in

HIVE configuration

$HIVE_HOME/conf/hive-site.xml

Minimally required editing is to set the location of metastore_db in the $HIVE_HOME

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/opt/hadoop/hive</value>
<description>location of default database for the warehouse</description>
</property>

Spark Setup

At the time of this writing, Spark 3.0.0 on Hadoop 2.7 is the release we need.

https://mirrors.sonic.net/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz

Log into masternode, run below commands:

cd ~
wget https://mirrors.sonic.net/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
tar -xvzf spark-3.0.0-bin-hadoop2.7.tgz
ln -s spark-3.0.0-bin-hadoop2.7 spark

Append below environment variables to ~/.bash_profile

export SPARK_HOME=/opt/hadoop/spark
export PATH=$SPARK_HOME/bin:$PATH

Log out and log back in

Now replicate ~/.bash_profile from masternode to workernode1 and workernode2

Replicate

scp ~/.bash_profile workernode1:/opt/hadoop/.bash_profile
scp ~/.bash_profile workernode2:/opt/hadoop/.bash_profile

Spark Configuration

$SPARK_HOME/conf/slaves

localhost
workernode1
workernode2

All 3 nodes including masternode run Spark worker process. masternode runs Spark master process.

$SPARK_HOME/conf/hive-site.xml, this is a soft link that links to $HIVE_HOME/conf/hive-site.xml, so Spark knows there is a HIVE installation on the masternode.

ln -s $HIVE_HOME/conf/hive-site.xml hive-site.xml

Replicate /opt/hadoop/spark from masternode to workernode1 and workernode2

Additionally, scp $HIVE_HOME/conf/hive-site.xml from masternode to workernode1 and workernode2

Preparing starting up Hadoop cluster and Hive service

Hadoop start up preparation

Format the HDFS

hdfs namenode -format

Hive start up preparation

Create needed HDFS directory path and set permission

hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /tmp/hive/
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/hive
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod -R 777 /tmp

Initialize Hive metastore

schematool -dbType derby -initSchema –verbose

Launch Hadoop

start-dfs.sh
start-yarn.sh

Inspect Hadoop started processes

masternode runs Namenode, SecondaryNameNode and ResourceManager (yarn)

Log into masternode, run jps

25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
5659 Jps

workernode1 runs Hadoop Datanode and NodeManager

Log into workernode1, run jps

18275 NodeManager
18151 DataNode
30839 Jps

Workernode2 runs Hadoop Datanode and NodeManager

Log into workernode2, run jps

18275 NodeManager
18151 DataNode
16638 Jps

Launch Hive Service

cd $HIVE_HOME
nohup hive — service metastore &

Inspect Hive service

Log into masternode, run jps

25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
9752 Jps
26927 RunJar

masternode runs RunJar, which is Hive service

Launch Apache Spark Standalone Mode

$SPARK_HOME/sbin/start-all.sh

Log into masternode, run jps

25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
30200 Master
30265 Worker
9821 Jps
26927 RunJar

masternode runs Spark Master and Worker

Log into workernode1, run jps

23457 NodeManager
23286 DataNode
27229 Worker
5774 Jps

workernode1 runs Spark Worker

Log into workernode2, run jps

31265 Jps
21698 Worker
18275 NodeManager
18151 DataNode

workernode2 runs Spark Worker

I can connect to Hadoop master node by web interface

Connect to ResourceManager (yarn) through its web interface that shows 2 Datanodes

Connect to Spark Master node through its web interface, that shows 3 Workers

Monitor Hadoop and Spark by SignalFx from Splunk

Sign up with SignalFx at

http://www.signalfx.com and log into signalfx website with signed up username and password. From there, you can get 2 needed information from “My Profile”

<realm> and <API Access Token>

Setup Signalfx agent

Log into masternode, run below commands

curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <API access token>

You need to replace <realm> and <API access token> with yours actual realm and API access token from “My Profile”

log into workernode1, run below commands

curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <access token>

log into workernode2, run below commands

curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <access token>

At the end of running agent install, you will see below confirmation

Complete!
The SignalFx Agent has been successfully installed.
Make sure that your system’s time is relatively accurate or else datapoints may not be accepted.
The agent’s main configuration file is located at /etc/signalfx/agent.yaml.

SignalFx agent monitor configuration file is /etc/signalfx/agent.yaml

Log into masternode,

List /etc/signalfx

ls /etc/signalfx/
agent.yaml api_url cluster ingest_url token

There are agent.yaml and ingest_url file

Display contents of /etc/signalfx/agent.yaml

cat /etc/signalfx/agent.yaml

# *Required* The access token for the org that you wish to send metrics to.
signalFxAccessToken: {"#from": "/etc/signalfx/token"}
ingestUrl: {"#from": "/etc/signalfx/ingest_url", default: "https://ingest.signalfx.com"}
apiUrl: {"#from": "/etc/signalfx/api_url" default: "https://api.signalfx.com"}
cluster: {"#from": "/etc/signalfx/cluster", optional: true}
intervalSeconds: 10
logging:
# Valid values are 'debug', 'info', 'warning', and 'error'
level: info
# observers are what discover running services in the environment
observers:
- type: host
monitors:
- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem

By default, the agent and daemon collect basic OS stats and send them to signalfx through URL in file ingest_url

To monitor Hadoop and Spark, need to append following to /etc/signalfx/agent.yaml

To Monitor Hadoop, the agent will collect metrics from yarn web interface at port 8088. In the masternode, Yarn web interface listens at port 8088 at 192.168.0.104

Following is the monitor parameter setting:

monitors:

- type: collectd/hadoop
host: 192.168.0.194
port: 8088

To Monitor Spark running Standalone mode, not managed by Yarn

Following is the monitor parameter setting:

monitors:

- type: collectd/spark
host: 192.168.0.104
port: 8080
clusterType: Standalone
isMaster: true
collectApplicationMetrics: true

sudo vi /etc/signalfx/agent.yaml, append Hadoop and Spark lines, save and exit. Note, this is only needing to be done on masternode

- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem
#Monitor Hadoop through Yarn at port 8088
- type: collectd/hadoop
host: 192.168.0.104
port: 8088
#Monitor Spark running in Standalone mode at port 8080- type: collectd/spark
host: 192.168.0.104
port: 8080
clusterType: Standalone
isMaster: true
collectApplicationMetrics: true

The signalfx agent runs as a service on CentOS. Restart signalfx-agent on masternode

sudo service signalfx-agent stop
sudo service signalfx-agent start

Monitor Hadoop and Spark metrics in dashboards in SignalFx

Log into signalfx.com, you will be able to see from built in Dashboards for Spark and Hadoop. You can create custom dashboard too.

Here are built in dashboards:

To see Hadoop MapReduce jobs

To see Spark master process

Launch Spark on Yarn

The Apache Spark were launched in Standalone mode. Spark can also be launched on Yarn, the ResourceManager in Hadoop cluster. Note, Spark on Yarn means there will not be dedicated master and worker processes, because it is Yarn to manage each Spark driver application submitted by spark-submit. A Spark driver application at the end, is really a Java application.

To launch Spark on Yarn, you do not need to run Spark master and worker processes. Therefore, stop Spark Standalone cluster by log into masternode and run

$SPARK_HOME/sbin/stop-all.sh

Configure Spark to be launched on Yarn

Define default parameters in spark-defaults.conf

Copy $SPARK_HOME/conf/spark-defaults.conf.template to $SPARK_HOME/conf/spark-defaults.conf

Set below parameters in $SPARK_HOME/conf/spark-defaults.conf. Memory settings are based on 8GB per VM. You should set proportionally if your installed memory capacity is different.

spark.master yarn
spark.executor.memory 1g
spark.driver.memory 512m
spark.yarn.am.memory 512m

Then replicate $SPARK_HOME/conf/spark-defaults.conf from masternode to workernode1 and workernode2

scp spark-defaults.conf workernode1:/opt/hadoop/spark/conf/
scp spark-defaults.conf workernode2:/opt/hadoop/spark/conf/

signalfx agent configuration modification

Now that Spark is ready to be launched on Yarn, it needs to update /etc/signalfx/agent.yaml

monitors:
- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem
#Monitor Hadoop through Yarn at port 8088
- type: collectd/hadoop
host: 192.168.0.104
port: 8088
#Monitor Spark running on Yarn, on Yarn port 8088
- type: collectd/spark
host: 192.168.0.104
port: 8088
clusterType: Yarn
isMaster: true
collectApplicationMetrics: true

Restart signalfx-agent service

sudo service signalfx-agent stop
sudo service signalfx-agent start

To test, simply launch spark-sql

spark-sql
Spark master: yarn, Application Id: application_1596433353347_0001
spark-sql>

Monitor Spark on Yarn metrics in SignalFx dashboards

Log into signalfx.com

Spark application shows up in Dashboard on Yarn

Thank you for your time viewing.

George Jen
George Jen

Written by George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

No responses yet