Create 3 node Hadoop and Spark Cluster Monitored by SignalFx from Splunk
George Jen, Jen Tek LLC
Summary
This writing provides step by step instructions to create 3 node Hadoop, Hive and 3 node Spark cluster in both Standalone mode and managed by yarn from Hadoop as well as how to setup monitoring by SignalFx from Splunk using VirtualBox CentOS VMs on a single Windows 10 hosting PC. Hope it is helpful.
Objective and Task
I need to test a monitoring solution on Hadoop and Spark installation, to monitor key Hadoop cluster and Spark applications as well as OS (Linux) indicators. There are some choices to compare features:
Ganglia
http://ganglia.sourceforge.net/
SignalFx from Splunk
In this exercise, the focus is on SignalFx.
Required
Nothing, except a computer running windows 10 with 32GB RAM, an Intel i7 4 core 8 thread CPU, the more recent generation i7, the better, plenty of free hard drive space.
Tasks
Install VirtualBox software onto the Windows 10
Download and install from
https://www.virtualbox.org/wiki/Downloads
Create VirtualBox VM with CentOS 7.1.
There are plenty of tutorial on the internet, you can search for it, one of them is here
https://www.avoiderrors.com/install-centos-7-virtual-box/
Create user name hadoop, group haoop after create and startup the CentOS VM
Home directory of OS user hadoop is /opt/hadoop, because /home file system is too small for this exercise.
Locate VirtualBox VM files to be duplicated
After create a CentOS 7.1 VirtualBox VM, shutdown the VM, locate the VM file from Settings->Storage
Need 3 VMs, then make 2 additional copies of the VM file. Name these 3 files as CentOS71master.vdi, CentOS71worker1.vdi, CentOS71worker2.vdi.
After copying, change the UUID for the 3 VM files because all these 3 files have the same UUID despite the file names are different. VirtualBox will refuse to launch the VM if there is another VM with the same UUID already exists.
Change VirtualBox UUIDs for all 3 VMs
On the Windows host, open a command windows as administrator and run below commands:
cd C:\Program Files\Oracle\VirtualBox
VBoxManage internalcommands sethduuid "D:\vm\CentOS71master.vdi"
VBoxManage internalcommands sethduuid "D:\vm\CentOS71worker1.vdi"
VBoxManage internalcommands sethduuid "D:\vm\CentOS71worker2.vdi"
Create 3 VMs
Create 3 VMs in VirtualBox named as SparkMaster, SparkWorker1 and SparkWorker2 that link to D:\vm\CentOS71master.vdi, D:\vm\CentOS71Worker1.vdi and D:\vm\CentOS71worker2.vdi respectively.
VirtualBox Settings
Additional VirtualBox VM settings on SparkMaster, SparkWorker1 and SparkWorker2
CPU and Memory
Assign 8GB RAM, 2 CPU for each VM.
Network
Each VM will be in NAT Network, an internal network that allows all 3 VMs can reach each other and can connect to the outside world, except they can not be reached from outside. For that, on the SparkMaster VM, I add a second network adapter configured a NAT and connect from host machine to it through port forwarding. That SparkMaster VM serves as the bastion station, to be used as jump box to reach other VMs in the internal network.
If need to know what are NAT and NAT Network, please refer to the Wikipedia site:
https://en.wikipedia.org/wiki/Network_address_translation
Find IP addresses of all 3 VMs
Start up all 3 VMs in the VirtualBox, open a terminal in each VM:
Open terminal in each VM, run below and record the IP addresses displayed
sudo ifconfig
Here are 3 VMs, OS user is hadoop for all 3 machines, passwords are set to the same
VM Intended Hostname Internal IP Address NAT IP address
SparkMaster masternode 192.168.0.104 10.0.3.15
SparkWorker1 workernode1 192.168.0.105
SparkWorker2 workernode2 192.168.0.106
Setup NAT port forwarding
On the SparkMaster VM, setup port forwarding:
SparkMaster VM has additional network adapter as NAT, which has IP address 10.0.3.15. In the VirtualBox network setting for the NAT adapter, setup port forwarding mapping host port 30 to guest port 22 on IP address 10.0.3.15. Port 22 is the ssh server port. That way, I can connect to SparkMaster through putty or ssh through host port 30. From SparkMaster, I can further connect to SparkWorker1 and SparkWorker2, the other 2 VMs.
Additionally, I setup port forwarding for web access to Hadoop, YARN and Apache Spark, i.e.:
Host port 31, map to guest port 50070 (Hadoop Web Access Port)
Host port 32, map to guest port 8088 (Yarn Web Access Port)
Host port 33, map to guest port 8080 (Spark Web Access Port)
Hadoop, Hive and Spark Setup Preparation
Use putty.exe (ssh) to connect to SparkMaster VM through port forwarding from host Windows.
Define /etc/hosts
Edit /etc/hosts add below:
192.168.0.104 masternode
192.168.0.105 workernode1
192.168.0.106 workernode2
Replicate /etc/hosts across masternode, workernode1 and workernode2
Set hostname to masternode, workernode1, workernode2
On 192.168.0.104 (to set as masternode)
sudo hostnamectl set-hostname masternode
On 192.168.0.105 (to set as workernode1)
sudo hostnamectl set-hostname workernode1
On 192.168.0.106 (to set as workernode2)
sudo hostnamectl set-hostname workernode2
Passwordless ssh setup
Passwordless ssh from masternode to masternode (itself), workernode1 and workernode2 for OS user hadoop
Configure ssh key based authentication for Hadoop account by running the below commands press enters to accept all defaults and leave the passphrase filed blank in order to automatically login via ssh.
ssh-keygen -t rsa
ssh-copy-id masternode
ssh-copy-id workernode1
ssh-copy-id workernode2
Test passwordless ssh
From masternode
ssh workernode1
Should be able log into workernode1 without password
From masternode
ssh workernode2
Should be able log into workernode2 without password
Pre-requisite Installation
Log into masternode, workernode1 and workernode2
Install developer tools
sudo yum group install "Development Tools"
Install JDK 1.8 (Spark works with JDK 8)
Log into masternode, workernode1 and workernode2
mkdir ~/jdk
Download from
https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html
Place downloaded file jdk-8u202-linux-x64.tar.gz in ~/jdk, expand the compressed tar file
cd ~/jdk
tar -xvzf jdk-8u202-linux-x64.tar.gz
JAVA_HOME is ~/jdk/jdk1.8.0_202
Test java and javac
cd ~/jdk/jdk1.8.0_202
bin/java -versionjava version “1.8.0_202”
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)bin/javac -version
javac 1.8.0_202
Add below in ~/.bash_profile on each of the VM
export JAVA_HOME=~/jdk/jdk1.8.0_202
export PATH=$JAVA_HOME/bin:$PATH
Log out and log back in.
Get Hadoop, Hive and Spark
Identify proper Hadoop and Spark version to install.
At the time of this writing, the latest release of Spark is Spark 3.0.0 prebuilt for Apache Hadoop 2.7.7
Now we know what version of Spark and Hadoop we need, which is Spark 3.0.0 and Hadoop 2.7.7 or greater.
Setup Hadoop
Since Hadoop needs to be 2.7.7 or greater, get Hadoop 2.9.2
https://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
Hadoop install
Log into masternode and run below command lines
cd ~
wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
tar -xvzf hadoop-2.9.2.tar.gz
ln -s hadoop-2.9.2 hadoop
Add below environment variables to ~/.bash_profile
export HADOOP_HOME=/opt/hadoop/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native”
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
Log out and log back in.
Hadoop configuration:
$HADOOP_HOME/etc/hadoop/slaves
workernode1
workernode2
$HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://masternode:9000/</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/volume/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/volume/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Note: since hdfs is mounted on /opt/hadoop/volume, Create path /opt/hadoop/volume on each of all 3 VMs. Do not create sub-folder namenode and datanode. They will be created automatically later.
$HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<! — Site specific YARN configuration properties →
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>masternode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4068</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4068</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1536</value>
</property>
</configuration>
Please note, these memory configuration is relevant to total memory of 8GB on each VM. If you have different total memory, you need to adjust proportionally.
Now replicate /opt/hadoop/hadoop including its contents and its sub-folders from masternode to workernode1 and workernode2.
Hive Setup
spark-sql works with Hive installation directly through thrift server built in with Hive, before setup Spark, setup Hive as well.
At the time of this writing, latest Hive version is 3.1.2
http://mirrors.advancedhosters.com/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Log into masternode, run below commands:
cd ~
wget http://mirrors.advancedhosters.com/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar -xvzf apache-hive-3.1.2-bin.tar.gz
ln -s apache-hive-3.1.2-bin hive
Add below environment variables to ~/.bash_profile
export HIVE_HOME=/opt/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH
Log out and log back in
HIVE configuration
$HIVE_HOME/conf/hive-site.xml
Minimally required editing is to set the location of metastore_db in the $HIVE_HOME
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/opt/hadoop/hive</value>
<description>location of default database for the warehouse</description>
</property>
Spark Setup
At the time of this writing, Spark 3.0.0 on Hadoop 2.7 is the release we need.
https://mirrors.sonic.net/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
Log into masternode, run below commands:
cd ~
wget https://mirrors.sonic.net/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
tar -xvzf spark-3.0.0-bin-hadoop2.7.tgz
ln -s spark-3.0.0-bin-hadoop2.7 spark
Append below environment variables to ~/.bash_profile
export SPARK_HOME=/opt/hadoop/spark
export PATH=$SPARK_HOME/bin:$PATH
Log out and log back in
Now replicate ~/.bash_profile from masternode to workernode1 and workernode2
Replicate
scp ~/.bash_profile workernode1:/opt/hadoop/.bash_profile
scp ~/.bash_profile workernode2:/opt/hadoop/.bash_profile
Spark Configuration
$SPARK_HOME/conf/slaves
localhost
workernode1
workernode2
All 3 nodes including masternode run Spark worker process. masternode runs Spark master process.
$SPARK_HOME/conf/hive-site.xml, this is a soft link that links to $HIVE_HOME/conf/hive-site.xml, so Spark knows there is a HIVE installation on the masternode.
ln -s $HIVE_HOME/conf/hive-site.xml hive-site.xml
Replicate /opt/hadoop/spark from masternode to workernode1 and workernode2
Additionally, scp $HIVE_HOME/conf/hive-site.xml from masternode to workernode1 and workernode2
Preparing starting up Hadoop cluster and Hive service
Hadoop start up preparation
Format the HDFS
hdfs namenode -format
Hive start up preparation
Create needed HDFS directory path and set permission
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /tmp/hive/
hdfs dfs -mkdir /user/
hdfs dfs -mkdir /user/hive
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod -R 777 /tmp
Initialize Hive metastore
schematool -dbType derby -initSchema –verbose
Launch Hadoop
start-dfs.sh
start-yarn.sh
Inspect Hadoop started processes
masternode runs Namenode, SecondaryNameNode and ResourceManager (yarn)
Log into masternode, run jps
25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
5659 Jps
workernode1 runs Hadoop Datanode and NodeManager
Log into workernode1, run jps
18275 NodeManager
18151 DataNode
30839 Jps
Workernode2 runs Hadoop Datanode and NodeManager
Log into workernode2, run jps
18275 NodeManager
18151 DataNode
16638 Jps
Launch Hive Service
cd $HIVE_HOME
nohup hive — service metastore &
Inspect Hive service
Log into masternode, run jps
25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
9752 Jps
26927 RunJar
masternode runs RunJar, which is Hive service
Launch Apache Spark Standalone Mode
$SPARK_HOME/sbin/start-all.sh
Log into masternode, run jps
25120 NameNode
25522 ResourceManager
25348 SecondaryNameNode
30200 Master
30265 Worker
9821 Jps
26927 RunJar
masternode runs Spark Master and Worker
Log into workernode1, run jps
23457 NodeManager
23286 DataNode
27229 Worker
5774 Jps
workernode1 runs Spark Worker
Log into workernode2, run jps
31265 Jps
21698 Worker
18275 NodeManager
18151 DataNode
workernode2 runs Spark Worker
I can connect to Hadoop master node by web interface
Connect to ResourceManager (yarn) through its web interface that shows 2 Datanodes
Connect to Spark Master node through its web interface, that shows 3 Workers
Monitor Hadoop and Spark by SignalFx from Splunk
Sign up with SignalFx at
http://www.signalfx.com and log into signalfx website with signed up username and password. From there, you can get 2 needed information from “My Profile”
<realm> and <API Access Token>
Setup Signalfx agent
Log into masternode, run below commands
curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <API access token>
You need to replace <realm> and <API access token> with yours actual realm and API access token from “My Profile”
log into workernode1, run below commands
curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <access token>
log into workernode2, run below commands
curl -sSL https://dl.signalfx.com/signalfx-agent.sh > /tmp/signalfx-agent.sh;
sudo sh /tmp/signalfx-agent.sh — realm <realm> — <access token>
At the end of running agent install, you will see below confirmation
Complete!
The SignalFx Agent has been successfully installed.
Make sure that your system’s time is relatively accurate or else datapoints may not be accepted.
The agent’s main configuration file is located at /etc/signalfx/agent.yaml.
SignalFx agent monitor configuration file is /etc/signalfx/agent.yaml
Log into masternode,
List /etc/signalfx
ls /etc/signalfx/
agent.yaml api_url cluster ingest_url token
There are agent.yaml and ingest_url file
Display contents of /etc/signalfx/agent.yaml
cat /etc/signalfx/agent.yaml
# *Required* The access token for the org that you wish to send metrics to.
signalFxAccessToken: {"#from": "/etc/signalfx/token"}
ingestUrl: {"#from": "/etc/signalfx/ingest_url", default: "https://ingest.signalfx.com"}
apiUrl: {"#from": "/etc/signalfx/api_url" default: "https://api.signalfx.com"}
cluster: {"#from": "/etc/signalfx/cluster", optional: true}intervalSeconds: 10
logging:
# Valid values are 'debug', 'info', 'warning', and 'error'
level: info
# observers are what discover running services in the environment
observers:
- type: hostmonitors:
- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem
By default, the agent and daemon collect basic OS stats and send them to signalfx through URL in file ingest_url
To monitor Hadoop and Spark, need to append following to /etc/signalfx/agent.yaml
To Monitor Hadoop, the agent will collect metrics from yarn web interface at port 8088. In the masternode, Yarn web interface listens at port 8088 at 192.168.0.104
Following is the monitor parameter setting:
monitors:
- type: collectd/hadoop
host: 192.168.0.194
port: 8088
To Monitor Spark running Standalone mode, not managed by Yarn
Following is the monitor parameter setting:
monitors:
- type: collectd/spark
host: 192.168.0.104
port: 8080
clusterType: Standalone
isMaster: true
collectApplicationMetrics: true
sudo vi /etc/signalfx/agent.yaml, append Hadoop and Spark lines, save and exit. Note, this is only needing to be done on masternode
- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem#Monitor Hadoop through Yarn at port 8088
- type: collectd/hadoop
host: 192.168.0.104
port: 8088#Monitor Spark running in Standalone mode at port 8080- type: collectd/spark
host: 192.168.0.104
port: 8080
clusterType: Standalone
isMaster: true
collectApplicationMetrics: true
The signalfx agent runs as a service on CentOS. Restart signalfx-agent on masternode
sudo service signalfx-agent stop
sudo service signalfx-agent start
Monitor Hadoop and Spark metrics in dashboards in SignalFx
Log into signalfx.com, you will be able to see from built in Dashboards for Spark and Hadoop. You can create custom dashboard too.
Here are built in dashboards:
To see Hadoop MapReduce jobs
To see Spark master process
Launch Spark on Yarn
The Apache Spark were launched in Standalone mode. Spark can also be launched on Yarn, the ResourceManager in Hadoop cluster. Note, Spark on Yarn means there will not be dedicated master and worker processes, because it is Yarn to manage each Spark driver application submitted by spark-submit. A Spark driver application at the end, is really a Java application.
To launch Spark on Yarn, you do not need to run Spark master and worker processes. Therefore, stop Spark Standalone cluster by log into masternode and run
$SPARK_HOME/sbin/stop-all.sh
Configure Spark to be launched on Yarn
Define default parameters in spark-defaults.conf
Copy $SPARK_HOME/conf/spark-defaults.conf.template to $SPARK_HOME/conf/spark-defaults.conf
Set below parameters in $SPARK_HOME/conf/spark-defaults.conf. Memory settings are based on 8GB per VM. You should set proportionally if your installed memory capacity is different.
spark.master yarn
spark.executor.memory 1g
spark.driver.memory 512m
spark.yarn.am.memory 512m
Then replicate $SPARK_HOME/conf/spark-defaults.conf from masternode to workernode1 and workernode2
scp spark-defaults.conf workernode1:/opt/hadoop/spark/conf/
scp spark-defaults.conf workernode2:/opt/hadoop/spark/conf/
signalfx agent configuration modification
Now that Spark is ready to be launched on Yarn, it needs to update /etc/signalfx/agent.yaml
monitors:
- {"#from": "/etc/signalfx/monitors/*.yaml", flatten: true, optional: true}
- type: host-metadata
- type: processlist
- type: cpu
- type: filesystems
- type: disk-io
- type: net-io
- type: load
- type: memory
- type: vmem#Monitor Hadoop through Yarn at port 8088
- type: collectd/hadoop
host: 192.168.0.104
port: 8088#Monitor Spark running on Yarn, on Yarn port 8088
- type: collectd/spark
host: 192.168.0.104
port: 8088
clusterType: Yarn
isMaster: true
collectApplicationMetrics: true
Restart signalfx-agent service
sudo service signalfx-agent stop
sudo service signalfx-agent start
To test, simply launch spark-sql
spark-sql
Spark master: yarn, Application Id: application_1596433353347_0001
spark-sql>
Monitor Spark on Yarn metrics in SignalFx dashboards
Log into signalfx.com
Spark application shows up in Dashboard on Yarn
Thank you for your time viewing.