Build Twitter Scala API Library for Spark Streaming using sbt

5 min readJun 25, 2020

George Jen, Jen Tek LLC

If you need to write scala code to use Apache Spark Streaming to stream tweets from Twitter, you will need to import Twitter API library as below:

import org.apache.spark.streaming.twitter._

Since this library does not come with Apache Spark, you will need to build the its jar file and place the jar file in the classpath.

Here is the how to.

To start with, knowing your Spark and Scala version, which can be found by running $SPARK_HOME/bin/spark-shell

$SPARK_HOME/bin/spark-shellSetting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/06/12 14:06:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1591995983793).
Spark session available as 'spark'.
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.2.1
/_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.

In this example, Spark version is 2.2.1, Scala version is 2.11.8

To build the “twitter jar” file, you need to manually create directory structure, let’s call twitter for example, that contains subfolder project and src. Subfolder names are significant. The directory structure is like below:

(base) [hadoop@master twitter]$ tree -a
.
├── build.sbt
├── project
│   ├── assembly.sbt
│   └── build.properties
└── src
└── main
└── scala

Install tree command on Linux, here is CentOS, do below

sudo yum install tree -y

test tree command to see if it works

tree -d

There are 1 file from the root directory called build.sbt, you need to create:

vi build.sbt

// this file was written for spark 3.0.0-preview and scala 2.12.10
// George Jen
// Jen Tek LLCversion := "1"
name := "JentekLLC-spark-streaming-from-Twitter"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1" % "provided"
libraryDependencies += "org.twitter4j" % "twitter4j-core" % "4.0.4"
libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "4.0.4"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.2.1"assemblyMergeStrategy in assembly := {
case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
case PathList("org", "apache", xs @ _*) => MergeStrategy.last
case PathList("com", "google", xs @ _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs @ _*) =>
MergeStrategy.last
case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

You need to set your Scala version, in my example, 2.11.8. Your one might be different.

scalaVersion := "2.11.8"

Then fill in your Spark version in these lines, mine is 2.2.1

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.1" % "provided"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.2.1"

Save and exit

Under sub-project folder, there are 2 files: assembly.sbt and build.properties

vi project/assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")

Save and exit

vi project/build.properties

sbt.version=1.3.6

Save and exit

You do not need place any source code file in src/main/scala folder. We only need to download needed twitter jar files from Maven repository and assemble into one single combined jar file by sbt.

If you do not have sbt installed, you can get it at

https://www.scala-sbt.org/download.html

Download the zip file, sbt-1.3.10.zip (note, the file name may be changing over the time)

mkdir ~/sbt
cd ~/sbt
wget https://piccolo.link/sbt-1.3.10.zip
unzip sbt-1.3.10.zip
sudo cp sbt/bin/sbt /usr/local/bin

Now, you should be able to build Twitter jar file using sbt

cd ~/twitter
#Test below to see you have the right
#directory structure with right files(base) [hadoop@master twitter]$ tree -a
.
├── build.sbt
├── project
│   ├── assembly.sbt
│   └── build.properties
└── src
└── main
└── scala#Now you run sbt at the root of the directory
#that has the build.sbt filesbt assemblyThe first time, it will takes a while and download lots of files, eventually, it will show below:
(base) [hadoop@master twitter]$ sbt assembly
[info] Loading settings for project twitter-build from assembly.sbt ...
[info] Loading project definition from /opt/hadoop/sbt/twitter/project
[info] Loading settings for project twitter from build.sbt ...
[info] Set current project to JentekLLC-spark-streaming-from-Twitter (in build file:/opt/hadoop/sbt/twitter/)
[info] Strategy 'discard' was applied to 12 files (Run the task at debug level to see details)
[info] Strategy 'last' was applied to a file (Run the task at debug level to see details)
[info] Strategy 'rename' was applied to 3 files (Run the task at debug level to see details)
[success] Total time: 8 s, completed Jun 12, 2020 3:47:03 PM

It produces a new directory called target, with subdirectory called scala-<version>, simply cd into it

(base) [hadoop@master twitter]$ ls
build.sbt  project  src  target
(base) [hadoop@master twitter]$ cd target
(base) [hadoop@master target]$ ls
scala-2.11  streams
(base) [hadoop@master target]$ cd scala*
(base) [hadoop@master scala-2.11]$ pwd
/opt/hadoop/sbt/twitter/target/scala-2.11

Now you should see twitter jar file produced

(base) [hadoop@master scala-2.11]$ lsJentekLLC-spark-streaming-from-Twitter-assembly-1.jar  update

The copy the jar file into $SPARK_HOME/jars/, which is the default classpath of Java library

cp JentekLLC-spark-streaming-from-Twitter-assembly-1.jar $SPARK_HOME/jars/

Then verify the file JentekLLC-spark-streaming-from-Twitter-assembly-1.jar in $SPARK_HOME/jars/

ls $SPARK_HOME/jars/Jentek*
/opt/spark/jars/JentekLLC-spark-streaming-from-Twitter-assembly-1.jar

Since we are going to load Twitter Tweets into Hive table, set the permission of HDFS /tmp/hive by:

hdfs dfs -chmod 777 /tmp/hive
hdfs dfs -ls /tmp/
drwx------   - hadoop supergroup          0 2020-05-09 14:16 /tmp/hadoop-yarn
drwxrwxrwx   - hadoop supergroup          0 2018-05-05 22:00 /tmp/hive

Assuming you have already started up Hadoop, Hive, Spark, and Jupyter-Notebook server in the virtualenv spark by conda activate spark. If not, visit the relevant session of this eBook if needing direction. Then launch your web browser, connect to Jupyter notebook server, start a new notebook in Scala Spylon kernel and run below code

https://github.com/geyungjen/jentekllc/blob/master/Spark/Scala/STREAMING/twitter_streaming_project_code.ipynb

You need to replace with your own Twitter consumer key, consumer secret, access token, access token secret. If you need direction, view my video presentation below:

Thank you for your time viewing this writing and step by step how to video.

Build Twitter Scala API Library for Spark Streaming using sbt

Written by George Jen