Quick and easy setup graphframes with Apache Spark for Python

George Jen
6 min readJun 8, 2021

--

George Jen, Jen Tek LLC

Abstract

Apache Spark natively includes library for Graph computing called Graphx, a distributed graph processing framework. It is based on the Spark platform and provides a simple, easy-to-use and rich interface for graph computing and graph mining, which greatly facilitates the demand for distributed graph processing.

However, currently Graphx only supports Scala, you need to write Scala code to import and invoke Graphx API calls, because Scala is native to Spark as Spark itself was written in Scala. While Spark (except from Graphx module) supports Python, Scala and R in addition to Scala.

If you want to build a graph application utilizing Apache Spark as a big data engine with other Spark supported programming language, such as Python or Java, you have an alternative, a library called graphframes. By the way, you can also import and invoke graphframes library in Scala if you want to.

In this writing, I will show you how to quickly setup graphframes for Python. I will do Java and Scala later. Is there no straight forward way today to setup graphframes? You are correct! That is why there is market for this writing.

Setup and configure graphframes for pyspark

According to graphframes install documentation, you simply run below command on your OS command line assuming you already have pyspark, by:

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

This used to work, but not anymore. By running the above command, following error will show:

...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: graphframes#graphframes;0.6.0-spark2.3-s_2.11: not found
::::::::::::::::::::::::::::::::::::::::::::::
...

The reason is graphframes Python library is no longer on bintray.com, in simple term, following URL no longer works:

https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

There has been no clear instructions over the internet.

I have managed to setup graphframes and summarized the steps for clear and straightforward instructions. If you want to setup graphframes for Python, then read on!

To demonstrate, I use a Mac Mini with freshly installed MacOS Catalina, you can extrapolate the steps to Windows and Linux. Starting from Anaconda Python for recent version of Python 3. If you already have Python3 and JDK 8, you may skip the steps on Anaconda and JDK installation.

https://www.anaconda.com/products/individual-b

Install JDK, recommend to download and install JDK 1.8, as Apache Spark works best with late release of JDK 1.8.

https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html

Download graphframes jar file, since unable to install directly from Maven.

https://spark-packages.org/package/graphframes/graphframes

Choose the latest jar file, at the time, Version: 0.8.1-spark3.0-s_2.12 highlighted in the red box:

Created a folder called jars in the home directory ~/jars and move the downloaded graphframes-0.8.1-spark3.0-s_2.12.jar there

(base) user@users-Mac-mini jars % ls
graphframes-0.8.1-spark3.0-s_2.12.jar

~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar is needed by Py4J, that enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine. Py4J is bundled with pyspark. Spark classes are Java objects, Py4J allows Python to invoke Spark class objects.

Next, for pyspark to import graphframes, it is needed to extract graphframes Python library from graphframes-0.8.1-spark3.0-s_2.12.jar that bundles both Java and Python libraries. Java library starts with org.graphframes, Python library starts with graphframes. A jar file is like a tar ball, simply use “jar -xvf” to extract graphframes. Following command will extract graphframes folder portion from the jar file:

cd ~/jars
jar -xvf graphframes-0.8.1-spark3.0-s_2.12.jar graphframes

~/jars/graphframes needs to be included in Python search path either in PYTHONPATH or sys.path.

After setup of Anaconda Python 3 (python is default to Anaconda Python 3) and JDK are completed, open a terminal to install pyspark

pip install pip –upgrade
pip install pyspark

Now, ready to test play with pyspark and graphframes in pyspark

# pyspark option “--jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar” is for Py4J to locate Java class APIs
(base) user@users-Mac-mini ~ % pyspark --jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 00:35:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
>>> import sys
>>> sys.path.append(“/Users/user/jars”)
>>> import graphframes
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Make graphframes to work with just python

While pyspark works, you would run into error if you want to run the above code in python. Making graphframes to work in plain python would provide better value, as most people would use Python IDEs for example PyCharm or jupyter notebook.

Following steps are needed to further configure to make python to work with graphframes directly.

You need to install Spark.

Why need to have Spark separately? Because Python (not pyspark) need Spark default classpath, i.e, $SPARK_HOME/jars/

Spark can be downloaded at:

https://spark.apache.org/downloads.html

Which version of Spark to download? I checked the pyspark installed by pip in Anaconda:

(base) user@users-Mac-mini /Applications % pysparkPython 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 16:52:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.

It is Spark version 3.0.1 with Anaconda pyspark, therefore, download the same version from

https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Expand the downloaded spark-3.0.1-bin-hadoop2.7.tgz and rename the expanded folder from spark-3.0.1-bin-hadoop2.7 into ~/spark

Append below 2 lines in ~/.zshrc (MacOS uses zsh, not bash)

export SPARK_HOME=/Users/user/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH

source ~/.zshrc, or log out and log back in.

Copy graphframes jar file to $SPARK_HOME/jars

cp ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar $SPARK_HOME/jars/

Now graphframes would work directly with Python

(base) user@users-Mac-mini ~ % python
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("/Users/user/jars")
>>> import graphframes
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("SQL example")\
... .master("local[*]")\
... .config("spark.sql.warehouse.dir", "/tmp")\
... .getOrCreate()
21/06/07 17:06:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>>
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Summary

While Scala is a promising language for functional programming with Spark, there are more folks who are proficient in Python and prefer to use Python to code applications that make Spark API calls, graphframes make it possible to combine Spark SQL to create graph computing applications in Python.

Hope this writing helpful, thank you for your time viewing.

--

--

George Jen
George Jen

Written by George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.