Quick and easy setup graphframes with Apache Spark for Python

Abstract

Apache Spark natively includes library for Graph computing called Graphx, a distributed graph processing framework. It is based on the Spark platform and provides a simple, easy-to-use and rich interface for graph computing and graph mining, which greatly facilitates the demand for distributed graph processing.

Setup and configure graphframes for pyspark

According to graphframes install documentation, you simply run below command on your OS command line assuming you already have pyspark, by:

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: graphframes#graphframes;0.6.0-spark2.3-s_2.11: not found
::::::::::::::::::::::::::::::::::::::::::::::
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar
(base) user@users-Mac-mini jars % ls
graphframes-0.8.1-spark3.0-s_2.12.jar
cd ~/jars
jar -xvf graphframes-0.8.1-spark3.0-s_2.12.jar graphframes
pip install pip –upgrade
pip install pyspark
# pyspark option “--jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar” is for Py4J to locate Java class APIs
(base) user@users-Mac-mini ~ % pyspark --jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 00:35:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
>>> import sys
>>> sys.path.append(“/Users/user/jars”)
>>> import graphframes
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Make graphframes to work with just python

While pyspark works, you would run into error if you want to run the above code in python. Making graphframes to work in plain python would provide better value, as most people would use Python IDEs for example PyCharm or jupyter notebook.

(base) user@users-Mac-mini /Applications % pysparkPython 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 16:52:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
export SPARK_HOME=/Users/user/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
cp ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar $SPARK_HOME/jars/
(base) user@users-Mac-mini ~ % python
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("/Users/user/jars")
>>> import graphframes
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("SQL example")\
... .master("local[*]")\
... .config("spark.sql.warehouse.dir", "/tmp")\
... .getOrCreate()
21/06/07 17:06:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>>
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Summary

While Scala is a promising language for functional programming with Spark, there are more folks who are proficient in Python and prefer to use Python to code applications that make Spark API calls, graphframes make it possible to combine Spark SQL to create graph computing applications in Python.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
George Jen

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.