Quick and easy setup graphframes with Apache Spark for Python

Abstract

Setup and configure graphframes for pyspark

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: graphframes#graphframes;0.6.0-spark2.3-s_2.11: not found
::::::::::::::::::::::::::::::::::::::::::::::
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar
(base) user@users-Mac-mini jars % ls
graphframes-0.8.1-spark3.0-s_2.12.jar
cd ~/jars
jar -xvf graphframes-0.8.1-spark3.0-s_2.12.jar graphframes
pip install pip –upgrade
pip install pyspark
# pyspark option “--jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar” is for Py4J to locate Java class APIs
(base) user@users-Mac-mini ~ % pyspark --jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 00:35:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
>>> import sys
>>> sys.path.append(“/Users/user/jars”)
>>> import graphframes
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Make graphframes to work with just python

(base) user@users-Mac-mini /Applications % pysparkPython 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 16:52:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
export SPARK_HOME=/Users/user/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
cp ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar $SPARK_HOME/jars/
(base) user@users-Mac-mini ~ % python
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("/Users/user/jars")
>>> import graphframes
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("SQL example")\
... .master("local[*]")\
... .config("spark.sql.warehouse.dir", "/tmp")\
... .getOrCreate()
21/06/07 17:06:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>>
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Summary

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store