Quick and easy setup graphframes with Apache Spark for Python

Abstract

Setup and configure graphframes for pyspark

pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar

::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: graphframes#graphframes;0.6.0-spark2.3-s_2.11: not found
::::::::::::::::::::::::::::::::::::::::::::::
...
https://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.6.0-spark2.3-s_2.11/graphframes-0.6.0-spark2.3-s_2.11.jar
(base) user@users-Mac-mini jars % ls
graphframes-0.8.1-spark3.0-s_2.12.jar
cd ~/jars
jar -xvf graphframes-0.8.1-spark3.0-s_2.12.jar graphframes
pip install pip –upgrade
pip install pyspark
# pyspark option “--jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar” is for Py4J to locate Java class APIs
(base) user@users-Mac-mini ~ % pyspark --jars ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 00:35:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
>>> import sys
>>> sys.path.append(“/Users/user/jars”)
>>> import graphframes
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Make graphframes to work with just python

(base) user@users-Mac-mini /Applications % pysparkPython 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
21/06/07 16:52:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Python version 3.8.8 (default, Apr 13 2021 12:59:45)
SparkSession available as 'spark'.
export SPARK_HOME=/Users/user/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
cp ~/jars/graphframes-0.8.1-spark3.0-s_2.12.jar $SPARK_HOME/jars/
(base) user@users-Mac-mini ~ % python
Python 3.8.8 (default, Apr 13 2021, 12:59:45)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("/Users/user/jars")
>>> import graphframes
>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("SQL example")\
... .master("local[*]")\
... .config("spark.sql.warehouse.dir", "/tmp")\
... .getOrCreate()
21/06/07 17:06:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>>
>>> v = spark.createDataFrame([
... ("a", "Alice", 34),
... ("b", "Bob", 36),
... ("c", "Charlie", 30),
... ], ["id", "name", "age"])
>>> e = spark.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> from graphframes import *
>>> g = GraphFrame(v, e)
>>> g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+

Summary

--

--

--

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Computer Network Architecture

Installing Spark2 on Cloudera’s Quickstart VM

Scope and Benefits of the GPS Tracking System Works

How to set up webhook notifications on Solana

Announcing the cooperation of Mars Ecosystem & Metaverse Miner

Rust Hello World using Visual Studio Code

1000 Slack Communities — Part 1/7

How to use GitHub Actions to deploy your Quarkus app to GCP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
George Jen

George Jen

I am founder of Jen Tek LLC, a startup company in East Bay California developing AI powered, cloud based documentation/publishing software as a service.

More from Medium

Faster Data Loading for Pandas on S3

Adding Custom JARs to Spark Jobs on AWS EMR

Setup PyCharm Hadoop PySpark Development on Windows without installing Hadoop

Machine Learning on Databricks — part 1: the data pipeline