Recently, I have been working with the Python API for Spark to use distrbuted computing techniques to perform analytics at scale. When you write Spark code in Scala or Java, you can bundle your dependencies in the jar file that you submit to Spark. However, when writing Spark code in Python, dependency management becomes more difficult because each of the Spark executor nodes performing computations needs to have all of the Python dependencies installed locally.

Typically, Python deals with dependencies using pip and virtualenv. However, even if you follow this convention, you will still need to install your Spark code dependencies on each Spark executor machine in the cluster.

A way around this is to bundle the dependencies in a zip file and pass them to Spark when you submit your job using the --py-files flag. The command will look something like this:

$ /path/to/spark-submit --py-files deps.zip my_spark_job.py

Building the deps.zip is easiest if you use virtualenvwrapper. If you don’t have virtualenvwrapper set up already, I like this guide to get started. When you install dependencies within a virtualenv via pip, they are placed in the folder $VIRTUAL_ENV'/lib/python2.7/site-packages where $VIRTUAL_ENV is ~/Envs/<env_name>. $VIRTUAL_ENV also becomes available as a bash variable if you are using virtualenvwrapper have used workon <env_name>. To create the deps.zip file, cd into the site-packages folder and run:

$ zip -r /target/path/for/deps.zip .

This command will zip all files and folders in site-packages at the top level of deps.zip. This distinction is worth noting because the files and folders must appear at the top level when deps.zip is unzipped. For make sure this works, create file from inside the site-packages folder – do not zip a folder containing all of the files and folders. If you run the zip command properly, you will see

$ zip -r ../deps.zip .
adding: __init__.py (stored 0%)
adding: _markerlib/ (stored 0%)
adding: _markerlib/__init__.py (deflated 55%)
adding: _markerlib/__init__.pyc (deflated 60%)
adding: _markerlib/markers.py (deflated 63%)
adding: _markerlib/markers.pyc (deflated 61%)
adding: aniso8601/ (stored 0%)

...

rather than

$ zip -r deps.zip folder_name/*
adding: folder_name/__init__.py (stored 0%)
adding: folder_name/_markerlib/ (stored 0%)
adding: folder_name/_markerlib/__init__.py (deflated 55%)
adding: folder_name/_markerlib/__init__.pyc (deflated 60%)
adding: folder_name/_markerlib/markers.py (deflated 63%)
adding: folder_name/_markerlib/markers.pyc (deflated 61%)
adding: folder_name/aniso8601/ (stored 0%)

...

Note: OSX obfuscates this distinction when you unzip deps.zip in Finder. For the former case, OSX will unzip all of the files and folders to a new folder with the same name as the zip file. For example, if your zip file is named my_deps.zip, OSX will create a folder named my_deps and unzip the contents of my_deps.zip to that folder. For the later case, also unzipping with Finder, OSX will unzip the contents as they were zipped, yielding a folder named folder_name. The results are similar, but only the former case will work when you zipping dependencies for Spark. The distinction becomes more obvious if you use zip and unzip, as the former case will extract all files and folders to the current working directory, while the latter case will extract to a folder containing those same files and folders in the current working directory.

You should be ready to run PySpark jobs in a “jarified” way.

Afternote: I’ve run into issues getting boto3 to run on a remote Spark cluster using this method.