Recently, I have been working with the Python API for Spark to use distrbuted computing techniques to perform analytics at scale. When you write Spark code in Scala or Java, you can bundle your dependencies in the jar file that you submit to Spark. However, when writing Spark code in Python, dependency management becomes more difficult because each of the Spark executor nodes performing computations needs to have all of the Python dependencies installed locally.
Typically, Python deals with dependencies using pip and virtualenv. However, even if you follow this convention, you will still need to install your Spark code dependencies on each Spark executor machine in the cluster.
A way around this is to bundle the dependencies in a zip file and pass them to Spark when you submit your job using the
--py-files flag. The command will look something like this:
$ /path/to/spark-submit --py-files deps.zip my_spark_job.py
deps.zip is easiest if you use virtualenvwrapper. If you don’t have virtualenvwrapper set up already, I like this guide to get started. When you install dependencies within a virtualenv via
pip, they are placed in the folder
$VIRTUAL_ENV also becomes available as a bash variable if you are using virtualenvwrapper have used
workon <env_name>. To create the
cd into the
site-packages folder and run:
$ zip -r /target/path/for/deps.zip .
This command will zip all files and folders in
site-packages at the top level of
deps.zip. This distinction is worth noting because the files and folders must appear at the top level when
unzipped. For make sure this works, create file from inside the
site-packages folder – do not zip a folder containing all of the files and folders. If you run the zip command properly, you will see
$ zip -r ../deps.zip .
adding: __init__.py (stored 0%)
adding: _markerlib/ (stored 0%)
adding: _markerlib/__init__.py (deflated 55%)
adding: _markerlib/__init__.pyc (deflated 60%)
adding: _markerlib/markers.py (deflated 63%)
adding: _markerlib/markers.pyc (deflated 61%)
adding: aniso8601/ (stored 0%)
$ zip -r deps.zip folder_name/*
adding: folder_name/__init__.py (stored 0%)
adding: folder_name/_markerlib/ (stored 0%)
adding: folder_name/_markerlib/__init__.py (deflated 55%)
adding: folder_name/_markerlib/__init__.pyc (deflated 60%)
adding: folder_name/_markerlib/markers.py (deflated 63%)
adding: folder_name/_markerlib/markers.pyc (deflated 61%)
adding: folder_name/aniso8601/ (stored 0%)
Note: OSX obfuscates this distinction when you unzip
deps.zip in Finder. For the former case, OSX will unzip all of the files and folders to a new folder with the same name as the zip file. For example, if your zip file is named
my_deps.zip, OSX will create a folder named
my_deps and unzip the contents of
my_deps.zip to that folder. For the later case, also unzipping with Finder, OSX will unzip the contents as they were zipped, yielding a folder named
folder_name. The results are similar, but only the former case will work when you zipping dependencies for Spark. The distinction becomes more obvious if you use
unzip, as the former case will extract all files and folders to the current working directory, while the latter case will extract to a folder containing those same files and folders in the current working directory.
You should be ready to run PySpark jobs in a “jarified” way.
Afternote: I’ve run into issues getting
boto3 to run on a remote Spark cluster using this method.