---
title: "PySpark dependencies"
createdAt: "2015-11-09T07:19:00.000Z"
updatedAt: "2015-11-09T07:19:00.000Z"
tags: ["code","spark","python","pyspark"]
draft: false
aliases: ["/code/2015/11/10/pyspark.html","/posts/2015-11-09-pyspark"]
atUri: "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.document/3mhnqy7r5tj2k"
readingTime: false
---

Recently, I have been working with the Python API for [Spark][pyspark] to use distrbuted computing techniques to perform analytics at scale. When you write Spark code in Scala or Java, you can bundle your dependencies in the jar file that you submit to Spark. However, when writing Spark code in Python, dependency management becomes more difficult because each of the Spark executor nodes performing computations needs to have all of the Python dependencies installed locally.

Typically, Python deals with dependencies using [pip][pip_link] and [virtualenv][virtualenv_link]. However, even if you follow this convention, you will still need to install your Spark code dependencies on each Spark executor machine in the cluster.

A way around this is to bundle the dependencies in a zip file and pass them to Spark when you submit your job using the `--py-files` flag. The command will look something like this:

```sh
$ /path/to/spark-submit --py-files deps.zip my_spark_job.py
```

Building the `deps.zip` is easiest if you use [virtualenvwrapper][virtualenvwrapper_link]. If you don't have virtualenvwrapper set up already, I like this [guide][venv_guide] to get started. When you install dependencies within a virtualenv via `pip`, they are placed in the folder `$VIRTUAL_ENV'/lib/python2.7/site-packages` where `$VIRTUAL_ENV` is `~/Envs/<env_name>`. `$VIRTUAL_ENV` also becomes available as a bash variable if you are using virtualenvwrapper have used `workon <env_name>`. To create the `deps.zip` file, `cd` into the `site-packages` folder and run:

```sh
$ zip -r /target/path/for/deps.zip .
```

This command will zip all files and folders in `site-packages` at the top level of `deps.zip`. This distinction is worth noting because the files and folders must appear at the top level when `deps.zip` is `unzip`ped. For make sure this works, create file from inside the `site-packages` folder -- _do not_ zip a _folder_ containing all of the files and folders. If you run the zip command properly, you will see

```sh
$ zip -r ../deps.zip .
adding: __init__.py (stored 0%)
adding: _markerlib/ (stored 0%)
adding: _markerlib/__init__.py (deflated 55%)
adding: _markerlib/__init__.pyc (deflated 60%)
adding: _markerlib/markers.py (deflated 63%)
adding: _markerlib/markers.pyc (deflated 61%)
adding: aniso8601/ (stored 0%)

...
```

rather than

```sh
$ zip -r deps.zip folder_name/*
adding: folder_name/__init__.py (stored 0%)
adding: folder_name/_markerlib/ (stored 0%)
adding: folder_name/_markerlib/__init__.py (deflated 55%)
adding: folder_name/_markerlib/__init__.pyc (deflated 60%)
adding: folder_name/_markerlib/markers.py (deflated 63%)
adding: folder_name/_markerlib/markers.pyc (deflated 61%)
adding: folder_name/aniso8601/ (stored 0%)

...
```

Note: OSX obfuscates this distinction when you unzip `deps.zip` in Finder. For the former case, OSX will unzip all of the files and folders to a new folder with the same name as the zip file. For example, if your zip file is named `my_deps.zip`, OSX will create a folder named `my_deps` and unzip the contents of `my_deps.zip` to that folder. For the later case, also unzipping with Finder, OSX will unzip the contents as they were zipped, yielding a folder named `folder_name`. The results are similar, but only the former case will work when you zipping dependencies for Spark. The distinction becomes more obvious if you use `zip` and `unzip`, as the former case will extract all files and folders to the current working directory, while the latter case will extract to a folder containing those same files and folders in the current working directory.

You should be ready to run PySpark jobs in a "jarified" way.

Afternote: I've run into issues getting [`boto3`](https://github.com/boto/boto3) to run on a remote Spark cluster using this method.

[pyspark]: http://spark.apache.org/docs/latest/api/python/
[pip_link]: https://pip.readthedocs.org/en/stable/
[virtualenv_link]: https://virtualenv.readthedocs.org/en/latest/
[virtualenvwrapper_link]: https://virtualenvwrapper.readthedocs.org/en/latest/
[venv_guide]: http://mkelsey.com/2013/04/30/how-i-setup-virtualenv-and-virtualenvwrapper-on-my-mac/