Skip to content

Instantly share code, notes, and snippets.

@tommycarpi
Last active September 3, 2021 10:14
Show Gist options
  • Save tommycarpi/f5a67c66a8f2170e263c to your computer and use it in GitHub Desktop.
Save tommycarpi/f5a67c66a8f2170e263c to your computer and use it in GitHub Desktop.
Link Apache Spark with IPython Notebook

How to link Apache Spark 1.6.0 with IPython notebook (Mac OS X)

Tested with

Python 2.7, OS X 10.11.3 El Capitan, Apache Spark 1.6.0 & Hadoop 2.6

Download Apache Spark & Build it

Download Apache Spark and build it or download the pre-built version.

I suggest to download the pre-built version with Hadoop 2.6.

Install Anaconda

Download and install Anaconda.

Install Jupyter

Once you have installed Anaconda open your terminal and type

conda install jupyter
conda update jupyter

Link Spark with IPython Notebook

Open terminal and type

echo "export PATH=$PATH:/path_to_downloaded_spark/spark-1.6.0/bin" >> .profile
echo "export PYSPARK_DRIVER_PYTHON=ipython" >> .profile
echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile

Now you can source it to make changes available in this terminal

source .profile

or Cmd+Q your terminal and reopen it.

Run IPython Notebook

Now, using your terminal, go in whatever folder you want and type pyspark. For example

cd Documents/my_spark_folder
pyspark

Now the IPython notebook should open in your browser.

To check whether Spark is correctly linked create a new Python 2 file inside IPython Notebook, type sc and run that line. You should see something like this

In [1]: sc
Out[1]: <pyspark.context.SparkContext at 0x1049bdf90>
@Nomii5007
Copy link

Hello sir I set both
PYSPARK_DRIVER_PYTHON=ipython and PYSPARK_DRIVER_PYTHON_OPTS=notebook in environment variables but when i wrote this command PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook spark-1.6.1-bin-hadoop2.6\bin\pyspark --packages com.databricks:spark-csv_2.11:1.4.0 --master local[*] to start notebook it gives error it is not recognized as internal or external command. what should i do to make it works? i want to load spark-csv into my notebook.

@jtitusj
Copy link

jtitusj commented Aug 2, 2016

Hello, is there a way to link spark with jupyter but using scala instead of python? the spark-kernel project before could do that up to spark-1.5 but I think they stopped their development. I want to integrate spark-2.0 with jupyter using scala.

@bsullins
Copy link

bsullins commented Nov 2, 2016

Thanks so much for this!

@bsullins
Copy link

bsullins commented Nov 2, 2016

I should add, this works w/ Spark 2.0.1 as well

@BethanyG
Copy link

BethanyG commented Feb 4, 2017

Note: instead of ipython for PYSPARK_DRIVER_PYTHON, use jupyter. The project officially changed names to jupyter, and the ipython name triggers a warning - it will be deprecated soon. Otherwise, excellent instructions! The only set (out of 5+ I checked) that actually worked.

@tommycarpi
Copy link
Author

@Nomii5007 you should not run the command that way. Follow the instruction, than move to the folder where you have notebook (using terminal) and type pyspark

@jtitusj sorry I never used it with scala, so I cannot be of any help :(

@bsullins you are welcome :)

@BethanyG yep I wrote this guide because I could not find any that really worked, so after all the struggle I thought it was worth sharing. Anyway thanks for the update, could you please share the versions of spark/python/hadoop etc so that I can update the guide and give you credit?

@jerrytim
Copy link

jerrytim commented Feb 8, 2017

Hello, I followed this instruction but didn't work. My environment was Python 2.7, OS X 10.11.6 El Capitan, Apache Spark 2.1.0 & Hadoop 2.7(pre-built version with Hadoop 2.7). I have used Anaconda and Jupyter for a long time. So I started from the step "Linked Spark with Ipython Notebook". After all I typed "pyspark" in my terminal in whatever folder but only got "command not found". Any idea? Thanks in advanced.

@tommycarpi
Copy link
Author

Have you run source .profile or closed and reopened the terminal? Otherwise try updating conda and jupyter

@hamedhsn
Copy link

@Nomii5007
pass the packages when you run pyspark like:
pyspark --packages com.databricks:spark-avro_2.11:3.0.0,com.databricks:spark-redshift_2.11:2.0.1,com.databricks:spark-csv_2.11:1.5.0,com.amazonaws:aws-java-sdk-s3:1.11.73,com.amazonaws:aws-java-sdk-core:1.11.73

@vkannanaen
Copy link

I've spark-2.1.1, hadoop2.7, Python 36, Java etc..Still the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment