Install Apache Spark with Python or Ruby-Spark in Yosemite
Enjoy the power of the spark in local REPL shell, including Ruby.
Recently joined the “EdX’s BerkeleyX CS 100.1X: Introduction to Big Data with Apache Spark” , I am excited to know there’s so much we can do with this cluster computing framework. Here is how I followed Sourabh Bajaj’s tutorial to setup Virtual Box and Vagrant, to extract and run the course Vagrantfile. Then a command like vagrant up --provider=virtualbox
will get you the complete environment accessible from IPython Notebook within browsers.
It’s great to have virtual box running. But I really like to have a separate How local REPL environment to play. Now it’s possible to get latest Spark by brew install scala sbt hadoop apache-spark
, dev/change-version-to-2.11.sh
then build/sbt -Pyarn -Phadoop-2.6 -Dscala-2.11 assembly
, with great help such as Marek’s tutorial and Prabeesh’s tutorial (John’s tutorial didn’t worked for me). But
Now your local Python Spark REPL is ready! You can test some codes like this (Clark Updike’s example):
Moreover, now it’s possible to use Apache Spark in Ruby by RubySpark gem, in addition to Java, Scala, Python or R. This fabulous gem has not supported DataFrames yet, and its shell is based on pry gem. One of my glitches was pry’s incompatibility with pry-debugger in Ruby 2.2.2, so I had to gem uninstall pry-debugger
and made sure pry works alone first.
After compiling for some time, your access to Apache Spark™ is now granted!. Let’s write some RDD now. According to Ruby-spark gem author’s benchmark, starting it by SPARK\_RUBY\_SERIALIZER="oj" ruby-spark shell
would use oj as the serializer and will be faster than default Marshal serializer. And the passing of the anonymous function is different from native Ruby. However I don’t care, it just works! have a lot of fun!