8/20/2015

My Intro to Spark

Spark has lots of momentum.  Works stand alone or with Hadoop.  Can run on a laptop, a Hadoop cluster or the Cloud.  Works with SQL, R, Java, Scala, Python, MLlib Machine Learning and data frames.

Thought we better give it a try: http://spark.apache.org/

Quick Start: http://spark.apache.org/docs/latest/quick-start.html

Download: http://spark.apache.org/downloads.html

Training: https://spark-summit.org/2014/training

My first question, does Spark run on Windows?

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

stated here: http://spark.apache.org/docs/latest/

Found some articles to Install Spark on Windows:

How to build SPARK on Windows: https://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows

StackOverFlow: http://stackoverflow.com/questions/25481325/how-to-set-up-spark-on-windows

How to run Apache Spark on Windows7 in standalone mode: http://nishutayaltech.blogspot.in/2015/04/how-to-run-apache-spark-on-windows7-in.html

How to build SPARK on Windows: http://arindampaul.tumblr.com/post/42924689925/how-to-build-spark-on-windows

Starting now.

Download SBT: 
http://www.scala-sbt.org/release/tutorial/Setup.html
http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Windows.html

Download Spark: http://spark.apache.org/downloads.html

I selected the "Source Code" edition spark-1.4.1.tgz

Download Scala: http://www.scala-lang.org/download/

1. Installed SBT first.

2. Then installed Scala Programming Language Distribution - be sure to not have spaces in your install destination path: 
i.e. C:\Users\jbloom\Desktop\Spark\Scala

3. Then Spark.  Copied file spark-1.4.1.tgz from download directory to the destination path directory: C:\Users\jbloom\Desktop\Spark  Then Extracted contents into folder using WinRar.

Created a file in \Scala\conf folder called "spark-env.cmd"


added contents:
set SCALA_HOME=C:\Users\jbloom\Desktop\Spark\Scala\bin
set SPARK_CLASSPATH=C:.…..\SPARK\source\spark-0.6.2-sources\spark-0.6.2\core\target\scala-2.9.2\spark-core_2.9.2-0.6.2.jar;C:.…..\scala\scala\lib\scala-library.jar;C:.…..\scala\scala\lib\scala-compiler.jar;


That's as far as I got yesterday.   If someone has a good link on how to "Build" Spark on Windows, that would be great, please post in the comments or send.

---------------------------------------------------------
---------------------------------------------------------
---------------------------------------------------------


So to continue testing Spark, I decided to download a pre-built VM from Hortonworks, as my company Bloom Consulting is a Partner.

Version 2.3

http://hortonworks.com/products/hortonworks-sandbox/#install 



Imported the VM to VM-Player...



Loaded fine...

It turns out my laptop has 6gb of RAM.  The VM-Player was set to give 8gb ram to the VM player, so adjusted to allow 3gb, and the thing stopped crashing the computer causing reboots. 

Hadoop loaded:



Press , prompts login...

username: root
password: hadoop

Some Spark info: http://hortonworks.com/hadoop/spark/

Spark Tutorials: http://hortonworks.com/hadoop/spark/#tutorials

5 Minute tutorial: http://hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/

Step 1: get file from Wikipedia



Step 2: move the file to HDFS



Step 3: start pyspark shell:

This launched Python 2.6.6 and Spark... took half a minute to load...



Step 4: with the newly created file Hortonworks in HDFS, instantiate the RDD to the Spark context "sc"




Step 5: transform using Python command to remove blank rows



Step 6: apply a "Count" function

resulting in the row count of: 308



And that concludes the first hands on tutorial provided by Hortonworks for Spark on Hadoop in 5 minutes.

Hands-on Tour of Apache Spark in 5 Minutes - Hortonworks http://hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/#.VdZFLvMwxOo.twitter

I've been downloading Hortonworks Sandboxes for a while now, and I have to say this was the easiest one so far.

Here's a few Hortonworks posts from this blog:

First Try #Hortonworks Hadoop: http://www.bloomconsultingbi.com/2013/08/first-try-hortonworks-hadoop.html

#SSIS ODBC Connection to #Hortonworks #Hadoop: http://www.bloomconsultingbi.com/2013/10/ssis-odbc-connection-to-hortonworks.html

Installation #Hortonworks #Hadoop 1.3 Part 1: http://www.bloomconsultingbi.com/2013/10/installation-hortonworks-hadoop-13-part.html 

Installation #Hortonworks #Hadoop 1.3 Part 2: http://www.bloomconsultingbi.com/2013/10/installation-hortonworks-hadoop-13-part_22.html

Hortonworks Ambari on Sandbox 2.0 VM: http://www.bloomconsultingbi.com/2014/03/hortonworks-ambari-on-sandbox-20-vm.html

Install Mahout, RHadoop, Configure Map Reduce Job:
http://www.bloomconsultingbi.com/2014/03/install-mahout-rhadoop-configure-map.html 

Bloom Consulting - Hadoop: http://www.bloomconsultingbi.com/2015/01/bloom-consulting-hadoop.html

Here's a site that provides tutorials on Scala from Udemy: https://blog.udemy.com/scala-tutorial-getting-started-with-scala/






Enjoy~!

1 comment:

Rajat Malik said...

Thank you for giving insightful article on Spark. I also read Tutorial & Interview Q/A on Intellpaat. These are very helpful for preparing interview on Spark.

Post a Comment

Note: Only a member of this blog may post a comment.

Top 22 Complaints by Number