Spark is the most active Apache project and has a lot of media press in the big data world. So how do you know if Spark is right for your project and what is the difference between Spark and Hadoop when run on HDInsight? I'll cover some of the differences between Spark and Hadoop and some of the things to consider for your next project.
VS
Spark and Hadoop are both big data frameworks. Spark can run on top of Hadoop but it does not have to. You can run Spark in local standalone mode on your laptop or in a distributed manner on a cluster. Spark has its own resource manager (Standalone Scheduler) as well as supporting other resource managers like Mesos and Yarn. On HDInsight by default, Spark uses its own resource manager and not Yarn. On HDInsight, Spark has a SparkMaster service on the headnodes and a SparkSlave service on the workernodes. These services start and manage JVM' for Spark.
One of the main reasons Spark is run on top of Hadoop is that Spark does not have a distributed file system like HDFS or Windows Azure Storage. Running Spark on top of Hadoop gives Spark access to distributed data that most big data projects require.
Spark can be faster in some circumstances and workloads. Spark can handle a lot of operations in memory which reduces the time to write and read from physical disk. Memory access is faster than disk access. MapReduce, on the other had writes data back to disk after operations in order insure recoverability on failure. Spark uses RDDs, Resilient Distributed Datasets. RDD's are datasets of objects that are distributed across nodes of the cluster. RDD's are automatically recoverable on failure, so intermediate data does not have to be written to disk. RDD's are also partitioned. Figuring out a RDD's correct partition size can be a challenge for optimal performance.
A spark job is broken up into stages and tasks. Each task has its own thread and is scheduled on an executor. An executor can run multiple tasks which means executors are multi-threaded. The executors also store Sparks' cache which stores the RDD's. As tasks are scheduled on an executor it runs code against an RDD's partitioned data. An executors multi-threaded nature helps improve performance.
Both Spark and Hadoop have shuffle operations. Spark writes intermediate data to physical disk. On HDInsight the shuffles intermediate data is written to disk locally on the virtual machines and not the default storage account on Windows Azure Storage. The shuffle can be a bottleneck for both Spark and Hadoop.
Spark has a rich programming choice. It supports Java, Scala, Python, and in Spark 1.4, R. This gives your development team a wide choice of languages to choose from. Spark is written in Scala. Scala is a functional programming language and is not as well-known as Java. Python is widely known and has a large developer base to work from. Python and R are widely used by data scientists for machine learning.
Spark uses "lazy execution". Spark commands are either transformations or actions. A transformation command builds up the plans lineage (metadata) and is not executed. The return type of a transformation is a RDD. Actions take the linage and executes it. Actions usually writes data back to the driver application or writes data to disk. Getting used to Sparks "lazy execution" can take some getting used to.
Spark can be a one stop shop instead of stitching together multiple projects in Hadoop. Sparks core contains functionality for scheduling, memory management, fault tolerance and different storage systems. It also has packages for Spark SQL, Spark Streaming, Spark Machine Learning and Spark GraphX processing. Instead of using multiple Hadoop sub projects like Storm, Hive, Sqoop, and other projects to create a solution you might be able to just use Spark to create the same solution.
Spark moves big data closer to interactive processing. Spark on HDInsight has multiple ways an end user can interact with the cluster. It has a Spark Dashboard to help manage and troubleshoot. It has IPython and Zeppelin notebooks to run interactive queries from your desktop. It has a Spark Job Submission Service so you can use Rest API to copy a local .jar or python script to Windows Azure Storage and then execute it on the Spark cluster in a batch mode. This can be done or scheduled remotely from your desktop so you don't have to remote desktop to the cluster to execute it. It also support the Spark ODBC driver so you can use Azure Power BI or Tableau to do interactive analysis. Spark on HDInsight gives end users a rich way to interact with the cluster.
This should give you a sense of some of the similarities and differences between Spark and Hadoop and how they interact with each other. Any big data project has a lot of challenges. For your next project, give Spark on HDInsight a look and see if it is right for you and your team!
Bill