Introduction to Apache Spark

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. Apache Spark ecosystem components/libraries are,
  • Spark Core API(RDD) 
  • Spark SQL(SQL, DataFrame) 
  • Spark Streaming 
  • MLlib/Spark ML(Machine Learning) 
  • GraphX

Apache Hadoop vs Apache Spark

Hadoop has two core components HDFS(Hadoop Distributed File System), MapReduce

HDFS - Reliable and Scalable storage solution for storing big datasets
MapReduce - Distributed programming model which helps big data computation

Advantages of Apache Spark

Hadoop lacks in two below areas,
  • Iterative Machine Learning 
  • Interactive Data Analysis
What Apache Spark provides,
  • Iterative Machine Learning - Intermediate data is kept in memory to reduce no. of read and write to disk which helps to perform the computation faster and efficiently 
  • Interactive Data Analysis - Rich set of functions to do data analysis, speed up data analysis by caching your data in memory

How Apache Spark achieves faster computation compare to Hadoop

In-memory computing at distributed scale - Caching data in memory

  • Spark execution engine translates user code into series of tasks, that task or operation is DAG(Directed Acyclic Graph) in nature, meaning execution flow goes from one operation to another/one task to another task, but never come back and re-execute same task again(no cyclic flow) 
  • Spark achieves this tracking of each operation on dataset using concept called RDD(Resilient Distributed Datasets) 
  • Spark Architecture is built from the ground up for speed and efficiency

Happy Learning !!!

Post a Comment