RDD(Resilient Distributed Datasets) in Apache Spark

Resilient Distributed Dataset (RDD)

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Two types of operation that we can perform on RDD:

Transformation - which generates new RDD from existing RDD/creates new RDD
Example: map, filter
Actions - which return a value to the driver program after running a computation on the RDD 
Example: count, collect, take

Architecture of Apache Spark

Exploring RDD in Apache Spark

val lines_rdd = spark.sparkContext.textFile(yarn_log_file_path)
val error_lines_rdd = lines_rdd.filter(line => line.contains("ERROR"))

Happy Learning !!!

Post a Comment