Why reduceByKey RDD transformation is preferred instead of groupByKey in PySpark | PySpark 101 | Part 13


Prerequisite

  • Apache Spark
  • PyCharm Community Edition

Walk-through

In this article, I am going to walk-through you all, how to use reduceByKey RDD transformation in the PySpark application using PyCharm Community Edition.

reduceByKey: Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a combiner in MapReduce.

# Importing Spark Related Packages
from pyspark.sql import SparkSession

# Importing Python Related Packages
import time

if __name__ == "__main__":
    print("PySpark 101 Tutorial")

    # reduceByKey - Merge the values for each key using an associative and commutative reduce function.
    # This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a combiner in MapReduce.

    spark = SparkSession \
            .builder \
            .appName("Part 13 - Why reduceByKey RDD transformation is preferred instead of groupByKey in PySpark | PySpark 101") \
            .master("local[*]") \
            .enableHiveSupport() \
            .getOrCreate()

    str_list = ["Three", "Five", "One", "Five", "One"]
    print("Printing str_list: ")
    print(str_list)

    str_rdd = spark.sparkContext.parallelize(str_list, 3)

    print("Get Partition Count: ")
    print(str_rdd.getNumPartitions())

    kv_rdd = str_rdd \
            .map(lambda e: (e, 1)) \
            .reduceByKey(lambda a, b: a + b)

    print(kv_rdd.collect())

    print("Printing current datetime - 1: ")
    print(time.strftime("%Y-%m-%D %H:%M:%S"))
    input_file_path = "hdfs://datamaking:9000/data/input/kv_pair_data/words_datagen.txt"
    lines_rdd = spark.sparkContext.textFile(input_file_path)
    words_rdd = lines_rdd.flatMap(lambda e: e.split(','))
    words_kv_pair_rdd = words_rdd.map(lambda e: (e, 1))

    results_kv_rdd = words_kv_pair_rdd.reduceByKey(lambda x, y: x + y)

    print(results_kv_rdd.collect())

    print("Printing current datetime - 2: ")
    print(time.strftime("%Y-%m-%D %H:%M:%S"))
    #print("Please wait for 10 minutes before stopping SparkSession object ... ")
    #time.sleep(600)
    #print(time.strftime("%Y-%m-%D %H:%M:%S"))
    print("Stopping the SparkSession object")
    spark.stop()




Summary

In this article, we have successfully used reduceByKey RDD transformation in the PySpark application using PyCharm Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments