How to use sortByKey RDD transformation in PySpark | PySpark 101 | Part 15


Prerequisite

  • Apache Spark
  • PyCharm Community Edition

Walk-through

In this article, I am going to walk-through you all, how to use sortByKey RDD transformation in the PySpark application using PyCharm Community Edition.

sortByKey - Sorts this RDD, which is assumed to consist of (key, value) pairs

# Importing Spark Related Packages
from pyspark.sql import SparkSession

# Importing Python Related Packages

if __name__ == "__main__":
    print("PySpark 101 Tutorial")

    spark = SparkSession \
            .builder \
            .appName("Part 15 - How to use sortByKey RDD transformation in PySpark | PySpark 101") \
            .master("local[*]") \
            .enableHiveSupport() \
            .getOrCreate()

    key_value_pair_list = [(3, "Three"), (5, "Five"), (1, "One"), (4, "Four"), (2, "Two")]
    print("Printing key_value_pair_list: ")
    print(key_value_pair_list)

    key_value_pair_rdd = spark.sparkContext.parallelize(key_value_pair_list, 2)

    print("Get Partition Count: ")
    print(key_value_pair_rdd.getNumPartitions())

    print(key_value_pair_rdd.sortByKey().collect())

    input_file_path = "file:///home/dmadmin/datamaking/data/pyspark101/input/tech.txt"
    tech_rdd = spark.sparkContext.textFile(input_file_path)
    tech_key_value_pair_rdd = tech_rdd.map(lambda e: (e, 1))
    print("Before sortByKey: ")
    print(tech_key_value_pair_rdd.collect())
    print("After sortByKey: ")

    print(tech_key_value_pair_rdd.sortByKey().collect())

    print("Stopping the SparkSession object")
    spark.stop()


Summary

In this article, we have successfully used sortByKey RDD transformation in the PySpark application using PyCharm Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments