Practical RDD transformation: repartition in PySpark using Jupyter Notebook | PySpark 101 | Part 21


Prerequisite

  • Apache Spark
  • Jupyter Notebook

Walk-through

In this article, I am going to walk-through you all, how to use repartition RDD transformation in the PySpark code using Jupyter Notebook.

repartition: It will return a new RDD that has exactly the number of partitions specified in the argument(numPartitions). It can increase or decrease the level of parallelism in this RDD. repartition uses(internally) a shuffle to redistribute data. If you are reducing the number of partitions in the particular RDD, we recommand to consider using `coalesce`, which can avoid performing a shuffle.




Summary

In this article, we have successfully used repartition RDD transformation in the PySpark code using Jupyter Notebook. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.

Happy Learning !!!

Post a Comment

0 Comments