Prerequisite
- Apache Spark
- IntelliJ IDEA Community Edition
Walk-through
In this article, I am going to walk-through you all, how to create Spark DataFrame from XML file(XML File Format) in the Apache Spark application using IntelliJ IDEA Community Edition.part_5_create_dataframe_from_xml_file.scala
package com.datamaking.apache.spark.dataframe import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import com.databricks.spark.xml._ object part_5_create_dataframe_from_xml_file { def main(args: Array[String]): Unit = { println("Apache Spark Application Started ...") val spark = SparkSession.builder() .appName("Create DataFrame from XML File") .master("local[*]") .getOrCreate() spark.sparkContext.setLogLevel("ERROR") //Code Block 1 Starts Here println("Approach 1: ") val xml_file_path = "D:\\apache_spark_dataframe\\data\\xml\\user_detail.xml" val users_df_1 = spark.read.option("rowTag", "user").xml(xml_file_path) users_df_1.show(10, false) users_df_1.printSchema() //Code Block 1 Ends Here //Code Block 2 Starts Here println("Approach 2: ") val user_schema = StructType(Array( StructField("user_id", IntegerType, true), StructField("user_name", StringType, true), StructField("user_city", StringType, true) )) val users_df_2 = spark.read.schema(user_schema).option("rowTag", "user").xml(xml_file_path) users_df_2.show(10, false) users_df_2.printSchema() //Code Block 2 Ends Here spark.stop() println("Apache Spark Application Completed.") } }
build.sbt
name := "apache_spark_dataframe_practical_tutorial" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4" // https://mvnrepository.com/artifact/com.databricks/spark-xml libraryDependencies += "com.databricks" %% "spark-xml" % "0.7.0" // https://mvnrepository.com/artifact/mysql/mysql-connector-java libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.18" // https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.1" // https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.4.1" // https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12 libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.4" // https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.3.1"
Summary
In this article, we have successfully created Spark DataFrame from XML file(XML File Format) in the Apache Spark application using IntelliJ IDEA Community Edition. Please go through all these steps and provide your feedback and post your queries/doubts if you have. Thank you. Appreciated.Happy Learning !!!
0 Comments