Below are the steps for creation Spark Scala SBT Project in Intellij:
1. Open Intellij via Run as Administrator and create a New project of type scala and sbt.
If this option is not available, open Intellij and go to settings -> pluging and type the plugin Scala and install it.
Also Install sbt plugin from the plugins window.
2. Select scala version which is compatible with spark, eg if spark version is 2.3 then select scala version as 2.11 and not 2.12 as spark 2.3 is compatible with scala 2.11. So, selected 2.11.8
Sample is available under https://drive.google.com/open?id=19YpCwLzuFZSqBReaceVOFS-BwlArOEpf
Debugging Spark Application
Remote Debugging
http://www.bigendiandata.com/2016-08-26-How-to-debug-remote-spark-jobs-with-IntelliJ/
1. Generate JAR file and copy it to a location in cluster.
2. Execute the command,
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
3. In Intellij Run -> Edit Configurations -> Select Remote and Create a configuration with port number 4000 and the Host name of the machine in which JAR is copied.
4. submit the Spark application Eg: spark-submit --class com.practice.SayHello demo_2.11-0.1.jar
5. Click on Debug in Intellij for the configuration create in step3 and this would connect to the Spark Application.
To write data to Hive tables from Spark Dataframe below are the 2 steps:
1. In spark-submit add the entry of hive site file as --files /etc/spark/conf/hive-site.xml
2. Enable Hive support in spark session enableHiveSupport(). eg:
val spark = SparkSession.builder.appName("Demo App").enableHiveSupport().getOrCreate()
Sample Code:
val date_add = udf((x: String) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val result = new Date(sdf.parse(x).getTime())
sdf.format(result)
} )
val dfraw2 = dfraw.withColumn("ingestiondt",date_add($"current_date"))
dfraw2.write.format("parquet").mode(SaveMode.Append).partitionBy("ingestiondt").option("path", "s3://ed-raw/cdr/table1").saveAsTable("db1.table1")
To write data to Hive tables from Spark Dataframe below are the 2 steps:
1. In spark-submit add the entry of hive site file as --files /etc/spark/conf/hive-site.xml
2. Enable Hive support in spark session enableHiveSupport(). eg:
val spark = SparkSession.builder.appName("Demo App").enableHiveSupport().getOrCreate()
Sample Code:
val date_add = udf((x: String) => {
val sdf = new SimpleDateFormat("yyyy-MM-dd")
val result = new Date(sdf.parse(x).getTime())
sdf.format(result)
} )
val dfraw2 = dfraw.withColumn("ingestiondt",date_add($"current_date"))
dfraw2.write.format("parquet").mode(SaveMode.Append).partitionBy("ingestiondt").option("path", "s3://ed-raw/cdr/table1").saveAsTable("db1.table1")