Below are the steps to create Maven Project
1. Getting Scala IDE . Open Eclipse -> Help -> Eclipse Market place and search for Scala . In the list 'Scala IDE 4.2.X' and click Install.
2. Selected Scala Prespective. Right click left pane - >New -> Project -> Maven and create new project GroupID "org.practice.Leela" ArtifactID "Projs".
3. Wait for 3-5 mins for the project to get updated. Right click project -> configure ->Add scala Nature.
Add Scala directory
Right click Project -> Java Build path -> Source -> Add Folder -> Select 'main' -> Add New Folder 'scala' -> Inclusion patterns and give **/*/.scala -> FInish -> Ok
Add Scala directory
Right click Project -> Java Build path -> Source -> Add Folder -> Select 'main' -> Add New Folder 'scala' -> Inclusion patterns and give **/*/.scala -> FInish -> Ok
4. Right click package -> New scala object -> "SparkWordcount"
Below is the latest POM.xml (From Narendra Machine)
Make sure groupId,artifactId and version are as like before.
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.tek</groupId>
<artifactId>Sparkprojs</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>hydraulic</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11.8</scala.version>
<java.version>1.7</java.version>
</properties>
<dependencies>
<dependency>
<artifactId>scala-library</artifactId>
<groupId>org.scala-lang</groupId>
<version>${scala.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.11</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.5</version>
</dependency>
<!-- tests -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>2.2.2</version>
<scope>test</scope>
</dependency>
<dependency>
<artifactId>junit</artifactId>
<groupId>junit</groupId>
<version>4.10</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/net.alchim31.maven/scala-maven-plugin -->
<dependency>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.0.2</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
5. In the POM.xml file add above dependancies under <dependancies></dependancies> tags and save. Now dependancies will be downloaded
Note: In some cases we come across scala compatibility issues like https://stackoverflow.com/questions/34507966/spark-build-path-is-cross-compiled-with-an-incompatible-version-of-scala-2-10-0
Errors :
"akka-actor_2.10-2.3.11.jar of Sparkexamples build path is cross-compiled with an incompatible version of Scala (2.10.0). In case this report is mistaken, this check can be disabled in the compiler preference page. Sparkexamples Unknown Scala Version Problem"
To fix this need to set the correct scala version so Right click Project -> Scala -> Set the Scala Installation -> Select Fixed Installation 2.10.x as the version specified in Maven is 2.10.5.
In case of alchim31 error
copy http://alchim31.free.fr/m2e-scala/update-site/
In Ecllipse -> help -> Install New Software and paste in 'Work with' and select all and Install
OR
6. Project Right click -> Properties -> Java Build path -> Select existing 2.11 scala library -> edit and select 2.10.5 which is added version in POM.xml. In our case right version is 2.10.5.
7. To add newer dependencies search for maven dependencies and add to the POM.xml file. Like,
<dependency>
<groupId>org.apache.spark</ groupId>
<artifactId>spark-sql_2.10</ artifactId>
<version>1.6.0</version>
</dependency>
8. Right click project -> Maven -> Update project -> check force option and OK
9. Right click Project -> New -> Scala Object
and write below code for testing.
package org.practice.Leela.Projs
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Wordcount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("wordCount")
.setMaster("local")
val sc = new SparkContext(conf)
val lst = List(1,2,3,4)
val rdd1 = sc.parallelize(lst)
rdd1.foreach(println)
}
}
Export Jar
Maven jar:
add below lines after dependencies in POM.xml,
<build>
<plugins>
<plugin>
<inherited>true</inherited>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
<plugins>
<plugin>
<inherited>true</inherited>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
Right click project -> Run as -> Maven Install. This will generate maven JAR. The jar name can be specified in POM.xml
Ecllipse jar: not recommended
right click the driver class eg: SparkWC.scala -> Export -> JAR File -> specify JAR file path and click Finish
Run the command: To run on local machine with JAR, Input and Output in Local directories
spark-submit --class "Sparkjobs.Wordcount" --master local[*] /home/hadoop/Leela_out/Sparkprojs-1.0.jar /home/hadoop/Leela/WC/inpath /home/hadoop/Leela/WC/Outdir/
Input and output paths are passed as arguments, these paths are Local file system paths. The JAR file is in local file system
http://10.188.193.152:4040 - Spark UI. This UI will appear only while the Spark job is running.
To run spark from HDFS
spark-submit --class "Sparkjobs.Wordcount" --master yarn --deploy-mode client /home/hadoop/Leela_out/Sparkprojs-1.0.jar /user/hadoop/Leela/WC/Leela_out /user/hadoop/Leela/WC/Outdir
spark-submit --class "Sparkjobs.Wordcount" --master local[*] /home/hadoop/Leela_out/Sparkprojs-1.0.jar /home/hadoop/Leela/WC/inpath /home/hadoop/Leela/WC/Outdir/
Input and output paths are passed as arguments, these paths are Local file system paths. The JAR file is in local file system
http://10.188.193.152:4040 - Spark UI. This UI will appear only while the Spark job is running.
To run spark from HDFS
spark-submit --class "Sparkjobs.Wordcount" --master yarn --deploy-mode client /home/hadoop/Leela_out/Sparkprojs-1.0.jar /user/hadoop/Leela/WC/Leela_out /user/hadoop/Leela/WC/Outdir
Input and output paths are passed as arguments, these paths are HDFS file system paths. The JAR file is in local file system
How to submit the Job with the built JAR file?
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \
1000
Latest POM.xml as on July 2017
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
Note: If the dependancy in Maven is,
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.1.0-cdh5.8.0</version>
</dependency>
then change the version to <version>1.1.0</version> by removing -cdhxxxx. This leads to error
Eg:
----------------------------- Hive Code ----------
Below JAR will read a CSV FILE and save the contents as an AVRO file. Then hive table is created on top of it/Data/sathwik_r_and_d/spark/spark-2.1.0-bin-hadoop2.6/bin/spark-submit --class com.hydraulics.HydraulicsCode --master local[*] --packages "com.databricks:spark-avro_2.11:3.2.0,com.databricks:spark-csv_2.11:1.5.0" /Data/sathwik_r_and_d/hptest/hydraulics/hydra_jar_file.jar /Data/sathwik_r_and_d/hptest/hydraulics/history_export_files/data-export-2017-01-01toCurrent.csv /user/new_hydra_data_second_time/
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
set hive.execution.engine=spark;
CREATE TABLE hydraulics_temp_reload_v3034567891011131416171819
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/new_hydra_data_second_time/'
TBLPROPERTIES ('avro.schema.url'='/user/eoil_v_5.avsc',"hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
select count(*) from hydraulics_temp_reload_v3034567891011131416171819;