Friday, February 24, 2017

Building Maven Project

Below are the steps to create Maven Project

1. Getting Scala IDE . Open Eclipse -> Help -> Eclipse Market place and search for Scala . In the list 'Scala IDE 4.2.X' and click Install.

2. Selected Scala Prespective. Right click left pane - >New -> Project -> Maven and create new project GroupID "org.practice.Leela" ArtifactID "Projs".

3. Wait for 3-5 mins for the project to get updated. Right click project -> configure ->Add scala Nature.

Add Scala directory

Right click Project -> Java Build path -> Source -> Add Folder -> Select 'main' -> Add New Folder 'scala' -> Inclusion patterns and give **/*/.scala -> FInish -> Ok

4. Right click package -> New scala object -> "SparkWordcount"


Below is the latest POM.xml (From Narendra Machine)

Make sure groupId,artifactId and version are as like before.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>org.tek</groupId>
  <artifactId>Sparkprojs</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>hydraulic</name>
  <url>http://maven.apache.org</url>

   <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <scala.version>2.11.8</scala.version>
  <java.version>1.7</java.version>
 </properties>

  <dependencies>
  <dependency>
   <artifactId>scala-library</artifactId>
   <groupId>org.scala-lang</groupId>
   <version>${scala.version}</version>
  </dependency>

  <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
  <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-core_2.11</artifactId>
   <version>2.1.1</version>
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
  <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-streaming_2.11</artifactId>
   <version>2.1.1</version>
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
  <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-sql_2.11</artifactId>
   <version>2.1.1</version>
  </dependency>
  <dependency>
   <groupId>com.databricks</groupId>
   <artifactId>spark-avro_2.11</artifactId>
   <version>3.2.0</version>
  </dependency>
  <dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
   <version>2.1.1</version>
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.11 -->
  <dependency>
   <groupId>org.apache.kafka</groupId>
   <artifactId>kafka_2.11</artifactId>
   <version>0.8.2.1</version>
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
  <dependency>
   <groupId>org.apache.kafka</groupId>
   <artifactId>kafka-clients</artifactId>
   <version>0.8.2.1</version>
  </dependency>
  <dependency>
   <groupId>com.databricks</groupId>
   <artifactId>spark-csv_2.11</artifactId>
   <version>1.5.0</version>
  </dependency>
  <dependency>
   <groupId>com.google.code.gson</groupId>
   <artifactId>gson</artifactId>
   <version>2.5</version>
  </dependency>
  <!-- tests -->
  <dependency>
   <groupId>org.scalatest</groupId>
   <artifactId>scalatest_2.11</artifactId>
   <version>2.2.2</version>
   <scope>test</scope>
  </dependency>
  <dependency>
   <artifactId>junit</artifactId>
   <groupId>junit</groupId>
   <version>4.10</version>
   <scope>test</scope>
  </dependency>
<!-- https://mvnrepository.com/artifact/net.alchim31.maven/scala-maven-plugin -->
<dependency>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.1.6</version>
</dependency>
 </dependencies>

 <build>
  <sourceDirectory>src/main/scala</sourceDirectory>
  <testSourceDirectory>src/test/scala</testSourceDirectory>
  <plugins>
   <plugin>
    <groupId>net.alchim31.maven</groupId>
    <artifactId>scala-maven-plugin</artifactId>
    <version>3.1.6</version>
    <executions>
     <execution>
      <phase>compile</phase>
      <goals>
       <goal>compile</goal>
      </goals>
     </execution>
    </executions>
   </plugin>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>2.0.2</version>
    <configuration>
     <source>${java.version}</source>
     <target>${java.version}</target>
    </configuration>
    <executions>
     <execution>
      <phase>compile</phase>
      <goals>
       <goal>compile</goal>
      </goals>
     </execution>
    </executions>
   </plugin>
  </plugins>
 </build>
</project>



5. In the POM.xml file add above dependancies under <dependancies></dependancies> tags and save. Now dependancies will be downloaded

Note: In some cases we come across scala compatibility issues like https://stackoverflow.com/questions/34507966/spark-build-path-is-cross-compiled-with-an-incompatible-version-of-scala-2-10-0

 Errors :
"akka-actor_2.10-2.3.11.jar of Sparkexamples build path is cross-compiled with an incompatible version of Scala (2.10.0). In case this report is mistaken, this check can be disabled in the compiler preference page.    Sparkexamples        Unknown    Scala Version Problem"

To fix this need to set the correct scala version so Right click Project -> Scala -> Set the Scala Installation -> Select Fixed Installation 2.10.x  as the version specified in Maven is 2.10.5.

In case of alchim31 error

copy http://alchim31.free.fr/m2e-scala/update-site/
In Ecllipse -> help -> Install New Software and paste in 'Work with' and select all and Install

                                  OR

6. Project Right click -> Properties -> Java Build path -> Select existing 2.11 scala library -> edit and select 2.10.5 which is added version in POM.xml. In our case right version is 2.10.5.

7. To add newer dependencies search for maven dependencies and add to the POM.xml file. Like,

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.10</artifactId>
    <version>1.6.0</version>
</dependency>


8. Right click project -> Maven -> Update project -> check force option and OK

9. Right click Project -> New -> Scala Object
and write below code for testing.

package org.practice.Leela.Projs

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object Wordcount {
  
  def main(args: Array[String]): Unit = {
    
    val conf = new SparkConf()
    .setAppName("wordCount")
    .setMaster("local")
    
    val sc = new SparkContext(conf)
    val lst = List(1,2,3,4)
    val rdd1 = sc.parallelize(lst)
    rdd1.foreach(println)
  }
}

Export Jar

Maven jar:

add below lines after dependencies in POM.xml,

   <build>
    <plugins>
        <plugin>
        <inherited>true</inherited>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.1</version>
            <configuration>
                       <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            </configuration>
        </plugin>
    </plugins>
  </build>



Right click project -> Run as -> Maven Install. This will generate maven JAR. The jar name can be specified in POM.xml

Ecllipse jar: not recommended
right click the driver class eg: SparkWC.scala -> Export -> JAR File -> specify JAR file path and click Finish

Run the command: To run on local machine with JAR, Input and Output in Local directories

spark-submit --class "Sparkjobs.Wordcount" --master local[*] /home/hadoop/Leela_out/Sparkprojs-1.0.jar /home/hadoop/Leela/WC/inpath /home/hadoop/Leela/WC/Outdir/

Input and output paths are passed as arguments, these paths are Local file system paths. The JAR file is in local file system 

 http://10.188.193.152:4040 - Spark UI. This UI will appear only while the Spark job is running.


 To run spark from HDFS

spark-submit --class "Sparkjobs.Wordcount" --master yarn --deploy-mode client /home/hadoop/Leela_out/Sparkprojs-1.0.jar /user/hadoop/Leela/WC/Leela_out /user/hadoop/Leela/WC/Outdir

Input and output paths are passed as arguments, these paths are HDFS file system paths. The JAR file is in local file system 

How to submit the Job with the built JAR file?

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

Latest POM.xml as on July 2017

    <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.8</version>
</dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.0</version>
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.1.0</version>
</dependency>
 <dependency>
  <groupId>org.apache.spark</groupId> 
  <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>   
  <version>2.1.0</version> 
  </dependency>
   <dependency>
  <groupId>org.apache.kafka</groupId> 
  <artifactId>kafka_2.11</artifactId> 
  <version>0.8.2.1</version> 
  </dependency>
  <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.1.1</version>
</dependency>
 <dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>3.2.0</version>
</dependency>
  </dependencies>
<build>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
           <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
           </descriptorRefs>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>

Note: If the dependancy in Maven is,
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.1.0-cdh5.8.0</version>
</dependency>

then change the version to <version>1.1.0</version> by removing -cdhxxxx. This leads to error


Eg:
----------------------------- Hive Code ----------
Below JAR will read a CSV FILE and save the contents as an AVRO file. Then hive table is created on top of it

/Data/sathwik_r_and_d/spark/spark-2.1.0-bin-hadoop2.6/bin/spark-submit --class com.hydraulics.HydraulicsCode --master local[*] --packages "com.databricks:spark-avro_2.11:3.2.0,com.databricks:spark-csv_2.11:1.5.0" /Data/sathwik_r_and_d/hptest/hydraulics/hydra_jar_file.jar /Data/sathwik_r_and_d/hptest/hydraulics/history_export_files/data-export-2017-01-01toCurrent.csv /user/new_hydra_data_second_time/

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

set hive.execution.engine=spark;
CREATE TABLE hydraulics_temp_reload_v3034567891011131416171819
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/new_hydra_data_second_time/'
TBLPROPERTIES ('avro.schema.url'='/user/eoil_v_5.avsc',"hive.input.dir.recursive" = "TRUE", 
    "hive.mapred.supports.subdirectories" = "TRUE",
    "hive.supports.subdirectories" = "TRUE", 
    "mapred.input.dir.recursive" = "TRUE");

select count(*) from hydraulics_temp_reload_v3034567891011131416171819;

No comments:

Post a Comment