Hadoop and Spark by Leela Prasad: March 2018

Wednesday, March 28, 2018

Kafka Connect

Kafka Connect- https://www.confluent.io/product/connectors/

Kafka Connect
Kafka Connect is a framework included in Apache Kafka that integrates Kafka with other systems. Its purpose is to make it easy to add new systems to your scalable and secure stream data pipelines.

To copy data between Kafka and another system, users instantiate Kafka Connectors for the systems they want to pull data from or push data to. Source Connectors import data from another system (e.g. a relational database into Kafka) and Sink Connectors export data (e.g. the contents of a Kafka topic to an HDFS file).

https://kafka.apache.org/documentation/#connectapi - Look under Transformations

https://docs.confluent.io/current/connect/connect-hdfs/docs/hdfs_connector.html
https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Few links to customize the connectors or write own connectors.

JAR files would be under /home/gorrepat/confluent/confluent-4.0.0/share/java/kafka-connect-hdfs

https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java

https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/json/JsonFormat.java

Need to add Custom logic(fingerprint removal) in write() of https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/json/JsonRecordWriterProvider.java

Friday, March 9, 2018

Apache phoenix is installed under /opt/mas/phoenix/

Launch Phoenix CLI with zookeeper

bin/sqlline.py localhost:2181
OR
bin/sqlline.py xvzw160.xdev.motive.com,xvzw161.xdev.motive.com,xvzw162.xdev.motive.com

Create Table:
CREATE TABLE IF NOT EXISTS STOCK_SYMBOL (SYMBOL VARCHAR NOT NULL PRIMARY KEY, COMPANY VARCHAR);

List tables in Phoenix
select DISTINCT("TABLE_NAME") from SYSTEM.CATALOG;

Insert Data into table.
UPSERT INTO STOCK_SYMBOL VALUES ('CRM','SalesForce.com');

Bulkupload data to table STOCK_SYMBOL:
bin/sqlline.py -t STOCK_SYMBOL xvzw160.xdev.motive.com,xvzw161.xdev.motive.com,xvzw162.xdev.motive.com /opt/mas/phoenix/examples/STOCK_SYMBOL.csv

To bulk upload data to Phoenix:
./psql.py xvzw160.xdev.motive.com,xvzw161.xdev.motive.com,xvzw162.xdev.motive.com /opt/mas/phoenix/examples/WEB_STAT.sql /opt/mas/phoenix/examples/WEB_STAT.csv /opt/mas/phoenix/examples/WEB_STAT_QUERIES.sql

./sqlline.py xvzw160.xdev.motive.com,xvzw161.xdev.motive.com,xvzw162.xdev.motive.com

Importing Java project and integrating with MAVEN

Point 1: Local Maven Repository:
Usually there would be a local repository in which all the JARs would be placed. By default these JARs would be downloaded from MAVEN repository, To override this repository and make our local repository to download the JARS, update settings.xml under Window -> preferences -> Maven -> User Settings.
Jenkins build also picks the jars from this local repository during build.

JARs acn also be added to the project externally via class Path -> Add External Jars.. Also, via Maven these can be added via POM.xml by adding <scope> and <systemPath> additionally to the dependancy as below,

<dependency>
<groupId>com.loopj.android.http</groupId>
<artifactId>android-async-http</artifactId>
<version>1.3.2</version>
<type>jar</type>
<scope>system</scope>
<systemPath>${project.basedir}/libs/android-async-http-1.3.2.jar</systemPath>
</dependency>

Point 2: If any of the Jars are failing to download then go the repository location by entering the repository URL in the browser and check if the jars exists.

Point 3: Inorder to add a new file repository to download JARs, add the entry in settings.xml

case : Found a sample project in an article then follow the steps.

1. See the code if it fits exactly to the need.
2. Integrate to the workspace and try to build with maven command prompt.
Before this verify if maven has installed properly. In command prompt, Maven home path to be set properly. Try to use Maven 3.2.2 which is most stable

C:\Users\gorrepat>mvn -version
Apache Maven 3.2.2 (45f7c06d68e745d05611f7fd14efb6594181933e; 2014-06-17T19:21:42+05:30)
Maven home: C:\SWSetup\apache-maven-3.2.2-bin\apache-maven-3.2.2
Java version: 1.8.0_151, vendor: Oracle Corporation
Java home: C:\Program Files\Java\jdk1.8.0_151\jre
Default locale: en_IN, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"

3. From command prompt go to workspace location and give command "mvn clean install".
4. Resolve JARs issue which failed to download using Points 1 and 2.
5. Upgrade to the latest version Jars and try to build again.
6. Try to incluse same versions those are available in the cluster.

case : If we get a sample class and need to create a project out of it, then create a new project with below skeleton and search for the jars based on package names.
Eg: search in google as <PACKAGENAME> jar as org.apache.hadoop.hbase.tablename jar

Then we will come to know the jar file and include it in pom.xml.

A sample POM.xml that includes all the dependancies is below, the key is "<descriptorRef>jar-with-dependencies</descriptorRef>"

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.spark.streaming</groupId>
<artifactId>JavaSparkStr</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.1</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.10.1.1</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.1</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>

To make Runnable JAR, need to create a manifest file or add below entries in POM.xml,

<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>consumerTest.KafkaConsumer</mainClass>
</manifest>

Latest Maven Projects complied with java 1.8 are under https://drive.google.com/open?id=1yp3Q6jucpGW4bCZ8P9MkDpy7q-x0j7hK

To Run a Java standalone program,

java -cp samples-0.0.1-SNAPSHOT-jar-with-dependencies.jar producer.SimpleStringProducer

Hadoop and Spark by Leela Prasad

Wednesday, March 28, 2018

Kafka Connect

Friday, March 9, 2018

Apache Phoenix

Importing Java project and integrating with MAVEN

Popular Posts