Monday, February 6, 2017

HIVE-HBASE Integration



HIVE and HBASE integration

From cloudera, HIVE files can be accessed via cd /usr/lib/hive/lib/
to open HIVE-site.xml,
cd /usr/lib/hive/conf
cat hive-site.xml

To allow Hive scripts to use HBase, add the following statements to the top of each script.
OR
adding these JARS BY populating in the hive.aux.jars.path property in hive-site.xml and Restart HIVE

ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.13.1-cdh5.3.0.jar;
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
ADD JAR /usr/lib/hbase/hbase-client-0.98.6-cdh5.3.0.jar;
ADD JAR /usr/lib/hbase/lib/hbase-common-0.98.6-cdh5.3.0.jar;
ADD JAR /usr/lib/hbase/lib/hbase-protocol-0.98.6-cdh5.3.0.jar;
ADD JAR /usr/lib/hbase/lib/hbase-server-0.98.6-cdh5.3.0.jar;
ADD JAR /usr/lib/hbase/lib/hbase-shell-0.98.6-cdh5.3.0.jar;
ADD JAR /usr/lib/hbase/lib/hbase-thrift-0.98.6-cdh5.3.0.jar;

Data,
arun 1 cse 1
sunil 2 cse 1
raj 3 cse 1
naveen 4 cse 1
venki 5 cse 2
prasad 6 cse 2
sudha 7 cse 2
ravi 1 mech 1
raju 2 mech 1
roja 3 mech 1
anil 4 mech 2
rani 5 mech 2
anvith 6 mech 2
madhu 7 mech 2

In general HBase expects a Key to be specified. In our data, there is no id, but we have specified key int as the first column. so because of this data in the file cannot be loaded into hive table.
CREATE TABLE IF NOT EXISTS student_hive (key int, name String, id int, course String, year int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ("hbase.columns.mapping" = ":key, Course_Details: hname,Course_Details: hbid, Course_Details: hbcourse, Course_Details: hbyear") TBLPROPERTIES ("hbase.table.name" = "Studentdata_hbase");
//The above command is INVALID

So, CREATE the HIVE table with the below command where name String acts as key for HBase. Here ":key" is specified at the begining of 'hbase.columns.mapping' property which automatically maps to first column(name String in HIVE. Also this automatically creates HBase table with name 'Studentdata_hbase' with columnfamily name as 'Course_Details'. Refer http://hadooptutorial.info/hbase-integration-with-hive/

CREATE TABLE IF NOT EXISTS student_hive (name String, id int, course String, year int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ("hbase.columns.mapping" = ":key, Course_Details: hbid, Course_Details: hbcourse, Course_Details: hbyear") TBLPROPERTIES ("hbase.table.name" = "Studentdata_hbase");

Now Load data to the HIVE table using,
From Local file - load data local inpath
OR
Insert from a table - INSERT into student_hive SELECT student_partition2.name, student_partition2.id, student_partition2.course FROM student_partition2;

Note: Initially data already exists in HBase table. After  Hive table creation, the same data in HBase table would be reflected in the newly created Hive table. Till now all this is with he existing table and consider the existing data is of 10,000 records.

Our need is to insert new data which is of 5000 records and is in an ORC table student_partition2. After 'Insert into --' statement the new 5000 records from ORC table will be added up to the HBase table and combinely 15000 records are available. This is a simple append process.

6 comments: