Hadoop and Spark by Leela Prasad: Updating data in a Hive table

This can be achieved with out ORC file format and transaction=false, can be achieved only when the table is a partitioned table. This is a 2 step process:

1. Create data set with Updated entries using Union of non-updated records and New record in the partition.

select tbl2.street,tbl2.city,tbl2.zip,tbl2.state,tbl2.beds,tbl2.baths,tbl2.sq__ft,tbl2.sale_date,tbl2.price,tbl2.latitude,tbl2.longitude,tbl2.type from (select * from samp_tbl_part where type = "Multi-Family") tbl1 JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 ON tbl1.zip=tbl2.zip ///New Record
UNION ALL
select tbl1.* from (select * from samp_tbl_part where type = "Multi-Family") tbl1 LEFT JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 on tbl1.zip=tbl2.zip where tbl2.zip is NULL; ////Non-updated records

2. Insert overwrite the partition.

Eg:
CREATE EXTERNAL TABLE `samp_tbl_part`(
`street` string,
`city` string,
`zip` string,
`state` string,
`beds` string,
`baths` string,
`sq__ft` string,
`sale_date` string,
`price` string,
`latitude` string,
`longitude` string)
PARTITIONED BY (
`type` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://quickstart.cloudera:8020/user/hive/sampledata/realestate_part';

220 OLD AIRPORT RD AUBURN 95603 CA 2 2 960 Mon May 19 00:00:00 EDT 2008 285000 38.939802 -121.054575 Multi-Family
398 LINDLEY DR SACRAMENTO 95815 CA 4 2 1744 Mon May 19 00:00:00 EDT 2008 416767 38.622359 -121.457582 Multi-Family
8198 STEVENSON AVE SACRAMENTO 95828 CA 6 4 2475 Fri May 16 00:00:00 EDT 2008 159900 38.465271 -121.40426 Multi-Family
1139 CLINTON RD SACRAMENTO 95825 CA 4 2 1776 Fri May 16 00:00:00 EDT 2008 221250 38.585291 -121.406824 Multi-Family
7351 GIGI PL SACRAMENTO 95828 CA 4 2 1859 Thu May 15 00:00:00 EDT 2008 170000 38.490606 -121.410173 Multi-Family

CREATE EXTERNAL TABLE `samp_tbl`(
`street` string,
`city` string,
`zip` string,
`state` string,
`beds` string,
`baths` string,
`sq__ft` string,
`type` string,
`sale_date` string,
`price` string,
`latitude` string,
`longitude` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'line.delim'='\n',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://quickstart.cloudera:8020/user/hive/sampledata/realestate'

1139 CLINTON RD SACRAMENTO 95825 FL 4 2 1776 Multi-Family Fri May 16 00:00:00 EDT 2008 221250 38.585291 -121.406824

Complete Insert statement:
INSERT OVERWRITE TABLE samp_tbl_part partition (type) select tbl2.street,tbl2.city,tbl2.zip,tbl2.state,tbl2.beds,tbl2.baths,tbl2.sq__ft,tbl2.sale_date,tbl2.price,tbl2.latitude,tbl2.longitude,tbl2.type from (select * from samp_tbl_part where type = "Multi-Family") tbl1 JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 ON tbl1.zip=tbl2.zip
UNION ALL
select tbl1.* from (select * from samp_tbl_part where type = "Multi-Family") tbl1 LEFT JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 on tbl1.zip=tbl2.zip where tbl2.zip is NULL;

5 comments:

rainbow brammiFebruary 22, 2020 at 2:44 AM
really nice information
Spark and Scala Online Training
rainbowrFebruary 27, 2020 at 3:45 AM
GOOD EXPLAIN

Big Data and Hadoop Online Training
Training InstituteMarch 11, 2020 at 1:52 AM
APTRON Gurgaon teach you everything about Hadoop right from the earliest starting point to the very advanced level so you can be efficient with your work in challenging situations.
For More Info: Hadoop Course in Gurgaon
MY NEWS ISSUEApril 23, 2020 at 12:02 AM
Thanks for your information ,We are also providing Hadoop online training in our institute NareshIT. We are having the best and well trained experienced faculty to train you the Hadoop. By joining in this course you will get complete knowledge about Hadoop. we are giving 100% guidance in placement for our students.

Website : https://nareshit.com/hadoop-online-training/
veeraMay 25, 2020 at 1:08 AM
Thank you for sharing this amazing blog.Keep updating more posts.

hadoop admin online training
big data online course

Hadoop and Spark by Leela Prasad

Tuesday, November 26, 2019

Updating data in a Hive table

5 comments:

Popular Posts