Friday, April 7, 2017

Primary Key/Unique id creation in HIVE

Primary Key creation in HIVE.


There are 2 ways to create Primary key in HIVE


OPTION 1:

Using reflect UDF
Eg: select reflect("java.util.UUID", "randomUUID"),name, deptnum, branch, year from studentdata;

Disadvantage: Mappers execute in parallel so there is chance of repetative UUIDs.

OPTION 2:

Creating a column by concatning 2 rows to have a meaningful name
Eg: select name, deptnum, hash(concat(name,year)) from studentdata;

Steps:
1. Create a Temporary External table with all the required fields and this table does not have Primary Key and
Eg: create External table studentdata(name string, dept INT, course string, year int) ...

2. Now create the main table,
   
    create table studentdata_PK(rowid_pk string, name string, dept INT, course string, year int)
    > ROW FORMAT DELIMITED
        > FIELDS TERMINATED BY '\t'
        > LINES TERMINATED BY '\n'
        > stored as AVRO
        > location "/user/Leela/Hive/Student_AVRO";

3. Inserting data into the main table, OPTION 2 implemented in this case.
    insert into studentdata_PK SELECT concat(name,year), * from studentdata;
   

Primary Key/Unique id creation in Spark


Get the existing max value from the Sink Table

val p = sinkdf.agg(max(colName)).rdd.map(x => x.mkString).collect
var maxval: String = p(0).toString

val windowSpec = Window.orderBy("ID")
srcdf.withColumn("Random_IDN", lit(row_number().over(windowSpec) + lit(maxval).cast(LongType)))

1 comment:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete