show functions;
DESCRIBE FUNCTION <function_name>;
describe function month or describe function extended month;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inOperators
Eg:
SELECT round(cgp) from student_record where student_id=15;
set hive.cli.print.header=true;
DESCRIBE FUNCTION <function_name>;
describe function month or describe function extended month;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inOperators
Eg:
SELECT round(cgp) from student_record where student_id=15;
Date and time functions:
set hive.cli.print.header=true;
select pay_date,date_format(to_date(pay_date), 'yyyy-MM') from department_avg_sal;
pay_date date_format(to_date(pay_date), yyyy-MM)
2017-03-31 2017-03
2017-02-28 2017-02
2017-03-31 2017-03
2017-02-28 2017-02
select *, from_unixtime(CAST(tmp_time.time/1000 as BIGINT), 'yyyy-MM-dd') as created_timestamp from tmp_time;
OK
tmp_time.time tmp_time.name created_timestamp
1406185200000 Leela 2014-07-24
select unix_timestamp('2018-06-04 11:06:38', 'yyyy-MM-dd HH:mm:ss') as s from tmp_time LIMIT 1;
1528110398
Custom Functions:
UDF - applied on a single row eg: day()UDAF - User defined Aggregrate Functions eg: Min() - Applied to set of rows
UDTF - User defined Transactional functions - transform a single input row to multiple output rows - Eg: json_tuple()
JSON file parsing
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
CREATE TABLE hive_parsing_json_table ( json string );
LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
LATERAL VIEW - forms a virtual table having the supplied table alias
select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
To flatten the multiline json to a single line JSON
CREATE EXTERNAL TABLE StudentsRaw (textcol string) STORED AS TEXTFILE LOCATION '/user/hadoop/Leela/Json';
CREATE TABLE StudentsOneLine(json_body string);
INSERT OVERWRITE TABLE StudentsOneLine
SELECT CONCAT_WS(' ',COLLECT_LIST(textcol)) AS singlelineJSON
FROM (SELECT INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE, textcol FROM StudentsRaw DISTRIBUTE BY INPUT__FILE__NAME SORT BY BLOCK__OFFSET__INSIDE__FILE) x
GROUP BY INPUT__FILE__NAME; //This statement does the flattening job
Generate the create statement for an existing hive table
SHOW CREATE TABLE
shows the CREATE TABLE
statement that creates a given table.Eg: SHOW CREATE TABLE myTable;
COALESCE function:
Of the fields above
(field1, field2, … , fieldn)
, for each record returned it returns the value of the first field of the ones listed in the COALESCE function that is not NULL. If all of the fields are NULL, it returns NULL.
For example, let’s say one had 3 date fields,
datefield1, datefield2, and datefield3
from the table tblDates
.PRIMARY_KEY | DATEFIELD1 | DATEFIELD2 | DATEFIELD3 |
---|---|---|---|
1 | NULL | NULL | 1993-06-04 |
The code:
SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1
will return ‘1993-06-04’
Usecase of COALESCE follow http://www.thelandbeyondspreadsheets.com/what-is-the-hive-sql-coalesce-function-what-does-it-do-and-why-on-earth-is-it-useful/
Difference between COALESCE and NVL functions:
NVL
Syntax - NVL(arg1, arg2)
This will replace arg1 with arg2 if arg1 value is NULL
Example -
NVL(value, default value)
Returns default value if value is null else returns value
COALESCE
Syntax - coalesce(value1, value 2, …value n)
This will return the first value that is not NULL, or NULL if all values's are NULL
Example
coalesce(email,phonenumber,address)
If customer primary contact medium is email, if email is null then use phonenumber, and if phonenumber is also null then use address
So now to understand the context NVL takes only two arguments while coalesce gives more flexibility to use different fields in an order of preference when one is NULL.
Create Sequence numbering for rows for a Hive table / Primary key / unique key creation in Hive.
add jar hdfs:////user/HiveHbase/hive-contrib-1.1.0.jar;
CREATE TEMPORARY FUNCTION row_sequence as
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
CREATE TABLE IF NOT EXISTS tbl1
(
ID int,
name String
) row format delimited fields terminated by ' ' stored as textfile;
CREATE TABLE IF NOT EXISTS tbl2
(
RowID BigInt,
ID int,
name String
) row format delimited fields terminated by ' ' stored as textfile;
INSERT INTO TABLE tbl1 values(56,"Name1");
INSERT INTO TABLE tbl1 values(57,"Name2");
INSERT INTO TABLE tbl1 values(58,"Name3");
INSERT INTO TABLE tbl1 values(59,"Name4");
INSERT OVERWRITE TABLE tbl2 SELECT row_sequence(), * FROM tbl1;
INSERT INTO TABLE tbl1 values(60,"Name5");
INSERT INTO TABLE tbl1 values(61,"Name6");
INSERT INTO TABLE tbl2 SELECT m.max+row_sequence(), * FROM tbl1 where ID > 59 ;
insert into table tbl2
select m.max+row_sequence() as inc , t.*
from (select * from tbl1 where ID > 59) t
join
(select max(RowID) as max from tbl2) m;
Source: http://kiran-javaworld.blogspot.in/2016/11/hive-auto-increment-column-incrementing.html
Alternate ways are:
1. Use the reflect UDF to generate UUIDs.
eg: reflect("java.util.UUID", "randomUUID")
2. USING row_number(). This is Hive's inbuild function. This can be used as like row_sequence(), however max() is not available for this. so use
select row_number() over() as colvals from tbl2 order by colvals desc LIMIT 1;
insert into table tbl2
select m.colvals +row_sequence() as inc , t.*
from (select * from tbl1 where ID > 65) t
join
(select row_number() over() as colvals from tbl2 order by colvals desc LIMIT 1) m;
Few queries of this.
Insert INTO TABLE tbl2 select row_number() over(), * from tbl1 WHERE ID > 63;
select row_number() over() as colvals from tbl1 order by colvals desc LIMIT 1;
Insert INTO TABLE tbl2 select 9+row_number() over(), * from tbl1 WHERE ID > 64;
Row_Number() and Rank() functions
hive> select fname,ip,ROW_NUMBER() OVER (ORDER BY ip ) as rownum from sales;
RANK: It is similar to ROW_NUMBER, but the equal rows are ranked with the same number. For example, if we use RANK in the previous query instead of ROW_NUM
hive> select fname,ip,RANK() OVER (ORDER BY ip) as ranknum, RANK() OVER (PARTITION BY ip order by fname ) from sales ;
Source: https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781782161080/6/ch06lvl1sec75/analytics-functions-in-hive
https://codingsight.com/similarities-and-differences-among-rank-dense_rank-and-row_number-functions/
Accessing data from complex data types in Hive:
Complex data types like Array, Map, Struct, Union are also known as collections in Hive.
Complex data types like Array, Map, Struct, Union are also known as collections in Hive.
Fetching individual values from an array of strings in key value pairs:
Complex column types are:
`device`struct<id:bigint,deviceid:string,make:string,model:string>
and
`response` array<struct<name:string,value:string>>)
select abc_exttbl.device.id from abc_exttbl;
select abc_exttbl.response from abc_exttbl;
select abc_exttbl.response[0].name from abc_exttbl;
select abc_exttbl.response[0].value from abc_exttbl;
select abc_exttbl.response[1].value from abc_exttbl;
Get the lenght of elements in an array.
select size(abc_exttbl.response) from abc_exttbl;
Analytical functions in Hive
Analytics functions
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
Eg:
select pat_id, dept_id, ins_amt, row_number() over (order by ins_amt) as rn, rank() over (order by ins_amt ) as rk, dense_rank() over (order by ins_amt ) as dense_rk from patient;OVER with standard aggregates:
- COUNT
- SUM
- MIN
- MAX
- AVG
select pat_id, dept_id, count(*) over (partition by dept_id order by dept_id asc) as pat_cnt from patient;
Windowing functions
LEAD
LAG
FIRST_VALUE
LAST_VALUE
Eg:select pat_id, dept_id, ins_amt, lead(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lead_ins_amt, lag(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lag_ins_amt from patient;partition by works as creating a window for the specified column. In the above case a window set will be created for the same values of dept_id.
Reference:https://dwgeek.com/hadoop-hive-analytic-functions-examples.html/
Conditional Functionshttp://dwgeek.com/hadoop-hive-conditional-functions-if-case-coalesce-nvl-decode.html/
Few scenarios on null:1. select c1 + c2 from tbl1; //This adds NUMBERS in c1 and c2 columnsif c2 = NULL then result is NULL2. CASE(NULL == NULL)return 'yes'else 'No'returns No as NULL == NULL will not be equal to 1
good information given the hive functionas topic .i have some information on hadoop in our Blog if u want visit our
ReplyDeleteHadoop Training in Hyderabad