Sunday, February 19, 2017

PIG

1. Filtering records in PIG and running PIG Script.
Consider data as below:
7|Ron|ron@abc.com
8|Rina
9|Don|dmes@xyz.com
9|Don|dmes@xyz.com
10|Maya|maya@cnn.com

11|marry|mary@abc.com

Loading data in PIG,
data = LOAD 'file:///home/cloudera/Desktop/Spark/Pig/demo.txt' using PigStorage('|');
dump data;
(7,Ron,ron@abc.com)
(8,Rina)
(9,Don,dmes@xyz.com)
(9,Don,dmes@xyz.com)
(10,Maya,maya@cnn.com)
()
(11,marry,mary@abc.com)


Now the need to filter out the empty tuple and duplicate one.

Use,
filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));

fltr_new = DISTINCT filterdata;

dump fltr_new;

To run set of commands in one instance  - PIG Script.

In Local mode browse to the location where 1.pig file is available and use the command,
pig -x local 1.pig

In MapReduce mode

As this is MapReduce mode the file has to be available in HDFS, so Load data file to HDFS.
 hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/demo.txt hdfs://quickstart.cloudera:8020/user/Pig/demo.txt

This will copy the 1.pig script file to HDFS for execution
hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/1.pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig

Running from terminal
pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig

Custom UDF in pig

Steps:

Step 1. Open Eclipse and create a New Java Project Right click ->New-> Project->JAVA Project, lets say CustomUDF.
Step 2. Now add a New package, Right click Java Project -> New -> Package, Lets say HadoopUDF.
Step 3. To this package add a JAVA class.
Step 4. Add HADOOP and PIG jars to this project.
Right Click on project —> Build Path —> Configure Build Path —> Libraries —> Add External Jars —> Select Hadoop and Pig Lib folder Jars files and Add other Jars files In Hadoop folder —–> Click Ok.
Step 5. The foremost step in UDF is to extend from EvalFunc class, so the added Java class should extend from EvalFunc<String>. <String> if the input is string.

Step 6. Override exec function.

Below is the code:

package HadoopUDF;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UcFirst extends EvalFunc<String>{

    @Override
    public String exec(Tuple input) throws IOException {
        // TODO Auto-generated method stub
        if(input == null || input.size() == 0){
        return null;
        }
        try
        {
            String str = input.get(1).toString();
            return str.toUpperCase();
        }
        catch(Exception ex)
        {
            throw new IOException("Exception in func");
        }
    } 

public class Ucfirst extends EvalFunc<Class DataType> and you return the value. 

public String exec(Tuple input) throws IOException {
if (input.size() == 0)
return null;
Class Name String and The entire row in text file is consider as Tuple and first of all it will check the input is zero or not if the input is zero then it return null.

Step 7: Execute this code In Pig UDF ? 
 Export Jar: Right click Project -> Export -> Java -> Jar File -> Select the Project for Jar Export and also specify the Jar path. -> Finish

In Pig terminal  ,
Register Jar file
register file:///home/cloudera/Desktop/Spark/Pig/UDefFuncs.jar 

result = foreach fltr_new generate HadoopUDF.UcFirst($0, $1);

dump result; 

Note: Additionally to have an alias name,

 define toUpper HadoopUDF.UcFirst();

result = foreach fltr_new generate toUpper($0, $1); 


Pig Script for this is, 

data = LOAD 'hdfs://quickstart.cloudera:8020/user/Pig/demo.txt' using PigStorage('|');

filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));

fltr_new = DISTINCT filterdata;

register /home/cloudera/Desktop/Spark/Pig/UDefFuncs.jar;

define toUpper HadoopUDF.UcFirst();

result = foreach fltr_new generate toUpper($0, $1);

dump result; 




Some Pig Examples:



1. Use REGEX_EXTRACT_ALL to extract values from given INput

A= LOAD 'FILENAME' AS line
B = FOREACH A REGEX_EXTRACT_ALL('Expression) AS f0,f1,f2

2. A = LOAD 'FILENAME' AS PigStorage(':')

3. TO save a file

STORE B INTO 'FIle path'

4. TO FIlter records which has more than 3 fields.

C = FILTER A BY SIZE(TOTUPLE(*)) > 3  //THe whole record is formed as a Tuple and its lenght is greater than 3.

http://axbigdata.blogspot.in/2014/05/pig.html

5. Eliminating NULL values

corrupt_records = FILTER records BY temperature is null;

6. FOREACH usage,

A = load '/home/hadoop/work/pig_inputs/foreach_A' AS (f0:chararray, f1:chararray, f2:int);
dump A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
B = foreach A generate $0, $2+1;         //Here we have taken 0th and 2nd records only so in this case foreach is used
dump B;
(Joe,3)
(Ali,4)
(Joe,3)
(Eve,8)


Few Pig examples:

1949 76 1 3
1941 78 1 3 5

1. Counting size of individual Atom in each row

A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data' as (f1:chararray, f2:chararray, f3:chararray);
X = FOREACH A GENERATE SIZE(f1);
DUMP X;

Eg2: a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample.txt';
B = FOREACH a generate SIZE($1);
DUMP B;

(1)
(2)
(3)
(3)
(2)

2. Saving output to a directory,

STORE a into 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample_out.txt' USING PigStorage(':');

3. GROUP - The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.

group_data = GROUP student_details by age;

Grouping by Multiple Columns

grunt> group_multiple = GROUP student_details by (age, city);

grunt> Dump group_multiple;

((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

Group All

You can group a relation by all the columns as shown below.

grunt> group_all = GROUP student_details All;

grunt> Dump group_all;

(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})

All the values will be assigned to a key 'all'.

4. COGROUP

5. JOIN

6. UNION

7. FILTER

8. Counting number of unique records

A B user1
C D user2
A D user3
A D user1

A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/Mytestdata' USING PigStorage(' ') AS (a1,a2,a3);
A_UNIQUE = FOREACH A GENERATE $2;
A_UNIQUE = DISTINCT A_UNIQUE;
A_UNIQUE_GROUP = GROUP A_UNIQUE ALL;
u_count = FOREACH A_UNIQUE_GROUP GENERATE COUNT(A_UNIQUE);

9. counting number of elements in a row

10. TOtal number of records

a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data2' AS (f1:chararray);
X_GRP = GROUP X All;

X_CNT = FOREACH X_GRP GENERATE COUNT(X);
DUMP X_CNT;
3

2 comments:

  1. I just want to know about Pig script and found this post is perfect one ,Thanks for sharing the informative post of Pig script and able to understand the concepts easily,Thoroughly enjoyed reading
    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/

    ReplyDelete
  2. this post is really informative and useful. The concepts of Pig Script z explained easily. Keep continuing. For more concept details and online training check the link below
    Online courses for hadoop
    Best Hadoop Training institute in chennai
    Bigdata Hadoop concepts

    ReplyDelete