1. Filtering records in PIG and running PIG Script.
Consider data as below:
7|Ron|ron@abc.com
8|Rina
9|Don|dmes@xyz.com
9|Don|dmes@xyz.com
10|Maya|maya@cnn.com
11|marry|mary@abc.com
Loading data in PIG,
data = LOAD 'file:///home/cloudera/Desktop/Spark/Pig/demo.txt' using PigStorage('|');
dump data;
(7,Ron,ron@abc.com)
(8,Rina)
(9,Don,dmes@xyz.com)
(9,Don,dmes@xyz.com)
(10,Maya,maya@cnn.com)
()
(11,marry,mary@abc.com)
Now the need to filter out the empty tuple and duplicate one.
Use,
filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));
fltr_new = DISTINCT filterdata;
dump fltr_new;
To run set of commands in one instance - PIG Script.
In Local mode browse to the location where 1.pig file is available and use the command,
pig -x local 1.pig
In MapReduce mode
As this is MapReduce mode the file has to be available in HDFS, so Load data file to HDFS.
hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/demo.txt hdfs://quickstart.cloudera:8020/user/Pig/demo.txt
This will copy the 1.pig script file to HDFS for execution
hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/1.pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig
Running from terminal
pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig
Consider data as below:
7|Ron|ron@abc.com
8|Rina
9|Don|dmes@xyz.com
9|Don|dmes@xyz.com
10|Maya|maya@cnn.com
11|marry|mary@abc.com
Loading data in PIG,
data = LOAD 'file:///home/cloudera/Desktop/Spark/Pig/demo.txt' using PigStorage('|');
dump data;
(7,Ron,ron@abc.com)
(8,Rina)
(9,Don,dmes@xyz.com)
(9,Don,dmes@xyz.com)
(10,Maya,maya@cnn.com)
()
(11,marry,mary@abc.com)
Now the need to filter out the empty tuple and duplicate one.
Use,
filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));
fltr_new = DISTINCT filterdata;
dump fltr_new;
To run set of commands in one instance - PIG Script.
In Local mode browse to the location where 1.pig file is available and use the command,
pig -x local 1.pig
In MapReduce mode
As this is MapReduce mode the file has to be available in HDFS, so Load data file to HDFS.
hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/demo.txt hdfs://quickstart.cloudera:8020/user/Pig/demo.txt
This will copy the 1.pig script file to HDFS for execution
hadoop fs -put file:///home/cloudera/Desktop/Spark/Pig/1.pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig
Running from terminal
pig hdfs://quickstart.cloudera:8020/user/Pig/1.pig
Custom UDF in pig
Steps:
Step 1. Open Eclipse and create a New Java Project Right click ->New-> Project->JAVA Project, lets say CustomUDF.
Step 2. Now add a New package, Right click Java Project -> New -> Package, Lets say HadoopUDF.
Step 3. To this package add a JAVA class.
Step 4. Add HADOOP and PIG jars to this project.
Right Click on project —> Build Path —> Configure Build Path —>
Libraries —> Add External Jars —> Select Hadoop and Pig Lib
folder Jars files and Add other Jars files In Hadoop folder —–> Click
Ok.
Step 5. The foremost step in UDF is to extend from EvalFunc class, so the added Java class should extend from EvalFunc<String>. <String> if the input is string.
Step 6. Override exec function.
Below is the code:
package HadoopUDF;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UcFirst extends EvalFunc<String>{
@Override
public String exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
if(input == null || input.size() == 0){
return null;
}
try
{
String str = input.get(1).toString();
return str.toUpperCase();
}
catch(Exception ex)
{
throw new IOException("Exception in func");
}
}
}
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UcFirst extends EvalFunc<String>{
@Override
public String exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
if(input == null || input.size() == 0){
return null;
}
try
{
String str = input.get(1).toString();
return str.toUpperCase();
}
catch(Exception ex)
{
throw new IOException("Exception in func");
}
}
}
public class Ucfirst extends EvalFunc<Class DataType> and you return the value.
public String exec(Tuple input) throws IOException {
if (input.size() == 0)
return null;
if (input.size() == 0)
return null;
Class Name String and The entire row in
text file is consider as Tuple and first of all it will check the input
is zero or not if the input is zero then it return null.
Step 7: Execute this code In Pig UDF ?
Export Jar: Right click Project -> Export -> Java -> Jar File -> Select the Project for Jar Export and also specify the Jar path. -> Finish
In Pig terminal ,
Register Jar file
register file:///home/cloudera/Desktop/Spark/Pig/UDefFuncs.jar
result = foreach fltr_new generate HadoopUDF.UcFirst($0, $1);
dump result;
Note: Additionally to have an alias name,
define toUpper HadoopUDF.UcFirst();
result = foreach fltr_new generate toUpper($0, $1);
Pig Script for this is,
data = LOAD 'hdfs://quickstart.cloudera:8020/user/Pig/demo.txt' using PigStorage('|');
filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));
fltr_new = DISTINCT filterdata;
register /home/cloudera/Desktop/Spark/Pig/UDefFuncs.jar;
define toUpper HadoopUDF.UcFirst();
result = foreach fltr_new generate toUpper($0, $1);
dump result;
http://axbigdata.blogspot.in/2014/05/pig.html
1949 76 1 3
1941 78 1 3 5
1. Counting size of individual Atom in each row
A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data' as (f1:chararray, f2:chararray, f3:chararray);
X = FOREACH A GENERATE SIZE(f1);
DUMP X;
Eg2: a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample.txt';
B = FOREACH a generate SIZE($1);
DUMP B;
(1)
(2)
(3)
(3)
(2)
2. Saving output to a directory,
STORE a into 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample_out.txt' USING PigStorage(':');
3. GROUP - The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
group_data = GROUP student_details by age;
Grouping by Multiple Columns
grunt> group_multiple = GROUP student_details by (age, city);
grunt> Dump group_multiple;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
grunt> Dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
All the values will be assigned to a key 'all'.
4. COGROUP
5. JOIN
6. UNION
7. FILTER
8. Counting number of unique records
A B user1
C D user2
A D user3
A D user1
A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/Mytestdata' USING PigStorage(' ') AS (a1,a2,a3);
A_UNIQUE = FOREACH A GENERATE $2;
A_UNIQUE = DISTINCT A_UNIQUE;
A_UNIQUE_GROUP = GROUP A_UNIQUE ALL;
u_count = FOREACH A_UNIQUE_GROUP GENERATE COUNT(A_UNIQUE);
9. counting number of elements in a row
10. TOtal number of records
a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data2' AS (f1:chararray);
X_GRP = GROUP X All;
X_CNT = FOREACH X_GRP GENERATE COUNT(X);
DUMP X_CNT;
3
Note: Additionally to have an alias name,
define toUpper HadoopUDF.UcFirst();
result = foreach fltr_new generate toUpper($0, $1);
Pig Script for this is,
data = LOAD 'hdfs://quickstart.cloudera:8020/user/Pig/demo.txt' using PigStorage('|');
filterdata = FILTER data BY (($0 is not NULL) OR ($1 is not NULL) OR($2 is not NULL));
fltr_new = DISTINCT filterdata;
register /home/cloudera/Desktop/Spark/Pig/UDefFuncs.jar;
define toUpper HadoopUDF.UcFirst();
result = foreach fltr_new generate toUpper($0, $1);
dump result;
Some Pig Examples:
1. Use REGEX_EXTRACT_ALL to extract values from given INput
A= LOAD 'FILENAME' AS line
B = FOREACH A REGEX_EXTRACT_ALL('Expression) AS f0,f1,f2
2. A = LOAD 'FILENAME' AS PigStorage(':')
3. TO save a file
STORE B INTO 'FIle path'
4. TO FIlter records which has more than 3 fields.
C = FILTER A BY SIZE(TOTUPLE(*)) > 3 //THe whole record is formed as a Tuple and its lenght is greater than 3.
5. Eliminating NULL values
corrupt_records = FILTER records BY temperature is null;
6. FOREACH usage,
A = load '/home/hadoop/work/pig_inputs/foreach_A' AS (f0:chararray, f1:chararray, f2:int);
dump A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
B = foreach A generate $0, $2+1; //Here we have taken 0th and 2nd records only so in this case foreach is used
dump B;
(Joe,3)
(Ali,4)
(Joe,3)
(Eve,8)
Few Pig examples:
1941 78 1 3 5
1. Counting size of individual Atom in each row
A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data' as (f1:chararray, f2:chararray, f3:chararray);
X = FOREACH A GENERATE SIZE(f1);
DUMP X;
Eg2: a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample.txt';
B = FOREACH a generate SIZE($1);
DUMP B;
(1)
(2)
(3)
(3)
(2)
2. Saving output to a directory,
STORE a into 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/sample_out.txt' USING PigStorage(':');
3. GROUP - The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
group_data = GROUP student_details by age;
Grouping by Multiple Columns
grunt> group_multiple = GROUP student_details by (age, city);
grunt> Dump group_multiple;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
grunt> Dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
All the values will be assigned to a key 'all'.
4. COGROUP
5. JOIN
6. UNION
7. FILTER
8. Counting number of unique records
A B user1
C D user2
A D user3
A D user1
A = LOAD 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/Mytestdata' USING PigStorage(' ') AS (a1,a2,a3);
A_UNIQUE = FOREACH A GENERATE $2;
A_UNIQUE = DISTINCT A_UNIQUE;
A_UNIQUE_GROUP = GROUP A_UNIQUE ALL;
u_count = FOREACH A_UNIQUE_GROUP GENERATE COUNT(A_UNIQUE);
9. counting number of elements in a row
10. TOtal number of records
a = load 'file:///home/cloudera/Desktop/Hadoop/pig_0.15.0/pig_inputs/data2' AS (f1:chararray);
X_GRP = GROUP X All;
X_CNT = FOREACH X_GRP GENERATE COUNT(X);
DUMP X_CNT;
3
I just want to know about Pig script and found this post is perfect one ,Thanks for sharing the informative post of Pig script and able to understand the concepts easily,Thoroughly enjoyed reading
ReplyDeleteAlso Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/
this post is really informative and useful. The concepts of Pig Script z explained easily. Keep continuing. For more concept details and online training check the link below
ReplyDeleteOnline courses for hadoop
Best Hadoop Training institute in chennai
Bigdata Hadoop concepts