Saturday, April 22, 2017

OOZIE


3 Types of Jobs

1. Workflow Jobs
2. Coordinator Jobs
3. Bundle

WorkFlow Jobs

> Job Runs from Action Nodes. Each type of action(hive, pig,shell script etc) will have its own tag name.
-> To Run/trigger the scheduled flow described in the XML, use the command
oozie job --oozie http://host_name:8080/oozie -D oozie.wf.application.path=hdfs://namenodepath/pathof_workflow_xml/workflow.xml -run
-> Job Tracker and name node have to be specifed. the tags are: <job-tracker> and <name-node>
-> To check the Job Status go to http://host_name:8080/
-> To Run Jobs in parallel use Fork nodes. Join tag to be used along with Fork jobs.
-> All the individual action nodes must go to join node after completion of its task. Until all the actions nodes complete and reach to join node the next action after join is not taken.
-> Decision Nodes: 

Eg: In the below case 
HDFS EL Function fs:exists is used
boolean fs:exists(String path)

It returns true or false depending on – if the specified path exists or not.

If not exists then "Create_External_Table" action is called. If exists then "orc_table_exists" action will be called.
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "simple-Workflow">
   <start to = "external_table_exists" />
   
   <decision name = "external_table_exists">
      <switch>
         <case to = "Create_External_Table">${fs:exists('/test/abc') eq 'false'}
            </case>
         <default to = "orc_table_exists" />
      </switch>
   </decision>

   <action name = "Create_External_Table">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>xyz.com:8088</job-tracker>
         <name-node>hdfs://rootname</name-node>
         <script>hdfs_path_of_script/external.hive</script>
      </hive>
      <ok to = "orc_table_exists" />
      <error to = "kill_job" />
   </action>

Property File:

Oozie workflows can be parameterized. The parameters come from a configuration file called as property file. We can run multiple jobs using same workflow by using multiple .property files (one property for each job).
Suppose we want to change the jobtracker url or change the script name or value of a param.

We can specify a config file (.property) and pass it while running the workflow.

In property file, variables like ${Namenode} can be specified and these can be passed to the workflow.xml at runtime. Sample properties file is,

# proprties
nameNode = hdfs://rootname
jobTracker = xyz.com:8088
script_name_external = hdfs_path_of_script/external.hive
script_name_orc=hdfs_path_of_script/orc.hive
script_name_copy=hdfs_path_of_script/Copydata.hive
database = database_name

When the above properties are used in the Workflow.xml, the file looks as below:
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "simple-Workflow">
   <start to = "Create_External_Table" />
   <action name = "Create_External_Table">
      <hive xmlns = "uri:oozie:hive-action:0.4">
         <job-tracker>${jobTracker}</job-tracker>
         <name-node>${nameNode}</name-node>
         <script>${script_name_external}</script>
      </hive>
      <ok to = "Create_orc_Table" />
      <error to = "kill_job" />
   </action>

To run the oozie job with properties file the below command has to be used:
oozie job --oozie http://hostname:8080/oozie --config edgenode_path/job1.properties -D oozie.wf.application.path hdfs://namenodepath/PathofWorkflow_xml/workflow.xml -run

Note: The properties file should be in the edgenode and not in HDFS. Same property file can be used for different workflows.

A property file can have paramaters more thatn the ones used in the Workfolw.xml and will not throw error. If a parameter that is specified in Workflow is not in properties then an error will be thrown.


Coordinator Jobs: 

Coordinator applications allow users to schedule complex workflows, including workflows that are scheduled regularly. The workflow execution happend in the form of Start time, frequency , event predicates etc.

A simple coordinator job is as below:

<coordinator-app xmlns = "uri:oozie:coordinator:0.2" name =
   "coord_copydata_from_external_orc" frequency = "5 * * * *" start =
   "2016-00-18T01:00Z" end = "2025-12-31T00:00Z"" timezone = "America/Los_Angeles">
   
   <controls>
      <timeout>1</timeout>
      <concurrency>1</concurrency>
      <execution>FIFO</execution>
      <throttle>1</throttle>
   </controls>
   
   <action>
      <workflow>
         <app-path>pathof_workflow_xml/workflow.xml</app-path>
      </workflow>
   </action>
</coordinator-app>

On the specified time, the Workflow XML will get launched and this inturn calls Hive script on the frequency of every 5th minute of the hour.

To run this coordinator job use,

oozie job -oozie http://hostname:8080/oozie --config edgenode_path/job1.properties -D oozie.wf.application.path=//NamenodePATH/pathofcoordinatorXML/coordinator.xml -run-d "2 minute"

Bundle Jobs:

More than one Coordinator Jobs can be bundled together and can be launched.

To launch it replace with the bundle.xml in the field oozie.wf.application.path= and execute.

Tags for launching different actions:


Hive: 
<hive xmlns = "uri:oozie:hive-action:0.x">

MapReduce:
<map-reduce>

Spark:
<spark xmlns="uri:oozie:spark-action:0.1">

Note: Via commandline spark job can be submitted using spark-submit and all the options like 

--class --master --deploy-mode --executor-memory --num-executors  have to be specified.

eg:
./bin/spark-submit \
--class org.apache.amd.sparkjob \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--total-executor-cores 100 \
/HDFSpath/Jar_Path/sample.jar

Via OOZIE all these options can be set in the form of tags.

eg: 
    <action name="myfirstsparkjob">
        <spark xmlns="uri:oozie:spark-action:0.1">
            <job-tracker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
            </configuration>
            <master>Yarn</master>
            <mode>cluster<mode>
            <name>Spark Example</name>
            <class>org.apache.spark.examples.mllib.JavaALS</class>
            <jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
            <spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
            <arg>inputpath=hdfs://localhost/input/file.txt</arg>
            <arg>value=2</arg>
        </spark>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>

Pig:
<pig>

Sqoop:
<sqoop xmlns="uri:oozie:sqoop-action:0.x">

Decision:
<decision name="[NODE-NAME]">

HDFS action:
<fs>

email:
<email xmlns = "uri:oozie:email-action:0.x">

Shell Action: Runs a Shell command.
<shell

Query:
OOZIE example.
Will OOZIE send an automated e-mail using its infrastructure or Do we need to write any code.

No comments:

Post a Comment