1. Why Is a Block in HDFS So Large?
Hadoop is a distributed environment, where data is stored in a distributed manner. Large size is important since mapreduce jobs typically traverse (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it. so since we have to spend the full
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
The seek time is the time we need to spend before reading any data, grossly speaking this is the time necessary to move the reading head to where the data sits physically on the disk (+ other similar kinds of overhead)
In order to read 100Mb stored contiguously, we spend
If those same 100M were stored as 10 blocs, that would give
[Seek Time] [Reading]
2.Data Locality?
Moving computation is cheaper than moving data.
3. Retry in Hadoop?
By default in hadoop retry count is 4. 4 retries is for operation level and Not Block level, for complete file.
Scenario: If a file has 4 blocks and failure occurred while reading Block 2, then in the next retry Block 2 will be read and will not start from Block 1.
In case if the failure occured in B2 after reading
4. What is InputSplit?
MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.
Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally. His is the concept of data locality.
An example:
conf.setInputFormat(TextInputFormat.class); Here, by passing TextInputFormat to the setInputFormat function, we are telling hadoop to treat each line of the input file at the map node as the input to the map function.In this case each line is an inputsplit.
FileInputFormat is the abstract class which defines how the input files are read and spilt up. FileInputFormat provides following functionalites: 1. select files/objects that should be used as input 2. Defines inputsplits that breaks a file into task.
As per hadoop basic functionality, if there are n splits then there will be n mapper.
XML, JSON etc files of this kind have to be processed sequentially. So the format to be used is SequenceFileFormat
5. HDFS & MapReduce
HDFS -> Block Size(128 MB)
MapReduce Reads data via -> input split
-> split size
By default,
split size = Block size [Only By default]
Block concept is only while writing data to HDFS
Split concept is for reading data from HDFS in Mapreduce
no of splits = no of mappers [Always]
no of splits = no of blocks [By default and not always]
if data is 384 MB and bolck size is 128 MB bad is split in 3 bolcks then,
no of splits =3 = no of mappers for the complete operation[By default0
split size = block size = 128 MB
In the above case 3 Mappers of 128 MB will be executed and initially 2 mappers will start the execution in parallel in 2 different blocks and 3rd mapper will wait for completion of any one of the mapper.
the property mapred.tasktracker.map.tasks.maximum is set by default as 2, so 2 mappers will execute in parallel. if this value is increased to 3 then all these 3 mappers execute in parallel.
6. Create HUE User,
cd /usr/lib/hue
sudo build/env/bin/hue createsuperuser
Username (leave blank to use 'root'): <enter the super user name>
Email address: <your email id>
Password: <password with one upper case, number, and special character>
Password (again):
Superuser created successfully.
Hadoop is a distributed environment, where data is stored in a distributed manner. Large size is important since mapreduce jobs typically traverse (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it. so since we have to spend the full
transferTime
anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
The seek time is the time we need to spend before reading any data, grossly speaking this is the time necessary to move the reading head to where the data sits physically on the disk (+ other similar kinds of overhead)
In order to read 100Mb stored contiguously, we spend
[Seek Time] [Reading]
10ms + 100Mb/(100Mb/s)=1.01s
. So a big proportion of that time is spent actually reading the data and only a small one is spent seeking. If those same 100M were stored as 10 blocs, that would give
[Seek Time] [Reading]
10*10ms + 100Mb/(100Mb/s)=2s
2.Data Locality?
Moving computation is cheaper than moving data.
3. Retry in Hadoop?
By default in hadoop retry count is 4. 4 retries is for operation level and Not Block level, for complete file.
Scenario: If a file has 4 blocks and failure occurred while reading Block 2, then in the next retry Block 2 will be read and will not start from Block 1.
In case if the failure occured in B2 after reading
4. What is InputSplit?
MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.
Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally. His is the concept of data locality.
An example:
conf.setInputFormat(TextInputFormat.class); Here, by passing TextInputFormat to the setInputFormat function, we are telling hadoop to treat each line of the input file at the map node as the input to the map function.In this case each line is an inputsplit.
FileInputFormat is the abstract class which defines how the input files are read and spilt up. FileInputFormat provides following functionalites: 1. select files/objects that should be used as input 2. Defines inputsplits that breaks a file into task.
As per hadoop basic functionality, if there are n splits then there will be n mapper.
XML, JSON etc files of this kind have to be processed sequentially. So the format to be used is SequenceFileFormat
HDFS -> Block Size(128 MB)
MapReduce Reads data via -> input split
-> split size
By default,
split size = Block size [Only By default]
Block concept is only while writing data to HDFS
Split concept is for reading data from HDFS in Mapreduce
no of splits = no of mappers [Always]
no of splits = no of blocks [By default and not always]
if data is 384 MB and bolck size is 128 MB bad is split in 3 bolcks then,
no of splits =3 = no of mappers for the complete operation[By default0
split size = block size = 128 MB
In the above case 3 Mappers of 128 MB will be executed and initially 2 mappers will start the execution in parallel in 2 different blocks and 3rd mapper will wait for completion of any one of the mapper.
the property mapred.tasktracker.map.tasks.maximum is set by default as 2, so 2 mappers will execute in parallel. if this value is increased to 3 then all these 3 mappers execute in parallel.
6. Create HUE User,
cd /usr/lib/hue
sudo build/env/bin/hue createsuperuser
Username (leave blank to use 'root'): <enter the super user name>
Email address: <your email id>
Password: <password with one upper case, number, and special character>
Password (again):
Superuser created successfully.
No comments:
Post a Comment