Monday, April 23, 2018

HBase Architecture and inserting data

                                            HBase

Architecture:

Master Slave Architecture

3 Major components:

-> Region Servers - Responsible to serve data to clients. Equivalent as data dones in HDFS.
-> Zookeeper maintains cluster state. Zookeeper ensemble is usually configured to maintain the state.
-> HMaster - As master system in the cluster.
Asssign regions
Load balancing
fault tolerance
health monitoring


Region servers those run on datanode machines will send heartbeats to zookeeper nodes. HMaster listens to heartbeats of Region Servers from zookeeper. Incase, heartbeat is not received for 3 seconds then HMaster treats the Region server as down.

Only 1 HMaster is always active, if active HMaster is down the inactive HMaster will become Active.


  •  HBase tables are horizontally divided into regios. Defaultregion size = 1GB 
  • A Single Region Server can have multiple regions of same table or different tables.
  • Max number regions for a Region Server = 1000
  • Regions of same table can be in same region server or different region server.
  • Initially these regions will be allocated in same Region server, later for better load balancing purpose newly allocated region will be moved to another region server.

Writing Process to HBase:

Key components

  1. WAL - Client when writes will write to WAL. Although it is not the area where the data is stored, it is done for the fault tolerant purpose. So, later if any error occurs while writing data, HBase always has WAL to look into.
  2. Memcache - Later WAL writes record to memcache. Memcache caches all write and edited records. Once memcache limit is reached, all data will be flushed into HFile and memcache becomes empty. As memcache gets filled, those many HFile will get created and in this way multiple HFiles will get generated. A single region can have multiple memcaches.
  3. HFile - Actual data is stored in these and are in HDFS


In case of multiple memcaches, all memcaches will flush data into different HFiles and results in numerous small HFiles.
Hadoop is bad for small files, so comes Minor compaction into picture.

Minor Compaction: Merge all small files into one big file.

https://www.edureka.co/blog/hbase-architecture/

Writing data to HBase via


1. Inserting data to HBase table via hbase shell


Put command: put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Eg:
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds

Read data:

get command: get ’<table name>’,’row1’

eg:
hbase(main):012:0> get 'emp', '1'

Read specific column: get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}

eg:
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}

Read complete table data: scan 'emp'

Update Data: update an existing cell value using the put command

put ‘table name’,’row ’,'Column family:column name',’new value’
Eg:
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'

Delete using delete command

Drop HBase table: disable it  and then drop

hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds

hbase(main):019:0> drop 'emp'


2. Spark and JAVA APIs.


API function used to insert data to HBase is Put()

Put() Sample code:

      // Instantiating Configuration class
      Configuration config = HBaseConfiguration.create();

      // Instantiating HTable class
      HTable hTable = new HTable(config, "emp");

      // Instantiating Put class
      // accepts a row name.
      Put p = new Put(Bytes.toBytes("row2"));

      // adding values using add() method
      // accepts column family name, qualifier/row name ,value
      p.add(Bytes.toBytes("personal"),
      Bytes.toBytes("name"),Bytes.toBytes("raju2"));

      p.add(Bytes.toBytes("personal"),
      Bytes.toBytes("city"),Bytes.toBytes("hyderabad2"));

      p.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
      Bytes.toBytes("manager2"));

      p.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
      Bytes.toBytes("60000"));

      // Saving the put Instance to the HTable.
      hTable.put(p);

Also Bulkput() is also available which does bulk insersion of data:

      List<String> list= new ArrayList<String>();
      list.add("1," + columnFamily + ",a,1");
      list.add("2," + columnFamily + ",a,2");
      list.add("3," + columnFamily + ",a,3");
      list.add("4," + columnFamily + ",a,4");
      list.add("5," + columnFamily + ",a,5");

      JavaRDD<String> rdd = jsc.parallelize(list);
      Configuration conf = HBaseConfiguration.create();

      JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);

      hbaseContext.bulkPut(rdd,
              TableName.valueOf(tableName),
              new PutFunction());

  public static class PutFunction implements Function<String, Put> {

    private static final long serialVersionUID = 1L;

    public Put call(String v) throws Exception {
      String[] cells = v.split(",");
      Put put = new Put(Bytes.toBytes(cells[0]));

      put.addColumn(Bytes.toBytes(cells[1]), Bytes.toBytes(cells[2]),
              Bytes.toBytes(cells[3]));
      return put;
    }

  }

3. Bulk upload

using TSVimport commandline.

6 comments:

  1. Wow this was so awesome blog. You have done a great job. Thanks for sharing...

    Hadoop Big Data Classes in Pune
    Hadoop Big Data Training in Pune

    ReplyDelete

  2. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking. Big data hadoop online training

    ReplyDelete
  3. We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.



    http://www.coepd.com/AnalyticsTraining.html

    ReplyDelete
  4. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    Big Data Hadoop training in electronic city

    ReplyDelete
  5. I like your attempt in providing good content. We have a similar site where we also provide good information on Big Data Hadoop

    ReplyDelete
  6. Thanks for sharing this post.Learnoa has emerged out as a leading training platform for various online courses in the entire e-learning industry.

    Plz visit:-
    Big data and Hadoop

    ReplyDelete