Monday, April 23, 2018

HBase Architecture and inserting data

                                            HBase

Architecture:

Master Slave Architecture

3 Major components:

-> Region Servers - Responsible to serve data to clients. Equivalent as data dones in HDFS.
-> Zookeeper maintains cluster state. Zookeeper ensemble is usually configured to maintain the state.
-> HMaster - As master system in the cluster.
Asssign regions
Load balancing
fault tolerance
health monitoring


Region servers those run on datanode machines will send heartbeats to zookeeper nodes. HMaster listens to heartbeats of Region Servers from zookeeper. Incase, heartbeat is not received for 3 seconds then HMaster treats the Region server as down.

Only 1 HMaster is always active, if active HMaster is down the inactive HMaster will become Active.


  •  HBase tables are horizontally divided into regios. Defaultregion size = 1GB 
  • A Single Region Server can have multiple regions of same table or different tables.
  • Max number regions for a Region Server = 1000
  • Regions of same table can be in same region server or different region server.
  • Initially these regions will be allocated in same Region server, later for better load balancing purpose newly allocated region will be moved to another region server.

Writing Process to HBase:

Key components

  1. WAL - Client when writes will write to WAL. Although it is not the area where the data is stored, it is done for the fault tolerant purpose. So, later if any error occurs while writing data, HBase always has WAL to look into.
  2. Memcache - Later WAL writes record to memcache. Memcache caches all write and edited records. Once memcache limit is reached, all data will be flushed into HFile and memcache becomes empty. As memcache gets filled, those many HFile will get created and in this way multiple HFiles will get generated. A single region can have multiple memcaches.
  3. HFile - Actual data is stored in these and are in HDFS


In case of multiple memcaches, all memcaches will flush data into different HFiles and results in numerous small HFiles.
Hadoop is bad for small files, so comes Minor compaction into picture.

Minor Compaction: Merge all small files into one big file.

https://www.edureka.co/blog/hbase-architecture/

Writing data to HBase via


1. Inserting data to HBase table via hbase shell


Put command: put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Eg:
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds

Read data:

get command: get ’<table name>’,’row1’

eg:
hbase(main):012:0> get 'emp', '1'

Read specific column: get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}

eg:
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}

Read complete table data: scan 'emp'

Update Data: update an existing cell value using the put command

put ‘table name’,’row ’,'Column family:column name',’new value’
Eg:
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'

Delete using delete command

Drop HBase table: disable it  and then drop

hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds

hbase(main):019:0> drop 'emp'


2. Spark and JAVA APIs.


API function used to insert data to HBase is Put()

Put() Sample code:

      // Instantiating Configuration class
      Configuration config = HBaseConfiguration.create();

      // Instantiating HTable class
      HTable hTable = new HTable(config, "emp");

      // Instantiating Put class
      // accepts a row name.
      Put p = new Put(Bytes.toBytes("row2"));

      // adding values using add() method
      // accepts column family name, qualifier/row name ,value
      p.add(Bytes.toBytes("personal"),
      Bytes.toBytes("name"),Bytes.toBytes("raju2"));

      p.add(Bytes.toBytes("personal"),
      Bytes.toBytes("city"),Bytes.toBytes("hyderabad2"));

      p.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
      Bytes.toBytes("manager2"));

      p.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
      Bytes.toBytes("60000"));

      // Saving the put Instance to the HTable.
      hTable.put(p);

Also Bulkput() is also available which does bulk insersion of data:

      List<String> list= new ArrayList<String>();
      list.add("1," + columnFamily + ",a,1");
      list.add("2," + columnFamily + ",a,2");
      list.add("3," + columnFamily + ",a,3");
      list.add("4," + columnFamily + ",a,4");
      list.add("5," + columnFamily + ",a,5");

      JavaRDD<String> rdd = jsc.parallelize(list);
      Configuration conf = HBaseConfiguration.create();

      JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);

      hbaseContext.bulkPut(rdd,
              TableName.valueOf(tableName),
              new PutFunction());

  public static class PutFunction implements Function<String, Put> {

    private static final long serialVersionUID = 1L;

    public Put call(String v) throws Exception {
      String[] cells = v.split(",");
      Put put = new Put(Bytes.toBytes(cells[0]));

      put.addColumn(Bytes.toBytes(cells[1]), Bytes.toBytes(cells[2]),
              Bytes.toBytes(cells[3]));
      return put;
    }

  }

3. Bulk upload

using TSVimport commandline.