HBase
Architecture:Master Slave Architecture
3 Major components:
-> Region Servers - Responsible to serve data to clients. Equivalent as data dones in HDFS.
-> Zookeeper maintains cluster state. Zookeeper ensemble is usually configured to maintain the state.
-> HMaster - As master system in the cluster.
Asssign regions
Load balancing
fault tolerance
health monitoring
Region servers those run on datanode machines will send heartbeats to zookeeper nodes. HMaster listens to heartbeats of Region Servers from zookeeper. Incase, heartbeat is not received for 3 seconds then HMaster treats the Region server as down.
Only 1 HMaster is always active, if active HMaster is down the inactive HMaster will become Active.
- HBase tables are horizontally divided into regios. Defaultregion size = 1GB
- A Single Region Server can have multiple regions of same table or different tables.
- Max number regions for a Region Server = 1000
- Regions of same table can be in same region server or different region server.
- Initially these regions will be allocated in same Region server, later for better load balancing purpose newly allocated region will be moved to another region server.
Writing Process to HBase:
Key components- WAL - Client when writes will write to WAL. Although it is not the area where the data is stored, it is done for the fault tolerant purpose. So, later if any error occurs while writing data, HBase always has WAL to look into.
- Memcache - Later WAL writes record to memcache. Memcache caches all write and edited records. Once memcache limit is reached, all data will be flushed into HFile and memcache becomes empty. As memcache gets filled, those many HFile will get created and in this way multiple HFiles will get generated. A single region can have multiple memcaches.
- HFile - Actual data is stored in these and are in HDFS
Hadoop is bad for small files, so comes Minor compaction into picture.
Minor Compaction: Merge all small files into one big file.
https://www.edureka.co/blog/hbase-architecture/
Writing data to HBase via
1. Inserting data to HBase table via hbase shell
Put command: put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Eg:
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds
Read data:
get command: get ’<table name>’,’row1’
eg:
hbase(main):012:0> get 'emp', '1'
Read specific column: get 'table name', ‘rowid’, {COLUMN ⇒ ‘column family:column name ’}
eg:
hbase(main):015:0> get 'emp', 'row1', {COLUMN ⇒ 'personal:name'}
Read complete table data: scan 'emp'
Update Data: update an existing cell value using the put command
put ‘table name’,’row ’,'Column family:column name',’new value’
Eg:
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'
Delete using delete command
Drop HBase table: disable it and then drop
hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds
hbase(main):019:0> drop 'emp'
2. Spark and JAVA APIs.
API function used to insert data to HBase is Put()
Put() Sample code:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable hTable = new HTable(config, "emp");
// Instantiating Put class
// accepts a row name.
Put p = new Put(Bytes.toBytes("row2"));
// adding values using add() method
// accepts column family name, qualifier/row name ,value
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("name"),Bytes.toBytes("raju2"));
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("city"),Bytes.toBytes("hyderabad2"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
Bytes.toBytes("manager2"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
Bytes.toBytes("60000"));
// Saving the put Instance to the HTable.
hTable.put(p);
Also Bulkput() is also available which does bulk insersion of data:
List<String> list= new ArrayList<String>();
list.add("1," + columnFamily + ",a,1");
list.add("2," + columnFamily + ",a,2");
list.add("3," + columnFamily + ",a,3");
list.add("4," + columnFamily + ",a,4");
list.add("5," + columnFamily + ",a,5");
JavaRDD<String> rdd = jsc.parallelize(list);
Configuration conf = HBaseConfiguration.create();
JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);
hbaseContext.bulkPut(rdd,
TableName.valueOf(tableName),
new PutFunction());
public static class PutFunction implements Function<String, Put> {
private static final long serialVersionUID = 1L;
public Put call(String v) throws Exception {
String[] cells = v.split(",");
Put put = new Put(Bytes.toBytes(cells[0]));
put.addColumn(Bytes.toBytes(cells[1]), Bytes.toBytes(cells[2]),
Bytes.toBytes(cells[3]));
return put;
}
}
3. Bulk upload
using TSVimport commandline.