Tuesday, May 23, 2017

Difference between Hive and Impala


Difference between Hive and Impala

Hive is batch based mapreduce where as Impala (Created by cloudera distribution) is more like MPP database (Example. vertica).
If you run a query on hive there is starttime overhead on queries run on mapreduce but not on impala.
Fault tolerance
Hive is fault tolerant where as impala is not. For e.g. if you run a query in hive mapreduce and while the query is running one of your datanode goes down still the output would be produced as its fault tolerant. It's not the same with Impala and if the query fails you will have to start the query all over again.
Memory based
Impala is more memory bound which makes it multitude faster than disk based mapreduce.  All heavy calculations like group by, conversions would be memory based.
Note: - in newest versions of impala there are settings which can make it spill to the disk if the memory overflows.
Impala while reading data is multinode but while writing data it uses single virtual disk.

Hive Is Not MeaNt For Interactive Computing. Impalais MeAnt For That.

see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".
In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.
The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.
That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.

Impala can't able to access all the Hive tables.

Run the query 'invalidate metadata' in Impala and your tables will show-up.

2 comments:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete
  2. Very nice blog,keep sharing more blogs with us.
    Thank you....

    big data hadoop certification

    ReplyDelete