Tuesday, June 6, 2017

Replicated, Skewed and Merge Joins in Pig


Replicated Join- Same as Distributed cache, where smaller table is loaded to RAM and performs Map side Join.

Skewed Join - Used when data is unbalanced. The cases wherein one redcer gets more number of records and other reducer gets less records.

Eg: Records for USA might be more when compared to India. Here the records are unbalanced and recucer for processing USA records takes more time.

Skewed Join in Pig handles this scenario by presampling and identifies keys which have skewed data. The skewed data will be automatically split across multiple reducers

Merge Join - If the data is pre-sorted, then this kind of Join can be used which can skip sort phase. Sort operation is an expensive process and this can be skipped.

3 comments:

  1. It’s really Nice and Meaningful. It’s really cool Blog. You have really helped lots of people who visit Blog and provide them Useful Information. Thanks for Sharing.DataScience with Python Training in Bangalore

    ReplyDelete
  2. decent explantion in short and precise

    ReplyDelete