Tuesday, May 23, 2017

Difference between ORC and Parquet

Both these implemented Snappy compression. ORC is more advantageous than Parquet.

Difference between Row oriented and Column Oriented Formats:


the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we're all used to -- text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other tricks of various formats (especially including compression) involve whether a format can be split -- that is, can you read a block of records from anywhere in the dataset and still know it's schema? But here's more detail on columnar formats like Parquet.
Parquet, and other columnar formats handle a common Hadoop situation very efficiently. It is common to have tables (datasets) having many more columns than you would expect in a well-designed relational database -- a hundred or two hundred columns is not unusual. This is so because we often use Hadoop as a place to denormalize data from relational formats -- yes, you get lots of repeated values and many tables all flattened into a single one. But it becomes much easier to query since all the joins are worked out. There are other advantages such as retaining state-in-time data. So anyway it's common to have a boatload of columns in a table.

Let's say there are 132 columns, and some of them are really long text fields, each different column one following the other and use up maybe 10K per record.
While querying these tables is easy with SQL standpoint, it's common that you'll want to get some range of records based on only a few of those hundred-plus columns. For example, you might want all of the records in February and March for customers with sales > $500.
To do this in a row format the query would need to scan every record of the dataset. Read the first row, parse the record into fields (columns) and get the date and sales columns, include it in your result if it satisfies the condition. Repeat. If you have 10 years (120 months) of history, you're reading every single record just to find 2 of those months. Of course this is a great opportunity to use a partition on year and month, but even so, you're reading and parsing 10K of each record/row for those two months just to find whether the customer's sales are > $500.
In a columnar format, each column (field) of a record is stored with others of its kind, spread all over many different blocks on the disk -- columns for year together, columns for month together, columns for customer employee handbook (or other long text), and all the others that make those records so huge all in their own separate place on the disk, and of course columns for sales together. Well heck, date and months are numbers, and so are sales -- they are just a few bytes. Wouldn't it be great if we only had to read a few bytes for each record to determine which records matched our query? Columnar storage to the rescue!
Even without partitions, scanning the small fields needed to satisfy our query is super-fast -- they are all in order by record, and all the same size, so the disk seeks over much less data checking for included records. No need to read through that employee handbook and other long text fields -- just ignore them. So, by grouping columns with each other, instead of rows, you can almost always scan less data. Win!
But wait, it gets better. If your query only needed to know those values and a few more (let's say 10 of the 132 columns) and didn't care about that employee handbook column, once it had picked the right records to return, it would now only have to go back to the 10 columns it needed to render the results, ignoring the other 122 of the 132 in our dataset. Again, we skip a lot of reading.
(Note: for this reason, columnar formats are a lousy choice when doing straight transformations, for example, if you're joining all of two tables into one big(ger) result set that you're saving as a new table, the sources are going to get scanned completely anyway, so there's not a lot of benefit in read performance, and because columnar formats need to remember more about the where stuff is, they use more memory than a similar row format).
One more benefit of columnar: data is spread around. To get a single record, you can have 132 workers each read (and write) data from/to 132 different places on 132 blocks of data. Yay for parallelization!
And now for the clincher: compression algorithms work much better when it can find repeating patterns. You could compress AABBBBBBCCCCCCCCCCCCCCCC as 2A6B16C but ABCABCBCBCBCCCCCCCCCCCCCC wouldn't get as small (well, actually, in this case it would, but trust me :-) ). So once again, less reading. And writing too.
So we read a lot less data to answer common queries, it's potentially faster to read and write in parallel, and compression tends to work much better.
Columnar is great when your input side is large, and your output is a filtered subset: from big to little is great. Not as beneficial when the input and outputs are about the same.

Columnar is best for bigdata usecases as majority of analytical queries rely on aggregation kind of analysis. So aggregation applied on a particular column set is many times faster than applying aggregation applied on row based set.

Typically in a warehouse DB there would be 50+ columns in a table as the data would be in a normalized form. More the number of columns the more advantageous is the columnar storage.

Bigdata is more of aggregate operations, applying MIN, MAX, SUM or any aggregation on a column is faster in columnar format as the control is directly acting upon column.

Usecase comparison between ORC and Parquet


ORC: 2 Use cases where ORC can be used.

1. ACID transactions:
ACID transactions are only possible when using ORC as the file format. Although ORC support ACID transactions, they are not designed to support OLTP requirements.
ORC supports streaming ingest in to Hive tables where streaming applications like Flume or Storm could write data into Hive and have transactions commit once a minute and queries would either see all of a transaction or none of it.
On OLTP requirements support means as, if any record is Deleted or updated then immediately this change would not be reflected in applications accessing data.

HDFS is a write once file system and ORC is a write-once file format, so edits were implemented using base files and delta files where insert, update, and delete operations are recorded.

2. Ultra fast access with Indexing feature:
Indexing in ORC : ORC provides three level of indexes within each file:

file level - This is the Top most Indexing level and statistics about the values in each column across the entire file. In simple terms, this will have list of all the columns in the file + Statistics of values in each column.

stripe level - This is second level indexing where multiple stripes would be part of file. statistics about the values in each column for each stripe.

row level - statistics about the values in each column for each set of 10,000 rows within a stripe

The file and stripe level column statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. Row level indexes include both the column statistics for each row group and the position for seeking to the start of the row group.


  • Hive performs best with ORC.
  • Hortonworks promote ORC




Parquet:

Parquet has Schema Evolution

Parquet + Snappy is splitable

Cloudera promotes Parquet
Spark performs best with parquet,

Creating a customized ORC table,

CREATE [EXTERNAL] TABLE OrcExampleTable
(clientid int, name string, address string, age int)
stored as orc
TBLPROPERTIES (
"orc.compress"="ZLIB",
"orc.compress.size"="262144",
"orc.create.index"="true",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="3000",
"orc.bloom.filter.columns"="clientid,age,name");

AVRO is a Row Oriented file format

If all the fields are being accessed frequently then AVRO is the best choice.

If working with condition based/subset based data operations then Parquet/ORC are better. These are columnar formats.

Features of File Formats 

Schema Evolution: Simple Example while working with addition or removal of columns.

create table emp_par(id int, name String,location String)
row format delimited
stored as parquet; //Initial table
insert into emp_par values(1,"Leela","India");

alter table emp_par ADD COLUMNS(state String);   //Added new Column

insert into emp_par values(2,"Kar","India","Andhra");
ALTER TABLE emp_par REPLACE COLUMNS( id int, name String,state String); //Removed location column
insert into emp_par values(3,"Saarika","Virginia");
ALTER TABLE emp_par ADD COLUMNS(location String); //Added back the removed column

hive> select * from emp_par; //See the backward and forward compatibility
OK
emp_par.id emp_par.name emp_par.state emp_par.location
1 Leela NULL India
2 Kar Andhra India
3 Saarika Virginia NULL
Time taken: 0.106 seconds, Fetched: 3 row(s)

All the specified activities can be performed in Parquet as well.

AVRO has a special property called avro.schema.url. This property is available only for AVRO and not in Parquet or ORC.

A simple example is:


CREATE EXTERNAL TABLE user_profile1 STORED AS avro

TBLPROPERTIES('avro.schema.url'='/user/Leela/USER_PROFILE.avsc');



Contents of AVSC file are below:



{
    "type" : "record",
    "name" : "user",
    "namespace" : "com.getindata.avro",
    "fields" : [ {"name" : "name", "type" : "string"}, 
                 {"name" : "birthday", "type" : "string", "aliases" : ["bday"]}, 
                 {"name" : "country", "type" : "string", "default" : "NULL"}
               ]
}


All the columns can be specified in this file and need not be specified in the table creation command. The file path has to be in HDFS.

Snappy vs GZIP

Snappy:
  • Storage Space: High
  • CPU Usage: Low
  • Splittable:  No, bydefault. But, Yes with Parquet
GZIP:
  • Storage Space: Medium
  • CPU Usage: Medium
  • Splittable: No

2 comments:

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    https://www.emexotechnologies.com/online-courses/big-data-hadoop-training-in-electronic-city/

    ReplyDelete