Hadoop and Spark by Leela Prasad: Redshift part2

To see tables in the database.

select distinct(tablename) from pg_table_def where schemaname = 'public';

To see all the errors, use the statement,

select query, substring(filename,22,25) as filename,line_number as line,
substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text,
substring(err_reason,0,45) as reason
from stl_load_errors
order by query desc
limit 10;

To delete a table use,

drop table dwdate;

or

drop table dwdate cascade; //cascade keyword will remove this tables relationships with other tables.

Points to remember before copying data from a source:

1. Data source

You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you will load from data files in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of the bucket and the location of the data files, by providing either an object path for the data files or the location of a manifest file that explicitly lists each data file and its location.

Key prefix
An object stored in Amazon S3 is uniquely identified by an object key, which includes the bucket name, folder names, if any, and the object name. A key prefix refers to a set of objects with the same prefix. The object path is a key prefix that the COPY command uses to load all objects that share the key prefix. For example, the key prefix custdata.txt can refer to a single file or to a set of files, including custdata.txt.001, custdata.txt.002, and so on.
Manifest file
If you need to load files with different prefixes, for example, from multiple buckets or folders, or if you need to exclude files that share a prefix, you can use a manifest file. A manifest file explicitly lists each load file and its unique object key. You will use a manifest file to load the PART table later in this tutorial.

2. To improve query performance set distkey and sortkeys especially when doing join operations.

3. 'diststyle all' keyword will make the table distributed.

Copying CSV data :

copy part from 's3://sample-redshift-data/Sample1/part-csv.tbl'
credentials 'aws_access_key_id=AKIAJR65X2FNTIQPLZVQ;aws_secret_access_key=UACFV4mzBdssys9eJDRdGOSUPO4l6nbMb9fH8/Px'
csv
null as '\000';

Note: when null as '\000' is specified then, The table column that receives the NULL value must be configured as nullable. That is, it must not include the NOT NULL constraint in the CREATE TABLE specification.

Copying data to dwdate table:

copy dwdate from 's3://sample-redshift-data/Sample1/dwdate-tab.tbl'
credentials 'aws_access_key_id=AKIAJR65X2FNTIQPLZVQ;aws_secret_access_key=UACFV4mzBdssys9eJDRdGOSUPO4l6nbMb9fH8/Px'
gzip
delimiter '\t'
dateformat 'auto';
Note: You can specify only one date format. If the load data contains inconsistent formats, possibly in different columns, or if the format is not known at load time, you use DATEFORMAT with the 'auto' argument. When 'auto' is specified, COPY will recognize any valid date or time format and convert it to the default format. The 'auto' option recognizes several formats that are not supported when using a DATEFORMAT and TIMEFORMAT string.

Copying data to customer table:

copy customer from 's3://sample-redshift-data/Sample1/customer-fw-manifest'
credentials 'aws_access_key_id=AKIAJR65X2FNTIQPLZVQ;aws_secret_access_key=UACFV4mzBdssys9eJDRdGOSUPO4l6nbMb9fH8/Px'
fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10'
maxerror 10
acceptinvchars as '^'
manifest;

Note: The ACCEPTINVCHARS option is usually a better choice for managing invalid characters. ACCEPTINVCHARS instructs COPY to replace each invalid character with a specified valid character and continue with the load operation
For this step, you will add the ACCEPTINVCHARS with the replacement character '^'.

Copying data to supplier table.

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl'
credentials 'aws_access_key_id=AKIAJR65X2FNTIQPLZVQ;aws_secret_access_key=UACFV4mzBdssys9eJDRdGOSUPO4l6nbMb9fH8/Px'
delimiter '|'
gzip
region 'us-west-2';

Note: gzip keyword is used to specify that the data to be loaded in s3 source is compressed in gzip format.

Copying data to lineorder table.

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'
credentials 'aws_access_key_id=AKIAJR65X2FNTIQPLZVQ;aws_secret_access_key=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'
gzip
compupdate off
region 'us-east-1';

Note: Compare the time consumed to load data from a single large file and the same file splitted to multiple files.
loading using Multiple files is faster because multiple nodes runs in parallel to load split data.

Video 31 - Best practices to load data

1. Highest throughput can be obtained when loaded data from S3. DynamoDB,EMR comes next.
2. Can use CTAS when copying from internal table. Even this is faster as this makes use of internal cluster resources.
eg: "CREATE TABLE AS" OR "INSERT INTO SELECT"
3. If need to use Insert INTO TABLE values(..), then the best way is to use multi-row inserts. Possibly use batch loading methodology.
4. Redshift links 1 core for 1 slice of the file. Ideal load is 1S3 file per core, however this is an ideal case and is not possible always as ideal slice size is between 1MB-1GB.
consider, if we have 10 DC1.large nodes of 2 cores each, 10 * 2 cores = 20 cores. vCPU column specifies the number of cores.
So, 20 slices are ideal for even load distribution.
5. USE the linux command to split a large data file to multiple files "split -[#] N". Eg: "split -500N", this command will split the file into 500 equal sized files.
6.

Hadoop and Spark by Leela Prasad

Wednesday, July 12, 2017

Redshift part2

No comments:

Post a Comment

Popular Posts