Tuesday, April 25, 2017

Topics Yet to be Explored

-> Avoid shuffling in spark using spark bucketing
Overwrite Partition. http://www.ericlin.me/2015/05/hive-insert-overwrite-does-not-remove-existing-data/
-> Futures in scala and thread pool in scala
->salting
->Ratio of data processing to CPU Cores and Memory
-> Which is better fo huge data(Peta Bytes) processing Spark / Hive
-> Skewing in bigdata, while joining 2 tables NULL values going to same partition.

-> Spark Execution Model, Executors, Stage etc.

1. Spark Streaming in Python and Scala
2. Create online dashboards in Python
3. Convert CSV to Avro in NIFI

Remote Debug Java
Windowing concept

Broadcast variables and Accumulator

4. In a Dataframe convert all the values in a column from string to Integer.

4. JDBC Connection in Spark

5. save dataframe to s3

6. difference bttween ORC and Parquet interms of usecases
Parquet is best for read intensive operations.

Lamda functions and different kinds of functions in Scala

https://alvinalexander.com/scala/passing-function-to-function-argument-anonymous
http://michaelpnash.github.io/scala-lambdas/

7. ORC is best which we would like to perform OLTP operations

8. Indexing in ORC

9. HBase usecases and cassandra usecases. Which cases HBase is preferable and cases in which Cassandra is preferable.

https://www.youtube.com/watch?v=nIPPTn4IC2Y&t=20s
https://www.youtube.com/watch?v=zCOFC8IVn3Q

cassandra works well with object storage. data stored in the form of object.
HBASE is block storage

Cassandra can be confugured outside Hadoop cluster

10. passing function as input to another function in scala

function as member variable

11. Difference between class and case class. Whats the need of using caseclass

case class = class + companion object.

In a class, to access protected variables Get and Set methods are to be implemented.But we don't want to access member variables using get and set functions. In this case companion object comes into picture which can directly access class member variables.

12. Can 2 consumer groups access complete data in a kafka topic independently
Yes.

13. If there are 2 consumers in a consumer group then can both these access all data in the kafka topic


14. DAG
RDD Lineage
Impala vs SparkSQL
Nifi vs flume
Java vs Scala

can 2 spark contexts run

python vs scala
In Python accumulators are not supported
Online Dashboards in Python for kafka-spark streaming:
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

Exercise 1:

1. Fetch data from http://samples.openweathermap.org/data/2.5/weather?q=London&appid=14e8adf775bccd703001ef35eefdfe3c in NIFI and GetJSON FILE.

2. Push JSON file to Kafka.

3. Get the data in Spark streaming and create a Hive table.

Exercise 2:

sqoop
=======
sqoop2   ->   kafka

sqoop   ->   Mysql
sqoop   ->   hdfs
sqoop   ->   hive


spark  core (All inputs should give in runtime)

read different files data
1) csv
2) avro
3) json

wordcount
staticwordcount
Dataframes
perform sql querys on Dataframes   (spark SQL )


Spark streaming

Source                          dest

kafka   =>  Spark streaming =>  hdfs
kafka   =>  Spark streaming =>  store in hive internal table
kafka   =>  Spark streaming =>  kafka

Oozie

step-1
exe wordcount program
step-2
Put this( exe wordcount program) in shell script and run from local

step-3
put that shell script in Workflow.xml and run
1) workflow
2) coord
3) Bundil jobs


Hive
====
create internal AND external TABLES
partitons
bucketing
create static and dynamic partitons



2. RegEx
3. Apache NIFI
4. SERDEPROPERTIES in HIVE
5. Self-Join with SubQueries
6. Shell Scripting
7. Difference between ORC and Parquet
8. Go through https://www.youtube.com/watch?v=eD3LkpPQIgM for deduplication implementation in MLlib

9. How to identify the failure in spark Job?
10. What happens if there is a failure in connection between Kafka and consumer for a while
What happens next?

11 Why can't spark streaming implement consumer API's

12. What is the JAVA version used and what are its new features?

13. HBase and Cassandra Architecture

14. Kerberos Security

15. Shuffling in case of colease and re-partition

16. Schema Evolution in AVRO

17. Data has date and time values in the format dd/mm/yyyy:00:00:00. Create Hive partitioned tables partitioned by Date.Need to use date in-built functions and extract date.

18. Download traffic data and find out the location that has highest traffic

19. Table design, performance, data modelling, denormalizing

20. Spark and Spark streaming job submission.

21. Copy data from s3 to cluster.

22. Create Hive table on s3 location

AWS Access Key ID [None]: AKIAINIHCWDZMRWJEI3A
AWS Secret Access Key [None]: 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

23. hive JDBC Connection via JAVA

Saturday, April 22, 2017

Sql Commands

Some SQL Commands:

Practice from http://www.w3resource.com/sql-exercises/

1. select * from customer where city='New York' AND NOT grade > 100;

2. select * from customer WHERE NOT (city = 'New York' OR grade>100);

3. SELECT * FROM orders WHERE NOT ((ord_date ='2012-09-10' AND salesman_id=5001) OR purch_amt>1000.00);  

4. select salesman_id, name, city, commission FROM salesman WHERE commission BETWEEN 0.10 AND 0.12;

5. SELECT * FROM  orders WHERE NOT(ord_date ='2012-08-17' OR customer_id>3005) ORDER BY ord_no DESC;  

6. select ord_no, purch_amt, (100*purch_amt)/6000 AS acheived, (100*(6000-purch_amt)/6000) AS unacheived from orders WHERE (100*purch_amt)/6000>50;

7. IN - to specify multiple value and as a shorthand for multiple OR operators

select * from emp_details WHERE lastname IN('Dosni','Mardy');

This is an alternative to, select * from emp_details WHERE EMP_LNAME='Dosni' OR EMP_LNAME='Mardy';

8. select * from emp_details where emp_dept IN(47,63);

9. select * from emp_details where EMP_LNAME LIKE ('Do%')
Finds any last name that starts with Do. This is case sensitive.

Result:
emp_idno emp_fname emp_lname emp_dept
843795                Enric Dosio             57
444527               Joseph Dosni             47

10. SELECT * FROM Customers WHERE City LIKE '%es%';

This finds the city that has es in the middle.

11. select * from salesman WHERE city IN ('Rome','Paris');

12. select * from salesman WHERE city NOT IN ('Rome','Paris');

13. SELECT * FROM customer where customer_id BETWEEN 3007 AND 3009;

                                OR
SELECT * FROM customer where customer_id IN(3007,3008,3009);

14. Filter all those orders with all information which purchase amount value is within the range 500 and 4000 except those orders of purchase amount value 948.50 and 1983.43

SELECT * FROM ORDERS WHERE((purch_amt BETWEEN 500 AND 4000) AND (NOT purch_amt BETWEEN 948.50 AND 1983.43)) ORDER BY purch_amt;

15. SELECT * FROM SALESMAN WHERE name LIKE 'L%' OR name LIKE 'A%';

16. SELECT * FROM salesman where name between 'A%' AND 'L%';

17. find that customer with all information who gets a grade except NULL value. 

SELECT * FROM customer where grade is NOT NULL;

18. find those rows from the table testtable which does not contain the character ( % ) in its column 'col1'
SELECT COL1 FROM testtable where col1 NOT LIKE '%/%%' ESCAPE '/';

19. Write a SQL statement to find those rows from the table testtable which does not contain the string ( _/ ) in its column 'col1'
SELECT * FROM testtable where col1 NOT LIKE ('%/_//%') ESCAPE '/';

In the above 2 statements _ and % are to be skipped, so need to specify the escape 
character / before them. Also need to specify / as Escape character.

Escape character need not be /. It can be either *. If * is used as escape character the the statement would be, 
SELECT * FROM testtable where col1 NOT LIKE ('%*_*/%') ESCAPE '*';

SELECT OrderID, Quantity, 
CASE
WHEN Quantity > 30 THEN "Above 30"
WHEN Quantity < 30 THEN "Below 30"
ELSE "Something else"
END AS Comment
FROM OrderDetails;


OrderIDQuantityComment
1024812Below 30
1024810Below 30
102485Below 30

20. find those salesmen with all information whose name containing the 1st character is 'N' and the 4th character is 'l' and rests may be any character. 

SELECT * FROM salesman WHERE name LIKE 'N__l%';  

21.  find all those customers with all information whose names are ending with the letter 'n'. 
select * from customer where cust_name LIKE '%n';

22. select SUM(purch_amt) from orders;

23. find the average purchase amount of all orders
select AVG(purch_amt) from orders;

24. SELECT COUNT(DISTINCT salesman_id) from orders;

25. SELECT MAX(purch_amt) from orders;

26. find the highest purchase amount ordered by the each customer with their ID and highest purchase amount.

select customer_id, MAX(purch_amt) as highest from orders GROUP BY customer_id ORDER BY customer_id;

27. a SQL statement to find the highest purchase amount on a date '2012-08-17' for each salesman with their ID.

select salesman_id, MAX(purch_amt) from orders WHERE ord_date='2012-08-17' GROUP BY salesman_id;

Note: HAVING is a replacement keyword for WHERE while performing aggregation operations.

28. SQL statement to find the highest purchase amount with their ID and order date, for only those customers who have highest purchase amount in a day is more than 2000.

SELECT  customer_id, ord_date, MAX(purch_amt) from orders GROUP BY ord_date, customer_id HAVING MAX(purch_amt) > 2000;  

29. find the highest purchase amount with their ID and order date, for those customers who have a higher purchase amount in a day is within the range 2000 and 6000. 


select customer_id, ord-date, MAX(purch_amt) from orders group BY customer_id, ord_date HAVING MAX(purch_amt) between 2000 AND 6000;

30.  find the highest purchase amount with their ID, for only those customers whose ID is within the range 3002 and 3007

select customer_id, MAX(purch_amt) from orders GROUP BY customer_id HAVING customer_id between 3002 AND 3007;

31. display customer details (ID and purchase amount) whose IDs are within the range 3002 and 3007 and highest purchase amount is more than 1000

SELECT customer_id, MAX(purch_amt) from orders WHERE customer_id BETWEEN 3002 AND 3007 GROUP BY customer_id HAVING MAX(purch_amt) > 1000;

Note: HAVING is performed after GROUP BY. That is exactly what it was invented for.

  • HAVING filters records that work on summarized GROUP BY results.
  • HAVING applies to summarized group records, whereas WHERE applies to individual records.
  • HAVING requires that a GROUP BY clause is present.

So, HAVING has to be used after GROUP BY and WHERE to be used before GROUP BY.

32. Find the highest purchase amount with their ID, for only those salesmen whose ID is within the range 5003 and 5008.

select customer_id, MAX(purch_amt) from orders WHERE salesman_id BETWEEN 5003 AND 5008 GROUP BY customer_id;


33. find the highest purchase amount with their ID, for only those salesmen whose ID is within the range 5003 and 5008

select customer_id, MAX(purch_amt) from orders WHERE salesman_id BETWEEN 5003 AND 5008 GROUP BY customer_id;

34. counts all orders for a date August 17th, 2012


SELECT COUNT(ord_no) from orders GROUP BY ord_date HAVING ord_date='2012-08-17';

                                                              OR

SELECT COUNT(ord_no) from orders WHERE ord_date='2012-08-17' GROUP BY ord_date;





35.  find the number of employees in each department along with the department code.



SELECT COUNT(*), EMP_DEPT FROM emp_details GROUP BY EMP_DEPT;


JOINING MORE THAN 2 Tables:


The logic is simple.See the below Query,


SELECT orders.ord_no, orders.ord_date, orders.purch_amt,customer.cust_name,salesman.name, salesman.commission * orders.purch_amt AS Commission_amt 

FROM orders 
JOIN customer ON orders.customer_id=customer.customer_id

JOIN salesman ON orders.salesman_id=salesman.salesman_id

In the above Left TABLE is orders and by JOINING Orders with customer table, we got customer.cust_name. These 2 tables were matched using the common column i.e customer_id.


To get salesman table details simply match the common columns in both the tables(orders and salesman) just like single JOIN statement. This is acheived by the below statement,


JOIN salesman ON orders.salesman_id=salesman.salesman_id


36.  EQUI JOIN: Not using JOIN Clause. Get values based on matching columns in 2 tables.


SELECT P.name,C.name FROM prod P, comp C WHERE P.id=C.id;


Using JOIN clause,


SELECT P.name, C.name FROM prod P JOIN comp C ON P.id=C.id;


37. CROSS JOIN


SELECT P.name, P.cost, d.* FROM prod P, discount D WHERE P.name LIKE '%TV';


Get the top 2nd Salaried employee details,


SELECT MAX(SALARY) FROM Employees WHERE Salary < ( SELECT Max(Salary) FROM Employees)
                                                      OR


SELECT MAX(SALARY) FROM employees WHERE NOT SALARY =(SELECT MAX(SALARY) FROM employees)


                                                       OR


SELECT SALARY FROM (SELECT SALARY FROM EMPLOYEES ORDER BY SALARY DESC LIMIT 2) AS Emp ORDER BY SALARY LIMIT 1


Part 1: SELECT SALARY FROM (SELECT SALARY FROM EMPLOYEES ORDER BY SALARY DESC LIMIT 2) AS Emp


Part 2: SELECT SALARY FROM (SELECT SALARY FROM EMPLOYEES ORDER BY SALARY DESC LIMIT 2) AS Emp ORDER BY SALARY


and apply LIMIT 1


                                                  OR

SELECT * FROM EMPLOYEES WHERE EMPLOYEE_ID  IN (SELECT EMPLOYEE_ID FROM EMPLOYEES WHERE SALARY IN (SELECT MAX(SALARY) FROM EMPLOYEES WHERE SALARY < (SELECT MAX(SALARY) FROM EMPLOYEES)))

Part 1: SELECT MAX(SALARY) FROM EMPLOYEES WHERE SALARY < (SELECT MAX(SALARY) FROM EMPLOYEES))  
//This gets 2nd highest salary from the list based on where condition

Part 2: SELECT * FROM EMPLOYEES WHERE EMPLOYEE_ID  IN (SELECT EMPLOYEE_ID FROM EMPLOYEES WHERE SALARY IN

38. SubQueries examples


SELECT * FROM EMPLOYEES WHERE SALARY BETWEEN (SELECT MIN(DISTINCT(SALARY)) FROM EMPLOYEES) AND 2500


SELECT * FROM EMPLOYEES WHERE NOT EMPLOYEE_ID IN (SELECT EMPLOYEE_ID FROM EMPLOYEES WHERE EMPLOYEE_ID BETWEEN 100 AND 200)


SELECT * FROM EMPLOYEES WHERE NOT DEPARTMENT_ID IN(SELECT DEPARTMENT_ID FROM EMPLOYEES WHERE EMPLOYEE_ID BETWEEN 100 AND 200)



SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID = ( SELECT DEPARTMENT_ID FROM EMPLOYEES WHERE FIRST_NAME = 'Clara') AND FIRST_NAME <> 'Clara'

SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID IN ( SELECT DEPARTMENT_ID FROM EMPLOYEES WHERE FIRST_NAME LIKE 'T%')

JOINING multiple tables and sub-query

SELECT COUNTRY_NAME, CITY, COUNT(DEPARTMENT_ID) FROM countries C
JOIN locations  L ON C.COUNTRY_ID=L.COUNTRY_ID
JOIN departments USING (LOCATION_ID)
WHERE DEPARTMENT_ID IN
(SELECT DEPARTMENT_ID FROM employees GROUP BY DEPARTMENT_ID HAVING COUNT(DEPARTMENT_ID) > 1)
GROUP BY COUNTRY_NAME,CITY;

NOT Operator:

select * from nobel_win where SUBJECT NOT LIKE 'P%'

Note: NOT cannot be used with =. Instead != or <> can be used.
NOT can be used for individual conditions like ,

where SUBJECT NOT LIKE 'P%'
where grade is NOT NULL

WHERE((purch_amt BETWEEN 500 AND 4000) AND (NOT purch_amt BETWEEN 948.50 AND 1983.43))

USING Clause:

The USING clause is something we don't need to mention in the JOIN condition when we are retrieving data from multiple tables. When we use USING clause, that particular column name should be present in both tables, and the SELECT query will automatically join those tables using the given column name in USING clause.


Eg:

Query 1:
SELECT *
FROM users
JOIN orders ON (orders.user_id = users.user_id)
WHERE users.user_id = 1;
Query 2:


SELECT *
FROM users
JOIN orders USING (user_id)
WHERE user_id = 1;


Scenarios of SQL:


1. Check if the user has installed all the prerequisites for the Graphics;


In one table get the list of all the installed softwares in the user machine.


Maybe,

user_list

SW1           VC++

SW2           .Net
SW3           Driver

From Sqoop Import the recommended softwares for that particulat Graphics card.

May be,

sys_list

SW1           VC++
SW2           .Net
SW3           Driver
SW4           Vulkan

Left JOIN and get the software name that has NULL

SELECT S.id FROM sys_list S LEFT JOIN user_list U
ON S.id=U.id WHERE S.id IS NULL;

This gives the software that the user has not installed.

2. Get the TOP 5 features which are most most popular based on Geographical location like country

SELECT COUNT(featureid), swname, country FROM
feature_list GROUP BY swname,country ORDER BY DESC LIMIT 5;

3. Suggesting Graphics card based on user preferences.




JOINS using Where condition


Eg:

SELECT i.Name
  FROM sys.TABLES AS tbl
  INNER JOIN sys.indexes AS i
    ON (i.object_id = tbl.object_id)
  WHERE (i.index_id > 0 AND i.is_hypothetical = 0) 
  AND (i.is_unique = 1 AND i.is_disabled = 0) 
  AND (tbl.Name = 'Warehouse')


CROSS JOIN: In CROSS JOIN each row in the first table JOINS with all the rows of another table.

Eg:
mysql> select * from school1;
+------+------+----------+
| id   | name | location |
+------+------+----------+
|    1 | abc1 | hyd      |
|    2 | abc2 | hyd2     |
|    3 | abc3 | hyd3     |
+------+------+----------+
3 rows in set (0.00 sec)

mysql> select * from school2;
+------+--------+-----------+
| id   | name   | location2 |
+------+--------+-----------+
|    1 | abc2_3 | hyd2_3    |
|    2 | abc2_2 | hyd2_2    |
|    3 | abc2_4 | hyd2_4    |
+------+--------+-----------+
3 rows in set (0.00 sec)

mysql> select * from school1 CROSS JOIN school2;
+------+------+----------+------+--------+-----------+
| id   | name | location | id   | name   | location2 |
+------+------+----------+------+--------+-----------+
|    1 | abc1 | hyd      |    1 | abc2_3 | hyd2_3    |
|    2 | abc2 | hyd2     |    1 | abc2_3 | hyd2_3    |
|    3 | abc3 | hyd3     |    1 | abc2_3 | hyd2_3    |
|    1 | abc1 | hyd      |    2 | abc2_2 | hyd2_2    |
|    2 | abc2 | hyd2     |    2 | abc2_2 | hyd2_2    |
|    3 | abc3 | hyd3     |    2 | abc2_2 | hyd2_2    |
|    1 | abc1 | hyd      |    3 | abc2_4 | hyd2_4    |
|    2 | abc2 | hyd2     |    3 | abc2_4 | hyd2_4    |
|    3 | abc3 | hyd3     |    3 | abc2_4 | hyd2_4    |
+------+------+----------+------+--------+-----------+

Note
1. CROSS keyword is optional.
select * from school1 JOIN (select id from school2 where id < 3) C; == select * from school1 CROSS JOIN (select id from school2 where id < 3) C;

2. As,'select id' is used in the sub-query for CROSS JOIN only ID would be joined with all the columns of school1.

mysql> select * from school1 JOIN (select id from school2 where id < 3) C;
+------+------+----------+------+
| id   | name | location | id   |
+------+------+----------+------+
|    1 | abc1 | hyd      |    1 |
|    1 | abc1 | hyd      |    2 |
|    2 | abc2 | hyd2     |    1 |
|    2 | abc2 | hyd2     |    2 |
|    3 | abc3 | hyd3     |    1 |
|    3 | abc3 | hyd3     |    2 |
+------+------+----------+------+

Few more examples:

mysql> select * from school1 JOIN (select count(*) val from school2 where id < 3) C where C.val = 0;
Empty set (0.00 sec)

mysql> select * from school1 JOIN (select count(*) val from school2 where id < 3) C where C.val = 2;
+------+------+----------+-----+
| id   | name | location | val |
+------+------+----------+-----+
|    1 | abc1 | hyd      |   2 |
|    2 | abc2 | hyd2     |   2 |
|    3 | abc3 | hyd3     |   2 |
+------+------+----------+-----+


The val column generated in the sub-query is CROSS JOINED and followed by where condition

CROSS JOIN

Consider that there are 2 tables,

Tbl1 has 2 Rows x 3 Columns
Tbl2 has 3 Rows x 2 Columns

select * from TBL1;

1 Leela USA
2 Annapurna India

select * from TBL2;

3 Karthik
4 Ramana
5 Sirisha

Now, select * from TBL1,TBL2 would result in Cartesian Product. Here all records where each row from the first table is combined with each row from the second table.

1 Leela USA 3 Karthik
1 Leela USA 4 Ramana
1 Leela USA 5 Sirisha
2 Annapurna India 3 Karthik
2 Annapurna India 4 Ramana
2 Annapurna India 5 Sirisha

USECASE 1: 
Need to fetch time from 2nd Row and make it as start time in a Row.

                       id                                                                 Time
1533027356180 7/31/2018 4:52
1532087175272 7/20/2018 6:03
1532071298705 7/19/2018 23:12
1532013014905 7/19/2018 8:01
1531477128107 7/13/2018 6:18
1531458468495 7/13/2018 1:00
1531448383317 7/12/2018 20:06

Like to Have the result as , such that previous id's time becomes Start_time for this id.

         id                                 Start_time                             End_time
1533027356180                7/20/2018 6:03                    7/31/2018 4:52
1532087175272                7/19/2018 23:12                  7/20/2018 6:03
1532071298705                7/19/2018 8:01                    7/19/2018 23:12
1532013014905                7/13/2018 6:18                    7/19/2018 8:01

Query: select T1.id, MAX(T1.tme) as end_time, MAX(T2.tme) as st_time from tb1 T1 join tb1 T2 on T1.id > T2.id group by T1.id ORDER BY T1.id DESC;

Note: In this case Self join is made on the condition T1.id > T2.id


Usecase 2:In Hive if there are multiple records of same Id but with different UpdateTimes then the record with Max(UpdateTime) is the valid one. Inorder to get this record

select * from tbl1 where (id,UpdateTime) = (select id, Max(UpdateTime) from tbl1 group by id) a1

OR

SELECT T1.* from tbl1 T1 JOIN (select id, Max(UpdateTime) from tbl1 group by id) a1
ON T1.id = a1.id AND T1.UpdateTime=a1.UpdateTime

OR

select * from (select *, row_number() over (partition by id order by UpdateTime Desc) as row_num FROM tbl1) tmp_tbl where row_num =1;