Hadoop and Spark by Leela Prasad: Topics Yet to be Explored

-> Avoid shuffling in spark using spark bucketing

Overwrite Partition. http://www.ericlin.me/2015/05/hive-insert-overwrite-does-not-remove-existing-data/
-> Futures in scala and thread pool in scala
->salting

var df1 = Seq((1,"a"),(2,"b"),(1,"c"),(1,"x"),(1,"y"),(1,"g"),(1,"k"),(1,"u"),(1,"n")).toDF("ID","NAME") 

df1.createOrReplaceTempView("fact")

var df2 = Seq((1,10),(2,30),(3,40)).toDF("ID","SALARY")

df2.createOrReplaceTempView("dim")

df1.show
df2.show

val salted_df1 = spark.sql("""select concat(ID, '_', FLOOR(RAND(123456)*19)) as salted_key, NAME from fact """)

salted_df1.createOrReplaceTempView("salted_fact")
salted_df1.show
val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from dim""")

//val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0 to 19)) as salted_key from dim""")

exploded_dim_df.createOrReplaceTempView("salted_dim")
exploded_dim_df.show(50)
val result_df = spark.sql("""select split(fact.salted_key, '_')[0] as ID, dim.SALARY 
            from salted_fact fact 
            LEFT JOIN salted_dim dim 
            ON fact.salted_key = concat(dim.ID, '_', dim.salted_key) """)
result_df.show

Source: https://www.youtube.com/watch?v=2oaTQl1YzCw&t=614s

df1

+---+----+ | ID|NAME| +---+----+ | 1| a| | 2| b| | 1| c| | 1| x| | 1| y| | 1| g| | 1| k| | 1| u| | 1| n| +---+----+

df2 +---+------+ | ID|SALARY| +---+------+ | 1| 10| | 2| 30| | 3| 40| +---+------+

salted_df1 +----------+----+ |salted_key|NAME| +----------+----+ | 1_15| a| | 2_9| b| | 1_1| c| | 1_8| x| | 1_0| y| | 1_3| g| | 1_4| k| | 1_11| u| | 1_13| n| +----------+----+

exploded_dim_df +---+------+----------+ | ID|SALARY|salted_key| +---+------+----------+ | 1| 10| 0| | 1| 10| 1| | 1| 10| 2| | 1| 10| 3| | 1| 10| 4| | 1| 10| 5| | 1| 10| 6| | 1| 10| 7| | 1| 10| 8| | 1| 10| 9| | 1| 10| 10| | 1| 10| 11| | 1| 10| 12| | 1| 10| 13| | 1| 10| 14| | 1| 10| 15| | 1| 10| 16| | 1| 10| 17| | 1| 10| 18| | 1| 10| 19| | 2| 30| 0| | 2| 30| 1| | 2| 30| 2| | 2| 30| 3| | 2| 30| 4| | 2| 30| 5| | 2| 30| 6| | 2| 30| 7| | 2| 30| 8| | 2| 30| 9| | 2| 30| 10| | 2| 30| 11| | 2| 30| 12| | 2| 30| 13| | 2| 30| 14| | 2| 30| 15| | 2| 30| 16| | 2| 30| 17| | 2| 30| 18| | 2| 30| 19| | 3| 40| 0| | 3| 40| 1| | 3| 40| 2| | 3| 40| 3| | 3| 40| 4| | 3| 40| 5| | 3| 40| 6| | 3| 40| 7| | 3| 40| 8| | 3| 40| 9| +---+------+----------+ only showing top 50 rows

result_df +---+------+ | ID|SALARY| +---+------+ | 1| 10| | 2| 30| | 1| 10| | 1| 10| | 1| 10| | 1| 10| | 1| 10| | 1| 10| | 1| 10| +---+------+

->Ratio of data processing to CPU Cores and Memory
-> Which is better fo huge data(Peta Bytes) processing Spark / Hive
-> Skewing in bigdata, while joining 2 tables NULL values going to same partition.

-> Spark Execution Model, Executors, Stage etc.

1. Spark Streaming in Python and Scala
2. Create online dashboards in Python
3. Convert CSV to Avro in NIFI

Remote Debug Java
Windowing concept

Broadcast variables and Accumulator

4. In a Dataframe convert all the values in a column from string to Integer.

4. JDBC Connection in Spark

5. save dataframe to s3

6. difference bttween ORC and Parquet interms of usecases
Parquet is best for read intensive operations.

Lamda functions and different kinds of functions in Scala

https://alvinalexander.com/scala/passing-function-to-function-argument-anonymous
http://michaelpnash.github.io/scala-lambdas/

7. ORC is best which we would like to perform OLTP operations

8. Indexing in ORC

9. HBase usecases and cassandra usecases. Which cases HBase is preferable and cases in which Cassandra is preferable.

https://www.youtube.com/watch?v=nIPPTn4IC2Y&t=20s
https://www.youtube.com/watch?v=zCOFC8IVn3Q

cassandra works well with object storage. data stored in the form of object.
HBASE is block storage

Cassandra can be confugured outside Hadoop cluster

10. passing function as input to another function in scala

function as member variable

11. Difference between class and case class. Whats the need of using caseclass

case class = class + companion object.

In a class, to access protected variables Get and Set methods are to be implemented.But we don't want to access member variables using get and set functions. In this case companion object comes into picture which can directly access class member variables.

12. Can 2 consumer groups access complete data in a kafka topic independently
Yes.

13. If there are 2 consumers in a consumer group then can both these access all data in the kafka topic

14. DAG
RDD Lineage
Impala vs SparkSQL
Nifi vs flume
Java vs Scala

can 2 spark contexts run

python vs scala
In Python accumulators are not supported
Online Dashboards in Python for kafka-spark streaming:
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

Exercise 1:

1. Fetch data from http://samples.openweathermap.org/data/2.5/weather?q=London&appid=14e8adf775bccd703001ef35eefdfe3c in NIFI and GetJSON FILE.

2. Push JSON file to Kafka.

3. Get the data in Spark streaming and create a Hive table.

Exercise 2:

sqoop
=======
sqoop2 -> kafka

sqoop -> Mysql
sqoop -> hdfs
sqoop -> hive

spark core (All inputs should give in runtime)

read different files data
1) csv
2) avro
3) json

wordcount
staticwordcount
Dataframes
perform sql querys on Dataframes (spark SQL )

Spark streaming

Source dest

kafka => Spark streaming => hdfs
kafka => Spark streaming => store in hive internal table
kafka => Spark streaming => kafka

Oozie

step-1
exe wordcount program
step-2
Put this( exe wordcount program) in shell script and run from local

step-3
put that shell script in Workflow.xml and run
1) workflow
2) coord
3) Bundil jobs

Hive
====
create internal AND external TABLES
partitons
bucketing
create static and dynamic partitons

2. RegEx
3. Apache NIFI
4. SERDEPROPERTIES in HIVE
5. Self-Join with SubQueries
6. Shell Scripting
7. Difference between ORC and Parquet
8. Go through https://www.youtube.com/watch?v=eD3LkpPQIgM for deduplication implementation in MLlib

9. How to identify the failure in spark Job?
10. What happens if there is a failure in connection between Kafka and consumer for a while
What happens next?

11 Why can't spark streaming implement consumer API's

12. What is the JAVA version used and what are its new features?

13. HBase and Cassandra Architecture

14. Kerberos Security

15. Shuffling in case of colease and re-partition

16. Schema Evolution in AVRO

17. Data has date and time values in the format dd/mm/yyyy:00:00:00. Create Hive partitioned tables partitioned by Date.Need to use date in-built functions and extract date.

18. Download traffic data and find out the location that has highest traffic

19. Table design, performance, data modelling, denormalizing

20. Spark and Spark streaming job submission.

21. Copy data from s3 to cluster.

22. Create Hive table on s3 location

AWS Access Key ID [None]: AKIAINIHCWDZMRWJEI3A
AWS Secret Access Key [None]: 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

23. hive JDBC Connection via JAVA

Hadoop and Spark by Leela Prasad

Tuesday, April 25, 2017

Topics Yet to be Explored

No comments:

Post a Comment

Popular Posts