-> Avoid shuffling in spark using spark bucketing
Overwrite Partition. http://www.ericlin.me/2015/05/hive-insert-overwrite-does-not-remove-existing-data/
-> Futures in scala and thread pool in scala
->salting
-> Futures in scala and thread pool in scala
->salting
var df1 = Seq((1,"a"),(2,"b"),(1,"c"),(1,"x"),(1,"y"),(1,"g"),(1,"k"),(1,"u"),(1,"n")).toDF("ID","NAME")
df1.createOrReplaceTempView("fact")
var df2 = Seq((1,10),(2,30),(3,40)).toDF("ID","SALARY")
df2.createOrReplaceTempView("dim")
df1.show
df2.show
val salted_df1 = spark.sql("""select concat(ID, '_', FLOOR(RAND(123456)*19)) as salted_key, NAME from fact """)
salted_df1.createOrReplaceTempView("salted_fact")
salted_df1.show
val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from dim""")
//val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0 to 19)) as salted_key from dim""")
exploded_dim_df.createOrReplaceTempView("salted_dim")
exploded_dim_df.show(50)
val result_df = spark.sql("""select split(fact.salted_key, '_')[0] as ID, dim.SALARY
from salted_fact fact
LEFT JOIN salted_dim dim
ON fact.salted_key = concat(dim.ID, '_', dim.salted_key) """)
result_df.show
df1
+---+----+
| ID|NAME|
+---+----+
| 1| a|
| 2| b|
| 1| c|
| 1| x|
| 1| y|
| 1| g|
| 1| k|
| 1| u|
| 1| n|
+---+----+
df2
+---+------+
| ID|SALARY|
+---+------+
| 1| 10|
| 2| 30|
| 3| 40|
+---+------+
salted_df1
+----------+----+
|salted_key|NAME|
+----------+----+
| 1_15| a|
| 2_9| b|
| 1_1| c|
| 1_8| x|
| 1_0| y|
| 1_3| g|
| 1_4| k|
| 1_11| u|
| 1_13| n|
+----------+----+
exploded_dim_df
+---+------+----------+
| ID|SALARY|salted_key|
+---+------+----------+
| 1| 10| 0|
| 1| 10| 1|
| 1| 10| 2|
| 1| 10| 3|
| 1| 10| 4|
| 1| 10| 5|
| 1| 10| 6|
| 1| 10| 7|
| 1| 10| 8|
| 1| 10| 9|
| 1| 10| 10|
| 1| 10| 11|
| 1| 10| 12|
| 1| 10| 13|
| 1| 10| 14|
| 1| 10| 15|
| 1| 10| 16|
| 1| 10| 17|
| 1| 10| 18|
| 1| 10| 19|
| 2| 30| 0|
| 2| 30| 1|
| 2| 30| 2|
| 2| 30| 3|
| 2| 30| 4|
| 2| 30| 5|
| 2| 30| 6|
| 2| 30| 7|
| 2| 30| 8|
| 2| 30| 9|
| 2| 30| 10|
| 2| 30| 11|
| 2| 30| 12|
| 2| 30| 13|
| 2| 30| 14|
| 2| 30| 15|
| 2| 30| 16|
| 2| 30| 17|
| 2| 30| 18|
| 2| 30| 19|
| 3| 40| 0|
| 3| 40| 1|
| 3| 40| 2|
| 3| 40| 3|
| 3| 40| 4|
| 3| 40| 5|
| 3| 40| 6|
| 3| 40| 7|
| 3| 40| 8|
| 3| 40| 9|
+---+------+----------+
only showing top 50 rows
result_df
+---+------+
| ID|SALARY|
+---+------+
| 1| 10|
| 2| 30|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
| 1| 10|
+---+------+
-> Which is better fo huge data(Peta Bytes) processing Spark / Hive
-> Skewing in bigdata, while joining 2 tables NULL values going to same partition.
-> Spark Execution Model, Executors, Stage etc.
1. Spark Streaming in Python and Scala
2. Create online dashboards in Python
3. Convert CSV to Avro in NIFI
Remote Debug Java
Windowing concept
Broadcast variables and Accumulator
4. In a Dataframe convert all the values in a column from string to Integer.
4. JDBC Connection in Spark
5. save dataframe to s3
6. difference bttween ORC and Parquet interms of usecases
Parquet is best for read intensive operations.
Lamda functions and different kinds of functions in Scala
https://alvinalexander.com/scala/passing-function-to-function-argument-anonymous
http://michaelpnash.github.io/scala-lambdas/
7. ORC is best which we would like to perform OLTP operations
8. Indexing in ORC
9. HBase usecases and cassandra usecases. Which cases HBase is preferable and cases in which Cassandra is preferable.
https://www.youtube.com/watch?v=nIPPTn4IC2Y&t=20s
https://www.youtube.com/watch?v=zCOFC8IVn3Q
cassandra works well with object storage. data stored in the form of object.
HBASE is block storage
Cassandra can be confugured outside Hadoop cluster
10. passing function as input to another function in scala
function as member variable
11. Difference between class and case class. Whats the need of using caseclass
case class = class + companion object.
In a class, to access protected variables Get and Set methods are to be implemented.But we don't want to access member variables using get and set functions. In this case companion object comes into picture which can directly access class member variables.
12. Can 2 consumer groups access complete data in a kafka topic independently
Yes.
13. If there are 2 consumers in a consumer group then can both these access all data in the kafka topic
14. DAG
RDD Lineage
Impala vs SparkSQL
Nifi vs flume
Java vs Scala
can 2 spark contexts run
python vs scala
In Python accumulators are not supported
Online Dashboards in Python for kafka-spark streaming:
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/
Exercise 1:
1. Fetch data from http://samples.openweathermap.org/data/2.5/weather?q=London&appid=14e8adf775bccd703001ef35eefdfe3c in NIFI and GetJSON FILE.
2. Push JSON file to Kafka.
3. Get the data in Spark streaming and create a Hive table.
Exercise 2:
sqoop
=======
sqoop2 -> kafka
sqoop -> Mysql
sqoop -> hdfs
sqoop -> hive
spark core (All inputs should give in runtime)
read different files data
1) csv
2) avro
3) json
wordcount
staticwordcount
Dataframes
perform sql querys on Dataframes (spark SQL )
Spark streaming
Source dest
kafka => Spark streaming => hdfs
kafka => Spark streaming => store in hive internal table
kafka => Spark streaming => kafka
Oozie
step-1
exe wordcount program
step-2
Put this( exe wordcount program) in shell script and run from local
step-3
put that shell script in Workflow.xml and run
1) workflow
2) coord
3) Bundil jobs
Hive
====
create internal AND external TABLES
partitons
bucketing
create static and dynamic partitons
2. RegEx
3. Apache NIFI
4. SERDEPROPERTIES in HIVE
5. Self-Join with SubQueries
6. Shell Scripting
7. Difference between ORC and Parquet
8. Go through https://www.youtube.com/watch?v=eD3LkpPQIgM for deduplication implementation in MLlib
9. How to identify the failure in spark Job?
10. What happens if there is a failure in connection between Kafka and consumer for a while
What happens next?
11 Why can't spark streaming implement consumer API's
12. What is the JAVA version used and what are its new features?
13. HBase and Cassandra Architecture
14. Kerberos Security
15. Shuffling in case of colease and re-partition
16. Schema Evolution in AVRO
17. Data has date and time values in the format dd/mm/yyyy:00:00:00. Create Hive partitioned tables partitioned by Date.Need to use date in-built functions and extract date.
18. Download traffic data and find out the location that has highest traffic
19. Table design, performance, data modelling, denormalizing
20. Spark and Spark streaming job submission.
21. Copy data from s3 to cluster.
22. Create Hive table on s3 location
AWS Access Key ID [None]: AKIAINIHCWDZMRWJEI3A
AWS Secret Access Key [None]: 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
23. hive JDBC Connection via JAVA