Monday, February 25, 2019

Configuring a Spark-submit Job


Configuring Spark-submit parameters

Before going further let's discuss on the below parameters which I have given for a Job.
spark.executor.cores=5 
spark.executor.instances=3
spark.executor.memory=20g
spark.driver.memory=5g 
spark.dynamicAllocation.enabled=true 
spark.dynamicAllocation.maxExecutors=10 

spark.executor.cores - specifies the number of cores for an executor. 5 is the optimum level of parallelism that can be obtained, More the number of executors can lead to bad HDFS I/O throughput.

spark.executor.instances - Specifies number of executors to run, so 3 Executors x 5 cores = 15 parallel tasks.

spark.executor.memory - The amount of memory each executor can get 20g(as per above configuration). Cores would be sharing this 20GB and as there are 5 cores each core/Task will get 20/5 = 4 GB.

spark.driver.memory - Usually this can be less as the driver manages the job and doesn't process the data. Driver also stores Local variables, needs larger space incase any Map/Array collection is allocated with large amounts of data. Usuall does Job allocation to executor nodes, DAG creation, writing Log, displaying in console etc

spark.dynamicAllocation.enabled - By default this is enabled and when there is a scope of using more resources than specified(when the cluster is free). The job will ramp up with the resources. Dominant resource calculator needs to be enabled to get the full power of Capacity. This will allocate executors and cores as per the ones we specified, if not enables default it will allocate 1 core per executor. Default is Memory based resource calculator.

spark.dynamicAllocation.maxExecutors - Can specify the maxExecutors in Dynamic allocation

Coming to the actual story of assigning the parameters. 

Consider a case where data needs to be read from a partitioned table with each partition containing multiple small/medium files. In this case have Good executor memory, more executors and as usual 5 cores.

Similar cases as above but, not having multiple small/medium files at source. In this case executors can be less and can have good memory for each executor.

In case of incremental load where data pull is less, however needs to pull from multiple tables in parallel(Futures). In this case executors can be more, little less executor memory and as usual 5 cores.

Needs to consider amount of data being processed, the way joins are applied, stages in job, broadcast or not. Basically this goes as trail and error after analyzing the above factors and the best one can be chalked out in UAT environment.

There is a good monitoring tool called sparklens which can monitor a job and provide it's analysis in the first run. This is an open source.

sparklens : https://docs.qubole.com/en/latest/user-guide/spark/sparklens.html

5 comments:

  1. Thanks for the List .. Its a good one, I always look forward to see your post.
    https://www.bharattaxi.com

    ReplyDelete
  2. Thanks for sharing such a nice information about Hadoop Language. Keep Sharing.

    ReplyDelete
  3. thanks for the information which you have provided.it helps me a lot i have seen so many youtube videos there is no satisfaction onces again thank you sir
    data science training in hyderabad

    ReplyDelete
  4. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Online Training

    ReplyDelete