Configuring Spark-submit parameters
Before going further let's discuss on the below parameters which I have given for a Job.
spark.executor.cores - specifies the number of cores for an executor. 5 is the optimum level of parallelism that can be obtained, More the number of executors can lead to bad HDFS I/O throughput.
spark.executor.instances - Specifies number of executors to run, so 3 Executors x 5 cores = 15 parallel tasks.
spark.executor.memory - The amount of memory each executor can get 20g(as per above configuration). Cores would be sharing this 20GB and as there are 5 cores each core/Task will get 20/5 = 4 GB.
spark.driver.memory - Usually this can be less as the driver manages the job and doesn't process the data. Driver also stores Local variables, needs larger space incase any Map/Array collection is allocated with large amounts of data. Usuall does Job allocation to executor nodes, DAG creation, writing Log, displaying in console etc
spark.dynamicAllocation.enabled - By default this is enabled and when there is a scope of using more resources than specified(when the cluster is free). The job will ramp up with the resources. Dominant resource calculator needs to be enabled to get the full power of Capacity. This will allocate executors and cores as per the ones we specified, if not enables default it will allocate 1 core per executor. Default is Memory based resource calculator.
spark.dynamicAllocation.maxExecutors - Can specify the maxExecutors in Dynamic allocation
Coming to the actual story of assigning the parameters.
Consider a case where data needs to be read from a partitioned table with each partition containing multiple small/medium files. In this case have Good executor memory, more executors and as usual 5 cores.
Similar cases as above but, not having multiple small/medium files at source. In this case executors can be less and can have good memory for each executor.
In case of incremental load where data pull is less, however needs to pull from multiple tables in parallel(Futures). In this case executors can be more, little less executor memory and as usual 5 cores.
Needs to consider amount of data being processed, the way joins are applied, stages in job, broadcast or not. Basically this goes as trail and error after analyzing the above factors and the best one can be chalked out in UAT environment.
There is a good monitoring tool called sparklens which can monitor a job and provide it's analysis in the first run. This is an open source.
sparklens : https://docs.qubole.com/en/latest/user-guide/spark/sparklens.html