A real case of optimazing spark notebook

At the beginning of my optimazation, I tried to find some standard principles that can quickly and smoothly help me. Altough lots of information online that indicate where I can improve my spark, none gives an easy solution to fix the key issues from end to end. So, I summried what I understand through online materical and apply some strateges to my spark notebook which I found may have some performance issues.

Start with the some slow jobs, you can order by “Druations” column. Some guides may go to the details of slowest stages. But here I suggest go to the related SQL Query page to review the whole process on the query level. It will help us to quickly figure out the common query performance issues, like duplicate scans without filters or inappropriate join, etc.

*In Jobs tab of Spark UI, we can order by duration of each job to find the some low performance jobs.*

*In the page of job detail, we click “Associated SQL Query” firstly.*

One table read multible times. If we found one table is scaned multiple time at the beginning, we may consider create caches to save the temporary result into the memory/local disk, rather than go through the remote disk.

*In my query DAG, there are two table scan parallelly, but they are from the same table and read the same data out. record read are both 884,688,530.*

Too many similar branches. If we see many similar branches, it may be caused by lack of cache too. The medium results have to be recomputed when reaching the point where we need it.

*Rad/Orange lines show us three copies of data, but essinssically they are doing the same process. We have to think about how to cache the medium compute results.*

Duplicate slow stages seem in the stage tab. It may caused by uncautioned code mistake. The case blew shows same dataframe execute computing twice. But this piece code is very normal in python.

*two jobs spent same time and with same piece of code, we may consider it as missing cache.*

*Back to the notebook, we found there is a dataframe executed twice.*

To cache table is like blew:
sqlContext.cacheTable('mtu_dcu_trans_reading')

Figure out strange operations. This process will take longer time compared with previous two. We need to match the DAG to SQL in the notebook and understand the wide/narrow shuffle. Then pick some wild shuffle we didn’t expect.

When I went through the notebook with DAG, I found this “HashAggregate” is not my expect. Then I realized I write “Union” rather than “Union All”, this misktake took addtional time to reduce duplicates (which should not exist at all).

Reduce data size by adding project to select the columns we need. I think this is the easiest way to improve the performance, especially for the big tables.

*Simply add the column names into select query, it reduced the data reading size from 27.9GB to 23.5GB.*

Reduce data size by using filter as early as possible. We can check if filter push down into scan, or manuly move the filter when read the table.
Corroctly partition destination table. When are use Merge or Replacewhere, it is better to partition correctly.

*After we created reading_date as partiton column, the time cost reduced from 2.49 mins to 6.67 seconds.Similar optimazation will happen in merge and replacewhere.*

Reduce spill. When executor is out of memory, it will spill into disk which is much slower than memory. To solve this issue. we can do one or more of following steps:
- Increase executor memory by change a bigger cluster or set spark.executor.memory(if workable)
- Increase the fraction of execution and storage memory(spark.memory.fraction, default 0.6).
- Adjust spark.memory.storageFraction(default 0.5). Depend on when the spill occurs, either in cache or shuffle, we can adjust storage fraction.
- Adjust spark.sql.shuffle.partitions (default 200). Try to increase shuffle partition size to avoid spill.

Reference:

Spark Configuration, https://spark.apache.org/docs/2.3.0/configuration.html

Performance Tuning, https://spark.apache.org/docs/latest/sql-performance-tuning.html#performance-tuning

From Query Plan to Performance: Supercharging your Apache Spark Queries using the Spark UI SQL Tab, https://www.youtube.com/watch?v=_Ne27JcLnEc&t=553s

How to Read Spark DAGs | Rock the JVM, https://www.youtube.com/watch?v=LoFN_Q224fQ

Optimize performance with caching, https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/delta-cache

NEO_AKSA

A real case of optimazing spark notebook

Config-Driven Feature Engineering: A Generic Approach ( Ver 0.1)

Capturing Moments: Beyond the Lens

Reviving Kodak: Leveraging Color Science

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Lessons on photography from the movie “Civil War”

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

Leave a Reply Cancel reply