NEO_AKSA

Optimize concurrency for merge operation in delta table

13.2.2021

Concurrency control is normal in OLTP operations, but for OLAP, not really. So I didn’t take care of it until I met the error blew:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [cust_id=000] by a concurrent update. Please try the operation again.

This error caused by write conflict into single delta table by two merge operations. Base on conflicts matrix provided by databricks, we knew even in writeSerializable isolation levels, two merge operation can conflict.

To solve this problem, I did two steps:

Choose right partition columns. Here in my example, I only partitioned by cust_id, but the merge operation in two scripts, they need to update based on two columns: cust_id and report_type. So the first step is to change the partition columns.

partitonby(['cust_id','report_type'])

Change condition clause in merge operations. I do put two partition columns into my merge condition clause. It likes:

A.cust_id=B.cust_id and A.report_type=B.report_type # where A is the merged table

But it still got conflict even after right partitioned. So I tried to hard code part of condition. It works, no conflict happened anymore. So I guess the delta table can not infer partition correctly by join.

"A.cust_id={0} and A.report_type={1}".format(cust_id, report_type)

Reference:

Isolation Levels in delta lake: https://docs.databricks.com/delta/optimizations/isolation-level.html#isolation-levels

Concurrency control: https://docs.databricks.com/delta/concurrency-control.html

How to improve performance of Delta Lake MERGE INTO queries using partition pruning: https://kb.databricks.com/delta/delta-merge-into.html

João Carvalho

2021.02.24

Hi,

We tried to implement this solution to deal with merging to the same delta table, but we continue to get the concurrent update error. Did you just add the hard code part or did you also enable other table properties that we are missing? (see for instance https://github.com/delta-io/delta/pull/114)

Reply
1. neo_aksa
  
  2021.03.01
  
  would you mind sharing piece of your code?
  
  Reply

NEO_AKSA

Optimize concurrency for merge operation in delta table

Capturing Moments: Beyond the Lens

Reviving Kodak: Leveraging Color Science

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Lessons on photography from the movie “Civil War”

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

Decoding A24’s Rise: A Blueprint for Indie Success

2 responses to “Optimize concurrency for merge operation in delta table”

Leave a Reply Cancel reply