Optimize concurrency for merge operation in delta table

Concurrency control is normal in OLTP operations, but for OLAP, not really. So I didn’t take care of it until I met the error blew:

com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [cust_id=000] by a concurrent update. Please try the operation again.

This error caused by write conflict into single delta table by two merge operations. Base on conflicts matrix provided by databricks, we knew even in writeSerializable isolation levels, two merge operation can conflict.

To solve this problem, I did two steps:

  • Choose right partition columns. Here in my example, I only partitioned by cust_id, but the merge operation in two scripts, they need to update based on two columns: cust_id and report_type. So the first step is to change the partition columns.
partitonby(['cust_id','report_type'])
  • Change condition clause in merge operations. I do put two partition columns into my merge condition clause. It likes:
A.cust_id=B.cust_id and A.report_type=B.report_type # where A is the merged table

But it still got conflict even after right partitioned. So I tried to hard code part of condition. It works, no conflict happened anymore. So I guess the delta table can not infer partition correctly by join.

"A.cust_id={0} and A.report_type={1}".format(cust_id, report_type)

Reference:

Isolation Levels in delta lake: https://docs.databricks.com/delta/optimizations/isolation-level.html#isolation-levels

Concurrency control: https://docs.databricks.com/delta/concurrency-control.html

How to improve performance of Delta Lake MERGE INTO queries using partition pruning: https://kb.databricks.com/delta/delta-merge-into.html

2 responses to “Optimize concurrency for merge operation in delta table”

  1. João Carvalho Avatar
    João Carvalho

    Hi,

    We tried to implement this solution to deal with merging to the same delta table, but we continue to get the concurrent update error. Did you just add the hard code part or did you also enable other table properties that we are missing? (see for instance https://github.com/delta-io/delta/pull/114)

    1. neo_aksa Avatar
      neo_aksa

      would you mind sharing piece of your code?

Leave a Reply

Your email address will not be published. Required fields are marked *