How to break down databricks DBU cost to the pipeline level

One day we found databricks DBU cost surged, but we didn’t know which ADF job caused this issue. Then I asked Azure support if any way we can track to the pipeline level. Unfortunately, they told me “ Regarding your question, please note that you can only track cost against the resource group or workspace(cluster). Unfortunately, there is no way to track Databricks cost at a job or user level.” Ok, that’s fine.

However, when I opened the cost details which exported from cost management in Azure, I found a tag column in DBU resource usage.

"ClusterId": "1210-XXXX-XXXX","DatabricksInstancePoolCreatorId": "XXXXXX","DatabricksInstancePoolId": "1025-204618-XXXXX-pool-XXXXXX","ClusterName": "job-8810145-run-1","JobId": "8810145","RunName": "ADF_df-eus-prod-01_Standard_Module_run_pipeline_935e1d13-02c5-4ae3-9441-3a414d1ad0eb","Creator": "XXXX@XXXX.com","Vendor": "Databricks","DatabricksInstanceGroupId": "-8346123XXXXX6640367"

Are these our databricks information? Maybe we can do something !? After consulting from my coworker Sai, he told me the JobID is similar to notebook run ID. Then everything could be solved since we already records everything when running notebook including ADF pipeline Id, cust information, notebook name, notebook Runid(jobId) and start-end time.

By simple SQL query to join cost csv table and log table, we can category DBU cost to each ADF pipeline.

select ProductName,t2.notebook_name,sum(t1.PreTaxCost)  from cost_df t1
left join (select distinct notebook_name,notebook_runID,cust_name from delta.`/mnt/eus-logging/job-log-details` ) t2 on t2.notebook_runID=get_jobid(Tags)
where ProductName='Azure Databricks - Premium - Jobs Compute - DBU' 
group by ProductName,t2.notebook_name

To go through this process, you have to log the ADF pipeline and notebook runtime information when you execute the notebook. I was doing this by add following function at the start/end of notebook.

# at beginning
notebook_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
notebook_RunID = getRunID()  # this is function in /Public/JOB_LOG
write_log_detail(ADFRunID,cust_id,cust_name,notebook_name,notebook_RunID,startTime=str(datetime.now())) # notebook start time

# at the end
write_log_detail(ADFRunID,cust_id,cust_name,notebook_name,notebook_RunID,endTime=str(datetime.now())) # notebook end time

Here function write_log_detail is the one writing into delta table.

Unfortunately, I have not found the way to break down the VM cost into pipeline level, I think there should be a way. Just need be more time. will update later.

2 responses to “How to break down databricks DBU cost to the pipeline level”

HR

2022.09.30

Hi,
Did you found a way to breakdown VM costs into pipeline level?

1. neo_aksa
  
  2023.02.22
  
  If it is a job cluster, you can figure out by parsing cluster id form tag column. However, if you use on demand cluster, you can only associate to this cluster unless you did some addtional coding work in your notebook to record start/end time and cluster id, etc.

NEO_AKSA

How to break down databricks DBU cost to the pipeline level

Config-Driven Feature Engineering: A Generic Approach ( Ver 0.1)

Capturing Moments: Beyond the Lens

Reviving Kodak: Leveraging Color Science

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Lessons on photography from the movie “Civil War”

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

2 responses to “How to break down databricks DBU cost to the pipeline level”

Leave a Reply Cancel reply