Tableau 2019.3 starts to support Databricks

Since version 2019.3, Tableau starts to support databricks with the native connection driver. So we don’t have to use some wired unknown product or custom API to connect them together. Here is the step how to use it simply.

  • Download the new version of Tableau through pre-release page. (it will come to the official version soon)
  • Download the supported Simba ODBC driver. and install it.
  • After all these two steps, you will find Databrick option in your tableau connect.
Select Databricks connector

The next steps are figuring out the authorization information for databricks.

  • Login Databricks.
  • Go to Cluster – <your selection cluster name> – Advanced Options – JDBC/ODBC

Here you can find Server Hostname and Http path, copy them to authorization page.

For user name and password, you have to use token.

  • find the little logo on the right up corner. Press it then choose user setting.
  • Click “Generate new Token”. there will be a popup windows, save the series code which is your PASSWORD.
  • Enter “token” as username, the code you just got as the password. Then you can see the tables you created in databricks.

For Saving tables in databricks, it is also simple.

df_test = spark.read.format("parquet").load("/mnt/aclaradls01/test/dcu_health.parquet")
df_test.write.mode('overwrite').saveAsTable("df_test") 

Tips:

We can also use spark SQL to connect databricks. And even use databricks as API to connect to datalakes. It pretty much like a cluster running Spark. Here is the steps:

  1. create a view in databricks which links to the mounted datalake path. spark.sql(“create view df_test_view as SELECT * FROM parquet./mnt/<mounted_name>/<folder_name>/“)
  2. Add “spark.hadoop.hive.server2.enable.doAs false” into your cluster spark config.
  3. If you tried to load large size data which takes long time, add spark.executor.heartbeatInterval 10s and spark.network.timeout 9999s to your cluster spark config as well.

Reference:

Tableau with databricks. https://docs.databricks.com/user-guide/bi/tableau.html

How to build a data pipeline in Databricks

For a long term, I thought there was no pipeline concept in Databricks. For the most engineers they will write the whole script into one notebook rather than split into several activities like in Data factory. In this case, we have to rewirte everything in the script when the next pipeline coming.

But recently, I found there was indeed a way to achieve pipeline in databricks. Here is the document. And follows my understanding.

  • Create widgets as parameters in callee notebook
dbutils.widgets.text("input", "default") # define a text widget in callee notebook
......
dbutils.notebook.exit(defaultText) # set return variable
  • Call callee notebook through caller notebook(we can write our pipeline in this caller notebook, then call all following activity notebooks)
status = dbutils.notebook.run("pipeline",30,{"input":"pass success"}) # transfer input = “pass success” to a notebook called pipeline
  • Callee notebook can return a string to indicate the execution statues( you still remember dbutils.notebook.exit in callee notebook right?)
returned_string = dbutils.notebook.run("pipeline2", 60) 
  • The data exchange between two notebooks can be done with global tempview which locates in memory.
# in callee
sqlContext.range(10).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

# in caller
returned_table = dbutils.notebook.run("pipeline2", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
  • We can also call function in another notebook, which means we can build a library for all notebook
%run ./aclaraLibary # in this notebook, there are two functions which are overloaded.
simple(1,2)
simple(1,2,3)