How to connect parquet files from Azure through Apache Drill

Parquet format is a very good choice for big data. It is fast, small and distributed. But not all software support this format. So, we have to use Apache Drill as intermediate layer to connect source and target together.

drill query flow

Today, we are going to take tableau as example to explain how to read parquet files from Azure Blobstorage to Tableau dashboard.

  • Install java JDK.
  • Install 7-zip.
  • Setting up windows environment.
    1. Add JAVA_HOME to your environment variable. set the value as ‘C:\progra~1\Java\jdk1.8.0_221‘. Please change the path if you have a different version.
    2. Add %JAVA_HOME%\bin to Path variable.
  • Create UDF directories manually in bash.
    mkdir "%userprofile%\drill"
    mkdir "%userprofile%\drill\udf"
    mkdir "%userprofile%\drill\udf\registry"
    mkdir "%userprofile%\drill\udf\tmp"
    mkdir "%userprofile%\drill\udf\staging"
    takeown /R /F "%userprofile%\drill"
  • download and unzip Apache Drill to your install folder.
  • Use cmd to start Drill.
cd  \apache-drill-1.16.0\bin
drill-embedded.bat

now, you can see apache drill has been running. If you have data files on your local machine, you can simply retrive data by running SQL select * from dfs.<file path>.

  • Download support jar for Azure.Put them into folder ‘apache-drill-1.16.0\jars\3rdparty’
    1. hadoop-azure-2.7.7.jar
    2. azure-storage-8.0.0.jar
  • Configure connection information.
    • goto ‘apache-drill-1.16.0\conf’, create a copy of core-site-example.xml at the same folder, and change the name to core-site.xml. Modify the contents as blew, remember to change STORAGE_ACCOUNT_NAME and AUTHENTICATION_KEY in term of your azure blob storage account:
<property>
  <name>fs.azure.account.key.STORAGE_ACCOUNT_NAME.blob.core.windows.net</name>
   <value>AUTHENTICATION_KEY</value>
</property>
  • Configure connection information.(con’t)
    • goto ‘http://localhost:8047/storage‘ to open WebUI. Create a new plugin called ‘AZure’, the contents is as same as cp. change the connection and set config as ‘null’ since we already set it up in core-site.xml.
 "type": "file",
 "connection":"wasbs://container_name@STORAGE_ACCOUNT_NAME.blob.core.windows.net",
    "config": null,
  • Try to read parquet files into tableau.
    1. Open Tableau and choose connection Apache Drill. Entry *localhost for Server. Click Signin.
    2. Create a new custom SQL. select * from `az.default`.`parquet file name`.
  • Create view.
    1. Install Drill JDBC Driver.
    2. Go to Start, find Drill Explorer.
    3. Connect to Server. Choose SQL Tab, write SQL here, then click Create as View. Now, you can see the view under dfs.temp.
    4. Tableau can read this view anytime.

Plus: Currently, Drill API didn’t support Datalake Gen2 since it only supports Blob Storage API which is without hierarchy.

How to build a data pipeline in Databricks

For a long term, I thought there was no pipeline concept in Databricks. For the most engineers they will write the whole script into one notebook rather than split into several activities like in Data factory. In this case, we have to rewirte everything in the script when the next pipeline coming.

But recently, I found there was indeed a way to achieve pipeline in databricks. Here is the document. And follows my understanding.

  • Create widgets as parameters in callee notebook
dbutils.widgets.text("input", "default") # define a text widget in callee notebook
......
dbutils.notebook.exit(defaultText) # set return variable
  • Call callee notebook through caller notebook(we can write our pipeline in this caller notebook, then call all following activity notebooks)
status = dbutils.notebook.run("pipeline",30,{"input":"pass success"}) # transfer input = “pass success” to a notebook called pipeline
  • Callee notebook can return a string to indicate the execution statues( you still remember dbutils.notebook.exit in callee notebook right?)
returned_string = dbutils.notebook.run("pipeline2", 60) 
  • The data exchange between two notebooks can be done with global tempview which locates in memory.
# in callee
sqlContext.range(10).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

# in caller
returned_table = dbutils.notebook.run("pipeline2", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
  • We can also call function in another notebook, which means we can build a library for all notebook
%run ./aclaraLibary # in this notebook, there are two functions which are overloaded.
simple(1,2)
simple(1,2,3)