How to connect parquet files from Azure through Apache Drill

Parquet format is a very good choice for big data. It is fast, small and distributed. But not all software support this format. So, we have to use Apache Drill as intermediate layer to connect source and target together.

Today, we are going to take tableau as example to explain how to read parquet files from Azure Blobstorage to Tableau dashboard.

Install java JDK.
Install 7-zip.
Setting up windows environment.
1. Add JAVA_HOME to your environment variable. set the value as ‘C:\progra~1\Java\jdk1.8.0_221‘. Please change the path if you have a different version.
2. Add %JAVA_HOME%\bin to Path variable.
Create UDF directories manually in bash.

    mkdir "%userprofile%\drill"
    mkdir "%userprofile%\drill\udf"
    mkdir "%userprofile%\drill\udf\registry"
    mkdir "%userprofile%\drill\udf\tmp"
    mkdir "%userprofile%\drill\udf\staging"
    takeown /R /F "%userprofile%\drill"

download and unzip Apache Drill to your install folder.
Use cmd to start Drill.

cd  \apache-drill-1.16.0\bin
drill-embedded.bat

now, you can see apache drill has been running. If you have data files on your local machine, you can simply retrive data by running SQL select * from dfs.<file path>.

Download support jar for Azure.Put them into folder ‘apache-drill-1.16.0\jars\3rdparty’
1. hadoop-azure-2.7.7.jar
2. azure-storage-8.0.0.jar
Configure connection information.
- goto ‘apache-drill-1.16.0\conf’, create a copy of core-site-example.xml at the same folder, and change the name to core-site.xml. Modify the contents as blew, remember to change STORAGE_ACCOUNT_NAME and AUTHENTICATION_KEY in term of your azure blob storage account:

<property>
  <name>fs.azure.account.key.STORAGE_ACCOUNT_NAME.blob.core.windows.net</name>
   <value>AUTHENTICATION_KEY</value>
</property>

Configure connection information.(con’t)
- goto ‘http://localhost:8047/storage‘ to open WebUI. Create a new plugin called ‘AZure’, the contents is as same as cp. change the connection and set config as ‘null’ since we already set it up in core-site.xml.

 "type": "file",
 "connection":"wasbs://container_name@STORAGE_ACCOUNT_NAME.blob.core.windows.net",
    "config": null,

Try to read parquet files into tableau.
1. Open Tableau and choose connection Apache Drill. Entry *localhost for Server. Click Signin.
2. Create a new custom SQL. select * from `az.default`.`parquet file name`.
Create view.
1. Install Drill JDBC Driver.
2. Go to Start, find Drill Explorer.
3. Connect to Server. Choose SQL Tab, write SQL here, then click Create as View. Now, you can see the view under dfs.temp.
4. Tableau can read this view anytime.

Plus: Currently, Drill API didn’t support Datalake Gen2 since it only supports Blob Storage API which is without hierarchy.

NEO_AKSA

How to connect parquet files from Azure through Apache Drill

Capturing Moments: Beyond the Lens

Reviving Kodak: Leveraging Color Science

Ensuring Exclusive Sub-Task Execution in Multiple Data Pipelines

Lessons on photography from the movie “Civil War”

Creating Read-Only External Table in Unity Catalog by Using Existing Delta Table in Azure Storage Account

Decoding A24’s Rise: A Blueprint for Indie Success

One response to “How to connect parquet files from Azure through Apache Drill”

Leave a Reply Cancel reply