Skip to content

Doing Interesting Things in Life

做有趣的事情

  • Home
  • Personal Gallery
  • About me
  • Kids need help in China

Tag: metadata-driven

Spy into metadata-driven ELT on Datafactory and Databricks

Azure provides datafactory and azure databricks for handling with ELT pipeline on a scalable environment. Datafactory provides more integrated solution while databricks gives more flexible one.

Like software development, the data pipeline development also face the same problems, e.g, duplicate activities, too many pipelines, hard coding reducing flexibility, etc. So, I was wondering if any solution to fix them once in all.

Then I checked the document library for datafactory, there is one simple solution:

Figure 1: the key activities for simple EL solution

In this solution, they retrieve the data source and destination combined with some parameters in the copy activity through a Lookup-Todo activity, then using foreach to execute stage data. But this simple solution can only solve Extract and Loading. Without some Transformation(biz logic or validation), it is not a standard process.

Figure 2: the table structure for the simple solution

To get the more advanced solution, I found a article from Microsoft in 2008. Through it is very old, but the content interests me.

Figure 3: the concept from the article “build a metadata-driven etl platform by extending microsoft sql server integration services”

If we look into this architecture, there are already some similar tech in Azure:

  • Monitor – Azure monitor / Power BI
  • Logging Repository – Azure log analytics
  • Builder – Data factory API

So they only problem comes into how to build a metadata designer and repository. I think the metadata repository is the key. Currently, I still think about and collect information about it. Here is some initial ideas.

  • ETL process Metadata. Create a table to record the order of each activities.
  • Schema Metadata. Like the simple solution mentioned before. Maintain the source and destination information.
  • Business Rule. This is the core function for complex transformation. For databricks, we can use public library to achieve this.
  • ETL pattern library. Azure function or databricks library.

Once I got some new idea to get deeper of these four topics, I will continue to finish the architecture for metadata drive ELT.

Reference:

Real-world data movement and orchestration patterns using Azure Data Factory V2

https://azure.microsoft.com/en-us/resources/videos/ignite-2018-real-world-data-movement-and-orchestration-patterns-using-azure-data-factory-v2/

Complex Azure Orchestration w Dynamic Data Factory Pipelines

https://sqlbits.com/Sessions/Event18/Complex_Azure_Orchestration_with_Dynamic_Data_Factory_Pipeli

Quickstart: Create an Azure Data Factory and pipeline using Python

https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-python

December 17, 2019neo_aksaBig Data, Computer ScienceADF, databricks, ELT, metadata-drivenLeave a comment

Recent Posts

  • Some features need to be improved in Azure Data Products
  • I bought a raspberry Pi 4, and tried to replace my PC>:<
  • Windows package manager(winget)
  • Spark 3.0 new features – Learning from Dr.Kazuaki Ishizaki
  • Columnstore index for MS SQL SERVER

Tags

ADF Azure Azure Devops big data CDC CFD CNN ConvLSTM Curriculum databricks data pipeline data science deep learning Django EDA Efficiency ELT etl exploratory data analysis HPC kafka Linux LSTM machine learning Mandelbrot set math MPI netflix NLP OpenFOAM OpenMP PCA PredNet programming python raspberry pi Recommendation systems Singleton Pattern spaCy spark SQL SERVER SSIS Tableau wordpress WSL

Archives

  • January 2021
  • December 2020
  • October 2020
  • July 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018

Categories

  • Big Data
  • Computer Science
  • ETL&DW
  • HPC
  • Linux
  • Machine Learning
  • Others
  • Programming
  • Stats
  • Visualization
  • What if

My Github

https://github.com/neoaksa

Proudly powered by WordPress | Theme: ZackLive by Zack.