Skip to content

NEO_AKSA

  • Home
  • Computer Science
  • Photography
  • Personal Gallery
  • Kids need help in China
  • About me

Tag: metadata-driven

Spy into metadata-driven ELT on Datafactory and Databricks

Azure provides datafactory and azure databricks for handling with ELT pipeline on a scalable environment. Datafactory provides more integrated solution while databricks gives more flexible one.

Like software development, the data pipeline development also face the same problems, e.g, duplicate activities, too many pipelines, hard coding reducing flexibility, etc. So, I was wondering if any solution to fix them once in all.

Then I checked the document library for datafactory, there is one simple solution:

Figure 1: the key activities for simple EL solution

In this solution, they retrieve the data source and destination combined with some parameters in the copy activity through a Lookup-Todo activity, then using foreach to execute stage data. But this simple solution can only solve Extract and Loading. Without some Transformation(biz logic or validation), it is not a standard process.

Figure 2: the table structure for the simple solution

To get the more advanced solution, I found a article from Microsoft in 2008. Through it is very old, but the content interests me.

Figure 3: the concept from the article “build a metadata-driven etl platform by extending microsoft sql server integration services”

If we look into this architecture, there are already some similar tech in Azure:

  • Monitor – Azure monitor / Power BI
  • Logging Repository – Azure log analytics
  • Builder – Data factory API

So they only problem comes into how to build a metadata designer and repository. I think the metadata repository is the key. Currently, I still think about and collect information about it. Here is some initial ideas.

  • ETL process Metadata. Create a table to record the order of each activities.
  • Schema Metadata. Like the simple solution mentioned before. Maintain the source and destination information.
  • Business Rule. This is the core function for complex transformation. For databricks, we can use public library to achieve this.
  • ETL pattern library. Azure function or databricks library.

Once I got some new idea to get deeper of these four topics, I will continue to finish the architecture for metadata drive ELT.

Reference:

Real-world data movement and orchestration patterns using Azure Data Factory V2

https://azure.microsoft.com/en-us/resources/videos/ignite-2018-real-world-data-movement-and-orchestration-patterns-using-azure-data-factory-v2/

Complex Azure Orchestration w Dynamic Data Factory Pipelines

https://sqlbits.com/Sessions/Event18/Complex_Azure_Orchestration_with_Dynamic_Data_Factory_Pipeli

Quickstart: Create an Azure Data Factory and pipeline using Python

https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-python

December 17, 2019neo_aksaBig Data, Computer ScienceADF, databricks, ELT, metadata-driven1 Comment

Next Topic on the Schedule

1. An Airflow Demo
2. Spark Rapids to leverage GPU to acclerate ETL
3. Something new in Databricks Summit 2021
4. Use Kaflka to extract data from SQL Server to data lake
5. Time Series Anomaly Detection

Recent Posts

  • Failure Sensor Detection by Pattern Comparison on Time Series
  • Service Principle on Databricks and Trivial zip file Ingestion
  • Summary@202205
  • How to break down databricks DBU cost to the pipeline level
  • A real case of optimazing spark notebook

Tags

ADF android Apache Drill architecture Azure Azure Devops big data CFD CNN Computer Vision ConvLSTM databricks data pipeline deep learning Delta Lake Django Eiganvalue Eiganvector ELT etl GPU HPC kafka Linux LSTM machine learning math metadata-driven mysql netflix NLP optimize parquet Power BI python raspberry pi Recommendation systems spark SQL SERVER SSIS Tableau ubuntu wordpress WSL WSL2

Archives

  • February 2023
  • November 2022
  • June 2022
  • January 2022
  • November 2021
  • August 2021
  • June 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • October 2020
  • July 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018

Categories

  • Big Data
  • Computer Science
  • ETL&DW
  • HPC
  • Linux
  • Machine Learning
  • Others
  • Programming
  • Stats
  • Visualization
  • What if

My Github

https://github.com/neoaksa

Proudly powered by WordPress | Theme: ZackLive by Zack.