Essentially, ADF(Azure Data Factory) only takes responsibility of Extraction and Load in the ELT pipeline. Then we have to use another tool, like databrick to write jupyter notebook to manipulate dataframe or RDD to complete transform activities. However, as Microsoft launched data “Data Flow” in ADF, it becomes more and more similar to SSIS. Most of ETL work can be done in ADF with nicely and friendly GUI. For ETL stuff, I copied a cheat sheet comparing the dataflows between ADF and SSIS. It maybe helpful.
Surprisingly, ADF didn’t provide SCD(slowing changing Dimension). You have to manually work on it. It won’t be hard although. Just set up some start and end time. Here is the article about it.
Anyway, for me. I like this way to complete some simple but repeatable actives with GUI, and let some complex operations in databrick.
ETL is the most common tool in the process of building EDW, of course the first step in data integration. As big data emerging, we would find more and more customer starting using hadoop and spark. Personally, I agree the idea that spark will replace most ETL tools.
Business Intelligence -> big data
Data warehouse -> data lake
Applications -> Micro services
Data getting out of sync, each copy is a risk.
Performance issues and waste of server resource(peek Performance), although ETL can do limited parallel work.
Plain-text code in hidden stages(VB or java typical)
CSV files are not type safe
all or nothing approach in batch jobs.
Spark for ETL
parallel processing in build in
using steaming to parallel ETL
Hadoop which is data source, we don’t need copy and reduce risk
just one code(scala or python)
Machine learning included
security, unit testing, Performance measurement , excepting handling, monitoring