Before we start to talk about delta lake, we have to take time to deal with data lake and understand why we need to use data lake. Blew is the best definition I think.
A data lake is a repository for structured, unstructured, and semi-structured data. Data lakes are much different from data warehouses since they allow data to be in its rawest form without needing to be converted and analyzed first.Devin Pickell
Data Lake stored everything(raw) with every format. In a short nut, it is a repository. e.g. in Azure, Data Lake Gen2 is a repository based on Blob storage.
What’s the difference between data lake and data warehouse?
From table 1, we can figure out that dw and dl use in the different scenarios. Data warehouse is used for the classic data analysis, and the data stored in structured rational table; while data lake is used for big data, ELT situation, every raw data coming into data lake then doing following actions. Since data lake is on the cluster cloud, it is very faster to handle the raw data.
What is the downside of data lake?
- No Transaction.
- Schema on read.
- No version control.
These three downsides are very critical in applying data lake into real project. It will affect data quality and efficiency both. It is why most big data projects are failed. Because the data lake can not provide stable data like the database.
Delta Lake comes out!
Delta Lake has its own abilities which enables a stable data lake for analysis.
- Open source storage layer, based on parquet file
- ACID transaction: serialize isolation
- Scalable meta data handling: spark distribution data processing
- stream and batch uniform: batch table as well as stream source
- Schema enforcement
- Time travel: data version
Let’s see some simple samples. If you are familiar with spark, you can change to delta lake super quick.
# set up # upgrade pyspark pip install --upgrade pyspark # install pyspark --packages io.delta:delta-core_2.12:0.1.0 # write table df= spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true") df.write.format("delta").save("/delta/diamonds") # append df.write.format("delta").mode("append").save("/delta/events") # we can also partition the table df.write.format("delta").partitionBy("date").save("/delta/events") # update data.write.format("delta").mode("overwrite").option("replaceWhere", "date >= '2017-01-01' AND date <= '2017-01-31'").save("/tmp/delta-table") # by default overwrite doesn't change schema, only if option("overwriteSchema", "true") exiting # add column automatically df.write.format("delta").mode("append").option("mergeSchema", "true").save("/delta/events") # read df = spark.read.format("delta").load("/tmp/delta-table") SELECT * FROM delta.`/tmp/delta-table` # time travel # create a read version df1 = spark.read.format("delta").option("timestampAsOf", timestamp_string).load("/delta/events") # For timestamp_string, only date or timestamp strings are accepted df2 = spark.read.format("delta").option("versionAsOf", version).load("/delta/events") # structured stream streamingDf = spark.readStream.format("delta").load() stream = streamingDf.selectExpr("value as id").writeStream.format("delta").option("checkpointLocation", "/tmp/checkpoint").start("/tmp/delta-table") # set mini-batch spark.readStream.format("delta").option("maxFilesPerTrigger",1000).load() # ignore updates and deletes with "ignoreDeletes" and "ignoreChanges" # append mode and complete mode streamingDf = spark.writeStream.format("delta").outputMode("append").load() streamingDf = spark.writeStream.format("delta").outputMode("complete").load() stream2 = spark.readStream.format("delta").load("/tmp/delta-table").writeStream.format("console").start() stream.stop() # check point .option("checkpointLocation", "/delta/eventsByCustomer/_checkpoints/streaming-agg")
In next article, I will talk about when to use delta lake and how to optimize it in data brick.
Here is one of my example which covert three small files into one delta table: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2499619125013057/2952616815059893/2608746070181068/latest.html
- Delta Lake Document. https://delta.io
- What Is a Data Lake and Why Is It Essential for Big Data? https://learn.g2.com/what-is-a-data-lake
- Delta Lake Document for databricks. https://docs.databricks.com/delta