Hadoop and Spark by Leela Prasad: February 2025

Monday, February 17, 2025

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch + streaming processing to big data workloads on top of Apache Spark and cloud object storage (like Azure Data Lake Storage, AWS S3, or GCS).

✅ Key Features of Delta Lake

Feature	Description
ACID Transactions	Guarantees data consistency through commits and rollbacks.
Schema Evolution	Automatically handles changes in data schema during writes.
Time Travel	Query older versions of data using versioning.
Unified Batch + Streaming	Use the same Delta table for both batch and real-time data processing.
Scalable Metadata	Supports billions of files and petabytes of data efficiently.
Data Reliability	Enforces schema, prevents partial or dirty writes.

🏗️ Delta Architecture: Bronze, Silver, Gold

Bronze Layer: Raw ingestion (streaming or batch from sources like Kafka, Event Hubs)
Silver Layer: Cleaned and transformed data
Gold Layer: Aggregated and business-ready data (used in dashboards)

Bronze Layer (Raw / Ingest)

Goal: Immutable as-received capture of source data with minimal transformation. Establish lineage and replayability.

Characteristics

Schema: Often semi-structured (JSON, CSV, Avro) stored as ingested; may include _ingest_ts, _source_file, _source_system.
Data Quality: Not validated beyond basic ingestion success.
Storage Pattern: Partition by ingest date/time (e.g., ingest_date=YYYY-MM-DD) to simplify retention & replay.
Use Cases: Reprocessing, audit, debugging, forensic analysis, schema drift detection.
Append-only writes from streaming (readStream from Event Hubs/Kafka; Auto Loader for files).

Silver Layer (Clean / Conform / Enrich)

Goal: Turn raw data into analytics-grade canonical entities that are trustworthy and joinable across domains.

Transforms Typically Applied

Step	Examples
Data Cleansing	Drop corrupt rows; parse JSON; enforce data types.
Deduplication	Use event IDs, hashes, or window-based dedupe.
Normalization	Explode arrays; flatten nested structures.
Conformance	Standardize units, currencies, time zones (UTC), enums.
Joins / Enrichment	Lookup dimension tables (users, products, geo).
Watermark + Late Data Handling	Structured Streaming with `withWatermark` to discard/mark very late events.

Table Modeling

Often entity-level (e.g., silver.sales_orders, silver.user_profile).
Partition on business date (event_date, transaction_date) when high volume.
Use MERGE INTO for CDC updates (upsert SCD Type 1/2 patterns).

Gold Layer (Curated / Business & Analytics)

Goal: High-trust, consumption-optimized data products: aggregates, KPIs, dimensional models, ML feature tables.

Patterns

Consumption Style	Modeling Approach	Notes
BI Reporting	Star/Snowflake (Fact + Dim tables)	Fast ad hoc BI (Power BI / Synapse).
Metrics/KPIs	Pre-aggregated summary tables	Daily/Hourly rollups, incremental refresh.
ML Features	Feature store–style Delta tables	Point-in-time correctness; training vs inference views.
Data Sharing	Cleaned, governed shareable tables	Unity Catalog + Delta Sharing.

Example Incremental ETL Flow (PySpark)

The Delta table is physically stored in your ADLS Gen2 container, mounted under /mnt/datalake/

# Bronze -> Silver incremental clean

bronze_path = "/mnt/datalake/bronze/events"

silver_table = "refined.silver.events_clean"

bronze_stream = (

spark.readStream

.format("delta")

.load(bronze_path)

)

cleaned = (

bronze_stream

.filter("body IS NOT NULL")

.selectExpr("cast(body as string) as json_str", "ingest_ts")

.select(from_json("json_str", event_schema).alias("e"), "ingest_ts")

.select("e.*", "ingest_ts")

.dropDuplicates(["eventId"])

)

(cleaned.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", "/mnt/datalake/checkpoints/silver/events_clean")

.toTable(silver_table))

Silver → Gold Aggregation (Triggered Batch Refresh)

gold_table = "curated.gold.daily_event_metrics"

silver_df = spark.table("refined.silver.events_clean")

daily = (
  silver_df
    .groupBy("event_date", "channel")
    .agg(count("*").alias("event_count"),
         countDistinct("userId").alias("unique_users"))
)

(daily.write
      .format("delta")
      .mode("overwrite")
      .option("overwriteSchema", "true")
      .saveAsTable(gold_table))

Apache Iceberg

Apache Iceberg is a Lakehouse system with it's key as Metadata.

1. Storage

2. Processing

3. Metadata

Metadata of Metadata

Key Points:

Storage
- Data files (Parquet, ORC, Avro) are stored in Amazon S3.
- Iceberg maintains metadata and snapshots to track table versions.
ACID Transactions
- Supports atomic inserts, updates, and deletes directly on S3 data.
- Prevents read/write conflicts in concurrent jobs.
Schema Evolution & Partitioning
- Allows adding, dropping, or renaming columns without rewriting entire tables.
- Supports hidden partitioning for efficient queries.
Query Engine Compatibility
- Works with Spark, Flink, Trino, Presto, Athena, and Glue.
- Enables time travel to query historical snapshots of data.
Lakehouse Advantage
- Combines data lake storage (S3) with data warehouse-like capabilities.
- Efficient for batch analytics, streaming, and ML pipelines.

🔹 Example Workflow on AWS:

Store raw/processed Parquet files in S3.
Create an Iceberg table referencing the S3 location.
Query and update data using Spark SQL or Athena with ACID guarantees.
Enable time travel for auditing or rollback.

Create Iceberg Table in Athena:

CREATE TABLE iceberg_users (

id INT,

name STRING,

event_date DATE

)

PARTITIONED BY = (day(event_date))

LOCATION 's3://your-s3-bucket/iceberg-warehouse/iceberg_users/'

TBLPROPERTIES (

'table_type' = 'ICEBERG',

'write.format.default' = 'parquet'

);

Reference: https://www.youtube.com/watch?v=iGvj1gjbwl0

🔹 Operations using PySpark:

Writing Data to an Iceberg Table

Data files are stored as Parquet in S3.

Metadata and snapshots are tracked by Iceberg in Glue catalog.

from pyspark.sql import SparkSession

# Initialize SparkSession with Iceberg support

spark = SparkSession.builder \

.appName("IcebergReadWrite") \

.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \

.config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-iceberg-bucket/warehouse/") \

.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \

.config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \

.getOrCreate()

# Sample DataFrame

data = [

(1, "Alice", 2025),

(2, "Bob", 2024),

(3, "Charlie", 2025)

]

columns = ["id", "name", "year"]

df = spark.createDataFrame(data, columns)

# Write to Iceberg table in Glue catalog (S3 backend)

df.writeTo("glue_catalog.db.iceberg_users").createOrReplace()

Reading from an Iceberg Table

# Read Iceberg table as a DataFrame
df_read = spark.read.table("glue_catalog.db.iceberg_users")

df_read.show()

Performing Updates/Deletes (ACID)
from pyspark.sql.functions import col

# Example: Delete rows where year = 2024
spark.sql("""
    DELETE FROM glue_catalog.db.iceberg_users WHERE year = 2024
""")

# Example: Insert new rows
new_data = [(4, "David", 2025)]
spark.createDataFrame(new_data, columns) \
     .writeTo("glue_catalog.db.iceberg_users") \
     .append()


Time Travel/Snapshot
# List all snapshots
spark.sql("SELECT * FROM glue_catalog.db.iceberg_users.snapshots").show()

# Query previous snapshot using 'as of' snapshot_id
df_time_travel = spark.read \
    .option("snapshot-id", "<snapshot_id>") \
    .table("glue_catalog.db.iceberg_users")

df_time_travel.show()