Storage
-
Data files (Parquet, ORC, Avro) are stored in Amazon S3.
-
Iceberg maintains metadata and snapshots to track table versions.
-
-
ACID Transactions
-
Supports atomic inserts, updates, and deletes directly on S3 data.
-
Prevents read/write conflicts in concurrent jobs.
-
-
Schema Evolution & Partitioning
-
Allows adding, dropping, or renaming columns without rewriting entire tables.
-
Supports hidden partitioning for efficient queries.
-
-
Query Engine Compatibility
-
Works with Spark, Flink, Trino, Presto, Athena, and Glue.
-
Enables time travel to query historical snapshots of data.
-
-
Lakehouse Advantage
-
Combines data lake storage (S3) with data warehouse-like capabilities.
-
Efficient for batch analytics, streaming, and ML pipelines.
-
🔹 Example Workflow on AWS:
-
Store raw/processed Parquet files in S3.
-
Create an Iceberg table referencing the S3 location.
-
Query and update data using Spark SQL or Athena with ACID guarantees.
-
Enable time travel for auditing or rollback.
🔹 Operations using PySpark:
Writing Data to an Iceberg Table
Data files are stored as Parquet in S3.
-
Metadata and snapshots are tracked by Iceberg in Glue catalog.
No comments:
Post a Comment