Saturday, August 2, 2025

Kubernetes

 

Kubernetes is a Container Orchestration tool.

Multiple containers do run in worket nodes.


Features:

  • High Availability
  • Scalability
  • Disaster Recovery

Basic Architecture:

It's basic Architecture contains Pods & Containers

1 Pod per Application, each pod contains 1 or more containers

Typically, 1 pod contains 1 container, however in cases when an application needs to access more than 1 resource like Database is in 1 container, messaging system is in another Container etc.


Why Spark on Kubernetes?

1. Multiple kinds of applications can run with each having it's own library dependency in the form of containers. 
Eg: In a cluster the below different kinds of applications can run simultaneously having their own library dependency.
  •      Spark Applications can run in their own container.
  •      ML Applications can run in a separate container. ML Libraries need not be installed in the cluster.
2. Spark cluster is a shared resource and if an upgrade needs to happen from Spark 2.0 to Spark 3.0 all the applications needs to be migrated. Incase of Kubernetes, only the applications those need to be upgraded can upgrade their dependencies and can run in their independent containers.



AWS EKS

The following diagram shows the two different deployment models for Amazon EMR.

Amazon EMR deployment options
Job Submission Options:
  1. AWS Command line Interface
  2. AWS Tools and AWS SDK
  3. Apache Airflow


Sample command to Launch a PySpark job on AWS EMR on EKS

aws emr-containers start-job-run \
    --virtual-cluster-id <EMR-EKS-virtual-cluster-id> \
    --name <your-job-name> \
    --execution-role-arn <your-emr-on-eks-execution-role-arn> \
    --release-label emr-6.x.x \
    --job-driver '{
        "sparkSubmitJobDriver": {
            "entryPoint": "s3://<your-s3-bucket>/scripts/your_pyspark_script.py",
            "entryPointArguments": ["arg1", "arg2"],
            "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=1G"
        }
    }' \
    --configuration-overrides '{
        "applicationConfiguration": [
            {
                "classification": "spark-defaults",
                "properties": {
                    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
                    "spark.kubernetes.container.image": "<your-custom-emr-on-eks-image-uri>"
                }
            }
        ],
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<your-s3-bucket>/logs/"
            },
            "cloudWatchMonitoringConfiguration": {
                "logGroupName": "<your-cloudwatch-log-group-name>",
                "logStreamPrefix": "<your-log-stream-prefix>"
            }
        }
    }'

--virtual-cluster-id: The ID of the virtual cluster registered with your EKS

Referencehttps://www.youtube.com/watch?v=avXbYBPzpIE&t=649s

Thursday, July 31, 2025

AWS Glue vs Running Spark jobs on EMR using spark-submit

 

comparison of AWS Glue vs Running Spark jobs on EMR using spark-submit from a data engineering perspective:


1️⃣ AWS Glue

  • Type: Serverless ETL service (managed Spark)

  • When to Use: Lightweight to medium ETL/ELT workloads, event-driven or scheduled jobs.

  • Pros:

    • Fully managed, no cluster management.

    • Auto-scaling and pay-per-use.

    • Built-in crawler, schema inference, and Data Catalog integration.

    • Native connectors to S3, RDS, Redshift, DynamoDB.

  • Cons:

    • Limited cluster customization.

    • Startup latency (1–2 min warmup).

    • Less control over Spark version tuning.

Example:
ETL pipelines that transform S3 data to Redshift or Iceberg tables on S3.


2️⃣ Amazon EMR with spark-submit

  • Type: Managed Hadoop/Spark cluster service (full control).

  • When to Use: Heavy processing, streaming jobs, or when you need fine-grained control.

  • Pros:

    • Full control over cluster configuration (memory, cores, Spark version).

    • Supports complex, long-running, streaming, and ML pipelines.

    • Can integrate with S3 (EMRFS), HDFS, Iceberg, Delta Lake.

  • Cons:

    • You manage cluster lifecycle (start/stop or auto-terminate).

    • Higher ops overhead and cost if not managed well.

Example:
Petabyte-scale ETL, Spark Streaming with Kafka, or ML pipelines needing custom Spark configs.






💡 Rule of Thumb:

  • Glue → Simpler, serverless ETL on S3/Redshift/Iceberg.

  • EMR → Complex, large-scale, or streaming workloads needing control and custom tuning.

Monday, February 17, 2025

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch + streaming processing to big data workloads on top of Apache Spark and cloud object storage (like Azure Data Lake Storage, AWS S3, or GCS).


Key Features of Delta Lake

FeatureDescription
ACID TransactionsGuarantees data consistency through commits and rollbacks.
Schema EvolutionAutomatically handles changes in data schema during writes.
Time TravelQuery older versions of data using versioning.
Unified Batch + StreamingUse the same Delta table for both batch and real-time data processing.
Scalable MetadataSupports billions of files and petabytes of data efficiently.
Data ReliabilityEnforces schema, prevents partial or dirty writes.


🏗️ Delta Architecture: Bronze, Silver, Gold

  1. Bronze Layer: Raw ingestion (streaming or batch from sources like Kafka, Event Hubs)

  2. Silver Layer: Cleaned and transformed data

  3. Gold Layer: Aggregated and business-ready data (used in dashboards)


Bronze Layer (Raw / Ingest)

Goal: Immutable as-received capture of source data with minimal transformation. Establish lineage and replayability.

Characteristics

  • Schema: Often semi-structured (JSON, CSV, Avro) stored as ingested; may include _ingest_ts, _source_file, _source_system.

  • Data Quality: Not validated beyond basic ingestion success.

  • Storage Pattern: Partition by ingest date/time (e.g., ingest_date=YYYY-MM-DD) to simplify retention & replay.

  • Use Cases: Reprocessing, audit, debugging, forensic analysis, schema drift detection.

  • Append-only writes from streaming (readStream from Event Hubs/Kafka; Auto Loader for files).


Silver Layer (Clean / Conform / Enrich)

Goal: Turn raw data into analytics-grade canonical entities that are trustworthy and joinable across domains.

Transforms Typically Applied

StepExamples
Data CleansingDrop corrupt rows; parse JSON; enforce data types.
DeduplicationUse event IDs, hashes, or window-based dedupe.
NormalizationExplode arrays; flatten nested structures.
ConformanceStandardize units, currencies, time zones (UTC), enums.
Joins / EnrichmentLookup dimension tables (users, products, geo).
Watermark + Late Data HandlingStructured Streaming with withWatermark to discard/mark very late events.

Table Modeling

  • Often entity-level (e.g., silver.sales_orders, silver.user_profile).

  • Partition on business date (event_date, transaction_date) when high volume.

  • Use MERGE INTO for CDC updates (upsert SCD Type 1/2 patterns).


Gold Layer (Curated / Business & Analytics)

Goal: High-trust, consumption-optimized data products: aggregates, KPIs, dimensional models, ML feature tables.

Patterns

Consumption StyleModeling ApproachNotes
BI ReportingStar/Snowflake (Fact + Dim tables)Fast ad hoc BI (Power BI / Synapse).
Metrics/KPIsPre-aggregated summary tablesDaily/Hourly rollups, incremental refresh.
ML FeaturesFeature store–style Delta tablesPoint-in-time correctness; training vs inference views.
Data SharingCleaned, governed shareable tablesUnity Catalog + Delta Sharing.

Example Incremental ETL Flow (PySpark)

The Delta table is physically stored in your ADLS Gen2 container, mounted under /mnt/datalake/

# Bronze -> Silver incremental clean
bronze_path = "/mnt/datalake/bronze/events"
silver_table = "refined.silver.events_clean"

bronze_stream = (
  spark.readStream
       .format("delta")
       .load(bronze_path)
)

cleaned = (
  bronze_stream
    .filter("body IS NOT NULL")
    .selectExpr("cast(body as string) as json_str", "ingest_ts")
    .select(from_json("json_str", event_schema).alias("e"), "ingest_ts")
    .select("e.*", "ingest_ts")
    .dropDuplicates(["eventId"])
)

(cleaned.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/mnt/datalake/checkpoints/silver/events_clean")
    .toTable(silver_table))


Silver → Gold Aggregation (Triggered Batch Refresh)

gold_table = "curated.gold.daily_event_metrics"
silver_df = spark.table("refined.silver.events_clean") daily = ( silver_df .groupBy("event_date", "channel") .agg(count("*").alias("event_count"), countDistinct("userId").alias("unique_users")) ) (daily.write .format("delta") .mode("overwrite") .option("overwriteSchema", "true") .saveAsTable(gold_table))

Apache Iceberg

Apache Iceberg is a Lakehouse system with it's key as Metadata.



1. Storage
2. Processing
3. Metadata
        Metadata of Metadata

Key Points:
  1. Storage

    • Data files (Parquet, ORC, Avro) are stored in Amazon S3.

    • Iceberg maintains metadata and snapshots to track table versions.

  2. ACID Transactions

    • Supports atomic inserts, updates, and deletes directly on S3 data.

    • Prevents read/write conflicts in concurrent jobs.

  3. Schema Evolution & Partitioning

    • Allows adding, dropping, or renaming columns without rewriting entire tables.

    • Supports hidden partitioning for efficient queries.

  4. Query Engine Compatibility

    • Works with Spark, Flink, Trino, Presto, Athena, and Glue.

    • Enables time travel to query historical snapshots of data.

  5. Lakehouse Advantage

    • Combines data lake storage (S3) with data warehouse-like capabilities.

    • Efficient for batch analytics, streaming, and ML pipelines.


🔹 Example Workflow on AWS:

  1. Store raw/processed Parquet files in S3.

  2. Create an Iceberg table referencing the S3 location.

  3. Query and update data using Spark SQL or Athena with ACID guarantees.

  4. Enable time travel for auditing or rollback.


Create Iceberg Table in Athena:

    CREATE TABLE iceberg_users (
        id INT,
        name STRING,
        event_date DATE
    )
    PARTITIONED BY = (day(event_date))
    LOCATION 's3://your-s3-bucket/iceberg-warehouse/iceberg_users/'
    TBLPROPERTIES (
        'table_type' = 'ICEBERG',
        'write.format.default' = 'parquet'
    );

Reference: https://www.youtube.com/watch?v=iGvj1gjbwl0

🔹 Operations using PySpark:

 Writing Data to an Iceberg Table

  • Data files are stored as Parquet in S3.

  • Metadata and snapshots are tracked by Iceberg in Glue catalog.

from pyspark.sql import SparkSession

# Initialize SparkSession with Iceberg support
spark = SparkSession.builder \
    .appName("IcebergReadWrite") \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-iceberg-bucket/warehouse/") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()

# Sample DataFrame
data = [
    (1, "Alice", 2025),
    (2, "Bob", 2024),
    (3, "Charlie", 2025)
]
columns = ["id", "name", "year"]

df = spark.createDataFrame(data, columns)

# Write to Iceberg table in Glue catalog (S3 backend)
df.writeTo("glue_catalog.db.iceberg_users").createOrReplace()

Reading from an Iceberg Table

# Read Iceberg table as a DataFrame df_read = spark.read.table("glue_catalog.db.iceberg_users") df_read.show()

Performing Updates/Deletes (ACID)

from pyspark.sql.functions import col # Example: Delete rows where year = 2024 spark.sql(""" DELETE FROM glue_catalog.db.iceberg_users WHERE year = 2024 """) # Example: Insert new rows new_data = [(4, "David", 2025)] spark.createDataFrame(new_data, columns) \ .writeTo("glue_catalog.db.iceberg_users") \ .append()


Time Travel/Snapshot

# List all snapshots spark.sql("SELECT * FROM glue_catalog.db.iceberg_users.snapshots").show() # Query previous snapshot using 'as of' snapshot_id df_time_travel = spark.read \ .option("snapshot-id", "<snapshot_id>") \ .table("glue_catalog.db.iceberg_users") df_time_travel.show()

Sunday, September 22, 2024

Classes and Object Oriented Python

 

In Python, you define a class by using the class keyword followed by a name and a colon. Then you use .__init__() to declare which attributes each instance of the class should have:

# dog.py

class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age


In the body of .__init__(), there are two statements using the self variable:

  1. self.name = name creates an attribute called name and assigns the value of the name parameter to it.
  2. self.age = age creates an attribute called age and assigns the value of the age parameter to it.


To instantiate this Dog class, you need to provide values for name and age. If you don’t, then Python raises a TypeError:

>>> Dog()
Traceback (most recent call last):
  ...
TypeError: __init__() missing 2 required positional arguments: 'name' and 'age'

To pass arguments to the name and age parameters, put values into the parentheses after the class name:

>>> miles = Dog("Miles", 4)
>>> buddy = Dog("Buddy", 9)

When you instantiate the Dog class, Python creates a new instance of Dog and passes it to the first parameter of .__init__(). This essentially removes the self parameter, so you only need to worry about the name and age parameters.

What is the use of self in Python

When working with classes in Python, the term “self” refers to the instance of the class that is currently being used. It is customary to use “self” as the first parameter in instance methods of a class. Whenever you call a method of an object created from a class, the object is automatically passed as the first argument using the “self” parameter. This enables you to modify the object’s properties and execute tasks unique to that particular instance.


The __init()___ is similar to constructors in C++ or JAVA. When you instantiate the Dog class, Python creates a new instance of Dog and passes it to the first parameter of .__init__(). This essentially removes the self parameter, so you only need to worry about the name and age parameters.


Instance methods are functions that you define inside a class and can only call on an instance of that class. Just like .__init__(), an instance method always takes self as its first parameter.

# dog.py

class Dog:
    species = "Canis familiaris"

    def __init__(self, name, age):
        self.name = name
        self.age = age

    # Instance method
    def description(self):
        return f"{self.name} is {self.age} years old"

    # Another instance method
    def speak(self, sound):
        return f"{self.name} says {sound}"

Creating object and calling the methods

>>> miles = Dog("Miles", 4)

>>> miles.description()
'Miles is 4 years old'

>>> miles.speak("Woof Woof")
'Miles says Woof Woof'

>>> miles.speak("Bow Wow")
'Miles says Bow Wow'


Inheritance

The Base class Dog can be inherited by the child classes as below:

# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return f"{self.name} says {sound}"

# ...    

Child class objects can be created as

>>> miles = JackRussellTerrier("Miles", 4)
>>> miles.speak()
'Miles says Arf'
Instances of child classes inherit all of the attributes and methods of the parent class

Parent Class functionality extension
# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return f"{self.name} says {sound}"

# ...    
Here, speak() is overrided in the derived class

You can access the parent class from inside a method of a child class by using super():
# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return super().speak(sound)

# ...
When you call super().speak(sound) inside JackRussellTerrier, Python searches the parent class, Dog, for a .speak() method and calls it with the variable sound.



Garbage Collection in Python

Python’s memory allocation and deallocation method is automatic. The user does not have to preallocate or deallocate memory similar to using dynamic memory allocation in languages such as C or C++ variables declared in heap. 

Python automatically schedules garbage collection based upon a threshold of object allocations and object deallocations. When the number of allocations minus the number of deallocations is greater than the threshold number, the garbage collector is run.

The garbage collection can be invoked manually in the following way: 
# Importing gc module
import gc
 
# Returns the number of
# objects it has collected
# and deallocated
collected = gc.collect()
 
# Prints Garbage collector
# as 0 object
print("Garbage collector: collected",
          "%d objects." % collected)