Hadoop and Spark by Leela Prasad: 2024

Sunday, September 22, 2024

Classes and Object Oriented Python

In Python, you define a class by using the class keyword followed by a name and a colon. Then you use .__init__() to declare which attributes each instance of the class should have:

# dog.py

class Dog:
    def __init__(self, name, age):
        self.name = name
        self.age = age

In the body of .__init__(), there are two statements using the self variable:

self.name = name creates an attribute called name and assigns the value of the name parameter to it.
self.age = age creates an attribute called age and assigns the value of the age parameter to it.

To instantiate this Dog class, you need to provide values for name and age. If you don’t, then Python raises a TypeError:

>>> Dog()
Traceback (most recent call last):
  ...
TypeError: __init__() missing 2 required positional arguments: 'name' and 'age'

To pass arguments to the name and age parameters, put values into the parentheses after the class name:

>>> miles = Dog("Miles", 4)
>>> buddy = Dog("Buddy", 9)

When you instantiate the Dog class, Python creates a new instance of Dog and passes it to the first parameter of .__init__(). This essentially removes the self parameter, so you only need to worry about the name and age parameters.

What is the use of self in Python

When working with classes in Python, the term “self” refers to the instance of the class that is currently being used. It is customary to use “self” as the first parameter in instance methods of a class. Whenever you call a method of an object created from a class, the object is automatically passed as the first argument using the “self” parameter. This enables you to modify the object’s properties and execute tasks unique to that particular instance.

The __init()___ is similar to constructors in C++ or JAVA. When you instantiate the Dog class, Python creates a new instance of Dog and passes it to the first parameter of .__init__(). This essentially removes the self parameter, so you only need to worry about the name and age parameters.

Instance methods are functions that you define inside a class and can only call on an instance of that class. Just like .__init__(), an instance method always takes self as its first parameter.

# dog.py

class Dog:
    species = "Canis familiaris"

    def __init__(self, name, age):
        self.name = name
        self.age = age

    # Instance method
    def description(self):
        return f"{self.name} is {self.age} years old"

    # Another instance method
    def speak(self, sound):
        return f"{self.name} says {sound}"

Creating object and calling the methods

>>> miles = Dog("Miles", 4)

>>> miles.description()
'Miles is 4 years old'

>>> miles.speak("Woof Woof")
'Miles says Woof Woof'

>>> miles.speak("Bow Wow")
'Miles says Bow Wow'

Inheritance

The Base class Dog can be inherited by the child classes as below:

# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return f"{self.name} says {sound}"

# ...

Child class objects can be created as

>>> miles = JackRussellTerrier("Miles", 4)
>>> miles.speak()
'Miles says Arf'

Instances of child classes inherit all of the attributes and methods of the parent class

Parent Class functionality extension

# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return f"{self.name} says {sound}"

# ...

Here, speak() is overrided in the derived class

You can access the parent class from inside a method of a child class by using super():

# dog.py

# ...

class JackRussellTerrier(Dog):
    def speak(self, sound="Arf"):
        return super().speak(sound)

# ...

When you call super().speak(sound) inside JackRussellTerrier, Python searches the parent class, Dog, for a .speak() method and calls it with the variable sound.

Source: https://realpython.com/python3-object-oriented-programming/#how-do-you-define-a-class-in-python

Garbage Collection in Python

Python’s memory allocation and deallocation method is automatic. The user does not have to preallocate or deallocate memory similar to using dynamic memory allocation in languages such as C or C++ variables declared in heap.

Python automatically schedules garbage collection based upon a threshold of object allocations and object deallocations. When the number of allocations minus the number of deallocations is greater than the threshold number, the garbage collector is run.

The garbage collection can be invoked manually in the following way:

# Importing gc module
import gc
 
# Returns the number of
# objects it has collected
# and deallocated
collected = gc.collect()
 
# Prints Garbage collector 
# as 0 object
print("Garbage collector: collected",
          "%d objects." % collected)

Sunday, September 15, 2024

Databricks

Databricks provides a community edition for free and can be used to explore it's capabilities or can be used for trying out on its Notebooks. Both Python and scala are supported.

Filesystem: It's filesystem is called dbfs

df.write.partitionBy("Location").mode("overwrite").parquet("Table1")

To View the files written and is similar as HDFS/S3/gs in GCP,

dbutils.fs.ls("/Table1/")

Few commands on dbfs filesystem.
dbutils.fs.cp("/Table1/", s3_dir. recursive=True)
dbutils.fs.rm(s3_dir,True)

Saturday, September 14, 2024

Pytest

pipenv pip install pytest

Now, I have a simple function in a file

t.py

def square(x: float):

return x * x

t_test.py

import t

def test_square():

assert t.square(5) == 25

Now, enhancing the test case code for running tests for multiple cases.

t_test.py

import t

import pytest

@pytest.mark.parametrize(

('input_n', 'expected'),

(

(5,25),

(3.,9.),

)

def test_square(input_n, expected):

assert t.square(input_n) == expected

Now, adding a class,

t_test.py

import t

import pytest

@pytest.mark.parametrize(

('input_n', 'expected'),

(

(5,25),

(3.,9.),

)

def test_square(input_n, expected):

assert t.square(input_n) == expected

class TestSquare:

def test_square(self):

assert t.square(3) == 9

PipEnv

Pipenv is a Python virtualenv management tool that supports a multitude of systems and nicely bridges the gaps between pip, python (using system python, pyenv or asdf) and virtualenv. Linux, macOS, and Windows are all first-class citizens in pipenv.

Pipenv is a recommended way to install Python Packages and use a virtual environment because when you use the PIP Package manager that's bundled with python anything installed gets installed globally and you do not have encapsulated environment for each project that is created Eg: Spark, ML might need different packages altogether. Pipenv allows us to create environment virtually and it also allows us to easily add or remove packages easily specific to the Project needs.

Pipenv automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages. It also generates a project Pipfile.lock, which is used to produce deterministic builds.

Pipenv is primarily meant to provide users and developers of applications with an easy method to arrive at a consistent working project environment.

Few Useful commands:

pip install --user pipenv

pipenv --version

pipenv shell

pipenv install -r requirements.txt

pipenv pip freeze

pipenv graph