Databricks provides a community edition for free and can be used to explore it's capabilities or can be used for trying out on its Notebooks. Both Python and scala are supported.
Filesystem: It's filesystem is called dbfs
Databricks provides a community edition for free and can be used to explore it's capabilities or can be used for trying out on its Notebooks. Both Python and scala are supported.
Filesystem: It's filesystem is called dbfs
pipenv pip install pytest
Now, I have a simple function in a file
t.py
def square(x: float):
return x * x
t_test.py
import t
def test_square():
assert t.square(5) == 25
Now, enhancing the test case code for running tests for multiple cases.
t_test.py
import t
import pytest
@pytest.mark.parametrize(
('input_n', 'expected'),
(
(5,25),
(3.,9.),
)
)
def test_square(input_n, expected):
assert t.square(input_n) == expected
Now, adding a class,
t_test.py
import t
import pytest
@pytest.mark.parametrize(
('input_n', 'expected'),
(
(5,25),
(3.,9.),
)
)
def test_square(input_n, expected):
assert t.square(input_n) == expected
class TestSquare:
def test_square(self):
assert t.square(3) == 9
Pipenv is a Python virtualenv management tool that supports a multitude of systems and nicely bridges the gaps between pip, python (using system python, pyenv or asdf) and virtualenv. Linux, macOS, and Windows are all first-class citizens in pipenv.
Pipenv is a recommended way to install Python Packages and use a virtual environment because when you use the PIP Package manager that's bundled with python anything installed gets installed globally and you do not have encapsulated environment for each project that is created Eg: Spark, ML might need different packages altogether. Pipenv allows us to create environment virtually and it also allows us to easily add or remove packages easily specific to the Project needs.
Pipenv automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile
as you install/uninstall packages. It also generates a project Pipfile.lock
, which is used to produce deterministic builds.
Pipenv is primarily meant to provide users and developers of applications with an easy method to arrive at a consistent working project environment.
Few Useful commands:
pip install --user pipenv
pipenv --version
pipenv shell
pipenv install -r requirements.txt
pipenv pip freeze
pipenv graph
Problems: Redundancy
Different kinds of Normal Forms:
1NF, 2NF,3NF etc
Star schemas denormalize the data, which means adding redundant columns to some dimension tables to make querying and working with the data faster and easier. The purpose is to trade some redundancy (duplication of data) in the data model for increased query speed, by avoiding computationally expensive join operations.
In this model, the fact table is normalized but the dimensions tables are not. That is, data from the fact table exists only on the fact table, but dimensional tables may hold redundant data.
Resources: https://www.databricks.com/glossary/star-schema
snowflake schema:
A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for business intelligence and reporting in OLAP data warehouses, data marts, and relational databases.
In a snowflake schema, engineers break down individual dimension tables into logical subdimensions. This makes the data model more complex, but it can be easier for analysts to work with, especially for certain data types.
It's called a snowflake schema because its entity-relationship diagram (ERD) looks like a snowflake, as seen below.
References: https://www.youtube.com/watch?v=9ToVk0Fgsz0
Normalization is the method of arranging the data in the database efficiently. It involves constructing tables and setting up relationships between those tables according to some certain rules. The redundancy and inconsistent dependency can be removed using these rules in order to make it more flexible.
There are 6 defined normal forms: 1NF, 2NF, 3NF, BCNF, 4NF and 5NF. Normalization should eliminate the redundancy but not at the cost of integrity.
A data lake and a data warehouse are two different approaches to managing and storing data.
A data lake is an unstructured or semi-structured data repository that allows for the storage of vast amounts of raw data in its original format. Data lakes are designed to ingest and store all types of data — structured, semi-structured or unstructured — without any predefined schema. Data is often stored in its native format and is not cleansed, transformed or integrated, making it easier to store and access large amounts of data.
A data warehouse, on the other hand, is a structured repository that stores data from various sources in a well-organized manner, with the aim of providing a single source of truth for business intelligence and analytics. Data is cleansed, transformed and integrated into a schema that is optimized for querying and analysis.